Sample solutions and discussion
Perl Expert Quiz of The Week #4 (20021106)


        If you were to write a program to locate all the anagrams in a
        dictionary, you would find a lot of very boring anagrams.  For
        example, Webster's International Dictionary, Second Edition,
        yields


                glossolabiopharyngeal labioglossopharyngeal
                anatomicopathological pathologicoanatomical

        and such.  

        Part of the problem with such 'anagrams' is that the words are
        extremely obscure, specialized technical jargon.  But even
        well-known words can fail to produce interesting anagrams:

                addresser readdress

        The major reason these anagrams fail to be interesting is
        because they really consist of rearrangements of words. not
        rearrangments of letters.  Even technical jargon can be
        interesting if it's sufficiently non-obvious; I'd submit that:

                protomagnesium spermatogonium

        is reasonably interesting; when the words are more familiar,
        the result can be spectacular:

                urogenital regulation

                ancestorial lacerations

        This suggests the following heuristic for the
        'interestingness' of a pair of anagrammatic words: If one word
        can be rearranged to form the other after being broken into
        just a few chunks, it is probably not very interesting.  If it
        must be broken into many chunks to be so rearranged, it is
        likely to be more interesting.

        Applying this heuristic to the examples above, we see that the
        two 'boring' examples are only two-, three- and four-chunk
        anagrams:

                address/er re/address
                glosso/labio/pharyngeal labio/glosso/pharyngeal
                anatomic/o/pathologic/al pathologic/o/anatomic/al

        Whereas the 'reasonably interesting' example is a ten-chunk
        anagram:

                p/r/o/to/ma/g/n/e/s/ium s/p/e/r/ma/to/g/o/n/ium

        As are the 'spectacular' examples:

                u/r/o/g/e/n/i/t/a/l r/e/g/u/l/a/t/i/o/n

                a/n/ce/s/t/o/r/i/a/l l/a/ce/r/a/t/i/o/n/s

        (No simpler decompositions are possible.)

        Note that this heuristic naturally favors long anagrams over
        short ones; it is impossible that a three-letter anagram pair
        could get a score higher than 3.  This is consistent with most
        peoples' judgement of anagram quality: short words' anagrams
        are rarely or never interesting.

        Write a function, 'min_chunks', whose arguments are two words
        which are anagrams of one another, and which returns the
        minimum number of chunks into which the first word must be
        divided before the chunks can be rearranged to form the second
        word.

        Use the function to generate interesting anagrams.


----------------------------------------------------------------

This week's report was difficult to write, because the question
generated so much interesting discussion and attracted so many
solutions.

It was quite some time before anyone submitted a correct solution.
During most of Saturday and Sunday I imagined that this report was
going to begin like this:

        There were three kinds of solutions sent to the -discuss list
        this week: Incorrect ones, slow ones, and ones written by Ben
        Kennedy.

I am glad to say that Mr. Kennedy's excellent solution was eventually
joined by several others.  The fastest one I tested was posted by
Andrew Wilson.  Andrew used essentially the same strategy that I tried
when I first worked on this problem ten years ago: Find all possible
matchings of single letters between the two words, compute how many
chunks are produced by each matching, and produce the minimum number
of chunks.

For example, given 'trout' and 'tutor', Mr. Wilson will produce two
matchings.  The 'rou' of 'trout' must get mapped onto the 'u-or' of
'tutor', but there are two possible ways to match up the 't's.  We can
represent each matching as a list of numbers from 0 to 4; the first
number says where the first 't' goes; the second says where the 'r'
goes, and so on.  The two matchings are (04312) and (24310).  In this
notation, a chunk is just a contiguous sequence of consecutive
numbers, so /0/4/3/12/ and /2/4/3/1/0/.  Thus the minimum is 4, 
and the chunking is /t/r/o/ut/ - /t/u/t/or/.  

Here's an example that might be more perspicuous. Suppose the two
'words' are 'aaabbcc' and 'bbccaaa'.  There are 24 matchings:

        (4560123)        (4650123) 
        (4561023)        (4651023)
        (4560132)        (4650132)
        (4561032)        (4651032)

        (5460123)        (5640123) 
        (5461023)        (5641023)
        (5460132)        (5640132)
        (5461032)        (5641032)

        (6450123)        (6540123) 
        (6451023)        (6541023)
        (6450132)        (6540132)
        (6451032)        (6541032)

Of these matchings, the first one contains the fewest chunks:
/456/0123/.  (The bottom left one implies six chunks, /6/45/1/0/3/2/, and yields
the chunking /a/aa/b/b/c/c/ - /b/b/c/c/aa/a/.)

Here's code I wrote after all the submissions were in, in an attempt
to improve on all of them.  I didn't quite succeed, but it may be
simpler than some of the other code that uses the same techniques.
The first function computes permutations:

        use strict;

        # Find and return the $n'th permutation
        # of the remaining arguments in some canonical order
        sub permutation_n {
          my $n = shift;
          my @result;
          while (@_) {
            ($n, my $r) = (int($n/@_), $n % @_);
            push @result, splice @_, $r, 1;
          }
          \@result;
        }

I cribbed this from pp. 126-127 of _Perl Cookbook_; it's essentially
'n2pat' from page 127.  There are six ways to order a list of three
items; given a number $n from 0 to 5, and a list of three items, this
function produces one of the orderings.  It produces a different
ordering for every number $n from 0 to 5.  Similarly, if there are
seven items, then $n should be a number from 0 to 5039.  

To get a clear idea of what this function is doing, try

        for (0 .. 23) {
          my $p = permutation_n($_, 'A' .. 'D');
          print "@$p\n";
        }

It will print out

        A B C D
        B A C D
        C A B D
        ...
        D C B A

The next function takes a word and manufactures an object which
records information about where each of its letters occurs, and how
those letters might be exchanged with one another:

        sub split_word {
          my ($word) = @_;
          my %graph;
          for my $index (0 .. length($word) - 1) {
            my $ch = substr $word, $index, 1;
            push @{$graph{$ch}{pos}}, $index;
            $graph{$ch}{max} = 1 unless defined $graph{$ch}{cur};
            $graph{$ch}{max} *= @{$graph{$ch}{pos}};
            $graph{$ch}{cur} = 0;
          }
          for my $ch (keys %graph) {
            $graph{$ch}{perm} = [@{$graph{$ch}{pos}}];
          }
          %graph;
        };

Given 'aababc', the function produces:

        ('a' => { max  => 6,
                  cur  => 0,
                  pos  => [0, 1, 3],
                  perm => [0, 1, 3],
                },
         'b' => { max  => 2,
                  cur  => 0,
                  pos  => [2, 4],
                  perm => [2, 4],
                },
         'c' => { max  => 1,
                  cur  => 0,
                  pos  => [5],
                  perm => [5],
                }
        )

This object will record which possible permutations of the letters
we've tried.  Each different letter gets its instances permuted
separately; the 'cur' member of each sub-hash records which
permutation we should try next.  'cur' will start at 0 and cycle up to
the value of 'max'.  ('max' is just the factorial of the number of
elements in 'pos'.)  'pos' is a list of original positions of the
letters; 'perm' is a permuted version of 'pos', permuted according to
permutation #'cur'.

The next function takes one of these structures and computes the next
permutation.  It does this by incrementing the 'cur' value of the
first key.  But if 'cur' is equal to 'max' for that key, then we've
run out of permutations for that key, so we start over again with
permutation 0 and increment the 'cur' for the next key instead.
Having adjusted the 'cur' values, we call permutation_n() to compute
the permutations themselves.

        sub next_permutation {
          my $o = shift;
          my $DONE = 0;
          for (values %$o) {
            if (++$_->{cur} == $_->{max}) {
              $_->{cur} = 0;
              $_->{perm} = [@{$_->{pos}}];
            } else {
              $_->{perm} = permutation_n($_->{cur}, @{$_->{pos}});
              $DONE = 1;
              last;
            }
          }
          return $DONE;
        }

If all the 'cur' values have become equal to the 'max' values, then we
have tried every possible permutation already.  In this case we fall
out of the loop without setting $DONE, and return a false value, to
indicate completion.

        sub count_chunks {
          my $prev = -2;
          my $chunks = 0;
          for (@_) {
            $chunks++ unless $_ == $prev+1;
            $prev = $_;
          }
          $chunks;
        }

count_chunks just takes a list of indices and counts the number of
runs of consecutive numbers.

Here's the root of the code.  It takes the first word and constructs a
permutation object for it; that's %o.  The 'while' loop runs through
all the possible permutations of each letter, and the inner 'for' loop
maps those letters, in order, to the corresponding letters of the
second word.  We count the number of chunks in each mapping so
constructed, and emit the one with the fewest chunks.

        sub min_chunks {
          my ($given, $target) = @_;
          my %o = split_word($given);
          my @targ = split //, $target;
          my $chunks = 99999999;
          my @best_mapping;

          while (1) {
            my @mapping;
            my %letct;
            for (@targ) {
              push @mapping, $o{$_}{perm}[$letct{$_}++||0];
            }
            my $map_chunks = count_chunks(@mapping);
            if ($map_chunks < $chunks) {
              $chunks = $map_chunks;
              @best_mapping = @mapping;
            }
            last unless next_permutation(\%o);
          }
          wantarray ? ($chunks, @best_mapping) : $chunks;
        };

Andrew Wilson's solution was very similar to this, and somewhat
faster.  This version is at
http://perl.plover.com/qotw/misc/e004/mjd3.pl ; Andrew Wilson's is at
http://perl.plover.com/qotw/misc/e004/wilson.pl .


1. There was very wide variation in the quality of the programs sent
   to the -discuss list.  Here's a report of the results of running
   most of them against an anagram list with 2931 pairs of words:

                   ------TIMES------
        Program    Sys   User  Total   Wrong  $?   Output
      wilson.pl    0.1    9.6    9.7       0 (  0, ***FAILURES 0              )
    sidhekin.pl    0.1   10.4   10.4       0 (  0, ***FAILURES 0              )
        mjd3.pl    0.0   10.4   10.5       0 (  0, ***FAILURES 0              )
     hofmann.pl    0.1   10.9   11.0       0 (  0, ***FAILURES 0              )
         mjd.pl    0.0   14.6   14.6       0 (  0, ***FAILURES 0              )
        mjd2.pl    0.0   14.9   14.9       0 (  0, ***FAILURES 0              )
     koenig2.pl    0.1   17.6   17.7       0 (  0, ***FAILURES 0              )
    kennedy2.pl    0.0   26.4   26.4       0 (  0, ***FAILURES 0     
      koenig.pl    0.2   31.2   31.4       0 (  0, ***FAILURES 0              )
     kennedy.pl    0.4   40.7   41.1       0 (  0, ***FAILURES 0              )
 boessenkool.pl    0.1   50.0   50.1       0 (  0, ***FAILURES 0              )
    mccarthy.pl    0.1   52.8   53.0       0 (  0, ***FAILURES 0              )
        hek2.pl    0.9   66.1   67.1       0 (  0, ***FAILURES 0              )
         hek.pl    1.4   98.7  100.1       0 (  0, ***FAILURES 0              )
    blumberg.pl    0.6  126.6  127.3       0 (  0, ***FAILURES 0              )
     wilson2.pl    0.2  141.3  141.5       0 (  0, ***FAILURES 0              )
   trottmann.pl    0.7  144.3  145.1       0 (  0, ***FAILURES 0              )
       boger.pl    2.6  263.7  266.3       0 (  0, ***FAILURES 0              )
       kripa.pl    0.1    0.6    0.6     999 (652,                            )
     camilio.pl    0.0    2.6    2.6      92 (253, ***FAILURES 92             )
      bruhat.pl    0.0    3.9    4.0     224 (253, ***FAILURES 224            )
         sde.pl    0.1   10.9   11.0     100 (253, ***FAILURES 100            )
       sde-i.pl    0.1   11.5   11.6     100 (253, ***FAILURES 100            )
     marvell.pl    0.1   12.3   12.4     100 (253, ***FAILURES 100            )
        fysh.pl    0.2   12.3   12.5     100 (253, ***FAILURES 100            )
      barbie.pl    0.2   47.5   47.7     100 (253, ***FAILURES 100            )
     payrard.pl    0.1   61.6   61.7      37 (253, ***FAILURES 37             )
      sanger.pl    0.0   79.5   79.6     870 (253, ***FAILURES 870            )
       ryan2.pl    0.6   95.2   95.9     587 (253, ***FAILURES 587            )
        ryan.pl    0.7   97.4   98.1     897 (253, ***FAILURES 897            )
      finkel.pl    1.6  169.9  171.5     300 (253, ***FAILURES 300            )
    lopresto.pl   15.1  617.8  632.9     999 (307,                            )
     wilson1.pl    1.2 1464.8 1466.0     999 (366,                            )
       scott.pl    1.1 2272.3 2273.3     999 (366,                            )

   'mjd3.pl' is the sample solution shown above.  All these programs
   are available from http://perl.plover.com/qotw/misc/e004/ .

   The 'Wrong' column indicates the number of test cases for which the
   program produced the wrong answer.  '999' means that the program
   died, or that I killed it, or something else went wrong; in such
   cases the '$?' column indicates what happened, if you are a Unix
   system programmer.  lopresto.pl ran out of memory; I killed
   wilson1.pl and scott.pl because they were taking too long; kripa.pl
   aborted for some reason I didn't bother to investigate.  (The 652
   is actually a 65280.)  Some of these cases may be mistakes that I
   made in running the programs.

   The others exited with a 99 failure code ('253' is really 25344) to
   indicate that they got some wrong answers; these are annotated with
   the number of examples they got wrong.

   A few of these results are deceptive.  For example, hofmann.pl
   looks good at first, but it sometimes gets the wrong answer.  It
   just so happens that my 2931 pairs of words didn't include an
   example for which it failed.

2. I posed this question because I'd written a program several years
   ago to do this, and I hadn't been happy with the results.  It got
   the right answers, but for long words with many repeated letters,
   it was very slow.  I was hoping that someone else would find a
   better way to do it.  This did happen.  I was completely surprised
   when I ran the report-generating program and found that my original
   solution (mjd.pl above) was among the better ones.  I had fully
   expected it to be among the slowest of the correct solutions.
   Apparently my recollection of its slowness was due mostly to the
   intrinsic slowness of the computers of ten years ago!

3. Most of the incorrect solutions made the same sorts of mistakes.
   They assumed that the best way to proceed was to find the largest
   possible chunk, remove it from the two words, and repeat until
   nothing was left.

4. Strategies like this are called 'greedy' because they grab as much
   as they can, without worrying about the future implications of
   their actions.  For some problems, greedy strategies always work.
   But just as in life, greedy strategies aren't always optimal.
   Removing a large chunk early may eliminate some crucial letters and
   foreclose the chance of removing almost-as-large chunks later on.

   One oft-discussed example of this type was 'xylidin' / 'xylinid'.
   The longest common chunk is /xyli/; if you remove this from the
   second word you have '----nid', and the rest of the letters chunk
   into /n/ /i/ /d/, for a total of four chunks.  However, the correct
   answer is 3.  The best chunking is /xyl/in/id/ /xyl/id/in/.
   Removing /xyli/ early on forecloses the possibility of removing
   /in/ and /id/ later on.

5. Another common approach was to compute all the possible chunks in
   the first word, search for them in the second word in order from
   longest to shortest, and remove any such chunks found.

   However, without great care, this may not work. Consider 'caviar'
   and 'variac': The longest matching chunks are /ar/ and /ia/, which
   we remove from 'variac', leaving 'v--iac' and then 'v----c'.  Then
   we remove the /v/ and the /c/, for a total of four chunks.

   Unfortunately, this is wrong. We chunked 'variac' into /v/ar/ia/c/,
   but in 'caviar', the /ar/ and /ia/ overlap, so there is no
   corresponding division into four chunks.  The correct answer is 5,
   either as /c/a/v/i/ar/   or as   /c/a/v/ia/r/
             /v/ar/i/a/c/           /v/a/r/ia/c/.

   I posted some sample test words of this type on Saturday, but
   incorrect solutions kept showing up on the -discuss list.

6. At one point Randal Schwartz suggested that the problem might be
   NP-complete.  Segher Boessenkool pointed out that this was not the
   case, and presented a (rather cryptic) solution which he claimed
   was polynomial.  I believe he's correct, but the exact complexity
   of the algorithm seems elusive.  See below for more details.

7. Avi Finkel suggested that it might be closer to the intruitive
   notion of anagram quality to be allowed to reverse chunks, so that
   in 'read'/'dare' there are two chunks (/re/ad/ and /da/re/) rather
   than 3.  Unfortunately, nobody took him up on this interesting
   suggestion.  It occurs to me now that it would be easy to do so,
   just by changing 'count_chunks' in the example solution above.

8. The sample solution has very poor worst-case behavior.  The best
   case occurs when each letter in the anagram appears exactly once,
   as with 'logarithm'/'algorithm'; in such cases there's only one
   possible mapping between the letters, so the 'while (1)' loop
   executes exactly once.

   The worst cases are when there are many repeated letters.  For
   'aaaaaaaaaa', the program stupidly examines 3628800 mappings.  For
   'photochromolithograph' it examines 23,040 mappings.  (There are
   120 ways to permute the o's, times 24 ways to permute the h's,
   times 2 ways to permute the t's, times 2 ways to permute the r's,
   times 2 ways to permute the p's.)  Other solutions did not have
   this worst-case behavior.

9. For example, here is Ben Kennedy's explanation of his solution.  I
   was going to write this up myself, but I found I couldn't improve
   on his explanation:

        This solution I came up with is relatively simple (code
        repeated at bottom of this message) - it's based on the
        observation that if the string is truly an anagram, the first
        chunk of the string must go somewhere into the second string.
        This provides the basis for a simple recursive algorithm based
        around placing the first chunk in the first string.  Since we
        already established that a greedy approach may not work, it
        has to look at *every* possible first chunk going into *every*
        possible location in the second word.  For example, "aaabbb"
        going into "aabbba" has the following first chunks:

        a
        aa
        aaa
        aaab
        aaabb
        aaabbb

        Each chunk is then scanned for in the second string:

        a can be matched to (a)abbba or a(a)bbba or aabbb(a)
        aa can be matched to (aa)bbba
        aaa, aaab, aaabb, aaabbb have no matches

        Thus we now have four potential branches, so the function
        decends with each of the following word pairs:

        aabbb to -abbba
        aabbb to a-bbba
        aabbb to aabbb-
        abbb to -bbba

        Note the presense of dashed to indicate removal of the letters
        in parenthesis.  If the chunk were simply removed, it would
        bring together letters that weren't together initially, and
        the algoritm would not work.  These word pairs are then sent
        to the next level of recustion.  Note the pairs after
        processing the third entry which was:

        aabbb to aabbb-

        We have the following new pairs:

        Chunk is a, pairs are abbb to -abbb-, abbb to a-bbb-
        Chunk is aa, pairs are bbb to -bbb-
        Chunk is aab, pairs are bb to -bb-
        Chunk is aabb, pairs are b to -b-
        Chunk is aabbb, pairs are (null) to --

        That last conditions means we have a complete anagram - the
        number of dashes (2) is equal to the depth of recursion (2),
        which means we eliminated the second word in only 2 chunks:

        (a)(aabbb) to (aabbb)(a)

        Once a solution is found, its marked as the "best" solution
        and the code does not descend to levels beyond this depth.

  http://perl.plover.com/~alias/list.cgi?1:msp:461

  The solutions from Sidhekin and Andres Koenig were also of this type.

10. Jamie McCarthy's program iterated over all possible ways of
    breaking the first string into chunks, starting by breaking it
    into two chunks and then working up through three chunks, four
    chunks, etc.  At each step it would look to see if all of the
    resulting chunks were present in the second string; if not, it
    would try a different way of breaking up the first string.  If all
    the chunks were present in the second string, his program would
    see if they could in fact be assembled into the second string.
    (The chunks /ab/ and /ba/ are both present in "abab", but they
    can't be assembled to form it "abab".)  If so, that's a winner.
    McCarthy's program is fast because it has some clever
    optimizations for skipping certain chunkings and certain
    permutations.  It's worth more discussion than I gave it here, but
    I have a book to write.

11. John Macdonald suggested using dynamic programming techniques:

    http://perl.plover.com/~alias/list.cgi?1:mss:516

    He wasn't sure it would work, but it does work: Segher Boessenkool
    did something very much like what John suggested.

    http://perl.plover.com/~alias/list.cgi?1:mss:537

    The code, unfortunately, was difficult to follow.  I was hoping
    that one of the other fast, correct solutions would be similar, so
    that I could ignore it, but I didn't see anything else similar.  A
    couple of comments would have helped out a lot.  

    I'm not going to explain the whole thing in detail, but if you
    want to investigate, it will probably help you to know that the
    main data structure, @a, is a two dimensional array.  $a[$L][$P]
    is a hash that describes all of the minimum-chunk ways of mapping
    substr(Word1, $P, $L) onto Word2.  Boessenkool computes the array
    elements starting for $L==1 (which is easy) and working up until
    $L is the length of the word, at which point the number of chunks
    in the mapping is the solution of the problem.  He also represents
    chunks as bit vectors; he can use the & operator to determine
    whether two chunks overlap.

12. I was suprised to see so much use of the 'Memoize' module; I
    thought that Memoize was one of those things that nobody ever
    used.  I guess I was wrong!

    But I was also surprised to see so much *inappropriate* use of
    'Memoize'.  Memoize isn't magical; all it does is install a cache
    on the front of a function.  Managing the cache incurs overhead.
    The hope is that the time saved by retrieving cached values will
    make up for the time spent managing the cache.  If not, the
    memoization makes the function slower, not faster.  In particular,
    if the function is not called repeatedly with the same arguments,
    the memoization will not help.  (See
    http://perl.plover.com/yak/memoize-quant/ for details.)

    One example of this is in sidhekin.pl.  Eliminating the
    memoization speeds up the program substantially.

13. Dan Schmidt was puzzled as to why there were so many incorrect
    solutions.  Dan said:

        I started by thinking of the greedy algorithm (start with the
        biggest match), but first I figured I would prove to myself
        that the greedy algorithm works... and instead I quickly found
        a counterexample.

        ...I am surprised how many people seemed to dash out an
        algorithm that looked reasonable without looking more
        carefully to see if it could fail.  I think that skill is just
        as important as neat regexp tricks.

   I wondered this myself.  As you can see from the table above, more
   than half of the posted solutions didn't work.  Even worse, people
   kept posting the same incorrect solutions even after the earlier
   incorrect solutions had been discussed, after I posted sample
   dictionaries, and after I posted a short list of 'hard cases' that
   tended produced failures for all these programs!  What does this
   mean?


Thanks to everyone who participated in the quiz, including both those
who mailed the -discuss list and those who did not.

Quiz #5 will be along today at about 2100 GMT.