Sample solutions and discussion Perl Expert Quiz of The Week #4 (20021106) If you were to write a program to locate all the anagrams in a dictionary, you would find a lot of very boring anagrams. For example, Webster's International Dictionary, Second Edition, yields glossolabiopharyngeal labioglossopharyngeal anatomicopathological pathologicoanatomical and such. Part of the problem with such 'anagrams' is that the words are extremely obscure, specialized technical jargon. But even well-known words can fail to produce interesting anagrams: addresser readdress The major reason these anagrams fail to be interesting is because they really consist of rearrangements of words. not rearrangments of letters. Even technical jargon can be interesting if it's sufficiently non-obvious; I'd submit that: protomagnesium spermatogonium is reasonably interesting; when the words are more familiar, the result can be spectacular: urogenital regulation ancestorial lacerations This suggests the following heuristic for the 'interestingness' of a pair of anagrammatic words: If one word can be rearranged to form the other after being broken into just a few chunks, it is probably not very interesting. If it must be broken into many chunks to be so rearranged, it is likely to be more interesting. Applying this heuristic to the examples above, we see that the two 'boring' examples are only two-, three- and four-chunk anagrams: address/er re/address glosso/labio/pharyngeal labio/glosso/pharyngeal anatomic/o/pathologic/al pathologic/o/anatomic/al Whereas the 'reasonably interesting' example is a ten-chunk anagram: p/r/o/to/ma/g/n/e/s/ium s/p/e/r/ma/to/g/o/n/ium As are the 'spectacular' examples: u/r/o/g/e/n/i/t/a/l r/e/g/u/l/a/t/i/o/n a/n/ce/s/t/o/r/i/a/l l/a/ce/r/a/t/i/o/n/s (No simpler decompositions are possible.) Note that this heuristic naturally favors long anagrams over short ones; it is impossible that a three-letter anagram pair could get a score higher than 3. This is consistent with most peoples' judgement of anagram quality: short words' anagrams are rarely or never interesting. Write a function, 'min_chunks', whose arguments are two words which are anagrams of one another, and which returns the minimum number of chunks into which the first word must be divided before the chunks can be rearranged to form the second word. Use the function to generate interesting anagrams. ---------------------------------------------------------------- This week's report was difficult to write, because the question generated so much interesting discussion and attracted so many solutions. It was quite some time before anyone submitted a correct solution. During most of Saturday and Sunday I imagined that this report was going to begin like this: There were three kinds of solutions sent to the -discuss list this week: Incorrect ones, slow ones, and ones written by Ben Kennedy. I am glad to say that Mr. Kennedy's excellent solution was eventually joined by several others. The fastest one I tested was posted by Andrew Wilson. Andrew used essentially the same strategy that I tried when I first worked on this problem ten years ago: Find all possible matchings of single letters between the two words, compute how many chunks are produced by each matching, and produce the minimum number of chunks. For example, given 'trout' and 'tutor', Mr. Wilson will produce two matchings. The 'rou' of 'trout' must get mapped onto the 'u-or' of 'tutor', but there are two possible ways to match up the 't's. We can represent each matching as a list of numbers from 0 to 4; the first number says where the first 't' goes; the second says where the 'r' goes, and so on. The two matchings are (04312) and (24310). In this notation, a chunk is just a contiguous sequence of consecutive numbers, so /0/4/3/12/ and /2/4/3/1/0/. Thus the minimum is 4, and the chunking is /t/r/o/ut/ - /t/u/t/or/. Here's an example that might be more perspicuous. Suppose the two 'words' are 'aaabbcc' and 'bbccaaa'. There are 24 matchings: (4560123) (4650123) (4561023) (4651023) (4560132) (4650132) (4561032) (4651032) (5460123) (5640123) (5461023) (5641023) (5460132) (5640132) (5461032) (5641032) (6450123) (6540123) (6451023) (6541023) (6450132) (6540132) (6451032) (6541032) Of these matchings, the first one contains the fewest chunks: /456/0123/. (The bottom left one implies six chunks, /6/45/1/0/3/2/, and yields the chunking /a/aa/b/b/c/c/ - /b/b/c/c/aa/a/.) Here's code I wrote after all the submissions were in, in an attempt to improve on all of them. I didn't quite succeed, but it may be simpler than some of the other code that uses the same techniques. The first function computes permutations: use strict; # Find and return the $n'th permutation # of the remaining arguments in some canonical order sub permutation_n { my $n = shift; my @result; while (@_) { ($n, my $r) = (int($n/@_), $n % @_); push @result, splice @_, $r, 1; } \@result; } I cribbed this from pp. 126-127 of _Perl Cookbook_; it's essentially 'n2pat' from page 127. There are six ways to order a list of three items; given a number $n from 0 to 5, and a list of three items, this function produces one of the orderings. It produces a different ordering for every number $n from 0 to 5. Similarly, if there are seven items, then $n should be a number from 0 to 5039. To get a clear idea of what this function is doing, try for (0 .. 23) { my $p = permutation_n($_, 'A' .. 'D'); print "@$p\n"; } It will print out A B C D B A C D C A B D ... D C B A The next function takes a word and manufactures an object which records information about where each of its letters occurs, and how those letters might be exchanged with one another: sub split_word { my ($word) = @_; my %graph; for my $index (0 .. length($word) - 1) { my $ch = substr $word, $index, 1; push @{$graph{$ch}{pos}}, $index; $graph{$ch}{max} = 1 unless defined $graph{$ch}{cur}; $graph{$ch}{max} *= @{$graph{$ch}{pos}}; $graph{$ch}{cur} = 0; } for my $ch (keys %graph) { $graph{$ch}{perm} = [@{$graph{$ch}{pos}}]; } %graph; }; Given 'aababc', the function produces: ('a' => { max => 6, cur => 0, pos => [0, 1, 3], perm => [0, 1, 3], }, 'b' => { max => 2, cur => 0, pos => [2, 4], perm => [2, 4], }, 'c' => { max => 1, cur => 0, pos => [5], perm => [5], } ) This object will record which possible permutations of the letters we've tried. Each different letter gets its instances permuted separately; the 'cur' member of each sub-hash records which permutation we should try next. 'cur' will start at 0 and cycle up to the value of 'max'. ('max' is just the factorial of the number of elements in 'pos'.) 'pos' is a list of original positions of the letters; 'perm' is a permuted version of 'pos', permuted according to permutation #'cur'. The next function takes one of these structures and computes the next permutation. It does this by incrementing the 'cur' value of the first key. But if 'cur' is equal to 'max' for that key, then we've run out of permutations for that key, so we start over again with permutation 0 and increment the 'cur' for the next key instead. Having adjusted the 'cur' values, we call permutation_n() to compute the permutations themselves. sub next_permutation { my $o = shift; my $DONE = 0; for (values %$o) { if (++$_->{cur} == $_->{max}) { $_->{cur} = 0; $_->{perm} = [@{$_->{pos}}]; } else { $_->{perm} = permutation_n($_->{cur}, @{$_->{pos}}); $DONE = 1; last; } } return $DONE; } If all the 'cur' values have become equal to the 'max' values, then we have tried every possible permutation already. In this case we fall out of the loop without setting $DONE, and return a false value, to indicate completion. sub count_chunks { my $prev = -2; my $chunks = 0; for (@_) { $chunks++ unless $_ == $prev+1; $prev = $_; } $chunks; } count_chunks just takes a list of indices and counts the number of runs of consecutive numbers. Here's the root of the code. It takes the first word and constructs a permutation object for it; that's %o. The 'while' loop runs through all the possible permutations of each letter, and the inner 'for' loop maps those letters, in order, to the corresponding letters of the second word. We count the number of chunks in each mapping so constructed, and emit the one with the fewest chunks. sub min_chunks { my ($given, $target) = @_; my %o = split_word($given); my @targ = split //, $target; my $chunks = 99999999; my @best_mapping; while (1) { my @mapping; my %letct; for (@targ) { push @mapping, $o{$_}{perm}[$letct{$_}++||0]; } my $map_chunks = count_chunks(@mapping); if ($map_chunks < $chunks) { $chunks = $map_chunks; @best_mapping = @mapping; } last unless next_permutation(\%o); } wantarray ? ($chunks, @best_mapping) : $chunks; }; Andrew Wilson's solution was very similar to this, and somewhat faster. This version is at http://perl.plover.com/qotw/misc/e004/mjd3.pl ; Andrew Wilson's is at http://perl.plover.com/qotw/misc/e004/wilson.pl . 1. There was very wide variation in the quality of the programs sent to the -discuss list. Here's a report of the results of running most of them against an anagram list with 2931 pairs of words: ------TIMES------ Program Sys User Total Wrong $? Output wilson.pl 0.1 9.6 9.7 0 ( 0, ***FAILURES 0 ) sidhekin.pl 0.1 10.4 10.4 0 ( 0, ***FAILURES 0 ) mjd3.pl 0.0 10.4 10.5 0 ( 0, ***FAILURES 0 ) hofmann.pl 0.1 10.9 11.0 0 ( 0, ***FAILURES 0 ) mjd.pl 0.0 14.6 14.6 0 ( 0, ***FAILURES 0 ) mjd2.pl 0.0 14.9 14.9 0 ( 0, ***FAILURES 0 ) koenig2.pl 0.1 17.6 17.7 0 ( 0, ***FAILURES 0 ) kennedy2.pl 0.0 26.4 26.4 0 ( 0, ***FAILURES 0 koenig.pl 0.2 31.2 31.4 0 ( 0, ***FAILURES 0 ) kennedy.pl 0.4 40.7 41.1 0 ( 0, ***FAILURES 0 ) boessenkool.pl 0.1 50.0 50.1 0 ( 0, ***FAILURES 0 ) mccarthy.pl 0.1 52.8 53.0 0 ( 0, ***FAILURES 0 ) hek2.pl 0.9 66.1 67.1 0 ( 0, ***FAILURES 0 ) hek.pl 1.4 98.7 100.1 0 ( 0, ***FAILURES 0 ) blumberg.pl 0.6 126.6 127.3 0 ( 0, ***FAILURES 0 ) wilson2.pl 0.2 141.3 141.5 0 ( 0, ***FAILURES 0 ) trottmann.pl 0.7 144.3 145.1 0 ( 0, ***FAILURES 0 ) boger.pl 2.6 263.7 266.3 0 ( 0, ***FAILURES 0 ) kripa.pl 0.1 0.6 0.6 999 (652, ) camilio.pl 0.0 2.6 2.6 92 (253, ***FAILURES 92 ) bruhat.pl 0.0 3.9 4.0 224 (253, ***FAILURES 224 ) sde.pl 0.1 10.9 11.0 100 (253, ***FAILURES 100 ) sde-i.pl 0.1 11.5 11.6 100 (253, ***FAILURES 100 ) marvell.pl 0.1 12.3 12.4 100 (253, ***FAILURES 100 ) fysh.pl 0.2 12.3 12.5 100 (253, ***FAILURES 100 ) barbie.pl 0.2 47.5 47.7 100 (253, ***FAILURES 100 ) payrard.pl 0.1 61.6 61.7 37 (253, ***FAILURES 37 ) sanger.pl 0.0 79.5 79.6 870 (253, ***FAILURES 870 ) ryan2.pl 0.6 95.2 95.9 587 (253, ***FAILURES 587 ) ryan.pl 0.7 97.4 98.1 897 (253, ***FAILURES 897 ) finkel.pl 1.6 169.9 171.5 300 (253, ***FAILURES 300 ) lopresto.pl 15.1 617.8 632.9 999 (307, ) wilson1.pl 1.2 1464.8 1466.0 999 (366, ) scott.pl 1.1 2272.3 2273.3 999 (366, ) 'mjd3.pl' is the sample solution shown above. All these programs are available from http://perl.plover.com/qotw/misc/e004/ . The 'Wrong' column indicates the number of test cases for which the program produced the wrong answer. '999' means that the program died, or that I killed it, or something else went wrong; in such cases the '$?' column indicates what happened, if you are a Unix system programmer. lopresto.pl ran out of memory; I killed wilson1.pl and scott.pl because they were taking too long; kripa.pl aborted for some reason I didn't bother to investigate. (The 652 is actually a 65280.) Some of these cases may be mistakes that I made in running the programs. The others exited with a 99 failure code ('253' is really 25344) to indicate that they got some wrong answers; these are annotated with the number of examples they got wrong. A few of these results are deceptive. For example, hofmann.pl looks good at first, but it sometimes gets the wrong answer. It just so happens that my 2931 pairs of words didn't include an example for which it failed. 2. I posed this question because I'd written a program several years ago to do this, and I hadn't been happy with the results. It got the right answers, but for long words with many repeated letters, it was very slow. I was hoping that someone else would find a better way to do it. This did happen. I was completely surprised when I ran the report-generating program and found that my original solution (mjd.pl above) was among the better ones. I had fully expected it to be among the slowest of the correct solutions. Apparently my recollection of its slowness was due mostly to the intrinsic slowness of the computers of ten years ago! 3. Most of the incorrect solutions made the same sorts of mistakes. They assumed that the best way to proceed was to find the largest possible chunk, remove it from the two words, and repeat until nothing was left. 4. Strategies like this are called 'greedy' because they grab as much as they can, without worrying about the future implications of their actions. For some problems, greedy strategies always work. But just as in life, greedy strategies aren't always optimal. Removing a large chunk early may eliminate some crucial letters and foreclose the chance of removing almost-as-large chunks later on. One oft-discussed example of this type was 'xylidin' / 'xylinid'. The longest common chunk is /xyli/; if you remove this from the second word you have '----nid', and the rest of the letters chunk into /n/ /i/ /d/, for a total of four chunks. However, the correct answer is 3. The best chunking is /xyl/in/id/ /xyl/id/in/. Removing /xyli/ early on forecloses the possibility of removing /in/ and /id/ later on. 5. Another common approach was to compute all the possible chunks in the first word, search for them in the second word in order from longest to shortest, and remove any such chunks found. However, without great care, this may not work. Consider 'caviar' and 'variac': The longest matching chunks are /ar/ and /ia/, which we remove from 'variac', leaving 'v--iac' and then 'v----c'. Then we remove the /v/ and the /c/, for a total of four chunks. Unfortunately, this is wrong. We chunked 'variac' into /v/ar/ia/c/, but in 'caviar', the /ar/ and /ia/ overlap, so there is no corresponding division into four chunks. The correct answer is 5, either as /c/a/v/i/ar/ or as /c/a/v/ia/r/ /v/ar/i/a/c/ /v/a/r/ia/c/. I posted some sample test words of this type on Saturday, but incorrect solutions kept showing up on the -discuss list. 6. At one point Randal Schwartz suggested that the problem might be NP-complete. Segher Boessenkool pointed out that this was not the case, and presented a (rather cryptic) solution which he claimed was polynomial. I believe he's correct, but the exact complexity of the algorithm seems elusive. See below for more details. 7. Avi Finkel suggested that it might be closer to the intruitive notion of anagram quality to be allowed to reverse chunks, so that in 'read'/'dare' there are two chunks (/re/ad/ and /da/re/) rather than 3. Unfortunately, nobody took him up on this interesting suggestion. It occurs to me now that it would be easy to do so, just by changing 'count_chunks' in the example solution above. 8. The sample solution has very poor worst-case behavior. The best case occurs when each letter in the anagram appears exactly once, as with 'logarithm'/'algorithm'; in such cases there's only one possible mapping between the letters, so the 'while (1)' loop executes exactly once. The worst cases are when there are many repeated letters. For 'aaaaaaaaaa', the program stupidly examines 3628800 mappings. For 'photochromolithograph' it examines 23,040 mappings. (There are 120 ways to permute the o's, times 24 ways to permute the h's, times 2 ways to permute the t's, times 2 ways to permute the r's, times 2 ways to permute the p's.) Other solutions did not have this worst-case behavior. 9. For example, here is Ben Kennedy's explanation of his solution. I was going to write this up myself, but I found I couldn't improve on his explanation: This solution I came up with is relatively simple (code repeated at bottom of this message) - it's based on the observation that if the string is truly an anagram, the first chunk of the string must go somewhere into the second string. This provides the basis for a simple recursive algorithm based around placing the first chunk in the first string. Since we already established that a greedy approach may not work, it has to look at *every* possible first chunk going into *every* possible location in the second word. For example, "aaabbb" going into "aabbba" has the following first chunks: a aa aaa aaab aaabb aaabbb Each chunk is then scanned for in the second string: a can be matched to (a)abbba or a(a)bbba or aabbb(a) aa can be matched to (aa)bbba aaa, aaab, aaabb, aaabbb have no matches Thus we now have four potential branches, so the function decends with each of the following word pairs: aabbb to -abbba aabbb to a-bbba aabbb to aabbb- abbb to -bbba Note the presense of dashed to indicate removal of the letters in parenthesis. If the chunk were simply removed, it would bring together letters that weren't together initially, and the algoritm would not work. These word pairs are then sent to the next level of recustion. Note the pairs after processing the third entry which was: aabbb to aabbb- We have the following new pairs: Chunk is a, pairs are abbb to -abbb-, abbb to a-bbb- Chunk is aa, pairs are bbb to -bbb- Chunk is aab, pairs are bb to -bb- Chunk is aabb, pairs are b to -b- Chunk is aabbb, pairs are (null) to -- That last conditions means we have a complete anagram - the number of dashes (2) is equal to the depth of recursion (2), which means we eliminated the second word in only 2 chunks: (a)(aabbb) to (aabbb)(a) Once a solution is found, its marked as the "best" solution and the code does not descend to levels beyond this depth. http://perl.plover.com/~alias/list.cgi?1:msp:461 The solutions from Sidhekin and Andres Koenig were also of this type. 10. Jamie McCarthy's program iterated over all possible ways of breaking the first string into chunks, starting by breaking it into two chunks and then working up through three chunks, four chunks, etc. At each step it would look to see if all of the resulting chunks were present in the second string; if not, it would try a different way of breaking up the first string. If all the chunks were present in the second string, his program would see if they could in fact be assembled into the second string. (The chunks /ab/ and /ba/ are both present in "abab", but they can't be assembled to form it "abab".) If so, that's a winner. McCarthy's program is fast because it has some clever optimizations for skipping certain chunkings and certain permutations. It's worth more discussion than I gave it here, but I have a book to write. 11. John Macdonald suggested using dynamic programming techniques: http://perl.plover.com/~alias/list.cgi?1:mss:516 He wasn't sure it would work, but it does work: Segher Boessenkool did something very much like what John suggested. http://perl.plover.com/~alias/list.cgi?1:mss:537 The code, unfortunately, was difficult to follow. I was hoping that one of the other fast, correct solutions would be similar, so that I could ignore it, but I didn't see anything else similar. A couple of comments would have helped out a lot. I'm not going to explain the whole thing in detail, but if you want to investigate, it will probably help you to know that the main data structure, @a, is a two dimensional array. $a[$L][$P] is a hash that describes all of the minimum-chunk ways of mapping substr(Word1, $P, $L) onto Word2. Boessenkool computes the array elements starting for $L==1 (which is easy) and working up until $L is the length of the word, at which point the number of chunks in the mapping is the solution of the problem. He also represents chunks as bit vectors; he can use the & operator to determine whether two chunks overlap. 12. I was suprised to see so much use of the 'Memoize' module; I thought that Memoize was one of those things that nobody ever used. I guess I was wrong! But I was also surprised to see so much *inappropriate* use of 'Memoize'. Memoize isn't magical; all it does is install a cache on the front of a function. Managing the cache incurs overhead. The hope is that the time saved by retrieving cached values will make up for the time spent managing the cache. If not, the memoization makes the function slower, not faster. In particular, if the function is not called repeatedly with the same arguments, the memoization will not help. (See http://perl.plover.com/yak/memoize-quant/ for details.) One example of this is in sidhekin.pl. Eliminating the memoization speeds up the program substantially. 13. Dan Schmidt was puzzled as to why there were so many incorrect solutions. Dan said: I started by thinking of the greedy algorithm (start with the biggest match), but first I figured I would prove to myself that the greedy algorithm works... and instead I quickly found a counterexample. ...I am surprised how many people seemed to dash out an algorithm that looked reasonable without looking more carefully to see if it could fail. I think that skill is just as important as neat regexp tricks. I wondered this myself. As you can see from the table above, more than half of the posted solutions didn't work. Even worse, people kept posting the same incorrect solutions even after the earlier incorrect solutions had been discussed, after I posted sample dictionaries, and after I posted a short list of 'hard cases' that tended produced failures for all these programs! What does this mean? Thanks to everyone who participated in the quiz, including both those who mailed the -discuss list and those who did not. Quiz #5 will be along today at about 2100 GMT.