Sample solutions and discussion Perl Quiz of The Week #22 (20040825) Write a program, 'wordladder', which gets two arguments, which are words of the same length, and which constructs and prints a "word ladder" from the first word to the second word. A word ladder from word AAA to word BBB is a sequence of dictionary words such that: 1. the first word in the sequence is word AAA 2. each word in the sequence after the first differs from the previous word in exactly one letter position 3. the last word in the sequence is word BBB For example, given the two words "love" and "hate", the program might print the word ladder: love hove have hate Or it might print: love lave have hate It might also print a longer word ladder, such as love lore lobe robe role rose lose lost most mosh moth math hath hate If the program is unable to find a word ladder, it should print an appropriate error message to the standard error, and exit with a failure status. The program should also accept an optional third argument, which, if specified, is the name of a dictionary file which contains the permissible words. If the third argument is omitted, the program should use a default dictionary. Sample word lists are available from http://perl.plover.com/qotw/words/ --------------------------------------------------- [ MJD: This week's report is a collaboration between three people. Ron Isaacson tested and timed the many solutions and wrote up a timing report. Daniel Martin read over the solutions and analyzed the algorithms used. I selected one sample solution and wrote up an explanation of how it works. My very grateful thanks to Prs. Isaacson and Martin for bailing me out of this one when I bit of far more than I could chew. I now turn you over to Pr. Martin. ] [Daniel Martin] This was a popular quiz, judging by the number of solutions. 23 solutions were submitted from 21 different people. Three solutions weren't tested for one reason or another, leaving 21 solutions to be tested and timed. Many thanks to Ron Isaacson for doing the time-testing. The overall algorithm used by each solution can be classified as follows: - "Expanding circles": 1) Start with just the word AAA as your working set. 2) Find every word which is only one letter different from something in the working set and that hasn't been seen before and make that your new working set. 3) Repeat (2) until BBB is in the working set. This solution can be viewed as Dijkstra's shortest-path algorithm applied to the specifics of this problem. (see http://en.wikipedia.org/wiki/Dijkstra's_algorithm) A variant of this has the circles expanding from both AAA and BBB, and stopping when the two sets intersect. One solution, from Rod Adams, computed each subsequent expanding circle by gathering the words that might be in the next working set and then eliminating those that were not adjacent to something in the previous working set. (It then switched to the more standard version based on an estimate as to which method would be faster) - "workqueue": 1) Place AAA in a queue. 2) Pull a word from the queue and push onto the queue all the adjacent words we haven't seen before. 3) Repeat (2) until BBB is pulled from the queue. (Variation: until BBB is about to be pushed onto the queue) This solution can be viewed as the A* algorithm applied to the specifics of this problem with no estimate of future path length. (see http://en.wikipedia.org/wiki/A-star_search_algorithm) A few people extended this by using a full-blown priority queue, with the priority of a word being (distance from w to BBB) + (length of path to w from AAA) This is the A* algorithm with a distance estimate. As before, there were also bi-directional variants that attacked the path from both ends, stopping when the words seen intersected. Yes, sometimes the classification between "workqueue" and "expanding circles" was a bit of a judgement call about how the code was organized overall. In general if - "depth first": 1) Start at AAA. 2) Pick an adjacent word that hasn't been visited yet. 3) Repeat (2) until you reach BBB or visit everything. If there are no available adjacent words, backtrack. Only two solutions attempted this. Both also sorted the adjacent words so as to first try words closer to BBB. After overall algorithm choice, there were still several other design decisions that could be made. A common source of variation was in how the dictionary was stored. Most solutions used a variation on this design: (code adapted from Colin Meyer's solution) open DICT, '<', $DICT or die "can't open $DICT: $!"; @words = grep length( $_ ) == $SIZE, map { chomp; lc } ; close DICT; for my $w ( @words ) { my @rungs = map { my $c = $w; substr( $c, $_, 1 ) = '_'; $c } 0..($SIZE-1);; push @{ $rung2word{ $_ } }, $w for @rungs; } That is, stored the dictionary as a hash of pattern => wordlist, where "pattern" is the word in question with one character replaced by an '_' (or '?' or '.' or '\0') and "wordlist" is a reference to all words with that pattern. Occasionally, the map word => patternlist would also be recorded. Storing the words in a hash that mapped each word to 1 (or some other value) was also popular. (So that keys %words was used whenever the full list of words was needed) A few stored the dictionary in a single giant array, searching through it with regexps or with Inline::C functions when finding adjacent words. There were also some non-perl solutions submitted: 1 in PIR (Parrot Intermediate Representation - assembly language for Perl 6, sort of), 2 in python, and 1 in scheme. Unfortunately, I have no PIR interpreter and neither of my scheme interpreters was able to handle the submitted scheme solution (guile chokes on the variable name with asterisks in it and mzscheme doesn't know about 'sort'). The python solutions were tested along with the others. Several of the solutions failed to check for certain edge cases. For example, if AAA == BBB, 4 solutions found no path. 3 more solutions found a path only if AAA was adjacent to something (so testing on "transubstantiationalist transubstantiationalist" showed no path for those, but testing "love love" did find a path) Also, some solutions had trouble with very large dictionaries or with long paths. At least one solution made the erroneous assumption that if AAA and BBB differ only in positions 1, 3, and 5, then all the words along the path from AAA to BBB will differ from each other only in positions 1, 3, and 5. This is shown false by the path from "axal" to "utah" in the Web2 dictionary, which among other words passes through "itch". Because of the large number of solutions, I won't go through each one and list the classification. Instead, this chart might be useful: dictionary ->|sig>wordlist|word>1|@DICT search method | | | ================================================== Expanding circles| 0 | 1 | 2 2-dir version | 5 | 0 | 1 -------------------------------------------------- workqueue | 3 | 2 | 0 + distance ver | 2 | 0 | 0 2-dir version | 0 | 1 | 0 -------------------------------------------------- depth first | 0 | 2 | 1 I didn't classify the scheme solution from Greg Bacon, nor the PIR solution from Ingo Blechschmidt. I also did not classify the perl solution sent in by David B., since I was unable to puzzle it out. (I was also unable to test it - it's doing something that eats memory like crazy.) I will note that in general, the depth first solutions all performed very poorly on this quiz, despite two of them coding crucial parts in C. The overall fastest solution was mine, the 2-direction workqueue, though not by much. Ron Isaacson's beautifully-formatted timing report is available at: http://perl.plover.com/qotw/misc/e022/report.html The raw timing data, and the submitted programs, are available at http://perl.plover.com/qotw/misc/e022/ Although the problem merely specified _a_ path from AAA to BBB, many solutions found the shortest path. Therefore, some people speculated on what the "longest shortest path" might be for different word lengths in the different dictionaries. The results can be found in the thread starting here: http://perl.plover.com/~alias/list.cgi?1:mss:2054 ---------------------------------------------------------------- [Mark Dominus] I've decided to use Zed Lopez's program as this week's sample solution. According to Ron Isaacson's report, it's consistently one of the very fastest sumbissions, and it's much shorter than the other comparably fast solutions. Part of that may be due to Zed's use of the Tree::Simple module, which provides functions that manage tree structures: #!/usr/bin/perl use strict; use warnings; use Tree::Simple; And part of it may be due to a somewhat excessively terse coding style, as seen here: my ($start_word, $destination_word, $dictionary_file) = @ARGV; die "You must specify two words of equal length" unless defined $start_word and defined $destination_word and length $start_word == length $destination_word; But on the whole I found the program straightforward and easy to understand. In Daniel Martin's classification, this was an "expanding circles" solution, of the "two directions at once variety". Pr. Lopez's program maintains two working sets of words, and operates on them alternately. On each pass, it finds all the words that are one step removed from the words in the current working set, and replaces the current working set with this new set of words. When it finds a word that is already in the *other* working set, it knows it has found a complete path. $dictionary_file ||= "/usr/share/dict/web2"; open (FH, $dictionary_file) or die "Couldn't open $dictionary_file: $!"; $start_word = lc $start_word; $destination_word = lc $destination_word; my %dict; while () { chomp; $dict{lc $_} = 1 if length $_ == length $start_word and $_ !~ /[^a-zA-Z]/; } close(FH); die "$start_word is not in dictionary" unless exists $dict{$start_word}; die "$destination_word is not in dictionary" unless exists $dict{$destination_word}; The dictionary is the usual Perl implementation of a set, with words as keys, and the values always 1. This code handles the special case if the start word and the end word being the same. Had I thought about it in advance, I might have ruled this out when I posed the problem: print "$start_word\n" and exit if $start_word eq $destination_word; $list and $next_nodes are the main data structures of the program: my ($list, $next_nodes); { my $top = Tree::Simple->new($start_word); my $bottom = Tree::Simple->new($destination_word); $list = [{$start_word => $top}, {$destination_word => $bottom}]; $next_nodes = [[$top],[$bottom]]; } The Tree::Simple objects keep track of tree structures whose roots are labeled with the start and end words; each tree node is labeled with another word. If BBB is in the start-word tree, then a word ladder from the start word to BBB can be found by tracing the path from BBB back to the root. The program's wordchain() function does this: sub word_chain { my $node = shift; my @words; while (1) { push @words, $node->getNodeValue(); last if $node->isRoot; $node = $node->getParent; } return @words; } (->getNodeValue returns the word with which the node is labeled.) $list contains exactly two hashes, one working 'down' from the source word and the other working 'up' from the destination word. Each hash maps words to the Tree::Simple objects that represent them. $next_nodes contains the two 'working sets'. The main loop of the program is: for (my $x = 0; ; $x = !$x) { $next_nodes->[$x] = find_next_nodes($next_nodes->[$x], $x); } The '$x = !$x' is a little obscure; he really means '$x = 1 - $x' here. The main loop alternates between the two working sets, replacing the old working set with the set of words adjacent to it. Calculating this new set of words is the job of find_next_nodes(): sub find_next_nodes { my ($nodes, $x) = @_; my @next_nodes; for my $node (@$nodes) { my $orig_word = $node->getNodeValue(); for (my $i = 0; $i < length $orig_word; $i++) { my $word = $orig_word; for my $char ('a'..'z') { next if $char eq substr $orig_word, $i, 1; substr $word, $i, 1, $char; This last line uses the four-argument form of substr(), which is newish. It was introduced by analogy with the four-argument form of splice(). substr($word, $i, 1, $char) locates the length-1 substring of $word at position $i, just like substr($word, $i, 1), but then replaces this substring with the contents of $char. It is just like >MJD> substr($word, $i, 1) = $char; only faster. The program has just calculated a new word, $word, that is one step removed from some word $orig_word that was in the working set. If this new word is already in the *other* working set, then a complete path has been found: # if word is in other list, we're done success($x ? ($list->[!$x]->{$word}, $node) : ($node, $list->[!$x]->{$word})) if exists $list->[!$x]->{$word}; I would have formatted this differently. I think it's clearer like this: >MJD> success($x ? ($list->[1-$x]->{$word}, $node) >MJD> : ($node, $list->[1-$x]->{$word})) >MJD> if exists $list->[1-$x]->{$word}; The success() function, which we'll see shortly, is responsible for tracing the paths from the common word back to the roots of the two trees, and generating the resulting word ladder. Its arguments are the nodes of the two trees that have the same label, with the node from the top-down tree first. If $x is true (that is, 1) then the current working set is the bottom-up one, so needs to be put second in the argument list of success(). If the program hasn't been succesful, it extends the current tree structure and the new working set: if (exists $dict{$word} and !exists $list->[$x]->{$word}) { my $child = Tree::Simple->new($word); $node->addChild($child); push @next_nodes, $child; $list->[$x]->{$word} = $child; } } } If there were no new words that could be reached from the current node, it is a dead end and is deleted from the tree structure to save memory: prune($node) if $node->isLeaf; } return \@next_nodes; } Here's prune(), which deletes the dead end. After the dead-end node is deleted, prune() checks to see if its parent is now a dead end also: sub prune { my $node = shift; while ($node->isLeaf) { die "No path exists" if $node->isRoot; my $parent = $node->getParent; $parent->removeChild($node); $node = $parent; } } The success() function was called when two nodes were found with the same label, one in each of thee trees. It uses word_chain() to find the word ladders from the common word up to the source word (at $node1) and down to the destination word (at $node2) and appends the two ladders together: sub success { my ($node1, $node2) = @_; print join "\n", (reverse word_chain($node1)), word_chain($node2), ''; exit; } The function avoids listing the common word twice because the common word is never added to the second tree. One of the arguments to success() is the node containing the common word; the other is the node in the other tree that would have been adjacent to the common-word node. Once again, thanks to everyone who participated by sending a solution or adding to the discusion, and thanks also to everyone who participated *without* sending a solution or adding to the discussion. Thanks especially to Ron Isaacson and Daniel Martin. John Trammell is working hard on the writeup of last week's parenthesis-generating quiz; I will send that along when it is ready. I hope to send out the new quiz tonight.