Sample solutions and discussion Perl Quiz of The Week #8 (20021211) Bill Gosper, a famous programmer, once said that a good way to manufacture word puzzles was to look through the dictionary for a word that contains a sequence of four letters that does not appear in any other word. Then the puzzle is to guess the word, given only the four letters. For example, what common English word contains the contiguous sequence of the four letters 'acur'? (Gosper says that you see this word every week, but that it will take you a month to figure out what it is.) Write a Perl program which, given a dictionary, generates two output files, 'questions' and 'answers'. 'questions' should contain every sequence of four letters that appears in exactly one word of the dictionary, one sequence per line. 'answers' should contain the corresponding words that contain the sequences, in the same order, again one per line. For example, given the trivial dictionary containing only arrows carrots give me The outputs should be: 'questions' 'answers' carr carrots give give rots carrots rows arrows rrot carrots rrow arrows Of course, 'arro' does not appear in the output, since it is found in more than one word. Here's a sample program, provided by Jonathan Scott Duff. I trimmed it a little. # Well, I see bunches of other people posting their solutions to the # regular quiz, so here's mine: # #!/usr/bin/perl $SEG_LENGTH = 4; while (<>) { chomp; next if /\W/; $w = lc $_; %w = map { substr($w,$_,$SEG_LENGTH) => 1 } 0..length($w)-$SEG_LENGTH; for $w (keys %w) { $wordmap{$w} = exists $wordmap{$w} ? undef : $_; } } open(Q,">questions") or die; open(A,">answers") or die; for (sort keys %wordmap) { next unless defined $wordmap{$_}; print Q "$_\n"; print A "$wordmap{$_}\n" } close Q; close A; The main data structure in the program is the hash %wordmap. Keys in %wordmap are strings of length 4. The value associated with a key $k is the word in which $k appears, if $k appears in only one word, and an undefined value if $k appears in more than one word. The program first converts each input word to all lowercase, and then uses 'map' to construct a hash, %w, whose keys are the length-4 segments of the word. For example, if the word is 'phlebotomy', the hash is ('phle' => 1, 'hleb' => 1, 'lebo' => 1, 'ebot' => 1, 'boto' => 1, 'otom' => 1, 'tomy' => 1, ) The 1's aren't significant'; they're just placeholders. Using a hash in this way is a common Perl idiom for representing a set of strings. The program then loops over the keys, looking up each one in %wordmap. If a key was already in wordmap, then this is at least the second time it has been seen, so the program sets the associated value to 'undef', to indicate that it has appeared more than once. If the key isn't in %wordmap yet, then it's inserted into %wordmap, and the associated value is the single word in which it has appeared. After generating %wordmap, the program writes out the questions and answers files, sipping over any elements of %wordmap whose values are undefined. *. Since the values in the %w hash are never used or examined at all, it might seem that we could dispense with them, replacing %w = map { substr($w,$_,$SEG_LENGTH) => 1 } ... ; for $w (keys %w) { ... } with @w = map { substr($w,$_,$SEG_LENGTH) } ... ; for $w (@w) { ... } This was a common error in the submitted programs. The problem it causes occurs with words like 'alfalfa' and 'lightweight' which contain the same sequence of four letters more than once. The second version of the code sets @w to ('alfa', 'lfal', 'falf', 'alfa') and then iterates over this list, processing 'alfa' twice. It then erroneously marks 'alfa' in %wordmap as appearing in two words when in fact it has appeared twice in only one word. To avoid this, we must be sure to process each sequence of four letters at most once per word. Storing the sequences as keys in the %w hash ensures this, because hash keys are unique. The %w generated for 'alfalfa' is ('alfa' => 1, 'lfal' => 1, 'falf' => 1) and so iterating over the keys processes 'alfa' only once. 1. A way to fix the problem without introducing another hash appears in Ronald Kimball's program. Ronald's program solves the problem more directly: for $w (keys %w) { $wordmap{$w} = exists $wordmap{$w} && $_ ne $wordmap{$w} ? undef : $_; } The second and subsequent times that the program sees a particular sequence, it throws away the stored word only if it's different from the current word. As written above, the program generates a huge number of 'uninitialized value' warnings because the 'undef' values stored in the hash to indicate a sequence that has been seen two or more times are compared with $_. Ronald's program uses 1 instead of undef, so doesn't generate any warnings. Another way to slience the code above is to shut off warnings.p 2. Mr. Duff's program, as submitted, actually finds unique sequences of 'n' letters, where 'n' defaults to 4, the number specified in the question. If it's run as duff.pl -n 2 < dictionary it finds digraphs (pairs of letters) that occur in only one word each. To make the code simpler, I trimmed this out and replaced the $opt{'n'} parameter with $SEG_LENGTH. 3. As a trivium, here's the output for n=2: bg bw dz fc fj fp fw gj hq hv iy jr lj qa sj vk vs vz wz xb xn xq xs xv xx yj yq zd zg zm zp Of these, 13 make good puzzles: bg bw fc fp fw gj hq lj sj wz xq xs yj The rest are either proper nouns (both English or otherwise) hv iy qa vs xb xn xv xx wq zd zg zm zp or are of visibly foreign origin ('resident aliens'): dz fj vk vz or are abbreviations: jr I think my favorite one is probably 'hq'. 4. People sometimes suggest that Perl's '..' operator should construct a backwards-counting range if the second operand is smaller than the first. For example, they say that 4..0 should produce the list (4, 3, 2, 1, 0). At present, it produces the empty list. This program demonstrates one of the many reasons why this is a bad idea. Consider this part of the code: %w = map { substr($w,$_,$SEG_LENGTH) => 1 } 0..length($w)-$SEG_LENGTH; Suppose $SEG_LENGTH is 4 and $w is "cat". The operands of the '..' are 0..-1. With the existing semantics for '..', the '..' generates an empty list for 'map' it iterate over, the hash %w becomes empty, and the word is effectively skipped--just the right thing. With the defective alternative behavior, the '..' would generate the list (0, -1), and the 'map' generates the (bizarre) list ('cat', 't'). To get correct behavior, the code would have to be adjusted with a special case to check for length($w) < $SEG_LENGTH. A similar example concerns this construction: @rest = @a[2..$#a]; Here the intent is to copy the third-through-last elements of @a. For example, if @a contains 10 elements, $#a is 9, and @rest gets elements 2 through 9. If @a contains only one element, the '2..1' expands to an empty list, and @rest is assigned nothing---which is just what was wanted. With the alternative semantics, the '2..$#a' expands to (2, 1), and @rest is assigned two undefined values. Again, a special case is necessary to guard against precisely the behavior of .. that was proposed. (If you do want to count backwards, use something like reverse(1..$n) .) 5. As usual, many people submitted programs that did not adhere to the interface I asked for in the question, making various gratuitous changes to the input semantics, the output file names, the output format, or whatever. Unlike in the past, I decided not to repair these. The changes that puzzled me most were the ones that replaced the two output files ('questions' and 'answers') with a single output file. I agree that this is simpler and more natural. Usually I would have specified a single file with two columns. But in this case that format is no good, because when you try to pick out a puzzle, you see the answer right next to it, which spoils the fun. 6. One common variation, particularly among the shorter programs, was to use a tricky regex to generate the substrings, instead of the loop shown above. For example while (/(?=(.{4}))/g) { ... } was a popular trick. (This iterates the 'while' loop once for each four-letter sequence, with $1 set to each sequence in turn.) 7. When I first tested the programs, I got a surprise. Everyone's programs ran very quickly, except mine, which was by far the slowest of the bunch. I wondered what elementary mistake I must be making. Unfortunately, it turned out to be an error in the test apparatus, not a programming mistake in my program. (I had been looking forward to discussing it.) I had forgotten to trim the email headers out of the other programs, so mine was the only one that wasn't aborting immediately with multiple syntax errors. Once again, my thanks to everyone who participated. I will send out a new quiz on yesterday.