Sample solutions and discussion
Perl Quiz of The Week #8 (20021211)

        Bill Gosper, a famous programmer, once said that a good way to
        manufacture word puzzles was to look through the dictionary
        for a word that contains a sequence of four letters that does
        not appear in any other word.  Then the puzzle is to guess the
        word, given only the four letters.

        For example, what common English word contains the contiguous
        sequence of the four letters 'acur'?  (Gosper says that you
        see this word every week, but that it will take you a month to
        figure out what it is.)

        Write a Perl program which, given a dictionary, generates two
        output files, 'questions' and 'answers'.  'questions' should
        contain every sequence of four letters that appears in exactly
        one word of the dictionary, one sequence per line.  'answers'
        should contain the corresponding words that contain the
        sequences, in the same order, again one per line.

        For example, given the trivial dictionary containing only

                arrows
                carrots
                give
                me

        The outputs should be:

                'questions'             'answers'

                carr                    carrots
                give                    give
                rots                    carrots
                rows                    arrows
                rrot                    carrots
                rrow                    arrows

        Of course, 'arro' does not appear in the output, since it is
        found in more than one word.


Here's a sample program, provided by Jonathan Scott Duff.  I trimmed
it a little.

        # Well, I see bunches of other people posting their solutions to the
        # regular quiz, so here's mine:
        # 
        #!/usr/bin/perl

        $SEG_LENGTH = 4;

        while (<>) {
           chomp;
           next if /\W/;		
           $w = lc $_;
           %w = map { substr($w,$_,$SEG_LENGTH) => 1 } 
                    0..length($w)-$SEG_LENGTH;

           for $w (keys %w) { 
              $wordmap{$w} = exists $wordmap{$w} ? undef : $_; 
           }
        }

        open(Q,">questions") or die;
        open(A,">answers") or die;
        for (sort keys %wordmap) {
           next unless defined $wordmap{$_};
           print Q "$_\n";
           print A "$wordmap{$_}\n"
        }
        close Q; close A;


The main data structure in the program is the hash %wordmap.  Keys in
%wordmap are strings of length 4.  The value associated with a key $k
is the word in which $k appears, if $k appears in only one word, and
an undefined value if $k appears in more than one word.

The program first converts each input word to all lowercase, and then
uses 'map' to construct a hash, %w, whose keys are the length-4
segments of the word.  For example, if the word is 'phlebotomy', the
hash is

        ('phle' => 1,
         'hleb' => 1,
         'lebo' => 1,
         'ebot' => 1,
         'boto' => 1,
         'otom' => 1,
         'tomy' => 1,
        )

The 1's aren't significant'; they're just placeholders.  Using a hash
in this way is a common Perl idiom for representing a set of strings.
The program then loops over the keys, looking up each one in %wordmap.
If a key was already in wordmap, then this is at least the second time
it has been seen, so the program sets the associated value to 'undef',
to indicate that it has appeared more than once.  If the key isn't in
%wordmap yet, then it's inserted into %wordmap, and the associated
value is the single word in which it has appeared.

After generating %wordmap, the program writes out the questions and
answers files, sipping over any elements of %wordmap whose values are
undefined.  

*. Since the values in the %w hash are never used or examined at all,
   it might seem that we could dispense with them, replacing

           %w = map { substr($w,$_,$SEG_LENGTH) => 1 } ... ; 
                    

           for $w (keys %w) { 
              ...
           }
   with

           @w = map { substr($w,$_,$SEG_LENGTH) } ... ;

           for $w (@w) { 
              ...
           }

   This was a common error in the submitted programs.  The problem it
   causes occurs with words like 'alfalfa' and 'lightweight' which
   contain the same sequence of four letters more than once.  

   The second version of the code sets @w to

        ('alfa', 'lfal', 'falf', 'alfa')

   and then iterates over this list, processing 'alfa' twice.  It then
   erroneously marks 'alfa' in %wordmap as appearing in two words when
   in fact it has appeared twice in only one word.  To avoid this, we
   must be sure to process each sequence of four letters at most once
   per word.  Storing the sequences as keys in the %w hash ensures
   this, because hash keys are unique.  The %w generated for 'alfalfa'
   is

        ('alfa' => 1, 'lfal' => 1, 'falf' => 1)

   and so iterating over the keys processes 'alfa' only once.

1. A way to fix the problem without introducing another hash appears
   in Ronald Kimball's program.  Ronald's program solves the problem
   more directly:

   for $w (keys %w) { 
      $wordmap{$w} = exists $wordmap{$w} && $_ ne $wordmap{$w} ? undef : $_; 
   }

   The second and subsequent times that the program sees a particular
   sequence, it throws away the stored word only if it's different
   from the current word.

   As written above, the program generates a huge number of
   'uninitialized value' warnings because the 'undef' values stored in
   the hash to indicate a sequence that has been seen two or more
   times are compared with $_.  Ronald's program uses 1 instead of
   undef, so doesn't generate any warnings. Another way to slience the
   code above is to shut off warnings.p

2. Mr. Duff's program, as submitted, actually finds unique sequences
   of 'n' letters, where 'n' defaults to 4, the number specified in
   the question.  If it's run as

        duff.pl -n 2 < dictionary

   it finds digraphs (pairs of letters) that occur in only one word
   each.  To make the code simpler, I trimmed this out and replaced
   the $opt{'n'} parameter with $SEG_LENGTH.

3.  As a trivium, here's the output for n=2:

        bg bw dz fc fj fp fw gj hq hv iy jr lj qa sj 
        vk vs vz wz xb xn xq xs xv xx yj yq zd zg zm zp

   Of these, 13 make good puzzles:

        bg bw fc fp fw gj hq lj sj wz xq xs yj

   The rest are either proper nouns (both English or otherwise)

        hv iy qa vs xb xn xv xx wq zd zg zm zp

   or are of visibly foreign origin ('resident aliens'):

        dz fj vk vz

   or are abbreviations:

        jr

   I think my favorite one is probably 'hq'.


4. People sometimes suggest that Perl's '..' operator should construct
   a backwards-counting range if the second operand is smaller than
   the first.  For example, they say that 4..0 should produce the list
   (4, 3, 2, 1, 0).  At present, it produces the empty list.

   This program demonstrates one of the many reasons why this is a bad
   idea.  Consider this part of the code:

           %w = map { substr($w,$_,$SEG_LENGTH) => 1 } 
                    0..length($w)-$SEG_LENGTH;

   Suppose $SEG_LENGTH is 4 and $w is "cat".  The operands of the '..'
   are 0..-1. With the existing semantics for '..', the '..' generates
   an empty list for 'map' it iterate over, the hash %w becomes empty,
   and the word is effectively skipped--just the right thing.

   With the defective alternative behavior, the '..' would generate
   the list (0, -1), and the 'map' generates the (bizarre) list 
   ('cat', 't').  To get correct behavior, the code would have to be
   adjusted with a special case to check for length($w) < $SEG_LENGTH.

   A similar example concerns this construction:

        @rest = @a[2..$#a];

   Here the intent is to copy the third-through-last elements of @a.
   For example, if @a contains 10 elements, $#a is 9, and @rest gets
   elements 2 through 9.  If @a contains only one element, the '2..1'
   expands to an empty list, and @rest is assigned nothing---which is
   just what was wanted.

   With the alternative semantics, the '2..$#a' expands to (2, 1), and
   @rest is assigned two undefined values.  Again, a special case is
   necessary to guard against precisely the behavior of .. that was
   proposed.  

   (If you do want to count backwards, use something like

        reverse(1..$n) .)

5. As usual, many people submitted programs that did not adhere to the
   interface I asked for  in the question, making various gratuitous
   changes to the input semantics, the output file names, the output
   format, or whatever.  Unlike in the past, I decided not to repair
   these.  

   The changes that puzzled me most were the ones that replaced the
   two output files ('questions' and 'answers') with a single output
   file.  I agree that this is simpler and more natural.  Usually I
   would have specified a single file with two columns.  But in this
   case that format is no good, because when you try to pick out a
   puzzle, you see the answer right next to it, which spoils the fun.

6. One common variation, particularly among the shorter programs, was
   to use a tricky regex to generate the substrings, instead of the
   loop shown above.  For example

        while (/(?=(.{4}))/g) {
          ...
        }

   was a popular trick.  (This iterates the 'while' loop once for each
   four-letter sequence, with $1 set to each sequence in turn.)  

7. When I first tested the programs, I got a surprise.  Everyone's
   programs ran very quickly, except mine, which was by far the
   slowest of the bunch.  I wondered what elementary mistake I must be
   making.  Unfortunately, it turned out to be an error in the test
   apparatus, not a programming mistake in my program. (I had been
   looking forward to discussing it.)  I had forgotten to trim the
   email headers out of the other programs, so mine was the only one
   that wasn't aborting immediately with multiple syntax errors.


Once again, my thanks to everyone who participated.  I will send out a
new quiz on yesterday.