Sample solutions and discussion
Perl Quiz of The Week #9 (20021218)


        You will write a simple spelling checker program, similar to
        the Unix 'spell' utility.

        The program should be called 'spel'.  It will read a document
        from standard input and print on standard output a list of all
        the misspelled words in the document.  If any command line
        arguments are given, 'spel' should read those files instead of
        the standard input.

        The output words should be in lexicographic order, and no word
        should appear more than once in the output.

        'spel' will be given one or more dictionaries of words that
        are already spelled correctly.  It will always try to read the
        file '/usr/dict/words'.  It will also try to read '.spel'
        files from certain directories.  If the user has set an
        environment variable SPELWORDS, 'spel' should interpret its
        value as a :-separated list of directories to be searched for
        '.spel' files.  If no SPELWORDS variable is set, 'spel' should
        search in the user's home directory and in the current
        directory.

        If you need a sample dictionary, you can obtain one from 

                http://perl.plover.com/qotw/words/


----------------------------------------------------------------

Here's sample code, submitted by Abigail:

#!/usr/bin/perl

#
# The exercise isn't clear what's to be considered a word,
# or how to deal with capitalization.
#
# This program considers words to be substrings consisting of only
# 'alpha' characters. This means that 'words' like "isn't" are
# considered to be two words, 'isn' and 't'.
#
# As for capitalization, words in the text should have the same
# capitalization as in the dictionary. However, since words starting
# a sentence are capitalized, we permit the first letter of a word 
# to be capitalized, even if the dictionary only has the all lower case
# version of the word. No attempt of parsing sentenses, trying to detect
# first words, has been made.
#

use strict;
use warnings;

my @std_dicts;    #  The default dictionaries.
my @spel_dirs;    #  Directories to look for .spel files.
my @dicts;        #  List of dictionaries.

my %words;        #  Words found in the dictionaries.
my %mistakes;     #  Mistakes in the file(s).

@std_dicts = ("/usr/dict/words",         # The exercise specifies this file,
              "/usr/share/dict/words");  # but on my system, it's found here.
@dicts = grep {-f} @std_dicts;           # So, we'll do some juggling,
splice @dicts => 1 if @dicts;            # making sure we use at most 1 file.

# Adding the ".spel" files.
@spel_dirs = defined $ENV {SPELWORDS} ? split /:/ => $ENV {SPELWORDS}
                                      : ($ENV {HOME}, ".");
push @dicts => grep {-f} map {"$_/.spel"} @spel_dirs;

#
# Init the dictionaries.
#
{
    local @ARGV = @dicts;
    while (<>) {
        chomp;
        $words {$_} = 1;
        $words {+ucfirst} = 1 unless /[[:upper:]]/;
    }
}

#
# Read the text, record all words not found in a dictionary.
#
while (<>) {$words {$1} or $mistakes {$1} = 1 while /([[:alpha:]]+)/g}

#
# Print the mistakes, sorted.
#
print "$_\n" for sort keys %mistakes;

__END__

----------------------------------------------------------------


Abigail's program has four phases.  First there's an initialization
section, in which it determines which dictionaries to use.  Then it
loads words from the dictionaries into a hash, %words.  Third, the
program loops over the manuscript input, checking each word against
%words.  Words not present in %words are noted in %mistakes.
Finally, the program prints out the words from %mistakes.

The initialization section first decides where the standard dictionary
is; my problem statement said it would be in '/usr/dict/words', but on
many systems (including mine---hmmm) it's in '/usr/share/dict/words'.
Abigail's code prefers the former if it exists, but if not it uses the
latter:

        @std_dicts = ("/usr/dict/words",         
                      "/usr/share/dict/words");  
        @dicts = grep {-f} @std_dicts;           
        splice @dicts => 1 if @dicts;            

Note that the code does not depend on there being exactly two items in
@std_dicts; you can list as many standard dictionaries as you want, in
the order you would prefer to try them, and the program will use the
first one it finds.  But I wonder if it might not have been more
perspicuous to write something like

        @std_dicts     = ("/usr/dict/words",         
                          "/usr/share/dict/words");  
        my ($std_dict) = grep {-f} @std_dicts;           

and then use

        local @ARGV = (@dicts, $std_dict);

later on.  

Initializing the dictionaries uses a handy trick that all Perl
programmers should be aware of.  Everyone knows about the <> operator,
which reads a line of input from the files named on the command line,
or from the standard input if none are named.  What many people aren't
aware of is that you can fool it about what the command-line files
are.  This is what Abigail is doing here:

        {
            local @ARGV = @dicts;
            while (<>) {
                chomp;
                $words {$_} = 1;
                $words {+ucfirst} = 1 unless /[[:upper:]]/;
            }
        }

'local @ARGV' temporarily resets the value of @ARGV, which is what <>
looks at to determine the command-line arguments.  Since the files
named in @ARGV are exactly the names of the dictionaries, the <>
operator reads one line at a time from each dictionary.  At the end of
the block, the effect of the 'local' is undone and @ARGV resumes its
original value.

The 'ucfirst' code here takes care of a detail that several submitters
forgot.  If the dictionary contains the word 'carrot', we would like
to accept both 'carrot' and 'Carrot' as correct.  The 'ucfirst' takes
care of this; if the word 'carrot' appears in the dictionary file,
then both 'carrot' and 'Carrot' are placed in the hash.  For the
dictionaries I supplied, the '/[[:upper:]]/' special case is never
meaningfully exercised.  It would become important if the dictionary
contained a word like 'iMac' which contained uppercase letters but
whose initial letter was not already uppercase.  The guard condition
would prevent 'IMac' from being added to the dictionary.  It's not
clear to me that this is really the right thing to do, however.
(Does 'iMac' get capitalized at the start of a sentence?  I don't know.)

The '+' on the 'ucfirst' prevents Perl from taking 'ucfirst' as a
literal hash key.

The manuscript input is read by another '<>' loop:

        while (<>) {$words {$1} or $mistakes {$1} = 1 while /([[:alpha:]]+)/g}

The order of control flow here may not be clear.  It's equivalent to:

        while (<>) {
          while (/([[:alpha:]]+)/g) {
            unless ($words {$1}) {
               $mistakes {$1} = 1;
            }
          }
        }

(This sort of thing is the reason that some people love statement
modifiers and other people hate them.)  

The interesting feature here is the use of '//g' to locate the words
in each line.  while (/..../g) { ... } in general will search through
$_, repeating the 'while' loop once for each time the pattern matches.
The pattern '[[:alpha:]]+' will match sequences of one or more
alphabetic characters.

Note that this treats the word 'isn't' as two words, 'isn' and 't',
and similarly 'pot-au-feu' as 'pot', 'au', and 'feu'.  To add
apostrophes and hyphens to the list of characters that may appear in a
'word', change this to

          while (/([[:alpha:]'-]+)/g) {

This is fraught with its own dangers; if the input now contains the line:

        to accept both 'carrot' and 'Carrot' as correct.  The 'ucfirst' takes

then "'carrot'" will be recognized as a 'word' and looked up in the
dictionary; but "'carrot'" isn't in the dictionary; only "carrot" is
present.  To properly handle all cases correctly can be rather tricky.

Finally, the list of misspellings is printed out with a
straightforward loop:

        print "$_\n" for sort keys %mistakes;

----------------------------------------------------------------

1. Loading the dictionary was a little tricky.  Several submitters
   wrote code like this to load the dictionary:

        while (<F>) { chomp; $WORDS{lc $_} = 1; }

   The words are smashed to all-lowercase before being stored, which
   leads their programs to accept some rather peculiar words.  For
   example, one of the dictionary files I supplied contains the 'word'
   'IEEE', the acronym for the Institute of Electrical and Electronics
   Engineers, which is likely to appear in many technical contexts.
   If the case is smashed, the spell-checker will silently accept the
   word 'ieee', and typically 'IEeE' and 'CArrot' as well.

   Some submitters forgot that 'carrot' in the dicionary indicates
   that 'Carrot' is also acceptable.  Some remembered, but got the
   code wrong.  For example, the solution I wrote before I posed
   the problem loads the words into a hash exactly as they are given,
   and then checks for a word's presence with:

        $bad{$_}++ unless $d{$_} || $d{lcfirst $_};

   This takes care of 'Carrot' properly, even if the dictionary
   contains only 'carrot'.  Unfortunately, it also causes the program
   to silently accept 'larry', even though the dictionary contains
   only 'Larry'.  Whoops!  It also refuses to accept 'CARROT'; I would
   consider this a bug.

2. The punctuational issue is one of those problems that gets more and
   more complicated the longer you look at it.  At first it seems that
   it can be solved by just treating hyphen and apostrophe as letters.
   But if you do that, your program fails on words that are quoted by
   being placed between apostrophes, as 'Carrot' is in this sentence.

   A second-order approximation is to trim punctuation from the
   beginning and the ending of each word before checking it, but then
   (as one submitter observed):

        Since I strip off all trailing punctuation, my program as it
        stands will flag 'words' such as 'etc.', 'a.c.', 'a.k.a.' as
        wrong, even if they are in the dictionaries used.


3. Nearly everyone's programs loaded the dictionary into a hash.  Two
   submissions didn't.  One loaded the dictionary into an array and
   did linear search on the array.  On a 10-word file, this program
   took 9 sec to check the file; Abigail's program took 25 sec; most
   of the extra time was taken up by constructing the hash.  (Some of
   the extra time occurred during the global destruction phase, after
   the program had completed; apparently the dictionary hash was
   destructed one key at a time, which I don't understand.)  But even
   with this extra overhead, the hash approach won for any file that
   wasn't trivially small.  For a 270-word file, the linear search
   program took 123 seconds; Abigail's program still took 25 sec.

   Another submission generated an enormous tree structure with hashes
   as the nodes.  This took a long time to build and search (17
   seconds to load a small dictionary that Abigail's program dealt
   with in less than 2 seconds) and a humongous amount of memory (I
   could not load the Web2 dictionary file.)

4. One submission contained the following code:

        if ($#ARGV == -1) {
            foreach (grep {!exists($words{lc $_})} split /\W+/, <>) {
                print qq("$_"\n);
            }
        } else {
            foreach(@ARGV) {
                open FILE, "< $_" or die "Couldn't open input, $_. $!";
                print "\nMispellings in $_:\n";
                foreach (grep {!exists($words{lc $_})} split /\W+/, <FILE>) {
                    print qq("$_"\n);
                }
                close FILE;
            }
        }

   The repeated 'foreach' block is a red flag; it suggests that the
   programmer should look for a way to merge the two blocks, and then
   see if that makes the code any easier to understand.  In this case,
   it's easy:

       foreach (grep {!exists($words{lc $_})} split /\W+/, <>) {
           print "\nMisspellings in $ARGV:\n" if $ARGV ne $prevARGV;
           print qq("$_"\n);
           $prevARGV = $ARGV;
       }

   Or perhaps

       @ARGV = ('-') unless @ARGV;
       foreach(@ARGV) {
           open FILE, "< $_" or die "Couldn't open input, $_. $!";
           print "\nMispellings in $_:\n";
           foreach (grep {!exists($words{lc $_})} split /\W+/, <FILE>) {
               print qq("$_"\n);
           }
           close FILE;
       }

5. A number of programs had extra features that I though substantially
   reduced the usefulness of the program.  For example, several of the
   programs produced extraneous output:

              print "\n$word misspelled at line $.\n";

   or

              print "\nMispellings in $_:\n";

   Extraneous output makes the program more difficult to use as a
   tool.  As specified, the program produces a list of misspelled
   words.  This list could be fed into another program, such as an
   editor, which could provide a convenient interface to correcting
   the misspellings.  The list could be piped into a program which
   might make guesses about what words were intended.  The output
   could be directed to the end of './.spel' and then edited.  Extra
   output makes all of these things more difficult.  At best, it would
   have to be filtered out before the output would be useful for
   anything other than human consumption.    Diagnostic messages, if
   they appear at all, should be printed to STDERR; that is what it is
   for.  

   Here's an example that's particularly egregious:

        print "Do you want to change the default spellcheck settings? (y/n): ";
        chomp (my $choice = <STDIN>);
        if (uc $choice eq 'Y') {
            print "Ignore words all in CAPITALS? (y/n): ";
            chomp (my $ch1 = <STDIN>);
            $ignore_caps = 'Y' if uc $ch1 eq 'Y';
            print "Ignore words with numbers? (y/n): ";
            chomp (my $ch2 = <STDIN>);
            $ignore_nums = 'Y' if uc $ch2 eq 'Y';
        }

   Using this program in any way other than as an interactive
   application is very difficult.  In spite of all the work that went
   into the interface, it remains inflexible; the program will only do
   the things that the programmer imagined.  Options like whether to
   ignore numerals should be specified non-interactively, on the
   command line.

6. Robin Szemeti suggested using the Search::Dict module.
   Search::Dict is one of those very clever pieces of software that
   never seems to be useful for anything.  The idea of Search::Dict is
   this: If the input file is in lexicographic order, items in it can
   be found with binary search; this should be quick and also
   memory-efficient, since the whole file needn't be loaded into
   memory at once.  

   I tried out a limited version of this, which looks words up in a
   single dictionary but doesn't actually construct the list of
   misspellings: 

        use Search::Dict;
        open W, "< Web2" or die $!;

        while (<>) {
          for (split /[^a-zA-Z'-]+/) {
            look *W, $_;
          }
        }

   On a 2387 word version of this postmortem file, Abigail's program
   took the usual 27 seconds; the program above took 13.  I was not
   expecting Search::Dict to do so well.  

   I worked up a working spelling checker based on Search::Dict and it
   still took about 13 seconds on the postmortem file.  I was
   surprised; I had expected the file reading overhead to be much
   higher.  On a 39,XXX word file, the hash approach was a big winner.
   Abigail's program still took about 27 seconds; the Search::Dict
   program took 192:

        #!/usr/bin/perl

        use Search::Dict;
        my @std_dicts = ("/usr/dict/words",         
                         "/usr/share/dict/words");  
        my ($std_dict) = grep {-f} @std_dicts;      
        my @spel_dirs = defined $ENV {SPELWORDS} ? split /:/ => $ENV {SPELWORDS}
                                                 : ($ENV {HOME}, ".");
        my @dicts = grep {-f} map {"$_/.spel"} @spel_dirs;

        my @fh;
        for my $dict (@dicts, $std_dict) {
          open my $fh, $dict or die "Couldn't open $dict for reading: $!";
          push @fh, $fh;
        }

        while (<>) {
          my @words = split /[^a-zA-Z'-]+/;
         WORD:
          for (@words) {
            next if $missp{$_};
            for my $fh (@fh) {
              look $fh, $_;
              my $w = <$fh>;
              chomp $w;
              next WORD if $w eq $_;
            }
            $missp{$_} = 1;
          }
        }

        print join "\n", sort(keys %missp), "";

        Search::Dict, by the way, has a lousy interface.  It looks in
        the file to find the first word that is equal to or greater
        than its argument, and leaves the filehandle positions at the
        place where it found that word; it also returns the position
        at which the handle was left.  But it doesn't return the word
        itself, or any indication of whether it matched the argument
        or not!  The return value, which is the file position, is
        useless, since you could have gotten it by doing tell() on the
        filehandle.  So my program has to reread the word at the
        current file position and then compare it with the argument
        word, even though Search::Dict has just finished doing this.

7. There were a bunch of defects which made me think that programs had
   not been well-tested.  One program wouldn't compile, because it had
   a missing semicolon.  One program did this:

          push(@dictionaries, qw(~/.spel .spel/));

   The author apparently didn't notice that neither of '~/.spel' or
   '.spel/' was ever read.  One program had 

        open DICT,"Web2" || die "Error: $!\n";

   so the error message would never appear; if 'Web2' was not present
   in the current directory when the program was run, it would
   cheerfully report every word as misspelled.

----------------------------------------------------------------

Other notes:

* Regular quiz #7 was about taking a list of numbers and then fudging
  the percentages so that they would add up to exactly 100%.  Douglas
  Wilson contributed a good explanation of why someone might really
  want to do this, apart from satisfying a demand from a clueless
  manager.  Suppose your company is in the business of selling
  assemble-it-yourself kits.  You would like to list a price for each
  individual part in the kit; the prices should add up to the cost of
  the whole kit exactly, even when they are rounded off to the nearest
  penny.  Mr. Wilson's message about this is at

        http://perl.plover.com/~alias/list.cgi?1:msp:1183

* Happy new year, everyone!  I will send out Quiz #10 later this evening.