Sample solutions and discussion Perl Quiz of The Week #9 (20021218) You will write a simple spelling checker program, similar to the Unix 'spell' utility. The program should be called 'spel'. It will read a document from standard input and print on standard output a list of all the misspelled words in the document. If any command line arguments are given, 'spel' should read those files instead of the standard input. The output words should be in lexicographic order, and no word should appear more than once in the output. 'spel' will be given one or more dictionaries of words that are already spelled correctly. It will always try to read the file '/usr/dict/words'. It will also try to read '.spel' files from certain directories. If the user has set an environment variable SPELWORDS, 'spel' should interpret its value as a :-separated list of directories to be searched for '.spel' files. If no SPELWORDS variable is set, 'spel' should search in the user's home directory and in the current directory. If you need a sample dictionary, you can obtain one from http://perl.plover.com/qotw/words/ ---------------------------------------------------------------- Here's sample code, submitted by Abigail: #!/usr/bin/perl # # The exercise isn't clear what's to be considered a word, # or how to deal with capitalization. # # This program considers words to be substrings consisting of only # 'alpha' characters. This means that 'words' like "isn't" are # considered to be two words, 'isn' and 't'. # # As for capitalization, words in the text should have the same # capitalization as in the dictionary. However, since words starting # a sentence are capitalized, we permit the first letter of a word # to be capitalized, even if the dictionary only has the all lower case # version of the word. No attempt of parsing sentenses, trying to detect # first words, has been made. # use strict; use warnings; my @std_dicts; # The default dictionaries. my @spel_dirs; # Directories to look for .spel files. my @dicts; # List of dictionaries. my %words; # Words found in the dictionaries. my %mistakes; # Mistakes in the file(s). @std_dicts = ("/usr/dict/words", # The exercise specifies this file, "/usr/share/dict/words"); # but on my system, it's found here. @dicts = grep {-f} @std_dicts; # So, we'll do some juggling, splice @dicts => 1 if @dicts; # making sure we use at most 1 file. # Adding the ".spel" files. @spel_dirs = defined $ENV {SPELWORDS} ? split /:/ => $ENV {SPELWORDS} : ($ENV {HOME}, "."); push @dicts => grep {-f} map {"$_/.spel"} @spel_dirs; # # Init the dictionaries. # { local @ARGV = @dicts; while (<>) { chomp; $words {$_} = 1; $words {+ucfirst} = 1 unless /[[:upper:]]/; } } # # Read the text, record all words not found in a dictionary. # while (<>) {$words {$1} or $mistakes {$1} = 1 while /([[:alpha:]]+)/g} # # Print the mistakes, sorted. # print "$_\n" for sort keys %mistakes; __END__ ---------------------------------------------------------------- Abigail's program has four phases. First there's an initialization section, in which it determines which dictionaries to use. Then it loads words from the dictionaries into a hash, %words. Third, the program loops over the manuscript input, checking each word against %words. Words not present in %words are noted in %mistakes. Finally, the program prints out the words from %mistakes. The initialization section first decides where the standard dictionary is; my problem statement said it would be in '/usr/dict/words', but on many systems (including mine---hmmm) it's in '/usr/share/dict/words'. Abigail's code prefers the former if it exists, but if not it uses the latter: @std_dicts = ("/usr/dict/words", "/usr/share/dict/words"); @dicts = grep {-f} @std_dicts; splice @dicts => 1 if @dicts; Note that the code does not depend on there being exactly two items in @std_dicts; you can list as many standard dictionaries as you want, in the order you would prefer to try them, and the program will use the first one it finds. But I wonder if it might not have been more perspicuous to write something like @std_dicts = ("/usr/dict/words", "/usr/share/dict/words"); my ($std_dict) = grep {-f} @std_dicts; and then use local @ARGV = (@dicts, $std_dict); later on. Initializing the dictionaries uses a handy trick that all Perl programmers should be aware of. Everyone knows about the <> operator, which reads a line of input from the files named on the command line, or from the standard input if none are named. What many people aren't aware of is that you can fool it about what the command-line files are. This is what Abigail is doing here: { local @ARGV = @dicts; while (<>) { chomp; $words {$_} = 1; $words {+ucfirst} = 1 unless /[[:upper:]]/; } } 'local @ARGV' temporarily resets the value of @ARGV, which is what <> looks at to determine the command-line arguments. Since the files named in @ARGV are exactly the names of the dictionaries, the <> operator reads one line at a time from each dictionary. At the end of the block, the effect of the 'local' is undone and @ARGV resumes its original value. The 'ucfirst' code here takes care of a detail that several submitters forgot. If the dictionary contains the word 'carrot', we would like to accept both 'carrot' and 'Carrot' as correct. The 'ucfirst' takes care of this; if the word 'carrot' appears in the dictionary file, then both 'carrot' and 'Carrot' are placed in the hash. For the dictionaries I supplied, the '/[[:upper:]]/' special case is never meaningfully exercised. It would become important if the dictionary contained a word like 'iMac' which contained uppercase letters but whose initial letter was not already uppercase. The guard condition would prevent 'IMac' from being added to the dictionary. It's not clear to me that this is really the right thing to do, however. (Does 'iMac' get capitalized at the start of a sentence? I don't know.) The '+' on the 'ucfirst' prevents Perl from taking 'ucfirst' as a literal hash key. The manuscript input is read by another '<>' loop: while (<>) {$words {$1} or $mistakes {$1} = 1 while /([[:alpha:]]+)/g} The order of control flow here may not be clear. It's equivalent to: while (<>) { while (/([[:alpha:]]+)/g) { unless ($words {$1}) { $mistakes {$1} = 1; } } } (This sort of thing is the reason that some people love statement modifiers and other people hate them.) The interesting feature here is the use of '//g' to locate the words in each line. while (/..../g) { ... } in general will search through $_, repeating the 'while' loop once for each time the pattern matches. The pattern '[[:alpha:]]+' will match sequences of one or more alphabetic characters. Note that this treats the word 'isn't' as two words, 'isn' and 't', and similarly 'pot-au-feu' as 'pot', 'au', and 'feu'. To add apostrophes and hyphens to the list of characters that may appear in a 'word', change this to while (/([[:alpha:]'-]+)/g) { This is fraught with its own dangers; if the input now contains the line: to accept both 'carrot' and 'Carrot' as correct. The 'ucfirst' takes then "'carrot'" will be recognized as a 'word' and looked up in the dictionary; but "'carrot'" isn't in the dictionary; only "carrot" is present. To properly handle all cases correctly can be rather tricky. Finally, the list of misspellings is printed out with a straightforward loop: print "$_\n" for sort keys %mistakes; ---------------------------------------------------------------- 1. Loading the dictionary was a little tricky. Several submitters wrote code like this to load the dictionary: while () { chomp; $WORDS{lc $_} = 1; } The words are smashed to all-lowercase before being stored, which leads their programs to accept some rather peculiar words. For example, one of the dictionary files I supplied contains the 'word' 'IEEE', the acronym for the Institute of Electrical and Electronics Engineers, which is likely to appear in many technical contexts. If the case is smashed, the spell-checker will silently accept the word 'ieee', and typically 'IEeE' and 'CArrot' as well. Some submitters forgot that 'carrot' in the dicionary indicates that 'Carrot' is also acceptable. Some remembered, but got the code wrong. For example, the solution I wrote before I posed the problem loads the words into a hash exactly as they are given, and then checks for a word's presence with: $bad{$_}++ unless $d{$_} || $d{lcfirst $_}; This takes care of 'Carrot' properly, even if the dictionary contains only 'carrot'. Unfortunately, it also causes the program to silently accept 'larry', even though the dictionary contains only 'Larry'. Whoops! It also refuses to accept 'CARROT'; I would consider this a bug. 2. The punctuational issue is one of those problems that gets more and more complicated the longer you look at it. At first it seems that it can be solved by just treating hyphen and apostrophe as letters. But if you do that, your program fails on words that are quoted by being placed between apostrophes, as 'Carrot' is in this sentence. A second-order approximation is to trim punctuation from the beginning and the ending of each word before checking it, but then (as one submitter observed): Since I strip off all trailing punctuation, my program as it stands will flag 'words' such as 'etc.', 'a.c.', 'a.k.a.' as wrong, even if they are in the dictionaries used. 3. Nearly everyone's programs loaded the dictionary into a hash. Two submissions didn't. One loaded the dictionary into an array and did linear search on the array. On a 10-word file, this program took 9 sec to check the file; Abigail's program took 25 sec; most of the extra time was taken up by constructing the hash. (Some of the extra time occurred during the global destruction phase, after the program had completed; apparently the dictionary hash was destructed one key at a time, which I don't understand.) But even with this extra overhead, the hash approach won for any file that wasn't trivially small. For a 270-word file, the linear search program took 123 seconds; Abigail's program still took 25 sec. Another submission generated an enormous tree structure with hashes as the nodes. This took a long time to build and search (17 seconds to load a small dictionary that Abigail's program dealt with in less than 2 seconds) and a humongous amount of memory (I could not load the Web2 dictionary file.) 4. One submission contained the following code: if ($#ARGV == -1) { foreach (grep {!exists($words{lc $_})} split /\W+/, <>) { print qq("$_"\n); } } else { foreach(@ARGV) { open FILE, "< $_" or die "Couldn't open input, $_. $!"; print "\nMispellings in $_:\n"; foreach (grep {!exists($words{lc $_})} split /\W+/, ) { print qq("$_"\n); } close FILE; } } The repeated 'foreach' block is a red flag; it suggests that the programmer should look for a way to merge the two blocks, and then see if that makes the code any easier to understand. In this case, it's easy: foreach (grep {!exists($words{lc $_})} split /\W+/, <>) { print "\nMisspellings in $ARGV:\n" if $ARGV ne $prevARGV; print qq("$_"\n); $prevARGV = $ARGV; } Or perhaps @ARGV = ('-') unless @ARGV; foreach(@ARGV) { open FILE, "< $_" or die "Couldn't open input, $_. $!"; print "\nMispellings in $_:\n"; foreach (grep {!exists($words{lc $_})} split /\W+/, ) { print qq("$_"\n); } close FILE; } 5. A number of programs had extra features that I though substantially reduced the usefulness of the program. For example, several of the programs produced extraneous output: print "\n$word misspelled at line $.\n"; or print "\nMispellings in $_:\n"; Extraneous output makes the program more difficult to use as a tool. As specified, the program produces a list of misspelled words. This list could be fed into another program, such as an editor, which could provide a convenient interface to correcting the misspellings. The list could be piped into a program which might make guesses about what words were intended. The output could be directed to the end of './.spel' and then edited. Extra output makes all of these things more difficult. At best, it would have to be filtered out before the output would be useful for anything other than human consumption. Diagnostic messages, if they appear at all, should be printed to STDERR; that is what it is for. Here's an example that's particularly egregious: print "Do you want to change the default spellcheck settings? (y/n): "; chomp (my $choice = ); if (uc $choice eq 'Y') { print "Ignore words all in CAPITALS? (y/n): "; chomp (my $ch1 = ); $ignore_caps = 'Y' if uc $ch1 eq 'Y'; print "Ignore words with numbers? (y/n): "; chomp (my $ch2 = ); $ignore_nums = 'Y' if uc $ch2 eq 'Y'; } Using this program in any way other than as an interactive application is very difficult. In spite of all the work that went into the interface, it remains inflexible; the program will only do the things that the programmer imagined. Options like whether to ignore numerals should be specified non-interactively, on the command line. 6. Robin Szemeti suggested using the Search::Dict module. Search::Dict is one of those very clever pieces of software that never seems to be useful for anything. The idea of Search::Dict is this: If the input file is in lexicographic order, items in it can be found with binary search; this should be quick and also memory-efficient, since the whole file needn't be loaded into memory at once. I tried out a limited version of this, which looks words up in a single dictionary but doesn't actually construct the list of misspellings: use Search::Dict; open W, "< Web2" or die $!; while (<>) { for (split /[^a-zA-Z'-]+/) { look *W, $_; } } On a 2387 word version of this postmortem file, Abigail's program took the usual 27 seconds; the program above took 13. I was not expecting Search::Dict to do so well. I worked up a working spelling checker based on Search::Dict and it still took about 13 seconds on the postmortem file. I was surprised; I had expected the file reading overhead to be much higher. On a 39,XXX word file, the hash approach was a big winner. Abigail's program still took about 27 seconds; the Search::Dict program took 192: #!/usr/bin/perl use Search::Dict; my @std_dicts = ("/usr/dict/words", "/usr/share/dict/words"); my ($std_dict) = grep {-f} @std_dicts; my @spel_dirs = defined $ENV {SPELWORDS} ? split /:/ => $ENV {SPELWORDS} : ($ENV {HOME}, "."); my @dicts = grep {-f} map {"$_/.spel"} @spel_dirs; my @fh; for my $dict (@dicts, $std_dict) { open my $fh, $dict or die "Couldn't open $dict for reading: $!"; push @fh, $fh; } while (<>) { my @words = split /[^a-zA-Z'-]+/; WORD: for (@words) { next if $missp{$_}; for my $fh (@fh) { look $fh, $_; my $w = <$fh>; chomp $w; next WORD if $w eq $_; } $missp{$_} = 1; } } print join "\n", sort(keys %missp), ""; Search::Dict, by the way, has a lousy interface. It looks in the file to find the first word that is equal to or greater than its argument, and leaves the filehandle positions at the place where it found that word; it also returns the position at which the handle was left. But it doesn't return the word itself, or any indication of whether it matched the argument or not! The return value, which is the file position, is useless, since you could have gotten it by doing tell() on the filehandle. So my program has to reread the word at the current file position and then compare it with the argument word, even though Search::Dict has just finished doing this. 7. There were a bunch of defects which made me think that programs had not been well-tested. One program wouldn't compile, because it had a missing semicolon. One program did this: push(@dictionaries, qw(~/.spel .spel/)); The author apparently didn't notice that neither of '~/.spel' or '.spel/' was ever read. One program had open DICT,"Web2" || die "Error: $!\n"; so the error message would never appear; if 'Web2' was not present in the current directory when the program was run, it would cheerfully report every word as misspelled. ---------------------------------------------------------------- Other notes: * Regular quiz #7 was about taking a list of numbers and then fudging the percentages so that they would add up to exactly 100%. Douglas Wilson contributed a good explanation of why someone might really want to do this, apart from satisfying a demand from a clueless manager. Suppose your company is in the business of selling assemble-it-yourself kits. You would like to list a price for each individual part in the kit; the prices should add up to the cost of the whole kit exactly, even when they are rounded off to the nearest penny. Mr. Wilson's message about this is at http://perl.plover.com/~alias/list.cgi?1:msp:1183 * Happy new year, everyone! I will send out Quiz #10 later this evening.