Sample solutions and discussion Perl Quiz of The Week #13 (20030528) The 'MH' mail system stores email messages in a 'folder', which is just a plain directory. Messages are files in this directory whose names are numerals. The directory might contain other files or subdirectories; these are not messages. The 'scan' command summarizes the contents of a folder. Here's a typical output: 1349 03/22 Yitzchak Scott-Th Re: Hey, is this list alive?< Proposed advanced problem for this week. 1352 03/25 John_Wunderlich@C Re: Hey, is this list alive?< $b } grep {!/\D/ && -f "$dir/$_" } readdir D; closedir D; my %me; { my @addresses = $ENV{ADDRESS} ? split(/,\s+/, $ENV{ADDRESS}) : guess_addresses(); for (@addresses) { $me{lc $_} = 1; }; } There was some question about how to decide if a message had been sent by the user running the program. I said on the -discuss list that anything reasonably reasonable would suffice. My 'guess_addresses' function tries to guess the user's address from various system information, but allows itself to be overidden by the contents of an ADDRESS environment varuable. Here's 'guess_addresses': sub guess_addresses { my $username = $ENV{USER} || (getpwuid($<))[0] || return; my $host = $ENV{HOST} || do { require Config; "Config"->import; $Config{myhostname} . $Config{mydomain}; } || return; "$username\@$host"; } I suppose 'Config' probably isn't in the Llama Book, but it's not an essential part of the program, so if you don't like it, you can take it out and replace it with something else. Now the main loop of the program starts: for $msgno (@message_files) { local *F; my $file = File::Spec->catfile($dir, $msgno); unless (open F, "<", $file) { warn "Couldn't read message $msgno; skipping.\n"; next; } my $mo = Mail::Internet->new(\*F); Problem #1: Mail::Internet::new requires a glob reference argument, which violates my Llama-features-only rule. Problem #2: Mail::Internet is mighty slow. my $h = $mo->head; { my $datefield = $h->get('date'); my $time = $datefield ? Mail::Field->new('date', $datefield)->time : (stat($file))[9]; $date = strftime("%m/%d", localtime($time)); } Mail::Field is part of the MailTools package. Here it returns a Mail::Field::date object, which supports a ->time method that converts the date into an epoch time. I used strftime() (which may not be available everywhere) to convert this back to a date. This seems like an awful lot of machinery to use just to convert something like "Sun, 15 Jun 2003 01:55:30 +0200" to "06/15". The possible upside is that the date in the output shows that date when the message was actually sent, relative to the user of the program. For example, the "Sun, 15 Jun 2003 01:55:30 +0200" message was sent late on the evening of 14 June, Philadelphia time, and I live in Philadelphia. However, that doesn't seem to me like much of a benefit. If there isn't a date in the message, we use the date that the message file was written. "??/??" would also be a reasonable alternative. Now the program deals with the sender's address: { for my $mf (Mail::Address->parse($h->get('from'))) { if ($me{lc $mf->address}) { $whom = "To:" . $h->get('to'); last; } else { $whom = $mf->phrase || $mf->comment || $mf->address; } } } It took me much longer to figure this out than it should have. I tried using Mail::Field again, and discovered that it's a tremendous pain in the ass. Unfortunately I don't remember most of the details. Mail::Address seems to do the job adequately, however. $subjcontent = $h->get('subject') . "<<" . join "", @{$mo->body}; $subjcontent =~ tr/\n//d; write; } format STDOUT = @#### @<<<< @<<<<<<<<<<<<<<<< @<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< $msgno,$date,$whom, $subjcontent . Does the Llama book cover formats? I hope not. Of course the format here is not necessary; I could have used a simple 'print' or something similar. But opportunities to use formats come along so rarely that I like to take advantage of them when they appear. Now here's the non-modules version. The biggest cost is that it's about 50% longer than the other version. #!/usr/bin/perl use File::Spec; use POSIX 'strftime'; my $dir = shift || '.'; opendir D, $dir or die "Couldn't read directory $dir: $!; aborting"; my @message_files = sort { $a <=> $b } grep {!/\D/ && -f "$dir/$_" } readdir D; closedir D; my %me; { my @addresses = $ENV{ADDRESS} ? split(/,\s+/, $ENV{ADDRESS}) : guess_addresses(); for (@addresses) { $me{lc $_} = 1 }; } sub guess_addresses { my $username = $ENV{USER} || (getpwuid($<))[0] || return; my $host = $ENV{HOST} || do { require Config; "Config"->import; $Config{myhostname} . $Config{mydomain}; } || return; "$username\@$host"; } So far everything is the same. But now because I'm not using a Mail:: module to deal with the RFC822-format date, I have to do this: my %m2n = (jan => 1, feb => 2, mar => 3, apr => 4, may => 5, jun => 6, jul => 7, aug => 8, sep => 9, oct => 10, nov => 11, dec => 12, ); for $msgno (@message_files) { local *F; my $file = File::Spec->catfile($dir, $msgno); unless (open F, "<", $file) { warn "Couldn't read message $msgno; skipping.\n"; next; } my %mo = read_message(\*F); Instead of using Mail::Internet to read in the email message, I wrote a replacement function. Here it is: sub read_message { my $fh = shift; my ($header); my %m; { local $/ = ""; $header = <$fh>; undef $/; $m{BODY} = <$fh>; } my @fields = split /\n(?!\s)/, $header; for (@fields) { my ($t, $v) = split /:\s+/, $_, 2; $m{lc $t} = $v; } %m; } The split /\n(?!\s)/ is a little tricky, and I suppose (?!\s) is non-Llama. \n(?!\s) matches only those newlines which are *not* followed by whitespace. When we split the message header on those newlines, we get an array of fields; each field may contain one or more physical lines of the header. The main program then continues; here's my quick and dirty code to deal with RFC822-format dates: { my $datefield = $mo{date}; if ($datefield && $datefield =~ /(\d+) (\w+)/) { $date = sprintf "%02d/%02d", $m2n{lc $2}, $1; } else { $date = strftime("%m/%d", localtime((stat($file))[9])); } } In retrospect, it probably would have been better to avoid strftime() here, since I could have done something like: # { my $datefield = $mo{date}; # my ($m, $d); # if ($datefield && $datefield =~ /(\d+) (\w+)/) { # ($m, $d) = ($m2n{lc $2}, $1); # } else { # ($m, $d) = (localtime((stat($file))[9])))[4,3]; # $m++; # } # $date = sprintf "%02d/%02d", $m, $d; # } instead. The double-list-slice on the 'localtime' line should make Randal happy. Next is the section which extracts the addresses from the 'from' and 'to' fields. It depends on a homemade 'parse_addr' function, which is the dodgiest part of the program. { my ($phrase, $addr, $comment) = parse_addr($mo{from}); if ($me{lc $addr}) { ($phrase, $addr, $comment) = parse_addr($mo{to}); $whom = "To:" . ($comment || $phrase || $addr); } else { $whom = $comment || $phrase || $addr; } } Here's the rather questionable 'parse_addr': sub parse_addr { my $x = shift; my ($phrase, $addr, $comment) = $x =~ /([^<(]*) (?: < ( [^<\s]* ) > ) ? \s* (?: \( ([^\)]*) \)) ? /x; $addr = $phrase unless defined $addr; # warn "$x -> '$phrase', '$addr', '$comment'\n"; for ($phrase, $addr, $comment) { s/^\s+//; s/\s+$//; } return ($phrase, $addr, $comment); } It's questionable because it doesn't always work. For example, it'll misparse "Joseph (Joe) Smith" jsmith@example.com RFC822 address syntax is horrendously complicated and grossly overengineered. But it works well enough for almost all examples that one encounters in practice. (Which is why RFC822 is overengineered.) Essentially, the idea is that an address wil have this format: PHRASE
(COMMENT) Where the address and comment parts might be missing. Addresses might also have this format: ADDRESS (COMMENT) in which case the address will be mistaken for a phrase; in that case we use the phrase as the address. We use a rather ordinary regex to extract the three parts. Finally, the rest of the program is simple: $subjcontent = $mo{subject} . "<<" . join "", $mo{BODY}; $subjcontent =~ tr/\n//d; write; } format STDOUT = @#### @<<<< @<<<<<<<<<<<<<<<< @<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< $msgno,$date,$whom, $subjcontent . The costs for the homegrown version are that it's 50% longer, and that it doesn't work on peculiar addresses. We also lost the benefit of displaying the local date on which the message was sent; "Sun, 15 Jun 2003 01:55:30 +0200" is displayed as 06/15 even though the message was sent on 06/14 Philadelphia time. The benefits are that the program doesn't depend on a bunch of nonstandard modules, and that the program runs *twelve* times faster. Of course, anyone can write a program that runs really fast and produces the wrong output. But this program produces the right output almost all the time, and it's hard to believe that fixing it would slow it down by a factor of 12. What wen't wrong? I haven't looked closely, but I suspect that Mail::Internet is way overwritten. I've placed complete source code at http://perl.plover.com/qotw/misc/r013/ Thanks to everyone who participated quietly and said nothing. I can confidently predict that I will post new quizzes on June 11.