Sample solutions and discussion Perl Quiz of The Week #20 (20040721) I run mailing lists. People subscribe, people unsubscribe, and people get unsubscribed automatically when their addresses generate too many bounces. I run these mailing lists using SmartList. I'd like to find out how my lists are being used - do people unsubscribe in a bunch when a flame war happens, or do they just drift in and out over time? What does the total-membership graph look like? You are to write a function, parse_smartlist_log. It takes three parameters: (1) the name of a SmartList log file. (2) the total current membership of the list. (3) the base name of the output file. It should parse a SmartList log file and generate a graph of total list membership against time. Note that not all subscriptions and unsubscriptions will be in the log; it's possible that the listmaster has added or removed addresses without using the administrative interface, especially when the list was first set up. This is the reason for the second parameter. Take whatever action seems appropriate. (The graph can be a bitmap, ASCII, or whatever else - just give it a sensible filename based on the third parameter.) A log file includes lines such as: subscribe: foo@bar.com by: foo@bar.com Thu Mar 21 15:30:35 GMT 2002 unsubscribe: 9 foo@bar.com 32760 foo@bar.com by: foo@bar.com Sat Mar 23 16:27:35 GMT 2002 procbounce: Removed: foo@bar.com 32718 SmartList has fuzzy matching on unsubscription requests - if the addresses in the line differ, use the first one. There are many other lines that may appear in the log file. Sometimes, as seen above for procbounce, there may be no date on the log line. Some sample log files may be obtained from http://firedrake.org/roger/sample_logs.zip or from http://perl.plover.com/qotw/misc/r020/sample_logs.zip http://perl.plover.com/qotw/misc/r020/sample_logs.tgz ---------------------------------------------------------------- Only two solutions were submitted on the discuss list. The only external solution which solved the problem came from Jesper Dalberg. This uses Text::Graph, a CPAN module of which I was not previously aware (thanks!), and Date::Manip. This is a relatively inefficient method of date parsing; as it happens, all dates I have observed in SmartList log files are in a format which Date::Parse can handle. This solution takes the sensible approach of latching a date value when it is spotted and using it for subsequent undated lines. However, it does not use dates found on lines which do not also contain a mailing list transaction. The totalling logic seems broken; the default value for list membership on any date is the final membership value, rather than being in any way affected by previous values. (Was there perhaps a missing reassignment to $cnt?) MJD submitted a solution which, while appealing (I am a great fan of PostScript and would love to see a Perl-PostScript Quiz of the Week), does not actually solve the problem. He is correct in that the SmartList log format is not particularly well-designed, and indeed that was part of the reason why I chose it for this quiz; it is the output of a variety of separate programs, including procmail, rather than coming from an integrated system. In any case, working from the provided PostScript output it appears that axes are unlabelled and unscaled. My own solution is designed for clarity. It parses every line in search of a date (fed to Date::Parse), and looks for specific patterns for subscription/unsubscription information. (It also looks for something vaguely resembling an email address in the line; as David Jones pointed out, not every line matching /^unsubscribe:/ will be an unsubscription.) After parsing, the data are rebased to give the correct final value. The code then uses George A. Fitch's GD::Graph::xylines module to provide a graph with labelled, scaled axes. Possible sophistications would be: * choose a strftime format based on the sample's date span (e.g. "%H:%M" if the whole logfile only covers a day, "%b %Y" if it spans several years). * if a subscriber is unsubscribed twice without an intervening resubscription, discount the earlier unsubscription (as he was clearly re-added without showing up in the log). #! /usr/bin/perl -w use strict; sub parse_smartlist_log { use Date::Parse; use GD::Graph::xylines; use POSIX qw(strftime); my ($logfile,$final,$outputfile)=@_; my $total=0; my (@x,@y); my $date=0; open IN,"<$logfile"; while () { chomp; my $n=0; if (/([A-Z][a-z][a-z]\s+[A-Z][a-z][a-z]\s+\d+\s+\d+:\d+:\d+\s+\d+)/) { $date=str2time($1) || $date; } if (/^subscribe: (\S+\@\S+)/) { $n=1; } elsif (/^(unsubscribe:\s+\d+|procbounce: Removed:)\s+(\S+\@\S+)/) { $n=-1; } if ($n && $date) { $total+=$n; push @x,$date; push @y,$total; } } close IN; my $offset=$final-$total; if ($offset) { foreach my $n (0..$#y) { $y[$n]+=$offset; if ($y[$n]<0) { $y[$n]=0; } } } my $graph=GD::Graph::xylines->new; $graph->set( x_label => 'date', y_label => 'subscribers', title => $logfile, x_number_format => sub{strftime('%d %b %Y',localtime(shift))}, y_min_value => 0, transparent => 0 ); my $img=$graph->plot([\@x,\@y]); open OUT,">$outputfile.png"; binmode OUT; print OUT $img->png; close OUT; } __END__ [ Thanks to Roger Burton West for running the QOTW this week. The solution was delayed because I was away at OSCON. I will send the new quiz tomorrow. -MJD ]