Note: This article originally appeared at the LinuxPlanet web site. I had intended it to be the first in a series of articles, but soon afterwards the LinuxPlanet editorship changed and they decided they didn't want the rest of the series. Perl.com will be carrying the rest of the articles. Part 2 and Part 3 are available now.

My Life with Spam or How I Caught the Spam and What I Did With it When I Caught it
An Unpleasant Surprise
Spam, Spam Everywhere
Filtering Strategies
How the Mail Gets Into our Filtering Program
Reading the Message
Tune in Next Time

My Life with Spam

How I Caught the Spam and What I Did With it When I Caught it

An Unpleasant Surprise

In May of 1994, while I was reading the phl.food newsgroup, I saw something new. It was a message with this subject:

        U.S. Green Card Lottery - New Immigration Opportunity

That was not what I expected to see in phl.food, so I wrote to the author:

        This has nothing at all to do with food, and you posted it to
        phl.food.  Please be more polite in the future and keep
        announements in relevant and appropriate groups.

And he sent me a reply:

        People in your group are interested.  Why do you wish to
        deprive them of what they consider to be important
        information??

I was really startled. I had Naïvely expected that the author would recognize that he had done something incorrect once it was pointed out to him. Gosh, was I wrong!

That was the beginning of my life with spam. It was the now infamous `Green Card Spam' from Lawrence Canter and Martha Siegel, a pair of incompetent lawyers. But they were on the leading edge of a big trend. Within two years the newsgroups were clogged with spam, and at the same time, email spam was becoming common.

Spam, Spam Everywhere

The email spam was really starting to bother me towards the end of 1996, when I was getting several junk messages each week. (How quaint that seems now!) I tried to figure out what to do about the spam. Some of the plans worked out well. Some were instant failures. Some were failures but it took me years to decide that they didn't work---those are the most interesting ones.

My basic problem was that it annoyed me to get spam mail in my inbox every morning. I wanted to address this problem directly. Complaining back to the the source wasn't what I was looking for, because I would still be annoyed. One solution would have been to stop getting annoyed, but I didn't have much luck with that, so I tried to think of ways to prevent the spam from getting into my inbox in the first place. I started thinking about filtering.

The idea of filtering is simple: Write a program that will examine each incoming message, and if it looks like spam, throw it away. If not, the program delivers it to my inbox as usual.

The idea is simple, but the implementation is hard. What does spam look like? And if I couldn't recognize it infallibly, what would happen when I made a mistake?

Failing to recognize spam is called a false negative, and is only a minor problem. The only thing that would happen was that my blood pressure would go up imperceptibly. But the opposite false positive problem, of incorrectly recognizing and discarding some message that wasn't spam, was much more frightening. If I erroneously identified a non-spam message as spam, and threw it away, I might lose important messages from clients, or from my lawyers, or from someone else important. So I decided that whatever happened no message must ever be thrown away until I had had a chance to look at it. And messages that are rejected mustn't go into an oubliette somewhere unless someone is notified. I had nightmares about getting a call from my lawer, asking why I hadn't gotten an important communication the previous month, and having to say it had gone into my spam box.

Filtering Strategies

Mail messages contain an actual message with some information that someone wanted to send you; that's called the body of the message. They also contain some meta-information such as who the message is from, who it was sent to, when it was sent, and soforth; this part of the message is called the header. I could try to recognize spam by looking in the body, or in the header, or both. But I couldn't think of a good way to recognize spam by by looking at the body that wouldn't also lead to an unacceptably high false positive rate. Every time I thought about the problem, I got stuck at the same place: Suppose I know that lots of spam arrives exhorting me to see sexy Annabel Chong, and suppose I were to establish a policy of rejecting messages that mention Annabel. Now suppose someone sends me a message that discusses spam filtering strategies and mentions Annabel in connection with this. ``Dominus, is it a good strategy to reject messages that mention Annabel Chong?'' Oops, I've just thrown their message away. Now suppose someone sent me a copy of the article you're reading right now. Ooops, it mentions Annabel Chong also.

I decided that since there wasn't any way to tell whether a message actually mentions Annabel directly rather than mentioning that messages sometimes mention her, content filtering wasn't going to work. I didn't want to cut myself off from discussions about spam filtering.

I decided I would have to filter based on information in the message header, not its body. The `subject' line is the most obvious place to start, but I didn't want to use it because it's really part of the message content and I would have had the same problem as if I were filtering on the message body. I didn't want to automatically reject mail that said

        Subject: These green card lottery articles are driving me crazy!

Now, among people who've followed this train of thought there seem to be two basic strategies. One strategy is to simply reject mail that's not addressed to you; if the `To:' address isn't yours, throw it out. That sounds good, but it has some problems. I get a lot of mailing list and carbon-copy mail that might not include me in the receipient list for perfectly good reasons. The typical mailing-list mail arrives in your inbox without your address on it anywhere; it says something like

        To: perl5-porters@perl.com

This is analogous to paper mailing list mail: You open it up and inside is a letter that says not `Dear Mark Dominus, ...' but rather `Dear Pigeon Fancier, ...'.

Also, I analyzed the spam messages I had collected and decided it wouldn't work well enough---about one spam message in five actually does arrive addressed to me.

Instead, the basic idea that I adopted was to make a list of domains that sent me a lot of spam, and to blacklist those domains. Any mail from a blacklisted domain would be rejected; other mail would be delivered. And now I'll show how to implement that, because the details turn out to be very interesting.

How the Mail Gets Into our Filtering Program

Our filter program will be run by the mail transport agent (MTA), which is the program that is responsible for receiving mail from the network and for delivering it to the right place. With sendmail, for example, you can put a line like this one into your .forward file:

        "| /home/mjd/bin/mailfilter"

When mail arrives, sendmail will run the mailfilter program and hand it the mail message on its standard input. mailfilter can then decide whether to deliver the message by writing it to the mailbox, or whether to throw it away, or whether to do something else. Most MTAs have an option to deliver mail to a program in this way. My system was using the superb qmail MTA, so I would put the line

        | /home/mjd/bin/mailfilter

into my ~/.qmail file. (Actually my filter program is named deliver.aol.q2. Please don't ask why, because I don't remember.)

Reading the Message

What does this mailfilter program need to do? Obviously, the first thing it must do is read the mail message in from the standard input. Code to read in an email message is very simple in Perl:

1	{ local $/ = ""; 
2	  $header = <STDIN>; 
3	  undef $/; 
4	  $body = <STDIN>; 
5	}

This reads the header of the message into $header and the body into $body. What's going on here? The Perl <...> operator reads a line of input from some filehandle. But what's a line? Normally, it's any sequence of characters that's terminated by a newline character. Why a newline? Because that's the default setting of the Perl $/ special variable. If you change $/, that changes Perl's idea of what a line looks like. If you changed it to contain a period, then Perl would think that a `line' was any sequence of characters that ended with a period.

There are two special settings for $/, however. If you set $/ to the empty string, as I did on line 1, the <...> operator reads by paragraphs instead of lines; consecutive paragraphs are separated by a blank line. Each call to <...> reads in one complete paragraph. Since the header of a mail message is a paragraph, separated from the body by a blank line, line 2 reads the entire header into $header.

The local on line 1 confines the changes to $/ to the block, so that when control reaches line 5, the original value of $/ is automatically restored. We might be doing file I/O in other parts of our program, and if we didn't put $/ back to normal we'd get weird results when the <...> operator didn't behave the way we expected. Using local ensures that we won't forget to put it back the way it was.

On line 3 we see the other special setting of $/. If $/ is undefined, then there is no line termination sequence, and Perl's <...> operator reads the entire rest of the input all at once. This is sometimes called `slurping' the input. Line 4 reads the entire message body into the variable $body.

Message Headers

We're not done processing the message. A mail header is made up of several lines that carry different information. You're probably familiar with the structure already; it looks like this:

        Return-Path: LRS@getstartednow.com
        Return-Path: <LRS@getstartednow.com>
        Delivered-To: mjd-deliver@plover.com
        Received: (qmail 15266 invoked by uid 119); 10 Apr 1997 05:08:37 -0000
        Delivered-To: mjd-filter@plover.com
        Received: (qmail 15261 invoked by uid 119); 10 Apr 1997 05:08:33 -0000
        Delivered-To: mjd@plover.com
        Received: (qmail 15258 invoked from network); 10 Apr 1997 05:08:31 -0000
        Received: from renoir.op.net (root@206.84.208.4)
          by plover.com with SMTP; 10 Apr 1997 05:08:31 -0000
        Received: from major.globecomm.net (major.globecomm.net [207.51.48.5]) 
	  by renoir.op.net (8.7.1/8.7.1/$Revision: 1.1 $) with ESMTP id BAA02191 
	  for <mjd@op.net>; Thu, 10 Apr 1997 01:06:35 -0400 (EDT)
        From: LRS@getstartednow.com
        Received: from globecomm.net (ip252.new-haven.ct.pub-ip.psi.net [38.11.102.252])
	  by major.globecomm.net (8.8.5/8.8.0) with SMTP id BAA00454; 
	  Thu, 10 Apr 1997 01:06:23 -0400 (EDT)
        Received: from mailhost.greatchances.com (alt3.greatchances.com(917.876.92.65))
		by greatchances.com (8.8.5/8.6.5) with SMTP id GAA04352
		for <friend@public.com>; Thu, 10 Apr 1997 01:00:43 -0600 (EST)
        To: friend@public.com
        Message-ID: <282732679098.HAb9037@greatchances.com>
        Date: Thu, 10 Apr 97 01:00:43 EST
        Subject: MAKE MONEY AT HOME!
        X-UIDL: 698987574a97aqd1p134jud427k9a6d
        
        
We'd like to break this up into the individual lines and then put the
information into a hash so that we can find the various parts easily.
For example, we'd like to be able to find the `To:' address in
$hash{To} and the subject in $hash{Subject}.

Breaking a Perl string into lines is easy; just use split:

        @lines = split /\n/, $header;

This tells Perl to take the string $header and break it into lines wherever it sees a \n character. The \n's are discarded and the parts in between are stored into the elements of the perl array @lines. Then we can dismantle each individual line:

        foreach $line (@lines) {
          my ($label, $value) = split /:\s*/, $line, 2;
          $hash{$label} = $value;
        }

This runs the loop once for each line. We use split again, this time to cut each line into two pieces. The /:\s*/ says that the pieces will be separated by a : followed by some white space; \s is an abbreviation for `space' and the star means that we don't know how much space there will be. For the header line Date: Thu, 10 Apr 97 01:00:43 EST this places Date into $label and Thu, 10 Apr 97 01:00:43 EST into $value. Notice that Perl does not split on the :'s in the date; that's because the 2 in the split tells Perl that there are only two fields here, so that it ignores any :'s after the first one. If we had omitted the 2, Perl would have split this line into four fields: $value would have gotten Thu, 10 Apr 97 01, and the two other fields with 00 and 43 EST would have been thrown away.

Continuation Lines

Actually though, there's a problem with this. When header lines are long, they can be broken up into two or more lines and continue on the following line. There's an example of this above. The two physical lines

        Received: from renoir.op.net (root@206.84.208.4)
          by plover.com with SMTP; 10 Apr 1997 05:08:31 -0000

are actually one logical line that is broken in half. We are supposed to consider this to be one line, even though there's a newline character in the middle. The rule for recognizing these extended logical lines is simple: If a line starts with white space, it's a continuation of the previous line.

The code we have incorrectly breaks this header line in two. Then it sets $hash{Received} to the partial value from renoir.op.net (root@206.84.208.4). Then it tries to process the following line the same way, not realizing that it's a continuation. It breaks the continuation into two pieces, by plover.com with SMTP; 10 Apr 1997 05 and 08:31 -0000 and it interprets the first piece as the header field name and the second piece as the value. This is obviously all wrong.

We need a way to handle the continuation lines. Here's one way: We'll split up the lines as before, and then put the continuations back together.

 1      @physical_lines = split /\n/, $header;
 2      @logical_lines = ();
 3      for $current_line (@physical_lines) {
 4        if ($current_line =~ /^\s/) {
 5          $previous_line .= $current_line;
 6        } else {
 7          push @logical_lines, $previous_line if defined $previous_line;
 8          $previous_line = $current_line;
 9        }
10      }       
11      push @logical_lines, $previous_line;

@logical_lines will contain the array of header lines after we've pasted the continued lines back together. We loop over the physical lines, and check each one to see if it begins with white space. We saw that \s is a Perl pattern for a white space; the ^ in front of it requires that the white space occur at the beginning of the string. In the normal case, the line does not begin with white space, and we come to line 7. We store the previous line into the @logical_lines array, if there was one, because we can be sure we're done with it. And we remember the current line in the $previous_line variable in case it turns out that it is continued on the following line.

If the current line does begin with white space, it is a continuation of the previous line, and on line 5 we append it to $previous_line. $previous_line will keep getting longer and longer as long as we keep seeing continuation lines, and then finally when we see a line that's not a continuation, we'll push the entire $previous_line onto the list of logical lines at line 7.

When we reach the end of the header and exit the loop, the last logical line is still in $previous_line, so we have line 11 to take care of it.

A Simpler Way to Deal with the Continuation Lines

This is a general pattern that you can apply to any problem that involves continuations or escape sequences. It's fairly simple, but it turns out that in Perl there's an even simpler way to write the same thing, if we're willing to use a little regex magic:

        @logical_lines = split /\n(?!\s)/, $header;

This replaces the 11-line loop we had above.

What's going on here? It says that the delimiters between header lines aren't \n characters; just an \n by itself isn't enough. (?!foo) says that in order to match, perl must not see foo coming up at that position in the string. (?!\s) says that the next character after the \n must not be a whitespace character. So where /\n/ will match any newline character, /\n(?!\s)/ will only match the newline characters that are not immediately followed by whitespace. These are precisely the ones that are at the ends of logical lines.

Tune in Next Time

I'm out of space now (Ha ha, that's a joke; on the web you have as much space as you want. But actually I'm out of time because I have to go to Boulder now) so I'll continue next time. I'll show how to analyze the header that we just dismantled, and how to check the address against a lot of patterns at the same time to make sure it doesn't come from a known spammer. I'll also show a lot of heuristics that turned out to be useful for recognizing spam messages from the header. Here's a teaser: What's wrong with this picture?

        Received: from login_2961.sayme2.net (mail.sayme2.net[103.12.210.92])
	  by sayme2.net (8.8.5/8.7.3) with SMTP id XAA02040
	  for creditc@aoI.net;  Thu, 28 August 1997 15:51:23 -0700 (EDT)

Author

Mark-Jason Dominus is the managing editor of www.perl.com and is a longtime Perl programmer, system administrator, hacker, and spam hater. He is not actually a pigeon fancier.

Return to: Universe of Discourse main page | What's new page | Perl Paraphernalia

mjd-lp-spam@plover.com