© Copyright 1999 Mark-Jason Dominus.
This article originally appeared on the Linux Planet Web Site
Note: This article originally appeared at the LinuxPlanet web site. I had intended it to be the first in a series of articles, but soon afterwards the LinuxPlanet editorship changed and they decided they didn't want the rest of the series. Perl.com will be carrying the rest of the articles. Part 2 and Part 3 are available now.
or
In May of 1994, while I was reading the phl.food newsgroup, I saw something new. It was a message with this subject:
U.S. Green Card Lottery - New Immigration Opportunity
That was not what I expected to see in phl.food, so I wrote to the author:
This has nothing at all to do with food, and you posted it to phl.food. Please be more polite in the future and keep announements in relevant and appropriate groups.
And he sent me a reply:
People in your group are interested. Why do you wish to deprive them of what they consider to be important information??
I was really startled. I had Naïvely expected that the author would recognize that he had done something incorrect once it was pointed out to him. Gosh, was I wrong!
That was the beginning of my life with spam. It was the now infamous `Green Card Spam' from Lawrence Canter and Martha Siegel, a pair of incompetent lawyers. But they were on the leading edge of a big trend. Within two years the newsgroups were clogged with spam, and at the same time, email spam was becoming common.
My basic problem was that it annoyed me to get spam mail in my inbox every morning. I wanted to address this problem directly. Complaining back to the the source wasn't what I was looking for, because I would still be annoyed. One solution would have been to stop getting annoyed, but I didn't have much luck with that, so I tried to think of ways to prevent the spam from getting into my inbox in the first place. I started thinking about filtering.
The idea of filtering is simple: Write a program that will examine each incoming message, and if it looks like spam, throw it away. If not, the program delivers it to my inbox as usual.
The idea is simple, but the implementation is hard. What does spam look like? And if I couldn't recognize it infallibly, what would happen when I made a mistake?
Failing to recognize spam is called a false negative, and is only a minor problem. The only thing that would happen was that my blood pressure would go up imperceptibly. But the opposite false positive problem, of incorrectly recognizing and discarding some message that wasn't spam, was much more frightening. If I erroneously identified a non-spam message as spam, and threw it away, I might lose important messages from clients, or from my lawyers, or from someone else important. So I decided that whatever happened no message must ever be thrown away until I had had a chance to look at it. And messages that are rejected mustn't go into an oubliette somewhere unless someone is notified. I had nightmares about getting a call from my lawer, asking why I hadn't gotten an important communication the previous month, and having to say it had gone into my spam box.
Mail messages contain an actual message with some information that someone
wanted to send you; that's called the body of the message. They also contain some meta-information such as who the
message is from, who it was sent to, when it was sent, and soforth; this
part of the message is called the header
. I could try to recognize spam by looking in the body, or in the header,
or both. But I couldn't think of a good way to recognize spam by by looking
at the body that wouldn't also lead to an unacceptably high false positive
rate. Every time I thought about the problem, I got stuck at the same
place: Suppose I know that lots of spam arrives exhorting me to see sexy
Annabel Chong, and suppose I were to establish a policy of rejecting
messages that mention Annabel. Now suppose someone sends me a message that
discusses spam filtering strategies and mentions Annabel in connection with
this. ``Dominus, is it a good strategy to reject messages that mention
Annabel Chong?'' Oops, I've just thrown their message away. Now suppose
someone sent me a copy of the article you're reading right now. Ooops, it
mentions Annabel Chong also.
I decided that since there wasn't any way to tell whether a message actually mentions Annabel directly rather than mentioning that messages sometimes mention her, content filtering wasn't going to work. I didn't want to cut myself off from discussions about spam filtering.
I decided I would have to filter based on information in the message header, not its body. The `subject' line is the most obvious place to start, but I didn't want to use it because it's really part of the message content and I would have had the same problem as if I were filtering on the message body. I didn't want to automatically reject mail that said
Subject: These green card lottery articles are driving me crazy!
Now, among people who've followed this train of thought there seem to be two basic strategies. One strategy is to simply reject mail that's not addressed to you; if the `To:' address isn't yours, throw it out. That sounds good, but it has some problems. I get a lot of mailing list and carbon-copy mail that might not include me in the receipient list for perfectly good reasons. The typical mailing-list mail arrives in your inbox without your address on it anywhere; it says something like
To: perl5-porters@perl.com
This is analogous to paper mailing list mail: You open it up and inside is a letter that says not `Dear Mark Dominus, ...' but rather `Dear Pigeon Fancier, ...'.
Also, I analyzed the spam messages I had collected and decided it wouldn't work well enough---about one spam message in five actually does arrive addressed to me.
Instead, the basic idea that I adopted was to make a list of domains that sent me a lot of spam, and to blacklist those domains. Any mail from a blacklisted domain would be rejected; other mail would be delivered. And now I'll show how to implement that, because the details turn out to be very interesting.
Our filter program will be run by the mail transport agent (MTA), which is
the program that is responsible for receiving mail from the network and for
delivering it to the right place. With sendmail, for example, you can put a line like this one into your .forward
file:
"| /home/mjd/bin/mailfilter"
When mail arrives, sendmail will run the mailfilter
program and hand it the mail message on its standard input. mailfilter
can then decide whether to deliver the message by writing it to the
mailbox, or whether to throw it away, or whether to do something else. Most
MTAs have an option to deliver mail to a program in this way. My system was
using the superb qmail
MTA, so I would put the line
| /home/mjd/bin/mailfilter
into my ~/.qmail
file. (Actually my filter program is named
deliver.aol.q2
. Please don't ask why, because I don't remember.)
What does this mailfilter
program need to do? Obviously, the first thing it must do is read the mail
message in from the standard input. Code to read in an email message is
very simple in Perl:
1 { local $/ = ""; 2 $header = <STDIN>; 3 undef $/; 4 $body = <STDIN>; 5 }
This reads the header of the message into $header
and the body into
$body
. What's going on here? The Perl <...>
operator reads a line of input from some filehandle. But what's a line?
Normally, it's any sequence of characters that's terminated by a newline
character. Why a newline? Because that's the default setting of the Perl $/
special variable. If you change $/
, that changes Perl's idea of what a line looks like. If you changed it to
contain a period, then Perl would think that a `line' was any sequence of
characters that ended with a period.
There are two special settings for $/
, however. If you set $/
to the empty string, as I did on line 1, the <...>
operator reads by paragraphs instead of lines; consecutive paragraphs are
separated by a blank line. Each call to <...>
reads in one complete paragraph. Since the header of a mail message is a
paragraph, separated from the body by a blank line, line 2 reads the entire
header into $header
.
The local
on line 1 confines the changes to $/
to the block, so that when control reaches line 5, the original value of $/
is automatically restored. We might be doing file I/O in other parts of our
program, and if we didn't put $/
back to normal we'd get weird results when the <...>
operator didn't behave the way we expected. Using local
ensures that we won't forget to put it back the way it was.
On line 3 we see the other special setting of $/
. If $/
is undefined, then there is no line termination sequence, and Perl's
<...>
operator reads the entire rest of the input all at once. This is sometimes
called `slurping' the input. Line 4 reads the entire message body into the
variable $body
.
We're not done processing the message. A mail header is made up of several lines that carry different information. You're probably familiar with the structure already; it looks like this:
Return-Path: LRS@getstartednow.com Return-Path: <LRS@getstartednow.com> Delivered-To: mjd-deliver@plover.com Received: (qmail 15266 invoked by uid 119); 10 Apr 1997 05:08:37 -0000 Delivered-To: mjd-filter@plover.com Received: (qmail 15261 invoked by uid 119); 10 Apr 1997 05:08:33 -0000 Delivered-To: mjd@plover.com Received: (qmail 15258 invoked from network); 10 Apr 1997 05:08:31 -0000 Received: from renoir.op.net (root@206.84.208.4) by plover.com with SMTP; 10 Apr 1997 05:08:31 -0000 Received: from major.globecomm.net (major.globecomm.net [207.51.48.5]) by renoir.op.net (8.7.1/8.7.1/$Revision: 1.1 $) with ESMTP id BAA02191 for <mjd@op.net>; Thu, 10 Apr 1997 01:06:35 -0400 (EDT) From: LRS@getstartednow.com Received: from globecomm.net (ip252.new-haven.ct.pub-ip.psi.net [38.11.102.252]) by major.globecomm.net (8.8.5/8.8.0) with SMTP id BAA00454; Thu, 10 Apr 1997 01:06:23 -0400 (EDT) Received: from mailhost.greatchances.com (alt3.greatchances.com(917.876.92.65)) by greatchances.com (8.8.5/8.6.5) with SMTP id GAA04352 for <friend@public.com>; Thu, 10 Apr 1997 01:00:43 -0600 (EST) To: friend@public.com Message-ID: <282732679098.HAb9037@greatchances.com> Date: Thu, 10 Apr 97 01:00:43 EST Subject: MAKE MONEY AT HOME! X-UIDL: 698987574a97aqd1p134jud427k9a6d We'd like to break this up into the individual lines and then put the information into a hash so that we can find the various parts easily. For example, we'd like to be able to find the `To:' address in $hash{To} and the subject in $hash{Subject}.
Breaking a Perl string into lines is easy; just use split
:
@lines = split /\n/, $header;
This tells Perl to take the string $header
and break it into lines wherever it sees a \n
character. The \n
's are discarded and the parts in between are stored into the elements of
the perl array
@lines
. Then we can dismantle each individual line:
foreach $line (@lines) { my ($label, $value) = split /:\s*/, $line, 2; $hash{$label} = $value; }
This runs the loop once for each line. We use split
again, this time to cut each line into two pieces. The /:\s*/
says that the pieces will be separated by a :
followed by some white space; \s
is an abbreviation for `space' and the star means that we don't know how
much space there will be. For the header line Date: Thu, 10 Apr
97 01:00:43 EST
this places Date
into $label
and Thu, 10 Apr
97 01:00:43 EST
into $value
. Notice that Perl does not split on the :
's in the date; that's because the 2
in the split
tells Perl that there are only two fields here, so that it ignores any
:
's after the first one. If we had omitted the 2
, Perl would have split this line into four fields: $value
would have gotten
Thu, 10 Apr 97 01
, and the two other fields with 00
and 43
EST
would have been thrown away.
Actually though, there's a problem with this. When header lines are long, they can be broken up into two or more lines and continue on the following line. There's an example of this above. The two physical lines
Received: from renoir.op.net (root@206.84.208.4) by plover.com with SMTP; 10 Apr 1997 05:08:31 -0000
are actually one logical line that is broken in half. We are supposed to consider this to be one line, even though there's a newline character in the middle. The rule for recognizing these extended logical lines is simple: If a line starts with white space, it's a continuation of the previous line.
The code we have incorrectly breaks this header line in two. Then it
sets $hash{Received}
to the partial value from
renoir.op.net (root@206.84.208.4)
. Then it tries to process the
following line the same way, not realizing that it's a
continuation. It breaks the continuation into two pieces, by
plover.com with SMTP; 10 Apr 1997 05
and 08:31
-0000
and it interprets the first piece as the header field
name and the second piece as the value. This is obviously all wrong.
We need a way to handle the continuation lines. Here's one way: We'll split up the lines as before, and then put the continuations back together.
1 @physical_lines = split /\n/, $header; 2 @logical_lines = (); 3 for $current_line (@physical_lines) { 4 if ($current_line =~ /^\s/) { 5 $previous_line .= $current_line; 6 } else { 7 push @logical_lines, $previous_line if defined $previous_line; 8 $previous_line = $current_line; 9 } 10 } 11 push @logical_lines, $previous_line;
@logical_lines
will contain the array of header lines after we've pasted the continued
lines back together. We loop over the physical lines, and check each one to
see if it begins with white space. We saw that \s
is a Perl pattern for a white space; the ^
in front of it requires that the white space occur at the beginning of the
string. In the normal case, the line does not begin with white space, and we come to line 7. We store the previous line
into the
@logical_lines
array, if there was one, because we can be sure we're done with it. And we
remember the current line in the
$previous_line
variable in case it turns out that it is continued on the following line.
If the current line does begin with white space, it is a continuation of the previous line, and on
line 5 we append it to
$previous_line
. $previous_line
will keep getting longer and longer as long as we keep seeing continuation
lines, and then finally when we see a line that's not a continuation, we'll
push the entire
$previous_line
onto the list of logical lines at line 7.
When we reach the end of the header and exit the loop, the last logical
line is still in $previous_line
, so we have line 11 to take care of it.
This is a general pattern that you can apply to any problem that involves continuations or escape sequences. It's fairly simple, but it turns out that in Perl there's an even simpler way to write the same thing, if we're willing to use a little regex magic:
@logical_lines = split /\n(?!\s)/, $header;
This replaces the 11-line loop we had above.
What's going on here? It says that the delimiters between header lines
aren't \n
characters; just an \n
by itself isn't enough.
(?!foo)
says that in order to match, perl must not see foo
coming up at that position in the string. (?!\s)
says that the next character after the \n
must not be a whitespace character. So where /\n/
will match any newline character, /\n(?!\s)/
will only match the newline characters that are not immediately followed by
whitespace. These are precisely the ones that are at the ends of logical
lines.
I'm out of space now (Ha ha, that's a joke; on the web you have as much space as you want. But actually I'm out of time because I have to go to Boulder now) so I'll continue next time. I'll show how to analyze the header that we just dismantled, and how to check the address against a lot of patterns at the same time to make sure it doesn't come from a known spammer. I'll also show a lot of heuristics that turned out to be useful for recognizing spam messages from the header. Here's a teaser: What's wrong with this picture?
Received: from login_2961.sayme2.net (mail.sayme2.net[103.12.210.92]) by sayme2.net (8.8.5/8.7.3) with SMTP id XAA02040 for creditc@aoI.net; Thu, 28 August 1997 15:51:23 -0700 (EDT)
Mark-Jason Dominus is the managing editor of www.perl.com
and is a longtime Perl programmer, system administrator, hacker, and spam
hater. He is not actually a pigeon fancier.
Return to: Universe of Discourse main page | What's new page | Perl Paraphernalia