Just to get a little techie again… here’s a short article on a new algorithm I’ve come up with.
Text-matching rule-based anti-spam systems are pretty common — SpamAssassin, of course, is probably the most well-known, and of course the proprietary apps built on SpamAssassin also use this. However, other proprietary apps also seem to use similar techniques, such as Symantec’s Brightmail and MessageLabs’ scanner (hi Matt ;) — and doubtless there are others. As a result, ways to write rules quickly and effectively are valuable.
So far, most SpamAssassin text rules are manually developed; somebody looks at
a few spam samples, spots common phrases, and writes a rule to match
that. It’d be great to
automate more of that work. Here’s an algorithm I’ve developed to perform this
in a memory-efficient and time-efficient way. I’m quite proud of this, so
thought it was worth a blog posting. ;)
Corpus collection
First, we collect a corpus of spam and "ham" (non-spam) mails. Standard
enough, although in this case it helps to try to keep it to a specific type of
mail (for example, a recent stock spam run, or a run from the OEM spammer).
Typically, a simple "grep" will work here, as long as the source corpus is all
spam anyway; a small number of irrelevant messages can be left in, as long as
the majority 80% or so are variations on the target message set. (The
SpamAssassin mass-check tool can now perform this on the fly, which is
helpful, using the new ‘GrepRenderedBody’ mass-check plugin.)
Rendering
Next, for each spam message, render the body. This involves:
- decoding MIME structure
- discarding non-textual parts, or parts that are not presented to the viewer by default in common end-user MUAs (such as attachments)
- decoding quoted-printable and base64 encoding
- rendering HTML, again based on the behaviour of the HTML renderers used in common end-user MUAs
- normalising whitespace, "this is\na \ntest" -> "this is a test"
All pretty basic stuff, and performed by the SpamAssassin "body" rendering process during a "mass-check" operation. A SpamAssassin plugin outputs each message’s body string to a log file.
Next, we take the two log files, and process them using the following algorithm:
N-gram Extraction
Iterate through each mail message in the spam set. Each message is assigned a
short message ID number. Cut off all but the first 32 kbytes of the text (for
this algorithm, I think it’s safe to assume that anything past 32 KB will not
be a useful place for spammers to place their spam text). Save a copy of this
shortened text string for the later "collapse patterns" step.
Split the text into "words" — ie. space-separated chunks of non-whitespace
chars. Compress each "word" into a shorter ID to save space:
"this is a test" => "a b c d"
(The compression dictionary used here is shared between all messages, and also
needs to allow reverse lookups.)
Then tokenize the message into 2-word and 3-word phrase snippets (also known
as N-grams):
"a b c d" => [ "a b", "b c", "c d", "a b c", "b c d" ]
Remove duplicate N-grams, so each N-gram only appears once per message.
For each N-gram token in this token set, increment a counter in a global "token
count" hashtable, and add the message ID to the token’s entry in a "message
subset hit" table.
Next, process the ham set. Perform the same algorithm, except: don’t keep the
shortened text strings, don’t cut at 32KB, and instead of incrementing the
"token count" hash entries, simply delete the entries in the "token count" and
"message subset hit" tables for all N-grams that are found.
By the end of this process, all ham and spam have been processed, and in a
memory-efficient fashion.
We now have:
- a table of hit-counts for the message text N-grams, with all N-grams where P(spam) < 1.0 — ie. where even a single ham message was hit — already discarded
- the "message subset hit" table, containing info about exactly which subset of messages contain a given N-gram
- the token-to-word reverse-lookup table
To further reduce memory use, the word-to-token forward-lookup table can now be
freed. In addition, the values in the "message subset hit" table can be
replaced with their hashes; we don’t need to be able to tell exactly which
messages are listed there, we just need a way to tell if one entry is equal to
another.
Summarisation
Iterate through the hit-count table. Discard entries that occur too
infrequently to be listed; discard, especially, entries that occur only once.
(We’ve already discarded entries that hit any ham.)
Make a hash that maps the message subsets to the set of all N-gram patterns for
that message-subset. For each subset, pick a single N-gram, and note the
hit-count associated with it as the hit-count value for that entire
message-subset. (Since those N-grams all appear in the exact same subset of
messages, they will always have the same P(spam) — this is a safe shortcut.)
Iterate through the message subsets, in order of their hit-count. Take all of
the message-subset’s patterns, decode the N-grams in all patterns using the
token-to-word reverse-lookup table, and apply this algorithm to that pattern
set:
Collapse patterns
So, input here is an array of N-gram patterns, which we know always occur in
the same subset of messages. We also have the saved array of all spam
messages’ shortened text strings, from the N-gram extraction step. With this,
we can apply a form of the BLAST pattern-discovery algorithm, from bioinformatics.
Pop the first entry off the array of patterns. Find any one mail from the
saved-mails array that hits this pattern. Find the single character before
the pattern in this mail, and prepend it to the pattern. See if the hits for
this new pattern are the same message set as hit the old pattern; if not,
restore the old pattern and break. If you hit the start of the mail message’s
text string, break. Then apply the same algorithm forward through the mail
text.
By the end of that, you have expanded the pattern from the basic N-gram
as far as it’s possible to go in both directions without losing a hit.
Next, discard all patterns in the pattern array that are subsumed by (ie.
appear in) this new expanded pattern. Add it to the output list of
expanded patterns, unless it in turn is already subsumed by a pattern
in that list; discard any patterns in the output list that are subsumed
by this new pattern; and move onto the next pattern in the input
list until they’re all exhausted.
(By the way, the "discard if subsumed" trick is the reason why we start off
with 3-word N-grams — it gives faster results than just 2-word N-grams alone,
presumably by reducing the amount of work that this collapse stage has to do,
by doing more of it upfront at a relatively small RAM cost.)
Summarisation (continued)
Finally, output a line listing the percentage of the input spam messages hit
(ie. (hit-count value / total number of spams) * 100
) and the list
of expanded patterns for that message-subset, then iterate on to the next
message-subset.
Example
Here’s an example of some output from recent "OEM" stock spam:
$ ./seek-phrases-in-corpus --grep 'OEM' \
spam:dir:/local/cor/recent/spam/*.2007022* \
ham:dir:/local/cor/recent/ham/*.200702*
[mass-check progress noises omitted]
RATIO SPAM% HAM% DATA
1.000 72.421 0.000 / OEM software - throw packing case, leave CD, use electronic manuals. Pay for software only and save 75-90%! /,
/ TOP 1O ITEMS/
1.000 73.745 0.000 / $99 Macromedia Studio 8 $59 Adobe Premiere 2.0 $59 Corel Grafix Suite X3 $59 Adobe Illustrator CS2 $129 Autodesk Autocad 2007 $149 Adobe Creative Suite 2 /,
/s: Adobe Acrobat PR0 7 $69 Adobe After Effects $49 Adobe Creative Suite 2 Premium $149 Ableton Live 5.0.1 $49 Adobe Photoshop CS $49 http:\/\//,
/ Microsoft Office 2007 Enterprise Edition Regular price: $899.00 Our offer: $79.95 You save: $819.95 (89%) Availability: Pay and download instantly. http:\/\//,
/ Adobe Acrobat 8.0 Professional Market price: $449.00 We propose: $79.95 Your profit: $369.05 (80%) Availability: Available for /,
/ $49 Windows XP Pro w\/SP2 $/,
/ Top-ranked item. (/,
/, use electronic manuals. Pay for software only and save 75-90%! /,
/ Microsoft Windows Vista Ultimate Retail price: $399.00 Proposition: $79.95 Your benefit: $319.05 (80%) Availability: Can be downloaded /,
/ $79 MS Office Enterprise 2007 $79 Adobe Acrobat 8 Pro $/,
/ Best choice for home and professional. (/,
/ OEM software - throw packing case, leave CD/,
/ Sales Rank: #1 (/,
/ $79 Microsoft Windows Vista /,
/ manufacturers: Microsoft...Mac...Adobe...Borland...Macromedia http:\/\//
1.000 73.855 0.000 / MS Office Enterprise 2007 /,
/9 Microsoft Windows Vista /,
/ Microsoft Windows Vista Ultimate /,
/9 Macromedia Studio 8 /,
/ Adobe Acrobat 8.0 /,
/ $79 Adobe /
1.000 74.242 0.000 / Windows XP Pro/
1.000 74.297 0.000 / Adobe Acrobat /
1.000 74.462 0.000 / Adobe Creative Suite /
1.000 74.573 0.000 / Adobe After Effects /
1.000 74.738 0.000 / Adobe Illustrator /
1.000 74.959 0.000 / Adobe Photoshop CS/
1.000 75.014 0.000 / Adobe Premiere /
1.000 75.290 0.000 / Macromedia Studio /
1.000 75.786 0.000 /OEM software/
1.000 75.841 0.000 / Creative Suite /
1.000 75.896 0.000 / Photoshop CS/
1.000 75.951 0.000 / After Effects /
1.000 76.062 0.000 /XP Pro/
1.000 82.460 0.000 / $899.00 Our /,
/ Microsoft Office 2007 Enterprise /,
/ $79.95 You/
Immediately, that provides several useful rules; in particular, that final
set of patterns can be combined with a SpamAssassin "meta" rule to hit 82% of
the samples. Generating this took a quite reasonable 58MB of virtual memory,
with a runtime of about 30 minutes, analyzing 1816 spam and 7481 ham mails
on a 1.7Ghz Pentium M laptop.
(Update:) here’s a sample message from that test set, demonstrating the top
extracted snippets in bold:
Return-Path: <tyokaluassa.com@ultradian.com>
X-Spam-Status: Yes, score=38.2 required=5.0 tests=BAYES_99,DK_POLICY_SIGNSOME,
FH_HOST_EQ_D_D_D_D,FH_HOST_EQ_VERIZON_P,FH_MSGID_01C67,FUZZY_SOFTWARE,
HELO_LOCALHOST,RCVD_IN_NJABL_DUL,RCVD_IN_PBL,RCVD_IN_SORBS_DUL,RDNS_DYNAMIC,
URIBL_AB_SURBL,URIBL_BLACK,URIBL_JP_SURBL,URIBL_OB_SURBL,URIBL_RHS_DOB,
URIBL_SBL,URIBL_SC_SURBL shortcircuit=no autolearn=spam version=3.2.0-r492202
Received: from localhost (pool-71-125-81-238.nwrknj.east.verizon.net [71.125.81.238])
by dogma.boxhost.net (Postfix) with SMTP id E002F310055
for <xxxxxxxxxxx@jmason.org>; Sun, 18 Feb 2007 08:58:20 +0000 (GMT)
Message-ID: <000001c7533a$b1d3ba00$0100007f@localhost>
From: "Kevin Morris" <tyokaluassa.com@ultradian.com>
To: <xxxxxxxx@jmason.org>
Subject: Need S0ftware?
Date: Sun, 18 Feb 2007 03:57:56 -0500
OEM software - throw packing case, leave CD, use electronic manuals.
Pay for software only and save 75-90%!
Discounts! Special offers! Software for home and office!
TOP 1O ITEMS.
$79 Microsoft Windows Vista Ultimate
$79 MS Office Enterprise 2007
$79 Adobe Acrobat 8 Pro
$49 Windows XP Pro w/SP2
$99 Macromedia Studio 8
$59 Adobe Premiere 2.0
$59 Corel Grafix Suite X3
$59 Adobe Illustrator CS2
$129 Autodesk Autocad 2007
$149 Adobe Creative Suite 2
http://ot.rezinkaoem.com/?0B85330BA896A9992D0561E08037493852CE6E1FAE&t0
Mac Specials:
Adobe Acrobat PR0 7 $69
Adobe After Effects $49
Adobe Creative Suite 2 Premium $149
Ableton Live 5.0.1 $49
Adobe Photoshop CS $49
http://ot.rezinkaoem.com/-software-for-mac-.php?0B85330BA896A9992D0561E08037493852CE
6E1FAE&t6
See more by this manufacturers:
Microsoft...Mac...Adobe...Borland...Macromedia
http://ot.rezinkaoem.com/?0B85330BA896A9992D0561E08037493852CE6E1FAE&t4
Microsoft Windows Vista Ultimate
Retail price: $399.00
Proposition: $79.95
Your benefit: $319.05 (80%)
Availability: Can be downloaded INSTANTLY.
http://ot.rezinkaoem.com/2480.php?0B85330BA896A9992D0561E08037493852CE6E1FAE&t3
Best choice for home and professional. (37268 reviews)
Microsoft Office 2007 Enterprise Edition
Regular price: $899.00
Our offer: $79.95
You save: $819.95 (89%)
Availability: Pay and download instantly.
http://ot.rezinkaoem.com/2442.php?0B85330BA896A9992D0561E08037493852CE6E1FAE&t1
Sales Rank: #1 (121329 reviews)
Adobe Acrobat 8.0 Professional
Market price: $449.00
We propose: $79.95
Your profit: $369.05 (80%)
Availability: Available for INSTANT download.
http://ot.rezinkaoem.com/2441.php?0B85330BA896A9992D0561E08037493852CE6E1FAE&t2
Top-ranked item. (31949 reviews)
Further work
Things that would be nice:
- It’d be nice to extend this to support /.*/ and /.{0,10}/ — matching "anys", also known as "gapped alignment" searches in bioinformatics, using algorithms like the Smith-Waterman or Needleman-Wunsch algorithms. (Update: this has been implemented.)
- A way to detect and reverse-engineer templates, e.g. "this is foo", "this is bar", "this is baz" => "this is (foo|bar|baz)", would be great.
- Finally, heuristics to detect and discard likely-poor patterns are probably the biggest wishlist item.
Tuits are the problem, of course, since $dayjob is the one that pays the
bills, not this work. :(
The code is being developed here, in SpamAssassin SVN. Feel free to comment/mail if you’re interested, have improvement ideas, or want more info on how to use it… I’d love to see more people trying it out!
Some credit: I should note that IBM’s
Chung-Kwei system,
presented at CEAS 2004, was the first time I’d heard of a pattern-discovery algorithm (namely, their proprietary Teiresias
algorithm) being applied to spam.