Another good trip report, from ‘babbage’ at perl.org.
-
Again, and interestingly, quite a few folks agreed with one of SA’s core tenets; no single approach (stats, RBLs, rules, distributed hashes) can filter effectively on its own, as spammers will soon figure out a way to subvert that technique. However, if you combine several techniques, they cannot all be subverted at once, so your effectiveness in the face of active attacks is much better.
-
Also interesting to note how everyone working with learning-based approaches commented on how hard it was to persuade ‘normal people’ to keep a corpus. Let’s hope SA’s auto-training will work well enough to avoid that problem.
-
in passing — babbage noted the old canard about Hotmail selling their user database to spammers. That must really piss the Hotmail folks off ;) I think it’s much more likely that, with Moore’s Law and the modern internet, a dictionary attack *will* find your account eventually.
-
Good tip on the legal angle from John Praed of The Internet Law Group: if a spam misuses the name of a trademarked product like ‘Viagra’, get a copy to Pfizer pronto. Trademark holders have a particular desire to follow up on infringements like this, as an undefended trademark loses its TM status otherwise.
-
David Berlind, ZDNet executive editor: ‘They don’t want to be involved (in developing an SMTPng)’. He might say that, but I bet their folks working on sending out their bulk-mailed email newsletters might disagree ;). Legit bulk mail senders have to be involved for it to work, and they will want to be involved, too.
-
Brightmail have a patent on spam honeypots? Must take a look for this sometime.
-
the plural of ‘corpus’ is ‘corpora’ ;)
Great report, overall.
It’s interesting to see that Infoworld notes that reps from AOL, Yahoo! and MS were all present.
Since the conf, Paul Graham has a new paper up about ‘Better Bayesian Filtering’, and lists some new tokenization techniques he’s using:
-
keep dollar signs, exclamation and most punctuation intact (we do that!)
-
prepend header names to header-mined tokens (us too!)
-
case is preserved (ditto!)
-
keep ‘degenerate’ tokens; ‘Subject:FREE!!!’ degenerates to ‘Subject:free’, to ‘FREE!!!’, and ‘free’. (ditto! well, partly. We use degeneration of tokens, but we keep the degenerate tokens in a separate, prefixed namespace from the non-degenerate ones, as he contemplates in footnote 7. It’s worth noting that case-sensitivity didn’t work well compared to the database bloat it produced; each token needs to be duplicated into the case-insensitive namespace, but that doubled the database size, and the hit-rate didn’t go up nearly enough to make it worthwhile.)
Most of these were also discovered and verified experimentally by SpamBayes, too, BTW.
When we were working on SpamAssassin‘s Bayesian-ish implementation, we took a scientific approach, and used suggestions from the SpamBayes folks and from the SpamAssassin community on tokenizer and stats-combining techniques. We then tested these experimentally on a test corpus, and posted the results. In almost all cases, our results matched up with the SpamBayes folks’ results, which is very nice, in a scientific sense.
(PS: update on the Fly UI story — ‘apis’ is not French, it’s Latin. oops! Thanks Craig…)