Skip to content

Archives

Lotsa SpamConf linkage and commentary

Another good trip report, from ‘babbage’ at perl.org.

  • Again, and interestingly, quite a few folks agreed with one of SA’s core tenets; no single approach (stats, RBLs, rules, distributed hashes) can filter effectively on its own, as spammers will soon figure out a way to subvert that technique. However, if you combine several techniques, they cannot all be subverted at once, so your effectiveness in the face of active attacks is much better.

  • Also interesting to note how everyone working with learning-based approaches commented on how hard it was to persuade ‘normal people’ to keep a corpus. Let’s hope SA’s auto-training will work well enough to avoid that problem.

  • in passing — babbage noted the old canard about Hotmail selling their user database to spammers. That must really piss the Hotmail folks off ;) I think it’s much more likely that, with Moore’s Law and the modern internet, a dictionary attack *will* find your account eventually.

  • Good tip on the legal angle from John Praed of The Internet Law Group: if a spam misuses the name of a trademarked product like ‘Viagra’, get a copy to Pfizer pronto. Trademark holders have a particular desire to follow up on infringements like this, as an undefended trademark loses its TM status otherwise.

  • David Berlind, ZDNet executive editor: ‘They don’t want to be involved (in developing an SMTPng)’. He might say that, but I bet their folks working on sending out their bulk-mailed email newsletters might disagree ;). Legit bulk mail senders have to be involved for it to work, and they will want to be involved, too.

  • Brightmail have a patent on spam honeypots? Must take a look for this sometime.

  • the plural of ‘corpus’ is ‘corpora’ ;)

Great report, overall.

It’s interesting to see that Infoworld notes that reps from AOL, Yahoo! and MS were all present.

Since the conf, Paul Graham has a new paper up about ‘Better Bayesian Filtering’, and lists some new tokenization techniques he’s using:

  • keep dollar signs, exclamation and most punctuation intact (we do that!)

  • prepend header names to header-mined tokens (us too!)

  • case is preserved (ditto!)

  • keep ‘degenerate’ tokens; ‘Subject:FREE!!!’ degenerates to ‘Subject:free’, to ‘FREE!!!’, and ‘free’. (ditto! well, partly. We use degeneration of tokens, but we keep the degenerate tokens in a separate, prefixed namespace from the non-degenerate ones, as he contemplates in footnote 7. It’s worth noting that case-sensitivity didn’t work well compared to the database bloat it produced; each token needs to be duplicated into the case-insensitive namespace, but that doubled the database size, and the hit-rate didn’t go up nearly enough to make it worthwhile.)

Most of these were also discovered and verified experimentally by SpamBayes, too, BTW.

When we were working on SpamAssassin‘s Bayesian-ish implementation, we took a scientific approach, and used suggestions from the SpamBayes folks and from the SpamAssassin community on tokenizer and stats-combining techniques. We then tested these experimentally on a test corpus, and posted the results. In almost all cases, our results matched up with the SpamBayes folks’ results, which is very nice, in a scientific sense.

(PS: update on the Fly UI story — ‘apis’ is not French, it’s Latin. oops! Thanks Craig…)