Found on
Paul Graham’s site: “according to a recent
study, the MAPS RBL, probably the best known blacklist, catches only
24% of spam, with 34% false positives. It would take a conscious effort to
write a content-based filter with performance that bad.”
The “recent study” is by David Nelson at Giga Information Group,
sometime last year.
For the sake of it, I’ve checked out how the MAPS figures stack up using
TCR, Ion Androutsopoulos‘
metric for measuring spam filter performance. TCR is a very nice
single-figure metric, which takes into account the “inconvenience
factor” of misfiled mails, based on a “lambda” setting indicating what
action is taken when a mail is classified. For MAPS, I’m assuming a
lambda of 9, the guideline figure for systems which bounce mail back to
the sender, instead of 1 for simple tagging, or 999 for outright deletion
with no notification.
So: using a lambda of 9, MAPS gets a TCR of 0.0912, a Spam Recall of 24%,
and a Spam Precision of 17%. It’s worth noting that the baseline figure
for TCR is 1.0, which represents no filtering whatsoever: ie. all the spam
comes right into your mailbox.
In other words, using MAPS is more inconvenient all-round than not
filtering your mail at all, if these figures are to be believed ;)
More spam: I’ve just assembled a totally-public corpus of spam
and non-spam mail, to allow spamfilter developers to compare and
contrast results using the same data. Let’s hope it proves useful.
Not spam: finally, I’m off to Chester for a wedding tomorrow morning;
my good mates Kitty and Gerry are tying the knot, in Chester Zoo, no less.
Let’s hope this horrible cold I’ve had all week dies down before
Saturday…