Last week, Matt Sergeant released a great little perl script, re2xs, which takes a set of simplified regexps, converts them to the subset of regular expression language supported by re2c, then uses that to build an XS module.
In other words, it offers the chance for SpamAssassin rules to be compiled into a trie structure in C code to match multiple patterns in parallel. Given that this is then compiled down to native machine code, it has the potential to be the fastest method possible, apart from using <a href=’http://www.sensorynetworks.com/pressreleases/PR0060_2006_05_02_NCASA-formatted.pdf’>dedicated hardware co-processors.
Sure enough, Matt’s results were pretty good — he says, ‘I managed to match 10k regexps against 10k strings in 0.3s with it, which I think is fairly good.’ ;)
Unfortunately, turning this into something that works with SpamAssassin hasn’t been quite so easy. SpamAssassin rules are free to use the full perl regular expression language — and this language supports many features that re2c’s subset does not. So we need to extract/translate the rule regexps to simplified subsets. This has generally been the case with all parallel matching systems, anyway, so that’s not a massive problem.
More problematically, re2c itself does not support nested patterns — if one token is contained within another, e.g. "FOO" within "FOOD", then the subsumed token will not be listed as a match. SpamAssassin rules, of course, are free to overlap or subsume each other, so an automated way to detect this is required.
For simple text patterns, this is easy enough to do using substring matching — e.g. "FOOD" =~ /\QFOO\E/ . Unfortunately, once any kind of sophisticated regexp functionality is available, this is no longer the case: consider /FOO*OD/ vs /FOO/ , /F[A-Z]OD/ vs /FO[M-P]/ , /F(?:OO|U)D/ vs /F(?:O|UU)?O/ .
The only way to do this is to either (a) fully parse the regexp, build the trie, and basically reimplement most of re2c to do this in advance; or (b) change the trie-generation code in re2c to support states returning multiple patterns, as <a href=’http://en.wikipedia.org/wiki/Aho-Corasick_algorithm’>Aho-Corasick does.
I <a href=’http://sourceforge.net/tracker/index.php?func=detail&aid=1540845&group_id=96864&atid=616203′>requested support for this in re2c, but got a brush-off, unfortunately. So work continues…
In other news, that food poisoning thing I had back at the end of June has lingered on. It’s now pretty clear that it isn’t food poisoning or a stomach bug… but I still have no idea what it actually is. No fun :(