I’ve been looking some more into recent new features added to bleadperl by <a href=’http://use.perl.org/~demerphq/’>demerphq, such as Aho-Corasick trie matching, and how we can effectively support this in SpamAssassin. Here’s the state of play.
These are the "base strings" extracted from the SpamAssassin SVN trunk body ruleset (ignore the odd mangled UTF-8 char in here, it’s suffering from cut-and-paste breakage). A "base string" is a simplified subset of the regular expression; specifically, these are the cases where the "base strings" of the rule are simpler than the full perl regular expression language, and therefore amenable to fast parallel string matching algorithms.
The base strings appear in that file as "r" lines, like so:
r I am currently out of the office:__BOUNCE_OOO_3 __DOS_COMING_TO_YOUR_PLACE r I drive a:__DOS_I_DRIVE_A r I might be c:__DOS_COMING_TO_YOUR_PLACE r I might c:__DOS_COMING_TO_YOUR_PLACE
The base string is the part after "r" and before the ":"; after that, the rule names appear.
Now, here are some limitations that make this less easy:
-
One string to many rules: each one of those strings corresponds to one or more SpamAssassin rules.
-
One rule to many strings: each rule may correspond to one or more of those strings. So it’s not a one-to-one correspondence either way.
-
No anchors: the strings may match anywhere inside the line, similar to
("foo bar baz" =~ /bar/)
. -
Multiple rules can fire on the same line: each line can cause multiple rules to fire on different parts of its text.
-
Subsumption is not permitted: the base-string extractor plugin has already established cases where subsumption takes place. Each string will not subsume another string; so a match of the string "food" against the strings "food" and "foo" should just fire on "food", not on "foo".
-
Overlapping is permitted: on the other hand, overlapping is fine; "foobar" matched against "foo" and "oobar" should fire on both base strings. (The above two are basically for re2c compatibility. This is the main reason the strings are so simple, with no RE metachars — so that this is possible, since re2c is limited in this way.)
-
Most rules are more complex: most of the ruleset — as you can see from the ‘orig’ lines in that file — are more complex than the base string alone. So this means that a base string match often needs to be followed by a "verification" match using the full regexp.
Now, the problem is to iterate through each line of the (base64-decoded, encoding-decoded, HTML-decoded, whitespace-simplified) "body text" of a mail message, with each paragraph appearing as a single "line", and run all those base strings in parallel, identifying the rule names that then need to be run.
This is turning out to be quite tricky with the bleadperl trie code.
For example, if we have 3 base strings, as follows:
hello:RULE_HELLO hi:RULE_HI foo:RULE_FOO
At first, it appears that we could use the pattern itself as a key into a lookup table to determine the pattern that fired:
%base_to_rulename_lookup = ( 'hello' => ['RULE_HELLO'], 'hi' => ['RULE_HI'], 'foo' => ['RULE_FOO'] ); if ($line =~ m{(hello|hi|foo)}) { $rule_fired = $base_to_rulename_lookup{$1}; }
However, that will fail in the face of the string "hi foo!", since only one of the bases will be returned as $1, whereas we want to know about both "RULE_HI" and "RULE_FOO".
m//gc might help:
%base_to_rulename_lookup = ( 'hello' => ['RULE_HELLO'], 'hi' => ['RULE_HI'], 'foo' => ['RULE_FOO'] ); while ($line =~ m{(hello|hi|foo)}gc) { $rule_fired = $base_to_rulename_lookup{$1}; }
That works pretty well, but not if two patterns overlap: /abc/ and /bcd/, matching on the string "abcd", for example, will fire only on "abc", and miss the "bcd" hit.
Given this, it appears the only option is to run the trie match, and then iterate on all the regexps for the base strings it contains:
if ($line =~ m{hello|hi|foo}) { $line =~ /hello/ and rule_fired("HELLO"); $line =~ /hi/ and rule_fired("HI"); $line =~ /foo/ and rule_fired("FOO"); }
Obviously, that doesn’t provide much of a speedup — in fact, so far, I’ve been unable to get any at all out of this method. :(
This can be optimized a little by breaking into multiple trie/match sets:
if ($line =~ m{hello|hi}) { $line =~ /hello/ and rule_fired("HELLO"); $line =~ /hi/ and rule_fired("HI"); ... } if ($line =~ m{foo|bar}) { $line =~ /foo/ and rule_fired("FOO"); $line =~ /bar/ and rule_fired("BAR"); ... }
But still, the reduction in regexp OPs vs the addition of logic OPs to do this, result in an overall slowdown, even given the faster trie-based REs.
Suggestions, anyone?
(by the way, if you’re curious, the current code is here in SVN.)