An interesting technique, from the ClamAV development list — using Bloom filters to speed up string searching. This kind of thing works well when you’ve got 1 input stream, and a multitude of simple patterns that you want to match against the stream. Bloom filters are a hashing-based technique to perform extremely fast and memory-efficient, but false-positive-prone, binary lookups.
The mailing list posting (‘Faster string search alternative to Boyer-Moore‘) gives some benchmarks from the developers’ testing, along with the core (GPL-licensed) code:
Regular signatures (28,326) :
Extended Boyer-Moore: 11 MB/s
BloomAV 1-byte: 89 MB/s
BloomAV 4-bytes: 122 MB/s
Some implementation details:
the (implementation) we chose is a simple bit array of (256 K * 8) bits. The filter is at first initialized to all zeros. Then, for every virus signature we load, we take the first 7 bytes, and hash it with four very fast hash functions. The corresponding four bits in the bloom filter are set to 1s.
Our intuition is that if the filter is small enough to fit in the CPU cache, we should be able to avoid memory accesses that cost around 200 CPU cycles each.
Also, in followup discussion, the following paper was mentioned: A paper describing hardware-level Bloom filters in the Snort IDS — S. Dharmapurikar, P. Krishnamurthy, T. Sproull, and J. W. Lockwood, “Deep packet inspection using parallel Bloom filters,” in Hot Interconnects, (Stanford, CA), pp. 44–51, Aug. 2003.
This system is dubbed ‘BloomAV’. Pretty cool. It’s unclear if the ClamAV developers were keen to incorporate it, though, but it does point at interesting new techniques for spam signatures.