Thanks to Pierce for pointing me at this review of an interesting-sounding book called Introduction to Information Retrieval. The book sounds quite useful, but I wanted to pick out a particularly noteworthy quote, on compression:
One benefit of compression is immediately clear. We need less disk space.
There are two more subtle benefits of compression. The first is increased use of caching … With compression, we can fit a lot more information into main memory. [For example,] instead of having to expend a disk seek when processing a query … we instead access its postings list in memory and decompress it … Increased speed owing to caching — rather than decreased space requirements — is often the prime motivator for compression.
The second more subtle advantage of compression is faster transfer data from disk to memory … We can reduce input/output (IO) time by loading a much smaller compressed posting list, even when you add on the cost of decompression. So, in most cases, the retrieval system runs faster on compressed postings lists than on uncompressed postings lists.
This is something I’ve been thinking about recently — we’re getting to the stage where CPU speed has so far outstripped disk I/O speed and network bandwidth, that pervasive compression may be worthwhile. It’s simply worth keeping data compressed for longer, since CPU is cheap. There’s certainly little point in not compressing data travelling over the internet, anyway.
On other topics, it looks equally insightful; the quoted paragraphs on Naive Bayes and feature selection algorithms are both things I learned myself, "in the field", so to speak, working on classifiers — I really should have read this book years ago I think ;)
The entire book is online here, in PDF and HTML. One to read in that copious free time…