Skip to content

Justin's Linklog Posts

Links for 2009-02-11

Config management as cookery

interesting to see Chef, a configuration management framework using cooking as a metaphor.

Back in the early ’90s in Iona, I wrote a user/group synchronization tool called "greenpages" which used a cooking metaphor; "spice" (data) was added to "raw" (template) files to produce "cooked" output. Great minds, eh!

Links for 2009-02-09

IR book recommendation

Thanks to Pierce for pointing me at this review of an interesting-sounding book called Introduction to Information Retrieval. The book sounds quite useful, but I wanted to pick out a particularly noteworthy quote, on compression:

One benefit of compression is immediately clear. We need less disk space.

There are two more subtle benefits of compression. The first is increased use of caching … With compression, we can fit a lot more information into main memory. [For example,] instead of having to expend a disk seek when processing a query … we instead access its postings list in memory and decompress it … Increased speed owing to caching — rather than decreased space requirements — is often the prime motivator for compression.

The second more subtle advantage of compression is faster transfer data from disk to memory … We can reduce input/output (IO) time by loading a much smaller compressed posting list, even when you add on the cost of decompression. So, in most cases, the retrieval system runs faster on compressed postings lists than on uncompressed postings lists.

This is something I’ve been thinking about recently — we’re getting to the stage where CPU speed has so far outstripped disk I/O speed and network bandwidth, that pervasive compression may be worthwhile. It’s simply worth keeping data compressed for longer, since CPU is cheap. There’s certainly little point in not compressing data travelling over the internet, anyway.

On other topics, it looks equally insightful; the quoted paragraphs on Naive Bayes and feature selection algorithms are both things I learned myself, "in the field", so to speak, working on classifiers — I really should have read this book years ago I think ;)

The entire book is online here, in PDF and HTML. One to read in that copious free time…

Good reasons to host inelastically on EC2

Recently, there’s been a bit of discussion online about whether or not it makes sense for companies to host server infrastructure at Amazon EC2, or on traditional colo infrastructure. Generally, these discussions have focussed on one main selling point of EC2: its elasticity, the ability to horizontally scale the number of server instances at a moment’s notice.

If you’re in a position to gain from elasticity, that’s great. But it is still worth noting that even if you aren’t in that position, there’s another good reason to host at an EC2-like cloud; if you want to deploy another copy of the app, either from a different version-control branch (dev vs staging vs production deployments), or to run separate apps with customizations for different customers. These aren’t scaling an existing app up, they’re creating new copies of the app, and EC2 works nicely to do this.

If you can deploy a set of servers with one click from a source code branch, this is entirely viable and quite useful.

Another reason: EC2-to-S3 traffic is extremely fast and cheap compared to external-to-S3. So if you’re hosting your data on S3, EC2 is a great way to crunch on it efficiently. Update: Walter observed this too on the backend for his Twitter Mosaic service.

Ice Cycling

I seem to have invented a new extreme sport on the way into work: Ice Cycling. The roads were like an ice-skating rink. Scary stuff :(

Here’s some advice for anyone in the same boat:

  • use a high gear: avoid using low gear if possible, even when starting off. Low revs mean you’re more likely to get traction.

  • try to avoid turns: keep the bike as upright as possible.

  • try to avoid braking: braking is very likely to start a skid in icy conditions.

  • use busy roads: where the ice has been melted by car traffic. In icy conditions, you should ride where the cars have been, since they’ll have melted the ice.

  • ride away from the gutters: they’re more likely to be iced over than the centre of a lane. Again, ride where the cars have been.

  • avoid road markings: it seems these were much icier than the other parts of the road; possibly because their high albedo meant the ice on them hadn’t been melted by the sun yet. So look out for that.

Here’s a good thread on cyclechat.co.uk, and don’t miss icebike.org: ‘Whether commuting to work, or just out for a romp in the woods, you arrive feeling very alive, refreshed, and surrounded with the aura of a cycling god. You will be looked upon with the smile of respect by friends and co-workers. – – – Or was that the sneer of derision…no matter, ICEBIKING is a blast!’ o-kay.

Their recommendations are pretty sane, though. ;)

Links for 2009-02-05

Links for 2009-02-03

Links for 2009-01-30

UK’s proposed anti-filesharing quango

Wow. The IFPI’s strategy of "divide and conquer" by taking individual ISPs to court to force them to institute a 3 strikes policy, as successfully deployed against Eircom this week, is possibly marginally better than this insane obsolete-business-model handout proposed by the UK government in their Digital Britain report:

Lord Carter of Barnes, the Communications Minister, will propose the creation of a quango, paid for by a charge that could amount to £20 a year per broadband connection.

The agency would act as a broker between music and film companies and internet service providers (ISPs). It would provide data about serial copyright-breakers to music and film companies if they obtained a court order. It would be paid for by a levy on ISPs, who inevitably would pass the cost on to consumers.

Jeremy Hunt, the Shadow Culture Secretary, said: “A new quango and additional taxes seem a bizarre way to stimulate investment in the digital economy. We have a communications regulator; why, when times are tough, should business have to fund another one?”

Well said. An incredibly bad idea.

By the way, I’ve noticed some misconceptions about the Eircom settlement. Telcos selling Eircom bitstream DSL (ie. the 2MB or 3MB DSL packages) are immune right now.

They are, however, next on the music industry’s hit-list, reportedly…

Links for 2009-01-29

Eircom forced to implement “3 strikes and you’re out” for filesharers

Eircom has been forced to implement "3 strikes and you’re out", according to Adrian Weckler:

If the music labels come to it with IP addresses that they have identified as illegal file-sharers, Eircom will, in its own words:

"1) inform its broadband subscribers that the subscribers IP address has been detected infringing copyright and

"2) warn the subscriber that unless the infringement ceases the subscriber will be disconnected and

"3) in default of compliance by the subscriber with the warning it will disconnect the subscriber."

My thoughts — it’s technically better than installing Audible Magic appliances to filter all outbound and inbound traffic, at least.

However, there’s no indication of the degree to which Eircom will verify the "proof" provided by the music labels, or that there’s any penalty for the labels when they accuse your laser printer of filesharing. I foresee a lot of false positives.

Update: LINX reports that the investigative company used will be Dtecnet, a ‘company that identifies copyright infringers by participating in P2P file-sharing networks’. TorrentFreak says:

DtecNet […] stems from the anti-piracy lobby group Antipiratgruppen, which represents the music and movie industry in Denmark. There are more direct ties to the music industry though. Kristian Lakkegaard, one of DtecNet’s employees, used to work for the RIAA’s global partner, IFPI. […]

Just like most (if not all) anti-piracy outfits, they simply work from a list of titles their client wishes to protect and then hunts through known file-sharing networks to find them, in order to track the IP addresses of alleged infringers.

Their software appears as a normal client in, for example, BitTorrent swarms, while collecting IP addresses, file names and the unique hash values associated with the files. All this information is filtered in order to present the allegations to the appropriate ISP, in order that they can send off a letter admonishing their own customer, in line with their commitments under the MoU.

[…] it will be a big surprise if [Dtecnet’s evidence is] of a greater ‘quality’ than the data provided by MediaSentry.

More coverage of the issues raised by the RIAA’s international lobbying for the 3-strikes penalty:

Links for 2009-01-28

Links for 2009-01-23

Links for 2009-01-21

Links for 2009-01-20

Switched to Magnet

I’ve switched my home broadband from Eircom’s 3Mbps all-in-one package to Magnet’s 10Mbps LLU package. It’s about a tenner a month cheaper, and significantly faster of course.

The modem arrived last Friday, about 2 weeks after ordering; that night, when I went to check my mail, I noticed that the DSL had gone down, and indeed so had the phone. I was dreading a weekend without the interwebs, it being 9pm on Friday night — but lo, when I plugged in the Magnet router, it all came up perfectly first time!

Great instructions too. Extremely readable and quite comprehensible for a reasonably non-techie person, I’d reckon. So far, they’ve provided great service, too.

I’m not actually getting the full 10Mbps, unfortunately; it’s RADSL, and I’m only getting 5Mbps when I test it. Just as well I didn’t pay the extra tenner to get their 24Mbps package. Still, that’s a hell of a lot faster than the sub-1Mbps speeds I’ve been getting from Eircom.

It’s hard to notice an effective difference when browsing though, as that kind of traffic is dominated by latency effects rather than throughput.

I haven’t even tried their "PCTV" digital TV system; it seems a bit pointless really, I have a networked PVR already, and anyway I doubt they support Linux.

One thing that’s wierd; when my wife attempts to view video on news.bbc.co.uk on her Mac running Firefox, it stalls with the spinny "loading video" image, and the status line claims that it’s downloading from "ad.doubleclick.net". This worked fine (of course) on Eircom. If I switch to my user account and use Firefox there, it works fine, too — possible difference being that I’m using AdBlock Plus and she’s not. Something to do with the number of simultaneous TCP connections to multiple hosts, maybe? Very odd anyway. It’d be nice to get some time to sit down with tcpdump and figure this one out… any suggestions?

Links for 2009-01-19

Links for 2009-01-15

Google.ie HTTPS fail

Check out what happens when you visit https://www.google.ie/ :

Clicking through Firefox’s ridiculous hoops gets me these dialogs:

Good work, Google and Firefox respectively!

Links for 2009-01-14

Links for 2009-01-13

Hack: reassassinate

A coworker today, returning from a couple of weeks holiday, bemoaned the quantities of spam he had to wade through. I mentioned a hack I often used in this situation, which was to discard the spam and download the 2 weeks of supposed-nonspam as a huge mbox, and rescan it all with spamassassin — since the intervening 2 weeks gave us plenty of time for the URLs to be blacklisted by URIBLs and IPs to be listed by DNSBLs, this generally results in better spamfilter accuracy, at least in terms of reducing false negatives (the "missed spam"). In other words, it gets rid of most of the remaining spam nicely.

Chatting about this, it occurred to us that it’d be easy enough to generalize this hack into something more widely useful by hooking up the Mail::IMAPClient CPAN module with Mail::SpamAssassin, and in fact, it’d be pretty likely that someone else would already have done so.

Sure enough, a search threw up this node on perlmonks.org, containing a script which did pretty much all that. Here’s a minor freshening: download

reassassinate – run SpamAssassin on an IMAP mailbox, then reupload

Usage: ./reassassinate –user jmason –host mail.example.com –inbox INBOX –junkfolder INBOX.crap

Runs SpamAssassin over all mail messages in an IMAP mailbox, skipping ones it’s processed before. It then reuploads the rewritten messages to two locations depending on whether they are spam or not; nonspam messages are simply re-saved to the original mailbox, spam messages are sent to the mailbox specified in "–junkfolder".

This is especially handy if some time passed since the mails were originally delivered, allowing more of the message contents of spam mails to be blacklisted by third-party DNSBLs and URIBLs in the meantime.

Prerequisites:

  • Mail::IMAPClient
  • Mail::SpamAssassin

Links for 2009-01-09

Links for 2009-01-08

  • Map/Reduce and Queues for MySQL using Gearman : A talk by Eric Day and Brian Aker at the upcoming MySQL Conference in April: ‘[Gearman] development is now active again with an optimized rewrite in C, along with features such as persistent message queues, queue replication, improved statistics, and advanced job monitoring. For MySQL, there is also a new user defined function to run Gearman jobs, as well as the possibility to write your own aggregate UDFs using Gearman. This gives you the ability to run functions in separate processes, separate servers, and in other languages. The Gearman framework gives you a robust interface to also run these functions reliably in the “cloud”. This session will introduce these concepts and give examples of sample applications.’ Persistent queues (at last)? Gearman integration directly in the DB? excellent!
    (tags: gearman queueing mysql databases brian-aker mapreduce sql conferences talks papers)

Links for 2009-01-07

Links for 2009-01-06

Links for 2009-01-02

Links for 2009-01-02

Links for 2008-12-28