Skip to content

Author: Justin

Justin Mason, the author of this weblog.

Using Karabiner-Elements to remap shift-3 from “£” to “#”

I've just been setting up a new Macbook, running MacOS Sequoia, and my previous trick I'd used to handle an ANSI keyboard in an Irish/UK English locale, specifically to remap shift-3 from "£" to "#", no longer works. So here's a replacement approach, using a Karabiner-Elements "Complex Modification" rule:

{
    "description": "Change right-shift-3 to hash",
    "manipulators": [
        {
            "from": {
                "key_code": "3",
                "modifiers": { "mandatory": ["right_shift"] }
            },
            "to": [
                {
                    "key_code": "3",
                    "modifiers": ["right_option"]
                }
            ],
            "type": "basic"
        }
    ]
}

My Self-Hosted GMail Backup

My Self-Hosted GMail Backup

For the past few months, I’ve had a bit of a background project going to ensure that my cloud-hosted personal data is safely archived on my own, self-hosted hardware, just in case. Google services are nice 'n' all, but I’m not 100% happy trusting them with everything in the long run.

Part of this project has been to archive my old email collection from GMail, which dates back to the initial public beta in 2004(ish?) -- and make it searchable, because what’s the point in having all that email if you can’t find the needle in the 20-year haystack when you need it?

Enter “notmuch” -- a “fast, global-search and tag-based email system”, which runs as a set of UNIX CLI commands, and is inspired by Sup, a mailreader I used previously. I have a self-hosted home server running Ubuntu 20.04 with a chunky SATA disk, so that's where I'll run it.

Here’s the process I followed:

Order a Google Takeout of your GMail account. This takes a couple of days to prepare. Request the 50GB tgz files.

When you get the email telling you it’s ready, download the files (this is awkward as you can only download one at a time, and only via your web browser, not fun). scp them to your server, and to a disk with lots of free space (/x/4 in my case).

Extract each one:

cd /x/4/tmp
tar xvfz takeout-20250322T145242Z-001.tgz
tar xvfz takeout-20250322T145242Z-002.tgz
...
rm takeout-20250322T145242Z-00*tgz

You will wind up with a few bits of uninteresting metadata, and one gigantic mbox file: Takeout/Mail/All\ mail\ Including\ Spam\ and\ Trash.mbox . In order to make this useful, it needs to be converted into Maildir format, so install “mb2md”:

sudo apt install mb2md

Now run it, creating a GMailTakeout directory for the result:

mkdir -p /x/4/GMailTakeout
mb2md -s /x/4/tmp/Takeout/Mail/All\ mail\ Including\ Spam\ and\ Trash.mbox -d /x/4/GMailTakeout

This takes quite a while for 20 years of email! Unfortunately, the resulting single directory is still unusably huge, so split it into 100 new Maildir folders:

cd /x/4/GMailTakeout/cur
find . -type f -print > /tmp/dirlisting
perl -ne '
  $dir = sprintf("dir_%03d", ($. % 100));
  (-d $dir) or mkdir($dir);
  chop; rename($_, "$dir/$_") or die "cannot rename $_";
' /tmp/dirlisting

cd /x/4/GMailTakeout
mv cur/* .
for f in dir_* ; do mkdir mail$f mail$f/{new,tmp} ; mv $f mail$f/cur ; done

The result of this is 100 Maildirs, /x/4/GMailTakeout/maildir_000 to /x/4/GMailTakeout/maildir_099, each containing about 300MB of email, in my case.

There really isn't much need to keep the mails labelled as spam, so let's just nuke them in advance:

grep -r 'X-Gmail-Labels: Spam' . | perl -pe 's/:.*$//' | xargs -n 100 rm -f

Next step is to install “notmuch” and create a “notmuch” configuration. I used the Debian packaged “notmuch”, version 0.29.3. Install using apt-get, and then run “notmuch”. Accept the defaults for the config, and don’t add any mail folders yet.

My initial attempt was simply to import the lot in one go: this went badly, throwing up a multi-day progress indicator, and with no safe way to checkpoint partial progress, and it quickly started consuming lots of RAM, causing me to suspect some leaking.

I aborted it and tried this instead to index each dir one-by-one:

for f in /x/4/GMailTakeout/maildir_* ; do ln -s $f ~/mail/ && nice notmuch new ; done

Unfortunately, this also turned out badly. The import of each maildir gradually slowed as data built up in notmuch’s Xapian indexes. After processing about 60 maildirs, memory consumption during the import became a problem, and the “notmuch” processes started being killed by the Linux OOM killer. In a couple of cases this resulted in corrupt index files and data loss. Ouch.

So I started again, with a new approach:

#!/bin/sh
set -exu
mkdir -p /x/4/GMailTakeout/notmuchbackup/xapian/
for f in /x/4/GMailTakeout/maildir_0*
do
    ln -s $f ~/mail/ && nice notmuch new
    nice notmuch compact
    cp /home/jm/mail/.notmuch/xapian/* /x/4/GMailTakeout/notmuchbackup/xapian/
done

Calling “notmuch compact” does seem to help, trimming the size of the indexes as it goes; taking a copy of the Xapian indexes in a backup dir is for extra safety. Since the “-e” shell flag is in place, any OOMs or other random failures will crash the entire script and ensure the last backup is still safe to use for recovery.

Unfortunately this still got bogged down and started OOMing fairly reliably after about maildir_065, 2 days into the process; at this point, I decided to keep that set of dirs as “notmuch config 1” and start a separate import process, into another index, as “notmuch config 2”. Accordingly, I moved ~/mail to ~/mail1 , ~/.notmuch-config to ~/.notmuch-config1, created a ~/mail2 , and started a new notmuch config file pointing at that instead. Ideally I’ll be able to merge the indexes at some point, but it’s no biggie.

With these two aliases, it’s pretty painless:

alias notmuch1='notmuch --config=$HOME/.notmuch-config1'
alias notmuch2='notmuch --config=$HOME/.notmuch-config2'

After another day or so of indexing, this is the result --

du -sh /home/jm/mail?/.notmuch/xapian
19G     /home/jm/mail1/.notmuch/xapian
4.2G    /home/jm/mail2/.notmuch/xapian

Notmuch supports pretty much all the nice email search features that GMail does, but seemingly more reliably, and faster; I’ve already been able to use this new mail index to find a mail that (worryingly!) GMail's own search can’t seem to locate -- my license for the Moom OSX window manager tool purchased over a decade ago:

time notmuch1 search moom "Many Tricks"
thread:00000000000034fe   2013-10-15 [1/1] Many Tricks; Your Many Tricks purchase (inbox unread)
thread:00000000000c267b   2013-10-15 [1/1] sales@manytricks.com; Your Moom License (attachment inbox unread)

real    0m0.068s
user    0m0.048s
sys     0m0.016s

And it’s just nice to have 20 years of email archived safely, off the cloud, and indexed.

Next steps? Maybe lieer would be good to try, to download incremental updates as we go forward. Let's see.

My Solar PV Output For 2024

My Solar PV Output For 2024

A couple of years ago, I had 5.8kW of solar panels and a 5kW battery installed on my (fairly typical) Dublin house.

The tricky part with solar PV is that, while you may have a set of solar panels, these may not be generating electricity at the time you want to use it. Even with a battery, your available stored power may wind up fully discharged by 7pm, leaving you running from expensive, non-renewable grid power for the rest of the evening. And up here in the high latitudes of northern Europe, you just don't get a whole lot of solar energy in December and January.

2024 was the first year in which (a) my panels and battery were fully up and running, (b) we were using a day/peak/night rate for grid electricity, and (c) for much of the year I had load-shifting in place; in other words, charging the battery from cheap night-rate electricity, then discharging it gradually over the course of the day, topping up with solar power once the sun gets high enough. As such, it’s worth doing the sums for the entire year to see how effective it’s been in real-world usage terms.

The total solar power generated across the year was reported from my Solis inverter as 4119 kWh.

Over the course of 2024, the entire household consumption comes to 8628 kWh. This was comprised of a fairly constant 800ish kWh per month, across the year; we still have gas-fired heating, so the winter months generally use gas energy instead of scaling up our electricity consumption.

Of that, the power consumed from solar PV was 2653 kWh (reported from the Solis web app as “annual PV to consumption”), and that from the grid was 5975 kWh (reported by the ESB Networks data feed).

So the correct figure is that 30% of our household consumption was driven from solar. This is a big difference from the naive figure of 4119/8628 = 47%; you can see that a big chunk of that power is being “lost”, due to happening at the wrong time to provide household power.

Of course, that power isn’t really “lost” -- it was exported to the grid instead. This export comprised 1403 kWh; this occurred when the battery was full, the household power usage was low, but there was still plenty of solar power being generated. (Arguably a bigger battery would be worthwhile to capture this, but at least we get paid for this export.)

There was a 2%-4% discrepancy between the Solis data and that from ESB Networks; Solis reported higher consumption (6102 kWh vs 5975) and higher export (1465 kWh vs 1403). I’m liable to believe ESB Networks more though.

In monetary terms:

The household consumption was 8628 kWh. Had we consumed this with the normal 24-hour rate tariff, we'd have paid (€236.62 standing charge per year) + (8628 at 23.61 cents per kWh) = (236.62 + 8628 * 0.2361) = €2273.69.

Checking the bills received over the year, taking into account load-shifting to take advantage of day/night variable rates and the power generated by the panels, and discounting the one-off government bill credits -- we spent €1325.97 -- 58.2% of the non-solar price.

Here! Have some graphs:

Ridding My Home Network of IP Addresses

Ridding My Home Network of IP Addresses

(Republishing this one on the blog, instead of just as a gist)

Recent changes in the tech scene have made it clear that relying on commercial companies to provide services I rely on isn't a good strategy in the long term, and given that Tailscale is so effective these days as a remote-access system, I've gradually been expanding a small collection of self-hosted web apps and services running on my home network.

Until now they've mainly been addressed using their IP addresses and random high ports on the internal LAN, for example:

  1. Pihole: http://10.19.72.7/admin
  2. Home Assistant: http://10.19.72.11:8123/
  3. Linkding: http://10.19.72.6:9092/
  4. Grafana: http://10.19.72.6:3000/
  5. (plus a good few others)

Needless to say this is a bit messy and inelegant, so I've been planning to sort it out for a while. My requirements:

  1. no more ugly bare IP addresses!
  2. a DNS domain;
  3. with HTTPS URLs;
  4. one per service;
  5. no visible port numbers;
  6. fully valid TLS certs, no having to click through warnings or install funny CA certs;
  7. accessible regardless of which DNS server is in use -- ie. using public DNS records. This may seem slightly unusual, but it's useful so that the internal services can still be accessed when I'm using my work VPN (which forces its own DNS servers);
  8. accessible internally;
  9. accessible externally, over Tailscale;
  10. not accessible externally without Tailscale.

After a few false starts, I'm pretty happy with the current setup, which uses Caddy.

Hosting The Domain At Cloudflare

First off, since the service URLs are not to be accessible externally without Tailscale active, the HTTP challenge approach to provision Let's Encrypt certs cannot be used. That would require an open-to-the-internet publicly-accessible HTTP server on my home network, which I absolutely want to avoid.

In order to use the ACME DNS challenge instead, I set up my public domain "taint.org" to use Cloudflare as the authoritative DNS server (in Cloudflare terms, "full setup"). This lets Caddy edit the DNS records via the Cloudflare API to handle the ACME challenge process.

One of the internal hosts is needed to run the Caddy server's reverse proxies; I picked "hass", 10.19.72.11, the Home Assistant host, which didn't have anything already running on port 80 or port 443. (All of my internal hosts are running on a private /24 IP range, at 10.19.72.0/24.)

The dedicated DNS domain I'm using for my home services is "home.taint.org". In order to use this, I clicked through to the Cloudflare admin panel and created a DNS record as follows:

Type   Name      Content             Proxy Status               TTL
A      *.home    10.19.72.11         DNS only - reserved IP     Auto

Now, any hostnames under "home.taint.org" will return the IP 10.19.72.11 (where Caddy will run).

I don't particularly care about exposing my internal home network IPs to the world, as a trade-off to allow the URLs to work even if an internal host is using the work VPN, or resolving with 8.8.8.8, or whatever. That's worth missing out on a little bit of paranoia, since the IPs won't be accessible from outside without Tailscale anyway.

It is worth noting that the Cloudflare-hosted domain doesn't have to be the same one used for URLs in the home network; using dns_challenge_override_domain you can delegate the ACME challenge from any "home" domain to one which is hosted in Cloudflare.

The Caddy Setup

One wrinkle is that I had to generate a custom Caddy build in order to get the "dns.providers.cloudflare" non-standard module, from https://caddyserver.com/download . This is a click-and-download page which generates a custom Caddy binary on the fly. It would have been nicer if the Cloudflare module was standard, but hey.

Once that's installed, I can get this output:

$ /usr/local/bin/caddy list-modules
[long list of standard modules omitted]

dns.providers.cloudflare
dns.providers.route53

  Non-standard modules: 2

  Unknown modules: 0

(Yes, I have Caddy running as a normal service, not as a Docker container. No particular reason; I think Docker should work fine.)

Go to the Cloudflare account dashboard, and create a user API token as described at https://developers.cloudflare.com/fundamentals/api/get-started/create-token/ . In my case, it has Zone / DNS / Edit permission, on the specific zone taint.org.

Copy that token as it's needed in the "Caddyfile", which now looks like the following:

hass.home.taint.org {
        tls {
                dns cloudflare cloudflare_api_token_goes_here
        }
        reverse_proxy /* 10.19.72.11:8123
}

links.home.taint.org {
        tls {
                dns cloudflare cloudflare_api_token_goes_here
        }
        reverse_proxy /* 10.19.72.6:9092
}

pi.home.taint.org {
        tls {
                dns cloudflare cloudflare_api_token_goes_here
        }
        redir / /admin/
        reverse_proxy /admin/* 10.19.72.7:80
}

grafana.home.taint.org {
        tls {
                dns cloudflare cloudflare_api_token_goes_here
        }
        reverse_proxy /* 10.19.72.6:3000
}

[many other services omitted]

Running sudo caddy run in the same dir will start up and verbosely log what it's doing. (Once you're happy enough, you can get Caddy running in the normal systemd service way.)

After setting those up, I now have my services accessible locally as:

  1. Home Assistant: https://hass.home.taint.org/
  2. Pihole: https://pi.home.taint.org/
  3. Grafana: https://grafana.home.taint.org/
  4. Linkding: https://links.home.taint.org/

Caddy seamlessly goes off and configures fully valid TLS certs with no fuss. I found it much tidier than Certbot, or Nginx Proxy Manager.

The Tailscale Setup

So this has now sorted out all of the requirements bar one:

  1. accessible externally, over Tailscale.

To do this I had to log into Tailscale's admin console and go to https://login.tailscale.com/admin/machines , pick a host on the 10.19.72/24 internal LAN, click it's dropdown menu and "Edit Route Settings...", and enable a Subnet Route for 10.19.72/24. By doing this, all of the service.home.taint.org DNS records are now accessible, remotely, once Tailscale is enabled; I don't even need to use ts.net names to access them! Perfect.

Anyway, that's the setup -- hopefully this writeup will help others. And kudos to Caddy, Let's Encrypt and Tailscale for making this relatively easy.

The Current State of This Blog’s Syndication

For the past several years, since the demise of Google Reader, I’ve been augmenting the RSS/Atom syndication of this linkblog with posts to various social media platforms using bot accounts. This is kind of a form of POSSE -- “Publish (on your) Own Site, Syndicate Elsewhere” (ideally I’d be self-hosting Pinboard to qualify for that I guess).

The destination for cross-posts were first to Twitter (RIP), and more recently to Mastodon via botsin.space. With the shutdown of that instance, I’ve had to make a few changes to my syndication script which gateways the contents to Mastodon, and I also took the opportunity to set up a BlueSky gateway at the same time. On the prompting of @kellan, here’s a quick write-up of where it all currently stands…

Primary Source: Pinboard

The primary source for the blog’s contents is my long-suffering account at https://pinboard.in/u:jm/, where I have been collecting links since 2009 (and before that, del.icio.us since I think 2004?, so that’s 20 years of links by now).

Pinboard has a pretty simple UI for link collection using a bookmarklet, which I’ve improved a tiny bit to open a large editor textbox instead of the default tiny one.

The resulting posts generally tend to include a blockquote, a short lede, and a few tags in the normal Pinboard/Del.icio.us style.

I find editing text posts in the Pinboard bare-bones UI to be easier and more pleasant than WordPress, so I generally use that as the primary source. Based on the POSSE principle, I should really figure out a way to get this onto something self-hosted, but Pinboard works for me (at the moment at least).

Publish from Pinboard to Blog

I use a Python script run from cron, to gateway new bookmarks from https://pinboard.in/u:jm/ as individual posts, formatted with Markdown, to this blog using the WordPress posting API: Github repo

Publish from Pinboard to Mastodon

This reads the Pinboard RSS feed for https://pinboard.in/u:jm/ and posts any new URLs (and the first 500 chars of its description) to the “jmason_links” account at mstdn.social: Github repo

Migration from the old Mastodon account at botsin.space to mstdn.social was really quite easy; after manually setting up the new account at mstdn.social and copying over the bio text, I hit the "Move from a different account" page, and entered @jm_links@botsin.space for the handle of the old account to migrate from.

I then logged in to the old account on botsin.space and hit the "Move to a different account" page, entering @jmason_links@mstdn.social for the handle to migrate to. This triggered copying of the followers from one account to the other, and left the old account dormant with a link to the new location instead.

(One thing to watch out for is that once the move is triggered, the profile for the old account becomes read-only; I've since had to temporarily undo the "moved" status in order to update the profile text, which was a bit messy.)

Publish from Pinboard to BlueSky

This reads the same Pinboard RSS feed as the Mastodon gateway, and gateways new posts from there to the “jmason.ie” account at BlueSky. This is slightly more involved than the Mastodon script, as it attempts to generate an embed card and mark up any links in the post appropriately: Github repo

I have a cron on my home server which runs those Mastodon and BlueSky gateway scripts every 15 minutes, and that seems to be a reasonable cadence without hammering the various APIs too much.

Migraines, and CGRP inhibitors

Over the past decade or so, I've been suffering with chronic migraine, sometimes with multiple attacks per week. It's been a curse -- not only do you have to suffer the periodic migraine attacks, but also the "prodrome", where unpleasant symptoms like brain fog and an inability to concentrate can impact you.

After a long process of getting a referral to the appropriate headache clinic, and eliminating other possible medications, I finally got approved to receive Ajovy (fremanezumab), one of the new generation of CGRP inhibitor monoclonals -- these work by blocking the action of a peptide on receptors in your brain. I started the course of these a month ago.

The results have, frankly, been amazing. As I hoped, the migraine episodes have reduced in frequency, and in impact; they are now milder. But on top of that, I hadn't realised just how much impact the migraine "prodrome" had been having on my day-to-day life. I now have more ability to concentrate, without it causing a headache or brain fog; I have more energy and am less exhausted on a day-to-day basis; judging by my CPAP metrics, I'm even sleeping better. It is a radical improvement. After 10 years I'd forgotten what it was like to be able to concentrate for prolonged periods!

They are so effective that the American Headache Society is now recommending them as a first-line option for migraine prevention, ahead of almost all other treatments.

If you're a migraine sufferer, this is a game changer. I'm delighted. It seems there may even be further options of concomitant treatment with other CGRP-targeting medications in the future, to improve matters further.

More papers on the topic: a real-world study on CGRP inhibitor effectiveness after 6 months; no "wearing-off" effect is expected.

My (Current) Solar PV Dashboard

About a year ago, I installed a solar PV system at my home. I wound up with a set of 14 panels on my roof, which can produce a max of 5.6 kilowatts output, and a 4.8 kW Dyness battery to store any excess power.

Since my car is an EV, I already had a home car charger installed, but chose to upgrade this to a MyEnergi Zappi at the same time, as the Zappi has some good features to charge from solar power only -- and part of that feature set involved adding a Harvi power monitor.

With HomeAssistant, I’ve been able to extract metrics from both the MyEnergi components and the Solis inverter for the solar PV system, and can publish those from HomeAssistant to my Graphite store, where my home Grafana can access them -- and I can thoroughly nerd out on building an optimal dashboard.

I’ve gone through a couple of iterations, and here’s the current top-line dashboard graph which I’m quite happy with...

Let’s go through the components to explain it. First off, the grid power:

Grid Import sans Charging

This is power drawn from the grid, instead of from the solar PV system. Ideally, this is minimised, but generally after about 8pm at night the battery is exhausted, and the inverter switches to run the house’s power needs from the grid.

In this case, there are notable spikes just after midnight, where the EV charge is topped up by a scheduled charge on the Zappi, and then a couple of short duration load spikes of 2kW from some appliance or another over the course of the night.

(What isn’t visible on this graph is a longer spike of 2kW charging from 07:00 until about 08:40, when a scheduled charge on the Solis inverter charges the house batteries to 100%, in order to load shift -- I’m on the Energia Smart Data contract, which gives cheap power between 23:00 and 08:00. Since this is just a scheduled load shift, I’ve found it clearer to leave it off, hence “sans charging”.)


Solar Generation

This is the power generated by the panels; on this day, it peaked at 4kW (which isn’t bad for an Irish slightly sunny day in April).


To Battery From Solar

Power charged from the panels to the Dyness battery. As can be seen here, during the period from 06:50 to 09:10, the battery charged using virtually all of the panels’ power output. From then on, it periodically applied short spikes of up to 1kW, presumably to maintain optimal battery operation.


From Battery

Pretty much any time the batteries are not charging, they are discharging at a low rate. So even during the day time with high solar output, there’s a little bit of battery drain going on -- until 20:00 when the solar output has tailed off and the battery starts getting used up.

<

p>

Grid Export

This covers excess power, beyond what can be used directly by the house, or charged to the battery; the excess is exported back to the power grid, at the (currently) quite generous rate of 24 cents per kilowatt-hour.

Rendering

All usages of solar power (either from battery or directly from PV) are rendered as positive values, above the 0 axis line; usage of (expensive) grid power is represented as negative, below the line.

For clarity, a number of lines are stacked:

From Battery (orange) and Solar Generation (green) are stacked together, since those are two separate complementary power sources in the PV system.

Grid Export (blue) and To Battery From Solar (yellow) are also stacked together, since those are subsets of the (green) Solar Generation block.

The grafana dashboard JSON export is available here, if you're curious.

Quick plug for Cronitor.IO

Quick plug for a good tool for self-hosting -- Cronitor.io. I have been using this for the past year or so as I migrate more of my personal stuff off cloud and back onto self-hosted setups, and it's been a really nice way to monitor simple cron-driven home workloads, and (together with graphite/grafana alerts) has saved my bacon many times. Integrates nicely with Slack, or even PagerDuty (although that would be overkill for my setup for sure).

Moving House

Bit of a meta update.

This blog has been at taint.org for a long time, but that's got to change...

When I started the blog, in March 2000 (!), "taint" had two primary meanings; one was (arguably) a technical term, referring to Perl's "taint checking" feature, which allowed dataflow tracing of "tainted" externally-sourced data as it is processed through a Perl program. The second meaning was the more common, less technical one: "a trace of a bad or undesirable substance or quality." The applicability of this to the first meaning is clear enough.

Both of those fit quite nicely for my intentions for a blog, with perl, computer security, and the odd trace of bad or undesirable substances. Perfect.

However. There was a third meaning, which was pretty obscure slang at the time.... for the perineum. The bad news is that in the intervening 23 years this has now by far become the primary meaning of the term, and everyone's entirely forgotten the computer-nerdy meanings.

I finally have to admit I've lost the battle on this one!

From now on, the blog's primary site will be the sensible-but-boring jmason.ie; I'll keep a mirror at taint.org, and all RSS URLs on that site will still work fine, but the canonical address for the site has moved. Change is inevitable!

An Irish Web Pioneer!

I'm happy to announce that I'm now listed on TechArchives.Irish as one of the pioneers of the Irish web!

After extensive interviewing and collaboration with John Sterne, my testimony and timeline of those early days of the Irish web is now up at TechArchives.

It's been a good opportunity to reflect on the differences between the tech scene, then and now. I was very idealistic 30 years ago at the possibilities that the web and internet technologies had to offer; nowadays, I'm a bit more grizzled and pragmatic. But I still have hope -- particularly if we can apply this tech in a way that helps address climate change, in particular.... here's to the next 30 years!

Anyway, I hope writing this down helps record the history of those great early years of the web. Please take a look.

DynamoDB-local on Apple Silicon

DynamoDB Local is one of the best features of AWS DynamoDB. It allows you to run a local instance of the data store, and is perfect for use in unit tests to validate correctness of your DynamoDB client code without calling out to the real service "in the cloud" and involving all sorts of authentication trickiness.

Unfortunately, if you're using one of the new MacBooks with M1 Apple silicon, you may run into trouble:

11:08:56.893 [DEBUG] [TestEventLogger]          DynamoDB > Feb 04, 2022 11:08:56 AM com.almworks.sqlite4java.Internal log
11:08:56.893 [DEBUG] [TestEventLogger]          DynamoDB > SEVERE: [sqlite] SQLiteQueue[]: error running job queue
11:08:56.893 [DEBUG] [TestEventLogger]          DynamoDB > com.almworks.sqlite4java.SQLiteException: [-91] cannot load library: java.lang.UnsatisfiedLinkError: /.../DynamoDBLocal_lib/libsqlite4java-osx.dylib: dlopen(/.../DynamoDBLocal_lib/libsqlite4java-osx.dylib, 0x0001): tried: '/.../DynamoDBLocal_lib/libsqlite4java-osx.dylib' (fat file, but missing compatible architecture (have 'i386,x86_64', need 'arm64e')), '/usr/lib/libsqlite4java-osx.dylib' (no such file)
11:08:56.893 [DEBUG] [TestEventLogger]          DynamoDB >      at com.almworks.sqlite4java.SQLite.loadLibrary(SQLite.java:97)
11:08:56.893 [DEBUG] [TestEventLogger]          DynamoDB >      at com.almworks.sqlite4java.SQLiteConnection.open0(SQLiteConnection.java:1441)
11:08:56.893 [DEBUG] [TestEventLogger]          DynamoDB >      at com.almworks.sqlite4java.SQLiteConnection.open(SQLiteConnection.java:282)
11:08:56.894 [DEBUG] [TestEventLogger]          DynamoDB >      at com.almworks.sqlite4java.SQLiteConnection.open(SQLiteConnection.java:293)

It's possible to invoke it via Rosetta, Apple's qemu-based x86 emulation layer, like so:

arch -x86_64 /path/to/openjdk/bin/java dynamodb-local.jar

But if you don't have control over the invocation of the Java command, or just don't want to involve emulation, this is a bit hacky. Here's a better way to make it work.

First, download dynamodb_local_latest.tar.gz from the DynamoDB downloads page, and extract it.

The DynamoDBLocal_lib/libsqlite4java-osx.dylib file in this tarball is the problem. It's OSX x86 only, and will not run with an ARM64 JVM. However, the same lib is available for ARM64 in the libsqlite4java artifacts list, so this will work:

wget -O libsqlite4java-osx.dylib.arm64 'https://search.maven.org/remotecontent?filepath=io/github/ganadist/sqlite4java/libsqlite4java-osx-arm64/1.0.392/libsqlite4java-osx-arm64-1.0.392.dylib'
mv DynamoDBLocal_lib/libsqlite4java-osx.dylib libsqlite4java-osx.dylib.x86_64
lipo -create -output libsqlite4java-osx.dylib.fat libsqlite4java-osx.dylib.x86_64 libsqlite4java-osx.dylib.arm64
mv libsqlite4java-osx.dylib.fat DynamoDBLocal_lib/libsqlite4java-osx.dylib

This is now a "fat" lib which supports both ARM64 and x86 hardware. Hey presto, you can now invoke DynamoDBLocal in the normal Rosetta-free manner, and it'll all work -- on both hardware platforms.

(This post is correct as of version 2022-1-10 (1.18.0) of DynamoDB-Local -- let me know by mail, or at @jmason on Twitter, if things break in future, and I'll update it.)

Richard J. Hayes, Ireland’s WWII cryptographer and polymath

This is new to me -- Thanks to David Mee for the pointer.

'During WWII, one of Nazi Germany’s most notorious communication codes was broken by a mild mannered librarian and family man from West Limerick, Richard Hayes. His day-job was as Director of the National Library of Ireland - but during wartime, he secretly led a team of cryptanalysts as they worked feverishly on the infamous "Görtz Cipher" - a fiendish Nazi code that had stumped some of the greatest code breaking minds at Bletchley Park, the centre of British wartime cryptography.

But who was Richard Hayes? He was a man of many lives. An academic, an aesthete, a loving father and one of World War Two’s most prolific Nazi Codebreakers.

At the outbreak of WWII, Hayes, being highly regarded for his mathematical and linguistic expertise, was approached by the head of Irish Military Intelligence (G2), Colonel Dan Bryan, with a Top Secret mission. At the behest of Taoiseach Éamon de Valera, Hayes was given an office and three lieutenants to decode wireless messages being covertly transmitted via Morse code from a house in north Dublin owned by the German Embassy. The coded messages posed a huge threat to Irish national security and the wider war effort. As Hayes team worked to break the code, it was all academic until he met his greatest challenge yet. The man who was to be his nemesis, Dr. Herman Görtz, a German agent who parachuted into Ireland in 1940 in full Luftwaffe uniform in an attempt to spy and transmit his own coded messages back to Berlin. [...] The events that transpired were a battle of wits between the mild mannered genius librarian and his nemesis, the flamboyant Nazi spy.

Hayes has been referred to by MI5 as Irelands "greatest unsung hero" and the American Office of Strategic Services as "a colossus of a man" yet due to the secret nature of his work he is virtually unheard of in his own country.'

Hayes was our lead code-breaker, director of the National Library of Ireland, and then director of the Chester Beatty Museum; he was the first to discover the German use of microdots to hide secret messages; and MI5 credited him with a "whole series of ciphers that couldn't have been solved without [his] input". Quite the polymath!

The book is apparently well worth a read: Code Breaker, by Marc McMenamin, and I can strongly recommend this RTE radio documentary. It's full of amazing details, such as the process of feeding Hermann Görtz false information while he was in prison, in order to mislead the Nazis.

After the war, he fruitlessly warned the Irish government not to use a "Swedish cipher machine", presumably one made by Boris Hagelin, who went on to found Crypto AG, which later proved to be providing backdoors in its machines to the CIA and BND.

Quite a towering figure in the history of Irish cryptography and cryptanalysis!

Peer-to-peer COVID-19 contact tracing without the surveillance

Maciej Ceglowski asks for a massive surveillance program to defeat COVID-19.

However, as I mentioned on twitter -- there IS an alternative, privacy-preserving approach, which is what is being done in Singapore with their TraceTogether app.

In summary, everyone carries a phone running an app which has an anonymized a random ID, scans local Bluetooth periodically for other people's apps with their random IDs, and records them locally (not uploading to a server). If you find out you have COVID-19 you then trigger an upload of your contact history to a central server. That server then broadcasts out the list of IDs, and everyone you've been in contact with will then get a ping on their app to get tested, self-isolate, etc.

No central surveillance, no creepy big brother watching your location.

My pinboard has a few more write-ups on basically the same idea from various other places, including MIT. This is similar to what China's app does, but (as far as I can tell) with more privacy.

It looks like the Singaporean government digital services team behind TraceTogether is putting together an open source version, at Bluetrace.io.

IMO we have to do this or we will never get out of COVID-19 lockdown before 2021. I am massively in favour of adopting this approach in Ireland and across the world.

Fixing echoing sound effects with Huawei Histen

Here's a quick tip for people using Huawei or Honor phones.

Huawei recently released EMUI version 9.1.0.326 as an OTA update, which I applied once it was offered as an upgrade option.

Once I installed that OS upgrade, however, I noticed that whenever I listened to music or podcasts using a Bluetooth headset or stereo speakers, there was a new and very noticeable 'echoing' effect on the audio.

It appears this was due to the addition of Huawei Histen, a 3D audio/equaliser feature, which apparently will add 3D audio effects when listening on wired headphones of various varieties -- however this is supposed to be disabled on Bluetooth devices.

I spent several days fruitlessly googling how to disable Histen, but with no luck. Eventually, through trial and error, I discovered a workaround -- simply plug in a pair of wired headphones, go into Settings -> Sounds -> Huawei Histen sound effects, and choose "Natural sound". Hey presto, next time you use Bluetooth headphones, it should no longer have the echo.

Recipe: clara con limón granizado

I came across this cocktail in Pals, in Catalonia, in 30 degree heat, a few weeks back -- I saw it on the menu at the cafe in the square of the old town, and had to give it a go. It's incredible. Basically, it's lager mixed with a lemon granita -- like a beer slushy. Nothing is better at thirst quenching on a hot day, and best of all it's quite low in alcohol so no worries about lorrying into it during the daytime :)

This year at Groovefest, our yearly get together/mini-festival, I got to serve up a few, with great results -- they were quite popular. So here's the recipe!

First off, a day or two in advance, make a batch of lemon granita. I based mine on this recipe which I'll copy here just in case the original goes away:

Lemon Granita

Serves: about 8

Ingredients:

  • 3-4 lemons
  • 1L water
  • 150g of sugar

Method:

  • Zest the lemons and set the zest aside. Juice the lemons until you have 150ml juice (you may not need all of them).

  • Add the water and sugar to a large pan and bring to the boil. Reduce to a simmer and cook for 2 minutes, stirring to dissolve the sugar.

  • Add the lemon juice and zest, remove from the heat and cover. Set aside to cool for 20 minutes.

  • Strain the mixture into 2 containers that will fit in your freezer and leave to cool to room temperature.

  • Freeze until the mixture is partially frozen, which should take several hours. (I just left them overnight)

  • Remove the granita from the freezer and leave at room temperature until you can break it into chunks with a large spoon or fork.

  • Either transfer to a blender or food processor and blitz, or break it up with a fork. It doesn't need to be perfectly smooth and snowy -- a slushy texture is just right for this drink.

  • Store in the freezer. Take out 30 minutes before serving and break it up again with a fork.

Clara Con Limón Granizado

To serve: half-fill a half-pint glass with the lemon granita. Pour the beer on top to fill the glass. Stir once or twice to mix. Enjoy!

PS: I think -- not sure as my Catalan is pretty terrible -- it may be a clara granitzada in Catalonia...

Don’t use Timers with exponentially-decaying reservoirs in Graphite

A common error when using the Metrics library is to record Timer metrics on things like API calls, using the default settings, then to publish those to a time-series store like Graphite. Here's why this is a problem.

By default, a Timer uses an Exponentially Decaying Reservoir. The docs say:

'A histogram with an exponentially decaying reservoir produces quantiles which are representative of (roughly) the last five minutes of data. It does so by using a forward-decaying priority reservoir with an exponential weighting towards newer data. Unlike the uniform reservoir, an exponentially decaying reservoir represents recent data, allowing you to know very quickly if the distribution of the data has changed.'

This is more-or-less correct -- but the key phrase is 'roughly'. In reality, if the frequency of updates to such a timer drops off, it could take a lot longer, and if you stop updating a timer which uses this reservoir type, it'll never decay at all. The GraphiteReporter will dutifully capture the percentiles, min, max, etc. from that timer's reservoir every minute thereafter, and record those to Graphite using the current timestamp -- even though the data it was derived from is becoming more and more ancient.

Here's a demo. Note the long stretch of 800ms 99th-percentile latencies on the green line in the middle of this chart:

However, the blue line displays the number of events. As you can see, there were no calls to this API for that 8-hour period -- this one was a test system, and the user population was safely at home, in bed. So while Graphite is claiming that there's an 800ms latency at 7am, in reality the 800ms-latency event occurred 8 hours previously.

I observed the same thing in our production systems for various APIs which suffered variable invocation rates; if rates dropped off during normal operation, the high-percentile latencies hung around for far longer than they should have. This is quite misleading when you're looking at a graph for 10pm and seeing a high 99th-percentile latency, when the actual high-latency event occurred hours earlier. On several occasions, this caused lots of user confusion and FUD with our production monitoring, so we needed to fix it.

Here are some potential fixes.

  • Modify ExponentiallyDecayingReservoir to also call rescaleIfNeeded() inside getSnapshot() -- but based on this discussion, it appears the current behaviour is intended (at least for the mean measurement), so that may not be acceptable. Another risk of this is that it leaves us in a position where the percentiles displayed for time T may actually have occurred several minutes prior to that, which is still misleading (albeit less so).

  • Switch to sliding time window reservoirs, but those are unbounded in size -- so a timer on an unexpectedly-popular API could create GC pressure and out-of-memory scenarios. It's also the slowest reservoir type, according to the docs. That made it too risky for us to adopt in our production code as a general-purpose Timer implementation.

  • Update, Dec 2017: as of version 3.2.3 of Dropwizard Metrics, there is a new SlidingTimeWindowArrayReservoir reservoir implementation, which is a drop-in replacement for SlidingTimeWindowReservoir, with much more acceptable memory footprint and GC impact. It costs roughly 128 bits per stored measurement, and is therefore judged to be 'comparable with ExponentiallyDecayingReservoir in terms of GC overhead and performance'. (thanks to Bogdan Storozhuk for the tip)

  • What we eventually did in our code was to use this Reporter class instead of GraphiteReporter; it clears all Timer metrics' reservoirs after each write to Graphite. This is dumb and dirty, reaching across logical class boundaries, but at the same time it's simple and comprehensible behaviour: with this, we can guarantee that the percentile/min/max data recorded at timestamp T is measuring events in that timestamp's 1-minute window -- not any time before that. This is exactly what you want to see in a time-series graph like those in Graphite, so is a very valuable feature for our metrics, and one that others have noted to be important in comparable scenarios elsewhere.

Here's an example of what a graph like the above should look like (captured from our current staging stack):

Note that when there are no invocations, the reported 99th-percentile latency is 0, and each measurement doesn't stick around after its 1-minute slot.

Another potential bug fix for a related issue, would be to add support to Metrics so that it can use Gil Tene's LatencyUtils package, and its HdrHistogram class, as a reservoir. (Update: however, I don't think this would address the "old data leaking into newer datapoints" problem as fully.) This would address some other bugs in the Exponentially Decaying Reservoir, as Gil describes:

'In your example of a system logging 10K operations/sec with the histogram being sampled every second, you'll be missing 9 out of each 10 actual outliers. You can have an outlier every second and think you have one roughly every 10. You can have a huge business affecting outlier happening every hour, and think that they are only occurring once a day.'

Eek.

the coming world of automated mass anti-terror false positives

Man sues RMV after driver's license mistakenly revoked by automated anti-terror false positive:

John H. Gass hadn’t had a traffic ticket in years, so the Natick resident was surprised this spring when he received a letter from the Massachusetts Registry of Motor Vehicles informing him to cease driving because his license had been revoked. [...] After frantic calls and a hearing with Registry officials, Gass learned the problem: An antiterrorism computerized facial recognition system that scans a database of millions of state driver’s license images had picked his as a possible fraud. “We send out 1,500 suspension letters every day," said Registrar Rachel Kaprielian. [...] “There are mistakes that can be made."

See also this New Scientist story. This story notes that the system's pretty widespread:

Massachusetts bought the system with a $1.5 million grant from the Department of Homeland Security. At least 34 states use such systems, which law enforcement officials say help prevent identity theft and ID fraud.

In my opinion, this kind of thing -- trial by inaccurate, false-positive-prone algorithm, is one of the most worrying things about the post-PRISM world.

When we created SpamAssassin, we were well aware of the risk of automated misclassification. Any machine-learning classifier will always make mistakes. The key is to carefully calibrate the expected false-positive/false-negative ratio so that the negative side-effects of a misclassification corresponds to the expected rate.

These anti-terrorism machine learning systems are calibrated to catch as many potential cases as possible, but by aiming to reduce false negatives to this degree, they become wildly prone to false positives. And when they're applied as a dragnet across all citizens' interactions with the state -- or even in the case of PRISM, all citizens' interactions that can be surveilled en masse -- it's going to create buckets of bureaucratic false-positive horror stories, as random innocent citizens are incorrectly tagged as criminals due to software bugs and poor calibration.

The easy way to find JMX metrics in the field using jmxsh

(oh look, a proper blog post!)

JMX is the de-facto standard in the Java and JVM-based world for exposing service metrics, and feeds nicely to tools like Graphite using JMXTrans and others. However, it's pretty obtuse and over-complex, and it can be hard to figure out what path the JMX metrics will show up under once deployed.

Unfortunately, once a JVM-based service is deployed to EC2, it becomes very difficult to use jconsole to connect to it, due to deficiencies and crappy design in the JMX RMI protocol (I love the way they reinvented the broken parts of IIOP in that respect). Don't even bother; instead, use jmxsh: https://code.google.com/p/jmxsh/ .

To use this, you need to modify the service process' command line to include the following JVM args, so that the remote JMX API is exposed:

-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=16660 -Dcom.sun.management.jmxremote.local.only=true -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false

Change the port number if there is already a process running on that port. Ensure the port isn't accessible from off-host; in EC2, this should be safe enough to use once that port number is not in the EC2 security group.

Go to https://code.google.com/p/jmxsh/downloads/list and download the latest jmxsh-FOO.jar; e.g. 'wget https://jmxsh.googlecode.com/files/jmxsh-R5.jar'. Then on the host, as the UID the service is running under, run: 'java -jar jmxsh-R5.jar -h 127.0.0.1 -p 16660'. You can then hit "Enter" to go into "Browse Mode", and you'll get text menus like this:

 ====================================================

  Attribute List:

        1. -r- long        MaxFileDescriptorCount
        2. -r- long        OpenFileDescriptorCount
        3. -r- long        CommittedVirtualMemorySize
        4. -r- long        FreePhysicalMemorySize
        5. -r- long        FreeSwapSpaceSize
        6. -r- long        ProcessCpuTime
        7. -r- long        TotalPhysicalMemorySize
        8. -r- long        TotalSwapSpaceSize
        9. -r- String      Name
       10. -r- int         AvailableProcessors
       11. -r- String      Arch
       12. -r- double      SystemLoadAverage
       13. -r- String      Version

   SERVER: service:jmx:rmi:///jndi/rmi://127.0.0.1:16660/jmxrmi
   DOMAIN: java.lang
   MBEAN:  java.lang:type=OperatingSystem

 ====================================================

Navigate through the MBean tree looking for good Attributes which would make good metrics (5 in the list above, for example). Note the MBean and the Attribute names.

Leaving Amazon

So, after just over 3 and a half years, I'm leaving Amazon.

It's been great fun -- I can honestly say, even with my code being used by hundreds of millions of users in SpamAssassin and elsewhere, I hadn't really had to come to grips with the distributed systems problems that an Amazon-scale service involves.

During my time at Amazon, I've had the pleasure of building out a brand-new, groundbreaking innovative internal service, from scratch to its current status where it's deployed in production datacenters worldwide. It's a low-latency service, used to monitor Amazon's internal networks using massive quantities of measurement data and machine learning algorithms. It's really very nifty, and I'm quite proud of what we've achieved. I was lucky to work closely with some very smart people during this, too -- Amazon has some top-notch engineers.

But time to move on! In a week's time, I'll be joining Swrve to work on the server-side architecture of their system. Swrve have a very interesting product, extending the A/B-testing model into gaming, and a great team; and it'll be nice to get back into startup-land once again, for a welcome change. (It's not all roses working for a big company. ;) I'm looking forward to it. Who knows, I may even start blogging here again...

Pity about losing those 12 phone tool icons though!

Flood of posts

Sorry for the flood of recent posts -- turns out my cron job to gateway from Pinboard had stopped running due to cron fail. (I should really set up some monitoring someday ;)

Telegraph spam in 1864

Here's a letter to the editor of The Times, dated 1st June 1864:

TO THE EDITOR OF THE TIMES.
Sir, --- On my arrival home late yesterday evening a "telegram," by "London District Telegraph," addressed in full to me, was put into my hands. It was as follows :--
"Messrs. Gabriel, dentists, 27, Harley-street, Cavendish-square. Until October Messrs. Gabriel's professional attendance at 27, Harley-street, will be 10 till 5."
I have never had any dealings with Messrs. Gabriel, and beg to ask by what right do they disturb me by a telegram which is evidently simply the medium of advertisement? A word from you would, I feel sure, put a stop to this intolerable nuisance. I enclose the telegram, and am,
Your faithful servant,
M.P.
Upper Grosvenor-street, May 30.

(thanks to Tony Finch for the forward)

In Dublin? Hear me talk about AWS network monitoring!

Reminder to Dublin-based readers -- next week, Amazon (my employers) will be putting on Under the Hood at Amazon, billed as 'A night of Beer, Pizza and Cloud Computing for Software Developers'. I'll be speaking at it.

It's partially a recruiting event, but even if you're not looking for a new job, please come along. It's also useful for us to talk about some details of what we've been doing in Dublin, since we've been operating to date with a pretty low profile, and in reality there's some very interesting stuff going on here... particularly the product I'll be talking about, naturally.

Also, there'll be free beer and some Kindles to be won ;)

It's next Thursday night, in our offices in Kilmainham. More info on this Facebook page.

temporary Hackerspace at MindField

This sounds very cool! Nice one, hackerspace ppl.

Ireland's Hackerspaces and Makerspaces (091 Labs - Galway, Belfast Hackerspace, MilkLabs - Limerick, Nexus Cork and TOG - Dublin) have been asked to build and man a temporary hackerspace during the MindField - International Festival of Ideas (http://www.mindfield.ie/). MindField will take place over the weekend of 29 April - 1 May in Merrion Square.

During MindField our temporary hackerspace will provide a range of events where festival participants can learn about diybio, 3D printing, basic electronics and micro controllers, electronic fashion/crafting and open data. These events are included in the festival schedule (http://mindfield.ie/festival-schedul/).

In parallel with these events we have an opportunity run a Hardware Hacking Challenge. In this challenge we will try to engage a group of willing hacker, makers and festival participants in the challenge to create or construct interesting or innovative projects out of recycled hardware. We are trying to source interesting materials, electronic devices or equipment that can be used to based projects off or as sources of components.

We are particularly interested in devices that contain various types of transducers which can then be hooked up to micro controllers and computers. We're not looking for normal computer equipment or servers we've got lots of that, but more unusual stuff that people have lying around.

If you think you've got something they might like, contact Robert Fitzsimons.

My Problem With Norris

I'm uncomfortable voting for David Norris for President. Here's why.

In November last year, he was a key voice in a Senate debate on the topic of "Protection of Intellectual Property Rights", where he quoted heavily from the flawed judgement by Mr. Justice Peter Charleton in the Warner, Universal, Sony BMG and EMI vs UPC case. (There are allegations that he called the debate after speaking to Paul McGuinness (U2's manager) and Niall Stokes (of Hot Press).)

In the debate, Norris quotes Mr Justice Charleton, saying:

'In failing to provide legislative provision for blocking, diverting and interrupting internet copyright theft, Ireland is not yet fully in compliance with its obligations under European law.' Norris then says: 'Irish law could be brought into alignment with the intention of the European directive through a simple statutory instrument.' [1]

Now, let me clarify my position -- I'm in favour of some means of resolving the level of piracy of music and movies which is widespread nowadays, and I believe there's a mutually agreeable way to do this. But what Norris and Mr Justice Charleton propose is not it. Here are the problems as I see them.

It Lets The Internet Filtering Genie Out Of The Bottle

The big one.

The problem is that any infrastructure for 'blocking, diverting and interrupting internet copyright theft' is effectively infrastructure for 'blocking, diverting and interrupting' any communication on the net. We have to be very careful about how this is permitted, as it'll very quickly suffer "feature creep" and become a general-purpose censorship system -- the Great Firewall Of Ireland. As Damien Mulley put it:

'first they’ll start with the Pirate Bay. Then comes Mininova, IsoHunt, then comes YouTube (they have dodgy stuff, right?), how long before we have Boards.ie because someone quoted a newspaper article or a section of a book? And don’t think they’ll stop there too, any site that links to The Pirate Bay and the others on the hate list will probably be added to the list too...'

In Australia, the anti-child-porn filtering system was quickly used to block gambling websites, gay and straight porn sites, political parties, Wikipedia entries, Christian sites, Wikileaks, and a dentist; in Thailand, a similar system was used to block criticism of the royal family.

Will It Help? I Don't Think So

Norris:

'As long as Irish law is deficient, Mr. Justice Charleton has found that all creative Irish industries are losing money.'

This is quite a hilariously overblown and sweeping statement. ALL creative Irish industries? What qualifies as a 'creative' industry? I suspect some in this country have been involved in industrial acts of creation that made money. ;)

While they're not Irish, the well-known indie label Beggar's Banquet has gone on the record as stating the opposite where the current music situation is concerned --

"There's fewer gatekeepers now. We don't have to knock on a TV station's door or a radio station's door and it's made us far more competitive. [...] There's a wide highway in front of us we can go speeding down, and it wasn't there even two years ago. It means the majors are looking at a world where only 35 Gold Albums a year are certified compared to ten times that recently. But going above Gold in the US is not a problem for us."

So it appears a 'creative' industry (albeit in the UK) is finding things not quite so bad.

Norris again:

'the facts were established in the judgment of Mr. Justice Charleton in which he stated: “Between 2005 and 2009 the recording companies experienced a reduction of 40% in the Irish market for the legal sale of recorded music.” That is a devastating blow. [...] He went on to state: “Some 675,000 people are likely to be engaged in some form of illegal downloading from time to time.”'

Without quite lining up one statement with the other, this reinforces the impression that the only reason the recording companies have seen these drops in revenues is due to internet-borne piracy. However, quoting the brilliant Mumblin' Deaf Ro on the topic of lies, damn lies, and music biz statistics:

'The drop in the value of Irish retail music sales was 11.7% between 2008 and 2009, which is significantly less than the 18% overall drop in retail sales for the economy that year. Digital album sales have increased by 30% since 2007 both in terms of volume and market value.'

So in other words, between 2008 and 2009, Irish retail music sales outperformed the retail sales economy as a whole!

In addition, Ro provides the following BPI figures for UK market volumes over the 2005-2009 period:

    Year  Albums  Singles
    2005  159.0m   47.9m
    2006  154.7m   66.9m
    2007  138.1m   86.6m
    2008  133.6m  115.1m
    2009  128.9m  152.7m

It's clear that singles sales went through the roof, more than tripling. Album sales did drop however, but nowhere near by 40% -- and this coincided with the general drop in the prevailing global economy around that time. He also notes that digital sales in the UK went through the roof globally on a number of metrics in 2009.

While this does not provide figures for the Irish market, I'm at a loss as to how it could be radically different -- Irish and UK consumers have pretty similar musical tastes and consumption habits, I would guess.

Here's a theory: perhaps the issue could be that "Irish" music sales are associated with bricks-and-mortar music shops selling the physical product, whereas digital music sales are associated with online services based outside Ireland, and an Irish buyer buying an album at 7digital.co.uk, or on iTunes, isn't counted as an "Irish retail sale"? Could the problem be that we don't have any significant Irish shops selling music online, I wonder?

Bricks-and-mortar music shops, such as ex-Senator Donie Cassidy's "Celtic Note" (who coincidentally was quite vociferous in that Seanad debate), are indeed hurting in this new model of music consumption -- and that's a problem. But given that good, working digital music sales systems are in operation, it doesn't necessarily appear to be due to massive volumes of internet-borne piracy, going by these figures.

Essentially, internet piracy is a convenient bogeyman, especially for the technophobic old guard, but may have little bearing on the current woes of the Irish record industry and bricks-and-mortar music shops.

(Update: a couple of days after this was posted, a pair of economists at the LSE have said basically the same thing.)

Audible Magic Won't Work For Long Anyway

Audible Magic, which Norris suggests is IRMA's favoured filtering system, received the following verdict from the EFF back in 2004:

'Should Audible Magic's technology be widely adopted, it is likely that P2P file-sharing applications would be revised to implement encryption. Accordingly, network administrators will want to ask Audible Magic tough questions before investing in the company's technology, lest the investment be rendered worthless by the next P2P "upgrade."'

Naturally, encryption is widespread nowadays, so this may already be the case.

Internet Censorship Harms Our Global Image

As Adrian Weckler points out:

'do we really want to send out the message that, digitally, we're the new France? Come to think of it, do we want to tell Google, Facebook, Apple and Twitter that, digitally, we're the new Britain?'

Right now, more than ever, we need to put out an image that we're ready to do business on our end of the internet. Mandatory censorship systems don't exactly support this.

In Summary

So in summary, I would hope to see a more balanced approach to the issue from Norris. Most of the problematic statements in his speech were directly sourced from Mr. Justice Charleton's flawed judgement, but some critical thinking would be vital, I would have thought. The fact that this was lacking, particularly given the allegations of heavy music-biz lobbying beforehand, leaves me feeling less inclined to vote for him than I would have been before, particularly since I haven't heard any clarification on these issues.

([1]: Funnily enough, an SI similar to this was nearly sneaked through a couple of weeks ago, according to reports.)

Against The Use Of Programming Languages in Configuration Files

It's pretty common for apps to require "configuration" -- external files which can contain settings to customise their behaviour. Ideally, apps shouldn't require configuration, and this is always a good aim. But in some situations, it's unavoidable.

In the abstract, it may seem attractive to use a fully-fledged programming language as the language to express configuration in. However, I think this is not a good idea. Here are some reasons why configuration files should not be expressed in a programming language (and yes, I include "Ruby without parentheses" in that bucket):

Provability

If a configuration language is Turing-incomplete, configuration files written in it can be validated "offline", ie. without executing the program it configures. All programming languages are, by definition, Turing-complete, meaning that the program must be executed in full before its configuration can be considered valid.

Offline validation is a useful feature for operational usability, as we've found with "spamassassin --lint".

Security

Some configuration settings may be insecure in certain circumstances; for example, in SpamAssassin, we allow certain classes of settings like whitelist/blacklists to be set in a users ~/.spamassassin/user_prefs file, while disallowing rule definitions (which can cause poor performance if poorly written).

If your configuration file is simply an evaluated chunk of code, it becomes more difficult to protect against an attacker introspecting the interpreter and overriding the security limitations. It's not impossible, since you can, for instance, use a sandboxed interpreter, but this is typically not particularly easy to implement.

Usability

Here's a rather hairy configuration file I've concocted.

    #! /usr/bin/somelanguage
    !$ app.status load html
    !c = []
    ;c['sources'] = < >
    ;c['sources'].append(
        NewConfigurationThingy("foo_bar",
            baz="flargle"))
    ;c['builders'] = < >
    ;c['bots'] = < >
    !$ app.steps load source, shell
    ;bf_mc_generic = factory.SomethingFactory( <
        woo(source.SVN, svnurl="http://example.com/foo/bar"),
        woo(shell.Configure, command="/bar/baz start"),
        woo(shell.Test, command="/bar/baz test"),
        woo(shell.Configure, command="/bar/baz stop")
        > );
    ;b1 = < "name": "mc-fast", "slavename": "mc-fast",
                 "builddir": "mc-fast", "factory": ;bf_mc_generic >
    ;c['builders'].append(;b1)
    ;SomethingOrOther = ;c

This isn't actually entirely concocted from thin air -- it's actually bits of our BuildBot configuration file, from before we switched to using Hudson. I've replaced the familiar Python syntax with deliberately-unfamiliar made-up syntax, to emulate the user experience I had attempting to configure BuildBot with no pre-existing Python knowledge. ;)

Compare with this re-stating of the same configuration data in a simplified, "configuration-oriented" imaginary DSL:

add_source NewConfigurationThingy foo_bar baz=flargle

buildfactory bf_mc_generic source.SVN http://example.com/foo/bar
buildfactory bf_mc_generic shell.Configure /bar/baz start
buildfactory bf_mc_generic shell.Test /bar/baz test
buildfactory bf_mc_generic shell.Configure /bar/baz stop

add_builder name=mc-fast slavename=mc-fast
     builddir=mc-fast factory=bf_mc_generic

Essentially, I've extracted the useful configuration data from the hairy example, discarded the symbology used to indicate types, function calls, data structure construction, and let the configuration domain knowledge imply what's necessary. Not only is this easier to comprehend for the casual reader, it also reduces the risk of syntax errors, by simply minimising the number of syntactical components.

See Also

The Wikipedia page on DSLs is quite good on the topic, with a succinct list of pros and cons.

This StackOverflow thread has some good comments -- I particularly like this point:

When you need your application to be very "configurable" in ways that you cannot imagine today, then what you really need is a plugins system. You need to develop your application in a way that someone else can code a new plugin and hook it into your application in the future.

+1.

This seems to be a controversial topic -- as you can see, that page has people on both sides of the issue. Maybe it fundamentally comes down to a matter of taste. Anyway -- my $.02.

Update: discussions elsewhere: HackerNews

Another Update, 2012-04-06: Robey Pointer wrote a post called Why Config?, in which he describes a Scala-based configuration language in use at Twitter, which uses Scala's runtime code evaluation, and a Scala trait, to express configuration succinctly in a Scala source file and load it at runtime. The downside? It's a Scala source file, executed at runtime, containing configuration. :(

However, this comment in the comments section is worth a read:

At Netli (now part of Akamai) we had a configuration framework very similar in spirit and appearance to Configgy. It was in early 2000-s, we open sourced it since. (http://ncnf.sourceforge.net/). It would provide on-the-fly reload for the C-based programs (the ncnf if a C library). It also had some perks like attribute inheritance and a concept of block references. Most importantly though, it contained a separate schema language and a validator to allow configuration be checked before pushing in production. At Netli we used it to configure 1200 services on over 400 hardware boxes, the configuration becoming about 20+mb in length (assembled from several pieces by the CPP, then M4 templating library).

Naturally, it wasn't Netli's first attempt at doing configuration. One of the first attempts failed since it was Turing-complete. That approach was to specify the configuration as a Perl data specification. In a very short time the lure of unused expressiveness of such Turing-complete environment prevailed and people started to write for-loops around data pieces and doing other tricks to remove redundancy from the configuration. It turned out to be a disaster in the end, with configuration becoming unmaintainable and flaky.

One principle I got out out of that exercise is that configuration shall not be Turing-complete. We've got burned specifically by that property far too many times. Yet I do agree with you that a validation facility is a must-have, which is something not usually part of the simple text-based frameworks. C-based NCNF had it almost from the very beginning though, and it proved to be a very useful harness.

+1. There's lots more info on that system at this post at lionet.livejournal.com.

Another Update, 2017-05-09: casio_juarez on Twitter:

Also related: The Configuration Complexity Clock.

(Image credit: Turn The Dial by VERY URGENT Photography)

Irish Times “Most Read” Article Feed

If you visit the Irish Times at all frequently, you'll probably have noticed a nifty "wisdom of crowds" feature in the right sidebar: the list of "most read" articles. It's quite good, since they're often very interesting articles. Unfortunately, there's no RSS feed for this feature.

Well, now there is:

I made a sled

Facing yet another day of being snowed in, with Dublin's icy roads and footpaths driving us all stir crazy, I came up with this:

More pics, vid -- fun!

Science Gallery Xmas Cards

The Dublin Science Gallery Greeting Cards are excellent!

Get 'em here, or pick up one of the great gadgets and gifts they have in stock.

(disclaimer: I am mates with the designer and the guy who runs the shop -- but I still think they're great work, regardless ;)

Name-checked in the Seanad

So, after I posted this post about Aslan's imaginary illegal downloads, someone on Twitter linked to this comment by Senator Paschal Mooney (Fianna Fail), in the Seanad the next day, repeating the incorrect Aslan factoid:

Sen. Paschal Mooney (Fianna Fail): There is a perception that the big five record companies, all international companies, have been ripping off the consumer for many years. I do not want to be seen as an apologist for the music industry, but at the lower level I can give a specific example to highlight the impact of illegal downloading on Aslan, an Irish band. It has sold 6,000 copies of its current album, but there have been 22,000 illegal downloads. [...] Why must we wait for a High Court judgment to be made before we introduce relevant legislation?

It appears a few people, Adam Beecher for one, got in touch with the Senator by email. To my surprise, a couple of days later, I got some Twitter messages telling me that I'd been mentioned in the Seanad! Indeed, here it is:

Sen. Paschal Mooney (Fianna Fail): Last week on the Order of Business I raised an issue relating to illegal downloading of music on the Internet which followed on a court case which the major international record companies had lost that had been taken the previous day. I asked the Leader what possible legislation could be introduced to address this gap, and I am repeating the request. I have had quite a significant amount of response to the comments I made last week, specifically from persons who state that the figures quoted in my report, and also the figures quoted in the court case to defend the record companies’ position, are inaccurate, and I was asked by a number of those who emailed me to correct the record. Having investigated this further - I recommend to the House that those who are interested log on to taint.org - there is no doubt that the figures that have been quoted to support the court case, which was subsequently lost, are not accurate. It related to the group Aslan. I do not want to delay the House on this other than to correct the record in that I put the figures as I had received them in good faith and such has been the response to the comments I made in the House last week that I feel obliged to correct the record and state that there is no doubt but that the figures that have been used are, at best, suspect.

It would be important if the Leader could have the Minister for Enterprise, Trade and Innovation, Deputy Batt O’Keeffe, come to the House to give some indication of his proposals because the music industry is currently lobbying in this House and in the other House to have legislation changed to benefit it. However, there is a wider view that illegal downloading will continue irrespective of what happens, the record companies are now on the defensive and there are other alternatives that could be brought forward such as licensing those who wish to download. In that context, I would be interested in the Leader’s response.

A few comments in response:

  • Credit is due to Senator Mooney in that he admitted that he'd been misled, and corrected the record in that regard.

  • it's amazing to see that the democratic process has opened up to this degree. I would have never expected to have this degree of input to our elected representatives without having to go through more traditional channels (face-to-face meetings etc.)

  • Finally: 'The music industry is currently lobbying in this House and in the other House to have legislation changed to benefit it'. That is very, very worrying. Indeed, suzybie noted on Twitter:

@jmason not sure if you caught it but I saw Willie K and his mates entering Dáíl last Wednesday evening. FF backbenchers were being met

McGarr solicitors have been in touch with the relevant Ministers requesting that Digital Rights Ireland be included in any discussions regarding legislative change. This will be one to keep an eye on.

Irish Times Letter re EMI v UPC

Submitted via email to their letters page. This may be a bit too long for the format, but hey. Enjoy.

Madam, -- Commentary in this paper and elsewhere has given the impression that Mr. Justice Charleton's judgement on the EMI v. UPC case was a poor result for EMI and the other record companies represented. This is not necessarily the case. While UPC may not yet have to implement "three strikes", there are many things to worry the Irish internet user in the judgement.

Mr. Justice Charleton states that he is satisfied that the business of the recording companies is being devastated by piracy, entirely based on evidence submitted by the record companies and IRMA. One of these assertions was that over 20,000 illegal downloads of an "Aslan" album had been "traced" -- but no details of the methodology of this "tracing" has been produced.

Third-party attempts to reproduce this figure indicate that it is probable that an extremely naive approach was taken in this testing -- the putative copies of the album available to download, and their large download figures, are in reality a lure used by criminals to persuade unwitting victims to provide their credit card details to fraudulent websites.

Worryingly, this flawed evidence has already been represented as fact in the Seanad by FF senator Paschal Mooney.

Other studies cited in the judgement have been criticised widely elsewhere, including by the US Government Accountability Office in its April 2010 report to the US Congress.

Mr. Justice Charleton goes on to suggest that all internet access from UPC (and presumably other ISPs) be filtered through a piracy-detection system. One wonders what the many companies who currently run internet-based services from Ireland would make of this proposal.

The government now seems keen to rush in and implement the filtering and blocking systems requested by IRMA and the music companies, as Mr. Justice Charleton recommends, or possibly even to give hand-outs to the music industry to compensate them, as IRMA demands. One hopes that more technical expertise will be brought to bear on the supposed "evidence" before this happens.

Yours, etc., Justin Mason

Aslan’s hard times, from the UPC judgement

Oh dear. Quoting Mr Justice Charleton's judgement in favour of UPC vs. EMI, Sony, et al:

'This scourge of internet piracy strongly affects Irish musicians, most of whom pay tax in Ireland. ‘Aslan’ is a distinguished Irish group which has a loyal fan base; but not all of them believe in paying for music.Previous sales of their albums were excellent, about 35,000 per album, and in respect of one called “Platinum Collection”, a three CD box set, 50,000 copies were sold. More recently, an album called “Uncased” was released and only 6,000 copies were sold.Perhaps, it might be thought, the album was not popular and did not sell well? In contrast, a search was made to see how many illegal downloads had been made on the internet from that album, and 22,000 were traced.'

Aslan, eh?

So, that would be about the same figure as EMI quoted in a press statement in July 2009, which 'Gambra' on the thumped.com boards thoroughly debunked at the time:

'I've just been listening to the first minute or two of this and have done a mere 10 minutes of googling to try verify the claim of 25,000 downloads. The EMI press statement mentioned that they've tracked that amount of downloads "through Torrents Nova and Pirate Bay alone." The first problem with that is that there's no such site as Torrents Nova (I presumed they meant mininova but Aslan gets zero hits over there) but never mind, we'll carry on. Next I search for ever possible permutation for downloads of the new Aslan album and I kept getting the same result which is "Aslan - Uncase'd (2009) KompletlyWyred Dhz.inc" which was uploaded to thepiratebay. However this file only has a grand total of 9 seeders and 6 leechers and has been alive since the 26th of June. There's no way of telling how many times it's been leeched exactly but even if it was 6 new leechers every day it'd be a total of 108 downloads. It is fair to assume that only 9 of these bothered to seed back so I'd say the total is right.

Wondering still where the hell they got their mystical 25,000 total from I just searched for "Aslan Uncased" and was surprised to see 5 links to torrents of the album in the first two pages of results. However 4 of the 5 just link back to the one on TPB with 9 seeders. The 5th is where I think they got their mystical 25,000 total from:

http://www.nowtorrents.com/torrents/aslan-uncased.html

This is the 7th result you get on google for the album title and when you click it you actually get "No Matches were found" but up at the top are FAKE results that are actually just ad links. You could search for anything and you'll get those exact same four ad results.

http://www.nowtorrents.com/torrents/gambra-thumped.html

If you refresh the totals change each time so it's safe to say they found this link by googling the name, added up the total of listed downloads they got (which is totally random) and are using that to moan about their loss of sales. Incredible.'

Indeed, according to the site, an album called 'Justin Mason on the nose flute' has been downloaded 24,752 times -- I never knew! Where's my cheque?

Some quality facts and figures from EMI there, I suspect.

E-mail Address Validating Regular Expressions – a Warning

This page has been floating around in links over the past couple of weeks, as a collection of test cases to compare e-mail address validating regular expressions. However, watch out: it's wrong.

RFC822/2822 defines an email address with a bare IP address domain part as using:

  domain-literal  =       [CFWS] "[" *([FWS] dcontent) [FWS] "]" [CFWS]

In other words, this test case is not valid at all:

  IPInsteadOfDomain@127.0.0.1

Instead, it should be:

  IPInsteadOfDomain@[127.0.0.1]

ditto for the other addrs using IP addresses in the domain part. They're rare, but the non-bracketed form is definitely not legal and should not be considered so in the test cases.

I sent a mail to the author a few days ago without response, hence this post.

Travel Insurance that works, even with ash clouds about?

Lazyweb request. In a few weeks I'll be taking a flight, along with the wife and kids, for some holidays.

The trip was booked before the whole ash-cloud thing and I used Ace Travel Insurance, a typical low-cost travel insurance agency, winding up with an 'ACE Travel Single Trip Travel HealthCover+ Insurance Policy'.

Looking at the policy doc now, it expressly excludes cover for 'a Public Conveyance being cancelled or curtailed because of adverse weather, industrial action, or mechanical breakdown or derangement' if 'an aircraft, sea vessel or train is withdrawn from service on the orders of the recognised regulatory authority in any country' -- which is exactly what's been happening in Ireland in the face of the Eyjafjallajoekull ash cloud.

That's pretty useless, isn't it? I'm considering booking another additional policy to cover the 'ash case'. Anyone got any tips on single-trip policies that don't use a similar exclusion?

what Colmcille really said

Mr. Justice Peter Charleton, in the course of his judgement on EMI Records & Ors -v- Eircom Ltd is quoted as having said the following:

' There is fundamental right to copyright in Irish Law. This has existed as part of Irish legal tradition since the time of Saint Colmcille. He is often quoted for his aphorism: le gach bó a buinín agus le gach leabhar a chóip (to each cow its calf and to every book its copy).'

As many have already noted, Colmcille didn't say that at all; his opponent did. If anything, Colmcille invented copyleft.

Manus O’Donnell's account:

Do inneis Finden a sceila art us don righ, ass ed adubhairt ris: “Do scrib C.C. mo leabhur gan fhis damh fen,”ar se, “aderim corub lim fen mac mo leabhur.”

“Aderim-se,” ar C.C., “nach mesde lebhur Findein ar scrib me ass, nach coir na neiche diadha do bi sa lebhur ud do muchadh no a bacudh dim fein no do duine eli a scribhadh no a leghadh no a siludh fan a cinedachaib; fos aderim ma do bi tarba dam-sa ina scribhadh, corb ail lium a chur a tarba do no poiplechaibh, gan dighbail Fhindein no a lebhair do techt ass, cor cedaigthe dam a scribudh.”

Is ansin ruc Diarmaid an breth oirrdearc .i. “le gach boin a boinin” .i. laugh “le gach lebhur a leabrán.”

Or, translated to English by A. O’ Kelleher and G. Schoepperle:

Finnen first told [High King Diarmaid] his story and he said “Colmcille hath copied my book without my knowing,” saith he “and I contend that the son of the book belongs to me.”

“I contend,” saith Colmcille, “that the book of Finnen is none the worse for my copying it, and it is not right that the divine words in that book should perish, or that I or any other should be hindered from writing them or reading them or spreading them among the tribes. And further I declare that it was right for me to copy it, seeing there was profit to me from doing in this wise, and seeing it was my desire to give the profit thereof to all peoples, with no harm therefore to Finnen or his book.”

Then it was that Diarmaid gave the famous judgement: “To every cow her young cow, that is, her calf, and to every book its transcript. And therefore to Finnen belongeth the book thou hast written, O Colmcille.”

Soon thereafter, of course, 3000 died in the Battle of the Book at Cooldrumman, bringing a rather literal meaning to the modern term "copyfight". 'Colmcille and the Battle of the Book: Technology, Law and Access to Knowledge in 6th Century Ireland' is recommended for more background.

Guinness vs independent breweries

Guinness' latest product, Guinness Black Lager, gets a panning in the Irish Times today.

I'm not a fan of Guinness. It's a good beer, but monotonous when it's the only thing available. This, from the old Dublin Brewing Company website, makes some interesting allegations as to why that may be the case:

In 1996 the Dublin Brewing Company was set up in Smithfield, in the old James Crean soap factory. As the only other brewery in Dublin to Guiness, Dublin Brewing Company represented a small but real challenge to the Guinness monopoly. Initially [Guinness'] reaction was "it won't work because, Irish people were brand loyal" and wouldn't change to anything new." However by November 1997 Guinness could see an increasing threat from a number of new microbreweries which were opening up around Ireland; it built its own microbrewery called St. James's Gate Beers. In the words of their Weekly News No: 44 "the four unique and distinctive draught beers are designed to meet perceived demand amongst ale and lager drinkers over the age of 28 for a wider choice of tastier draught beers."

The project team had spent 18 months conducting exhaustive R & D into the Irish drinking palette before the launch. This research included taking samples of Beckett's and D'Arcy's from public houses in Temple Bar and returning it to their citadel of brewing science for further analysis. Just exactly how do those "Fun Lovin Brewers" in Smithfield make beer? The code word for this return to basic brewing was affectionately known among company staff as "Operation Wolf".

The Dublin Brewing Company, amongst other small breweries was going to be lambs for slaughter. Of course, when you have a virtual monopoly on tap space in most bars, it's no problem launching no less than four beers in twenty pubs in Dublin overnight. Luckily drinkers in this country know what they want, and if they want a real beer they support the increasing number of microbreweries in Ireland, not a monopoly brewer masquerading as a small producer. The attempt at what was called "full taste" beers turned out to be a disaster. By October 1998 the operation was quietly closed down. However, now that St. James Gate is no more (£3-5m expenditure), we have its latest treat, Breo, being launched with the usual bravado Guinness display on these occasions - 10/15 kegs of beer free for every publican that takes it in. The pub gets the higher number of kegs if they take something else out. As the only other brewery in town, the Dublin Brewing Company is back on the firing line. The Dublin Brewing Company would like to dedicate D'Arcy's Dublin Stout to the memory of those old Dublin breweries.

Sadly, whether due to Guinness' tactics or not, the DBC appears to be no more. There are a few microbreweries around Ireland, but generally, the pub taps in this country are dominated by low-quality lagers, and Guinness. At least Paulaner is becoming widely available on tap, imported by Heineken...

spamass-milter != SpamAssassin

Just heading this one off before it gets too much further...

A couple of weeks ago, a researcher found a bug in the spamass-milter project, an open-source milter to integrate SpamAssassin filtering into an MTA. Here's the exploit details.

This H-Online story covered it:

Security vulnerability in SpamAssassin filter module

The SpamAssassin Milter plug-in which plugs in to Milter and calls SpamAssassin, contains a security vulnerability which can be exploited by attackers using a crafted email to inject and execute code on a mail server. The SpamAssassin Milter plug-in is frequently used to run SpamAssassin on Postfix servers.

(I think this is the source article on Heise.de.)

That was more-or-less accurate -- but the problem is the "chinese whispers" effect, where a news story on another site builds on misreadings of another news article. eSecurityPlanet:

Security Flaw Found in SpamAssassin Plug-in

The SpamAssassin Milter plug-in has been found to contain a security vulnerability. [...]

sigh.

To clarify: spamass-milter is not a part of SpamAssassin. it's a third-party product which allows sendmail/postfix users to integrate spamassassin into their message flows as a milter.

SAY2K10 Doh

Happy new year! Or maybe not. Doh.

Over a year ago, Lee Maguire noticed that a contributed SpamAssassin rule, FH_DATE_PAST_20XX, was naively written -- simply to match any date in the year 2010 or later -- and would start to false-positive on all mail in 14 months. We made the trivial fix to avoid this (for at least 10 years, by which point the rule would have obsoleted itself through normal means), and I committed it to SVN.

Problem solved, right? Nope. I'd committed to trunk, but in a moment of inattention had forgotten to backport the fix to the stable release branch, 3.2.x, as well. Nobody else noticed the mistake, and several months later, boom:

Bugger.

Annoyingly, the GA had assigned this rule 3.5 points in the 3.2.0 rescoring run. This meant that the effective default threshold had been lowered from 5.0 points to 1.5, which produced a 2% false positive rate during the first 13 hours of the new year.

After that point, the fix was pushed to the sa-update channel, and anyone who runs sa-update regularly (as they should!) was brought back to normal filtering behaviour.

The rule is superfluous anyway, since it overlaps with a better-written "eval" rule, DATE_IN_FUTURE_96_XX. Accordingly, most likely scenario is that it'll be removed.

Personally, I see a few lessons from this:

  • Obviously, I need to pay more attention. This is easier said than done though, since SpamAssassin has nothing to do with my day job anymore; it's a spare-time thing nowadays, and that's a rare resource, unfortunately. :( But still, a chastening result, and I'm very sorry for my part in this screwup.

  • We need more active committers on Apache SpamAssassin. If we'd had more eyes, the fact that I'd forgotten to backport the fix might have been spotted. we're definitely in a better situation now in this regard than we were 6 months ago, so that's good.

  • IMO, this is a good demonstration of how too many simple rules are risky; without careful vetting and moderation, it's easy for a bad one to slip past. Perhaps we need to move more towards a DNSBL/network-rule driven approach, although this has its downsides too. Still thinking about this.

  • It'd be good to fix the GA so that it wouldn't assign such high points to simple rules like this, without some indication that a human has vetted them and believes them trustworthy.

Daryl posted a good comment on /.:

Clearly we dropped the ball on this one. As far as I know it's our first big rule screw up in the project's 10 years. If you're going to screw up you might as well do it well.

+1 to that!

And to everyone who had to clean up the fallout and spend a holiday recovering lost mails from spam folders... sorry :(

Sup Rocks

For the past 2 years or so, I've been using GMail to handle my main mail feed for jmason.org. I'm an absolute convert to its "river of threads"/search-based workflow.

Since starting at Amazon, I've had to start dealing with a heavy volume of work mail. Previously jobs have either had low mail volumes, or used Google Apps hosting for their mail, but Amazon's volumes are high and -- obviously -- they're not using Google. ;) For a while, I tried using Thunderbird, but it just didn't really cut it; I could never keep track of mails I wanted archived, or remember which folder they were in, etc. -- the same old problems that GMail solved.

Enter Sup. It's a console-based *nix email client, with a Mutt-like curses interface, which offers something closely approximating the GMail experience:


Sup is a console-based email client for people with a lot of email. It supports tagging, very fast full-text search, automatic contact-list management, custom code insertion via a hook system, and more. If you're the type of person who treats email as an extension of your long-term memory, Sup is for you.

Inbox Zero is a daily occurrence for my work email now; I can simply archive pretty much everything, and reliably know the excellent full-text search support will allow me to find it again in an instant when I need it. The new-user guide is well worth a read to get an idea of its featureset and UI.

Setting it up

The process of getting it set up is quite hairy; here are some instructions for Ubuntu, which thoroughly failed to work for me on 9.04. I had a similarly tricky time using some Ruby packages on the Red Hat work desktop, but eventually avoided it by just building vanilla Ruby from source, then using that to install "gem" and from that, "sudo gem install sup". Much easier...

Next step is to get the mail. From some reading, it appears the most reliable way to deal with a MS Exchange 2007 server is to use offlineimap to sync it to a local set of maildirs, then add those as Sup "sources" using sup-add, one by one. This is very well supported in Sup, and works well. Offlineimap is very easy to install on Ubuntu, and can easily be built from source if that's not an option. My config is pretty much a vanilla copy of the minimal config.

There's a good Sup hook to run "offlineimap" every poll interval, and rescan synced sources that contain new mail. It works well.

Sup has an interesting approach to mail storage -- it doesn't. Instead, it stores pointers to the messages' locations in their source storage. This is a great idea, since bugs in Sup therefore cannot lose your mail -- just your metadata about your mail. However, it means that if the source changes in a way which moves or removes messages, you need to tell Sup to rescan (using "sup-sync"), but that's no big deal in practice; in the more usual case, if new mail arrives, it's automatically rescanned.

I have just under 7000 mail messages in my Sup index, and rescans are speedy and searches super-fast. It's very nicely done.

Outbound mail is delivered using /usr/sbin/sendmail by default, which should be working on any decent *nix desktop anyway ;)

Recommended Hooks

The Hooks wiki page has a few good hooks that you should install:

  • ~/.sup/hooks/before-poll.rb: the above-mentioned offlineimap poll hook
  • ~/.sup/hooks/mime-decode.rb: 'uses w3m to translate all HTML attachments that don't have a text/html alternative.' Well worth installing.
  • ~/.sup/hooks/before-add-message.rb: essential to filter out cron noise and the like so it doesn't hit the inbox; unfortunately Sup doesn't (yet) support GMail's "filter messages like this" UI.

Bad Points

  • Long URIs: unfortunately, very long URIs are broken by Sup's renderer, and it doesn't offer a native way to "activate" URIs and have them displayed in the browser; instead one has to cut and paste them. This is pretty lame. I've hacked up a perl script that will reconstruct the full URLs from the broken rendering, when the text is piped to it, but that's a horrible hack.

  • Index Corruption: I've had the misfortune (once, in the month since I started) of corrupting my search index, causing Ruby exception stack traces when I attempted to run "sup-sync" to scan new mail. The only fix appeared to be to restore my index from a "sup-dump" backup. Thankfully all seems fine now, but it was a definite reminder of the product's beta status.

  • Calendaring: still as painful as it's ever been with UNIX command line email.

  • HTML: A good-quality, email-oriented, native HTML renderer would be awesome.

  • MIME: Sup again takes the traditional approach from UNIX command line clients of delegating to the mailcap file and its rules; unfortunately my RHEL5 desktop is too crappy to have a good mailcap setup. So I've had to write this from scratch to deal with the usual .docs and .xls's etc., flying about.

  • Inconsistent Key Mapping: Given that it shares so much UI with GMail in other respects, it's a little annoying that Sup doesn't have the same key mapping. Not a big deal, as it took only a couple of hours to get the hang of Sup's, though.

Overall

If you're happy enough to spend a day or two getting the damn thing installed, and aren't afraid of a little dalliance with the bleeding edge, I strongly recommend it. It's definitely the best *NIX mail reader at the moment.

Met iPhone

Irish iPhone users -- you may find this useful. I've written a web scraper which takes a couple of the more useful pages on Met Eireann's website -- the regional forecast and the rainfall radar page -- and reformats them in an iPhone-optimised style. Enjoy:

(updated: supports all the provincial forecasts now)

Lest we forget

Regarding Google Wave's similarity to Lotus Notes, which is a meme I've heard from several angles -- David Jones hits the nail on the head:

Well, I used Notes from 1994 to 1999. It did have a database backend for e-mail and a rich collaborative editing model. But it didn't have realtime shared editing, or instant annotation.

And it was shit. No-one in their right minds would have wanted the future of the web to have been Notes. Even though, and I completely agree, it did things that the web is now only just getting round to.

+1 to that!

n+30 Days

Colm's "n+1" post reminded me that I'd forgotten to write about this.

On July 27th, I started at Amazon, in a new Dublin-based software dev team working on infrastructure automation. It's now (just over) a month later, and I'm enjoying it immensely.

Needless to say, this company does some very interesting web-scale technology, and getting to look inside the AWS sausage factory is really enjoyable, believe it or not ;)

(I should also post a pic of my glorious screen real-estate. The hardware is a massive improvement over the previous gig, thankfully.)

Unfortunately, however, this has coincided with a lack of free time to blog and keep up with interweb-based leisure pursuits, including SpamAssassin. Really though, this is more due to looking after two wonderful little girls under 2 years of age, rather than the job -- but still, I need to remedy my neglect of this site...

In SpamAssassin news: we've been putting out some alpha releases of 3.3.0, and are planning to do a mass-check for score-generation in the next couple of days. Hopefully we can drive 3.3.0 to a GA release in a few weeks.

Also -- we're still looking for more people in the Amazon team, and hiring aggressively. If you're looking for an interesting software dev role in Dublin, get in touch!

PS: it was Bea's second birthday last weekend. Check out the awesome Very Hungry Caterpillar cupcake cake made by the missus for the occasion:

Embedded software development

Found in an Ivan Krstic post about Sugar and the OLPC:

In truth, the XO ships a pretty shitty operating system, and this fact has very little to do with Sugar the GUI. It has a lot to do with the choice of incompetent hardware vendors that provided half-assedly built, unsupported and unsupportable components with broken closed-source firmware blobs that OLPC could neither examine nor fix. [...]

We had an embedded controller that blocks keyboard events and stops machine suspend, and to which we -- after a long battle -- received the source, under strict NDA, only to find a jungle of nested if statements, twelve levels deep, and no code history. (The company that wrote the code doesn't use version control, see. They put dates into code comments when they make changes, and the developers mail each other zip files with new versions.)

Haha. Been there, done that. Sometimes it's great not to have to work with custom hardware anymore...

YA link-blog aggregator

Alex Payne writing about "Fever", a new link-blog aggregator app:

Fever's proposition is straightforward: supply it with the feeds you always want to read, and supplement those with feeds that you only want to read the juicy bits of. Fever will then show you a sort of personal Techmeme or Google News, pulling together stories that reference common URLs.

Fever is commercial software, costing $30. Alternatively, I've been doing something very similar for the past few years using SpicyLinks, which is free (if a great deal less pretty on the UI end).

It's nice to see the idea getting some polish, though. ;)

Alex does raise an interesting point towards the end:

Fever is just fine for floating good techie content to the top, but poor for most any other subject. I'd love it if Fever could find me good posts from the set of minimal techno or cocktail blogs I subscribe to, but link blogs -- and, indeed, linking outside one's own site -- just aren't as prevalent in those communities.

True.

Eircom’s “DDOS”, or not

I woke up this morning to hear speculation on RTE Radio as to how Eircom's DDOS woes were possibly being caused by the Russian mob, of all things. This absurd speculation is not helped by lines in statements like this:

'The company blamed the problems on "an unusual and irregular volume of internet traffic" directed at its website, which affected the systems and servers that provide access to the internet for its customers.'

I'm speculating, too, but it seems a lot more likely to me that this isn't just a DDOS, and someone -- possibly just a lone Irish teenager -- is running an attempted DNS cache-poisoning attack. Here's why.

Last week, there were two features of the attack in reports: DDOS levels of traffic and incorrect pages coming up for some popular websites. To operate a Kaminsky DNS cache-poisoning attack requires buckets of packets -- easily perceivable as DDOS levels. This level of traffic would be the first noticeable symptom on Eircom's network management consoles, so it'd be easy to jump to the conclusion that a simple DDOS attack was the root cause.

This week, there's just the DDOS levels of traffic. No cache poisoning effects have been reported. This would be consistent with Eircom's engineers getting the finger out over the weekend, and upgrading the NSes to a non-vulnerable version. ;)

Once the attacker(s) realise this, they'll probably stop the attack.

It's not even a good attack for a bad guy to make, by the way. Given the timing, right after major press about a North Korean DDOS on US servers. it's extremely high-profile, and made the news in several national newspapers (albeit in rather inept fashion). If someone wanted to make money from an attack, a massive-scale packet flood indistinguishable from a DDOS against the nation's largest ISP is not exactly a subtle way to do it.

In the meantime, apparently OpenDNS have really seen the effects, with mass switchover of Eircom's customers to the OpenDNS resolvers. Probably just as well...

I’m a Dermotologist!

Found here:

On Wednesday 20 May 2009, speaking at a parliamentary Justice Committee debating his new blasphemy law, Dermot Ahern joked that people were making blasphemous comments about him, and he compared his own purity to that of the baby Jesus.

So we have a Justice Minister joking about himself being blasphemed, at a parliamentary Justice Committee discussing his own blasphemy law, that could make his own jokes illegal.

In honour of this Ministerial revelation, we have founded the Church of Dermotology. We believe God sent Dermot Ahern to save Ireland from rational thinking. Our sacred symbol is the Star of Dermot.

Our sacred beliefs are quite similar to those of other religions.

  • We believe ice cream wafers are literally the body of Dermot Ahern.
  • We believe Dermot Ahern created the universe on Wed 20 may 2009.
  • We’re sometimes not sure whether Dermot Ahern really exists.
  • We believe it is blasphemous to publish an image of Dermot Ahern.
  • We refuse to gather sticks on the Sabbath, which is Wednesday.
  • We wear magic underpants that protect us from fire and bullets.
  • We are outraged whenever anybody insults our sacred beliefs.
  • We fervently support Dermot Ahern’s proposed blasphemy law.
  • If it is passed, we will be regularly outraged, and will take test cases.

Like Scientologists, Dermotologists offer a free personality test. Question one: are you vulnerable? Question two: have you money? If you answer yes to either of these questions, you’re in.

After you join, check out the campaign against the Irish blasphemy law at blasphemy.ie.

Health and Safety

A while back a friend of mine mailed us all with this classic of overweening health-and-safety bureaucrats gone wild:

The company are now installing wallpaper on our PCs with their 5 golden safety rules:

  1. Always hold the handrail

  2. Always reverse park

  3. Assess Risks

  4. Accept Challenges

  5. Wear PPE [Personal Protective Equipment] gear

We also have to drink from metal cups with plastic lids on them.

The thing that really got me was #2 -- 'always reverse park'. Apparently, someone decided that reversing into the parking space was safer than going in head-first, and to such a significant degree that it was worth mandating it across a medium-sized company. On the other hand, another friend noted:

The college i went to [in the US] would ticket you if you backed into a parking space -- they said it was a "fire hazard".

so we've got "fire hazard" in one direction and "unsafe" in the other. Parse that.

Another friend was told that she couldn't bring her folding bike in the lift because "what would happen if the president was in the lift going to the board room?". She says "I could not work out the health and safety implications."

What health and safety insanity have you encountered recently?

Gravatar Fail

Hey Gravatar. When you auto-generate an avatar image, like you did with the one to right, could you do me a favour and omit the bits that look like swastikas? kthxbai!

Open source ‘full text’ bookmarklet and feed filter

Last year, I blogged about Full-Text RSS, a utility to convert those useless "partial-text" RSS/Atom feeds into the real, full-story-inline deal.

The only downside is that the author felt it necessary to withhold the source, saying:

Still, I wouldn't want to offer a feature that middlemen can resell at the expense of bloggers. So while I do want to open this up, I don't want to make things easy for the unscrupulous.

However, recently Keyvan Minoukadeh from the Five Filters project got in touch to say:

I recently created a similar service (along with a bookmarklet for it). [...] It’s a free software (open source) project so code is also available.

Here it is:

fivefilters.org: Create Full-Text Feeds

I've tried it out and it works great, and the source is indeed downloadable under the AGPL.

Five Filters -- its overarching project -- looks interesting, too:

Edward Herman and Noam Chomsky describe the media as businesses which sell a product (readers) to other businesses (advertisers). In their propaganda model of the media they point to five 'filters' which determine what we read in the newspapers and see on the television. These filters produce a very narrow view of the world that is in line with government policy and business interests.

In this project we try to encourage readers to explore the world of non-corporate online news, websites which avoid the five filters of the propaganda model. We also try to make these sources of news more accessible by allowing users to print the stories found on these alternative news sites in the format of a newspaper.

User script: add my delicious search results to Google

For years now, I've been collecting bookmarks at delicious.com/jm -- nearly 7000 of them by now. I've been scrupulous about tagging and describing each one, so they're eminently searchable, too. I've frequently found this to be a very useful personal reference resource.

I was quite pleased to come across the Delicious Search Results on Google Greasemonkey userscript, accordingly. It intercepts Google searches, adding Delicious tag-search results at the top of the search page, and works pretty well. Unfortunately though, that searches all of delicious, not specifically my own bookmarks.

So here's a quick hack fix to do just that:

my_delicious_search_results.user.js - My Delicious Search Results on Google

Shows tag-search results from my Delicious account on Google search pages, with links to more extensive Delicious searches. Use 'User Script Commands' -> 'Set Delicious Username' to specify your username.

Screenshot:

Enjoy!

Still using perl 5.6.x?

For the upcoming release of Apache SpamAssassin, we're considering dropping support for perl 5.6.x interpreters. Perl 5.6.0 is 9 years old, and the most recent maintainance release, 5.6.2, dates back to November 2003. The current 5.x release branch is 5.10, so we're still sticking with a "support the release branch before the current one" policy this way.

If you're still using one of the 5.6.x versions, or know of a (relatively recent) distro that does, please reply to highlight this....

IBM Ad Execs Who Should Be Fired

Watching television last night, I couldn't fail to take notice of this new IBM ad:

'For the first time in history, more people live in cities than anywhere else, which means cities have to get smarter.' [...] 'Paris has smart healthcare; smart traffic systems in Brisbane keep traffic moving; Galway has smart water'.

Jaw-dropping. That would be this Galway?

A major water crisis has left scores of people ill and tens of thousands at risk from contamination in a west of Ireland city. Galway's water supply has been hit by an outbreak of the parasite cryptosporidium, with up to 170 people now confirmed to have been affected by a serious stomach bug as a result. Tests found that the city's water supply contained nearly 60 times the safe limit of cryptosporidium pollution. Residents have already been unable to drink or use water for food preparation for weeks.

Residents in parts of Co. Galway have been hit by a new outbreak of the cryptosporidium parasite.Tests on the Roundstone Public Water Scheme showed trace elements of the parasite, as did water schemes for Inishnee and Errisbeg.

Council engineers in Galway have begun work on providing safe drinking water for up to 1,000 householders [...] where supplies have been contaminated by lead. The residents have been advised not to drink tap water until further notice.

Apparently the IBM ad is referring to something to do with tides and aquaculture in Galway Bay, rather than the worst sequence of water-quality disasters in Ireland for several decades. But really -- someone at IBM's marketing department should have done a little more research first before using that line...

Mae’s OK!

Well, that was a really scary few days.

On Monday, the lovely C was nearly 2 weeks overdue, and was scheduled to come into the Rotunda for induction the next morning; then contractions started on Monday afternoon. We were happy, as avoiding induction was good news for a natural birth, allowing the process to be run through the excellent Domino scheme, etc.

So we went in, arriving at the Rotunda ER for 3.45 or so. They put on the CTG to monitor the baby's heartbeats, and the first 3 contractions were strong, but everything seemed OK. The next one, however, the baby's heart rate dropped dramatically -- to a very low 40bpm; I called the ER nurses, they ran in, put C on oxygen, and that seemed to help, returning the rate to normal -- but on the next contraction the baby's heart rate dropped even further. Once that happened, the shit hit the fan. In seconds C was on a trolley heading for surgery. It was clear this was serious trouble.

I was left standing outside the theatre while she was operated on -- as an emergency Caesarean section there was no time for luxuries like hapless husbands stumbling around the background. Probably just as well. The midwives and surgical staff kept me as well informed as was possible, though.

After a terrifying 10 minutes, the prognosis improved a little. Initially they were worried that the baby had put pressure on the cord, but this was discounted -- in fact the baby had emptied its bowels of meconium in the womb, which irritated it enough to cause enough distress and cause its heart rate to crash. After 10 minutes, the baby was out (and was a girl!), and C was going to be OK at least. however the baby was at quite a lot of risk from aspiration of meconium and possible brain damage due to reduced oxygen in the womb. holy shit. :(

The baby had indeed aspirated some meconium, causing a collapsed lung. Over the next couple of days in an incubator in the neonatal intensive care unit, the little mite had surgery to introduce a chest tube into her pleura to re-inflate the lung, and was treated with a variety of treatments to deal with meconium in her stomach.

The best bit was this afternoon when we got news that the results of her cranial ultrasound were in -- all clear, no brain damage. Then C got to feed her and hold her -- and she latched on like some kind of milk-seeking missile. what a little trooper.

Anyway, with any luck, 2 or 3 days from now they'll both be able to come home in one piece.

We were lucky btw -- if we hadn't been in the ER at the time, it was very unlikely that the prognosis would have been anywhere near as good. And I have to give credit to the Rotunda staff, they did a great job.

pics on Flickr!

Update, 7 June: C was released from hospital yesterday, and Mae got the all-clear this morning. We're now all back home, healthy and in one piece. Now we can just get on with the usual second-child excitement-slash-drama! phew!

Michael Woods saying “the Brits made us do it”

If you were listening to the Marian Finucane show on RTE Radio 1 last Saturday afternoon, you might have heard the mind-boggling stuff coming out of Michael Woods, the Fianna Fail former Education Minister with a "strong Catholic faith" who brokered the controversial backroom deal back in 2003 which allowed the Catholic Church and its institutions to evade prosecution on child abuse.

Here's a great thread on Politics.ie where quite a few folks boggle at the incredible things he said.

Thanks to Podcasting Ireland, I was able to track down and cut out this segment, so here is a recording of Michael Woods coming up with the pathetic excuse of how the British forced the Christian Brothers to abuse children:

Michael Woods - the brits made us do it.mp3 (951KB)

The last refuge of a cornered FFer -- blame the British. Absolutely incredible. It has to be heard to be believed. What century is this again?

Update: according to Mary Raftery in the Irish Times, this is a preview of the religious right's tactics:

'It Is easy to discount former government minister and senior Fianna Fáil member Michael Woods. A former minister, he is no longer a prominent figure. He has, however, left a festering sore behind him which continues to weep poison every now and then. The infamous church-State deal on redress for victims of institutional child abuse, under which the religious orders pay a mere 10 per cent of the compensation bill, was at its most septic over the weekend.

Woods, the main architect of the deal, defended it on the television news and gave a long RTÉ radio interview on Saturday. We were beginning to hear some of the defences likely to be chosen by religious conservatives as soon as they manage to regroup and fight back.'

We marched in the streets about this stuff. It's like the 90's never happened.

New EC2 Features

Amazon Cloudwatch:

This is nifty. Monitor EC2 instances and load balancers; CPU, data transfer rates, disk usage, disk activity, HTTP/TCP request counts/latency, "healthy/unhealthy" instances (see below). This data is both exposed via web service APIs, but also usable as input for their new "Auto Scaling" elastic scaling feature. Ideal for someone to write a Nagios plugin for. Also, I'm looking forward to some kick-ass sysadmin dataviz for this.

Auto-Scaling:

Elastically scale out (or in) your grid of EC2 instances, based on Amazon CloudWatch metrics. An officially-supported form of a myriad of third-party apps. I expect to hear of people accidentally spending a fortune due to accidental misuse of this ;)

Elastic Load Balancing:

Load balance across multiple EC2 instances, report metrics to Cloudwatch such as requests/second and request latency, and -- most usefully of all in my opinion -- shift traffic away from EC2 instances that fail to respond to a "health-check" HTTP GET with a 200, or fail to accept a TCP connection.

In other words, this provides a way to do decent HA on EC2, which is something that's been much needed for a long time, and is quite tricky to set up using Linux-HA. I've done the latter, and found it full of potential reliability pitfalls; I found that Elastic IP addresses were not useful for quickly failing over to backup servers; in some cases, I found it taking about 5 minutes to fail over :( The only (relatively) snappy way to implement it was to set up a dynamic DNS record with a short TTL, point to it using a CNAME, and use "ddclient" to switch it when failing over. And even that could leave sites down for as long as it takes the DNS client to time out the existing cached CNAME.

Elastic Load Balancing supports HTTP or generic TCP connections. Unfortunately, it doesn't support "real" termination of HTTPS connections, which is unfortunate. (You can terminate them as generic TCP connections, though.)

More details on the RightScale blog, at the AWS dev blog, and Werner Vogel's blog.

The Pay-No-Attention-To-Our-Tiny-Logo Party

In the current run-up to the local elections here in Ireland, it's pretty obvious that Fianna Fail, the ruling party who've screwed the economy with mismanagement and rampant cronyism, are in line for a massive drubbing. So much so, in fact, that their own candidates are attempting to hide their party affiliations.

Check out this poster for candidate Kenneth O'Flynn (son of FF TD Noel O'Flynn):

what logo, you ask? Look closer:

Compare that to what FF posters used to look like, 2 years ago:

Meath FF councillor Nick Killian has removed the logo from his leaflet's front page entirely, too.

Thanks to martinoc for the Bertie's Team poster, and Ivor in the comments of this post at On The Record for the photos of Kenny's posters. There's gold in those comments...

Spoon’s Rhubodka Recipe

Today on Twitter, the perennial rhubarb topic -- ie. what to do with all this rhubarb -- came up. Here's a recipe I picked up from a man called spoon which may help:

I've mentioned this before, but just in case.... Rhumember kids:

  • 1 empty 2 litre bottle
  • 4 or 5 sticks of pink rhubarb
  • 110g caster sugar
  • 1 litre of vodka

Cut the rhubarb into inch chunks and put them into the empty bottle until it is not empty any more and you have run out of rhubarb.

Add the sugar

Add the Vodka

Shake vigorously

Leave to stand in a dark corner for maybe 4 weeks or until you can't wait any longer. You should certainly wait until all the sugar has gone. The longer you leave it the more Rhubarby goodness will be pulled out by the sugar.

Strain all the rhubarb out.

CONGRATULATION. YOU HAVE UNLOCKED RHUBODKA.

DRINK THE RHUBODKA

It sounds awful, but instead of being that, it is fucking awesome.

I have a bottle of this stewing away on top of my kitchen cabinet. It should be ready just in time to toast the arrival of child #2 ;)

PS: "rhubodka" is a googlewhack!

Spirit of Ireland

Spirit of Ireland looks very nifty.

It's extremely simple -- a group of Irish 'entrepreneurs, engineers, academics, architects and legal and financial experts' are calling for Ireland to achieve energy independence and become a net exporter of green energy within five years, by building a number of wind farms on our western seaboard, buffering the generated energy in water reservoirs using pumped-storage hydroelectricity.

This kind of massive-scale public-works engineering project has a strong historical precedent in Ireland -- Ardnacrusha, opened in 1929, was the largest hydroelectric station in the world for a time. Given that Turlough Hill is a pumped-storage facility, it can even be beautiful ;)

We can certainly do it, given sufficient government vision. I'd love to see it happen. Great stuff!

(image credit: CC-licensed image from Ganders on Flickr. thanks!)

Irish Examiner innumeracy

Here's a great example of numerical illiteracy spotted by my mate Tom:

some classic reporting in the Irish Examiner today...

"Department staff clocked up 20,000 sick days in the three years" is the headline. Closer examination of the article reveals there are 5,000 people in the department. Do the maths (which the paper doesn't - I wonder why) and that's a SHOCKING 1.3 sick days a year.

Even better is this quote: "Department of Agriculture staff clocked up 3,095 uncertified sick days last year - 653 of these on a Monday"

So that would be about a fifth of the sick days being taken on one of the five working days in the week. DISGRACE!

Let's hear it for old media's commitment to quality journalism!

Reminder: Irish computing history talk next Monday

Don't forget -- next Monday, the Heritage Society of Engineers Ireland, in association with The Irish Computer Society, and the ICT and Electronic and Electrical Divisions of Engineers Ireland, will be hosting an evening lecture entitled "Reminiscences of Early days of Computing in Ireland", by Gordon Clarke (M.A., CEng., F.B.C.S., C.I.T.P., F.I.C.S). Sounds like it'll be great. More details.

Update: it starts at 8pm; useful info! Also, the event's flyer can be found on this page, which notes:

For those new to using our webcast facility, please see www.engineersireland.ie/webcast for information on how to set-up and access our webcasts. To view the event, please log onto the url below: https://engineersireland.webex.com/engineersireland/onstage/g.php?t=a&d=841959965 The password: computer

Linux per-process I/O performance: measuring the wrong thing

A while back, I linkblogged about "iotop", a very useful top-like UNIX utility to show which processes are initiating the most I/O bandwidth.

Teodor Milkov left a comment which is well worth noting, though:

Definitely iotop is a step in the right direction.

Unfortunately it's still hard to tell who's wasting most disk IO in too many situations.

Suppose you have two processes - dd and mysqld.

dd is doing massive linear IO and its throughput is 10MB/s. Let's say dd reads from a slow USB drive and it's limited to 10MB/s because of the slow reads from the USB.

At the same time MySQL is doing a lot of very small but random IO. A modern SATA 7200 rpm disk drive is only capable of about 90 IO operations per second (IOPS).

So ultimately most of the disk time would be occupied by the mysqld. Still iotop would show dd as the bigger IO user.

He goes into more detail on his blog. Fundamentally, iotop works based on what the Linux kernel offers for per-process I/O accounting, which is I/O bandwidth per second, not I/O operations per second. Most contemporary storage in desktops and low-end server equipment is IOPS-bound ('A modern 7200 rpm SATA drive is only capable of about 90 IOPS'). Good point! Here's hoping a future change to the Linux per-process I/O API allows measurement of IOPS as well...

Big table desking

We have an extremely open-plan layout in work -- no partitions, just long benches of keyboards and monitors. It looks a bit like this, but with less designer furniture and more Office Depot:

Aman pointed out that this is a new trend in workplace design, which Workalicious calls "Big Table Desking":

I'm still not sure what to make of the frequent instances of Big Table Desking. While this kind of workstation arrangement is no doubt a new trend, the no-privacy work place is a throwback to the 1950s office pool, a line up of identical desks classroom style. Is it the peer to peer seating position that overcomes this? How would it? By building community? As opposed the pilot and passenger 747, catholic church model of everybody facing "forward". Does the Big Table Desk break down this heirarchy by facing people towards one another, sharing a big desk instead of staking out territory? Is the big table desk a microcosm, a representation of a healthy organizational structure?

No comment ;)

It seems to be popular with designers, presumably due to their collaborative working needs.

Mind you, it also looks a bit like a Taylorist workplace layout from 1904, of which Wired says:

American engineer Frederick Taylor was obsessed with efficiency and oversight and is credited as one of the first people to actually design an office space. Taylor crowded workers together in a completely open environment while bosses looked on from private offices, much like on a factory floor.

YouBloom plug

Last week I got a very nice mail looking to plug a new music site:

'I'm not sure if this would interest you at all but wanted to pass on the link to a new website called YouBloom.

It's a new social networking and e-commerce website set up with independent artists in mind - to help them to make make real money (unlike MySpace etc which just make money from the artists)! It was set up by Irish Musician Phil Harrington and is backed by Sir Bob Geldof.

Admittedly I am involved with the website. I have been helping bring artists on site for the last few months, since I was introduced to the concept by a friend, but would love for you to take a look at the site anyway - even if it turns out to be of no interest to you.'

I normally wouldn't post these, but I'm a sucker for flattery ;) and the poster had taken the time to read my blog a little. It also looks like the site allows bands to offer free MP3 downloads of their tunes, which IMO is a key factor for bands trying to get promotion.

UPC.ie’s new Channel 4 frequency for MythTV

So, after spending an hour or two attempting to figure out where the hell UPC had moved Channel 4 to, I eventually found out that it was now being broadcast on 543 Mhz. I also found out that this wasn't part of the standard list of A1 to A30 channels in the "pal-ireland" range. :(

Thankfully, I then found this Frequency to MythTV channel converter page; here's the correct values to use on the MythWeb channels page:

  • Freqid = 30
  • Finetune = -4

“you are, in fact, in the message queue business”

Oh man, this Twitter Ruby-vs-Scala language spat is hilarious; talk about handbags at dawn. I loved this exchange in the comments to this post in particular:

BJ Clark:

I'm mostly surprised that a guy who wrote the book on Scala comes out and says that Scala is better than everything else and someone actually listened and took him seriously. He has a vested interest in saying that Scala is the next big thing and I've yet to see any evidence that Kestrel is better (at anything) than RabbitMQ.

And frankly, I still get fail whales at Twitter on a daily basis, so, what exactly are they so proud about over there?

Steve Jenson:

Kestrel pages queues to disk: if you get more messages than you have memory, it's fine. If RabbitMQ gets more messages than memory, it crashes. We talked to them extensively about this problem and they're going to address it. We were hoping we'd be able to use RabbitMQ or another message queue. We didn't want to be in the message queue business. At this point, given that we know the code and it's performance inside and out, it makes sense to continue using and developing it.

BJ Clark:

I don't feel like arguing with you but your logic isn't clear to me. It would make sense that if you don't want to be in the message queue business, you'd submit patches against an established message queue to make it work in your situation instead of writing your own message queue, twice. This is overlooking the fact that twitter is basically a massive message queue and you are, in fact, in the message queue business.

Zing!

URL shortening services: my experience

A good post from Joshua Schachter about URL shortening services.

For what it's worth, I ran into the unwanted-interstitial risk. At one stage, before I'd bothered registering jmason.org, sitescooper.taint.org or my other domains, I used a URL-shortening service to provide a memorable, short URL for an open-source application I wrote -- http://zap.to/snarfnews/.

At some point a few years down the line, the forwarding process started accreting ads; eventually they became soft-porn in content, and I was forced to apologise to users for the forwarding I could no longer control!

By now, 10 years down the line, it seems to hijack the page entirely, returning a page in Cyrillic I can't even read :( (apparently it's a page of Flash games; thanks, Alexandr Ciornii, for the interpretation!)

Anyway, lesson learned.

“Report Says Deal”

Twitter has this "Trending Topics" sidebar now, which lists the following topics:

Trending Topics

  • TGIF
  • National Cleavage
  • G20
  • Easter
  • #grammarsongs
  • France
  • #rp09
  • French
  • Grand National
  • Report Says Deal

Now, I'm not going to go into the topic of National Cleavage right now. 'Report Says Deal' is intriguing because it makes no sense, until you click through to see:

Real-time results for "Report Says Deal"

  1. Too_cool_normal dlloydsecret Google to Buy Twitter? Report Says Deal is in the Works http://bit.ly/Wt1Wb half a minute ago from twitterfeed    
  2. Orig_8102_003_normal dlloydthemlmpro Google to Buy Twitter? Report Says Deal is in the Works http://bit.ly/Wt1Wb 1 minute ago from twitterfeed    
  3. Ad-tech-paul2_normal techupdates [PCWrld] Google to Buy Twitter? Report Says Deal is in the Works http://tinyurl.com/c63ont 3 minutes ago from twitterfeed    
  4. Orkut_normal icidade Google to Buy Twitter? Report Says Deal is in the Works. http://is.gd/quu9 4 minutes ago from TweetDeck    
  5. Img00315_normal chrisgraves Retweeting @CinWomenBlogger: Retweeting @ays: Google to Buy Twitter? Report Says Deal is in the Works - PC World http://bitly.com/LhT4 6 minutes ago from twhirl

So I'd say that Twitter's "Trending Topics" uses N-grams of between 1 and 3 "words" for topic identification. In this case, rather than "Report Says Deal", a better topic string would be something like:

Google to Buy Twitter? Report Says Deal is in the Works - PC World

or even:

Google to Buy Twitter? Report Says Deal is in the Works - PC World http://bitly.com/LhT4

Funnily enough this is exactly the issue I ran into while developing this algorithm. The trick at this point is to apply a variant of the BLAST pattern-discovery algorithm, expanding the patterns sideways while they still match the same subsets of the corpus until they're maximal.

Twitter folks, if you can read Perl, "assemble_regexps()" in seek-phrases-in-log in SpamAssassin SVN does this pretty nicely, and reasonably efficiently, and is licensed under the ASL 2.0. ;)

OSSBarCamp this weekend

It's two days until OSSBarCamp, a free open-source-focussed Bar Camp unconference at Kevin Street DIT, this Saturday. I'm looking forward to it -- although unfortunately I missed the boat on giving a talk. (Unlike the traditional Bar Camp model, this is using a pre-booked talk system.)

Particularly if you're working with open source in Ireland, you should come along!

I have high hopes for John Looney's discussion of cloud computing and how it interacts with open source. Let's hope he's not too Google-biased in his definition of "cloud computing". ;)

Also of interest -- Fintan Boyle's "An Introduction To Developing With Flex". To be honest, I hadn't even realised that Adobe Flex was now open source. cool.

Talk: Early days of Computing in Ireland

On Monday April 20th, the Heritage Society of Engineers Ireland, in association with The Irish Computer Society, and the ICT and Electronic and Electrical Divisions of Engineers Ireland, will be hosting an evening lecture: 'Reminiscences of Early days of Computing in Ireland':

In 1957 the Irish Sugar Company installed the first stored program computer in Ireland. Other large organisations slowly followed suit.

Gordon Clarke will discuss how the early computers enhanced the electro-mechanical systems that had developed over the previous 60 years. He will talk about their specifications, a few of the first applications and tell the story of the very early years of designing and developing computer based systems.

All Welcome. Admission Free. No booking required. This event will be web-cast

For Details: www.engineersireland.ie, or Con Kehely: (01) 6860113 (con.kehely /at/ dublincity.ie)

Location: Engineers Ireland, 22 Clyde Road D4

Sounds great! Thanks to Frank Duignan on the ILUG list for forwarding the notice.

4chan Memes, circa 1889

In the comments to this unremarkable story about 4chan's Boxxy fad, I came across this gem from CSClark:

I don't know why I didn't think to see if this sort of phenomenon was covered in Extraordinary Popular Delusions... Of course, it is.

Walk where we will, we cannot help hearing from every side a phrase repeated with delight, and received with laughter, by men with hard hands and dirty faces, by saucy butcher lads and errand-boys, by loose women, by hackney coachmen, cabriolet-drivers, and idle fellows who loiter at the corners of streets. Not one utters this phrase without producing a laugh from all within hearing. It seems applicable to every circumstance, and is the universal answer to every question; in short, it is the favourite slang phrase of the day, a phrase that, while its brief season of popularity lasts, throws a dash of fun and frolicsomeness over the existence of squalid poverty and ill-requited labour, and gives them reason to laugh as well as their more fortunate fellows in a higher stage of society.

Wherein we also learn that the FAIL of the day was Quoz:

When a disputant was desirous of throwing a doubt upon the veracity of his opponent, and getting summarily rid of an argument which he could not overturn, he uttered the word Quoz, with a contemptuous curl of his lip, and an impatient shrug of his shoulders. The universal monosyllable conveyed all his meaning, and not only told his opponent that he lied, but that he erred egregiously if he thought that any one was such a nincompoop as to believe him.

I'm also sure I've read of a fad - Greek, Roman, 18th century, something like that - where a group of young (aristocratic?) men who would suddenly grab a common woman and proclaim her Helen and make her their queen and swear to die for her and so on. And the tearing down of such idols could be seen, if you were wont to be pretentious like me, as part of Frazer's Golden Bough's Sacrificial King idea, although I'm not sure script kiddies care if the crops grow. (One other problem with that is that Frazer was romancing; but so are the more literal memecists, so yah!)

Since then however, it appears that "quoz" has entirely flipped meaning, according to UrbanDictionary:

slang for quality, a cockney term for something good. usually accompanied with a hand action of slaping ur index finger against the stationary thumb and middle finger. 'thats quoz man! propa quoz.' finger slappy hand thingy

“Fundamentally flawed”

Killer presentation -- "RPC And Its Offspring: Convenient, Yet Fundamentally Flawed" from Steve Vinoski, who presented it at QCon London last week. It's full of reminders of the mid-90's, hacking away on CORBA technology -- Steve was one of the key players at Iona while I was there.

But never mind where we've been; let me hit you with the summary slide to show where Steve's going:

  • RPC is a convenient but flawed accident of history

    • 1980s research focused on monoliths of programming languages, distributed applications, and operating systems
    • each computer vendor of the time owned their own full stack, from language to hardware and network, and you used what they gave you
    • imperative languages won back then simply because of their superior performance at that time
  • It’s almost 2010, folks — we can do WAY better

    • pull your head from the imperative language sand and learn functional programming
    • the world is many-core and highly distributed, and the old ways aren’t going to keep working much longer

Awesome ;)

A plug for Kiva.org

I just made a loan using Kiva.org to a weaver in Nepal and a group of Vietnamese broom makers.

You can go to Kiva's website and lend to someone in the developing world who needs a loan for their business. Each loan has a picture of the entrepreneur, a description of their business and how they plan to use the loan so you know exactly how your money is being spent -- and you get updates letting you know how the entrepreneur is going.

The best part is, when the entrepreneur pays back their loan you get your money back - and Kiva's loans are managed by microfinance institutions on the ground who have a lot of experience doing this, so you can trust that your money is being handled responsibly.

Kiva's microfinancing seems like a nice way of helping the developing world, and I've heard good things about it. Here's hoping it works out well for my two recipients!

Google Reader productivity hack: change your Home

So, if you use Google Reader, read your news with the "All items" page, and are subscribed to hundreds of feeds, it can be pretty overwhelming. I've found a better way to deal with this.

Select a 'most important' subset of feeds. For each of those, click through to the feed details page, hit the "Feed Settings..." menu, and select "Change folders...". Put the feed into a new "top" folder (creating it if necessary).

Now go to "Settings" -> "Preferences" and check out the "Start page" preference. By default, it's set to "Home"; change it to "Folders and Tags: top".

Hey presto -- now, when you load Google Reader, it'll come up with your "top" items. You can get through those quickly enough, and get on to other more important tasks. When you're bored and need something to read, though, just hit "Navigation" -> "All items" (or even just type 'ga'), and every other feed is now there for your delectation. Sweet!

Ready for the blackout?

Reminder -- Ireland's Blackout Week starts tomorrow:

Take part in Blackout Week

  1. To demonstrate your feelings about [IRMA's censorship demands], you can make your avatar black on any websites you have a presence on.
  2. This is inspired by Creative Freedom New Zealand's blackout campaign.
  3. From Black Thursday on the 5th of March, for one week, set your picture on sites like Facebook, Bebo, Twitter, MSN, etc black to raise awareness for Blackout Ireland.
  4. On that Thursday we encourage you to express yourself publicly about this issue, whether by blog posts, letters to newspapers or any form of communication you can think of.

Using VC to track system config changes by mail

Here's a great idea from a thread on the SpamAssassin users list, from Roger Marquis:

Karsten Bräckelmann [questioning the utility of a mechanism to dump the entire contents of the SpamAssassin configuration database]:

'postconf' without the handy -n switch dumps about 500 lines. The equivalent dump for SA including the rules is about 6000 lines. And that's a plain dump, without following and unfolding meta rules or anything.

Whether 6K or 60K would not necessarily make a difference to how I would like to use an SA 'postconf -n' equivalent. That use is change management. The intent is not in the full report itself but in its deltas.

As full time mail/systems admins we get invaluable data from tripwire/integrit, 'postconf -n', dconf, 'rpm -qa', 'dpkg -l *', 'pkg_info -a', ... whose output is checked in to RCS daily. This provides a nice configuration snapshot and historical record but its real usefulness comes from rcsdiff piped into a daily report. These are (usually) relatively concise, and IMO, absolutely essential for monitoring production Unix/Linux systems.

I like it! I think I'd check it into a git repo, though. The concept of applying VC smarts to traditional sysadmin tasks is definitely a meme on the way up -- see also etckeeper.

Blackout Ireland – a response to IRMA’s censorship demands

As Adrian noted last week, IRMA are demanding that Eircom block the Pirate Bay -- first on a list of websites they don't like -- on pain of being sued. On top of that, they intend for the other Irish ISPs to follow suit -- here's a key line from the letter they sent to Blacknight MD Michele Neylon:

in the event of a positive response to this letter it is proposed to make practical arrangements with Blacknight of a like nature to those made with eircom.

If that comes to pass, this will be an appalling situation for Irish internet users, and we need to act to ensure it doesn't happen. Digital Rights Ireland:

The net effect of this scheme, if it is allowed to go into effect, will be to impose an internet death penalty on two groups. On users, who will be cut off on the allegation of a private body, with no court involvement, and on websites, which could be blocked to Irish users based on a court hearing where only one side is heard.

Pace Mulley:

So first they’ll start with the Pirate Bay. Then comes Mininova, IsoHunt, then comes YouTube (they have dodgy stuff, right?), how long before we have Boards.ie because someone quoted a newspaper article or a section of a book?

Digital Rights Ireland have posted an excellent document detailing the following plan of action for Irish internet users concerned about this:

  • Contact your ISP and let them know that this is a key issue for you, as their customer.

  • Join up with your fellow netizens. Subscribe to the Blackout Ireland blog. Follow the #blackoutirl hashtag on Twitter. Join the Blackout Ireland Facebook group. It looks likely that there'll be a week-long blackout campaign starting next Thursday, March 5th.

  • Contact politicians. This is likely to cause irreparable damage to the Irish internet, so our pols should be very worried. See the DRI post for details on getting in touch with Minister for Communications Eamonn Ryan.

New Zealand is running their own blackout campaign right now, so that may help our planning.

International readers -- make no mistake, you're next. IRMA in this case is acting as the local delegate of IFPI, which stated in 2007 that this was one of the 3 technical options for ISPs to control piracy:

Here's some other interesting coverage:

Fantastic interview with BitBuzz CEO Alex French:

If ISPs, including Eircom, agree not to oppose blocking access to The Pirate Bay and other similar websites, is this not an agreement to web censorship? “I don’t think there is any other way to interpret it,” said French.

“They are essentially agreeing to censor certain websites at the behest of the recording industry, without these websites ever having necessarily shown to be illegal in the Republic of Ireland. I would have a huge concern over what other websites may be blocked and what other industries will pile in now that the precedent has been set.”

Some sample letters:

And further discussion -- here's a massive boards.ie discussion thread, now closed in favour of this newer thread.

Update: here's the letter I sent to the Minister, if you're curious or need inspiration.

Ubuntu to bundle Eucalyptus

Introducing Karmic Koala, Ubuntu 9.10:

What if you want to build an EC2-style cloud of your own? Of all the trees in the wood, a Koala's favourite leaf is Eucalyptus. The Eucalyptus project, from UCSB, enables you to create an EC2-style cloud in your own data center, on your own hardware. It's no coincidence that Eucalyptus has just been uploaded to universe and will be part of Jaunty - during the Karmic cycle we expect to make those clouds dance, with dynamically growing and shrinking resource allocations depending on your needs.

A savvy Koala knows that the best way to conserve energy is to go to sleep, and these days even servers can suspend and resume, so imagine if we could make it possible to build a cloud computing facility that drops its energy use virtually to zero by napping in the midday heat, and waking up when there's work to be done. No need to drink at the energy fountain when there's nothing going on. If we get all of this right, our Koala will help take the edge off the bear market.

AWESOME -- exactly where the Linux server needs to go. Eucalyptus is the future of server farms. Really looking forward to this...

Blimey, I won

Somehow or other, I seem to have won the 2009 Irish Blog Award for Best Technology Blog/Blogger! To be honest, for the last year I haven't been spending as much time on the blog as before, due mainly to a rather compelling distraction, so I'm doubly grateful for winning.

Unfortunately, I was out of the country, at Nishad and Janet's wedding, so missed my chance to get up on stage and thank my fellow bloggers in person -- but I asked John to do so instead. Seems he in turn got stage fright and delegated to his missus, who picked up the trophy. Thanks Fiona! That's probably just as well, since I'm pretty incoherent in that kind of situation myself.

Cheers to my fellow nominees, Eoghan, Robin, Michele and Pat. One of you guys should totally have won ;)

And last of all -- cheers to BitBuzz for sponsoring the category, and Mulley for the whole bash. I definitely have to turn up next year!

Now I need to put more time in this year to really earn that award...

Plenty of money for Dublin’s bikes

So it seems that JC Decaux have been complaining about the costs of running the Velib scheme in Paris:

Since the scheme's launch, nearly all the original bicycles have been replaced at a cost of 400 euros each.

Of course, this won't be a problem in Dublin. Going by Newstalk's estimates of how much the advertising space provided to JC Decaux for free, in exchange for the (as yet nonexistent) 450 bikes would have cost, each bike comes at a public cost of 111,000 Euros. That should cover a lot of "velib extreme".

(OK, that may be overestimating it. The Irish Times puts a more sober figure of EUR 1m per year; that works out as EUR 2,000 per bike per year. Still should cover a few broken bikes.)

A quick reminder:

ParisDublin
20,000 bikes450 promised
~1,600 billboards~120 installed
~12.5 bikes per billboard~3.8 bikes per billboard
10km range (from 15e to 19e arondissement)4km range (from the Mater Hospital to the Grand Canal)

And, of course, there's no sign of the bikes here yet... assuming they ever arrive. Heck of a job, Dublin City Council.

BTW, here's the rate card for advertising on the "Metropole" ad platforms, if you're curious, via the charmingly-titled Go Ask Me Bollix.

Fixing the Gmail Tasks window bug

Hey Gmail users! If you're using Tasks, there's a slightly annoying bug in Gmail right now -- you may see the "Use this link to open Tasks" tip window appear every time you access the inbox page.

Several other people have reported it, and apparently the Google guys are 'working to resolve it' at the moment. In the meantime, though, here's a way to work around the issue without losing Tasks (you will, unfortunately, lose the offline-gmail functionality, though). Simply disable Offline Gmail (Settings -> Offline -> "Disable Offline Gmail for this computer"), and the bug no longer manifests itself.

You can allow Gmail to keep the stored mail on your computer if you like, which will be handy for when the bug is fixed and Offline can be re-enabled -- hopefully sooner rather than later.

Continuous deployment

This is awesome, if a little insane. Continuous Deployment at IMVU: Doing the impossible fifty times a day:

Continuous Deployment means running all your tests, all the time. That means tests must be reliable. We’ve made a science out of debugging and fixing intermittently failing tests. When I say reliable, I don’t mean “they can fail once in a thousand test runs.” I mean “they must not fail more often than once in a million test runs.” We have around 15k test cases, and they’re run around 70 times a day. That’s a million test cases a day. Even with a literally one in a million chance of an intermittent failure per test case we would still expect to see an intermittent test failure every day. It may be hard to imagine writing rock solid one-in-a-million-or-better tests that drive Internet Explorer to click ajax frontend buttons executing backend apache, php, memcache, mysql, java and solr. I am writing this blog post to tell you that not only is it possible, it’s just one part of my day job.

OK, so far, so sensible. But this is where it gets really hairy:

Back to the deploy process, nine minutes have elapsed and a commit has been greenlit for the website. The programmer runs the imvu_push script. The code is rsync’d out to the hundreds of machines in our cluster. Load average, cpu usage, php errors and dies and more are sampled by the push script, as a basis line. A symlink is switched on a small subset of the machines throwing the code live to its first few customers. A minute later the push script again samples data across the cluster and if there has been a statistically significant regression then the revision is automatically rolled back. If not, then it gets pushed to 100% of the cluster and monitored in the same way for another five minutes. The code is now live and fully pushed. This whole process is simple enough that it’s implemented by a handfull of shell scripts.

Mental. So what we've got here is:

  • phased rollout: automated gradual publishing of a new version to small subsets of the grid.

  • stats-driven: rollout/rollback is controlled by statistical analysis of error rates, again on an automated basis.

Worth noting some stuff from the comments. MySQL schema changes break this system:

Schema changes are done out of band. Just deploying them can be a huge pain. Doing an expensive alter on the master requires one-by-one applying it to our dozen read slaves (pulling them in and out of production traffic as you go), then applying it to the master’s standby and failing over. It’s a two day affair, not something you roll back from lightly. In the end we have relatively standard practices for schemas (a pseudo DBA who reviews all schema changes extensively) and sometimes that’s a bottleneck to agility. If I started this process today, I’d probably invest some time in testing the limits of distributed key value stores which in theory don’t have any expensive manual processes.

They use an interesting two-phased approach to publishing of the deploy file tree:

We have a fixed queue of 5 copies of the website on each frontend. We rsync with the “next” one and then when every frontend is rsync’d we go back through them all and flip a symlink over.

All in all, this is very intriguing stuff, and way ahead of most sites. Cool!

(thanks to Chris for the link)

Config management as cookery

interesting to see Chef, a configuration management framework using cooking as a metaphor.

Back in the early '90s in Iona, I wrote a user/group synchronization tool called "greenpages" which used a cooking metaphor; "spice" (data) was added to "raw" (template) files to produce "cooked" output. Great minds, eh!

IR book recommendation

Thanks to Pierce for pointing me at this review of an interesting-sounding book called Introduction to Information Retrieval. The book sounds quite useful, but I wanted to pick out a particularly noteworthy quote, on compression:

One benefit of compression is immediately clear. We need less disk space.

There are two more subtle benefits of compression. The first is increased use of caching ... With compression, we can fit a lot more information into main memory. [For example,] instead of having to expend a disk seek when processing a query ... we instead access its postings list in memory and decompress it ... Increased speed owing to caching -- rather than decreased space requirements -- is often the prime motivator for compression.

The second more subtle advantage of compression is faster transfer data from disk to memory ... We can reduce input/output (IO) time by loading a much smaller compressed posting list, even when you add on the cost of decompression. So, in most cases, the retrieval system runs faster on compressed postings lists than on uncompressed postings lists.

This is something I've been thinking about recently -- we're getting to the stage where CPU speed has so far outstripped disk I/O speed and network bandwidth, that pervasive compression may be worthwhile. It's simply worth keeping data compressed for longer, since CPU is cheap. There's certainly little point in not compressing data travelling over the internet, anyway.

On other topics, it looks equally insightful; the quoted paragraphs on Naive Bayes and feature selection algorithms are both things I learned myself, "in the field", so to speak, working on classifiers -- I really should have read this book years ago I think ;)

The entire book is online here, in PDF and HTML. One to read in that copious free time...

Good reasons to host inelastically on EC2

Recently, there's been a bit of discussion online about whether or not it makes sense for companies to host server infrastructure at Amazon EC2, or on traditional colo infrastructure. Generally, these discussions have focussed on one main selling point of EC2: its elasticity, the ability to horizontally scale the number of server instances at a moment's notice.

If you're in a position to gain from elasticity, that's great. But it is still worth noting that even if you aren't in that position, there's another good reason to host at an EC2-like cloud; if you want to deploy another copy of the app, either from a different version-control branch (dev vs staging vs production deployments), or to run separate apps with customizations for different customers. These aren't scaling an existing app up, they're creating new copies of the app, and EC2 works nicely to do this.

If you can deploy a set of servers with one click from a source code branch, this is entirely viable and quite useful.

Another reason: EC2-to-S3 traffic is extremely fast and cheap compared to external-to-S3. So if you're hosting your data on S3, EC2 is a great way to crunch on it efficiently. Update: Walter observed this too on the backend for his Twitter Mosaic service.

Ice Cycling

I seem to have invented a new extreme sport on the way into work: Ice Cycling. The roads were like an ice-skating rink. Scary stuff :(

Here's some advice for anyone in the same boat:

  • use a high gear: avoid using low gear if possible, even when starting off. Low revs mean you're more likely to get traction.

  • try to avoid turns: keep the bike as upright as possible.

  • try to avoid braking: braking is very likely to start a skid in icy conditions.

  • use busy roads: where the ice has been melted by car traffic. In icy conditions, you should ride where the cars have been, since they'll have melted the ice.

  • ride away from the gutters: they're more likely to be iced over than the centre of a lane. Again, ride where the cars have been.

  • avoid road markings: it seems these were much icier than the other parts of the road; possibly because their high albedo meant the ice on them hadn't been melted by the sun yet. So look out for that.

Here's a good thread on cyclechat.co.uk, and don't miss icebike.org: 'Whether commuting to work, or just out for a romp in the woods, you arrive feeling very alive, refreshed, and surrounded with the aura of a cycling god. You will be looked upon with the smile of respect by friends and co-workers. - - - Or was that the sneer of derision...no matter, ICEBIKING is a blast!' o-kay.

Their recommendations are pretty sane, though. ;)

UK’s proposed anti-filesharing quango

Wow. The IFPI's strategy of "divide and conquer" by taking individual ISPs to court to force them to institute a 3 strikes policy, as successfully deployed against Eircom this week, is possibly marginally better than this insane obsolete-business-model handout proposed by the UK government in their Digital Britain report:

Lord Carter of Barnes, the Communications Minister, will propose the creation of a quango, paid for by a charge that could amount to £20 a year per broadband connection.

The agency would act as a broker between music and film companies and internet service providers (ISPs). It would provide data about serial copyright-breakers to music and film companies if they obtained a court order. It would be paid for by a levy on ISPs, who inevitably would pass the cost on to consumers.

Jeremy Hunt, the Shadow Culture Secretary, said: “A new quango and additional taxes seem a bizarre way to stimulate investment in the digital economy. We have a communications regulator; why, when times are tough, should business have to fund another one?”

Well said. An incredibly bad idea.

By the way, I've noticed some misconceptions about the Eircom settlement. Telcos selling Eircom bitstream DSL (ie. the 2MB or 3MB DSL packages) are immune right now.

They are, however, next on the music industry's hit-list, reportedly...

Eircom forced to implement “3 strikes and you’re out” for filesharers

Eircom has been forced to implement "3 strikes and you're out", according to Adrian Weckler:

If the music labels come to it with IP addresses that they have identified as illegal file-sharers, Eircom will, in its own words:

"1) inform its broadband subscribers that the subscribers IP address has been detected infringing copyright and

"2) warn the subscriber that unless the infringement ceases the subscriber will be disconnected and

"3) in default of compliance by the subscriber with the warning it will disconnect the subscriber."

My thoughts -- it's technically better than installing Audible Magic appliances to filter all outbound and inbound traffic, at least.

However, there's no indication of the degree to which Eircom will verify the "proof" provided by the music labels, or that there's any penalty for the labels when they accuse your laser printer of filesharing. I foresee a lot of false positives.

Update: LINX reports that the investigative company used will be Dtecnet, a 'company that identifies copyright infringers by participating in P2P file-sharing networks'. TorrentFreak says:

DtecNet [...] stems from the anti-piracy lobby group Antipiratgruppen, which represents the music and movie industry in Denmark. There are more direct ties to the music industry though. Kristian Lakkegaard, one of DtecNet’s employees, used to work for the RIAA’s global partner, IFPI. [...]

Just like most (if not all) anti-piracy outfits, they simply work from a list of titles their client wishes to protect and then hunts through known file-sharing networks to find them, in order to track the IP addresses of alleged infringers.

Their software appears as a normal client in, for example, BitTorrent swarms, while collecting IP addresses, file names and the unique hash values associated with the files. All this information is filtered in order to present the allegations to the appropriate ISP, in order that they can send off a letter admonishing their own customer, in line with their commitments under the MoU.

[...] it will be a big surprise if [Dtecnet's evidence is] of a greater ‘quality’ than the data provided by MediaSentry.

More coverage of the issues raised by the RIAA's international lobbying for the 3-strikes penalty:

Switched to Magnet

I've switched my home broadband from Eircom's 3Mbps all-in-one package to Magnet's 10Mbps LLU package. It's about a tenner a month cheaper, and significantly faster of course.

The modem arrived last Friday, about 2 weeks after ordering; that night, when I went to check my mail, I noticed that the DSL had gone down, and indeed so had the phone. I was dreading a weekend without the interwebs, it being 9pm on Friday night -- but lo, when I plugged in the Magnet router, it all came up perfectly first time!

Great instructions too. Extremely readable and quite comprehensible for a reasonably non-techie person, I'd reckon. So far, they've provided great service, too.

I'm not actually getting the full 10Mbps, unfortunately; it's RADSL, and I'm only getting 5Mbps when I test it. Just as well I didn't pay the extra tenner to get their 24Mbps package. Still, that's a hell of a lot faster than the sub-1Mbps speeds I've been getting from Eircom.

It's hard to notice an effective difference when browsing though, as that kind of traffic is dominated by latency effects rather than throughput.

I haven't even tried their "PCTV" digital TV system; it seems a bit pointless really, I have a networked PVR already, and anyway I doubt they support Linux.

One thing that's wierd; when my wife attempts to view video on news.bbc.co.uk on her Mac running Firefox, it stalls with the spinny "loading video" image, and the status line claims that it's downloading from "ad.doubleclick.net". This worked fine (of course) on Eircom. If I switch to my user account and use Firefox there, it works fine, too -- possible difference being that I'm using AdBlock Plus and she's not. Something to do with the number of simultaneous TCP connections to multiple hosts, maybe? Very odd anyway. It'd be nice to get some time to sit down with tcpdump and figure this one out... any suggestions?

Google.ie HTTPS fail

Check out what happens when you visit https://www.google.ie/ :

Clicking through Firefox's ridiculous hoops gets me these dialogs:

Good work, Google and Firefox respectively!

Hack: reassassinate

A coworker today, returning from a couple of weeks holiday, bemoaned the quantities of spam he had to wade through. I mentioned a hack I often used in this situation, which was to discard the spam and download the 2 weeks of supposed-nonspam as a huge mbox, and rescan it all with spamassassin -- since the intervening 2 weeks gave us plenty of time for the URLs to be blacklisted by URIBLs and IPs to be listed by DNSBLs, this generally results in better spamfilter accuracy, at least in terms of reducing false negatives (the "missed spam"). In other words, it gets rid of most of the remaining spam nicely.

Chatting about this, it occurred to us that it'd be easy enough to generalize this hack into something more widely useful by hooking up the Mail::IMAPClient CPAN module with Mail::SpamAssassin, and in fact, it'd be pretty likely that someone else would already have done so.

Sure enough, a search threw up this node on perlmonks.org, containing a script which did pretty much all that. Here's a minor freshening: download

reassassinate - run SpamAssassin on an IMAP mailbox, then reupload

Usage: ./reassassinate --user jmason --host mail.example.com --inbox INBOX --junkfolder INBOX.crap

Runs SpamAssassin over all mail messages in an IMAP mailbox, skipping ones it's processed before. It then reuploads the rewritten messages to two locations depending on whether they are spam or not; nonspam messages are simply re-saved to the original mailbox, spam messages are sent to the mailbox specified in "--junkfolder".

This is especially handy if some time passed since the mails were originally delivered, allowing more of the message contents of spam mails to be blacklisted by third-party DNSBLs and URIBLs in the meantime.

Prerequisites:

  • Mail::IMAPClient
  • Mail::SpamAssassin

If only this were true

Some people, when facing a problem, think "I'll use regular expressions." Now they have HORDES OF CUTE PEOPLE WANTING TO SLEEP WITH THEM

-- Yoz, on twitter