Skip to content

Category: Uncategorized

Host monitoring with Jaiku

A few weeks back, we were having trouble with dogma, our shared server where taint.org is hosted, which would occasionally be unavailable for unknown reasons. We needed to monitor its availability so that it could be fixed when it crashed again, and we'd be able to investigate quickly. Since it was happening mostly out of working hours, SMS notification was essential.

Normally, that kind of monitoring is pretty basic stuff, and there's plenty of services out there, from Host-Tracker.com to the more complex self-hosted apps like monit and Nagios which can do that. But looking around, I found that none of them offered SMS notification for free, and since this was our personal-use server, I wasn't willing to sign up for a $10-per-month paid account to support it, or buy any hardware to act as a private SMS gateway.

Instead, I thought of Jaiku -- the Finnish company which offers a microblogging/presence platform similar to Twitter. Jaiku had a couple of cool features:

  • SMS notifications
  • it's possible to broadcast messages to a "channel", which others could subscribe to, IRC-style
  • it has an open API

This would allow me to notify any interested party of dogma's downtime, allowing subscribers to subscribe and unsubscribe using whatever notification systems Jaiku support.

With a little perl and LWP, I rigged up a quick monitoring script to check http://taint.org/ via HTTP, and report if it was unavailable over the course of 5 retries in 50 seconds. If it was broken, the script sends a JSON-formatted POST request to Jaiku's "presence.send" method, informing the target channel of the issue. (Perl source here.)

You can see the '#dogmastatus' channel here -- as you can see, we fixed the problem with dogma just over 2 weeks ago ;)

It's worth noting that I had to set up an additional user, "downtimebot", on Jaiku to send the messages -- otherwise I'd never see them on my configured mobile phone! Jaiku uses the optimisation that, if I sent the message, there's no need to cc me with a copy of what I just sent; logical enough.

Anyway, if you're interested in dogma's availability (there might be one or two taint.org readers who are), feel free to add yourself to the #dogmastatus channel and receive any updates.

Update: Fergal noted that it's pretty simple to use Cape Clear's assembly framework to perform a HTTP ping test with output to Jabber/XMPP. nifty!

A fishy Challenge-Response press release

I have a Google News notification set up for mentions of "SpamAssassin", which is how I came across this press release on PRNewsWire:

Study: Challenge-Response Surpasses Other Anti-Spam Technologies in Performance, User Satisfaction and Reliability; Worst Performing are Filter-based ISP Solutions

NORTHBOROUGH, Mass., July 17 /PRNewswire/ -- Brockmann & Company, a research and consulting firm, today released findings from its independent, self-funded "Spam Index Report-- Comparing Real-World Performance of Anti-Spam Technologies."

The study evaluated eight anti-spam technologies from the three main technology classes -- filters, real-time black list services and challenge- response servers. The technologies were evaluated using the Spam Index, a new method in anti-spam performance measurement that leverages users' real-world experiences.

[...] The report finds that the best performing anti-spam technology is challenge-response, based on that technology's lowest average Spam Index score of 160.

[...] Filter - Open Source software-(Spam Index: 388): This technology is frequently configured to work in conjunction with PC email client filters. The server adds * * SPAM * * to the subject line so that the client filter can move the message into the junk folder. This class of software includes projects such as ASSP, Mail Washer and SpamAssassin, among others.

The "Spam Index" is a proprietary measurement of spam filtering, created by Brockmann and Company. A lower "Spam Index" score is better, apparently, so C/R wins! (Funny that. The author, Peter Brockmann, seems to have some kind of relationship with C/R vendor Sendio, being quoted in Sendio press releases like this one and this one, and providing a testimonial on the Sendio.com front page.)

However -- there's a fundamental flaw with that "Spam Index" measurement, though; it's designed to make C/R look good. Here's how it's supposed to work. Take these four measurements:

  • Average number of spam messages each day x 20 (to get approximate number per work-month)
  • Average minutes spent dealing with spam each day x 20 (to get approximate minutes per work-month)
  • Number of resend requests last month
  • Number of trapped messages last month

Then sum them, and that gives you a "Spam Index".

First off, let's translate that into conventional spam filter accuracy terms. The 'minutes spent dealing with spam each day' measures false negatives, since having to 'deal with' (ie delete) spam means that the spam got past the filter into the user's inbox. The 'number of trapped messages' means, presumably, both true positives -- spam marked correctly as spam -- and false positives -- nonspam marked incorrectly as spam. The 'number of resend requests last month' also measures false positives, although it will vastly underestimate them.

Now, here's the first problem. The "Spam Index" therefore considers a false negative as about as important as a false positive. However, in real terms, if a user's legit mail is lost by a spam filter, that's a much bigger failure than letting some more spam through. When measuring filters, you have to consider false positives as much more serious! (In fact, when we test SpamAssassin, we consider FPs to be 50 times more costly than a false negative.)

Here's the second problem. Spam is sent using forged sender info, so if a spammer's mail is challenged by a Challenge/Response filter, the challenge will be sent to one of:

  • (a) an address that doesn't exist, and be discarded (this is fine); or
  • (b) to an invalid address on an innocent third-party system (wasting that system's resources); or
  • (c) to an innocent third-party user on an innocent third-party system (wasting that system's resources and, worst of all, the user's time).

The "Spam Index" doesn't measure the latter two failure cases in any way, so C/R isn't penalised for that kind of abusive traffic it generates.

Also, if a good, nonspam mail is challenged, either

  • (a) the sender will receive the challenge and take the time to jump through the necessary hoops to get their mail delivered ("visit this web page, type in this CAPTCHA, click on this button" etc.); or
  • (b) they'll receive the challenge, and not bother jumping through hoops (maybe they don't consider the mail that important); or
  • (c) they'll not be able to act on the challenge at all (for example, if an automated mail is challenged).

Again, the "Spam Index" doesn't measure the latter two failure cases.

In other words, the situations where C/R fails are ignored. Is it any wonder C/R wins when the criteria are skewed to make that happen?

Stop with the fake phish data

An anonymous friend in the anti-phishing community writes:

For those of you who blog and/or have contacts in the general computer user 'go fight 'em' community:

Is there any way you can get the word out that dropping a couple hundred fake logins on a phishing site is NOT appreciated??

It creates havoc for those monitoring the drop since it's an unbelieveable waste of time and resources to clean up the file. Also, for those drop files that 'recycle' after every 10 entries, valid data is lost.

It also creates havoc for those who get these files and try to notify victims. They waste time, too .. pulling legit info from amongst the trash.

I know there are programs out there that create/dump this stuff onto sites and some who call themselves 'phish phighters' enjoy the harassment aspect. But it wastes the time/effort of those who are seriously working these things.

New Science Gallery in Dublin

I just got this missive from the new Science Gallery at Trinity College Dublin:

The SCIENCE GALLERY is seeking EXPRESSIONS OF INTEREST for Festival of Light projects.

Calling all techno-artists, playful scientists, renegade engineers, architects, sculptors, lighting designers, fashion designers, guerilla projectionists and inventors...

The Science Gallery at Trinity College Dublin is developing a two week FESTIVAL OF LIGHT as its launching programme in January 2008 which will celebrate the art, science and technology of light through a range of installations and events in the Science Gallery and around Dublin's city centre.

We are seeking proposals for installations, events and workshops. You can download our Expression of Interest form here. We would like this to reach far and wide so please forward this onto anyone you think may be interested in submitting!

If you would like to discuss your ides with us or would like further information prior to submitting an Expression of Interest Submission please contact Elizabeth Allen at elizabeth.allen /at/ sciencegallery.org .

I'm looking forward to see what happens with this; hope it works out well.

T9 in Ireland

Tobias DiPasquale notes that the iPhone's dictionary can correct the word 'f***ing' right out of the box. Handy!

The vagaries of various companies' autocompletion dictionaries are always worth a comment. I've noticed that swearing is generally omitted, presumably for prudish reasons to do with tabloid PR fears. But as an Irishman, I find it particularly galling that Nokia's T9 dictionary cycles through the following entries for "pints":

  • Shots
  • Pious
  • Riots
  • Pints

When I type "pints" (which happens a lot), believe me, I never mean to type "pious". Stupid phone!

Planet Antispam unborked

Those of you who visit Planet Antispam may have noticed that it hadn't been updating in a few days. Somehow or other, the Planet software had corrupted its cache, and was dying with this error:

Traceback (most recent call last):
  File "planet.py", line 167, in ?
    main()
  File "planet.py", line 160, in main
    my_planet.run(planet_name, planet_link, template_files, offline)
  File "/home/planet/antispam/planet-2.0/planet/__init__.py", line 240, in run
    channel = Channel(self, feed_url)
  File "/home/planet/antispam/planet-2.0/planet/__init__.py", line 527, in __init__
    self.cache_read_entries()
  File "/home/planet/antispam/planet-2.0/planet/__init__.py", line 569, in cache_read_entries
    item = NewsItem(self, key)
  File "/home/planet/antispam/planet-2.0/planet/__init__.py", line 845, in __init__
    self.cache_read()
  File "/home/planet/antispam/planet-2.0/planet/cache.py", line 74, in cache_read
    self._type[key] = self._cache[cache_key + " type"]
  File "/usr/lib/python2.3/bsddb/__init__.py", line 116, in __getitem__
    return self.db[key]
KeyError: 'tag:blogger.com,1999:blog-9336495.post-117499582419244211 feedburner_origlink type'

Ah, Berkeley DB, always good for the infrequent inscrutable, yet fatal, error. A wipe of the contents of the cache directory, and it seems to be working again.

Unfortunately, I had to drop the RSS feed for Aunty Spam; it seems the domain has lapsed, and I can't seem to find an RSS feed that contains just the spam-related Aunty Spam posts any more.

‘I Go Chop Your Dollar’ star arrested

The Register is reporting that 'Nigerian comedian and actor Nkem Owoh' has been arrested in Amsterdam as a suspected 419 scammer:

Nigerian comedian and actor Nkem Owoh was one of the 111 suspected 419 scammers arrested in Amsterdam recently as part of a seven month investigation, dubbed Operation Apollo.

Owoh became a well known star within the Nigerian film industry, sometimes colloquially known as Nollywood because of its trite plots, poor dialogue, terrible sound, and low production standards.

Owoh starred in the 2003 film Osuofia, and a year later was one of several actors temporarily banned from appearing in movies by Nigeria's Association of Movie Marketers and Producers because he demanded excessive fees and unreasonable contract demands.

Owoh became internationally known for his song "I Go Chop Your Dollar", the anthem for 419 scammers ("Oyinbo man I go chop your dollar, I go take your money and disappear / 419 is just a game, you are the loser, I am the winner", full lyrics here), which was banned in Nigeria after many complaints.

The song was the title track from the comedy, "The Master", starring Owoh as a scheming 419er.

The alleged scammers are suspected of running a series of lottery-based (AKA 419-lite) scams.

Here's the video for "I Go Chop Your Dollar".

It's not exactly cut and dried, though. This thread suggests that he wasn't arrested for fraud; instead that the Dutch authorities detained pretty much everyone at his concert. This article suggests similar:

The Netherlands police were said to have stormed the venue of the show in a helicopter about 2a.m and arrested practically everybody at the venue. [...]

"Over 200 of them (Nigerians) were arrested that night. It was a big haul; they came with helicopter and cars and circled the whole area. As I speak with you, over 70 of those apprehended that night have been deported for possession of expired or fake immigration papers.

"Osuofia was also whisked away but was released hours after," the source said.

Update: It appears Osuofia was not arrested after all; lots more details here.

Hunting the wily mangosteen

A few weeks ago, I was in Tesco Clearwater when I spotted something I wasn't expecting; a tray of fruit labelled "Mangosteen".

Mangosteen are delicious. In Thailand, they're called "the queen of fruit" (with the oh-so-stinky and not quite as enjoyable Durian as the king). We once spent a week on a Thai beach snacking on bags of the things; they're so good.

Unfortunately the tray was empty. :(

Ever since then, every time I've gone back to that Tesco, there's been no sign of the mangosteen; not even another empty tray! Thing is, I now know they're importing them, so I'm really jonesing... if any Dublin taint.org readers happen to spot some, please (a) be sure to buy some for yourself and (b) let us know where you found it!

Linking for charidee

Tom tagged me with another blog link-meme -- a worthwhile one, though; the idea is to improve the page rank of charities in Ireland, by linking to them. Fair enough!

The list of charities so far is:

And I'll add Focus Ireland (who seem to have broken their website!). Thanks to Dorothy for the suggestion.

Who to pass it on to? How's about Una, James and Donncha?

NSAI invites comments on OOXML/OpenXML standard

Antoin writes:

NSAI (the Irish national standards body) has posted an invitation for comments on its site regarding the proposed new Office Open XML standard (ISO/IEC DIS 29500). NSAI has established an ad hoc committee to consider the matter, and I am a member of that committee, together with a number of far more important and qualified people.

Anyway, we are anxious to hear from anyone who has a view on what way NSAI should vote on this standard when it reaches committee. If you can provide links to any relevant articles, that would also be very helpful. If you have time, please review the documents and leave your comments either here or send them to the committee.

So if you've been following the ongoing drama (to be honest, I haven't), please feel free to make a submission; the deadline is 11 July.

UPS Ireland suck

I'm waiting for a replacement battery from Dell, covered under warranty. Dell service have been great, but UPS, not so much...

On Monday (25th June), after a little back-and-forth to establish that the battery was faulty, I got a mail from Dell saying:

The Part (Battery) will be with you tomorrow pre 17:00 (Next Business Day). Please note that you will require to return the faulty part at the same point of time, the courier person would not be delivering the part until you return the defective part.

Great! That's good warranty service. I'm happy.

So I wait... and wait. Finally, 2 days later, today (Wednesday 27th), at 17:45, a courier appears to pick up the faulty part. Unfortunately, he doesn't have the replacement with him.

I go online to see what's up via online tracking, and see this:

Location Date Local Time Description
DUBLIN,
IE
27/06/2007 16:41 A CORRECT STREET NAME IS NEEDED FOR DELIVERY. UPS IS ATTEMPTING TO OBTAIN THIS INFORMATION
27/06/2007 4:13 IN-TRANSIT SCAN
27/06/2007 4:12 IMPORT SCAN
DUBLIN,
IE
26/06/2007 18:31 IMPORT SCAN
26/06/2007 5:59 IMPORT SCAN
26/06/2007 5:58 OUT FOR DELIVERY
26/06/2007 3:59 ARRIVAL SCAN
KOELN (COLOGNE),
DE
26/06/2007 4:39 DEPARTURE SCAN
26/06/2007 4:14 DEPARTURE SCAN
HERKENBOSCH,
NL
25/06/2007 10:09 ORIGIN SCAN
NL 25/06/2007 14:02 BILLING INFORMATION RECEIVED

So, what, the street name is "INCORRECT" despite one UPS driver having no problem? I suspect someone just couldn't be arsed.

I rang up UPS, provided a hint, and it seems the delivery is now rescheduled for Friday. So much for "next business day" delivery! Lucky the laptop works on AC without the battery, otherwise I'd be quite annoyed.

I wonder if I can provide feedback to Dell about this? There's a possibility they might switch courier company if they get enough complaints about crappy service. It also makes me wonder if there's any decent international parcel delivery service in Ireland. At least UPS haven't yet required me to schlep over to a "local" depot 5 miles away to pick up the package myself, like An Post does...

How I wound up with a pond

My weekend went like this:

  1. buy a Green Cone composting system
  2. read instructions
  3. find out I had to dig a 3' by 2' deep hole
  4. spend all Saturday afternoon digging massive hole in the back garden, horny-handed son of toil style
  5. just as I finish, the skies open
  6. watch in horror as the hole rapidly becomes a pond
  7. since the green cone requires a dry hole, wait for it to drain...
  8. ...and wait...
  9. ...and wait...

I'm still waiting. :(

I just hope the flooded state of the pit is a side effect of the monsoon levels of rain over the last week, and will drain soon, rather than the normal situation for the garden. Otherwise, I'll have to fill the hole and give up on the Green Cone entirely... argh. I should have gone for the wormery option, like lisey suggested!

Update: Enda left a good tip in the comments -- dig deeper into the clay and fill in with more gravel. I did that and it looks like it's working... Let's see if the worms like it. I'll keep yis posted ;)

How to solve a maze with Photoshop

wow, this is cool. lod3n, confronted by this heinous puzzle, wrote:

'2 minutes in Photoshop. All too easy. So, where do I pick up my cake?

  1. Increase contrast.
  2. Select the right wall of the maze using the magic wand.
  3. Select > Modify > Expand 4 pixels
  4. Create new layer.
  5. Fill with Red.
  6. Select > Modify > Contract 2 pixels.
  7. Delete. Now you've got a line tracing the solution.
  8. Manually clean up the outer edge, and connect the dots.
  9. Cake!'

Here's the result. Seriously nifty!

(Update: wow, this got Dugg heavily -- 17000 pageviews from Digg alone! Unfortunately that caused a bit of a server meltdown. Should be back now though...)

7digital – a bit risky

Apparently EMI are now offering their DRM-free MP3s via 7digital, so I thought I've give the newly-revamped 7digital site a go. Results were a little mixed, unfortunately.

I found a couple of tracks I wanted which were available as MP3 format, clicked the "purchase" button beside them, and they were added to the "basket" on the right-hand side. Pretty typical stuff, if you've used EMusic or iTunes. Then I created an account, chose to pay using Paypal, paid a couple of quid and all was well!

The good stuff:

  • the website works great in Firefox on Linux, and was nice and speedy.

  • the range of music seems pretty good; most of the catalogue is WMA-only unfortunately, but most of the new releases now seem to be coming out with MP3 as an option.

  • it's very easy to pay by credit card or with Paypal.

There were a couple of glitches, however.

First, it allowed me to buy a file, then not give it to me. My first tester track was the Soulwax remix of 'Standing in the Way of Control' by Gossip. I happily added it to my basket, checked out, and paid -- then when I got to my 'Your downloads' page, I was presented with this:

Gossip - Standing In The Way Of Control (Soulwax Nite Version) / 6:54 / Released 24.06.2007

No download links etc... hmm. A quick check of today's date reveals that the 24th is a week from now -- the track hasn't been released yet! It seems this isn't yet "available as a digital release" for some reason, despite the fact that as far as I can tell it's been out for ages on CD. The only way to spot this in advance of purchase is to look at the "Digital release date" on the album info page and compare with today's date; there's no other notification that you'll be buying a prerelease, and will have to wait to get your digital mitts on what you buy. Grrrr.

OK, next one; my other tester track was the title track from the new White Stripes, Icky Thump. At least this one was available. Now, supposedly we're getting 320kbps MP3s, right? Not so, it seems -- this one was 192kbps, a fact that's only revealed once you've already paid for the tracks. Double grrr...

(it turns out, by the way, that only the "EMI content" is delivered in 320kbps format. I guess the other MP3 labels are sticking with 192kbps.)

So, two for two, both of the test downloads turned out to be wonky in one way or another. A bit disappointing. I hope they'll improve though -- there seems to be a new willingness to offer a decent MP3 music-download service there... and this is still more convenient for me than having to boot up a Windows virtual machine to use the iTunes Music Store.

They could really do with signposting exactly what you're getting more clearly, though; in particular, being able to search by available format and bitrate would really help.

Lyris’ low SpamAssassin threshold

via jgc's newsletter, Lyris' latest ISP Deliverability Report (Q1 2007) makes an interesting point about legitimate bulk mail and SpamAssassin:

Contrary to popular belief among marketers, message content is not a major cause of deliverability challenges for most email marketers. This finding is a result of testing the content of more than 1,705 unique emails, using [Lyris] EmailAdvisor's content scoring tool. The content scoring function is based on the content scoring rules of the widely adopted Spam Assassin open source project. The emails tested had an average content point score of 1.04 well below the filter's generally accepted spam identification level of 3.0 or higher.

Now, that's broadly good advice -- SpamAssassin hasn't really given much strength to signatures found in message body text in the past couple of years, since the signatures from other sources (especially DNS blocklists and URI blocklists) are much more reliable.

However, note the bit I emphasised. Since when is 3.0 the 'generally accepted spam identification level'? Only the most paranoid user would ever go that low, since at that level, they'd expect to find 2.22% of their nonspam mail going into the spam folder (according to our own tests). In reality, our recommended level has always been 5.0 points, and that's what we optimise for. I'm mystified as to where they're getting 3.0 from...

Irish medical tourism

Just got a mail from an old friend, Caelen, who's got a new start-up going with an interesting angle. Caelen and his (now-) wife, Barbara, spent a while travelling around Asia around the same time as we did. As I noted back in 2003, one thing he tried out, which I found particularly intriguing at the time, was to have some minor surgery in Bangkok:

This may seem foolish at first, but despite being in the heart of South East Asia, in what is generally thought to be a developing country, the Thai medical system is unbelievably good. Not only is it the medical hub for expatriates throughout the region, but tens of thousands fly here each year to have elective surgery, from laser eye treatments to boob jobs and face lifts. There are lots of reasons why they come to Bangkok but invariably quality of surgery and care comes top of the list. Simply put, medical care in Thailand is amongst the best in the word, available at a fraction of the cost.

The Thai government sees health care as the next logical step in its hospitality industry. As holiday makers in Thailand reach saturation point, growth has to come from other sectors and international healthcare has many of the same requirements as the tourism industry: good flight connections, plentiful accommodation and above all staff that are understanding and friendly. Gleaming hospitals, which could be mistaken for 5 star hotels, not only have rooms with all amenities but also have suites, restaurants, shops and cinemas. Menus from the finest restaurants in town are placed in the best rooms. Going to hospital doesn't mean you have to stop having fun - this is Bangkok after all. This is a long way from the cold greasy egg served by the kitchen's 'Miserable Person of the Year' award winner we get at home.

Back in 2002, this was pretty unprecedented -- of course, nowadays, the concept is a lot more widely practiced, what with healthcare costs rising in the US and waiting lists rising in the UK.

I can vouch that the quality of care in Bangkok was fantastic, by all accounts; fastidiously clean and professional. (I never did it myself, but many people I knew at the time took advantage of the opportunity, rather than risk something flaring up in the less, er, reliable settings of Luang Prabang or Phnom Penh.)

Anyway, turns out Caelen has come up with a new site that is related to this -- Reva Health Network. He says, 'basically, we are a medical tourism search engine where consumers can find and compare hospitals and clinics from around the world. We cover everything although the bulk of our business is currently in dental.'

If you're looking for some work done, it might be worth taking a look; it's at revahealthnetwork.com.

Update 2010-08-16: They've moved! The new URL is http://www.whatclinic.com , which makes much more sense really. Apparently they're getting 500,000 visitors a month, and proxy though 800 phone calls a day to clinics. Cool -- sounds like it's going well...

IKEA Dublin gets planning permission

Given that I'm trying to get a new house in order, here's a topic close to my heart right now -- massive IKEA store approved for Dublin:

An Bord Pleanála has given the go-ahead for the construction of a massive IKEA outlet in the Ballymun area of Dublin. Legal restrictions on the size of retail developments had already been changed to allow the Swedish furniture giant to build a 30,000 square foot shop in the area. However, several objections were received from the National Roads Authority, Green Party TD Eamon Ryan and a number of businesses which said they would be adversely affected by a huge increase in traffic on the M50 motorway. An Bord Pleanála has now decided to grant permission for the project, subject to 30 conditions aimed at preventing traffic congestion, protecting the visual amenity of the area and promoting sustainable development.

This is long overdue, and something Ireland's been crying out for -- the price and quality of furniture here is dire. I'm glad to see it.

The details are up on An Bord Pleanala's site, including the Board's conditions. For ease of reading, I've converted it to HTML using OpenOffice.

This one strikes me as potentially annoying:

A schedule of parking charges shall be applied to car park users (other than coaches and buses which shall not be charged for parking during opening hours) [...]

At least two months prior to the opening of the proposed development for trading, an initial schedule of charges shall be agreed in writing with the planning authority. Where the daily peak hour two-way traffic flows as measured by the automatic traffic counters do not comply with the thresholds set above, the schedule of parking charges shall be varied as directed by the planning authority until compliance is achieved, save that breaches or non-compliances of a very minor or trivial nature or arising from exceptional circumstances may be disregarded at the discretion of the planning authority.

Reason: To minimise traffic impacts and avoid serious traffic congestion.

Patronising pregnancy

Via Yoz comes this great article: Zoe Williams: Being pregnant and receiving unscientific advice go hand in hand. Here's a sample:

Listeria has been my particular bugbear ever since a midwife - that is, a trained prenatal professional who, unless I develop complications, represents the highest medical authority I can expect to deal with throughout my pregnancy - told me that I could get listeriosis, thereby brain-damaging my foetus, without knowing about it. Now, listeriosis is an incredibly serious disease, with extremely serious symptoms, taken extremely seriously by epidemiologists nationwide. Get it without noticing it? If I got listeriosis, the national papers would know about it. It would be the third outbreak that has occurred in [the UK] in the past 20 years.

Here are some other things that are wantonly untrue: pasteurisation, in fact, has nothing to do with a cheese's ability to harbour the listeria bacteria. The bacteria that characterise different cheeses are introduced after the pasteurisation process anyway. Listeria flourishes in moist environments, so parmesan is safe where camembert isn't, but even rinded and soft cheeses are safe once they have been cooked. But food hygiene is a much more important factor than moisture - raw fish does not come out of the sea carrying listeria, but contracts the bacteria from contact with dirty hands. Of the past two outbreaks of listeria in Britain, one was from butter and the other from lettuce (there have been other instances of product recalls, but no human contamination).

In fact the three worst recorded cases of listeria since 1992 have all been in France, and were all from pork tongue in jelly, which nobody in their right mind would ever eat. Of the past 10 listeriosis outbreaks in America, only two were from cheese, and one of those was a Mexican homemade cheese. The notion that there are pregnant people out there whipping themselves into a frenzy of guilt because they have eaten some gorgonzola is just infuriating.

This patronising "pregnant women mustn't do X" paranoia is C's pet hate of the moment; being a (pregnant) scientist, she's been checking them against Medline, looking into the extent of the real research these claims are based on, and generally writing them off one by one. I've been trying to persuade her to write a blog post about this for taint.org, so far with no luck though...

MAAWG Talk

Here's the talk I gave at MAAWG, entitled New Features in SpamAssassin 3.2.0 Of Interest To Large Receivers:

Abstract:

Many ISPs and mail receivers, at all scales, use SpamAssassin as part of their spam-filtering arsenal. The recent release of SpamAssassin 3.2.0 introduces much new functionality, and some of this is of particular interest to the large-scale mail receiver; in particular, rules compiled to parallel-matching native object code for increased speed, early short-circuiting based on administrator-specified rules, the new "msa_networks" setting to specify MSA hosts or pools, a new ruleset to detect spam/virus backscatter bounces, a way to run SpamAssassin in the Apache httpd server using mod_perl, and support for Amazon's EC2 virtual server farm. In this talk, I'll discuss each of these in detail, and discuss why it may be useful to you.

If you were at MAAWG, hope you enjoyed it ;)

DSPAM acquired by Sensory Networks

whoa, didn't see that coming. Quoting Jonathan Zdziarski via jgc's newsletter:

...The [DSPAM] project had grown to a point where it would take others - with enough free time - to bring DSPAM to the next level as a widely accepted enterprise-class solution, and [I] decided that it would be in the best interest of the project to entrust it to someone with the technical knowhow and dedication to reach these goals. Many of you are aware of my work in the past with Sensory Networks in developing a hardware-accelerated version of DSPAM (capable of supporting multi-megabit speeds in large carrier environments). I've spent a considerable amount of time with SN's team over the past several years and when we initially discussed working together, they had shown to be very excited and motivated about the project.

After careful consideration and many discussions at length, I decided to allow Sensory Networks to acquire the rights to the project, and continue development on it with their own team. SN has displayed a strong commitment to the open source community and has been working closely with other leading projects such as Snort, Clam Antivirus, and SpamAssassin. They assured me that the project will remain open-source and available to all, and at the same time the project will receive exposure in commercial environments it has not seen before, as many of you have been asking for. We've now completed the acquisition for the project, and I'd like to encourage you to support them in helping them move forward as it grows into new areas.

More details at zdziarski.com.

Dealing with backscatter, revisited

Back in January, I wrote about how I deal with email backscatter nowadays. Since then, I've made a notable tweak.

This is that I no longer reject "null-sender" traffic during the SMTP transaction. It turned out that it broke Exim's implementation of Sender Address Verification, which performs the SAV check using a MAIL FROM of <>, rendering it indistinguishable from a bounce during the SMTP transaction.

Now, I've complained about SAV, but I have to be pragmatic anyway (Postel's law and all that!) -- so it was better to just allow other sites to perform SAV lookups against our server, and fix the anti-bounce stuff some other way.

The new method (below) does this, by allowing null-sender SMTP traffic just fine; it detects bounces in Postfix if they arrive via SMTP in RFC-3464 format, and bounces that slip past are then dealt with in a more CPU-intensive manner using the SpamAssassin "VBounce" ruleset (which is part of the now-released SpamAssassin 3.2.0, btw).

This increases the load, since some bounces cannot be rejected at MAIL FROM time now, and instead we have to wait 'til DATA -- but CPU hasn't been a problem recently, so this is ok.

Here are the updated instructions:

In Postfix

In my Postfix configuration, on the machine that acts as MX for my domains -- edit '/etc/postfix/header_checks', and add these lines:

/^Content-Type: multipart\/report; report-type=delivery-status\;/  REJECT no third-party DSNs
/^Content-Type: message\/delivery-status; /     REJECT no third-party DSNs

Edit '/etc/postfix/main.cf', and ensure it contains:

header_checks = regexp:/etc/postfix/header_checks

Then run:

sudo /etc/init.d/postfix restart

This catches most of the bounces -- RFC-3464-format Delivery-Status-Notification messages from other mail servers.

In SpamAssassin

As before, install the Virus-bounce ruleset and set it up. This will catch challenge-response mails, "out of office" noise, "virus scanner detected blah" crap, and bounce mails generated by really broken groupware MTAs -- the stuff that gets past the Postfix front-line.

Dead laptop time

Argh. My Thinkpad's power socket must have received a knock during the move. It no longer works with either of the two power bricks I have here -- so it looks like it's time to either (a) buy a soldering iron and some screwdrivers (incl Torx ones?) or (b) renew my IBM warranty service and send it in for some fixing :(

Bad timing.

Update: oh look, it's working again! phew. I guess I should probably set aside some time for warranty service here anyway though...

Back

Hey -- I'm back, rested and full of tasty, tasty Niçois and Provencal cuisine.

I got back just in time to vote, for what good that did with Bertie's gang leading strongly in the current counts... argh!

For what it's worth, I gave Patricia McKenna a preference, in the end. I was reminded that she'd been entirely on our side on software patents during her time as an MEP -- so credit where it's due, there; on top of that, a vote for the Greens is better than a vote going to Sinn Fein, after all, no matter what. ;)

Carbon offsetting

I'm off to Nice on vacation for two weeks, starting tomorrow -- back on May 25th. See ya then!

In the meantime, and appropriately enough given that jet fuel I'll be consuming, here's some interesting stuff from my mate Eoin on carbon offsetting...

'It's a fecking minefield to figure out. There are many conflicting standards, some of which sound impressive but are useless in reality.

Steer clear of tree planting, especially outside Europe; even a well-run forestry in Europe will take decades to make any difference.

The best quality-mark appears to be the CDM Gold Standard. The Gold Standard is a recent introduction, a response to the weak, conflicting Kyoto standards and many ad hoc government ones. Gold Standard specifically excludes tree plantatations.

The following operators are the only ones I found that are Gold Standarded and also pass the bullshit smell test (which is far more stringent ;-) thanks to all who supplied links etc. -- eoin

  • My Climate -- Seem good. run out of Switzerland. Professional vibe. Mainly projects in the developing world.
  • Atmosfair -- like the swiss one except smaller and German. Again, seems professional, their projects page in particular reads well. Doing a German schools project as well as developing world ones.
  • Climate Friendly -- Aussies. Mainly wind power, in Oz & NZ. Again seem good, have been around for a few years. Website is decent if a bit all over the place.
  • Sustainable Travel International -- more an eco-holidays travel agent than offsetting per se. Useful bookmark.
  • Puretrust.org.uk -- These guys seem good. Interesting business model. They buy high quality carbon credits, from mainly Gold Standard providers, and retire these credits. Permanent retirement, I think, though this wasn't 100% clear on their site. So they both support the providers directly by doing business with them, and also jack up the market price by reducing supply. This supply choke isn't something that the rest of them do, at first glance anyway. Clever idea. As the market price gets higher it will put pressure on companies to reduce their emissions, not just buy their way out of it.'

Now it's worth noting that this is the state of play as of May 2007; it'll definitely change pretty quickly as time goes on. Good info, though.

Eircom broadband — it’s never easy

Argh, it's never easy.

After this post, the consensus was that nowadays, Eircom have a pretty good quality of service for their DSL offerings, taking both price and service into account. I was happy enough to go with that, so I ordered their "Eircom broadband always on 2MB and Eircom talktime anytime bundle", back around the middle of April.

I had a great call with the sales agent, Hazel. Everything went swimmingly, we were all set for the modem to be delivered and the service to be up and running in 10 working days -- by May 1st April 30th. I asked for an order reference number and she said I didn't need one, it was all handled in their system. Great!

Unfortunately it seems the call centre staff never got that quality-of-service memo.

Come May 1st, there was no sign of the modem, so I rang Eircom's order line to see how things were going. To my horror, the staff I talked to told me that there was no record of my previous order, or call... it was as if that call had never taken place at all. No part of the order had even started.

As a result, I've had to reorder from scratch. The previous 10 working days we've waited counts for nothing. (The agents lie through their teeth about this, though -- one agent says they'll send it out in the "next 3-5 days", the next agent insists that we have to wait the full 10 days, and the next says somewhere in between -- anything to get us off the line within 4 minutes.)

This is bad news, since we're waiting on the broadband to move in -- since I work from home, we can't move in until we have a good 'net connection.

We can't even make a complaint to Eircom about this fuckup, because they refuse to take complaints without the original order number to reference -- the one that "Hazel" told me wasn't needed anymore. Now that's bureaucracy. Attempts at escalation just wound up with a dead end, where supervisors had no names and had left the office at 10am anyway. >:(

Best of all, their online complaints system now takes a maximum message length of 400 characters, so you can't even provide a detailed written complaint online anymore. (That is, not unless you submit the complaint in 15 separate parts...)

What a fiasco.

So we now have to wait until May the 15th. We've submitted the complaint via the aforementioned 15 parts, and postally; if they don't take action on those, we'll complain to Comreg (and let's see what that's worth).

But here's a question -- assuming they fail to deliver the second order within time this time around, can we cancel at that stage? There's a minimum contract length of 6 months, but since the service hasn't been delivered, I would hope that hasn't started yet. The terms and conditions document says:

"Ready for Service date" (otherwise "RFS date") means the date on which eircom establishes the Facility for the Customer.

3.1 This Agreement shall commence on the Ready for Service date and shall be for the Initial Period. Provided that this Agreement has not been terminated in accordance with its terms or in accordance with the Regulations, this Agreement shall thereafter automatically renew for successive six-month periods. For the purposes of this clause 3, a six-month period will be calculated from the anniversary of the RFS date.

3.2 The Customer may cancel its order for the Facility at any time prior to the RFS date. In the event of such cancellation by the Customer it shall be obliged to return any Kit, which may have been provided to it by eircom. Any Kit shall be returned to eircom by posting it to the freepost address detailed in the welcome pack. In the event of any Kit not being returned to eircom within fourteen (14) days of the cancellation of the Order for the Facility, the Customer shall be charged by eircom and shall pay to eircom such sum as is set out in the Regulations as being the charge payable in respect of the non-return of any Kit.

So I guess as long as the facility -- the ADSL line -- is not up and running, I'm clear to cancel, right? It's a little worrying that the "facility" doesn't include the "kit" -- ie. the broadband modem, though; if they fuck up sending out the modem, but the line is up, am I liable for 200 Euros?

In terms of who are viable options to switch to -- in my opinion it's got to be fixed wireless, since everyone else now would have to go via Eircom's exchanges anyway, and be delayed there. So -- Irish Broadband. I know they had some pretty massive problems 2 or 3 years ago, but recently I've been hearing good things about them, Boards.ie has some reasonably good-sounding recent experiences, and half of my new neighbours (srsly!) are using them with great results. Anyone got recent news about how useful they are with service quality and install speed for their Breeze product in the D9/D11 area?

Alternatively, Ripwave might make a reasonable stop-gap option? 120 euros is the minimum fee (6 months at 18.95 per month), which is better than the money I'm paying now to live in two houses...

Alternatively anyone know an Eircom engineer in D9/D11 that can nip over to the exchange and plug in my connection on the DSLAM? ;)

Moin Moin attachment spam

Here's a new trick used by the web spammers -- attachments on a Moin Moin wiki. The taint.org/wk RecentChanges list illustrates it well:

2007-05-07  set bookmark
[UPDATED]       UserPreferences         04:17   Info    ?StepStep [1-21]        
  #01 Upload of attachment 'big-cocks.html'.
  #02 Upload of attachment 'big-cock.html'.
  #03 Upload of attachment 'big-boobs.html'.
  #04 Upload of attachment 'big-ass.html'.
  #05 Upload of attachment 'bdsm.html'.
  #06 Upload of attachment 'bbw.html'.
  #07 Upload of attachment 'bang-bros.html'.
  #08 Upload of attachment 'bangbros.html'.
  #09 Upload of attachment 'baby.html'.
  #10 Upload of attachment 'asian-porn.html'.
  #11 Upload of attachment 'asian-girls.html'.
  #12 Upload of attachment 'anime-porn.html'.
  #13 Upload of attachment 'anime-girls.html'.
  #14 Upload of attachment 'angelina-jolie.html '.
  #15 Upload of attachment 'amature.html'.
  #16 Upload of attachment 'amatuer.html'.
  #17 Upload of attachment 'adult-videos.html'.
  #18 Upload of attachment 'adult-stories.html' .
  #19 Upload of attachment 'adult-games.html'.
  #20 Upload of attachment '69.html'.
  #21 Upload of attachment '3d.html'.

Great. Lots of spam. This first started appearing on Feb 27 2007, in a multi-upload attack on a single page ("FindPage"), from IP address 212.26.129.162; then reoccurred on Apr 27 and May 7 from the (insecure open proxy) proxy.drevlanka.ru.

Annoyingly my "subscribe to wiki changes" patch doesn't catch this -- these aren't gatewayed through as "changes" via mail for review. I need to fix that in my copious free time. :(

Also, the RecentChanges RSS feed doesn't list them, although the HTML form does.

So unfortunately, the only way I can see to block this is either to review by visiting the RecentChanges page in a web browser regularly (how retro!), and delete them retrospectively, or simply to turn off attachments entirely -- which is what I've done, by editing "wikiconfig.py" and adding:

    actions_excluded = ['AttachFile']

It looks like quite a few other wikis around the web are running into the issue too :(

SpamAssassin 3.2.0!

W00t! SpamAssassin 3.2.0 has finally gone gold!

This release is a big one -- it's the first major release since 3.1.0, back in September 2005, just over a year and a half ago. Here is the release announcement mail, containing a list of major changes since version 3.1.8. There are a few major new features that I feel worth picking out in more detail and editorialising about:

sa-compile

This is a biggie. This new script takes the active SpamAssassin ruleset, and uses code contributed by Matt Sergeant to produce input for re2c. re2c in turn compiles the ruleset into a deterministic finite automaton, which can match multiple regular expressions in parallel. That's not all, though; re2c then compiles that DFA into C code -- which is then compiled into native object code. SpamAssassin will then load that object code and use it to replace the slower perl regexp tests, if it's available at scan-time.

Now, it's been a long time since SpamAssassin's ruleset consisted mainly of rudimentary regular expressions matched against the body text -- a good portion of SpamAssassin's ruleset these days operates against headers, performs network lookups, analyzes URLs extracted from the body, uses the more advanced features supported by Perl's NFA regexp engine, or so on. But even given that, the effects of 'sa-compile' seem to average between a 15% and 25% speedup, in my testing. That's good ;)

Many of the commercial versions of SpamAssassin include their own body-rule speedups -- but this is the first time anything similar has made it into the open source code.

Short-circuiting

Another good one for performance. There are some rules that you can reasonably assume will never hit nonspam or spam mail in a well-configured setup. For example, a hit on "ALL_TRUSTED" should mean that the message never traversed an untrusted network, therefore it cannot be spam, so why bother applying the expensive tests? It should be reasonable to "short-circuit" and immediately return a "ham" score for that mail.

This new plugin implements that algorithm -- and efficiently, too, which historically has been the hard part!

I've been using this for a while with a ruleset like this one -- in my experience, it's cut overall CPU time spent scanning mail by 20%.

It is pretty flexible, too -- there's lot of tweakage that can be done with this functionality to suit your own setup.

Reduced memory footprint

One aim of this release has been to reduce the memory usage of SpamAssassin; the core code now uses less RAM than 3.1.x does, when tested with the same ruleset. (Unfortunately we've added lots more rules in the interim, so it's a bit of a wash overall. ;)

The VBounce anti-bounce ruleset

Detects spurious bounce messages sent by broken mail systems in response to spam or viruses. More info about that here.

Apache-spamd

apache-spamd implements spamd as a mod_perl module. This was contributed by Radoslaw Zielinski, as a Google Summer of Code project last year. Thanks Radoslaw!

There are plenty more new, useful features and rules -- these are just the top ones, in my opinion. Pretty cool stuff!

Patricia McKenna and MMR, again

Great! Patricia McKenna just called around, canvassing our area -- and just got a serious telling off from the wife ;)

Catherine -- unsurprisingly, given that she's a zoology Ph.D -- was fantastic, hitting every key point of the issue: that we're both long-time Green voters who've been forced to not vote Green this time around, due to this MMR issue and the anti-science/pro-hokum angle it represents.

Interestingly, she claimed that her stance on MMR was always her own point of view, and that it wasn't party policy -- and that it was mentioned on the party website was a rumour put about by the PDs.

While it turns out that Dr. Ruairi Hanley, the author of this letter to the Indo is indeed a PD (didn't realise that!), Treasa at Winds and Breezes also noted it appearing on the Green Party site, as follows:

Questioning the Benefits of Immunisation

There are significant question marks about the effectiveness of mass immunisation programs. We would launch a major study of the benefits of these programs looking at all aspects of health

So Treasa -- are you a stealth PD rumour-monger? ;)

Worth noting that at no time did McKenna reassure C that her policy would not become government policy if the Greens were elected... as an elected representative, surely her own policies would influence the government's thinking?

Screenclick devolve again

After a short period where things were looking up, Screenclick have once again reverted to type, by ditching the lovely simple Netflix-style queue they seemed to be using, and instead instituting some new kind of bizarre homebrew wierdness.

It looks like a queue, with a line-by-line listing of movies -- but then beside each title, there are 3 radio buttons: "High", "Medium", and "Low".

The instructions run as follows:

All titles are sorted in alphabetical order within their priority group
  • - High: Please deliver these titles as soon as possible
  • - Medium: Please deliver these titles as they become available
  • - Low: I don't mind when you send these titles

So what -- does this mean that if I put a title in as "High", I'm going to receive it next, or not, or what? and what's with the alphabetical order? WTF is going on? argh.

Anyway, I just got out "Amores Perros", presumably due to this alphabetical ordering thing. not what I wanted at all. What a mess.

A week of Bertiespam

We're in the run-up to a general election here in Ireland, and I live in Bertie's constituency. For the past year or so, things have been pretty quiet, but in the last week there's been a sudden flurry of activity and direct postal mail from Bertie's office -- and from many departments of local government, too:

Mon Apr 23:

  • Fianna Fail: "Fianna Fail delivers on education in Dublin Central", tabloid newspaper.

  • direct from the office of Bertie: a photocopied letter from the Environmental Health Officers of Dublin City Council about the standards of rented houses "in my area".

Tues Apr 24:

  • HSE: "Parents Who Listen, Protect" leaflet, a full-colour glossy handbook "on building good communication in families and communities" "as part of a national initiative on child protection".

  • Dept of Environment: a leaflet on the "National Climate Change Strategy, 2007-2012, Main Points". Printed on recycled paper, naturally ;)

Fri Apr 27:

  • Fianna Fail Senator Cyprian Brady: "dear resident, please vote for me" -- one-page full-colour glossy.

  • Spring 2007 "Central News", "Official Voice of Fianna Fail in Dublin Central", a 16-page tabloid newspaper, featuring stories like "Smithfield: the Temple Bar of the Northside" (like Temple Bar, but with more winos and Children's Court, and less stuff!)

Mon Apr 30:

  • HSE: "Need a doctor urgently? Call D-DOC out-of-hours GP service", full-colour glossy leaflet.

  • from Bertie: Evening of Election Letter. "Good evening constituents" etc.

It's a veritable flood of full-colour glossies! Could be worse, I suppose -- I hear the PDs have been blanketing selected Dublin constituencies in free books. However I suspect grimy Dublin 7 is a little off their list (see "winos", above).

It's worth noting that a good half of this flood (which I've coined Bertiespam to describe) isn't from Bertie's constituency office -- it's from government departments like the HSE and the Department of Environment. It's funny that we hadn't heard a peep from them all year, then once an election looms -- "here come the voters! look busy!" ;)

What bertiespam have you been getting?

Hog’s Chip

Hey Google --

Since Fido.ie is throwing errors at me, and since you're probably a more searchable (and more global) database anyway -- the Trovan FDX-B RFID transponder number 956000000659388 is that of "Hog Dempsey", a small female black and white cat, whose owners can be contacted via any address on this page. Cheers!

HOWTO do a DOS-based BIOS upgrade without Windows

Wow, I can't believe I still have to do this in 2007 -- Taiwan really needs to discover FreeDOS! Here's how to run a DOS BIOS update on a PC without using Windows (in my case, it's a Dell laptop).

  gunzip FDSTD.288.gz
  sudo mount -t msdos -o loop `pwd`/FDSTD.288 /tmp/bootiso
  • ensure there's enough space, and copy the app into the disk image:
  df /tmp/bootiso
  sudo cp ME051A10.EXE /tmp/bootiso
  • Then make an ISO, using mkisofs' "-b" option to ensure it's bootable:
  mkdir /tmp/floppycopy
  cp -Rp /tmp/bootiso/* /tmp/floppycopy
  cp -p FDSTD.288 /tmp/floppycopy
  mkisofs -pad -b FDSTD.288 -R -o /tmp/cd.iso /tmp/floppycopy
  • And burn it:
  sudo umount /tmp/bootiso
  sudo cdrecord dev=0,0,0 -pad -v -eject /tmp/cd.iso
  • Now, take the burned CDROM, and boot it.

Answer "N" to all questions when booting, otherwise you're likely to see an error like "Cannot operate in Protected environment" when you run the BIOS update.

Thanks to the Motherboard Flash Boot CD from Linux Mini HOWTO; very helpful. I hope the next time I have to do this, they just issue a bootable ISO image instead...

Update, Sep 2013:

Wayno Guerrini emailed to say: 'I used your recipe to update the bios on a old Dell Dimension 8400. Worked like a champ, with a couple of modifications. I am running 64 bit debian wheezy.

apparently the mkisofs has been replaced by genisoimage. Syntax the same.

instead of cdrecord I had to use wodim: sudo wodim dev=/dev/sg1 -pad -v -eject /tmp/cd.iso

Thank you. Recipe worked very well. I will point people to this article, but add the changes as appropriate to my website.'

Using qpsmtpd for traps.spamassassin.org

Like many anti-spam systems these days, SpamAssassin operates a network of spamtraps. One set of these run off traps.SpamAssassin.org, a server kindly donated by ISP Sonic.net.

Large-scale spam-trapping systems like this are generally run in quite a secretive manner, but we're an open source project -- so it may be interesting if I give some details of our setup. Here's a potted history of how this spamtrap server has run over the years...

The beginning

The architecture was initially very simple. The MX was Postfix, delivering to the "trapper" user, which in turn ran procmail, which directly ran a perl script. This perl script then performed the trap actions, namely: DoS prevention, discarding viruses and malware, discarding backscatter bounces, extraction and cleanup of the incoming mails, then onward reporting, archival, and further distribution.

Given that this was a target for spam -- and we want as much spam as possible here! -- this would predictably run into load issues. Right at the beginning, back in around 2001/2002, I ran this on our shared server, where it pretty quickly caused trouble for delivery of other, more useful mail. It was around this time that Sonic kindly donated the server.

With dedicated hardware, we weren't seeing much trouble -- it was enough to just wait for the few hours for a traffic spike to pass, and the Postfix queue would then clear.

Clearing the queues

After a few months, though, this wasn't enough -- the queue would get consistently clogged, and the backlog became enough to result in the incoming spam being delayed for days before it made it from the MX to the trap archives. For a spamtrap, you want fresh spam, but not necessarily all spam -- so I installed a cron job to simply clear the queue on a nightly basis. (I also had to restart the Postfix server, too, since it'd occasionally get hung and stop accepting connections on port 25, presumably due to load issues.)

IPC::DirQueue

The next level was an inability of the procmail/perl script end to process the mail fast enough for the MTA to keep up with the incoming connections, and follow-on problems, caused by load generated by the perl script impacting the MX's activity. To work around these, I designed a new queueing backend, based around IPC::DirQueue. This allowed a new split architecture; the procmail-run perl script was extremely lightweight, delivering all inbound mail to a dirqueue and exiting quickly, allowing the MX to get back to the next inbound spam message, and the trap processing script was then split into a web of dirqueues, allowing each individual part of the trap backend pipeline to operate independently.

There were several benefits to this:

  1. Since dirqueues operate as a batch-processing model, load spikes become irrelevant; the load incurred is limited by how many dequeuer processes are run.
  2. The time taken in backend tasks becomes irrelevant to the MX throughput, since that is bottlenecked only by the lightweight perl script and its write speed to the "incoming" dirqueue.
  3. By splitting the backend work into multiple queues, outages in the spam-reporting systems or onward forwardings become much less of a problem, since they won't affect inbound spam, archival, outbound delivery to other reporting systems, forwards, etc.

Again, the dirqueues were cleared on a frequent basis, to discard the "spiky" traffic and ensure we were just seeing samples of the freshest spam. The dirqueues use a tmpfs as the backing storage directory, so it never hits the disk at all.

This worked pretty well for several years -- from 80 megabytes of spam per day to the current level, which is around 130MB per day. However, we still occasionally saw problems from load spikes, where high load caused the traps to refuse incoming SMTP connections -- purely because the load of inbound connections is too high for the Postfix MX to accept them all in a timely fashion.

qpsmtpd

Last weekend, I had a go at a project I'd been thinking of trying out for a long time -- switching from Postfix to qpsmtpd. A while back, Matt Sergeant rewrote qpsmtpd to use Danga::Socket, Danga Interactive / Six Apart's insanely scalable event-driven asynchronous socket class, as used in mogilefsd, perlbal and djabberd. This article notes that 'two large antispam companies' high-traffic spam traps have used this effectively since the second quarter of 2005, delivering concurrency as high as 10,000 on some occasions', so it seemed likely to work ;)

Sure enough, results have been great... we now have a pure-perl system handling heavy volumes without breaking a sweat, certainly compared to the previous system. qpsmtpd's plugin system was elegant, allowing me to annotate inbound spam with more details of the SMTP transaction, write plugins to deliver mail to a dirqueue directly instead of to an MTA, and do some conditional code (ie. basic "deliver this RCPT TO to this queue") where needed.

Full details are over on the QpsmtpdSpamtrap page on the taint.org wiki, for the curious.

Don’t worry about Blacklist.ie

Irish techies -- wondering what the next website to put the fear into your parents will be? Here it is: Blacklist.ie. It's been getting a bit of coverage from the Irish technology press recently, it seems, as the new site from IE Internet.

(IE Internet are the Irish internet company that puts a press release every month or so telling us how much of their mail is being filtered as spam, which Silicon Republic et al dutifully report as news, month after month.)

I got a call from my mother last week, telling me that she'd been "blacklisted", and asking how to fix it. Sure enough, when I found out that she'd heard this on blacklist.ie, I went to the site, and her IP address was indeed listed -- as was mine:

The IP address 212.2.169.61 is blacklisted.

RBLs checked:

Spam Haus not listed

Spam Cop not listed

Mailwall RBL not listed

Abuse At not listed

SORBS not listed

NJABL listed: Dynamic/Residential IP range listed by NJABL dynablock - http://njabl.org/dynablock.html

510 SG not listed

Naturally, that IP is listed -- it's entirely ok for a home-user broadband machine to appear in SORBS or NJABL as a dynablock-listed IP. (Dynablock, for those who don't know, is a set of records for addresses which are known to be residential/end-user "dynamic" addresses, rather than mail relays -- so obviously most end-user desktop machines would fall under this category.)

Unfortunately, this distinction isn't mentioned anywhere on the blacklist.ie page... just a large, red, "The IP address is blacklisted" warning.

Worried readers might then reasonably go on to read the site's Frequently Asked Questions list -- which, incredibly, includes a helpful suggestion that you sign up with IE Internet to avoid being listed in future! I'd be curious how that's supposed to help a home user get off the NJABL dynablock list... a little fishy, if you ask me!

Bar Camp Dublin next weekend

Dublin hackers/software people -- don't forget! Bar Camp Dublin is happening on April 21st -- that's 9 days from now.

It should be interesting -- there are 93 attendees signed up already, and I see a good few familiar names I haven't run into in a while! The last Bar Camp was a good opportunity to meet up for some very informal talks, and this looks likely to be the same.

Sign up here, go on...

Screenclick improve their site

Yay! They now have a proper queue! Also member reviews and other improvements -- it seems a lot better.

Can't figure out how to change my password, though ;)

Don’t vote Green in Dublin Central!

I've long held green views, and have always voted green -- I believe climate change, damage to the environment and pollution are extremely serious problems, especially for Ireland. At the same time, I also believe that science and technology has a key place in a better, greener future -- a Viridian, bright green / electric green viewpoint, in other words.

Given this, I was really shocked and appalled to hear (via the lovely C) of an interview on Today FM with Patricia McKenna, a Green Party candidate for my local constituency of Dublin Central -- one I've voted for before, no less! -- in which she revealed that she believes in the thoroughly discredited scaremongering regarding a link between the MMR vaccine and autism, and has taken the appallingly irresponsible position of not allowing her children to be vaccinated.

This blog post discusses the interview, which was broadcast on Today FM's The Last Word show on Tuesday 13 March. Here's an archived podcast of that interview so you can listen to it yourself, and here's a local copy of that WMV file in case that first link expires any time soon.

Here's a transcript of the part of the interview once the issue of vaccination is brought up. Matt Cooper is the host of the show. Keith Redmond is an opposing candidate, for the PDs. The timestamps are in minutes and seconds from the start of the audio file.

  • 8:30: Patricia McKenna: Parents have the right to choose what they opt to do, and in relation to some vaccinations, there are serious question marks hanging over them but that's not what we're talking about here...

  • 8:44: Matt Cooper (clearly annoyed): No its not, but now that it's up there, couldn't it be irresponsible for parents not to vaccinate children against serious issues (sic), if they don't have reputable scientific facts to back up the decision not to vaccinate?

  • 8:54: Patricia McKenna: Many parents in this country have chosen not to vaccinate their children in relation to the MMR because of the links to autism.

  • 9:00: Matt Cooper: Utterly untrue, totally unproven, absolutely bogus and false.

  • 9:02: Patricia McKenna: Hold on a second...

  • 9:03: Matt Cooper: Andrew Wakefield has been utterly and totally discredited in relation to that. Anyone who doesn't give the MMR vaccine to their children because of a fear of autism is almost in danger of endangering their child themselves. We're going to have a rise of measles again in this country because of people not actually giving the vaccine.

  • 9:17: Patricia McKenna: First of all, we're moving away from the issue...

  • 9:22: Matt Cooper: Yeah we are, but it's come up now, let's deal with it...

  • 9:23: Patricia McKenna: It's come up, right. Eh, have you had the measles? I've had the measles, and I've got over them well, I have a strong immune system, my 10 year old son has had the measles...

  • 9:30: Matt Cooper: And you are aware that unhandled the measles can have very serious side effects?

  • 9:33: Patricia McKenna: Look -- the side effects that are linked to the measles are in relation to... there are other things linked to it in relation to the child's well being initially. Now you just look at the number of people when you were young, all of your peers I would say have had the measles as with mine, and I think we have a tendency to over-indulge in vaccinating our children and vaccinating ourselves, because what we need -- our immune systems are getting weaker and weaker by the day, it's a -- I think we need to be very careful about how we actually approach this so that when medicines are necessary, we will not be immune to them...

  • 10:08: Matt Cooper (interrupting): Do you know that children have died of the measles in this country in the last 5 years?

  • Keith Redmond: because of views like that.

  • Patricia McKenna: Well I'm saying is that, as far as I'm concerned...

  • 10:18: Matt Cooper (repeats): Do you know that children have died of the measles in this country in the last 5 years?

  • 10:30: Patricia McKenna: The children that have died of the measles because of other complications (sic), not the measles themselves.

  • Keith Redmond: that have not been vaccinated.

  • Patricia McKenna: Not the measles themselves, but other complications, right? Now if you're saying that parents should -- it's a bit like --

  • Keith Redmond: Matt, can I just come back to...

  • 10:32: Matt Cooper: Sorry, one second Keith. Would you also concede Patricia, that there is absolutely no link between the MMR and autism, that that link was a bogus link put up by Andrew Wakefield who has been completely and utterly discredited and it has done an awful lot of damage, the misrepresentation of his views in relation to the MMR and autism.

  • 10:50: Patricia McKenna: Well in relation to the MMR, I am not satisfied that it's safe, and I am not satisfied with the idea of lumping a whole lot of vaccines -- different vaccinations together en masse, inducing them (sic) to our children -- but having said that, parents should have the right to choose and decide what is best for their children...

  • 11:06: Matt Cooper: But would you concede that Andrew Wakefield, who is the man that pushed that whole agenda, was exposed as a fraud?

  • 11:11: Patricia McKenna: But the jury is still out in relation to...

  • 11:15: Matt Cooper: No, it's not.

  • 11:16: Patricia McKenna: Yeah well I'm sorry but the jury is still out in relation to how safe the MMR is. And I think it's unfair to label all parents who decide for their own children's safety, that they may not want to go down the route of vaccination, that they're being irresponsible, because I wouldn't consider myself irresponsible, I would consider I want what's best for my child.

  • 11:37: Keith Redmond: [again says something]

  • Matt Cooper: Give Keith a chance to come in.

  • 11:41: Keith Redmond: This totally exemplifies the Greens' approach to any kind of science. We have a woman there who knows, in her heart of hearts, that her argument is wrong but refuses to admit it because it relies on science. Now, we have exactly the same issue with flouridation -- we know the science, we know the facts, and we still have this scaremongering every now and again. And the Green Party are totally irresponsible and you're right, they are frightening parents across the country right now and it's absolutely reprehensible.

My god, this insanity has me agreeing with a feckin' PD!

This is luddism, pure and simple. Matt Cooper is spot on the money -- children are dying in Dublin because of this "my child, my rules" selfishness and simple inability to understand the science surrounding vaccination as a public health policy.

This is appalling. To put it bluntly, there is no fucking way I'll be voting Green if this kind of cargo-cult, anti-science superstition is the kind of shite they're espousing these days. ...and if you think I'm feeling strongly about this, you should hear my (zoologist) wife.

But it goes on -- here's a letter to the Irish Independent on this issue from Feb 9 2007, which raises another worrying factor:

... until two days ago, there was a statement on the Green Party website informing voters that there were "serious question marks about the benefit of mass vaccination programs".

Furthermore, the party promised that there would be a "major review" of vaccination if they were returned to office.

Now that these statements have apparently been removed from the Green party website are we to take it that they are no longer Green policy?

This blog posting at Winds and Breezes also notes this. So -- is this official Green policy or not?

Update: In the comments, it was noted that McKenna is pretty much acting alone in this; it, apparently, is not Green Party policy at all. I've updated the title to reflect that it's only one constituency's candidate that needs to be shunned.

Also, Conor O'Neill has a great idea over here:

I was thinking further on this yesterday and I realised what the Greens need to do in order to be taken seriously... They need to become the “Party of Science”. Proper environmentalism is based on rigorous science and strategic thinking. Every policy they define should be backed up with rock-solid science and a detailed long-term financial analysis proving why it is in our best interests to adopt them.

Man, I would love to see that!

Eircom broadband?

I'm moving house. Naturally, first priority after getting the keys is getting the broadband set up ;)

Current broadband: BT DSL. Supposedly "up to" 3Mbps -- however, as with most DSL connections in Ireland, it's rate-adaptive RADSL, which means it trades off connection speed against distance to exchange and line quality.

Sadly, this has really deteriorated since the last time I checked! A "bing" test between the BT-supplied DSL router and the far end looks like this:

BING    10.18.72.1 (10.18.72.1) and 193.95.142.243 (193.95.142.243)
        44 and 108 data bytes (1024 bits)
193.95.142.243: minimum delay difference is zero, can't estimate link throughput
193.95.142.243:  6.966Mbps 0.147ms 0.143555us/bit
193.95.142.243: minimum delay difference is zero, can't estimate link throughput
193.95.142.243: 19.692Mbps 0.052ms 0.050781us/bit
193.95.142.243:  4.697Mbps 0.218ms 0.212891us/bit
193.95.142.243:  3.261Mbps 0.314ms 0.306641us/bit
193.95.142.243:  3.170Mbps 0.323ms 0.315430us/bit
193.95.142.243:  2.479Mbps 0.413ms 0.403320us/bit
193.95.142.243:  2.723Mbps 0.376ms 0.367187us/bit
193.95.142.243:  2.688Mbps 0.381ms 0.372070us/bit
193.95.142.243:  2.716Mbps 0.377ms 0.368164us/bit
193.95.142.243:  2.065Mbps 0.496ms 0.484375us/bit
193.95.142.243:  1.984Mbps 0.516ms 0.503906us/bit
193.95.142.243:  1.270Mbps 0.806ms 0.787109us/bit
193.95.142.243:  1.017Mbps 1.007ms 0.983398us/bit
193.95.142.243:  1.002Mbps 1.022ms 0.998047us/bit
193.95.142.243:  1.008Mbps 1.016ms 0.992187us/bit
193.95.142.243: 983.670Kbps 1.041ms 1.016602us/bit
193.95.142.243: 993.210Kbps 1.031ms 1.006836us/bit
193.95.142.243: 987.464Kbps 1.037ms 1.012695us/bit

--- 10.18.72.1 statistics ---
bytes   out    in   dup  loss   rtt (ms): min       avg       max   std dev
   44   762   758          0%           2.524     3.858    19.083     2.194
  108   762   762          0%           2.639     4.187    58.273     3.079

--- 193.95.142.243 statistics ---
bytes   out    in   dup  loss   rtt (ms): min       avg       max   std dev
   44   762   761          0%          13.061    20.025    78.689     8.226
  108   762   760          0%          14.213    17.954    61.137     4.697

--- estimated link characteristics ---
host                              bandwidth       ms
193.95.142.243                      987.464Kbps      10.536

987Kbps is not 3Mbps any more, not by a long shot. I'd say I now have a lot of new friends adding contention at the ol' DSLAM. I'm paying way too much money for what I'm getting :(

(Update: actually, it may not be contention. Judging by boards.ie traffic, high-contention situations in Ireland are usually faster in the mornings and daytime, then slower from 4pm-9pm as the commuters and kids get home -- however, this slowdown is pretty consistent across all times of day.)

(Update 2: as of right now, late afternoon on Apr 12, it's the worst I've seen it -- packet rates of 600Kbps, and packet loss of 5%-20%.)

On top of this, they have the really annoying daily disconnection policy, which I have hacked around with IPv6 and a VPN, but which still manages to waste my time and cause aggravation, even after frickin' months of pissing about.

For this, and the packaged phone service, I'm paying just under EUR 60 per month, including all call charges and VAT.

At that price, Eircom are offering a pretty good bundle -- free connection, free modem, 2Mbps downstream, 256Kbps upstream, unlimited free local and national calls at all times, 5% off calls to mobiles, 10c/min calls to the UK and US.

Now, a drop to 2Mbps may seem a lot, but bear in mind I'm getting just under 1 right now! I'm pretty sure the new gaff will have similar-quality lines and exchanges. Also, if I get the 2Mbps line, and the attenuation and S/N statistics indicate that it can support 3Mbps, I can always upgrade pretty easily.

The only problem now is getting over my revulsion at buying from Eircom, ugh...

Am I missing something? Does that Eircom bundle not include line rental maybe?

About the title change

The eagle-eyed may have spotted a change that took place a month or two ago in the taint.org configuration -- I ditched the old weblog tagline.

Previously, this weblog was titled "taint.org: Happy Software Prole". This title had been in place since around October 2003, when Daniel Lyons wrote a particularly idiotic article for Forbes entitled "Linux's Hit Men", which I took umbrage to:

Here we go again -- the old 'free software is communism' line [...] The article goes on to bemoan how software companies who write proprietary extensions into GPL-licensed software, have to comply with the terms of the license. It's all a bit of an obvious dig -- but I am looking forward to the follow-up article -- that's the one where the author bemoans how commercial software companies send out their 'enforcers' to extort money from companies who don't bother paying the royalties and runtime license fees their licenses require.

As an free/open-source-software guy, I happily adopted 'happy software prole' as an absurd tagline, in the spirit of detournement. Fast-forward to 3.5 years on, however, and I'd say most people can't even remember the Forbes article, or that Daniel Lyons guy! So that tagline was a bit old and busted, really.

On top of this, I'd noticed something I do in my weblog reading -- I've started renaming blogs in the feed reader from their fancy title, to simply the name of the author.

I've found that when reading blogs, I'm interested in who's writing. When skimming through the feeds of a morning, having to spend 5 seconds to recall that "ByteSurgery.com" is Robin Blandford is just a wee bit superfluous, sorry Robin. ;)

As a favour for readers, I've saved them the trouble, and renamed the blog to be quite explicit about who's writing; the taint.org tagline is now just "taint.org: Justin Mason's Weblog". Let's face it -- it's a bit functional. Hopefully it's helpful, though!

(And finally, it gives me the edge in the ongoing Google war against the non-me "Justin Masons" out there... and against a heart surgeon and a Texan basketball player, I need it. ;)

A recycling puzzle

Myself and Tom were in a taxi last night, stopped at a stop light, when I noticed something odd.

A girl, about 20 or so, walking along the path stopped beside a bag of recycled rubbish, and bent over as if she was tying her shoelace. Instead of fixing her lace, though, she quickly ripped a hole in the (transparent) plastic bag, grabbed a crumpled Fanta can, and walked off.

WTF? anyone got any theories?

Coworking.IE

Coworking.ie is a new community-driven coworking group-blog and promotion site, set up by Jason Roe.

Coworking's a pretty cool idea -- 'a movement to create a community of cafe-like collaboration spaces for developers, writers and independents.' Great news for us teleworkers.

I've subscribed -- it'll be interesting to track development of this concept, in Ireland and elsewhere...

New list for Irish users of MythTV

MythTV is a pretty great product, once you get it working -- however, it can be labour-intensive, involving lots of local knowledge to deal with the ins and outs of each area's TV provider, cable service, etc.

To that end, we're recently set up a new mailing list: mythtv-ireland, a list for discussion of topics of interest for MythTV users in Ireland.

Particularly on-topic:

  • the NTL frequencies list for areas in Ireland

  • hacks to scrape the Channel 6 schedule from their website

  • dealing with the NTL Digital set-top box

Sign up, if you're interested!

Twitter and del.icio.us

Walter Higgins says:

It's just occurred to me why I don't like twitter - It doesn't fulfill any need that isn't already fulfilled by del.icio.us. I usually post a note alongside each bookmark which lets me micro-blog (post short comments without having to think too much). If I want to signal to someone to take a look at the bookmarked item I just tag it for:[nameofperson] which I suppose you could loosely call 'chat'. Since I gave up personal blogging, del.icio.us has fulfilled a need for short-hand blogging. Thinking about it - twitter is like del.icio.us but without the bookmarks - viewed in that light it really is hard to understand why anyone would use twitter.

To my mind, though, there's a big difference:

  • My del.icio.us page is where I post things I'm reading, and things I think others may be interested in reading;

  • My twitter page is where I post things I'm doing, and chat.

There's no way I'd try to hold a conversation in my del.icio.us bookmarks! ;) Different tools for different uses.

Geeking out on the ‘leccy bill

A good post from Lars Wirzenius on measuring the electricity consumption of his computer hardware. Here's a previous post of mine on the subject.

With the rising cost of energy, a keenness to reduce consumption for green purposes, and an overweening nerdity in general, I did some more investigation around my house recently.

I have a pretty typical Irish electricity meter; it contains a visible disc with a red dot, which spins at a speed proportional to power usage. (There's a good pic of something similar at the Wikipedia page).

The fuse-board works out as follows (discarding the boring ones like the house alarm etc.):

  • Fuse 7 - gas-fired central heating (on), fridge (on), kitchen power sockets

  • Fuse 8 - TV in standby, idle PVR, Wii in standby, digital cable set-top box, washing machine

  • Fuse 9 - telephone, DSL router, Linksys WRT54G AP/router

  • Fuse 10 - bedroom sockets, home office with laptop, printer, speakers, laptop-server etc.

The approach was simply to turn off the house fuses at the fuse board, one by one, and measure how long it took the disc to make a full revolution; then invert that (1/n) to convert from units of time over a static power value, to some notional unit of power consumption over a static time interval (I haven't figured out how to convert to kW/h or anything like that, they're just makey-uppy units).

Fuses Time/power Power/time
Baseline (all fuses on) 22.71 seconds 0.0440
Fuse 7 off 43.03 0.0232
Fuses 7 and 8 off 57.92 0.0172
Fuse 7, 8 and 9 off 84.88 0.0117
Fuse 7, 8, 9, and 10 off ~20 minutes (I'd guess) 0.0008?

(I stopped measuring on the last one and just estimated; it was crawling around.)

Breaking out the individual fuses, that works out as:

Fuse Power/time
Fuse 7 (central heating, fridge, kitchen bits) 0.0208
Fuse 8 (TV, Wii, set-top box, washing machine) 0.0060
Fuse 9 (phones, routers) 0.0055
Fuse 10 (home office, bedrooms) 0.0109

Good results already: (a) it was pretty clear that fuse 7 was doing all the quotidian legwork, eating the majority of the power, and (b) the TV equipment and internet/wifi infrastructure was pretty good at low-power operation (yay). However (c) the computer bits aren't so great, but still only half the power consumption of the kitchen bits.

Breaking down the kitchen consumption further:

Appliances Time/power Power/time
Gas central heating on (rechecking the baseline) 20.46 0.0488
Gas central heating off 34.15 0.0292
Washing machine on (40 degree wash) 13.65 0.0732
Dishwasher on 2.53 0.3952
Dishwasher and dehumidifier on 2.53 0.3952

Subtracting the baseline:

Appliance Power/time
Gas central heating 0.0196
Washing machine 0.0244
Dishwasher 0.3464
Dishwasher and dehumidifier 0.3464

So the central heating, despite being supposedly gas-fired, eats lots of power! I guess this is the electric pump, used to drive the heated water around the house to the radiators. Ah well, I'm not skimping on that ;)

More practically: the dishwasher result is incredible. That's 30 times the power usage of the house's computer hardware. This is a ~7-year-old standard dishwasher; obviously green power consumption wasn't an issue back then! We're running it less frequently now, obviously; the odd hand-wash of bulky and nearly-clean items helps. With any luck when we move in a few months, we can replace it with a greener model.

The washing machine is about what I would expect, so I'm OK with that.

Also interesting to note that our dehumidifier is unnoticeable in the volume of the dishwasher; I could have tried to work it out properly in isolation, but couldn't be bothered by that stage ;)

Sender Address Verification considered harmful

(as an anti-spam technique, at least.)

Sender-address verification, also known as callback verification, is a technique to verify that mail is being sent with a valid envelope-sender return address. It is supported by Exim and Postfix, among others.

Some view this as a useful anti-spam technique. In my opinion, it's not.

Spam/anti-spam is an adversarial "game". Whenever you're considering anti-spam techniques, it's important to bear in mind game theory, and the possible countermeasures that spammers will respond with. Before SAV became prevalent, spam was often sent using entirely fake sender data; hence the initial attractiveness of SAV. Once SAV became worth evading, the spammers needed to find "real" sender addresses to evade it. And where's the obvious place to find real addresses? On the list of target addresses they're spamming!

Since the spam is now sent using forged sender addresses of "real" people, when a spam bounces (as much of it does), the bounce will be sent back not to an entirely fake address, but to a spam recipient's address.

Hence, the spam recipients now get twice as much mail from each spam run -- spam aimed at them, and bounce blowback from hundreds of spams aimed at others, forged to appear to be from them.

This is the obvious "next move" in response to SAV, which is one reason why we never implemented something like it in SpamAssassin.

On top of this -- it doesn't work well enough anymore. Verizon use SAV. Have you ever heard anyone talk about how great Verizon's spam filtering is? Didn't think so.

(This post is a little late, given that SAV has been used for years now, but better late than never ;)

By the way, it's worth noting that it's still marginally acceptable to use SAV as a general email acceptance policy for your site -- ie. as a way to assert that you're not going to accept mail from people who won't accept mail to the envelope sender address used to deliver it. Just don't be fooled into thinking it's helping the spam problem, or is helping anyone else but yourself.

Finally, this Sender Address Verification is different from what Sendio calls Sender Address Verification. That's just challenge-response, which is crap for an entirely different, and much worse, set of reasons.

Something in the oven

Check out what's cooking chez Mason:

Thrills and spills! I may have to cut down on the extra-curricular activities for a while, so we'd better get SpamAssassin 3.2.0 released before August 21st ;)

Spam volumes at accidental-DoS levels

Both Jeremy Zawodny and Dale Dougherty at O'Reilly Radar are expressing some pretty serious frustration with the current state of SMTP. I have to say, I've been feeling it too.

A couple of months back, our little server came under massive load; this had happened before, and normally in those situations it was a joe-job attack. Switching off all filtering and just collecting the targeted domain's mail in a buffer for later processing would work to ameliorate the problem, by allowing the load to "drain". Not this time, though.

Instead, when I turned off the filtering, the load was still too high -- the massive volume of spam (and spam blowback / backscatter) was simply too much for the Postfix MTA. The MTA could not handle all the connections and SMTP traffic in time to simply collect all the data and store it in a file!

Looking into the "attack" afterwards, once the load was back under control, it looked likely that it wasn't really an attack -- it was just a volume spike. Massive SMTP load, caused by spammers increasing the volume of their output for no apparent reason. (Since then, spam volumes have been increasing still further on a nearly weekly basis.)

This is the effect of botnets -- the amount of compromised hosts is now big enough to amplify spam attacks to server-swamping levels. Our server is not a big one, but it serves less than 50 users' email I'd say; the user-to-CPU-power ratio is pretty good compared to most ISPs' servers.

So here's the thing. New SMTP-based methods of delivering nonspam email -- whether based on DKIM, SPF, webs of trusted servers, or whatever -- will not be able to operate if they have to compete for TCP connection slots with spammers, since spammers can now swamp the SMTP listener for port 25 with connections. In effect, spam will DDoS legitimate email, no matter what authentication system that legit mail uses to authenticate itself.

This, in my opinion, is a big problem.

What's the fix? A "new SMTP" on a whole different port, where only authed email is permitted? How do you make that DoS-resistant? Ideas?

(Obviously, counting on spammers to notice or care is not a good approach.)

A SpamAssassin rule-discovery algorithm

Just to get a little techie again... here's a short article on a new algorithm I've come up with.

Text-matching rule-based anti-spam systems are pretty common -- SpamAssassin, of course, is probably the most well-known, and of course the proprietary apps built on SpamAssassin also use this. However, other proprietary apps also seem to use similar techniques, such as Symantec's Brightmail and MessageLabs' scanner (hi Matt ;) -- and doubtless there are others. As a result, ways to write rules quickly and effectively are valuable.

So far, most SpamAssassin text rules are manually developed; somebody looks at a few spam samples, spots common phrases, and writes a rule to match that. It'd be great to automate more of that work. Here's an algorithm I've developed to perform this in a memory-efficient and time-efficient way. I'm quite proud of this, so thought it was worth a blog posting. ;)

Corpus collection

First, we collect a corpus of spam and "ham" (non-spam) mails. Standard enough, although in this case it helps to try to keep it to a specific type of mail (for example, a recent stock spam run, or a run from the OEM spammer).

Typically, a simple "grep" will work here, as long as the source corpus is all spam anyway; a small number of irrelevant messages can be left in, as long as the majority 80% or so are variations on the target message set. (The SpamAssassin mass-check tool can now perform this on the fly, which is helpful, using the new 'GrepRenderedBody' mass-check plugin.)

Rendering

Next, for each spam message, render the body. This involves:

  • decoding MIME structure
  • discarding non-textual parts, or parts that are not presented to the viewer by default in common end-user MUAs (such as attachments)
  • decoding quoted-printable and base64 encoding
  • rendering HTML, again based on the behaviour of the HTML renderers used in common end-user MUAs
  • normalising whitespace, "this is\na \ntest" -> "this is a test"

All pretty basic stuff, and performed by the SpamAssassin "body" rendering process during a "mass-check" operation. A SpamAssassin plugin outputs each message's body string to a log file.

Next, we take the two log files, and process them using the following algorithm:

N-gram Extraction

Iterate through each mail message in the spam set. Each message is assigned a short message ID number. Cut off all but the first 32 kbytes of the text (for this algorithm, I think it's safe to assume that anything past 32 KB will not be a useful place for spammers to place their spam text). Save a copy of this shortened text string for the later "collapse patterns" step.

Split the text into "words" -- ie. space-separated chunks of non-whitespace chars. Compress each "word" into a shorter ID to save space:

"this is a test" => "a b c d"

(The compression dictionary used here is shared between all messages, and also needs to allow reverse lookups.)

Then tokenize the message into 2-word and 3-word phrase snippets (also known as N-grams):

"a b c d" => [ "a b", "b c", "c d", "a b c", "b c d" ]

Remove duplicate N-grams, so each N-gram only appears once per message.

For each N-gram token in this token set, increment a counter in a global "token count" hashtable, and add the message ID to the token's entry in a "message subset hit" table.

Next, process the ham set. Perform the same algorithm, except: don't keep the shortened text strings, don't cut at 32KB, and instead of incrementing the "token count" hash entries, simply delete the entries in the "token count" and "message subset hit" tables for all N-grams that are found.

By the end of this process, all ham and spam have been processed, and in a memory-efficient fashion. We now have:

  • a table of hit-counts for the message text N-grams, with all N-grams where P(spam) < 1.0 -- ie. where even a single ham message was hit -- already discarded
  • the "message subset hit" table, containing info about exactly which subset of messages contain a given N-gram
  • the token-to-word reverse-lookup table

To further reduce memory use, the word-to-token forward-lookup table can now be freed. In addition, the values in the "message subset hit" table can be replaced with their hashes; we don't need to be able to tell exactly which messages are listed there, we just need a way to tell if one entry is equal to another.

Summarisation

Iterate through the hit-count table. Discard entries that occur too infrequently to be listed; discard, especially, entries that occur only once. (We've already discarded entries that hit any ham.)

Make a hash that maps the message subsets to the set of all N-gram patterns for that message-subset. For each subset, pick a single N-gram, and note the hit-count associated with it as the hit-count value for that entire message-subset. (Since those N-grams all appear in the exact same subset of messages, they will always have the same P(spam) -- this is a safe shortcut.)

Iterate through the message subsets, in order of their hit-count. Take all of the message-subset's patterns, decode the N-grams in all patterns using the token-to-word reverse-lookup table, and apply this algorithm to that pattern set:

Collapse patterns

So, input here is an array of N-gram patterns, which we know always occur in the same subset of messages. We also have the saved array of all spam messages' shortened text strings, from the N-gram extraction step. With this, we can apply a form of the BLAST pattern-discovery algorithm, from bioinformatics.

Pop the first entry off the array of patterns. Find any one mail from the saved-mails array that hits this pattern. Find the single character before the pattern in this mail, and prepend it to the pattern. See if the hits for this new pattern are the same message set as hit the old pattern; if not, restore the old pattern and break. If you hit the start of the mail message's text string, break. Then apply the same algorithm forward through the mail text.

By the end of that, you have expanded the pattern from the basic N-gram as far as it's possible to go in both directions without losing a hit.

Next, discard all patterns in the pattern array that are subsumed by (ie. appear in) this new expanded pattern. Add it to the output list of expanded patterns, unless it in turn is already subsumed by a pattern in that list; discard any patterns in the output list that are subsumed by this new pattern; and move onto the next pattern in the input list until they're all exhausted.

(By the way, the "discard if subsumed" trick is the reason why we start off with 3-word N-grams -- it gives faster results than just 2-word N-grams alone, presumably by reducing the amount of work that this collapse stage has to do, by doing more of it upfront at a relatively small RAM cost.)

Summarisation (continued)

Finally, output a line listing the percentage of the input spam messages hit (ie. (hit-count value / total number of spams) * 100) and the list of expanded patterns for that message-subset, then iterate on to the next message-subset.

Example

Here's an example of some output from recent "OEM" stock spam:

$ ./seek-phrases-in-corpus --grep 'OEM' \
        spam:dir:/local/cor/recent/spam/*.2007022* \
        ham:dir:/local/cor/recent/ham/*.200702*
[mass-check progress noises omitted]
 RATIO   SPAM%    HAM%   DATA
 1.000  72.421   0.000  / OEM software - throw packing case, leave CD, use electronic manuals. Pay for software only and save 75-90%! /,
                         / TOP 1O ITEMS/
 1.000  73.745   0.000  / $99 Macromedia Studio 8 $59 Adobe Premiere 2.0 $59 Corel Grafix Suite X3 $59 Adobe Illustrator CS2 $129 Autodesk Autocad 2007 $149 Adobe Creative Suite 2 /,
                         /s: Adobe Acrobat PR0 7 $69 Adobe After Effects $49 Adobe Creative Suite 2 Premium $149 Ableton Live 5.0.1 $49 Adobe Photoshop CS $49 http:\/\//,
                         / Microsoft Office 2007 Enterprise Edition Regular price: $899.00 Our offer: $79.95 You save: $819.95 (89%) Availability: Pay and download instantly. http:\/\//,
                         / Adobe Acrobat 8.0 Professional Market price: $449.00 We propose: $79.95 Your profit: $369.05 (80%) Availability: Available for /,
                         / $49 Windows XP Pro w\/SP2 $/,
                         / Top-ranked item. (/,
                         /, use electronic manuals. Pay for software only and save 75-90%! /,
                         / Microsoft Windows Vista Ultimate Retail price: $399.00 Proposition: $79.95 Your benefit: $319.05 (80%) Availability: Can be downloaded /,
                         / $79 MS Office Enterprise 2007 $79 Adobe Acrobat 8 Pro $/,
                         / Best choice for home and professional. (/,
                         / OEM software - throw packing case, leave CD/,
                         / Sales Rank: #1 (/,
                         / $79 Microsoft Windows Vista /,
                         / manufacturers: Microsoft...Mac...Adobe...Borland...Macromedia http:\/\//
 1.000  73.855   0.000  / MS Office Enterprise 2007 /,
                         /9 Microsoft Windows Vista /,
                         / Microsoft Windows Vista Ultimate /,
                         /9 Macromedia Studio 8 /,
                         / Adobe Acrobat 8.0 /,
                         / $79 Adobe /
 1.000  74.242   0.000  / Windows XP Pro/
 1.000  74.297   0.000  / Adobe Acrobat /
 1.000  74.462   0.000  / Adobe Creative Suite /
 1.000  74.573   0.000  / Adobe After Effects /
 1.000  74.738   0.000  / Adobe Illustrator /
 1.000  74.959   0.000  / Adobe Photoshop CS/
 1.000  75.014   0.000  / Adobe Premiere /
 1.000  75.290   0.000  / Macromedia Studio /
 1.000  75.786   0.000  /OEM software/
 1.000  75.841   0.000  / Creative Suite /
 1.000  75.896   0.000  / Photoshop CS/
 1.000  75.951   0.000  / After Effects /
 1.000  76.062   0.000  /XP Pro/
 1.000  82.460   0.000  / $899.00 Our /,
                         / Microsoft Office 2007 Enterprise /,
                         / $79.95 You/

Immediately, that provides several useful rules; in particular, that final set of patterns can be combined with a SpamAssassin "meta" rule to hit 82% of the samples. Generating this took a quite reasonable 58MB of virtual memory, with a runtime of about 30 minutes, analyzing 1816 spam and 7481 ham mails on a 1.7Ghz Pentium M laptop.

(Update:) here's a sample message from that test set, demonstrating the top extracted snippets in bold:

  Return-Path: <tyokaluassa.com@ultradian.com>
  X-Spam-Status: Yes, score=38.2 required=5.0 tests=BAYES_99,DK_POLICY_SIGNSOME,
          FH_HOST_EQ_D_D_D_D,FH_HOST_EQ_VERIZON_P,FH_MSGID_01C67,FUZZY_SOFTWARE,
          HELO_LOCALHOST,RCVD_IN_NJABL_DUL,RCVD_IN_PBL,RCVD_IN_SORBS_DUL,RDNS_DYNAMIC,
          URIBL_AB_SURBL,URIBL_BLACK,URIBL_JP_SURBL,URIBL_OB_SURBL,URIBL_RHS_DOB,
          URIBL_SBL,URIBL_SC_SURBL shortcircuit=no autolearn=spam version=3.2.0-r492202
  Received: from localhost (pool-71-125-81-238.nwrknj.east.verizon.net [71.125.81.238])
          by dogma.boxhost.net (Postfix) with SMTP id E002F310055
          for <xxxxxxxxxxx@jmason.org>; Sun, 18 Feb 2007 08:58:20 +0000 (GMT)
  Message-ID: <000001c7533a$b1d3ba00$0100007f@localhost>
  From: "Kevin Morris" <tyokaluassa.com@ultradian.com>
  To: <xxxxxxxx@jmason.org>
  Subject: Need S0ftware?
  Date: Sun, 18 Feb 2007 03:57:56 -0500

  OEM software - throw packing case, leave CD, use electronic manuals.
  Pay for software only and save 75-90%!

  Discounts! Special offers! Software for home and office!
              TOP 1O ITEMS.

    $79 Microsoft Windows Vista Ultimate
    $79 MS Office Enterprise 2007
    $79 Adobe Acrobat 8 Pro
    $49 Windows XP Pro w/SP2
    $99 Macromedia Studio 8
    $59 Adobe Premiere 2.0
    $59 Corel Grafix Suite X3
    $59 Adobe Illustrator CS2
  $129 Autodesk Autocad 2007
  $149 Adobe Creative Suite 2
  http://ot.rezinkaoem.com/?0B85330BA896A9992D0561E08037493852CE6E1FAE&t0

            Mac Specials:
  Adobe Acrobat PR0 7             $69
  Adobe After Effects             $49
  Adobe Creative Suite 2 Premium $149
  Ableton Live 5.0.1              $49
  Adobe Photoshop CS              $49
  http://ot.rezinkaoem.com/-software-for-mac-.php?0B85330BA896A9992D0561E08037493852CE
  6E1FAE&t6

  See more by this manufacturers:
  Microsoft...Mac...Adobe...Borland...Macromedia
  http://ot.rezinkaoem.com/?0B85330BA896A9992D0561E08037493852CE6E1FAE&t4

  Microsoft Windows Vista Ultimate
  Retail price:  $399.00
  Proposition:  $79.95
  Your benefit:  $319.05 (80%)
  Availability: Can be downloaded INSTANTLY.
  http://ot.rezinkaoem.com/2480.php?0B85330BA896A9992D0561E08037493852CE6E1FAE&t3
  Best choice for home and professional. (37268 reviews)

  Microsoft Office 2007 Enterprise Edition
  Regular price:  $899.00
  Our offer:  $79.95
  You save:  $819.95 (89%)
  Availability: Pay and download instantly.
  http://ot.rezinkaoem.com/2442.php?0B85330BA896A9992D0561E08037493852CE6E1FAE&t1
  Sales Rank: #1 (121329 reviews)

  Adobe Acrobat 8.0 Professional
  Market price:  $449.00
  We propose:  $79.95
  Your profit:  $369.05 (80%)
  Availability: Available for INSTANT download.
  http://ot.rezinkaoem.com/2441.php?0B85330BA896A9992D0561E08037493852CE6E1FAE&t2
  Top-ranked item. (31949 reviews)

Further work

Things that would be nice:

  • It'd be nice to extend this to support /.*/ and /.{0,10}/ -- matching "anys", also known as "gapped alignment" searches in bioinformatics, using algorithms like the Smith-Waterman or Needleman-Wunsch algorithms. (Update: this has been implemented.)
  • A way to detect and reverse-engineer templates, e.g. "this is foo", "this is bar", "this is baz" => "this is (foo|bar|baz)", would be great.
  • Finally, heuristics to detect and discard likely-poor patterns are probably the biggest wishlist item.

Tuits are the problem, of course, since $dayjob is the one that pays the bills, not this work. :(

The code is being developed here, in SpamAssassin SVN. Feel free to comment/mail if you're interested, have improvement ideas, or want more info on how to use it... I'd love to see more people trying it out!

Some credit: I should note that IBM's Chung-Kwei system, presented at CEAS 2004, was the first time I'd heard of a pattern-discovery algorithm (namely, their proprietary Teiresias algorithm) being applied to spam.

Irish Blog Awards 2007

Well, that was fun! Taint.org didn't make the shortlists, but I went along anyway just to hang out -- and lots of chat was had accordingly. Got to finally meet up with a few people I'd chatted with online, like Nialler9 -- and with a few old friends I don't get to see often enough: Antoin, Elana, Brendan, Clare Dillon (ex-Iona!), and another ex-Ionian, Aisling Mackey. A good laugh.

Have to say though, it seems a vote from me was the kiss of death in many of the categories: Sarah Carey, Blogorrah, Ireland from a Polish perspective, and (the late lamented) TCAL all got my thumbs-up in the shortlist voting, and all wound up missing out on the chunk'o'lucite. Sorry about that guys. ;)

Thanks again to Damien for organising the whole do! It's great to have an event like this to bring each of our disparate blogs physically together for a bit of community.

By the way I'd like to point out that, in contrast to the Blogorrah Bock the Robber mafiosi, I had a real moustache... ;)

BT’s daily disconnects, revisited

As I noted last year, BT, the ISP I use here in Ireland, disconnects broadband sessions on a daily basis, assigning a new IP address; this is really aggravating to anyone who uses a VPN, such as most telecommuters. Reportedly, this is done to work around deficiencies in their billing system.

A comment from Jeremy on that post suggested something interesting, though:

Just had a very helpful tech support guy on from BT. [... he] told me to restart the modem sometime that will make it convenient for the 24 hour IP change - i.e. restart it at 6am, and then it'll change IP every day at 6am.

I've tested this, and it works. Much more convenient! Now the renumbering and VPN breakage can take place when I want it to -- at the start of the workday, instead of some random point chosen by BT's billing system. Quite an improvement.

To make this useful, here's a script, "reboot-zyxel", which will reboot your Zyxel P-660RU router remotely over the LAN. (It requires perl and curl.)

MailScanner developer in hospital

According to this message, Julian Field, the main developer of MailScanner, was found collapsed at his home last Friday. More details via the SA list:

He is in ICU though stable condition. I'll not go into any details, anyone interested and not on the MS list can read the thread on the MS archive.

Currently any plans for cards and such as are on hold until further instructions are given to the MS list. However Matt Hampton has setup a clustermap at this address.

Matt will also forward any well wishes left on the website along with the map. Visiting the page will show Julian and his family just how far reaching his software is and how many people appreciate his efforts.

Get well, Julian! :(

Script: mythsshimport

Here's a useful script for users of a MythTV box equipped with a PVR-350 MPEG capture/playback card -- mythsshimport:

NAME

mythsshimport - transcode and install video files onto a MythTV box

SYNOPSIS

mythsshimport file1 [file2 ...]

DESCRIPTION

Transcodes video files (AVI, MPEG, MOV, WMV etc.) into MythTV-compatible and PVR-350-optimised MPEG-2 .nuv files, suitable for viewing on a 4/3 screen, then transfers them to the MythTV backend, inserts them into the "recorded programs" listings, and builds seek tables.

All this happens on-the-fly, at faster-than-real-time rates; with a recent CPU in the transcoding box, and over an 802.11b wifi home network, you can start the process and start watching the video within 20 seconds, while it is transcoded and transferred in the background.

SSH is used as the network transport. If you have the CPU power available on the MythTV backend itself, you can run this script there (as the mythtv user) and it will skip the SSH parts entirely.

REQUIREMENTS

  • ssh password-less key access from transcode box into mythtv@mythbox (this could be localhost, if you're transcoding on the mythbox). Test using: "ssh mythtv@mythbox echo hi". If you run this script on the mythbox as the mythtv user, this is not required.

  • mencoder. Tested with 2:0.99+1.0pre7try2+cvs20060117-0ubuntu8 (I swear that's a version string and not just me rolling my head around the keyboard)

  • MythTV. Tested with MythTV 0.20.

  • The "contrib/myth.rebuilddatabase.pl" script from the MythTV source tarball, installed on the mythbox in $PATH: download from svn.mythtv.org.

  • screen(1) installed on the transcoding box, used to keep the mencoder output readable

Download here.

Masonic spam

Wow, here's a new one -- and kind of appropriate, given my surname ;) Masonic spam!

To: xxxxxx at taint.org

Subject: Dear Benefactor Of 2007 Masory Grant,

From: Dr.Lavine Ferdon Ferdon

Date: Wed, 21 Feb 2007 15:40:26 +0100 (CET)

Dear Benefactor Of 2007 Masory Grant, The Freemason society of Bournemout under the jurisdiction of the all Seeing Eye, Master Nicholas Brenner has after series of secret deliberations selected you to be a beneficiary of our 2007 foundation laying grants and also an optional opening at the round table of the Freemason society. These grants are issued every year around the world in accordance with the objective of theFreemasons as stated by Thomas Paine in 1808 which is to ensure the continuous freedom of man and toenhance mans living conditions. We will also advice that these funds which amount to USD2.5million be used to better the lot of man through your own initiative and also we will go further to inform that the open slot to become a Freemason is optional, you can decline the offer. In order to claim your grant, contact the Grand Lodge Office co-secretary Dr.Lavine Ferdon Ferdon Grand Lodge Office Co-Secretary's email: (lavin_ferd_law at excite.com)

Dr.Lavine Ferdon Ferdon,

Co-Secretary Freemason Society of Holdenhurst Road,

Bournemouth.

Sir David Hurley,

Secretary Freemason Society of Holdenhurst Road,

Brilliant. But why Bournemouth?

HOWTO block editing of pages in Moin Moin

A useful Moin Moin anti-spam tip, via Upayavira at the ASF: adding ACLs to pages so that only certain users can edit them. This is an easy way to interfere with the wiki spammers who get past the existing (quite good) Moin Moin anti-spam subsystems. They tend to aim for the common Wiki pages, such as WikiSandBox, RecentChanges, and FrontPage, so if you make those pages uneditable, that'll cause them more trouble -- and hopefully cause them to move on to easier targets, instead of defacing your wiki. Here's how to do it (at least for Moin Moin >= 1.5.1).

Open a shell on the machine where the Moin Moin software is installed. Edit your "wikiconfig.py" file (in my case this is at /home/moinmoin/moin-1.5.1/share/moin/jmwiki/wikiconfig.py), and change the "acl_rights_before" line to read:

    acl_rights_before = u"JustinMason:read,write,delete,revert,admin"

Replace "JustinMason" with your wiki login name, of course.

Create an administrative group of trusted users. Do this by creating a page called "AdminGroup" containing

#acl All:read
These are the members of this group, who can edit certain restricted pages:
 * JustinMason

Now, for the sensitive pages (like FrontPage etc.), edit each one and add an access-control list line at the top of each page containing:

#acl AdminGroup:read,write All:read

That's it. Users who are not in the AdminGroup will no longer be able to edit those pages. That should help... at least for a while ;)

Update: you should also use this in wikiconfig.py:

    acl_rights_default = u'Known:read,write,revert All:read'

This blocks non-logged-in users from writing to pages.

Irish Blog Awards

A quick note; the Irish Blog Awards shortlisting votes are about to end later today. I've been nominated in the long list (thanks!), for best technology blog -- feel free to vote for me if you like ;)

Update: boo, no shortlisting. Still, probably my own fault, I was a bit too wishy-washy with the vote hustling! Maybe next year...

Odd legal mail

Last week, I received an odd-looking mail from "Claims Administration Center" ClaimsAdministrationCenter /at/ enotice.info, sent to my private email address -- the one listed in an image on http://jmason.org/ (it never gets spam).

The mail reads:

Mittlholtz v . International Medical Research, Inc., Sophie Chen, John Chen, and Allan Wang ("IMR Defendants"), aka Meco, et al. v. IMR, et al., case No. GIC846200.

We are requesting by order of the Court filed with the Superior Court for the County of San Diego, CA, that you post the attached Summary notice as a Public Service Announcement on your web-site.

Below is a link to the PDF Summary Notice (Note: The document is in the .PDF format. To view the documents you will need the Adobe Acrobat Reader)

http://echo.bluehornet.com/ct/ct.php?t=....

This message was intended for: webaddress@jmason.org You were added to the system January 17, 2007. For more information please follow the URL below: http://echo.bluehornet.com/subscribe/source.htm?c=...

Follow the URL below to update your preferences or opt-out: http://echo.bluehornet.com/phase2/survey1/survey.htm?CID=...

Googling for GIC846200, I find it on a cached "civil new filed cases index" page at sandiego.courts.ca.gov:

CASE NUMBER FILE DATE CATEGORY LOCATION

GIC846200 04/21/2005 A72120 - Personal Injury (Other) San Diego MECO vs INTERNATIONAL MEDICAL RESEARCH INCORPORATED

So the case exists. I have no idea who either of the parties are, however.

The URLs in the message were all web-bugged; but bluehornet seem legit in general.

The URL http://www.enotice.info/ times out. Seems to have no spam-related Google Groups hits, although there are a lot of discussions about some iffy-looking class-action suit about Google Adsense.

After quite a bit of discomfort and asking around about the reputation of both bluehornet.com and enotice.info, I eventually succumbed and clicked through. The Summary URL above, after logging my click, redirects to this PDF file, which reads:

This case, called Mittleholtz v . International Medical Research, Inc., Sophie Chen, John Chen, and Allan Wang ('IMR Defendants'), et al., case No. GIC846200, is a class action lawsuit that alleges that the IMR Defendants unlawfully distributed a product containing synthetic chemicals, the presence of which was also concealed from the public as a result of the IMR Defendants' alleged failure to conduct any testing for adulteration by synthetic chemicals, including but not limited to diethylstilbestrol (DES) and warfarin (or coumadin), which is the active chemical in bloodthinners. Defendants deny the allegations. The Court has not formed any opinions concerning the merits of the lawsuit nor has it ruled for or against the Plaintiffs as to any of their claims. The sole purpose of this notice is to inform you of the lawsuit so that you may make an informed decision as to whether you wish to remain in or opt out of this class action.

You have legal rights and choices in this case. You can:

  • Join the case. You do not have to do or pay anything to be part of this case. And, you have to accept the final result in the case.

  • Exclude yourself and file your own lawsuit. If you want your own lawyer, you will have to exclude yourself as set forth below and pay your lawyer's fees and costs.

  • Exclude yourself and not sue. If you do not wish to be part of this case and do not want to bring your own lawsuit, please mail a first class letter stating that you want to be excluded from the Mittleholtz v IMR class action (Case No. GIC846200), or you may fill out the letter available at www.gilardi.com/mittleholtzsettlement. Make sure the letter has your full name, address and signature. Mail it to: PC-SPES Litigation, Class Administrator, c/o Gilardi & Co. LLC, P O Box 8060 San Rafael, CA 94912-8060 by March 23, 2007.

    *This is only a summary. For complete notice and further information go to: www.gilardi.com/mittleholtzsettlement or call the toll-free number 1-877-800-7853.

So in other words, it's hand-targeted unsolicited, but probably not bulk, email, flogging a class-action suit about 'synthetic chemicals' (presumably as opposed to the 'organic' variety). I suspect, given the phrasing in the initial mail, they probably googled for a keyword or company name, and found a hit somewhere in taint.org's 5 years of archives -- hence the PSA request.

In fact, I bet this forwarded story is what they found through Googling. Pity they didn't include a URL for that!

Does sending legal notices like this through email not seem particularly risky, given the lack of reliability of the medium?

An odd situation, all told...

More ‘Small Engine Repair’

Plug plug plug: next week is the 2007 Jameson Dublin International Film Festival -- some great movies being shown, I'm looking forward to it. Most of all, though, I want to recommend Small Engine Repair, which I've written about before. It's being shown in the festival at 6:20 PM on Wed 21st Feb in IFI 1 -- tickets can be booked online here, at EUR 9 apiece.

Writer and director, Niall Heery, won the Breakthrough Talent Award at this year's Irish Film and Television Awards at the weekend. Nice one Niall!

Go see it if you get a chance -- it's a fantastic movie, in my opinion. And be sure to vote for it for the festival's Audience Award...

Wikipedia and rel=”nofollow”

Apparently, Wikipedia has (possibly temporarily) decided to re-add the rel="nofollow" attribute to outbound links from their encyclopedia pages.

There's been a lot of heat and light generated about this, most missing one thing: there's no reason why Google needs to pay attention.

Google, or any other search engine, can treat links in the Wikipedia pages any way they like -- including ignoring 'nofollow', applying extra anti-spam heuristics of their own, or even trusting the links more highly.

'Nofollow' has had pretty much no effect on web-spam, and now is generally festooned all over weblog posts across the internet, both spammed and non-spammed posts, at that. It'd be interesting to see if it's yet flipped to mean a higher correlation with nonspam than spam content...

Update: It appears Wikipedia used 'nofollow' before, so this is not exactly new, either.

more on social whitelisting with OpenID

An interesting post from Simon Willison, noting that he is now publishing a list of "non-spammy" OpenID identities (namely people who posted one or more non-spammy comments to his blog).

I attempted to comment, but my comments haven't appeared -- either they got moderated as irrelevant (I hope not!) or his new anti-comment-spam heuristics are wonky ;) Anyway, I'll publish here instead.

It's possible to publish a whitelist in a "secure" fashion -- allowing third parties to verify against it, without explicitly listing the identities contained. One way is using Google's enchash format. Another is using something like the algorithm in LOAF.

Also, a small group of people (myself included) tried social-network-driven whitelisting a few years back, with IP addresses and email, as the Web-o-Trust.

Social-network-driven whitelisting is not as simple as it first appears. Once someone in the web -- a friend of a friend -- trusts a marginally-spammy identity, and a spam is relayed via that identity, everyone will get the spam, and tracking down the culprit can be hard unless you've designed for that in the first place (this happened in our case, and pretty much killed the experiment). I think you need to use a more complex Advogato-style trust algorithm, and multiple "levels" of outbound trust, instead of the simplistic Web-o-Trust model, to avoid this danger.

Basically, my gut feeling is that a web of trust for anti-spam is an attractive concept, possible, but a lot harder than it looks. It's been suggested repeatedly ever since I started writing SpamAssassin, but nobody's yet come up with a working one... that's got to indicate something ;) (Mind you, the main barrier has probably been waiting for workable authentication, which is now in place with DK/SPF/DKIM.)

In the meantime, the concept of a trusted third party who publishes their concept of an identity's reputation -- like Dun and Bradstreet, or Spamhaus -- works very nicely indeed, and is pretty simple and easy to implement.

SpamArchive.org no more

Remember SpamArchive.org, the site that allowed random Internet users to upload their spam? It was set up back in 2002 by CipherTrust, one of the commercial anti-spam vendors, to offer a large, 'standard' database of known spam to be used for testing, developing, and benchmarking anti-spam tools, and for anti-spam researchers. It got a bit of coverage at Slashdot and Wired News at the time.

It never really was too useful for its supposed purposes, though, at least for us in SpamAssassin, since:

  1. it collected submissions from random internet users, without vetting, and therefore couldn't be guaranteed to be 100% valid;

  2. it 'anonymized' the headers too much for the spam to be useful in testing a filter like SpamAssassin, which requires correct header data for valid results;

  3. collecting spam has never been a problem; avoiding it is ;)

Anyway, looks like Ciphertrust/Secure Computing have since lost interest, since they've allowed the domain to lapse. It has instead been picked up by a domain speculator:

Domain ID:D134033677-LROR
Domain Name:SPAMARCHIVE.ORG
Created On:30-Nov-2006 18:52:13 UTC
Last Updated On:01-Dec-2006 12:42:26 UTC
Expiration Date:30-Nov-2007 18:52:13 UTC
Sponsoring Registrar:PSI-USA, Inc. dba Domain Robot (R68-LROR)
Status:TRANSFER PROHIBITED
Registrant ID:ABM-9376887
Registrant Name:Robert Farris
Registrant Organization:Virtual Clicks
Registrant Street1:P.O. Box 232471
Registrant Street2:
Registrant Street3:
Registrant City:San Diego
Registrant State/Province:US
Registrant Postal Code:92023
Registrant Country:US
Registrant Phone:+1.7205968887
Registrant Phone Ext.:
Registrant FAX:
Registrant FAX Ext.:
Registrant Email:domain_whois@virtualclicks.com
Name Server:NS1.DIGITAL-DNS-SERVER.COM
Name Server:NS2.DIGITAL-DNS-SERVER.COM

A visit to http://www.spamarchive.org/ now reveals a parking page, which grabs the browser window, forces it to front, maximises it, attempts to bookmark it, add it to the Firefox sidebar -- and who knows what else ;)

apres-Barcamp!

Well, that was great fun -- well worth the trip down. Got to put a load of faces to names, meeting up with a fair few people I've been conversing with online -- and a few I hadn't met before, online or off. Plenty of thought-provoking and interesting chats, too!

My talk went down well, I think. Unfortunately, we didn't quite know how to operate the projector, so the attendees, while they got to hear me talk, didn't get to read the leftmost quarter or so of each slide ;)

To make up for it, here they are:

OpenOffice 2 source (234k), PDF (320k), HTML

(PS: Regarding GUI interfaces to managing EC2 -- a question that came up in the Q&A -- here's one that looks pretty interesting...)

Barcamp!

I was wavering for a minute there, but I've decided to head down to Waterford for Barcamp Ireland - SouthEast -- a bit last-minute, but there you go! Tickets and hotel booked.

I'm hoping to give a quick, 20-minute intro to Amazon's EC2 and S3 web services -- what they are, how they're used, some interesting features and a few gotchas to watch out for.

Also, I'm up for dinner on the Saturday night, given there's a promise of free booze ;)

Any taint.org readers heading down?

Debunking the “cocaine on 100% of Irish banknotes” story

BBC: Cocaine on '100% of Irish euros':

One hundred percent of banknotes in the Republic of Ireland carry traces of cocaine, a new study has found.

Researchers used the latest forensic techniques that would detect even the tiniest fragments to study a batch of 45 used banknotes.

The scientists at Dublin's City University said they were "surprised by their findings".

Also at RTE, Irish Examiner, PhysOrg.com, Bloomberg.com, even at Kazakhstan's KazInform.

This story is (of course) being played widely in the media as "OMG Ireland must use more coke than anywhere else" -- in particular, in comparison with a previous study in the US:

The most recent survey carried out in the US showed 65% of dollar notes were contaminated with cocaine.

The DCU press-release has a few more details:

Using a technique involving chromatography/mass spectrometry, a sample of 45 bank notes were analysed to show the level of contamination by cocaine. ...

62% of notes were contaminated with levels of cocaine at concentrations greater than 2 nanograms/note, with 5% of the notes showing levels greater than 100 times higher, indicating suspected direct use of the note in either drug dealing or drug inhalation. ... The remainder of the notes which showed only ultra-trace quantities of cocaine was most probably the result of contact with other contaminated notes, which could have occurred within bank counting machines or from other contaminated surfaces.

However, looking at an abstract of what I think is the paper in question, Evaluation of monolithic and sub 2 µm particle packed columns for the rapid screening for illicit drugs -- application to the determination of drug contamination on Irish euro banknotes, Jonathan Bones, Mirek Macka and Brett Paull, Analyst, 2007, DOI: 10.1039/b615669j, that says:

A study comparing recently available 100 × 3 mm id, 200 × 3 mm id monolithic reversed-phase columns with a 50 × 2.1 mm id, 1.8 µm particle packed reversed-phase columns was carried out to determine the most efficient approach ... for the rapid screening of samples for 16 illicit drugs and associated metabolites. ... Method performance data showed that the new LC-MS/MS method was significantly more sensitive than previous GC-MS/MS based methods for this application.

My emphasis. I'd guess that that means that comparing this result to banknote-analysis experiments carried out elsewhere using different methods is probably invalid -- perhaps this method is more efficient at picking up 'contact with other contaminated notes, which could have occurred within bank counting machines or from other contaminated surfaces', as noted in the DCU release?

Email authentication is not anti-spam

There's a common misconception about spam, email, and email authentication; Matt Cutts has been the most recent promulgator, asking 'Where's my authenticated email?', in which various members of the comment thread consider this as an anti-spam question.

Here's the thing -- email these days is authenticated. If you send a mail from GMail, it'll be authenticated using both SPF and DomainKeys. However, this alone will not help in the fight against spam.

Put simply -- knowing that a mail was sent by 'jm3485 at massiveisp.net', is not much better than knowing that it was sent by IP address 192.122.3.45, unless you know that you can trust 'jm3485 at massiveisp.net', too. Spammers can (and do) authenticate themselves.

Authentication is just a step along the road to reputation and accreditation, as Eric Allman notes:

Reputation is a critical part of an overall anti-spam, anti-phishing system but is intentionally outside the purview of the DKIM base specification because how you do reputation is fundamentally orthogonal to how you do authentication.

Conceptually, once you have established an identity of an accountable entity associated with a message you can start to apply a new class of identity-based algorithms, notably reputation. ... In the longer term reputation is likely to be based on community collaboration or third party accreditation.

As he says, in the long term, several vendors (such as Return Path and Habeas) are planning to act as accreditation bureaus and reputation databases, undoubtedly using these standards as a basis. Doubtless Spamhaus have similar plans, although they've not mentioned it.

But there's no need to wait -- in the short term, users of SpamAssassin and similar anti-spam systems can run their own personal accreditation list, by whitelisting frequent correspondents based on their DomainKeys/DKIM/SPF records, using whitelist_from_spf, whitelist_from_dkim, and whitelist_from_dk.

Hopefully more ISPs and companies will deploy outbound SPF, DK and DKIM as time goes on, making this easier. All three technologies are useful for this purpose (although I prefer DKIM, if pushed to it ;).

It's worth noting that the upcoming SpamAssassin 3.2.0 can be set up to run these checks upfront, "short-circuiting" mail from known-good sources with valid SPF/DK/DKIM records, so that it isn't put through the lengthy scanning process.

That's not to say Matt doesn't have a point, though. There are questions about deployment -- why can't I already run "apt-get install postfix-dkim-outbound-signer" to get all my outbound mail transparently signed using DKIM signatures? Why isn't DKIM signing commonplace by now?

How to deal with joe-jobs and massive bounce storms

As I've noted before, we still have a major problem with sites generating bounce/backscatter storms in response to forged mail -- whether deliberately targeted, as a "Joe-Job", or as a side-effect of attempts to evade over-simplistic sender address verification as seen in spam, viruses, and so on.

Sites sending these bounces have a broken mail configuration, but there are thousands remaining out there -- it's very hard to fix an old mail setup to avoid this issue. As a result, even if your mail server is set up correctly and can handle the incoming spam load just fine, a single spam run sent to other people can amplify the volume of response bounces in a Smurf-attack-style volume multiplication, acting as a denial of service. I've regularly had serious load problems and backlogs on my MX, due solely to these bounces.

However, I think I've now solved it, with only a little loss of functionality. Here's how I did it, using Postfix and SpamAssassin.

(UPDATE: if you use the algorithm described below, you'll block mail from people using Sender Address Verification! Use this updated version instead.)

Firstly, note that if you adopt this, you will lose functionality. Third party sites will not be able to generate bounces which are sent back to senders via your MX -- except during the SMTP transaction.

However, if a message delivery attempt is run from your MX, and it is bounced by the host during that SMTP transaction, this bounce message will still be preserved. This is good, since this is basically the only bounce scenario that can be recommended, or expected to work, in modern SMTP.

Also, a small subset of third-party bounce messages will still get past, and be delivered -- the ones that are not in the RFC-3464 bounce format generated by modern MTAs, but that include your outbound relays in the quoted header. The idea here is that "good bounces", such as messages from mailing lists warning that your mails were moderated, will still be safe.

OK, the details:

In Postfix

Ideally, we could do this entirely outside Postfix -- but in my experience, the volume (amplified by the Smurf attack effects) is such that these need to be rejected as soon as possible, during the SMTP transaction.

Update: I've now changed this technique: see this blog post for the current details, and skip this section entirely!

(If you're curious, though, here's what I used to recommend:)

In my Postfix configuration, on the machine that acts as MX for my domains -- edit '/etc/postfix/header_checks', and add these lines:
/^Return-Path: <>/                              REJECT no third-party DSNs
/^From:.*MAILER-DAEMON/                         REJECT no third-party DSNs
Edit '/etc/postfix/null_sender', and add:
<>              550 no third-party DSNs
Edit '/etc/postfix/main.cf', and ensure it contains these lines:
header_checks = regexp:/etc/postfix/header_checks
smtpd_sender_restrictions = check_sender_access hash:/etc/postfix/null_sender
(If you already have an 'smtpd_sender_restrictions' line, just add 'check_sender_access hash:/etc/postfix/null_sender' to the end.) Finally, run:
sudo postmap /etc/postfix/null_sender
sudo /etc/init.d/postfix restart
This catches most of the bounces -- RFC-3464-format Delivery-Status-Notification messages from other mail servers.

In SpamAssassin

Install the Virus-bounce ruleset. This will catch challenge-response mails, "out of office" noise, "virus scanner detected blah" crap, and bounce mails generated by really broken groupware MTAs -- the stuff that gets past the Postfix front-line.

Once you've done these two things, that deals with almost all the forged-bounce load, at what I think is a reasonable cost. Comments welcome...

Kernighan and Pike on debugging

While reading the log4j manual, I came across this excellent quote from Brian W. Kernighan and Rob Pike's "The Practice of Programming":

As personal choice, we tend not to use debuggers beyond getting a stack trace or the value of a variable or two. One reason is that it is easy to get lost in details of complicated data structures and control flow; we find stepping through a program less productive than thinking harder and adding output statements and self-checking code at critical places. Clicking over statements takes longer than scanning the output of judiciously-placed displays. It takes less time to decide where to put print statements than to single-step to the critical section of code, even assuming we know where that is. More important, debugging statements stay with the program; debugging sessions are transient.

+1 to that.

5 things revisited

Hey Danny! I've already filled out my "5 Things" list. Surprisingly (or thankfully) nobody has commented on #5 ;)

Great Things, btw. I might adopt #4, and see if it works.

It's great fun following the web of "5 Things" links as they percolate through the interwebs. now if only the people I nominated would get on with their lists...

Script: knewtab

Here's a handy script for konsole users like myself:

knewtab -- create a new tab in a konsole window, from the commandline

usage: knewtab {tabname} {command line ...}

Creates a new tab in a "konsole" window (the current window, or a new one if the command is not run from a konsole).

Requires that the konsole app be run with the "--script" switch.

Download 'knewtab.txt'

Spam zombies — we need to cure the disease, not suppress the symptoms

Here's a great presentation from Joe St Sauver presented at the London Action Plan meeting recently: Infected PCs Acting As Spam Zombies: We Need to Cure the Disease, Not Just Suppress the Symptoms

Some key points in brief:

Despite all our ongoing efforts: the spam problem continues to worsen, with nine out of every ten emails now spam; spam volume has increased by 80% over just the past few months and users face a constantly morphing flood of malware trying to take over their computers. Bottom line: we're losing the war on spam.

The root cause of today's spam problems is spam zombies, with 85% of all spam being delivered via spam zombies.

The spam zombie problem grows worse every day (with over ninety one million new spam zombies per year)

Users don't, won't, or can't clean up their infected PCs; and ISPs can't be expected to clean up their infected customers' PCs.

Filtering port 25 and doing rate limiting is like giving cough syrup to someone with lung cancer -- it may suppress some overt symptoms but it doesn't cure the underlying disease.

Filtered and rate-limited spam zombies CAN still be used for many, many OTHER bad things, and they represent a huge problem if left to languish in a live infected state.

Joe's take -- "we're in the middle of a worldwide cyber crisis". I agree. He suggests a new strategy:

It is common for universities to produce and distribute a one-click clean-up-and-secure CD for use by their students and faculty. It's now time for our governments to produce and distribute an equivalent disk for everyone to use.

I agree the existing schemes are clearly not working; this is an interesting suggestion. Read/listen to the presentation in full for more details; pick up PDF, PPT and video here.

Massive spam volumes causing ISP delays

Via Steve Champeon's daily links, the following spam-in-the-news stories illustrate a rising trend:

Huge amounts of spam are said to be responsible for delays in the email network of NZ ISP Xtra.

Several customers have vented their frustrations on an Xtra website message board saying some emails were days late, The New Zealand Herald reports.

... Record volumes of spam meant such problems would be "an unfortunate and on-going reality of the internet not specific to any provider", he said.

Mr Bowler said Telecom had invested "tens of millions of dollars" in email and anti-spam software and worked closely with two of the world's leading anti-spam vendors.

Holiday spam e-mails are to blame for slowing message delivery to faculty and staff in schools across Kentucky ...

"Some 123-reg customers may have experienced intermittent delays in their emails in the last two weeks. We had received a particularly high level of image-based spam attacks over a short period of time," the Pipex subsidiary said.

Small businesses are threatening legal action over continuing glitches with Xtra's email service and the Consumers' Institute says they may have a case.

Several people have contacted the Herald complaining that delays and non-deliveries of emails over the past three weeks on the Xtra network are severely affecting their businesses. ...

The institute's David Russell said home users could claim compensation for email delays if they had suffered "a real measurable loss".

Non-commercial customers were covered by the Consumer Guarantees Act and services they paid for had to be of a "reasonable quality".

Although it might be more difficult for small business owners, they could also have a case, Mr Russell said. "If there has been a considerable amount of money, they could consider legal action or, if the amount was smaller, they could go through the disputes tribunal."

In other words, the DDOS-like elements of the spam problem are becoming an increasing worry; even with working spam filtering in place, the record size of zombie botnets means that spammers can now destroy organisations' computing infrastructure, almost accidentally.

Spammers don't care if an organisation's infrastructure collapses while they're sending their spam to it -- they just want to maximise exposure of their spam, by any means necessary. If that requires knocking a company off the air entirely for a while, so be it.

I'm not sure what can be done about this, in terms of filtering. It may finally be time to fall back to a "side channel" of trusted, authenticated SMTP peers, and leave the spam-filled world of random email from people and organisations you don't know to one side, as a lower-priority system which can (and will, frequently) collapse, without affecting the 'important' stuff. What a mess. :(

Alternatively, maybe it's time for governments to start putting serious money into botnet-spam-related arrests and prosecution.

This has additional issues for ISPs, too, btw -- I wonder if Earthlink are taking note of that Xtra lawsuit story above....

Cliche-finder bookmarklet

Quinn posted a link to a nifty CGI by Aaron Swartz which detects uses of common cliches, with the list of cliches to avoid taken from the Associated Press Guide to News Writing. In addition, she also mentioned there's the Passivator, 'a passive verb and adverb flagger for Mozilla-derived browsers, Safari, and Opera 7.5'.

Combining the two, I've hacked together a bookmarklet version of the cliche finder -- it can be found on this page. (Couldn't place it inline into this post due to stupid over-aggressive Markdown, grr.)

Fun! Probably not IE-compatible, though.

5 things

Tagged by richi! drat. OK, here are 5 things you probably don't know about me:

  1. I'm a certified SCUBA diver, at PADI Advanced Open Water Diver level. (oh, look, so's Tom Raftery!)

  2. I generally try to avoid meeting my heroes, since I get quite tongue-tied in the presence of people I admire -- I once stammered "I think you're brilliant" at Alex Paterson, instead of anything more witty or interesting.

  3. I met my wife at a student occupation in university, where her knowledge of the science and nature questions in Trivial Pursuit, and amazing looks of course, got me hooked ;)

  4. I could listen to Brian Eno's Taking Tiger Mountain By Strategy and Here Come The Warm Jets on repeat for several weeks, if necessary.

  5. I was a child model, modelling (among other things) underpants for Dunnes Stores! It's all been downhill since then, really ;)

Passing it on: go for it, Brendan, Colm, Lisey, and Jason.

An anti-challenge-response Xmas linkfest

As all right-thinking people know by now, Challenge-response spam filtering is broken and abusive, since it simply shifts the work of filtering spam out of your email, onto innocent third-parties -- either your legitimate correspondents, people on mailing lists you read, or even random people you have never heard of (due to spam blowback).

I've ranted about this in the past, but I'm not alone in this opinion -- and frequently find myself explaining it. To avoid repeating myself, here's a canonical collection of postings from around the web on this topic.

Description: This "selfish" method of spam filtering replies to all email with a "challenge" - a message only a living person can (theoretically) respond to. There are several problems with this method which have been well known for many years.

  1. Does not scale: If everyone used this method, nobody would ever get any mail.
  2. Annoying: Many users refuse to reply to the challenge emails, don't know what they are or don't trust them.
  3. Ineffective: Because of confusion about these emails, many of them are confirmed by people who did not trigger them. This results in the original malicious email being delivered.
  4. Selfish: This is the problem we are mainly concerned with. By using challenge/response filtering, you are asking innumerable third parties to receive your challenge emails just so that a relatively few legitimate ones get through to the intended recipient.

C-R systems in practice achieve an unacceptably high false-positive rate (non-spam treated as spam), and may in fact be highly susceptible to false-negatives (spam treated as non-spam) via spoofing.

Effective spam management tools should place the burden either on the spammer, or, at the very least, on the person receiving the benefits of the filtering (the mail recipient). Instead, challenge-response puts the burden on, at best, a person not directly benefitting, and quite likely (read on) a completely innocent party. The one party who should be inconvenienced by spam consequences ¿ the spammer ¿ isn't affected at all.

Worse: C-R may place the burden on third parties either inadvertantly (via spoofed sender spam or virus mail), or deliberately (see Joe Job, below). Such intrusions may even result in subversion of the C-R system out of annoyance. Many recent e-mail viruses spoof the e-mail sender, including Klez, Sobig variants, and others.

The collateral damage from widely used C/R systems, even with implementations that avoid the stupid bugs, will destroy usable e-mail. [jm: in fairness, this was written in 2003.]

Challenge systems have effects a lot like spam. In both cases, if only a few people use them they're annoying because they unfairly offload the perpetrator's costs on other people, but in small quantities it's not a big hassle to deal with. As the amount of each goes up, the hassle factor rapidly escalates and it becomes harder and harder for everyone else to use e-mail at all.

I'm skeptical of CR as a response to email. If you're the first on your block to adopt CR, and if nobody else uses anti-spam technology, then CR might provide you some modest benefit. But it¿s hard to see how CR can be widely successful in a world where most people use some kind of spam defense.

If these systems are so brain-dead as to not bother adding my address to the whitelist when the user sends me e-mail, I have serious trouble understanding why anyone is using them.

Is it just me? Is this too hard to figure out?

Anyway, there's another 5 minutes I'll never get back. It's too bad there's no mail header to warn me that "this message is from a TDMA user", because then I'd be able to procmail 'em right to /dev/null where they belong.

Ugh.

This bullshit is not going to "solve" the spam problem, people. If that's your solution, please let me opt out. Forever.

C/R slows down and impedes communication by placing unwanted barriers between you and your clients/suppliers.

If you must insist on using some form of C/R please make sure that you whitelist my address before you contact me as I will not reply to challenges.

We will not answer any challenges generated in response to our mailing list postings. Thus, if you're using a challenge-response system and not receiving TidBITS, you'll need to figure that out on your own. Also, if you send us a personal note and we receive a challenge to our reply, we may or may not respond to it, depending on our workload at the time.

uol.com.br uses a very broken method of anti-spam. Everytime someone sends an email message to one of their members, they send back a verification message, asking the original sender to click a link before they will allow the message through. These messages are themselves a form of spam, and the resulting back-scatter of these messages is altogether bad for the Internet, the UOL member, and all of the UOL member's contacts. UOL is aware of the complaints against them, and they refuse to correct the issue, claiming that their members love the service.

I hate C/R systems. With a passion. I absolutely will not respond to them. They go in the trash. I don't get them very often but I get them more and more. I think they have the potential to seriously damage email communication as we know it. And I'm not alone in this opinion.

Phew.

Linux USB frequent reconnects – workaround

I've been running into problems recently (since several months ago at least), with USB hardware on my Thinkpad T40 running Ubuntu Hoary Dapper; in particular, every time I plug in my iPod or one of my USB hard disks nowadays, I get this:

[5008549.187000] usb 4-3: USB disconnect, address 14
[5008550.143000] usb 4-3: new high speed USB device using ehci_hcd and address 18
[5008552.643000] usb 4-3: new high speed USB device using ehci_hcd and address 27
[5008557.393000] usb 4-3: new high speed USB device using ehci_hcd and address 43
[5008557.893000] usb 4-3: new high speed USB device using ehci_hcd and address 44
[5008558.643000] usb 4-3: new high speed USB device using ehci_hcd and address 46
[5008558.895000] ehci_hcd 0000:00:1d.7: port 3 reset error -110
[5008558.896000] hub 4-0:1.0: hub_port_status failed (err = -32)
[5008559.893000] usb 4-3: new high speed USB device using ehci_hcd and address 48
[5008562.643000] usb 4-3: new high speed USB device using ehci_hcd and address 58
[5008563.143000] usb 4-3: new high speed USB device using ehci_hcd and address 59
[5008563.643000] usb 4-3: new high speed USB device using ehci_hcd and address 60
[5008570.143000] usb 4-3: new high speed USB device using ehci_hcd and address 85

This repeats ad infinitum until the USB device is disconnected.

I had this down as a hardware issue (since it started happening just after warranty expiration ;), but some accidental googling revealed several other cases -- and a workaround:

sudo modprobe -r ehci-hcd

Run that repeatedly, each time replugging the device and monitoring dmesg via watch -n 1 'dmesg | tail' in a window, until the device is finally recognised as a USB hard disk. It generally seems to take 3 or 4 attempts, in my experience.

This LKML thread suggests hardware changes can cause it, but this hardware hasn't changed in years. Annoying.

Anyway, this is ongoing. This tip seems to help, but it might be just treating a symptom, I don't know -- just posting for google and posterity... and to moan, of course :(

Threadless deals with plagiarism

(Updated since original posting; see end of post for details)

Paging boogah!

Interesting situation playing out at Threadless -- I think this may be the first time a stolen design made it through voting and so on, onto cotton, without being spotted. Here's the design, supposedly by someone called 'rocketrobyn':

And here's the (apparently original) stencil art by miso and ghostpatrol:

BTW, note the perspective being copied from the photo's odd angle, to the shirt design...

The Threadless design's submission page has some classic comments:

  • Boney_King_of_Nowhere: Wow. Are you by any chance a fan of Bansky? Because this is almost a rip off. Almost. Awsome though.
  • rocketrobyn (this is my design): Thank you for the positive comments. I really like this shirt too! [...] I'm not sure who Bansky [jm:sic] is, but I'll check it out!

Heh.

I heard about this via You Thought We Wouldn't Notice, a street-design plagiarism blog, where ghostpatrol (one of the stencil artists) posted a blog post about the situation. In the comments there, Jake from Threadless pipes up:

jake n on 12 Dec 2006 at 4:30 am

hey, jake here from threadless. i was just made aware of this situation and want to give you all my assurance that we will handle this properly.

the designer will not be paid and the design will either be removed or licensed from the original designer if they are willing.

give us a couple days to sort the details.

Not to appear whingy, 2 hours later "n." posts:

The original owners are not willing to license this design to Threadless, and want it removed from the site. Neither artist has yet been contacted by Threadless.

Bit of patience there ;)

More links:

It's an interesting situation, and so far Threadless is handling it very well as far as I can see -- the only people who aren't are some other graf and stencil artists in the reaction threads, vituperating about Threadless not using psychic powers to detect plagiarism:

i tell you, you aren't printing any of my subs, i know it as they score way too low to get noticed. but on the off chance that someone rips off a design i've done, as blatantly as this...i would definitely seek reparations from threadless and the offending subber. do a background check with the subbers available websites etc.

Background checks?! wtf.

Good reaction from miso though:

Once again, we own automatic copyright on these images,...

To clarify -- we are not blaming Threadless. They didn't take the design knowing that it was stolen [if they had done so witch such knowledge, we would be approaching this very differently].

This is the fault of the "designer", and hopefully this will sort itself out in the next few days. [Who, by the way, has claimed to have done these designs -- "This is a t-shirt I designed for Threadless."]

As yet, either GP nor I have yet been contacted by either the company or "designer" to fix this, but Jake from Threadless has left a very nice comment for us on "You Thought We Wouldn't Notice".

The Threadless blog reactions are worth watching if you want to follow the ongoing drama.

Update: reposted to preshrunk. In the comments there, someone notes that it's not the first Threadless tee to make it to production before plagiarism was spotted -- The Killing Tree was first. There are some oblique references to this in this blog post's comments.

Backscatter in InformationWeek

Yay! Kudos to Richi Jennings, who's been trumpeting the dangers of backscatter to InformationWeek recently. It's a great article. I particularly like how it digs up this impressively off-the-mark quote:

Tal Golan, CTO, president, and founder of Sendio, maker of a challenge/response e-mail appliance used by more than 150 enterprise consumers, disagrees strongly with Jennings's assertion that challenge-based filtering has problems. "Without question, the benefit to the whole community at large drastically outweighs that FUD [fear, uncertainty, and doubt] that's out there in the marketplace that somehow challenge/response makes the problem worse," he says. "The real issue is that filters don't work. From our perspective, challenge/response is the only solution. This whole concept of backscatter is just not true. Very, very rarely do spammers forge the e-mail addresses of legitimate companies anymore."

hahahaha. Well, since last Thursday, "very very rarely" translates as "214 MB of backscatter in my inbox". The facts aren't on Tal Golan's side here...

(PS: SpamAssassin 3.2.0 will include backscatter detection.)

An Post: 75% lost-parcels rate so far

I don't know what's going on with An Post, the Irish postal service, these days -- I've been having some pretty bad luck with them.

For my birthday, I was lucky enough to be given a Thingamagoop -- it took a while (hey, they're hand-made) but was shipped on Nov 7th from the US. Bleep Labs accidentally shipped me two, apparently, but only one has arrived -- on Nov 16th, 9 days after shipping. The other one's still AWOL nearly a month later.

I then ordered something from Sendit.com on Nov 17th, as a birthday gift for Nov 30th. It was shipped from their Belfast offices on Nov 18th, and still hasn't arrived to date. Sendit were champs, however, and refunded the purchase as soon as I rang them on the 30th (I'd recommend their services, no problem).

Finally, SpamAssassin was lucky enough to win a Linux New Media Award 2006 for 'Best Linux-based Anti-spam Solution' -- nifty! As part of this, a (physical) trophy is apparently winging its way from Germany, and was apparently shipped on November 27th. Guess what: no sign.

In other words, in the past month, 75% of the parcels sent to me seem to have gone AWOL. All I can do is hope that they've just been delayed, rather than suffer a worse fate. In particular, I hope that trophy turns up -- it's the only physical award we've ever received :(

Can anyone think of a good avenue to track these down? The website seems pretty negative, and what I've heard seems to be along the lines of 'turn up at the sorting depot, cross your fingers, and see if they've been misdelivered'. Ick.

SpamAssassin as an EC2 service

I had a bit of an epiphany while chatting to Antoin about the qpsmtpd/EC2 idea. Craig had the same thoughts.

Here's the thing -- there's actually no need to offload the SMTP part at all. That stuff is tricky, since you've got to build in a lot of fault tolerance, quality-of-service, uptime, etc. to ensure that the MX really is reachable. Since an EC2 instance will lose its "disks" once rebooted/shut down, you need to store your queues in Amazon S3 -- which has differing filesystem semantics from good old POSIX -- so things get quite a bit hairier. On top of that, it requires a little RFC-breakage; there are issues with using CNAMEs in MX records, reportedly.

However, if we offload just the spamd part, it becomes a whole lot simpler. The SPAMD protocol will work fine across long distances, securely, with SSL encryption active, and SpamAssassin will work fine as a filtering system in an entirely stateless mode, with no persistent-across-reboots storage. (What about the persistent-storage aspects of spamd operation? There's just the auto-whitelist, which can be easily ignored, and I haven't trained a Bayes database in 2 years, so I doubt I'll need that either ;)

If the spamd server is down or uncontactable, spamc will handle this and retry with another server, or eventually give up and pass the message through, safely intact (though unscanned).

Given that there's a cool third-party ClamAV plugin now available for SpamAssassin, this system can offload the virus-scanning work, too.

So here's the new plan: run the MTA, MX, and the super-lean "spamc" client on the normal MX machine -- and offload the "spamd" work to one or more EC2 machines.

Basically, there would be a CNAME record in DNS, listing the dynamic DNS names of the EC2 spamd instances. Then, spamc is set to point at that CNAME as the spamd host to use. As EC2 instances are started/removed, they are added/removed from that CNAME list and spamc will automatically keep up.

Pricing is reasonably affordable -- don't send over-large messages to the EC2 spamd; rate-limit total incoming SMTP traffic in the MTA; and use the SPAMD protocol's REPORT verb to reduce the bandwidth consumption of mails in transit by ensuring that the mail messages are only transmitted one-way, MX-to-EC2, instead of both MX-to-EC2 and EC2-to-MX. That will keep the bandwidth pricing down.

Recent figures indicate that I got about 90MB of mail per day, at peak, over the past weekend (which nearly DOS'd my server and caused some firefighting) -- 68MB of spam, and 13MB of blowback. At 20 cents per GB, that's 1.8 cents per day for traffic. Plus the $0.10 per instance hour, that's $2.42 per day to run a single EC2 instance to handle DDOS spikes. Of course, that can be shut down when load is low.

Yep, this is looking very promising. Now when are Amazon going to let me onto the beta program for EC2?...

Using qpsmtpd and Amazon EC2 to provide SMTP-DDoS protection

Like a few other anti-spammers, I found myself under a hitherto-unprecedented level of spam blowback this weekend. Disappointingly, there are still thousands of SMTP servers configured to send bounce messages in response to spam.

Even with the anti-bounce ruleset for SpamAssassin, the volume was so great that our creaky old server had a lot of difficulty keeping up -- once the messages got to SpamAssassin, the load issues had already been created. Also, Postfix's anti-spam features really weren't designed to deal with blowback.

While attempting to take some shortcuts in the setup on our server to deal with this, a great idea occurred to me -- why not come up with an app that uses Amazon EC2 to flexibly provision enough server power and bandwidth to pre-filter the SMTP traffic for an MX under attack?

I'm basically thinking of qpsmtpd, with SpamAssassin and/or other antispam blobs active, running in an Amazon EC2 server image. Multiple images can be brought up, and added to the attacked domain's MX record at an equal priority, to take load off the main (overloaded) MX.

Now to cogitate a little -- details to follow...

Working out electricity costs for your appliances and hardware

This question came up on a forum I'm on. It turns out it's really quite easy to work out -- this page covers pretty much all the details.

In addition to what's there, it's worth noting that the current Irish price for a kilowatt-hour under the ESB's domestic rate is 12.73 cents per kWh, which works out as 14.41 cents per kWh once the 13.5% VAT is added in. So Irish users, pretend you live in New Hampshire (15 cents per kWh) to get realistic figures from the excellent cost calculator.

Using this, it looks like if I was to leave an 160W desktop computer on permanently in Ireland, I'd be spending 215 euros per year to power it. Wow, that's pricey! My strategy of using low-noise, low-power hardware for home servers has paid off already, in that case. ;)

For what it's worth, if you're worrying about the power consumption of an NTL digital Pace Digital TV set-top box -- if this Pace presentation is anything to go by, it appears the standby power consumption is on the order of 1-2 watts -- about 2 euros per year. Grand.

Labour’s flat-rate bus tickets

Well, that was quick!

Right after posting this, I hear about Labour's new transport strategy for Dublin. Here's the top 3 items:

  • Labour will increase the Dublin Bus fleet by 50% (500 buses), significantly increasing frequency and reducing waiting times.

  • Will complete the Quality Bus Corridors, and greatly reduce journey times.

  • Will introduce a EUR 1 per-trip fare for adults and a 50c per-trip fare for children.

The flat-rate fee structure makes a lot more sense than the confusing and rip-off-ish current model, whereby if you don't know in advance how much a particular journey is going to cost, you're given a useless receipt instead of change. This wierd and rip-off-ish policy has certainly stopped me from catching buses in the past. In general, flat-rate pricing models appear to encourage use in other fields. And the increase in the fleet is obviously a fantastic idea. Fantastic stuff!

Read the full policy paper here (as a PDF).

Dublin transport survey

Via Lean comes this, I think from the Irish Times:

One-half of Dublin drivers would never use bus - survey

One-half of all car drivers in the greater Dublin area say they would not switch to travelling by bus, even if services were improved, according to a new survey.

Unreliability, long waiting times and poor connections were cited as the main reasons for not taking the bus in the survey carried out for the Dublin Transportation Office (DTO).

As many as four out of five people expressed dissatisfaction with traffic congestion and access to the Luas.

Just over 35 per cent of those surveyed were satisfied with the quality and upkeep of roads, and with facilities for cycling. Over one-half said they were happy with the reliability, frequency and cost of buses.

Almost 2,500 people were interviewed for the survey and a similar number of travel diaries were compiled. The car is the main form of transport in the region, used by 45 per cent of respondents. Some 18 per cent relied on the bus and 16 per cent said walking was their main form of transport. Just 2 per cent used the Luas more often than other modes of transport, and 3 per cent used the DART or local train. Two per cent cycled and 1 per cent relied on taxis.

Of those who said they might switch to the bus, over 60 per cent said more frequent services was the main change needed. Accurate timetables and stops closer to destinations were also called for.

Respondents linked transport by car to comfort, convenience and reliability. In contrast, buses were viewed as being for older people and people with no other choice. Bus transport was favourably viewed for going out socially and for being reasonably priced.

The Luas was seen as modern, while DART and train services were viewed as fast and safe. Cycling and walking were viewed as healthy and environmentally friendly, but for young people.

Great figures -- they sound pretty accurate.

The novelty of being home in a (relatively) bike- and public-transport-friendly city has worn off for me by now -- I'm now more familiar with buses that aren't a dumping ground for the homeless and mentally ill, and that do actually tend to pass both your origin and destination in a single journey. But that was in Orange County, possibly one of the most public-transit-hostile societies in the developed world, and compared to a more sane standard, Dublin still has a major problem.

By the way, it's interesting to note Ireland's move OC-wards on many fronts. When I got back, I was shocked to see tubby children being driven to school by mobile-phone-wielding, SUV-driving parents -- the very worst aspects of US suburban-sprawl life being happily parrotted over here. :(

Spam filter evasion self-defeating?

Donncha asks, is spam self-defeating?

has anyone else noticed that the new generation of gif based stock-trading spams are getting really hard to read? In the last one I had to squint and look really carefully to find out what stock was hot and a sure-buy today!

I've been wondering about this, too. We continually push spammers further and further from comprehensibility, since comprehensible spam is easily-filtered spam, but the spam flood doesn't stop. In fact, spam volumes have shot up higher than ever.

My theory is that it's a symptom of the spam side of things being a market in itself (and an inefficient, scam-heavy one at that).

IMO, the people providing the underlying products advertised in "high-end" spam -- the pill-peddlers and stock pumpers -- no longer control the technical details of how or where the spam is sent. Instead, they are the customers of professional spam gangs who do that, and take care of the obfuscation, filter-evasion, etc.

In other words, the pill-peddlers and scam operators are getting ripped off, too. They think their products or scams will be advertised in a comprehensible manner, in readable emails; but instead, odd, opaque 3-word messages with "cut and paste this" lines, hidden inside filter-evasion text and bits of Project Gutenberg, are what gets delivered to the victims.

I can't imagine the clickthrough rates are exactly stellar on that. So I'd guess the spammers are responding by pushing up volumes to attempt to increase clickthrough/sales volumes. Wonder if it's working or not?

Planet Antispam Update

Hey, some Planet Antispam updates. I've upgraded to Planet 2.0, and that seems to have solved some of the wierdness with consuming Atom feeds.

Also, there are two new antispam weblogs added to the subscription list:

Welcome guys!

(btw, if you're wondering what happened to the music post -- I moved it over here, to the mp3 blog where it was supposed to be posted in the first place, duh ;)

The nightmare that is Ryanair

It's interesting reading US weblogs when they wax enthusiastic about Ryanair, typically on the foot of this BusinessWeek article.

Here's the thing -- flying Ryanair is a deeply unpleasant experience. I've heard rumour that their staff are paid commission based on how many discretionary charges they can pile onto the basic fare -- leaving you feeling nickled and dimed at every turn -- and that certainly matches with my experience. I mean, I've had better service in train stations in Uttar Pradesh.

In our case, our "no more" moment was after a trip to Spain earlier this year, where we were humiliated for attempting to shift around luggage instead of immediately paying the charges liable once you exceed 15 kilos (33 pounds). (Naturally, there's no weighing scales until you get right in front of the check-in desk...) Once it became clear we didn't want to pay the fee, the check-in person screamed at us, and sent us to the back of the check-in queue -- like bold schoolchildren!

This level of service is pretty standard, going by local word of mouth. Several of my friends have, like me, vowed never to fly them again, even picking more expensive flights to more distant airports to avoid it.

It's certainly not comparable to JetBlue, or any other low-fare airline I've had the pleasure of dealing with -- this is a level below. The BusinessWeek article ends with:

American long-haul discounters aren't likely to go to the extremes Ryanair has gone to sell basic services, but they're paying more attention to Ryanair these days. "They're on the cutting edge," says Tad Hutcheson, vice-president for marketing at AirTran, which recently assigned two marketing staffers to spend a week flying on Ryanair. "Charging for Cokes or snacks, blankets or pillows--I'm not sure Americans are ready for that."

Well, I certainly hope not, for their sakes!

Bleadperl regexp optimization vs SA

I've been looking some more into recent new features added to bleadperl by demerphq, such as Aho-Corasick trie matching, and how we can effectively support this in SpamAssassin. Here's the state of play.

These are the "base strings" extracted from the SpamAssassin SVN trunk body ruleset (ignore the odd mangled UTF-8 char in here, it's suffering from cut-and-paste breakage). A "base string" is a simplified subset of the regular expression; specifically, these are the cases where the "base strings" of the rule are simpler than the full perl regular expression language, and therefore amenable to fast parallel string matching algorithms.

The base strings appear in that file as "r" lines, like so:

r I am currently out of the office:__BOUNCE_OOO_3 __DOS_COMING_TO_YOUR_PLACE
r I drive a:__DOS_I_DRIVE_A
r I might be c:__DOS_COMING_TO_YOUR_PLACE
r I might c:__DOS_COMING_TO_YOUR_PLACE

The base string is the part after "r" and before the ":"; after that, the rule names appear.

Now, here are some limitations that make this less easy:

  • One string to many rules: each one of those strings corresponds to one or more SpamAssassin rules.

  • One rule to many strings: each rule may correspond to one or more of those strings. So it's not a one-to-one correspondence either way.

  • No anchors: the strings may match anywhere inside the line, similar to ("foo bar baz" =~ /bar/).

  • Multiple rules can fire on the same line: each line can cause multiple rules to fire on different parts of its text.

  • Subsumption is not permitted: the base-string extractor plugin has already established cases where subsumption takes place. Each string will not subsume another string; so a match of the string "food" against the strings "food" and "foo" should just fire on "food", not on "foo".

  • Overlapping is permitted: on the other hand, overlapping is fine; "foobar" matched against "foo" and "oobar" should fire on both base strings. (The above two are basically for re2c compatibility. This is the main reason the strings are so simple, with no RE metachars -- so that this is possible, since re2c is limited in this way.)

  • Most rules are more complex: most of the ruleset -- as you can see from the 'orig' lines in that file -- are more complex than the base string alone. So this means that a base string match often needs to be followed by a "verification" match using the full regexp.

Now, the problem is to iterate through each line of the (base64-decoded, encoding-decoded, HTML-decoded, whitespace-simplified) "body text" of a mail message, with each paragraph appearing as a single "line", and run all those base strings in parallel, identifying the rule names that then need to be run.

This is turning out to be quite tricky with the bleadperl trie code.

For example, if we have 3 base strings, as follows:

  hello:RULE_HELLO
  hi:RULE_HI
  foo:RULE_FOO

At first, it appears that we could use the pattern itself as a key into a lookup table to determine the pattern that fired:

  %base_to_rulename_lookup = (
    'hello' => ['RULE_HELLO'],
    'hi' => ['RULE_HI'],
    'foo' => ['RULE_FOO']
  );

  if ($line =~ m{(hello|hi|foo)}) {
    $rule_fired = $base_to_rulename_lookup{$1};
  }

However, that will fail in the face of the string "hi foo!", since only one of the bases will be returned as $1, whereas we want to know about both "RULE_HI" and "RULE_FOO".

m//gc might help:

  %base_to_rulename_lookup = (
    'hello' => ['RULE_HELLO'],
    'hi' => ['RULE_HI'],
    'foo' => ['RULE_FOO']
  );

  while ($line =~ m{(hello|hi|foo)}gc) {
    $rule_fired = $base_to_rulename_lookup{$1};
  }

That works pretty well, but not if two patterns overlap: /abc/ and /bcd/, matching on the string "abcd", for example, will fire only on "abc", and miss the "bcd" hit.

Given this, it appears the only option is to run the trie match, and then iterate on all the regexps for the base strings it contains:

  if ($line =~ m{hello|hi|foo}) {
    $line =~ /hello/ and rule_fired("HELLO");
    $line =~ /hi/ and rule_fired("HI");
    $line =~ /foo/ and rule_fired("FOO");
  }

Obviously, that doesn't provide much of a speedup -- in fact, so far, I've been unable to get any at all out of this method. :(

This can be optimized a little by breaking into multiple trie/match sets:

  if ($line =~ m{hello|hi}) {
    $line =~ /hello/ and rule_fired("HELLO");
    $line =~ /hi/ and rule_fired("HI");
    ...
  }
  if ($line =~ m{foo|bar}) {
    $line =~ /foo/ and rule_fired("FOO");
    $line =~ /bar/ and rule_fired("BAR");
    ...
  }

But still, the reduction in regexp OPs vs the addition of logic OPs to do this, result in an overall slowdown, even given the faster trie-based REs.

Suggestions, anyone?

(by the way, if you're curious, the current code is here in SVN.)

A Guinness 419 scam!

I may be a bit hungover this Sunday morning due mainly to the effects of the subject of this post, but -- Guinness National Lottery? is anyone going to fall for that?

From: hamilton jones 
Subject: GUINNESS. CUSTORMERS PROMOTION

GUINNESS. CUSTORMERS PROMOTION
dv-2006 program
guinness plc, West Africa.
st christo road (ecowas)

                                    FINAL_ NOTIFICATION.

We happily inform you about our (guinness. national lottery
program)held on the 10th of november 2006, which you enterd as a
dependent client and finally took the 1st position in our second
(2nd) category winners, that falls within  the europe region Manchester Uk.
Your email was attached to the ticket number(44-40-23-777-01) which
made you a winner of (us$500,000.00) and your name being recorded in
our guinness world book of record as the 1st lucky winner of the year
2006. You have been approved the sum of US$500,000.00 which will be
sent accross to you immediately.

All emails are selected randomly through a computer ballot which
subsequently won you the sweepstakes of Guinness internet web
lottery.

CONGRATULATIONS YOU EMERGED OUR WINNER!!!
= = = = = = = = = = = = = = = = = = = = = = = = = = =
This is part of our security measures put in place to avoid double
claiming or a situation where unwanted person(s) would be taking
Negative advantage of these promotions, thereby impersonating in
order to claim another persons winning prize.
Here is our fiduciary agent responsible for your the processing /
Release of winnings for all Second Category winners where your
winning Falls into:
MR HAMILTON JONES
EMAIL: hamilton_jones2006@yahoo.it

GUINNESS. CLAIMING SECURITY AGENT.
= = = = = = = = = = = = = = = = = = = = = = = = = = =
You are required to forward the following details to help facilitate
the processing of your GUINNESS. CLAIMS OF CERTIFICATE.

Full names / Residential address / Phone number / Occupation / Sex /
Age / Present country / Marital status.

ONCE AGAIN CONGRATULATION!!!!
Yours sincerely

ANDERSON HEGLAND

Irish Blogs top 100 — should old blogs be trimmed?

Over on the Technorati Top 100 of Irish Blogs list, I've noticed something; quite a few of the listings have stopped publishing, such as number 5, Tom Murphy's Natterjackpr.com.

I'm wondering -- should no-longer-publishing blogs be listed? Technorati still keeps their ranking high -- clearly old data is not expired from the Technorati database for at least a year. But maybe my scripts should use last-post-published time, from planet.journals.ie where available, and discard blogs that haven't put anything up in something like 4 months.

What do you think?

Top 100 Irish Blogs, pt 2

The previous post was pretty popular, and one of the requests was for a regularly-updated listing. So here it is: http://taint.org/technorati/

Since Technorati limit daily queries to about 500 per day (iirc), and there are quite a few more blogs in the Irish blogs list, I plan to update it on a nightly basis, with each set of blogs updating on different days. This should result in the figures staying more-or-less up to date without hammering T'rati too much.

Gastric woes

milkncheese.jpgObservant taint.org readers might recall me complaining about a bout of food poisoning back in June during ApacheCon week, which, along with a poorly-timed work trip, unfortunately managed to stop me attending ApacheCon altogether.

Turns out that that "food poisoning" never went away -- four months later, I'm still having digestive troubles. However, I've been lucky enough to figure out a way to minimise it, which I'll mention here for posterity (and Google).

So, basically, the symptoms were general stomach unsettledness, nausea, cramping, a sharp pain in the right side, and heartburn -- all waxing and waning intermittently. (There were issues at "the other end" I'll leave out, in the interests of good taste.) On top of that, my level of stomach "calmness" was way off -- nausea from travelling in cars, buses, taxis etc. became an issue.

Thankfully, it didn't interfere with work much at all -- since I work from home, it was pretty easy to deal with. But it certainly put a damper on trips like ApacheCon, or BarCamp Ireland... it became quite difficult, in particular, to travel any kind of distance during the daytime. (Luckily my ability to partake in pints of Guinness during the evening was not affected, however. ;)

I did the usual thing of visiting my local G.P., and was referred to a gastro-intestinal specialist -- that's all still going on, slowly. But fortunately, in the meantime, I had a breakthrough in terms of dealing with the symptoms.

Initially, the waxing and waning of symptoms seemed pretty random, but after a week or two, a pattern emerged -- on a normal day, it'd typically be worst at about 11am in the morning, then ease off before lunch, then worse again after lunch. During and after dinner, it'd be fine, and the evenings were almost symptom-free. On an empty stomach, there was similarly virtually no problems whatsoever.

Of course, having a link with quantities of food makes sense for a GI illness. But it eventually occurred to me that the symptoms were increasing and waning in time with specific types of food, in fact. The pattern of symptoms were tracking my drinking of milk, in cereal, and in tea or coffee, delayed by about 2 hours. Now, I've always been a total omnivore -- I've never suffered from allergies, had any issues digesting food, or suffered travel illness. My sea legs were rock solid; one trip to the Great Barrier Reef saw myself and C being the only tourists not to vom over the sides despite some heavy waves. Also, as an Irishman, tea is the core component of my diet, and tea with milk at that; and dairy is similarly at the heart of Irish cuisine in many ways, plenty of milk, cheese, and butter. I was raised on the stuff, and love it!

But the signs were pretty solid, so I gave up dairy for a week or two to try it out. It took a week to "clear out" initially, but since then, the results have been fantastic; some of the symptoms (the sharp pain, cramps, heartburn) are almost gone, and levels of the others (nausea, stomach 'unsettledness') are way down most of the time. If I eat something that contains milk, cheese or whey -- such as a packet of crisps recently -- I can tell within 10 minutes, since the pain in my right side "twinges" noticeably. It really is astounding.

The wierd thing is, this came out of nowhere. A week before that bbq, I was glugging milk without a single issue, and feeling perfectly fine; I've never had issues with dairy. Then all of a sudden, it just hit me, seemingly after a short bout of food poisoning, and it still hasn't gone away.

Talking to people, though, it appears this is more common than one might think; I now know of several people who've become lactose intolerant, suddenly, in their 30s.

Anyway, the core issue is still there, but while the wheels of medical science grind on, I at least have pretty good control of the nastier symptoms again. yay.

Technorati-ranked Irish Blogs Top 100

So, I was thinking about the various Irish blog aggregators, Planet.journals.ie, IrishBlogs.ie, and IrishBlogs.info. Michele's Irishblogs.info attempts to "rank" the blogs by hits, but many of the Irish webloggers don't include that hit-counting HTML snippet in their web pages, so quite a few are probably missing; on top of that, RSS readers don't count. It lists me as #3, which I knew was definitely wrong, anyway ;)

However, it occurred to me that an alternative way to compute a "top 100" would be to use the Technorati rank of each blog, and make a table based on that; that'd measure the blogs by Technorati's readership-estimation algorithm, which may still be faulty, of course, but worth a try... I was curious, so I gave it a go, and here's the results. Enjoy!

Update: This table is no longer up-to-date -- a much fresher version is now available over here, and will be updated regularly.

Top 100 by rank / inbound blog links:

Position Rank Inbound blogs Inbound links Blog
1 2940 638 1931   http://www.tomrafteryit.net/
2 6636 371 1280   http://www.mulley.net/
3 8231 315 625   http://twentymajor.blogspot.com/
4 10984 249 512   http://www.natterjackpr.com/
5 15720 181 409   http://www.avalon5.com/
6 18897 151 315   http://irish.typepad.com/irisheyes/
7 19364 148 472   http://www.gavinsblog.com/
8 21214 136 385   http://www.blather.net/
9 21715 133 968   http://ocaoimh.ie/
10 22210 132 399   http://eirepreneur.blogs.com/eirepreneur/
11 22258 130 323   http://thetorturegarden.blogspot.com/
12 23921 122 351   http://www.dehora.net/journal/
13 24143 121 199   http://www.atlanticblog.com/
14 24828 118 174   http://freestater.blogspot.com/
15 25570 115 260   http://arseblog.com/WP
16 25570 115 246   http://tcal.net/
17 27174 109 252   http://www.digitalrights.ie/
18 27189 110 169   http://cork2toronto.blogspot.com/
19 28004 106 731   http://taint.org/
20 29008 103 286   http://unitedirelander.blogspot.com/
21 29008 103 232   http://www.nialler9.com/blog
22 29008 103 175   http://clickhere.blogs.ie/
23 29978 100 270   http://www.mneylon.com/blog
24 31954 95 901   http://www.irishelection.com/
25 33397 91 231   http://memex.naughtons.org/
26 34121 89 370   http://siciliannotes.blogspot.com/
27 35022 86 285   http://www.sineadgleeson.com/blog
28 35022 86 146   http://www.cfdan.com/
29 35858 84 904   http://www.pkellypr.com/blog
30 36223 84 255   http://www.thinkingoutloud.biz/
31 37735 80 175   http://www.dervala.net/
32 39719 76 207   http://backseatdrivers.blogspot.com/
33 40078 76 229   http://fdelondras.blogspot.com/
34 40276 75 203   http://www.mediangler.com/
35 40821 74 128   http://www.thinkinghomebusiness.com/blog
36 44148 69 122   http://outofambit.blogspot.com/
37 45075 67 147   http://www.podleaders.com/
38 45075 67 87   http://www.aidanf.net/
39 45729 66 238   http://www.argolon.com/
40 46477 65 201   http://www.sarahcarey.ie/
41 46477 65 191   http://disillusionedlefty.blogspot.com/
42 47586 64 141   http://www.johnbreslin.com/blog
43 48011 63 66   http://www.branedy.net/
44 52278 58 398   http://dossing.blogspot.com/
45 54710 56 155   http://redmum.blogspot.com/
46 55758 55 103   http://richarddelevan.blogspot.com/
47 56390 54 148   http://donal.wordpress.com/
48 56390 54 129   http://prettycunning.net/blog
49 57527 53 104   http://www.dublinblog.ie/
50 58724 52 167   http://www.tuppenceworth.ie/blog
51 58724 52 102   http://www.inter-actions.biz/blog/
52 59920 51 101   http://seanmcgrath.blogspot.com/
53 60315 51 76   http://www.blackphoebe.com/msjen/
54 62483 49 112   http://www.infactah.com/
55 62885 49 118   http://mamanpoulet.blogspot.com/
56 63869 48 229   http://icecreamireland.com/
57 68503 45 93   http://www.web2ireland.org/
58 68503 45 75   http://www.davidmcwilliams.ie/
59 68503 45 73   http://vipglamour.net/
60 68824 45 193   http://imeall.blogspot.com/
61 72248 43 81   http://planetpotato.blogs.com/planet_potato_an_irish_bl/
62 73843 42 149   http://lettertoamerica.blogs.com/
63 73843 42 119   http://www.kenmc.com/
64 73843 42 102   http://www.pmooney.net/blogsphe.nsf
65 73843 42 70   http://bohanna.typepad.com/pureplay/
66 75725 41 107   http://bonhom.ie/
67 75725 41 93   http://www.bibliocook.com/
68 75725 41 78   http://shittyfirstdraft.blogspot.com/
69 77680 40 225   http://bestofbothworlds.blogspot.com/
70 77680 40 134   http://www.stdlib.net/%7Ecolmmacc
71 77957 40 82   http://davesrants.com/
72 79732 39 103   http://ricksbreakfastblog.blogspot.com/
73 80012 39 92   http://manuel-estimulo.blogspot.com/
74 81970 38 91   http://gingerpixel.com/
75 82240 38 248   http://www.linksheaven.com/
76 84304 37 726   http://thelimerick.blogspot.com/
77 84304 37 127   http://www.ryderdiary.com/
78 84304 37 83   http://morgspace.net/
79 84304 37 64   http://talideon.com/weblog/
80 86729 36 140   http://www.damienblake.com/
81 86729 36 124   http://irisheagle.blogspot.com/
82 86729 36 102   http://blog.rymus.net/
83 86729 36 65   http://www.adammaguire.com/blog
84 87068 36 272   http://progressiveireland.blogspot.com/
85 89814 35 145   http://www.windsandbreezes.org/
86 92646 34 43   http://football-corner.blogspot.com/
87 95258 33 207   http://www.fustar.org/
88 95258 33 171   http://www.iced-coffee.com/
89 95258 33 82   http://www.bytesurgery.com/gearedup
90 101881 31 90   http://phoblacht.blogspot.com/
91 101881 31 70   http://counago-and-spaves.blogspot.com/
92 101881 31 58   http://www.firstpartners.net/blog
93 105668 30 82   http://realitycheckdotie.blogspot.com/
94 109643 29 142   http://bifsniff.com/cartoons/
95 109643 29 75   http://dave.antidisinformation.com/
96 109643 29 60   http://conoroneill.com/
97 109643 29 55   http://www.minds.may.ie/%7Edez/serendipity/
98 109643 29 51   http://dublin.metblogs.com/
99 110005 29 78   http://www.janinedalton.com/blog
100 110005 29 54   http://www.runningwithbulls.com/blog

List by inbound links:

Position Rank Inbound blogs Inbound links Blog
1 2940 638 1931   http://www.tomrafteryit.net/
2 6636 371 1280   http://www.mulley.net/
3 21715 133 968   http://ocaoimh.ie/
4 35858 84 904   http://www.pkellypr.com/blog
5 31954 95 901   http://www.irishelection.com/
6 28004 106 731   http://taint.org/
7 84304 37 726   http://thelimerick.blogspot.com/
8 8231 315 625   http://twentymajor.blogspot.com/
9 258886 13 519   http://newswire99.blogspot.com/
10 10984 249 512   http://www.natterjackpr.com/
11 19364 148 472   http://www.gavinsblog.com/
12 164780 20 451   http://inao.blogspot.com/
13 15720 181 409   http://www.avalon5.com/
14 22210 132 399   http://eirepreneur.blogs.com/eirepreneur/
15 52278 58 398   http://dossing.blogspot.com/
16 21214 136 385   http://www.blather.net/
17 34121 89 370   http://siciliannotes.blogspot.com/
18 23921 122 351   http://www.dehora.net/journal/
19 156276 21 336   http://www.ebbybrett.co.uk/blog
20 22258 130 323   http://thetorturegarden.blogspot.com/
21 18897 151 315   http://irish.typepad.com/irisheyes/
22 29008 103 286   http://unitedirelander.blogspot.com/
23 35022 86 285   http://www.sineadgleeson.com/blog
24 87068 36 272   http://progressiveireland.blogspot.com/
25 239963 14 271   http://www.thehealthtechblog.com/
26 29978 100 270   http://www.mneylon.com/blog
27 25570 115 260   http://arseblog.com/WP
28 36223 84 255   http://www.thinkingoutloud.biz/
29 27174 109 252   http://www.digitalrights.ie/
30 82240 38 248   http://www.linksheaven.com/
31 977738 3 248   http://www.tomgriffin.org/the_green_ribbon/
32 25570 115 246   http://tcal.net/
33 45729 66 238   http://www.argolon.com/
34 29008 103 232   http://www.nialler9.com/blog
35 33397 91 231   http://memex.naughtons.org/
36 40078 76 229   http://fdelondras.blogspot.com/
37 63869 48 229   http://icecreamireland.com/
38 77680 40 225   http://bestofbothworlds.blogspot.com/
39 208904 16 210   http://www.anlionra.com/
40 471327 7 208   http://www.ravenfamily.org/sam/
41 39719 76 207   http://backseatdrivers.blogspot.com/
42 95258 33 207   http://www.fustar.org/
43 40276 75 203   http://www.mediangler.com/
44 46477 65 201   http://www.sarahcarey.ie/
45 637233 5 200   http://armchaircelts.co.uk/
46 24143 121 199   http://www.atlanticblog.com/
47 280786 12 199   http://conann.com/
48 68824 45 193   http://imeall.blogspot.com/
49 46477 65 191   http://disillusionedlefty.blogspot.com/
50 637233 5 182   http://www.everysecondpaycheck.com/blog
51 164524 20 181   http://irishlinks.blogspot.com/
52 542250 6 176   http://www.dublinka.com/
53 29008 103 175   http://clickhere.blogs.ie/
54 37735 80 175   http://www.dervala.net/
55 24828 118 174   http://freestater.blogspot.com/
56 155943 21 172   http://www.jamesgalvin.com/
57 95258 33 171   http://www.iced-coffee.com/
58 164524 20 171   http://irishcraftworker.typepad.com/an_irish_craftworkers_goo/
59 27189 110 169   http://cork2toronto.blogspot.com/
60 58724 52 167   http://www.tuppenceworth.ie/blog
61 141242 23 164   http://atp.datagate.net.uk/blog
62 148304 22 159   http://www.lifewithouttoast.com/
63 184241 18 158   http://funferal.org/
64 54710 56 155   http://redmum.blogspot.com/
65 73843 42 149   http://lettertoamerica.blogs.com/
66 56390 54 148   http://donal.wordpress.com/
67 45075 67 147   http://www.podleaders.com/
68 155943 21 147   http://dublinopinion.com/
69 35022 86 146   http://www.cfdan.com/
70 89814 35 145   http://www.windsandbreezes.org/
71 109643 29 142   http://bifsniff.com/cartoons/
72 195745 17 142   http://podcasting.ie/podcast
73 47586 64 141   http://www.johnbreslin.com/blog
74 86729 36 140   http://www.damienblake.com/
75 223280 15 137   http://thegurrier.com/
76 77680 40 134   http://www.stdlib.net/%7Ecolmmacc
77 980795 3 131   http://www.sineadcochrane.com/
78 56390 54 129   http://prettycunning.net/blog
79 40821 74 128   http://www.thinkinghomebusiness.com/blog
80 84304 37 127   http://www.ryderdiary.com/
81 86729 36 124   http://irisheagle.blogspot.com/
82 44148 69 122   http://outofambit.blogspot.com/
83 73843 42 119   http://www.kenmc.com/
84 62885 49 118   http://mamanpoulet.blogspot.com/
85 135121 24 117   http://nellysgarden.blogspot.com/
86 195745 17 115   http://blog.infurious.com/
87 542250 6 114   http://ainelivia.typepad.com/aine_livia_at_the_midnigh/
88 62483 49 112   http://www.infactah.com/
89 75725 41 107   http://bonhom.ie/
90 57527 53 104   http://www.dublinblog.ie/
91 55758 55 103   http://richarddelevan.blogspot.com/
92 79732 39 103   http://ricksbreakfastblog.blogspot.com/
93 58724 52 102   http://www.inter-actions.biz/blog/
94 73843 42 102   http://www.pmooney.net/blogsphe.nsf
95 86729 36 102   http://blog.rymus.net/
96 59920 51 101   http://seanmcgrath.blogspot.com/
97 173857 19 99   http://www.ofoghlu.net/log/
98 118678 27 96   http://irishkc.com/
99 68503 45 93   http://www.web2ireland.org/
100 75725 41 93   http://www.bibliocook.com/

Update: Here's a full list of all 569 tested blogs. Also, there's been a minor change to the rankings here; I've just realised that there was a bug in how the script handled evenly-matched blogs, so (for example) #15 and #16 were reversed in order; that's now fixed.

If you find a blog missing, it's possible that (a) it's not pinging Planet.journals.ie or (b) is not registered with Technorati; this method requires both of those. Most Irish blogs do, but some (Old Rotten Hat, for example) don't...

Methodology

I found this more-or-less full list of Irish weblogs at Planet.journals.ie, and selected the blogs that had pinged their site in the past 6 months, then cut that down to just the blog main-page URLs, removing duplicates.

Given that list, I then looked up each blog URL using the Technorati API, and got its rank, inbound link count, and inbound linking blogs count.

top100code.tgz is a tarball of the perl code I wrote to do this, if you fancy doing it yourself on whichever set of blogs you fancy...

Maximise value, not protection (fwd)

Here's an excellent quote from the OpenGeoData weblog, really worth reproducing:

''We think the natural tendency is for producers to worry too much about protecting their intellectual property. The important thing is to maximise the value of your intellectual property, not to protect it for the sake of protection. If you lose a little of your property when you sell it or rent it, that’s just a cost of doing business, along with depreciation, inventory losses, and obsolescence.'' -- Information Rules, Carl Shapiro and Hal Varian, page 97.

Words to live by!