Skip to content

Justin's Linklog Posts

An Post: 75% lost-parcels rate so far

I don’t know what’s going on with An Post, the Irish postal service, these days — I’ve been having some pretty bad luck with them.

For my birthday, I was lucky enough to be given a Thingamagoop — it took a while (hey, they’re hand-made) but was shipped on Nov 7th from the US. Bleep Labs accidentally shipped me two, apparently, but only one has arrived — on Nov 16th, 9 days after shipping. The other one’s still AWOL nearly a month later.

I then ordered something from Sendit.com on Nov 17th, as a birthday gift for Nov 30th. It was shipped from their Belfast offices on Nov 18th, and still hasn’t arrived to date. Sendit were champs, however, and refunded the purchase as soon as I rang them on the 30th (I’d recommend their services, no problem).

Finally, SpamAssassin was lucky enough to win a <a href=’http://www.linuxnewmedia.com/Press/Press_Releases/Awards_2006′>Linux New Media Award 2006 for ‘Best Linux-based Anti-spam Solution’ — nifty! As part of this, a (physical) trophy is apparently winging its way from Germany, and was apparently shipped on November 27th. Guess what: no sign.

In other words, in the past month, 75% of the parcels sent to me seem to have gone AWOL. All I can do is hope that they’ve just been delayed, rather than suffer a worse fate. In particular, I hope that trophy turns up — it’s the only physical award we’ve ever received :(

Can anyone think of a good avenue to track these down? The website seems pretty negative, and what I’ve heard seems to be along the lines of ‘turn up at the sorting depot, cross your fingers, and see if they’ve been misdelivered’. Ick.

SpamAssassin as an EC2 service

I had a bit of an epiphany while chatting to Antoin about the qpsmtpd/EC2 idea. Craig had the same thoughts.

Here’s the thing — there’s actually no need to offload the SMTP part at all. That stuff is tricky, since you’ve got to build in a lot of fault tolerance, quality-of-service, uptime, etc. to ensure that the MX really is reachable. Since an EC2 instance will lose its "disks" once rebooted/shut down, you need to store your queues in Amazon S3 — which has differing filesystem semantics from good old POSIX — so things get quite a bit hairier. On top of that, it requires a little RFC-breakage; there are issues with using CNAMEs in MX records, reportedly.

However, if we offload just the spamd part, it becomes a whole lot simpler. The SPAMD protocol will work fine across long distances, securely, with SSL encryption active, and SpamAssassin will work fine as a filtering system in an entirely stateless mode, with no persistent-across-reboots storage. (What about the persistent-storage aspects of spamd operation? There’s just the auto-whitelist, which can be easily ignored, and I haven’t trained a Bayes database in 2 years, so I doubt I’ll need that either ;)

If the spamd server is down or uncontactable, spamc will handle this and retry with another server, or eventually give up and pass the message through, safely intact (though unscanned).

Given that there’s a cool third-party ClamAV plugin now available for SpamAssassin, this system can offload the virus-scanning work, too.

So here’s the new plan: run the MTA, MX, and the super-lean "spamc" client on the normal MX machine — and offload the "spamd" work to one or more EC2 machines.

Basically, there would be a CNAME record in DNS, listing the dynamic DNS names of the EC2 spamd instances. Then, spamc is set to point at that CNAME as the spamd host to use. As EC2 instances are started/removed, they are added/removed from that CNAME list and spamc will automatically keep up.

Pricing is reasonably affordable — don’t send over-large messages to the EC2 spamd; rate-limit total incoming SMTP traffic in the MTA; and use the SPAMD protocol‘s REPORT verb to reduce the bandwidth consumption of mails in transit by ensuring that the mail messages are only transmitted one-way, MX-to-EC2, instead of both MX-to-EC2 and EC2-to-MX. That will keep the bandwidth pricing down.

Recent figures indicate that I got about 90MB of mail per day, at peak, over the past weekend (which nearly DOS’d my server and caused some firefighting) — 68MB of spam, and 13MB of blowback. At 20 cents per GB, that’s 1.8 cents per day for traffic. Plus the $0.10 per instance hour, that’s $2.42 per day to run a single EC2 instance to handle DDOS spikes. Of course, that can be shut down when load is low.

Yep, this is looking very promising. Now when are Amazon going to let me onto the beta program for EC2?…

Using qpsmtpd and Amazon EC2 to provide SMTP-DDoS protection

Like a few other anti-spammers, I found myself under a hitherto-unprecedented level of spam blowback this weekend. Disappointingly, there are still thousands of SMTP servers configured to send bounce messages in response to spam.

Even with the anti-bounce ruleset for SpamAssassin, the volume was so great that our creaky old server had a lot of difficulty keeping up — once the messages got to SpamAssassin, the load issues had already been created. Also, Postfix’s anti-spam features really weren’t designed to deal with blowback.

While attempting to take some shortcuts in the setup on our server to deal with this, a great idea occurred to me — why not come up with an app that uses <a href=’http://aws.amazon.com/’>Amazon EC2 to flexibly provision enough server power and bandwidth to pre-filter the SMTP traffic for an MX under attack?

I’m basically thinking of qpsmtpd, with SpamAssassin and/or other antispam blobs active, running in an Amazon EC2 server image. Multiple images can be brought up, and added to the attacked domain’s MX record at an equal priority, to take load off the main (overloaded) MX.

Now to cogitate a little — details to follow…

Working out electricity costs for your appliances and hardware

This question came up on a forum I’m on. It turns out it’s really quite easy to work out — this page covers pretty much all the details.

In addition to what’s there, it’s worth noting that the current Irish price for a kilowatt-hour under the <a href=’http://www.esb.ie/main/energy_home/ef9.jsp’>ESB’s domestic rate is 12.73 cents per kWh, which works out as 14.41 cents per kWh once the 13.5% VAT is added in. So Irish users, pretend you live in New Hampshire (15 cents per kWh) to get realistic figures from <a href="http://michaelbluejay.com/electricity/howmuch.html#4″>the excellent cost calculator.

Using this, it looks like if I was to leave an 160W desktop computer on permanently in Ireland, I’d be spending 215 euros per year to power it. Wow, that’s pricey! My strategy of using low-noise, low-power hardware for home servers has paid off already, in that case. ;)

For what it’s worth, if you’re worrying about the power consumption of an NTL digital Pace Digital TV set-top box — if <a href=’http://energyefficiency.jrc.cec.eu.int/pdf/Presentation_EC_meeting_11-04-03KD.pdf’>this Pace presentation is anything to go by, it appears the standby power consumption is on the order of 1-2 watts — about 2 euros per year. Grand.

Labour’s flat-rate bus tickets

Well, that was quick!

Right after posting this, I hear about Labour’s new transport strategy for Dublin. Here’s the top 3 items:

  • Labour will increase the Dublin Bus fleet by 50% (500 buses), significantly increasing frequency and reducing waiting times.

  • Will complete the Quality Bus Corridors, and greatly reduce journey times.

  • Will introduce a EUR 1 per-trip fare for adults and a 50c per-trip fare for children.

The flat-rate fee structure makes a lot more sense than the confusing and rip-off-ish current model, whereby if you don’t know in advance how much a particular journey is going to cost, you’re given a useless receipt instead of change. This wierd and rip-off-ish policy has certainly stopped me from catching buses in the past. In general, flat-rate pricing models appear to encourage use in other fields. And the increase in the fleet is obviously a fantastic idea. Fantastic stuff!

Read the full policy paper here (as a PDF).

Dublin transport survey

Via Lean comes this, I think from the Irish Times:

One-half of Dublin drivers would never use bus – survey

One-half of all car drivers in the greater Dublin area say they would not switch to travelling by bus, even if services were improved, according to a new survey.

Unreliability, long waiting times and poor connections were cited as the main reasons for not taking the bus in the survey carried out for the Dublin Transportation Office (DTO).

As many as four out of five people expressed dissatisfaction with traffic congestion and access to the Luas.

Just over 35 per cent of those surveyed were satisfied with the quality and upkeep of roads, and with facilities for cycling. Over one-half said they were happy with the reliability, frequency and cost of buses.

Almost 2,500 people were interviewed for the survey and a similar number of travel diaries were compiled. The car is the main form of transport in the region, used by 45 per cent of respondents. Some 18 per cent relied on the bus and 16 per cent said walking was their main form of transport. Just 2 per cent used the Luas more often than other modes of transport, and 3 per cent used the DART or local train. Two per cent cycled and 1 per cent relied on taxis.

Of those who said they might switch to the bus, over 60 per cent said more frequent services was the main change needed. Accurate timetables and stops closer to destinations were also called for.

Respondents linked transport by car to comfort, convenience and reliability. In contrast, buses were viewed as being for older people and people with no other choice. Bus transport was favourably viewed for going out socially and for being reasonably priced.

The Luas was seen as modern, while DART and train services were viewed as fast and safe. Cycling and walking were viewed as healthy and environmentally friendly, but for young people.

Great figures — they sound pretty accurate.

The novelty of being home in a (relatively) bike- and public-transport-friendly city has worn off for me by now — I’m now more familiar with buses that aren’t a dumping ground for the homeless and mentally ill, and that do actually tend to pass both your origin and destination in a single journey. But that was <a href=’http://www.octa.net/’>in Orange County, possibly one of the most public-transit-hostile societies in the developed world, and compared to a more sane standard, Dublin still has a major problem.

By the way, it’s interesting to note Ireland’s move OC-wards on many fronts. When I got back, I was shocked to see tubby children being driven to school by mobile-phone-wielding, SUV-driving parents — the very worst aspects of US suburban-sprawl life being happily parrotted over here. :(

Spam filter evasion self-defeating?

Donncha asks, is spam self-defeating?

has anyone else noticed that the new generation of gif based stock-trading spams are getting really hard to read? In the last one I had to squint and look really carefully to find out what stock was hot and a sure-buy today!

I’ve been wondering about this, too. We continually push spammers further and further from comprehensibility, since comprehensible spam is easily-filtered spam, but the spam flood doesn’t stop. In fact, spam volumes have shot up higher than ever.

My theory is that it’s a symptom of the spam side of things being a market in itself (and an inefficient, scam-heavy one at that).

IMO, the people providing the underlying products advertised in "high-end" spam — the pill-peddlers and stock pumpers — no longer control the technical details of how or where the spam is sent. Instead, they are the customers of professional spam gangs who do that, and take care of the obfuscation, filter-evasion, etc.

In other words, the pill-peddlers and scam operators are getting ripped off, too. They think their products or scams will be advertised in a comprehensible manner, in readable emails; but instead, odd, opaque 3-word messages with "cut and paste this" lines, hidden inside filter-evasion text and bits of Project Gutenberg, are what gets delivered to the victims.

I can’t imagine the clickthrough rates are exactly stellar on that. So I’d guess the spammers are responding by pushing up volumes to attempt to increase clickthrough/sales volumes. Wonder if it’s working or not?

Planet Antispam Update

Hey, some Planet Antispam updates. I’ve upgraded to Planet 2.0, and that seems to have solved some of the wierdness with consuming Atom feeds.

Also, there are two new antispam weblogs added to the subscription list:

Welcome guys!

(btw, if you’re wondering what happened to the music post — I moved it over here, to the mp3 blog where it was supposed to be posted in the first place, duh ;)

The nightmare that is Ryanair

It’s interesting reading US weblogs when they <a href="http://radar.oreilly.com/archives/2006/11/two_great_follo.html”>wax enthusiastic about Ryanair, typically on the foot of <a href=’http://www.businessweek.com/magazine/content/06_48/b4011064.htm?campaign_id=rss_magzn’>this BusinessWeek article.

Here’s the thing — flying Ryanair is a deeply unpleasant experience. I’ve heard rumour that their staff are paid commission based on how many discretionary charges they can pile onto the basic fare — leaving you feeling nickled and dimed at every turn — and that certainly matches with my experience. I mean, I’ve had better service in train stations in Uttar Pradesh.

In our case, our "no more" moment was after a trip to Spain earlier this year, where we were humiliated for attempting to shift around luggage instead of immediately paying the charges liable once you exceed 15 kilos (33 pounds). (Naturally, there’s no weighing scales until you get right in front of the check-in desk…) Once it became clear we didn’t want to pay the fee, the check-in person screamed at us, and sent us to the back of the check-in queue — like bold schoolchildren!

This level of service is pretty standard, going by local word of mouth. Several of my friends have, like me, vowed never to fly them again, even picking more expensive flights to more distant airports to avoid it.

It’s certainly not comparable to JetBlue, or any other low-fare airline I’ve had the pleasure of dealing with — this is a level below. The <a href=’http://www.businessweek.com/magazine/content/06_48/b4011064.htm?campaign_id=rss_magzn’>BusinessWeek article ends with:

American long-haul discounters aren’t likely to go to the extremes Ryanair has gone to sell basic services, but they’re paying more attention to Ryanair these days. "They’re on the cutting edge," says Tad Hutcheson, vice-president for marketing at AirTran, which recently assigned two marketing staffers to spend a week flying on Ryanair. "Charging for Cokes or snacks, blankets or pillows–I’m not sure Americans are ready for that."

Well, I certainly hope not, for their sakes!

Bleadperl regexp optimization vs SA

I’ve been looking some more into recent new features added to bleadperl by <a href=’http://use.perl.org/~demerphq/’>demerphq, such as Aho-Corasick trie matching, and how we can effectively support this in SpamAssassin. Here’s the state of play.

These are the "base strings" extracted from the SpamAssassin SVN trunk body ruleset (ignore the odd mangled UTF-8 char in here, it’s suffering from cut-and-paste breakage). A "base string" is a simplified subset of the regular expression; specifically, these are the cases where the "base strings" of the rule are simpler than the full perl regular expression language, and therefore amenable to fast parallel string matching algorithms.

The base strings appear in that file as "r" lines, like so:

r I am currently out of the office:__BOUNCE_OOO_3 __DOS_COMING_TO_YOUR_PLACE
r I drive a:__DOS_I_DRIVE_A
r I might be c:__DOS_COMING_TO_YOUR_PLACE
r I might c:__DOS_COMING_TO_YOUR_PLACE

The base string is the part after "r" and before the ":"; after that, the rule names appear.

Now, here are some limitations that make this less easy:

  • One string to many rules: each one of those strings corresponds to one or more SpamAssassin rules.

  • One rule to many strings: each rule may correspond to one or more of those strings. So it’s not a one-to-one correspondence either way.

  • No anchors: the strings may match anywhere inside the line, similar to ("foo bar baz" =~ /bar/).

  • Multiple rules can fire on the same line: each line can cause multiple rules to fire on different parts of its text.

  • Subsumption is not permitted: the base-string extractor plugin has already established cases where subsumption takes place. Each string will not subsume another string; so a match of the string "food" against the strings "food" and "foo" should just fire on "food", not on "foo".

  • Overlapping is permitted: on the other hand, overlapping is fine; "foobar" matched against "foo" and "oobar" should fire on both base strings. (The above two are basically for re2c compatibility. This is the main reason the strings are so simple, with no RE metachars — so that this is possible, since re2c is limited in this way.)

  • Most rules are more complex: most of the ruleset — as you can see from the ‘orig’ lines in that file — are more complex than the base string alone. So this means that a base string match often needs to be followed by a "verification" match using the full regexp.

Now, the problem is to iterate through each line of the (base64-decoded, encoding-decoded, HTML-decoded, whitespace-simplified) "body text" of a mail message, with each paragraph appearing as a single "line", and run all those base strings in parallel, identifying the rule names that then need to be run.

This is turning out to be quite tricky with the bleadperl trie code.

For example, if we have 3 base strings, as follows:

  hello:RULE_HELLO
  hi:RULE_HI
  foo:RULE_FOO

At first, it appears that we could use the pattern itself as a key into a lookup table to determine the pattern that fired:

  %base_to_rulename_lookup = (
    'hello' => ['RULE_HELLO'],
    'hi' => ['RULE_HI'],
    'foo' => ['RULE_FOO']
  );

  if ($line =~ m{(hello|hi|foo)}) {
    $rule_fired = $base_to_rulename_lookup{$1};
  }

However, that will fail in the face of the string "hi foo!", since only one of the bases will be returned as $1, whereas we want to know about both "RULE_HI" and "RULE_FOO".

m//gc might help:

  %base_to_rulename_lookup = (
    'hello' => ['RULE_HELLO'],
    'hi' => ['RULE_HI'],
    'foo' => ['RULE_FOO']
  );

  while ($line =~ m{(hello|hi|foo)}gc) {
    $rule_fired = $base_to_rulename_lookup{$1};
  }

That works pretty well, but not if two patterns overlap: /abc/ and /bcd/, matching on the string "abcd", for example, will fire only on "abc", and miss the "bcd" hit.

Given this, it appears the only option is to run the trie match, and then iterate on all the regexps for the base strings it contains:

  if ($line =~ m{hello|hi|foo}) {
    $line =~ /hello/ and rule_fired("HELLO");
    $line =~ /hi/ and rule_fired("HI");
    $line =~ /foo/ and rule_fired("FOO");
  }

Obviously, that doesn’t provide much of a speedup — in fact, so far, I’ve been unable to get any at all out of this method. :(

This can be optimized a little by breaking into multiple trie/match sets:

  if ($line =~ m{hello|hi}) {
    $line =~ /hello/ and rule_fired("HELLO");
    $line =~ /hi/ and rule_fired("HI");
    ...
  }
  if ($line =~ m{foo|bar}) {
    $line =~ /foo/ and rule_fired("FOO");
    $line =~ /bar/ and rule_fired("BAR");
    ...
  }

But still, the reduction in regexp OPs vs the addition of logic OPs to do this, result in an overall slowdown, even given the faster trie-based REs.

Suggestions, anyone?

(by the way, if you’re curious, the current code is here in SVN.)

A Guinness 419 scam!

I may be a bit hungover this Sunday morning due mainly to the effects of the subject of this post, but — Guinness National Lottery? is anyone going to fall for that?

From: hamilton jones 
Subject: GUINNESS. CUSTORMERS PROMOTION

GUINNESS. CUSTORMERS PROMOTION
dv-2006 program
guinness plc, West Africa.
st christo road (ecowas)

                                    FINAL_ NOTIFICATION.

We happily inform you about our (guinness. national lottery
program)held on the 10th of november 2006, which you enterd as a
dependent client and finally took the 1st position in our second
(2nd) category winners, that falls within  the europe region Manchester Uk.
Your email was attached to the ticket number(44-40-23-777-01) which
made you a winner of (us$500,000.00) and your name being recorded in
our guinness world book of record as the 1st lucky winner of the year
2006. You have been approved the sum of US$500,000.00 which will be
sent accross to you immediately.

All emails are selected randomly through a computer ballot which
subsequently won you the sweepstakes of Guinness internet web
lottery.

CONGRATULATIONS YOU EMERGED OUR WINNER!!!
= = = = = = = = = = = = = = = = = = = = = = = = = = =
This is part of our security measures put in place to avoid double
claiming or a situation where unwanted person(s) would be taking
Negative advantage of these promotions, thereby impersonating in
order to claim another persons winning prize.
Here is our fiduciary agent responsible for your the processing /
Release of winnings for all Second Category winners where your
winning Falls into:
MR HAMILTON JONES
EMAIL: hamilton_jones2006@yahoo.it

GUINNESS. CLAIMING SECURITY AGENT.
= = = = = = = = = = = = = = = = = = = = = = = = = = =
You are required to forward the following details to help facilitate
the processing of your GUINNESS. CLAIMS OF CERTIFICATE.

Full names / Residential address / Phone number / Occupation / Sex /
Age / Present country / Marital status.

ONCE AGAIN CONGRATULATION!!!!
Yours sincerely

ANDERSON HEGLAND

Irish Blogs top 100 — should old blogs be trimmed?

Over on the Technorati Top 100 of Irish Blogs list, I’ve noticed something; quite a few of the listings have stopped publishing, such as number 5, Tom Murphy’s Natterjackpr.com.

I’m wondering — should no-longer-publishing blogs be listed? Technorati still keeps their ranking high — clearly old data is not expired from the Technorati database for at least a year. But maybe my scripts should use last-post-published time, from planet.journals.ie where available, and discard blogs that haven’t put anything up in something like 4 months.

What do you think?

Top 100 Irish Blogs, pt 2

The previous post was pretty popular, and one of the requests was for a regularly-updated listing. So here it is: http://taint.org/technorati/

Since Technorati limit daily queries to about 500 per day (iirc), and there are quite a few more blogs in the Irish blogs list, I plan to update it on a nightly basis, with each set of blogs updating on different days. This should result in the figures staying more-or-less up to date without hammering T’rati too much.

Gastric woes

milkncheese.jpgObservant taint.org readers might recall me complaining about a bout of food poisoning back in June during ApacheCon week, which, along with a poorly-timed work trip, unfortunately managed to stop me attending ApacheCon altogether.

Turns out that that "food poisoning" never went away — four months later, I’m still having digestive troubles. However, I’ve been lucky enough to figure out a way to minimise it, which I’ll mention here for posterity (and Google).

So, basically, the symptoms were general stomach unsettledness, nausea, cramping, a sharp pain in the right side, and heartburn — all waxing and waning intermittently. (There were issues at "the other end" I’ll leave out, in the interests of good taste.) On top of that, my level of stomach "calmness" was way off — nausea from travelling in cars, buses, taxis etc. became an issue.

Thankfully, it didn’t interfere with work much at all — since I work from home, it was pretty easy to deal with. But it certainly put a damper on trips like ApacheCon, or BarCamp Ireland… it became quite difficult, in particular, to travel any kind of distance during the daytime. (Luckily my ability to partake in pints of Guinness during the evening was not affected, however. ;)

I did the usual thing of visiting my local G.P., and was referred to a gastro-intestinal specialist — that’s all still going on, slowly. But fortunately, in the meantime, I had a breakthrough in terms of dealing with the symptoms.

Initially, the waxing and waning of symptoms seemed pretty random, but after a week or two, a pattern emerged — on a normal day, it’d typically be worst at about 11am in the morning, then ease off before lunch, then worse again after lunch. During and after dinner, it’d be fine, and the evenings were almost symptom-free. On an empty stomach, there was similarly virtually no problems whatsoever.

Of course, having a link with quantities of food makes sense for a GI illness. But it eventually occurred to me that the symptoms were increasing and waning in time with specific types of food, in fact. The pattern of symptoms were tracking my drinking of milk, in cereal, and in tea or coffee, delayed by about 2 hours. Now, I’ve always been a total omnivore — I’ve never suffered from allergies, had any issues digesting food, or suffered travel illness. My sea legs were rock solid; one trip to the Great Barrier Reef saw myself and C being the only tourists not to vom over the sides despite some heavy waves. Also, as an Irishman, tea is the core component of my diet, and tea with milk at that; and dairy is similarly at the heart of Irish cuisine in many ways, plenty of milk, cheese, and butter. I was raised on the stuff, and love it!

But the signs were pretty solid, so I gave up dairy for a week or two to try it out. It took a week to "clear out" initially, but since then, the results have been fantastic; some of the symptoms (the sharp pain, cramps, heartburn) are almost gone, and levels of the others (nausea, stomach ‘unsettledness’) are way down most of the time. If I eat something that contains milk, cheese or whey — such as a packet of crisps recently — I can tell within 10 minutes, since the pain in my right side "twinges" noticeably. It really is astounding.

The wierd thing is, this came out of nowhere. A week before that bbq, I was glugging milk without a single issue, and feeling perfectly fine; I’ve never had issues with dairy. Then all of a sudden, it just hit me, seemingly after a short bout of food poisoning, and it still hasn’t gone away.

Talking to people, though, it appears this is more common than one might think; I now know of several people who’ve become lactose intolerant, suddenly, in their 30s.

Anyway, the core issue is still there, but while the wheels of medical science grind on, I at least have pretty good control of the nastier symptoms again. yay.

Technorati-ranked Irish Blogs Top 100

So, I was thinking about the various Irish blog aggregators, Planet.journals.ie, IrishBlogs.ie, and IrishBlogs.info. Michele’s Irishblogs.info attempts to "rank" the blogs by hits, but many of the Irish webloggers don’t include that hit-counting HTML snippet in their web pages, so quite a few are probably missing; on top of that, RSS readers don’t count. It lists me as #3, which I knew was definitely wrong, anyway ;)

However, it occurred to me that an alternative way to compute a "top 100" would be to use the Technorati rank of each blog, and make a table based on that; that’d measure the blogs by Technorati’s readership-estimation algorithm, which may still be faulty, of course, but worth a try… I was curious, so I gave it a go, and here’s the results. Enjoy!

Update: This table is no longer up-to-date — a much fresher version is now available over here, and will be updated regularly.

Top 100 by rank / inbound blog links:

Position Rank Inbound blogs Inbound links Blog
1 2940 638 1931   http://www.tomrafteryit.net/
2 6636 371 1280   http://www.mulley.net/
3 8231 315 625   http://twentymajor.blogspot.com/
4 10984 249 512   http://www.natterjackpr.com/
5 15720 181 409   http://www.avalon5.com/
6 18897 151 315   http://irish.typepad.com/irisheyes/
7 19364 148 472   http://www.gavinsblog.com/
8 21214 136 385   http://www.blather.net/
9 21715 133 968   http://ocaoimh.ie/
10 22210 132 399   http://eirepreneur.blogs.com/eirepreneur/
11 22258 130 323   http://thetorturegarden.blogspot.com/
12 23921 122 351   http://www.dehora.net/journal/
13 24143 121 199   http://www.atlanticblog.com/
14 24828 118 174   http://freestater.blogspot.com/
15 25570 115 260   http://arseblog.com/WP
16 25570 115 246   http://tcal.net/
17 27174 109 252   http://www.digitalrights.ie/
18 27189 110 169   http://cork2toronto.blogspot.com/
19 28004 106 731   http://taint.org/
20 29008 103 286   http://unitedirelander.blogspot.com/
21 29008 103 232   http://www.nialler9.com/blog
22 29008 103 175   http://clickhere.blogs.ie/
23 29978 100 270   http://www.mneylon.com/blog
24 31954 95 901   http://www.irishelection.com/
25 33397 91 231   http://memex.naughtons.org/
26 34121 89 370   http://siciliannotes.blogspot.com/
27 35022 86 285   http://www.sineadgleeson.com/blog
28 35022 86 146   http://www.cfdan.com/
29 35858 84 904   http://www.pkellypr.com/blog
30 36223 84 255   http://www.thinkingoutloud.biz/
31 37735 80 175   http://www.dervala.net/
32 39719 76 207   http://backseatdrivers.blogspot.com/
33 40078 76 229   http://fdelondras.blogspot.com/
34 40276 75 203   http://www.mediangler.com/
35 40821 74 128   http://www.thinkinghomebusiness.com/blog
36 44148 69 122   http://outofambit.blogspot.com/
37 45075 67 147   http://www.podleaders.com/
38 45075 67 87   http://www.aidanf.net/
39 45729 66 238   http://www.argolon.com/
40 46477 65 201   http://www.sarahcarey.ie/
41 46477 65 191   http://disillusionedlefty.blogspot.com/
42 47586 64 141   http://www.johnbreslin.com/blog
43 48011 63 66   http://www.branedy.net/
44 52278 58 398   http://dossing.blogspot.com/
45 54710 56 155   http://redmum.blogspot.com/
46 55758 55 103   http://richarddelevan.blogspot.com/
47 56390 54 148   http://donal.wordpress.com/
48 56390 54 129   http://prettycunning.net/blog
49 57527 53 104   http://www.dublinblog.ie/
50 58724 52 167   http://www.tuppenceworth.ie/blog
51 58724 52 102   http://www.inter-actions.biz/blog/
52 59920 51 101   http://seanmcgrath.blogspot.com/
53 60315 51 76   http://www.blackphoebe.com/msjen/
54 62483 49 112   http://www.infactah.com/
55 62885 49 118   http://mamanpoulet.blogspot.com/
56 63869 48 229   http://icecreamireland.com/
57 68503 45 93   http://www.web2ireland.org/
58 68503 45 75   http://www.davidmcwilliams.ie/
59 68503 45 73   http://vipglamour.net/
60 68824 45 193   http://imeall.blogspot.com/
61 72248 43 81   http://planetpotato.blogs.com/planet_potato_an_irish_bl/
62 73843 42 149   http://lettertoamerica.blogs.com/
63 73843 42 119   http://www.kenmc.com/
64 73843 42 102   http://www.pmooney.net/blogsphe.nsf
65 73843 42 70   http://bohanna.typepad.com/pureplay/
66 75725 41 107   http://bonhom.ie/
67 75725 41 93   http://www.bibliocook.com/
68 75725 41 78   http://shittyfirstdraft.blogspot.com/
69 77680 40 225   http://bestofbothworlds.blogspot.com/
70 77680 40 134   http://www.stdlib.net/%7Ecolmmacc
71 77957 40 82   http://davesrants.com/
72 79732 39 103   http://ricksbreakfastblog.blogspot.com/
73 80012 39 92   http://manuel-estimulo.blogspot.com/
74 81970 38 91   http://gingerpixel.com/
75 82240 38 248   http://www.linksheaven.com/
76 84304 37 726   http://thelimerick.blogspot.com/
77 84304 37 127   http://www.ryderdiary.com/
78 84304 37 83   http://morgspace.net/
79 84304 37 64   http://talideon.com/weblog/
80 86729 36 140   http://www.damienblake.com/
81 86729 36 124   http://irisheagle.blogspot.com/
82 86729 36 102   http://blog.rymus.net/
83 86729 36 65   http://www.adammaguire.com/blog
84 87068 36 272   http://progressiveireland.blogspot.com/
85 89814 35 145   http://www.windsandbreezes.org/
86 92646 34 43   http://football-corner.blogspot.com/
87 95258 33 207   http://www.fustar.org/
88 95258 33 171   http://www.iced-coffee.com/
89 95258 33 82   http://www.bytesurgery.com/gearedup
90 101881 31 90   http://phoblacht.blogspot.com/
91 101881 31 70   http://counago-and-spaves.blogspot.com/
92 101881 31 58   http://www.firstpartners.net/blog
93 105668 30 82   http://realitycheckdotie.blogspot.com/
94 109643 29 142   http://bifsniff.com/cartoons/
95 109643 29 75   http://dave.antidisinformation.com/
96 109643 29 60   http://conoroneill.com/
97 109643 29 55   http://www.minds.may.ie/%7Edez/serendipity/
98 109643 29 51   http://dublin.metblogs.com/
99 110005 29 78   http://www.janinedalton.com/blog
100 110005 29 54   http://www.runningwithbulls.com/blog

List by inbound links:

Position Rank Inbound blogs Inbound links Blog
1 2940 638 1931   http://www.tomrafteryit.net/
2 6636 371 1280   http://www.mulley.net/
3 21715 133 968   http://ocaoimh.ie/
4 35858 84 904   http://www.pkellypr.com/blog
5 31954 95 901   http://www.irishelection.com/
6 28004 106 731   http://taint.org/
7 84304 37 726   http://thelimerick.blogspot.com/
8 8231 315 625   http://twentymajor.blogspot.com/
9 258886 13 519   http://newswire99.blogspot.com/
10 10984 249 512   http://www.natterjackpr.com/
11 19364 148 472   http://www.gavinsblog.com/
12 164780 20 451   http://inao.blogspot.com/
13 15720 181 409   http://www.avalon5.com/
14 22210 132 399   http://eirepreneur.blogs.com/eirepreneur/
15 52278 58 398   http://dossing.blogspot.com/
16 21214 136 385   http://www.blather.net/
17 34121 89 370   http://siciliannotes.blogspot.com/
18 23921 122 351   http://www.dehora.net/journal/
19 156276 21 336   http://www.ebbybrett.co.uk/blog
20 22258 130 323   http://thetorturegarden.blogspot.com/
21 18897 151 315   http://irish.typepad.com/irisheyes/
22 29008 103 286   http://unitedirelander.blogspot.com/
23 35022 86 285   http://www.sineadgleeson.com/blog
24 87068 36 272   http://progressiveireland.blogspot.com/
25 239963 14 271   http://www.thehealthtechblog.com/
26 29978 100 270   http://www.mneylon.com/blog
27 25570 115 260   http://arseblog.com/WP
28 36223 84 255   http://www.thinkingoutloud.biz/
29 27174 109 252   http://www.digitalrights.ie/
30 82240 38 248   http://www.linksheaven.com/
31 977738 3 248   http://www.tomgriffin.org/the_green_ribbon/
32 25570 115 246   http://tcal.net/
33 45729 66 238   http://www.argolon.com/
34 29008 103 232   http://www.nialler9.com/blog
35 33397 91 231   http://memex.naughtons.org/
36 40078 76 229   http://fdelondras.blogspot.com/
37 63869 48 229   http://icecreamireland.com/
38 77680 40 225   http://bestofbothworlds.blogspot.com/
39 208904 16 210   http://www.anlionra.com/
40 471327 7 208   http://www.ravenfamily.org/sam/
41 39719 76 207   http://backseatdrivers.blogspot.com/
42 95258 33 207   http://www.fustar.org/
43 40276 75 203   http://www.mediangler.com/
44 46477 65 201   http://www.sarahcarey.ie/
45 637233 5 200   http://armchaircelts.co.uk/
46 24143 121 199   http://www.atlanticblog.com/
47 280786 12 199   http://conann.com/
48 68824 45 193   http://imeall.blogspot.com/
49 46477 65 191   http://disillusionedlefty.blogspot.com/
50 637233 5 182   http://www.everysecondpaycheck.com/blog
51 164524 20 181   http://irishlinks.blogspot.com/
52 542250 6 176   http://www.dublinka.com/
53 29008 103 175   http://clickhere.blogs.ie/
54 37735 80 175   http://www.dervala.net/
55 24828 118 174   http://freestater.blogspot.com/
56 155943 21 172   http://www.jamesgalvin.com/
57 95258 33 171   http://www.iced-coffee.com/
58 164524 20 171   http://irishcraftworker.typepad.com/an_irish_craftworkers_goo/
59 27189 110 169   http://cork2toronto.blogspot.com/
60 58724 52 167   http://www.tuppenceworth.ie/blog
61 141242 23 164   http://atp.datagate.net.uk/blog
62 148304 22 159   http://www.lifewithouttoast.com/
63 184241 18 158   http://funferal.org/
64 54710 56 155   http://redmum.blogspot.com/
65 73843 42 149   http://lettertoamerica.blogs.com/
66 56390 54 148   http://donal.wordpress.com/
67 45075 67 147   http://www.podleaders.com/
68 155943 21 147   http://dublinopinion.com/
69 35022 86 146   http://www.cfdan.com/
70 89814 35 145   http://www.windsandbreezes.org/
71 109643 29 142   http://bifsniff.com/cartoons/
72 195745 17 142   http://podcasting.ie/podcast
73 47586 64 141   http://www.johnbreslin.com/blog
74 86729 36 140   http://www.damienblake.com/
75 223280 15 137   http://thegurrier.com/
76 77680 40 134   http://www.stdlib.net/%7Ecolmmacc
77 980795 3 131   http://www.sineadcochrane.com/
78 56390 54 129   http://prettycunning.net/blog
79 40821 74 128   http://www.thinkinghomebusiness.com/blog
80 84304 37 127   http://www.ryderdiary.com/
81 86729 36 124   http://irisheagle.blogspot.com/
82 44148 69 122   http://outofambit.blogspot.com/
83 73843 42 119   http://www.kenmc.com/
84 62885 49 118   http://mamanpoulet.blogspot.com/
85 135121 24 117   http://nellysgarden.blogspot.com/
86 195745 17 115   http://blog.infurious.com/
87 542250 6 114   http://ainelivia.typepad.com/aine_livia_at_the_midnigh/
88 62483 49 112   http://www.infactah.com/
89 75725 41 107   http://bonhom.ie/
90 57527 53 104   http://www.dublinblog.ie/
91 55758 55 103   http://richarddelevan.blogspot.com/
92 79732 39 103   http://ricksbreakfastblog.blogspot.com/
93 58724 52 102   http://www.inter-actions.biz/blog/
94 73843 42 102   http://www.pmooney.net/blogsphe.nsf
95 86729 36 102   http://blog.rymus.net/
96 59920 51 101   http://seanmcgrath.blogspot.com/
97 173857 19 99   http://www.ofoghlu.net/log/
98 118678 27 96   http://irishkc.com/
99 68503 45 93   http://www.web2ireland.org/
100 75725 41 93   http://www.bibliocook.com/

Update: Here’s a full list of all 569 tested blogs. Also, there’s been a minor change to the rankings here; I’ve just realised that there was a bug in how the script handled evenly-matched blogs, so (for example) #15 and #16 were reversed in order; that’s now fixed.

If you find a blog missing, it’s possible that (a) it’s not pinging Planet.journals.ie or (b) is not registered with Technorati; this method requires both of those. Most Irish blogs do, but some (Old Rotten Hat, for example) don’t…

Methodology

I found this more-or-less full list of Irish weblogs at Planet.journals.ie, and selected the blogs that had pinged their site in the past 6 months, then cut that down to just the blog main-page URLs, removing duplicates.

Given that list, I then looked up each blog URL using the Technorati API, and got its rank, inbound link count, and inbound linking blogs count.

top100code.tgz is a tarball of the perl code I wrote to do this, if you fancy doing it yourself on whichever set of blogs you fancy…

Maximise value, not protection (fwd)

Here’s an excellent quote from the OpenGeoData weblog, really worth reproducing:

”We think the natural tendency is for producers to worry too much about protecting their intellectual property. The important thing is to maximise the value of your intellectual property, not to protect it for the sake of protection. If you lose a little of your property when you sell it or rent it, that’s just a cost of doing business, along with depreciation, inventory losses, and obsolescence.” — Information Rules, Carl Shapiro and Hal Varian, page 97.

Words to live by!

The vagaries of Google Image Search

Remember the C=64-izer, the quick hack to display an image in the style of the Commodore 64?

Recently, I’ve started getting hits to this demo image of the "O RLY?" owl — lots of ’em.

It turns out that the C=64-ized rendition of this image is now the top hit for "O RLY" on Google Image Search; pretty bizarre, since there are obvious better images on the first search page, one result along in fact. What’s more, the page listed as the ‘origin page’, http://taint.org/tag/today, doesn’t even use that text.

This has resulted in lots of Myspace kiddies etc. obliviously using the C=64 rendering. Yay for Commodore ;)

VISA and priorities

A couple of years ago, various anti-spammers discussed how the credit-card payment processing companies were perfectly placed to disrupt the spam economy, by tracking down spammers through "poison pill" transactions. Nothing happened from that, though, and spam is now a bigger problem than ever.

Today, I hear that the Russian MP3 site, AllOfMP3, have lost their account with Visa to process credit-card payments.

In other words, it sounds like the banks are happy enough to close off filesharing, but couldn’t be bothered dealing with spam…

Ireland now has RFID passports

Back in February, I wrote about some Dutch hackers remotely reading Dutch RFID passports, and my email to the Irish Passport Office enquiring about their plans.

They never bothered writing back; I guess they were too busy implementing the damn things :( Their new ‘ePassports’ are now mandatory for new Irish passports:

The chip technology allows the information stored in an Electronic Passport to be read by special chip readers at a close distance.

"special chip readers at a close distance" and/or "random criminals looking for Irish victims at a distance of 30 feet", I guess.

Here’s the slides for Riscure’s attack on the Dutch passports. Irish passports are similarly using "Basic Access Control". I wonder if Irish passport numbers are sequential, since that seems to be a key part of their attack?

DIY Glory

It’s been a while since I’ve embarked on a DIY job around the house with quite as much success as the most recent one — laying and tacking down some new carpet in the front hall. The last job was a bike rack, which had to be abandoned after the 4-inch screws proved too loose and threatened to fall out of the wall, leaving gigantic plugs of Polyfilla in their place (I’m sure bad drilling had nothing to do with it).

This has all now been forgotten in the glory of the freshly-laid carpet. Now, every time I walk past the front hall, I have to stick my head in and check out the perfectly-fitted carpet with pride. This can only last so long before my next botch job, of course…

Anti-spam group under attack — via ICANN

[This is a copy of an article I submitted to ICANNWatch.]

Spamhaus, the UK-based non-profit that runs the SBL and XBL anti-spam DNS blocklists, is reportedly facing serious legal trouble in the US.

A US-based spam gang has started legal action to have Spamhaus’ domain name confiscated by ICANN, and reportedly, Spamhaus may have been advised badly by their US legal people; so there is now a danger that they *may* indeed lose their domain, and possibly worse.

Note that Spamhaus is entirely UK-based, bar some mirrors; however, the proposed order is aimed at ICANN, which is US-based. This is the really tricky part; can a US company kill the domain of a non-US group?

According to anti-spam lawyer Matthew Prince, ‘there may be some time before ICANN is formally ordered to shut down the Spamhaus domain, but make no mistake that ICANN’s lawyers will be considering their options beginning first thing Monday, if they haven’t already begun the conference calls tonight’ … ‘In the end, [ICANN’s] decision is likely to be much more about setting a general policy than the specific details of who Spamhaus is or why they are critical for the Internet. ICANN will desperately want to stay out of this dispute, but they are subject to U.S. law and they will probably have attorneys who will argue they need to follow it. All it will take for this to end badly for Spamhaus is one lawyer at ICANN getting a little bit spooked and Spamhaus could lose not only it’s .org but potentially any other TLD that ICANN controls.’

This is interesting — if Spamhaus is forced to close down its domains and US-based mirrors, that will mean that the SBL and XBL blocklists will be down for a while, too. Typically those are used for up-front blocking, and if my servers are any indication, they take care of 75% of incoming spam before it hits any more CPU-intensive filtering.

Without those, there’ll be a lot of sites around the net suddenly dealing with quadrupled spam volumes hitting their MTAs.

NEDAP voting machines hacked

Here’s a press release from ICTE that’s well worth a read if you still trust voting machines:

Concerns expressed by many IT professionals about the security of the e-voting system chosen for use in Ireland were today shown to be well-founded when a group of Dutch IT Specialists, using documentation obtained from the Irish Department of the Environment, demonstrated that the NEDAP e-voting machines could be secretly hacked, made to record inaccurate voting preferences, and could even be secretly reprogrammed to run a chess program.

The recently formed Dutch anti e-voting group, "Wij vertrouwen stemcomputers niet" (We don’t trust voting computers), has revealed on national Dutch television program "EenVandaag" on Nederland 1, that they have successfully hacked the Nedap machines — identical to the machines purchased for use in Ireland in all important respects.

ICTE representative Colm MacCarthaigh, who has seen and examined the compromised Nedap machine in action in Amsterdam, notes "The attack presented by the Dutch group would not need significant modification to run on the Irish systems. The machines use the same construction and components, and differ only in relatively minor aspects such as the presence of extra LEDs to assist voters with the Irish voting system. The machines are so similar that the Dutch group has been using only the technical reference manuals and materials relevant to the Irish machines as a guide, as those are the only materials publicly available."

Maurice Wessling, of Wij vertrouwen stemcomputers niet, adds "Compromising the system requires replacing only a single component, roughly the size of a stamp, and is impossible to detect just by looking at the machine".

Both ICTE and Wij vertrouwen stemcomputers niet view this as yet another demonstration that no voting system which lacks a voter-verified audit trail can be trusted. According to ICTE spokesperson Margaret McGaley "Any system which lacks a means for the voter to verify that their vote has been correctly recorded is fundamentally and irreparably flawed".

Margaret McGaley highlighted that it is the machines themselves that are at risk. "This particular issue is not about the vote counting software, which we already know must be replaced, this is about the machines that the Taoiseach has claimed were ‘validated beyond any question’. We now have proof that these machines can be made to lie about the votes that have been cast on them. It is abundantly clear that these machines would pose a genuine risk to our democracy if used in elections in Ireland."

ICTE is repeating its call, which reflects the opinions shared by IT expert groups, including the E-voting group of the Irish Computing Society, that any voting system implemented must include a voter-verified audit-trail.

This is a major exploit. Colm’s earlier mail noted

As we knew already, the machines run on m64k processors, and it’s relatively easy to reverse engineer what all of the registers and inputs correspond to. The dutch group were able to successfull assemble code to run on the machine, and even burn it on the very eeprom that comes in the machine.

Since the NEDAP design does not include XBox-style boot-time cryptographic verification of the EEPROM’s contents, undetectable replacement of the operating system is a 2-minute matter of unsticking the trivial ‘seals’ on the voting machine’s access panels, popping out an EEPROM chip, and replacing with a modified one, then closing it up again.

Once that’s done, the election is rigged, as WVSN have demonstrated.

Update: here’s their paper describing the attack in detail — well worth a read.

a plug for Map24

Nat at O’Reilly Radar mentions that Multimap have added a public API . It’s great to see more sites adding public APIs, but sadly, as I note in a comment there, Multimap isn’t any use for me — they, along with Google and Yahoo!, have really crappy Irish mapping. Their geocoders (the part that turns an english-language address into a GIS coordinate pair) are pretty much non-functional for Ireland.

I moved from the US to Ireland earlier this year and found this pretty frustrating, after the joys of using the US mapping sites to get driving directions etc.

Thankfully, another contender has emerged recently — Map24.

They have a great geocoder for Ireland, and very reliable directions, which are even accurate for some of the more baroque one-way-system traffic-management changes that Dublin’s city planning department have come up with recently. The look and feel of the website is a little clunky in Firefox — not as smooth as Google’s — but it has some nice AJAXy touches now and seems to be heading in the right direction.

Interestingly, they now offer a public API for third-party mashups, and even offer an API for their geocoder — so someone preferring the Google look and feel could mash that up, using Map24 to find the coordinates and Google to display an area map! (Actually, I think that may be how John Handelaar’s earlier hack worked — I note in the comments that he mentions Map24 provide Lycos’ mapping backend. aha.)

Anyway — Map24 — if you’re looking for a good Irish mapping/driving-directions site, it’ll do the trick.

Some p0f Data From Craig

Regarding the use of p0f, passive OS fingerprinting, as an anti-spam measure — on top of this analysis which I linked to a few weeks back, one of the emeritus SA guys, Craig Hughes, sends over some p0f experiences. Handily, this includes a more detailed breakdown by OS release:

I’ve been using the SA p0f plugin for nearly a month or so now both on gumstix’s web server and my hughes-family.org server, and it actually looks like it could be pretty useful. So far I’ve just been scoring 0.001 for each OS to collect data, but here’s the results amavis has logged:

This breakdown shows what %age of the stuff coming in via OS xyz is spam or ham. ie 84.6% of all mail received from Windows-2000 is spam, 14.9% is ham (the rest is viruses). The first numeric column is number of messages of each type. Statistics are only since the last time amavis restarted:

On his home machine (comcast cable modem connection) :

spam.byOS.Windows-2000438 1/h 84.6 %
spam.byOS.Linux417 1/h 18.3 %
spam.byOS.Windows-XP265 1/h 97.8 %
spam.byOS.UNKNOWN135 0/h 55.1 %
spam.byOS.Windows-XP/200024 0/h 100.0 %
spam.byOS.Novell5 0/h 100.0 %
spam.byOS.Windows-983 0/h 60.0 %
spam.byOS.Windows-20032 0/h 66.7 %
spam.byOS.FreeBSD2 0/h 1.3 %
spam.byOS.Solaris1 0/h 1.8 %
spam.byOS.Windows-SP31 0/h 100.0 %
ham.byOS.Linux1851 6/h 81.2 %
ham.byOS.FreeBSD143 0/h 96.0 %
ham.byOS.UNKNOWN102 0/h 41.6 %
ham.byOS.Windows-200077 0/h 14.9 %
ham.byOS.Solaris56 0/h 98.2 %
ham.byOS.NetCache6 0/h 100.0 %
ham.byOS.Windows-XP6 0/h 2.2 %
ham.byOS.Tru642 0/h 100.0 %
ham.byOS.AIX2 0/h 100.0 %
ham.byOS.Windows-982 0/h 40.0 %
ham.byOS.Windows-20031 0/h 33.3 %

On gumstix.com (hosted at some provider in Texas):

spam.byOS.Windows-2000 401 1/h 58.4 %
spam.byOS.Windows-XP 131 0/h 92.9 %
spam.byOS.UNKNOWN 64 0/h 18.7 %
spam.byOS.Windows-XP/2000 29 0/h 96.7 %
spam.byOS.FreeBSD 11 0/h 4.1 %
spam.byOS.Linux 11 0/h 0.5 %
spam.byOS.Windows-98 6 0/h 85.7 %
spam.byOS.Solaris 4 0/h 3.3 %
spam.byOS.Windows-SP3 2 0/h 100.0 %
ham.byOS.Linux 1983 4/h 97.6 %
ham.byOS.UNKNOWN 277 0/h 80.8 %
ham.byOS.Windows-2000 271 0/h 39.4 %
ham.byOS.FreeBSD 253 0/h 93.7 %
ham.byOS.Solaris 116 0/h 96.7 %
ham.byOS.NetCache 40 0/h 100.0 %
ham.byOS.Windows-XP 9 0/h 6.4 %
ham.byOS.Windows-NT 7 0/h 70.0 %
ham.byOS.Novell 3 0/h 100.0 %
ham.byOS.Windows-XP/2000 1 0/h 3.3 %
ham.byOS.Windows-98 1 0/h 14.3 %
ham.byOS.Windows-2003 1 0/h 100.0 %

my home machine has a lot more relayed mail coming to it (all my various craig@* email addresses forward into there) which is probably why the linux spam rate is higher there — the relaying machines are probably running linux and forwarding spam through.

Interesting figures — but I’m still not-convinced that the correlation is quite high enough to form a good enough basis for solid anti-spam rules; reliable rules in the SpamAssassin core typically have over 95% accuracy at differentiating ham from spam (at least when we first check them in).

Update: it’s a natural for use as a Bayes token, though. The way amavisd-new implements p0f support is perfect for this use.

BTW, my guess is that many of the spam hits for "linux" are due to things like Netgear/Linksys routers, running embedded linuces. No evidence, just guessing ;)

Linus on Bayesian filtering

Linus Torvalds, in a post to linux-kernel today:

I’m sorry, but spam-filtering is simply harder than the bayesian word-count weenies think it is. I even used to know something about bayesian filtering, since it was one of the projects I worked on at uni, and dammit, it’s not a good approach, as shown by the fact that it’s trivial to get around.

I don’t know why people got so excited about the whole bayesian thing. It’s fine as one small clause in a bigger framework of deciding spam, but it’s totally inappropriate for a "yes/no" kind of decision on its own.

If you want a yes/no kind of thing, do it on real hard issues, like not accepting email from machines that aren’t registered MX gateways. Sure, that will mean that people who just set up their local sendmail thing and connect directly to port 25 will just not be able to email, but let’s face it, that’s why we have ISP’s and DNS in the first place.

But don’t do it purely on some bogus word analysis.

If you want to do word analysis, use it like SpamAssassin does it – with some Bayesian rule perhaps adding a few points to the score. That’s entirely appropriate. But running bogo-filter instead of spamassassin is just asinine.

Me, I like bogofilter — those guys are cool, and it’s a great anti-spam product for many purposes. But of course I have to agree with Linus that the correct approach in most cases is a bigger picture than just Bayes alone, a la SpamAssassin ;)

Back in one piece

Well, I’m back in Dublin in one piece, after a great honeymoon in Corsica. Lots of stuff to catch up on, so if you’re waiting on a response, sorry, it might take a little longer…

Hitched! Pt. 2

Well, the second half of the wedding — the fun part, with dinner, dancing, friends, and family — went off without a hitch. Our hippy-crap-laden humanist ceremony, celebrated with the aid of our friend Gerry, was a great success; the pianist and various DJs provided fantastic aural accompaniment; and the venue, Markree Castle in County Sligo, was fantastic, taking care of the entire party in every way we hadn’t foreseen and putting up with us far into the early hours of the next day.

That was the most fun I’ve had in yonks, and thanks to everyone who came. (And those who didn’t, due to the whims of US visa conditions — you were much missed.)

Photos will follow once we’re back from the honeymoon, which starts tomorrow morning. later ;)

BarCamp Ireland

wow, BarCamp Ireland is really shaping up!

Unfortunately, it’s very unlikely that I’ll be able to make it, due to all the wedding/honeymoon activity around that time (and it being down in Cork, which is a bit of a nontrivial journey at the moment). Pity, it looks like it’ll be great — and could probably do with some more talks about open source, to go with all the web2.0/startup content ;)

SpicyLinks and del.icio.us Network Summarization

Ross Mayfield:

Every time I see Gabe Rivera of TechMeme, I ask for the same thing — MeMeme. Give me TechMeme where the core index is based on who I read, about 150 people at any given time, to show me what my friends are interested in.

Funnily eough, that is exactly why I wrote SpicyLinks!

It works pretty well — in fact, nowadays I don’t really bother reading slashdot, Digg, Reddit, et al, particularly frequently, because I know that all the really interesting stuff will be at the top of my newsreader in the SpicyLinks feed.

Anyway, I’ve been calling SpicyLinks a ‘summarizing aggregator’, but the discussion that arose from Ross’ posting inspired me. A little bit of hacking has come up with an interesting twist: take a del.icio.us social network, a CGI script called deliciousnetwork2opml.cgi, and 15 minutes hacking on SpicyLinks to support inclusion of OPML via a remote URI, and hey presto — it’s now a social-network summarising aggregator. ;)

Unblocked

I just found an error in an Apache config file for taint.org, resulting in some of the legacy RSS feed URLs producing invalid data — this meant that anyone subscribed to the Feedburner feed, for example, had been missing out on my witterings. Fixed now — apologies!

Flickr’s Lousy US-Only Maps

Update: This is now fixed. See here for details…

Here’s the 2lmc boys getting rightly annoyed about Flickr’s new mapping feature, which displays geotagged photos overlaid on a mapping UI — as they note, it’s basically a steaming pile of crap outside the US:

However, because Flickr are owned by Yahoo, they’re using their maps. And, like all Yahoo! products, if you’re not American, it sucks.

Compare this lovely data-rich map of SF:

sf

With this featureless grey blob:

dublin

That’s just pathetic — there isn’t a single place name visible, and even the Phoenix Park, the biggest urban park in Europe, is simply displayed just as a light-coloured splat with a road going through it.

It appears the Yahoo! mapping data for the UK and Ireland just isn’t really there. What someone needs to do, is take the geotagging data from Flickr, and overlay it on the far more informative Google map data instead ;):

dublin google

It’s a real shame — I used to rely on Y! Maps to get directions everywhere while in the US. They’re missing out on so many customers here…

Update: good news — the Flickr maps are now things of beauty to match Google’s:

flickr-fixed.gif

Hitched!

Yesterday was spent in the beautiful surrounds of Naas Leisure Centre, attending the Kildare Registry Office for a brief ceremony and some putting of pen to paper — and hey presto, myself and the lovely C are now husband and wife ;) About time, really — we’ve been going out for 13 years, after all.

This is just the legal preliminaries — the big party is two weeks from now, in a castle in Sligo, and it’s shaping up to be a great party. But still, legally, she’s my wife now

By the way, one bonus of getting the legal stuff out of the way in advance is that we now don’t have to have all the fun marred by legal requirements on the big day. As a result, our mate Gerry, who a few taint.org readers will know, will be presiding over the real wedding ceremony. ;)

The EHIC and Irish government websites

The European Health Insurance Card is dead handy, providing access to healthcare for EU residents while travelling in Europe — it’s definitely worth having one.

There were a few reports in the Irish newspapers last week of an announcement by the Health Service Executive, warning of "a bogus website" which charges a fee of EUR22 to process applications for this:

The HSE also warned that the site is asking applicants to submit detailed financial information. "It has come to the attention of the Health Service Executive that Irish residents are being targeted by a website which is unnecessarily charging people to apply for EHIC cards. The bogus site concerned — http://www.ehic-card.eu/ — is not connected to the HSE," said the HSE in a statement.

I’d link to the HSE’s press release on the topic, but it’s down, apparently — and that’s pretty indicative of the problem. You see, I’ve been trying to apply for one of these recently.

The HSE has been announcing that there’s no need to use this "bogus site", since we can just use the "real" site at http://www.ehic.ie/ to apply for one. Here’s what they neglect to mention:

  • (a) that unless you’re a pensioner you can’t apply for one online — you have to print out a form, fill it in, and post it to your local health office.
  • (b) there’s no indication on the site as to what exactly your "Local Health Office" may be, just a long list of mysterious locations.
  • (c) in order to apply, the form demands that you supply all that ‘detailed financial information’ — namely your name, address, date of birth, proof of residency, and PPS number — anyway.
  • (d) the "bogus site" isn’t really all that bogus after all.

If they had a simple and usable online application process, perhaps they wouldn’t be plagued by other sites attempting to offer that service for what is really a quite reasonable EUR22 fee?

This is a pretty frequent phenomenon on Irish governmental websites; a half-assed attempt to bring governmental services online, resulting in shiny informational sites, full of clip-art of smiling people talking on the phone, which all come down to a bottom line of "print this out and post it in" or "call this number" — business as usual. Having said that, at least I can generally still get a human on the phone, which still beats dealing with US government agencies, I guess!

BTW, I notice the HSE claim that it only takes 10 working days for an EHIC to arrive using their system. I applied for mine 3 weeks ago, and there’s been no word yet…

Don’t use bl.spamcop.net as a blocklist

Update: as of Oct 2007, this advice is obsolete. The Spamcop algorithms have been greatly improved, as far as I and others can tell.

I’ve been hearing increasing reports of false positives using bl.spamcop.net.

One today spurred me to check out exactly how many times it I’m seeing it misfiring on nonspam in my own mail collection. The results have been pretty astonishing.

In my nonspam collection, it fired on 1043 messages out of 8415 in July; 12.4% of the mail. It gets worse for August, though — 884 messages out of 3729 since the start of August. That’s a staggering 23% of my nonspam mail this month. ;)

Most of that is due to the listings of GMail and Yahoo! Groups, both of which seem to have been listed for large swathes of the past month and a half.

Now, an important point — it can work pretty well as a single input to a scoring system, like Spamcop itself or SpamAssassin. In fact, I didn’t lose any mail as a result of those listings; SpamAssassin assigns only 1.5 points to the RCVD_IN_BL_SPAMCOP_NET rule, so it’s easily corrected by other rules.

However, people using it to block or reject spam outright, or who’ve changed the score of the RCVD_IN_BL_SPAMCOP_NET rule, need to turn that off ASAP — as they are losing mail.