Category: Uncategorized

Barcamp!

Published January 17, 2007

I was wavering for a minute there, but I've decided to head down to Waterford for Barcamp Ireland - SouthEast -- a bit last-minute, but there you go! Tickets and hotel booked.

I'm hoping to give a quick, 20-minute intro to Amazon's EC2 and S3 web services -- what they are, how they're used, some interesting features and a few gotchas to watch out for.

Also, I'm up for dinner on the Saturday night, given there's a promise of free booze ;)

Any taint.org readers heading down?

Debunking the “cocaine on 100% of Irish banknotes” story

Published January 11, 2007

BBC: Cocaine on '100% of Irish euros':

One hundred percent of banknotes in the Republic of Ireland carry traces of cocaine, a new study has found.

Researchers used the latest forensic techniques that would detect even the tiniest fragments to study a batch of 45 used banknotes.

The scientists at Dublin's City University said they were "surprised by their findings".

Also at RTE, Irish Examiner, PhysOrg.com, Bloomberg.com, even at Kazakhstan's KazInform.

This story is (of course) being played widely in the media as "OMG Ireland must use more coke than anywhere else" -- in particular, in comparison with a previous study in the US:

The most recent survey carried out in the US showed 65% of dollar notes were contaminated with cocaine.

The DCU press-release has a few more details:

Using a technique involving chromatography/mass spectrometry, a sample of 45 bank notes were analysed to show the level of contamination by cocaine. ...

62% of notes were contaminated with levels of cocaine at concentrations greater than 2 nanograms/note, with 5% of the notes showing levels greater than 100 times higher, indicating suspected direct use of the note in either drug dealing or drug inhalation. ... The remainder of the notes which showed only ultra-trace quantities of cocaine was most probably the result of contact with other contaminated notes, which could have occurred within bank counting machines or from other contaminated surfaces.

However, looking at an abstract of what I think is the paper in question, Evaluation of monolithic and sub 2 Âµm particle packed columns for the rapid screening for illicit drugs -- application to the determination of drug contamination on Irish euro banknotes, Jonathan Bones, Mirek Macka and Brett Paull, Analyst, 2007, DOI: 10.1039/b615669j, that says:

A study comparing recently available 100 Ã— 3 mm id, 200 Ã— 3 mm id monolithic reversed-phase columns with a 50 Ã— 2.1 mm id, 1.8 Âµm particle packed reversed-phase columns was carried out to determine the most efficient approach ... for the rapid screening of samples for 16 illicit drugs and associated metabolites. ... Method performance data showed that the new LC-MS/MS method was significantly more sensitive than previous GC-MS/MS based methods for this application.

My emphasis. I'd guess that that means that comparing this result to banknote-analysis experiments carried out elsewhere using different methods is probably invalid -- perhaps this method is more efficient at picking up 'contact with other contaminated notes, which could have occurred within bank counting machines or from other contaminated surfaces', as noted in the DCU release?

Email authentication is not anti-spam

Published January 10, 2007

There's a common misconception about spam, email, and email authentication; Matt Cutts has been the most recent promulgator, asking 'Where's my authenticated email?', in which various members of the comment thread consider this as an anti-spam question.

Here's the thing -- email these days is authenticated. If you send a mail from GMail, it'll be authenticated using both SPF and DomainKeys. However, this alone will not help in the fight against spam.

Put simply -- knowing that a mail was sent by 'jm3485 at massiveisp.net', is not much better than knowing that it was sent by IP address 192.122.3.45, unless you know that you can trust 'jm3485 at massiveisp.net', too. Spammers can (and do) authenticate themselves.

Authentication is just a step along the road to reputation and accreditation, as Eric Allman notes:

Reputation is a critical part of an overall anti-spam, anti-phishing system but is intentionally outside the purview of the DKIM base specification because how you do reputation is fundamentally orthogonal to how you do authentication.

Conceptually, once you have established an identity of an accountable entity associated with a message you can start to apply a new class of identity-based algorithms, notably reputation. ... In the longer term reputation is likely to be based on community collaboration or third party accreditation.

As he says, in the long term, several vendors (such as Return Path and Habeas) are planning to act as accreditation bureaus and reputation databases, undoubtedly using these standards as a basis. Doubtless Spamhaus have similar plans, although they've not mentioned it.

But there's no need to wait -- in the short term, users of SpamAssassin and similar anti-spam systems can run their own personal accreditation list, by whitelisting frequent correspondents based on their DomainKeys/DKIM/SPF records, using whitelist_from_spf, whitelist_from_dkim, and whitelist_from_dk.

Hopefully more ISPs and companies will deploy outbound SPF, DK and DKIM as time goes on, making this easier. All three technologies are useful for this purpose (although I prefer DKIM, if pushed to it ;).

It's worth noting that the upcoming SpamAssassin 3.2.0 can be set up to run these checks upfront, "short-circuiting" mail from known-good sources with valid SPF/DK/DKIM records, so that it isn't put through the lengthy scanning process.

That's not to say Matt doesn't have a point, though. There are questions about deployment -- why can't I already run "apt-get install postfix-dkim-outbound-signer" to get all my outbound mail transparently signed using DKIM signatures? Why isn't DKIM signing commonplace by now?

How to deal with joe-jobs and massive bounce storms

Published January 10, 2007

As I've noted before, we still have a major problem with sites generating bounce/backscatter storms in response to forged mail -- whether deliberately targeted, as a "Joe-Job", or as a side-effect of attempts to evade over-simplistic sender address verification as seen in spam, viruses, and so on.

Sites sending these bounces have a broken mail configuration, but there are thousands remaining out there -- it's very hard to fix an old mail setup to avoid this issue. As a result, even if your mail server is set up correctly and can handle the incoming spam load just fine, a single spam run sent to other people can amplify the volume of response bounces in a Smurf-attack-style volume multiplication, acting as a denial of service. I've regularly had serious load problems and backlogs on my MX, due solely to these bounces.

However, I think I've now solved it, with only a little loss of functionality. Here's how I did it, using Postfix and SpamAssassin.

(UPDATE: if you use the algorithm described below, you'll block mail from people using Sender Address Verification! Use this updated version instead.)

Firstly, note that if you adopt this, you will lose functionality. Third party sites will not be able to generate bounces which are sent back to senders via your MX -- except during the SMTP transaction.

However, if a message delivery attempt is run from your MX, and it is bounced by the host during that SMTP transaction, this bounce message will still be preserved. This is good, since this is basically the only bounce scenario that can be recommended, or expected to work, in modern SMTP.

Also, a small subset of third-party bounce messages will still get past, and be delivered -- the ones that are not in the RFC-3464 bounce format generated by modern MTAs, but that include your outbound relays in the quoted header. The idea here is that "good bounces", such as messages from mailing lists warning that your mails were moderated, will still be safe.

OK, the details:

In Postfix

Ideally, we could do this entirely outside Postfix -- but in my experience, the volume (amplified by the Smurf attack effects) is such that these need to be rejected as soon as possible, during the SMTP transaction.

Update: I've now changed this technique: see this blog post for the current details, and skip this section entirely!

(If you're curious, though, here's what I used to recommend:)

In my Postfix configuration, on the machine that acts as MX for my domains -- edit '/etc/postfix/header_checks', and add these lines:
/^Return-Path: <>/                              REJECT no third-party DSNs
/^From:.*MAILER-DAEMON/                         REJECT no third-party DSNs
Edit '/etc/postfix/null_sender', and add:
<>              550 no third-party DSNs
Edit '/etc/postfix/main.cf', and ensure it contains these lines:
header_checks = regexp:/etc/postfix/header_checks
smtpd_sender_restrictions = check_sender_access hash:/etc/postfix/null_sender
(If you already have an 'smtpd_sender_restrictions' line, just add 'check_sender_access hash:/etc/postfix/null_sender' to the end.) Finally, run:
sudo postmap /etc/postfix/null_sender
sudo /etc/init.d/postfix restart
This catches most of the bounces -- RFC-3464-format Delivery-Status-Notification messages from other mail servers.

In SpamAssassin

Install the Virus-bounce ruleset. This will catch challenge-response mails, "out of office" noise, "virus scanner detected blah" crap, and bounce mails generated by really broken groupware MTAs -- the stuff that gets past the Postfix front-line.

Once you've done these two things, that deals with almost all the forged-bounce load, at what I think is a reasonable cost. Comments welcome...

Kernighan and Pike on debugging

Published January 8, 2007

While reading the log4j manual, I came across this excellent quote from Brian W. Kernighan and Rob Pike's "The Practice of Programming":

As personal choice, we tend not to use debuggers beyond getting a stack trace or the value of a variable or two. One reason is that it is easy to get lost in details of complicated data structures and control flow; we find stepping through a program less productive than thinking harder and adding output statements and self-checking code at critical places. Clicking over statements takes longer than scanning the output of judiciously-placed displays. It takes less time to decide where to put print statements than to single-step to the critical section of code, even assuming we know where that is. More important, debugging statements stay with the program; debugging sessions are transient.

+1 to that.

5 things revisited

Published January 5, 2007

Hey Danny! I've already filled out my "5 Things" list. Surprisingly (or thankfully) nobody has commented on #5 ;)

Great Things, btw. I might adopt #4, and see if it works.

It's great fun following the web of "5 Things" links as they percolate through the interwebs. now if only the people I nominated would get on with their lists...

Script: knewtab

Published January 5, 2007

Here's a handy script for konsole users like myself:

knewtab -- create a new tab in a konsole window, from the commandline

usage: knewtab {tabname} {command line ...}

Creates a new tab in a "konsole" window (the current window, or a new one if the command is not run from a konsole).

Requires that the konsole app be run with the "--script" switch.

Download 'knewtab.txt'

Spam zombies — we need to cure the disease, not suppress the symptoms

Published December 28, 2006

Here's a great presentation from Joe St Sauver presented at the London Action Plan meeting recently: Infected PCs Acting As Spam Zombies: We Need to Cure the Disease, Not Just Suppress the Symptoms

Some key points in brief:

Despite all our ongoing efforts: the spam problem continues to worsen, with nine out of every ten emails now spam; spam volume has increased by 80% over just the past few months and users face a constantly morphing flood of malware trying to take over their computers. Bottom line: we're losing the war on spam.

The root cause of today's spam problems is spam zombies, with 85% of all spam being delivered via spam zombies.

The spam zombie problem grows worse every day (with over ninety one million new spam zombies per year)

Users don't, won't, or can't clean up their infected PCs; and ISPs can't be expected to clean up their infected customers' PCs.

Filtering port 25 and doing rate limiting is like giving cough syrup to someone with lung cancer -- it may suppress some overt symptoms but it doesn't cure the underlying disease.

Filtered and rate-limited spam zombies CAN still be used for many, many OTHER bad things, and they represent a huge problem if left to languish in a live infected state.

Joe's take -- "we're in the middle of a worldwide cyber crisis". I agree. He suggests a new strategy:

It is common for universities to produce and distribute a one-click clean-up-and-secure CD for use by their students and faculty. It's now time for our governments to produce and distribute an equivalent disk for everyone to use.

I agree the existing schemes are clearly not working; this is an interesting suggestion. Read/listen to the presentation in full for more details; pick up PDF, PPT and video here.

Massive spam volumes causing ISP delays

Published December 27, 2006

Via Steve Champeon's daily links, the following spam-in-the-news stories illustrate a rising trend:

Spam causing email delays: NZ Telecom

Huge amounts of spam are said to be responsible for delays in the email network of NZ ISP Xtra.

Several customers have vented their frustrations on an Xtra website message board saying some emails were days late, The New Zealand Herald reports.

... Record volumes of spam meant such problems would be "an unfortunate and on-going reality of the internet not specific to any provider", he said.

Mr Bowler said Telecom had invested "tens of millions of dollars" in email and anti-spam software and worked closely with two of the world's leading anti-spam vendors.

Holiday spam turns e-mail to snail mail for states schools

Holiday spam e-mails are to blame for slowing message delivery to faculty and staff in schools across Kentucky ...

Channel Register: 123-Reg says sorry

"Some 123-reg customers may have experienced intermittent delays in their emails in the last two weeks. We had received a particularly high level of image-based spam attacks over a short period of time," the Pipex subsidiary said.

Xtra faces lawsuits over email delays

Small businesses are threatening legal action over continuing glitches with Xtra's email service and the Consumers' Institute says they may have a case.

Several people have contacted the Herald complaining that delays and non-deliveries of emails over the past three weeks on the Xtra network are severely affecting their businesses. ...

The institute's David Russell said home users could claim compensation for email delays if they had suffered "a real measurable loss".

Non-commercial customers were covered by the Consumer Guarantees Act and services they paid for had to be of a "reasonable quality".

Although it might be more difficult for small business owners, they could also have a case, Mr Russell said. "If there has been a considerable amount of money, they could consider legal action or, if the amount was smaller, they could go through the disputes tribunal."

In other words, the DDOS-like elements of the spam problem are becoming an increasing worry; even with working spam filtering in place, the record size of zombie botnets means that spammers can now destroy organisations' computing infrastructure, almost accidentally.

Spammers don't care if an organisation's infrastructure collapses while they're sending their spam to it -- they just want to maximise exposure of their spam, by any means necessary. If that requires knocking a company off the air entirely for a while, so be it.

I'm not sure what can be done about this, in terms of filtering. It may finally be time to fall back to a "side channel" of trusted, authenticated SMTP peers, and leave the spam-filled world of random email from people and organisations you don't know to one side, as a lower-priority system which can (and will, frequently) collapse, without affecting the 'important' stuff. What a mess. :(

Alternatively, maybe it's time for governments to start putting serious money into botnet-spam-related arrests and prosecution.

This has additional issues for ISPs, too, btw -- I wonder if Earthlink are taking note of that Xtra lawsuit story above....

Cliche-finder bookmarklet

Published December 23, 2006

Quinn posted a link to a nifty CGI by Aaron Swartz which detects uses of common cliches, with the list of cliches to avoid taken from the Associated Press Guide to News Writing. In addition, she also mentioned there's the Passivator, 'a passive verb and adverb flagger for Mozilla-derived browsers, Safari, and Opera 7.5'.

Combining the two, I've hacked together a bookmarklet version of the cliche finder -- it can be found on this page. (Couldn't place it inline into this post due to stupid over-aggressive Markdown, grr.)

Fun! Probably not IE-compatible, though.

5 things

Published December 20, 2006

Tagged by richi! drat. OK, here are 5 things you probably don't know about me:

I'm a certified SCUBA diver, at PADI Advanced Open Water Diver level. (oh, look, so's Tom Raftery!)
I generally try to avoid meeting my heroes, since I get quite tongue-tied in the presence of people I admire -- I once stammered "I think you're brilliant" at Alex Paterson, instead of anything more witty or interesting.
I met my wife at a student occupation in university, where her knowledge of the science and nature questions in Trivial Pursuit, and amazing looks of course, got me hooked ;)
I could listen to Brian Eno's Taking Tiger Mountain By Strategy and Here Come The Warm Jets on repeat for several weeks, if necessary.
I was a child model, modelling (among other things) underpants for Dunnes Stores! It's all been downhill since then, really ;)

Passing it on: go for it, Brendan, Colm, Lisey, and Jason.

An anti-challenge-response Xmas linkfest

Published December 14, 2006

As all right-thinking people know by now, Challenge-response spam filtering is broken and abusive, since it simply shifts the work of filtering spam out of your email, onto innocent third-parties -- either your legitimate correspondents, people on mailing lists you read, or even random people you have never heard of (due to spam blowback).

I've ranted about this in the past, but I'm not alone in this opinion -- and frequently find myself explaining it. To avoid repeating myself, here's a canonical collection of postings from around the web on this topic.

Spamcop FAQ: Why are auto responders bad?:

Description: This "selfish" method of spam filtering replies to all email with a "challenge" - a message only a living person can (theoretically) respond to. There are several problems with this method which have been well known for many years.

Does not scale: If everyone used this method, nobody would ever get any mail.

Annoying: Many users refuse to reply to the challenge emails, don't know what they are or don't trust them.

Ineffective: Because of confusion about these emails, many of them are confirmed by people who did not trigger them. This results in the original malicious email being delivered.

Selfish: This is the problem we are mainly concerned with. By using challenge/response filtering, you are asking innumerable third parties to receive your challenge emails just so that a relatively few legitimate ones get through to the intended recipient.

Karsten M. Self: Challenge-Response Anti-Spam Systems Considered Harmful:

C-R systems in practice achieve an unacceptably high false-positive rate (non-spam treated as spam), and may in fact be highly susceptible to false-negatives (spam treated as non-spam) via spoofing.

Effective spam management tools should place the burden either on the spammer, or, at the very least, on the person receiving the benefits of the filtering (the mail recipient). Instead, challenge-response puts the burden on, at best, a person not directly benefitting, and quite likely (read on) a completely innocent party. The one party who should be inconvenienced by spam consequences Â¿ the spammer Â¿ isn't affected at all.

Worse: C-R may place the burden on third parties either inadvertantly (via spoofed sender spam or virus mail), or deliberately (see Joe Job, below). Such intrusions may even result in subversion of the C-R system out of annoyance. Many recent e-mail viruses spoof the e-mail sender, including Klez, Sobig variants, and others.

John Levine: Challenge-response systems are as harmful as spam:

The collateral damage from widely used C/R systems, even with implementations that avoid the stupid bugs, will destroy usable e-mail. [jm: in fairness, this was written in 2003.]

Challenge systems have effects a lot like spam. In both cases, if only a few people use them they're annoying because they unfairly offload the perpetrator's costs on other people, but in small quantities it's not a big hassle to deal with. As the amount of each goes up, the hassle factor rapidly escalates and it becomes harder and harder for everyone else to use e-mail at all.

Ed Felten: A Challenging Response to Challenge-Response:

I'm skeptical of CR as a response to email. If you're the first on your block to adopt CR, and if nobody else uses anti-spam technology, then CR might provide you some modest benefit. But itÂ¿s hard to see how CR can be widely successful in a world where most people use some kind of spam defense.

Jeremy Zawodny: TMDA Users Can Blow Me (heh):

If these systems are so brain-dead as to not bother adding my address to the whitelist when the user sends me e-mail, I have serious trouble understanding why anyone is using them.

Is it just me? Is this too hard to figure out?

Anyway, there's another 5 minutes I'll never get back. It's too bad there's no mail header to warn me that "this message is from a TDMA user", because then I'd be able to procmail 'em right to /dev/null where they belong.

Ugh.

This bullshit is not going to "solve" the spam problem, people. If that's your solution, please let me opt out. Forever.

Michele Neylon: Why C/R is not a good idea:

C/R slows down and impedes communication by placing unwanted barriers between you and your clients/suppliers.

If you must insist on using some form of C/R please make sure that you whitelist my address before you contact me as I will not reply to challenges.

TidBITS policy on Challenge-Response:

We will not answer any challenges generated in response to our mailing list postings. Thus, if you're using a challenge-response system and not receiving TidBITS, you'll need to figure that out on your own. Also, if you send us a personal note and we receive a challenge to our reply, we may or may not respond to it, depending on our workload at the time.

Fedora Project policy on UOL -- a Brazilian ISP that uses C-R extensively:

uol.com.br uses a very broken method of anti-spam. Everytime someone sends an email message to one of their members, they send back a verification message, asking the original sender to click a link before they will allow the message through. These messages are themselves a form of spam, and the resulting back-scatter of these messages is altogether bad for the Internet, the UOL member, and all of the UOL member's contacts. UOL is aware of the complaints against them, and they refuse to correct the issue, claiming that their members love the service.

Matt Sergeant: C-R spam solutions:

I hate C/R systems. With a passion. I absolutely will not respond to them. They go in the trash. I don't get them very often but I get them more and more. I think they have the potential to seriously damage email communication as we know it. And I'm not alone in this opinion.

Richi Jennings: 'Challenge/Response makes you a spammer.'
BusinessWeek: Stephen Wildstrom: A Spam-Fighter More Noxious Than Spam: 'Challenge-response filtering systems are likely to wipe out e-mails you want, too.'
and lots more at Spamlinks.net

Phew.

Linux USB frequent reconnects – workaround

Published December 13, 2006

I've been running into problems recently (since several months ago at least), with USB hardware on my Thinkpad T40 running Ubuntu ~~Hoary~~ Dapper; in particular, every time I plug in my iPod or one of my USB hard disks nowadays, I get this:

[5008549.187000] usb 4-3: USB disconnect, address 14
[5008550.143000] usb 4-3: new high speed USB device using ehci_hcd and address 18
[5008552.643000] usb 4-3: new high speed USB device using ehci_hcd and address 27
[5008557.393000] usb 4-3: new high speed USB device using ehci_hcd and address 43
[5008557.893000] usb 4-3: new high speed USB device using ehci_hcd and address 44
[5008558.643000] usb 4-3: new high speed USB device using ehci_hcd and address 46
[5008558.895000] ehci_hcd 0000:00:1d.7: port 3 reset error -110
[5008558.896000] hub 4-0:1.0: hub_port_status failed (err = -32)
[5008559.893000] usb 4-3: new high speed USB device using ehci_hcd and address 48
[5008562.643000] usb 4-3: new high speed USB device using ehci_hcd and address 58
[5008563.143000] usb 4-3: new high speed USB device using ehci_hcd and address 59
[5008563.643000] usb 4-3: new high speed USB device using ehci_hcd and address 60
[5008570.143000] usb 4-3: new high speed USB device using ehci_hcd and address 85

This repeats ad infinitum until the USB device is disconnected.

I had this down as a hardware issue (since it started happening just after warranty expiration ;), but some accidental googling revealed several other cases -- and a workaround:

sudo modprobe -r ehci-hcd

Run that repeatedly, each time replugging the device and monitoring dmesg via watch -n 1 'dmesg | tail' in a window, until the device is finally recognised as a USB hard disk. It generally seems to take 3 or 4 attempts, in my experience.

This LKML thread suggests hardware changes can cause it, but this hardware hasn't changed in years. Annoying.

Anyway, this is ongoing. This tip seems to help, but it might be just treating a symptom, I don't know -- just posting for google and posterity... and to moan, of course :(

Threadless deals with plagiarism

Published December 12, 2006

(Updated since original posting; see end of post for details)

Paging boogah!

Interesting situation playing out at Threadless -- ~~I think this may be the first time~~ a stolen design made it through voting and so on, onto cotton, without being spotted. Here's the design, supposedly by someone called 'rocketrobyn':

And here's the (apparently original) stencil art by miso and ghostpatrol:

BTW, note the perspective being copied from the photo's odd angle, to the shirt design...

The Threadless design's submission page has some classic comments:

Boney_King_of_Nowhere: Wow. Are you by any chance a fan of Bansky? Because this is almost a rip off. Almost. Awsome though.
rocketrobyn (this is my design): Thank you for the positive comments. I really like this shirt too! [...] I'm not sure who Bansky [jm:sic] is, but I'll check it out!

Heh.

I heard about this via You Thought We Wouldn't Notice, a street-design plagiarism blog, where ghostpatrol (one of the stencil artists) posted a blog post about the situation. In the comments there, Jake from Threadless pipes up:

jake n on 12 Dec 2006 at 4:30 am

hey, jake here from threadless. i was just made aware of this situation and want to give you all my assurance that we will handle this properly.

the designer will not be paid and the design will either be removed or licensed from the original designer if they are willing.

give us a couple days to sort the details.

Not to appear whingy, 2 hours later "n." posts:

The original owners are not willing to license this design to Threadless, and want it removed from the site. Neither artist has yet been contacted by Threadless.

Bit of patience there ;)

Our first physical award

Published December 7, 2006

W00t!

Backscatter in InformationWeek

Published December 5, 2006

Yay! Kudos to Richi Jennings, who's been trumpeting the dangers of backscatter to InformationWeek recently. It's a great article. I particularly like how it digs up this impressively off-the-mark quote:

Tal Golan, CTO, president, and founder of Sendio, maker of a challenge/response e-mail appliance used by more than 150 enterprise consumers, disagrees strongly with Jennings's assertion that challenge-based filtering has problems. "Without question, the benefit to the whole community at large drastically outweighs that FUD [fear, uncertainty, and doubt] that's out there in the marketplace that somehow challenge/response makes the problem worse," he says. "The real issue is that filters don't work. From our perspective, challenge/response is the only solution. This whole concept of backscatter is just not true. Very, very rarely do spammers forge the e-mail addresses of legitimate companies anymore."

hahahaha. Well, since last Thursday, "very very rarely" translates as "214 MB of backscatter in my inbox". The facts aren't on Tal Golan's side here...

(PS: SpamAssassin 3.2.0 will include backscatter detection.)

An Post: 75% lost-parcels rate so far

Published December 4, 2006

I don't know what's going on with An Post, the Irish postal service, these days -- I've been having some pretty bad luck with them.

For my birthday, I was lucky enough to be given a Thingamagoop -- it took a while (hey, they're hand-made) but was shipped on Nov 7th from the US. Bleep Labs accidentally shipped me two, apparently, but only one has arrived -- on Nov 16th, 9 days after shipping. The other one's still AWOL nearly a month later.

I then ordered something from Sendit.com on Nov 17th, as a birthday gift for Nov 30th. It was shipped from their Belfast offices on Nov 18th, and still hasn't arrived to date. Sendit were champs, however, and refunded the purchase as soon as I rang them on the 30th (I'd recommend their services, no problem).

Finally, SpamAssassin was lucky enough to win a Linux New Media Award 2006 for 'Best Linux-based Anti-spam Solution' -- nifty! As part of this, a (physical) trophy is apparently winging its way from Germany, and was apparently shipped on November 27th. Guess what: no sign.

In other words, in the past month, 75% of the parcels sent to me seem to have gone AWOL. All I can do is hope that they've just been delayed, rather than suffer a worse fate. In particular, I hope that trophy turns up -- it's the only physical award we've ever received :(

Can anyone think of a good avenue to track these down? The website seems pretty negative, and what I've heard seems to be along the lines of 'turn up at the sorting depot, cross your fingers, and see if they've been misdelivered'. Ick.

SpamAssassin as an EC2 service

Published November 30, 2006

I had a bit of an epiphany while chatting to Antoin about the qpsmtpd/EC2 idea. Craig had the same thoughts.

Here's the thing -- there's actually no need to offload the SMTP part at all. That stuff is tricky, since you've got to build in a lot of fault tolerance, quality-of-service, uptime, etc. to ensure that the MX really is reachable. Since an EC2 instance will lose its "disks" once rebooted/shut down, you need to store your queues in Amazon S3 -- which has differing filesystem semantics from good old POSIX -- so things get quite a bit hairier. On top of that, it requires a little RFC-breakage; there are issues with using CNAMEs in MX records, reportedly.

However, if we offload just the spamd part, it becomes a whole lot simpler. The SPAMD protocol will work fine across long distances, securely, with SSL encryption active, and SpamAssassin will work fine as a filtering system in an entirely stateless mode, with no persistent-across-reboots storage. (What about the persistent-storage aspects of spamd operation? There's just the auto-whitelist, which can be easily ignored, and I haven't trained a Bayes database in 2 years, so I doubt I'll need that either ;)

If the spamd server is down or uncontactable, spamc will handle this and retry with another server, or eventually give up and pass the message through, safely intact (though unscanned).

Given that there's a cool third-party ClamAV plugin now available for SpamAssassin, this system can offload the virus-scanning work, too.

So here's the new plan: run the MTA, MX, and the super-lean "spamc" client on the normal MX machine -- and offload the "spamd" work to one or more EC2 machines.

Basically, there would be a CNAME record in DNS, listing the dynamic DNS names of the EC2 spamd instances. Then, spamc is set to point at that CNAME as the spamd host to use. As EC2 instances are started/removed, they are added/removed from that CNAME list and spamc will automatically keep up.

Pricing is reasonably affordable -- don't send over-large messages to the EC2 spamd; rate-limit total incoming SMTP traffic in the MTA; and use the SPAMD protocol's REPORT verb to reduce the bandwidth consumption of mails in transit by ensuring that the mail messages are only transmitted one-way, MX-to-EC2, instead of both MX-to-EC2 and EC2-to-MX. That will keep the bandwidth pricing down.

Recent figures indicate that I got about 90MB of mail per day, at peak, over the past weekend (which nearly DOS'd my server and caused some firefighting) -- 68MB of spam, and 13MB of blowback. At 20 cents per GB, that's 1.8 cents per day for traffic. Plus the $0.10 per instance hour, that's $2.42 per day to run a single EC2 instance to handle DDOS spikes. Of course, that can be shut down when load is low.

Yep, this is looking very promising. Now when are Amazon going to let me onto the beta program for EC2?...

Using qpsmtpd and Amazon EC2 to provide SMTP-DDoS protection

Published November 29, 2006

Like a few other anti-spammers, I found myself under a hitherto-unprecedented level of spam blowback this weekend. Disappointingly, there are still thousands of SMTP servers configured to send bounce messages in response to spam.

Even with the anti-bounce ruleset for SpamAssassin, the volume was so great that our creaky old server had a lot of difficulty keeping up -- once the messages got to SpamAssassin, the load issues had already been created. Also, Postfix's anti-spam features really weren't designed to deal with blowback.

While attempting to take some shortcuts in the setup on our server to deal with this, a great idea occurred to me -- why not come up with an app that uses Amazon EC2 to flexibly provision enough server power and bandwidth to pre-filter the SMTP traffic for an MX under attack?

I'm basically thinking of qpsmtpd, with SpamAssassin and/or other antispam blobs active, running in an Amazon EC2 server image. Multiple images can be brought up, and added to the attacked domain's MX record at an equal priority, to take load off the main (overloaded) MX.

Now to cogitate a little -- details to follow...

Working out electricity costs for your appliances and hardware

Published November 28, 2006

This question came up on a forum I'm on. It turns out it's really quite easy to work out -- this page covers pretty much all the details.

In addition to what's there, it's worth noting that the current Irish price for a kilowatt-hour under the ESB's domestic rate is 12.73 cents per kWh, which works out as 14.41 cents per kWh once the 13.5% VAT is added in. So Irish users, pretend you live in New Hampshire (15 cents per kWh) to get realistic figures from the excellent cost calculator.

Using this, it looks like if I was to leave an 160W desktop computer on permanently in Ireland, I'd be spending 215 euros per year to power it. Wow, that's pricey! My strategy of using low-noise, low-power hardware for home servers has paid off already, in that case. ;)

For what it's worth, if you're worrying about the power consumption of an NTL digital Pace Digital TV set-top box -- if this Pace presentation is anything to go by, it appears the standby power consumption is on the order of 1-2 watts -- about 2 euros per year. Grand.

Labour’s flat-rate bus tickets

Published November 27, 2006

Well, that was quick!

Right after posting this, I hear about Labour's new transport strategy for Dublin. Here's the top 3 items:

Labour will increase the Dublin Bus fleet by 50% (500 buses), significantly increasing frequency and reducing waiting times.
Will complete the Quality Bus Corridors, and greatly reduce journey times.
Will introduce a EUR 1 per-trip fare for adults and a 50c per-trip fare for children.

The flat-rate fee structure makes a lot more sense than the confusing and rip-off-ish current model, whereby if you don't know in advance how much a particular journey is going to cost, you're given a useless receipt instead of change. This wierd and rip-off-ish policy has certainly stopped me from catching buses in the past. In general, flat-rate pricing models appear to encourage use in other fields. And the increase in the fleet is obviously a fantastic idea. Fantastic stuff!

Read the full policy paper here (as a PDF).

Dublin transport survey

Published November 27, 2006

Via Lean comes this, I think from the Irish Times:

One-half of Dublin drivers would never use bus - survey

One-half of all car drivers in the greater Dublin area say they would not switch to travelling by bus, even if services were improved, according to a new survey.

Unreliability, long waiting times and poor connections were cited as the main reasons for not taking the bus in the survey carried out for the Dublin Transportation Office (DTO).

As many as four out of five people expressed dissatisfaction with traffic congestion and access to the Luas.

Just over 35 per cent of those surveyed were satisfied with the quality and upkeep of roads, and with facilities for cycling. Over one-half said they were happy with the reliability, frequency and cost of buses.

Almost 2,500 people were interviewed for the survey and a similar number of travel diaries were compiled. The car is the main form of transport in the region, used by 45 per cent of respondents. Some 18 per cent relied on the bus and 16 per cent said walking was their main form of transport. Just 2 per cent used the Luas more often than other modes of transport, and 3 per cent used the DART or local train. Two per cent cycled and 1 per cent relied on taxis.

Of those who said they might switch to the bus, over 60 per cent said more frequent services was the main change needed. Accurate timetables and stops closer to destinations were also called for.

Respondents linked transport by car to comfort, convenience and reliability. In contrast, buses were viewed as being for older people and people with no other choice. Bus transport was favourably viewed for going out socially and for being reasonably priced.

The Luas was seen as modern, while DART and train services were viewed as fast and safe. Cycling and walking were viewed as healthy and environmentally friendly, but for young people.

Great figures -- they sound pretty accurate.

The novelty of being home in a (relatively) bike- and public-transport-friendly city has worn off for me by now -- I'm now more familiar with buses that aren't a dumping ground for the homeless and mentally ill, and that do actually tend to pass both your origin and destination in a single journey. But that was in Orange County, possibly one of the most public-transit-hostile societies in the developed world, and compared to a more sane standard, Dublin still has a major problem.

By the way, it's interesting to note Ireland's move OC-wards on many fronts. When I got back, I was shocked to see tubby children being driven to school by mobile-phone-wielding, SUV-driving parents -- the very worst aspects of US suburban-sprawl life being happily parrotted over here. :(

Spam filter evasion self-defeating?

Published November 24, 2006

Donncha asks, is spam self-defeating?

has anyone else noticed that the new generation of gif based stock-trading spams are getting really hard to read? In the last one I had to squint and look really carefully to find out what stock was hot and a sure-buy today!

I've been wondering about this, too. We continually push spammers further and further from comprehensibility, since comprehensible spam is easily-filtered spam, but the spam flood doesn't stop. In fact, spam volumes have shot up higher than ever.

My theory is that it's a symptom of the spam side of things being a market in itself (and an inefficient, scam-heavy one at that).

IMO, the people providing the underlying products advertised in "high-end" spam -- the pill-peddlers and stock pumpers -- no longer control the technical details of how or where the spam is sent. Instead, they are the customers of professional spam gangs who do that, and take care of the obfuscation, filter-evasion, etc.

In other words, the pill-peddlers and scam operators are getting ripped off, too. They think their products or scams will be advertised in a comprehensible manner, in readable emails; but instead, odd, opaque 3-word messages with "cut and paste this" lines, hidden inside filter-evasion text and bits of Project Gutenberg, are what gets delivered to the victims.

I can't imagine the clickthrough rates are exactly stellar on that. So I'd guess the spammers are responding by pushing up volumes to attempt to increase clickthrough/sales volumes. Wonder if it's working or not?

Planet Antispam Update

Published November 24, 2006

Hey, some Planet Antispam updates. I've upgraded to Planet 2.0, and that seems to have solved some of the wierdness with consuming Atom feeds.

Also, there are two new antispam weblogs added to the subscription list:

Terry Zink from Microsoft
Nik Clayton, who blogs about mail and, occasionally, spam

Welcome guys!

(btw, if you're wondering what happened to the music post -- I moved it over here, to the mp3 blog where it was supposed to be posted in the first place, duh ;)

The nightmare that is Ryanair

Published November 22, 2006

It's interesting reading US weblogs when they wax enthusiastic about Ryanair, typically on the foot of this BusinessWeek article.

Here's the thing -- flying Ryanair is a deeply unpleasant experience. I've heard rumour that their staff are paid commission based on how many discretionary charges they can pile onto the basic fare -- leaving you feeling nickled and dimed at every turn -- and that certainly matches with my experience. I mean, I've had better service in train stations in Uttar Pradesh.

In our case, our "no more" moment was after a trip to Spain earlier this year, where we were humiliated for attempting to shift around luggage instead of immediately paying the charges liable once you exceed 15 kilos (33 pounds). (Naturally, there's no weighing scales until you get right in front of the check-in desk...) Once it became clear we didn't want to pay the fee, the check-in person screamed at us, and sent us to the back of the check-in queue -- like bold schoolchildren!

This level of service is pretty standard, going by local word of mouth. Several of my friends have, like me, vowed never to fly them again, even picking more expensive flights to more distant airports to avoid it.

It's certainly not comparable to JetBlue, or any other low-fare airline I've had the pleasure of dealing with -- this is a level below. The BusinessWeek article ends with:

American long-haul discounters aren't likely to go to the extremes Ryanair has gone to sell basic services, but they're paying more attention to Ryanair these days. "They're on the cutting edge," says Tad Hutcheson, vice-president for marketing at AirTran, which recently assigned two marketing staffers to spend a week flying on Ryanair. "Charging for Cokes or snacks, blankets or pillows--I'm not sure Americans are ready for that."

Well, I certainly hope not, for their sakes!

Bleadperl regexp optimization vs SA

Published November 16, 2006

I've been looking some more into recent new features added to bleadperl by demerphq, such as Aho-Corasick trie matching, and how we can effectively support this in SpamAssassin. Here's the state of play.

These are the "base strings" extracted from the SpamAssassin SVN trunk body ruleset (ignore the odd mangled UTF-8 char in here, it's suffering from cut-and-paste breakage). A "base string" is a simplified subset of the regular expression; specifically, these are the cases where the "base strings" of the rule are simpler than the full perl regular expression language, and therefore amenable to fast parallel string matching algorithms.

The base strings appear in that file as "r" lines, like so:

r I am currently out of the office:__BOUNCE_OOO_3 __DOS_COMING_TO_YOUR_PLACE
r I drive a:__DOS_I_DRIVE_A
r I might be c:__DOS_COMING_TO_YOUR_PLACE
r I might c:__DOS_COMING_TO_YOUR_PLACE

The base string is the part after "r" and before the ":"; after that, the rule names appear.

Now, here are some limitations that make this less easy:

One string to many rules: each one of those strings corresponds to one or more SpamAssassin rules.
One rule to many strings: each rule may correspond to one or more of those strings. So it's not a one-to-one correspondence either way.
No anchors: the strings may match anywhere inside the line, similar to ("foo bar baz" =~ /bar/).
Multiple rules can fire on the same line: each line can cause multiple rules to fire on different parts of its text.
Subsumption is not permitted: the base-string extractor plugin has already established cases where subsumption takes place. Each string will not subsume another string; so a match of the string "food" against the strings "food" and "foo" should just fire on "food", not on "foo".
Overlapping is permitted: on the other hand, overlapping is fine; "foobar" matched against "foo" and "oobar" should fire on both base strings. (The above two are basically for re2c compatibility. This is the main reason the strings are so simple, with no RE metachars -- so that this is possible, since re2c is limited in this way.)
Most rules are more complex: most of the ruleset -- as you can see from the 'orig' lines in that file -- are more complex than the base string alone. So this means that a base string match often needs to be followed by a "verification" match using the full regexp.

Now, the problem is to iterate through each line of the (base64-decoded, encoding-decoded, HTML-decoded, whitespace-simplified) "body text" of a mail message, with each paragraph appearing as a single "line", and run all those base strings in parallel, identifying the rule names that then need to be run.

This is turning out to be quite tricky with the bleadperl trie code.

For example, if we have 3 base strings, as follows:

  hello:RULE_HELLO
  hi:RULE_HI
  foo:RULE_FOO

At first, it appears that we could use the pattern itself as a key into a lookup table to determine the pattern that fired:

  %base_to_rulename_lookup = (
    'hello' => ['RULE_HELLO'],
    'hi' => ['RULE_HI'],
    'foo' => ['RULE_FOO']
  );

  if ($line =~ m{(hello|hi|foo)}) {
    $rule_fired = $base_to_rulename_lookup{$1};
  }

However, that will fail in the face of the string "hi foo!", since only one of the bases will be returned as $1, whereas we want to know about both "RULE_HI" and "RULE_FOO".

m//gc might help:

  %base_to_rulename_lookup = (
    'hello' => ['RULE_HELLO'],
    'hi' => ['RULE_HI'],
    'foo' => ['RULE_FOO']
  );

  while ($line =~ m{(hello|hi|foo)}gc) {
    $rule_fired = $base_to_rulename_lookup{$1};
  }

That works pretty well, but not if two patterns overlap: /abc/ and /bcd/, matching on the string "abcd", for example, will fire only on "abc", and miss the "bcd" hit.

Given this, it appears the only option is to run the trie match, and then iterate on all the regexps for the base strings it contains:

  if ($line =~ m{hello|hi|foo}) {
    $line =~ /hello/ and rule_fired("HELLO");
    $line =~ /hi/ and rule_fired("HI");
    $line =~ /foo/ and rule_fired("FOO");
  }

Obviously, that doesn't provide much of a speedup -- in fact, so far, I've been unable to get any at all out of this method. :(

This can be optimized a little by breaking into multiple trie/match sets:

  if ($line =~ m{hello|hi}) {
    $line =~ /hello/ and rule_fired("HELLO");
    $line =~ /hi/ and rule_fired("HI");
    ...
  }
  if ($line =~ m{foo|bar}) {
    $line =~ /foo/ and rule_fired("FOO");
    $line =~ /bar/ and rule_fired("BAR");
    ...
  }

But still, the reduction in regexp OPs vs the addition of logic OPs to do this, result in an overall slowdown, even given the faster trie-based REs.

Suggestions, anyone?

(by the way, if you're curious, the current code is here in SVN.)

Bye bye beard

Published November 15, 2006

I shaved my beard, and made an animated GIF of the process! Enjoy...

RFID chip in a UK passport

Published November 12, 2006

rfid
Originally uploaded by Xoger.

Found over at this UK blogger's site. So RFID passports are now deployed in both Ireland and the UK...

A Guinness 419 scam!

Published November 12, 2006

I may be a bit hungover this Sunday morning due mainly to the effects of the subject of this post, but -- Guinness National Lottery? is anyone going to fall for that?

From: hamilton jones 
Subject: GUINNESS. CUSTORMERS PROMOTION

GUINNESS. CUSTORMERS PROMOTION
dv-2006 program
guinness plc, West Africa.
st christo road (ecowas)

                                    FINAL_ NOTIFICATION.

We happily inform you about our (guinness. national lottery
program)held on the 10th of november 2006, which you enterd as a
dependent client and finally took the 1st position in our second
(2nd) category winners, that falls within  the europe region Manchester Uk.
Your email was attached to the ticket number(44-40-23-777-01) which
made you a winner of (us$500,000.00) and your name being recorded in
our guinness world book of record as the 1st lucky winner of the year
2006. You have been approved the sum of US$500,000.00 which will be
sent accross to you immediately.

All emails are selected randomly through a computer ballot which
subsequently won you the sweepstakes of Guinness internet web
lottery.

CONGRATULATIONS YOU EMERGED OUR WINNER!!!
= = = = = = = = = = = = = = = = = = = = = = = = = = =
This is part of our security measures put in place to avoid double
claiming or a situation where unwanted person(s) would be taking
Negative advantage of these promotions, thereby impersonating in
order to claim another persons winning prize.
Here is our fiduciary agent responsible for your the processing /
Release of winnings for all Second Category winners where your
winning Falls into:
MR HAMILTON JONES
EMAIL: hamilton_jones2006@yahoo.it

GUINNESS. CLAIMING SECURITY AGENT.
= = = = = = = = = = = = = = = = = = = = = = = = = = =
You are required to forward the following details to help facilitate
the processing of your GUINNESS. CLAIMS OF CERTIFICATE.

Full names / Residential address / Phone number / Occupation / Sex /
Age / Present country / Marital status.

ONCE AGAIN CONGRATULATION!!!!
Yours sincerely

ANDERSON HEGLAND

Irish Blogs top 100 — should old blogs be trimmed?

Published November 9, 2006

Over on the Technorati Top 100 of Irish Blogs list, I've noticed something; quite a few of the listings have stopped publishing, such as number 5, Tom Murphy's Natterjackpr.com.

I'm wondering -- should no-longer-publishing blogs be listed? Technorati still keeps their ranking high -- clearly old data is not expired from the Technorati database for at least a year. But maybe my scripts should use last-post-published time, from planet.journals.ie where available, and discard blogs that haven't put anything up in something like 4 months.

What do you think?

Top 100 Irish Blogs, pt 2

Published November 6, 2006

The previous post was pretty popular, and one of the requests was for a regularly-updated listing. So here it is: http://taint.org/technorati/

Since Technorati limit daily queries to about 500 per day (iirc), and there are quite a few more blogs in the Irish blogs list, I plan to update it on a nightly basis, with each set of blogs updating on different days. This should result in the figures staying more-or-less up to date without hammering T'rati too much.

Gastric woes

Published November 6, 2006

Observant taint.org readers might recall me complaining about a bout of food poisoning back in June during ApacheCon week, which, along with a poorly-timed work trip, unfortunately managed to stop me attending ApacheCon altogether.

Turns out that that "food poisoning" never went away -- four months later, I'm still having digestive troubles. However, I've been lucky enough to figure out a way to minimise it, which I'll mention here for posterity (and Google).

So, basically, the symptoms were general stomach unsettledness, nausea, cramping, a sharp pain in the right side, and heartburn -- all waxing and waning intermittently. (There were issues at "the other end" I'll leave out, in the interests of good taste.) On top of that, my level of stomach "calmness" was way off -- nausea from travelling in cars, buses, taxis etc. became an issue.

Thankfully, it didn't interfere with work much at all -- since I work from home, it was pretty easy to deal with. But it certainly put a damper on trips like ApacheCon, or BarCamp Ireland... it became quite difficult, in particular, to travel any kind of distance during the daytime. (Luckily my ability to partake in pints of Guinness during the evening was not affected, however. ;)

I did the usual thing of visiting my local G.P., and was referred to a gastro-intestinal specialist -- that's all still going on, slowly. But fortunately, in the meantime, I had a breakthrough in terms of dealing with the symptoms.

Initially, the waxing and waning of symptoms seemed pretty random, but after a week or two, a pattern emerged -- on a normal day, it'd typically be worst at about 11am in the morning, then ease off before lunch, then worse again after lunch. During and after dinner, it'd be fine, and the evenings were almost symptom-free. On an empty stomach, there was similarly virtually no problems whatsoever.

Of course, having a link with quantities of food makes sense for a GI illness. But it eventually occurred to me that the symptoms were increasing and waning in time with specific types of food, in fact. The pattern of symptoms were tracking my drinking of milk, in cereal, and in tea or coffee, delayed by about 2 hours. Now, I've always been a total omnivore -- I've never suffered from allergies, had any issues digesting food, or suffered travel illness. My sea legs were rock solid; one trip to the Great Barrier Reef saw myself and C being the only tourists not to vom over the sides despite some heavy waves. Also, as an Irishman, tea is the core component of my diet, and tea with milk at that; and dairy is similarly at the heart of Irish cuisine in many ways, plenty of milk, cheese, and butter. I was raised on the stuff, and love it!

But the signs were pretty solid, so I gave up dairy for a week or two to try it out. It took a week to "clear out" initially, but since then, the results have been fantastic; some of the symptoms (the sharp pain, cramps, heartburn) are almost gone, and levels of the others (nausea, stomach 'unsettledness') are way down most of the time. If I eat something that contains milk, cheese or whey -- such as a packet of crisps recently -- I can tell within 10 minutes, since the pain in my right side "twinges" noticeably. It really is astounding.

The wierd thing is, this came out of nowhere. A week before that bbq, I was glugging milk without a single issue, and feeling perfectly fine; I've never had issues with dairy. Then all of a sudden, it just hit me, seemingly after a short bout of food poisoning, and it still hasn't gone away.

Talking to people, though, it appears this is more common than one might think; I now know of several people who've become lactose intolerant, suddenly, in their 30s.

Anyway, the core issue is still there, but while the wheels of medical science grind on, I at least have pretty good control of the nastier symptoms again. yay.

Technorati-ranked Irish Blogs Top 100

Published November 2, 2006

So, I was thinking about the various Irish blog aggregators, Planet.journals.ie, IrishBlogs.ie, and IrishBlogs.info. Michele's Irishblogs.info attempts to "rank" the blogs by hits, but many of the Irish webloggers don't include that hit-counting HTML snippet in their web pages, so quite a few are probably missing; on top of that, RSS readers don't count. It lists me as #3, which I knew was definitely wrong, anyway ;)

However, it occurred to me that an alternative way to compute a "top 100" would be to use the Technorati rank of each blog, and make a table based on that; that'd measure the blogs by Technorati's readership-estimation algorithm, which may still be faulty, of course, but worth a try... I was curious, so I gave it a go, and here's the results. Enjoy!

Update: This table is no longer up-to-date -- a much fresher version is now available over here, and will be updated regularly.

Top 100 by rank / inbound blog links
Top 100 by inbound links

Top 100 by rank / inbound blog links:

Position	Rank	Inbound blogs	Inbound links	Blog
1	2940	638	1931	http://www.tomrafteryit.net/
2	6636	371	1280	http://www.mulley.net/
3	8231	315	625	http://twentymajor.blogspot.com/
4	10984	249	512	http://www.natterjackpr.com/
5	15720	181	409	http://www.avalon5.com/
6	18897	151	315	http://irish.typepad.com/irisheyes/
7	19364	148	472	http://www.gavinsblog.com/
8	21214	136	385	http://www.blather.net/
9	21715	133	968	http://ocaoimh.ie/
10	22210	132	399	http://eirepreneur.blogs.com/eirepreneur/
11	22258	130	323	http://thetorturegarden.blogspot.com/
12	23921	122	351	http://www.dehora.net/journal/
13	24143	121	199	http://www.atlanticblog.com/
14	24828	118	174	http://freestater.blogspot.com/
15	25570	115	260	http://arseblog.com/WP
16	25570	115	246	http://tcal.net/
17	27174	109	252	http://www.digitalrights.ie/
18	27189	110	169	http://cork2toronto.blogspot.com/
19	28004	106	731	http://taint.org/
20	29008	103	286	http://unitedirelander.blogspot.com/
21	29008	103	232	http://www.nialler9.com/blog
22	29008	103	175	http://clickhere.blogs.ie/
23	29978	100	270	http://www.mneylon.com/blog
24	31954	95	901	http://www.irishelection.com/
25	33397	91	231	http://memex.naughtons.org/
26	34121	89	370	http://siciliannotes.blogspot.com/
27	35022	86	285	http://www.sineadgleeson.com/blog
28	35022	86	146	http://www.cfdan.com/
29	35858	84	904	http://www.pkellypr.com/blog
30	36223	84	255	http://www.thinkingoutloud.biz/
31	37735	80	175	http://www.dervala.net/
32	39719	76	207	http://backseatdrivers.blogspot.com/
33	40078	76	229	http://fdelondras.blogspot.com/
34	40276	75	203	http://www.mediangler.com/
35	40821	74	128	http://www.thinkinghomebusiness.com/blog
36	44148	69	122	http://outofambit.blogspot.com/
37	45075	67	147	http://www.podleaders.com/
38	45075	67	87	http://www.aidanf.net/
39	45729	66	238	http://www.argolon.com/
40	46477	65	201	http://www.sarahcarey.ie/
41	46477	65	191	http://disillusionedlefty.blogspot.com/
42	47586	64	141	http://www.johnbreslin.com/blog
43	48011	63	66	http://www.branedy.net/
44	52278	58	398	http://dossing.blogspot.com/
45	54710	56	155	http://redmum.blogspot.com/
46	55758	55	103	http://richarddelevan.blogspot.com/
47	56390	54	148	http://donal.wordpress.com/
48	56390	54	129	http://prettycunning.net/blog
49	57527	53	104	http://www.dublinblog.ie/
50	58724	52	167	http://www.tuppenceworth.ie/blog
51	58724	52	102	http://www.inter-actions.biz/blog/
52	59920	51	101	http://seanmcgrath.blogspot.com/
53	60315	51	76	http://www.blackphoebe.com/msjen/
54	62483	49	112	http://www.infactah.com/
55	62885	49	118	http://mamanpoulet.blogspot.com/
56	63869	48	229	http://icecreamireland.com/
57	68503	45	93	http://www.web2ireland.org/
58	68503	45	75	http://www.davidmcwilliams.ie/
59	68503	45	73	http://vipglamour.net/
60	68824	45	193	http://imeall.blogspot.com/
61	72248	43	81	http://planetpotato.blogs.com/planet_potato_an_irish_bl/
62	73843	42	149	http://lettertoamerica.blogs.com/
63	73843	42	119	http://www.kenmc.com/
64	73843	42	102	http://www.pmooney.net/blogsphe.nsf
65	73843	42	70	http://bohanna.typepad.com/pureplay/
66	75725	41	107	http://bonhom.ie/
67	75725	41	93	http://www.bibliocook.com/
68	75725	41	78	http://shittyfirstdraft.blogspot.com/
69	77680	40	225	http://bestofbothworlds.blogspot.com/
70	77680	40	134	http://www.stdlib.net/%7Ecolmmacc
71	77957	40	82	http://davesrants.com/
72	79732	39	103	http://ricksbreakfastblog.blogspot.com/
73	80012	39	92	http://manuel-estimulo.blogspot.com/
74	81970	38	91	http://gingerpixel.com/
75	82240	38	248	http://www.linksheaven.com/
76	84304	37	726	http://thelimerick.blogspot.com/
77	84304	37	127	http://www.ryderdiary.com/
78	84304	37	83	http://morgspace.net/
79	84304	37	64	http://talideon.com/weblog/
80	86729	36	140	http://www.damienblake.com/
81	86729	36	124	http://irisheagle.blogspot.com/
82	86729	36	102	http://blog.rymus.net/
83	86729	36	65	http://www.adammaguire.com/blog
84	87068	36	272	http://progressiveireland.blogspot.com/
85	89814	35	145	http://www.windsandbreezes.org/
86	92646	34	43	http://football-corner.blogspot.com/
87	95258	33	207	http://www.fustar.org/
88	95258	33	171	http://www.iced-coffee.com/
89	95258	33	82	http://www.bytesurgery.com/gearedup
90	101881	31	90	http://phoblacht.blogspot.com/
91	101881	31	70	http://counago-and-spaves.blogspot.com/
92	101881	31	58	http://www.firstpartners.net/blog
93	105668	30	82	http://realitycheckdotie.blogspot.com/
94	109643	29	142	http://bifsniff.com/cartoons/
95	109643	29	75	http://dave.antidisinformation.com/
96	109643	29	60	http://conoroneill.com/
97	109643	29	55	http://www.minds.may.ie/%7Edez/serendipity/
98	109643	29	51	http://dublin.metblogs.com/
99	110005	29	78	http://www.janinedalton.com/blog
100	110005	29	54	http://www.runningwithbulls.com/blog

List by inbound links:

Position	Rank	Inbound blogs	Inbound links	Blog
1	2940	638	1931	http://www.tomrafteryit.net/
2	6636	371	1280	http://www.mulley.net/
3	21715	133	968	http://ocaoimh.ie/
4	35858	84	904	http://www.pkellypr.com/blog
5	31954	95	901	http://www.irishelection.com/
6	28004	106	731	http://taint.org/
7	84304	37	726	http://thelimerick.blogspot.com/
8	8231	315	625	http://twentymajor.blogspot.com/
9	258886	13	519	http://newswire99.blogspot.com/
10	10984	249	512	http://www.natterjackpr.com/
11	19364	148	472	http://www.gavinsblog.com/
12	164780	20	451	http://inao.blogspot.com/
13	15720	181	409	http://www.avalon5.com/
14	22210	132	399	http://eirepreneur.blogs.com/eirepreneur/
15	52278	58	398	http://dossing.blogspot.com/
16	21214	136	385	http://www.blather.net/
17	34121	89	370	http://siciliannotes.blogspot.com/
18	23921	122	351	http://www.dehora.net/journal/
19	156276	21	336	http://www.ebbybrett.co.uk/blog
20	22258	130	323	http://thetorturegarden.blogspot.com/
21	18897	151	315	http://irish.typepad.com/irisheyes/
22	29008	103	286	http://unitedirelander.blogspot.com/
23	35022	86	285	http://www.sineadgleeson.com/blog
24	87068	36	272	http://progressiveireland.blogspot.com/
25	239963	14	271	http://www.thehealthtechblog.com/
26	29978	100	270	http://www.mneylon.com/blog
27	25570	115	260	http://arseblog.com/WP
28	36223	84	255	http://www.thinkingoutloud.biz/
29	27174	109	252	http://www.digitalrights.ie/
30	82240	38	248	http://www.linksheaven.com/
31	977738	3	248	http://www.tomgriffin.org/the_green_ribbon/
32	25570	115	246	http://tcal.net/
33	45729	66	238	http://www.argolon.com/
34	29008	103	232	http://www.nialler9.com/blog
35	33397	91	231	http://memex.naughtons.org/
36	40078	76	229	http://fdelondras.blogspot.com/
37	63869	48	229	http://icecreamireland.com/
38	77680	40	225	http://bestofbothworlds.blogspot.com/
39	208904	16	210	http://www.anlionra.com/
40	471327	7	208	http://www.ravenfamily.org/sam/
41	39719	76	207	http://backseatdrivers.blogspot.com/
42	95258	33	207	http://www.fustar.org/
43	40276	75	203	http://www.mediangler.com/
44	46477	65	201	http://www.sarahcarey.ie/
45	637233	5	200	http://armchaircelts.co.uk/
46	24143	121	199	http://www.atlanticblog.com/
47	280786	12	199	http://conann.com/
48	68824	45	193	http://imeall.blogspot.com/
49	46477	65	191	http://disillusionedlefty.blogspot.com/
50	637233	5	182	http://www.everysecondpaycheck.com/blog
51	164524	20	181	http://irishlinks.blogspot.com/
52	542250	6	176	http://www.dublinka.com/
53	29008	103	175	http://clickhere.blogs.ie/
54	37735	80	175	http://www.dervala.net/
55	24828	118	174	http://freestater.blogspot.com/
56	155943	21	172	http://www.jamesgalvin.com/
57	95258	33	171	http://www.iced-coffee.com/
58	164524	20	171	http://irishcraftworker.typepad.com/an_irish_craftworkers_goo/
59	27189	110	169	http://cork2toronto.blogspot.com/
60	58724	52	167	http://www.tuppenceworth.ie/blog
61	141242	23	164	http://atp.datagate.net.uk/blog
62	148304	22	159	http://www.lifewithouttoast.com/
63	184241	18	158	http://funferal.org/
64	54710	56	155	http://redmum.blogspot.com/
65	73843	42	149	http://lettertoamerica.blogs.com/
66	56390	54	148	http://donal.wordpress.com/
67	45075	67	147	http://www.podleaders.com/
68	155943	21	147	http://dublinopinion.com/
69	35022	86	146	http://www.cfdan.com/
70	89814	35	145	http://www.windsandbreezes.org/
71	109643	29	142	http://bifsniff.com/cartoons/
72	195745	17	142	http://podcasting.ie/podcast
73	47586	64	141	http://www.johnbreslin.com/blog
74	86729	36	140	http://www.damienblake.com/
75	223280	15	137	http://thegurrier.com/
76	77680	40	134	http://www.stdlib.net/%7Ecolmmacc
77	980795	3	131	http://www.sineadcochrane.com/
78	56390	54	129	http://prettycunning.net/blog
79	40821	74	128	http://www.thinkinghomebusiness.com/blog
80	84304	37	127	http://www.ryderdiary.com/
81	86729	36	124	http://irisheagle.blogspot.com/
82	44148	69	122	http://outofambit.blogspot.com/
83	73843	42	119	http://www.kenmc.com/
84	62885	49	118	http://mamanpoulet.blogspot.com/
85	135121	24	117	http://nellysgarden.blogspot.com/
86	195745	17	115	http://blog.infurious.com/
87	542250	6	114	http://ainelivia.typepad.com/aine_livia_at_the_midnigh/
88	62483	49	112	http://www.infactah.com/
89	75725	41	107	http://bonhom.ie/
90	57527	53	104	http://www.dublinblog.ie/
91	55758	55	103	http://richarddelevan.blogspot.com/
92	79732	39	103	http://ricksbreakfastblog.blogspot.com/
93	58724	52	102	http://www.inter-actions.biz/blog/
94	73843	42	102	http://www.pmooney.net/blogsphe.nsf
95	86729	36	102	http://blog.rymus.net/
96	59920	51	101	http://seanmcgrath.blogspot.com/
97	173857	19	99	http://www.ofoghlu.net/log/
98	118678	27	96	http://irishkc.com/
99	68503	45	93	http://www.web2ireland.org/
100	75725	41	93	http://www.bibliocook.com/

Update: Here's a full list of all 569 tested blogs. Also, there's been a minor change to the rankings here; I've just realised that there was a bug in how the script handled evenly-matched blogs, so (for example) #15 and #16 were reversed in order; that's now fixed.

If you find a blog missing, it's possible that (a) it's not pinging Planet.journals.ie or (b) is not registered with Technorati; this method requires both of those. Most Irish blogs do, but some (Old Rotten Hat, for example) don't...

Methodology

I found this more-or-less full list of Irish weblogs at Planet.journals.ie, and selected the blogs that had pinged their site in the past 6 months, then cut that down to just the blog main-page URLs, removing duplicates.

Given that list, I then looked up each blog URL using the Technorati API, and got its rank, inbound link count, and inbound linking blogs count.

top100code.tgz is a tarball of the perl code I wrote to do this, if you fancy doing it yourself on whichever set of blogs you fancy...

Maximise value, not protection (fwd)

Published October 31, 2006

Here's an excellent quote from the OpenGeoData weblog, really worth reproducing:

''We think the natural tendency is for producers to worry too much about protecting their intellectual property. The important thing is to maximise the value of your intellectual property, not to protect it for the sake of protection. If you lose a little of your property when you sell it or rent it, thatâ€™s just a cost of doing business, along with depreciation, inventory losses, and obsolescence.'' -- Information Rules, Carl Shapiro and Hal Varian, page 97.

Words to live by!

Hog helps with the painting

Published October 27, 2006

Here's some pics of Hog, our new kitten, from earlier this week: Hog helps with the painting. (Warning: sickeningly cute.)

The vagaries of Google Image Search

Published October 23, 2006

Remember the C=64-izer, the quick hack to display an image in the style of the Commodore 64?

Recently, I've started getting hits to this demo image of the "O RLY?" owl -- lots of 'em.

It turns out that the C=64-ized rendition of this image is now the top hit for "O RLY" on Google Image Search; pretty bizarre, since there are obvious better images on the first search page, one result along in fact. What's more, the page listed as the 'origin page', http://taint.org/tag/today, doesn't even use that text.

This has resulted in lots of Myspace kiddies etc. obliviously using the C=64 rendering. Yay for Commodore ;)

PhishTank now supported by SpamAssassin

Published October 19, 2006

Thanks to Jeff Chan of the SURBL project, data from PhishTank is now being included in the SURBL 'ph' anti-phishing list.

This means it's now supported by all existing versions of SpamAssassin from 3.0.0 onwards. Good news, and thanks to Jeff and the OpenDNS guys!

VISA and priorities

Published October 18, 2006

A couple of years ago, various anti-spammers discussed how the credit-card payment processing companies were perfectly placed to disrupt the spam economy, by tracking down spammers through "poison pill" transactions. Nothing happened from that, though, and spam is now a bigger problem than ever.

Today, I hear that the Russian MP3 site, AllOfMP3, have lost their account with Visa to process credit-card payments.

In other words, it sounds like the banks are happy enough to close off filesharing, but couldn't be bothered dealing with spam...

Ireland now has RFID passports

Published October 17, 2006

Back in February, I wrote about some Dutch hackers remotely reading Dutch RFID passports, and my email to the Irish Passport Office enquiring about their plans.

They never bothered writing back; I guess they were too busy implementing the damn things :( Their new 'ePassports' are now mandatory for new Irish passports:

The chip technology allows the information stored in an Electronic Passport to be read by special chip readers at a close distance.

"special chip readers at a close distance" and/or "random criminals looking for Irish victims at a distance of 30 feet", I guess.

Here's the slides for Riscure's attack on the Dutch passports. Irish passports are similarly using "Basic Access Control". I wonder if Irish passport numbers are sequential, since that seems to be a key part of their attack?

DIY Glory

Published October 16, 2006

It's been a while since I've embarked on a DIY job around the house with quite as much success as the most recent one -- laying and tacking down some new carpet in the front hall. The last job was a bike rack, which had to be abandoned after the 4-inch screws proved too loose and threatened to fall out of the wall, leaving gigantic plugs of Polyfilla in their place (I'm sure bad drilling had nothing to do with it).

This has all now been forgotten in the glory of the freshly-laid carpet. Now, every time I walk past the front hall, I have to stick my head in and check out the perfectly-fitted carpet with pride. This can only last so long before my next botch job, of course...

Anti-spam group under attack — via ICANN

Published October 9, 2006

[This is a copy of an article I submitted to ICANNWatch.]

Spamhaus, the UK-based non-profit that runs the SBL and XBL anti-spam DNS blocklists, is reportedly facing serious legal trouble in the US.

A US-based spam gang has started legal action to have Spamhaus' domain name confiscated by ICANN, and reportedly, Spamhaus may have been advised badly by their US legal people; so there is now a danger that they *may* indeed lose their domain, and possibly worse.

Note that Spamhaus is entirely UK-based, bar some mirrors; however, the proposed order is aimed at ICANN, which is US-based. This is the really tricky part; can a US company kill the domain of a non-US group?

According to anti-spam lawyer Matthew Prince, 'there may be some time before ICANN is formally ordered to shut down the Spamhaus domain, but make no mistake that ICANN's lawyers will be considering their options beginning first thing Monday, if they haven't already begun the conference calls tonight' ... 'In the end, [ICANN's] decision is likely to be much more about setting a general policy than the specific details of who Spamhaus is or why they are critical for the Internet. ICANN will desperately want to stay out of this dispute, but they are subject to U.S. law and they will probably have attorneys who will argue they need to follow it. All it will take for this to end badly for Spamhaus is one lawyer at ICANN getting a little bit spooked and Spamhaus could lose not only it's .org but potentially any other TLD that ICANN controls.'

This is interesting -- if Spamhaus is forced to close down its domains and US-based mirrors, that will mean that the SBL and XBL blocklists will be down for a while, too. Typically those are used for up-front blocking, and if my servers are any indication, they take care of 75% of incoming spam before it hits any more CPU-intensive filtering.

Without those, there'll be a lot of sites around the net suddenly dealing with quadrupled spam volumes hitting their MTAs.

NEDAP voting machines hacked

Published October 5, 2006

Here's a press release from ICTE that's well worth a read if you still trust voting machines:

Concerns expressed by many IT professionals about the security of the e-voting system chosen for use in Ireland were today shown to be well-founded when a group of Dutch IT Specialists, using documentation obtained from the Irish Department of the Environment, demonstrated that the NEDAP e-voting machines could be secretly hacked, made to record inaccurate voting preferences, and could even be secretly reprogrammed to run a chess program.

The recently formed Dutch anti e-voting group, "Wij vertrouwen stemcomputers niet" (We don't trust voting computers), has revealed on national Dutch television program "EenVandaag" on Nederland 1, that they have successfully hacked the Nedap machines -- identical to the machines purchased for use in Ireland in all important respects.

ICTE representative Colm MacCarthaigh, who has seen and examined the compromised Nedap machine in action in Amsterdam, notes "The attack presented by the Dutch group would not need significant modification to run on the Irish systems. The machines use the same construction and components, and differ only in relatively minor aspects such as the presence of extra LEDs to assist voters with the Irish voting system. The machines are so similar that the Dutch group has been using only the technical reference manuals and materials relevant to the Irish machines as a guide, as those are the only materials publicly available."

Maurice Wessling, of Wij vertrouwen stemcomputers niet, adds "Compromising the system requires replacing only a single component, roughly the size of a stamp, and is impossible to detect just by looking at the machine".

Both ICTE and Wij vertrouwen stemcomputers niet view this as yet another demonstration that no voting system which lacks a voter-verified audit trail can be trusted. According to ICTE spokesperson Margaret McGaley "Any system which lacks a means for the voter to verify that their vote has been correctly recorded is fundamentally and irreparably flawed".

Margaret McGaley highlighted that it is the machines themselves that are at risk. "This particular issue is not about the vote counting software, which we already know must be replaced, this is about the machines that the Taoiseach has claimed were 'validated beyond any question'. We now have proof that these machines can be made to lie about the votes that have been cast on them. It is abundantly clear that these machines would pose a genuine risk to our democracy if used in elections in Ireland."

ICTE is repeating its call, which reflects the opinions shared by IT expert groups, including the E-voting group of the Irish Computing Society, that any voting system implemented must include a voter-verified audit-trail.

This is a major exploit. Colm's earlier mail noted

As we knew already, the machines run on m64k processors, and it's relatively easy to reverse engineer what all of the registers and inputs correspond to. The dutch group were able to successfull assemble code to run on the machine, and even burn it on the very eeprom that comes in the machine.

Since the NEDAP design does not include XBox-style boot-time cryptographic verification of the EEPROM's contents, undetectable replacement of the operating system is a 2-minute matter of unsticking the trivial 'seals' on the voting machine's access panels, popping out an EEPROM chip, and replacing with a modified one, then closing it up again.

Once that's done, the election is rigged, as WVSN have demonstrated.

Update: here's their paper describing the attack in detail -- well worth a read.

a plug for Map24

Published October 4, 2006

Nat at O'Reilly Radar mentions that Multimap have added a public API . It's great to see more sites adding public APIs, but sadly, as I note in a comment there, Multimap isn't any use for me -- they, along with Google and Yahoo!, have really crappy Irish mapping. Their geocoders (the part that turns an english-language address into a GIS coordinate pair) are pretty much non-functional for Ireland.

I moved from the US to Ireland earlier this year and found this pretty frustrating, after the joys of using the US mapping sites to get driving directions etc.

Thankfully, another contender has emerged recently -- Map24.

They have a great geocoder for Ireland, and very reliable directions, which are even accurate for some of the more baroque one-way-system traffic-management changes that Dublin's city planning department have come up with recently. The look and feel of the website is a little clunky in Firefox -- not as smooth as Google's -- but it has some nice AJAXy touches now and seems to be heading in the right direction.

Interestingly, they now offer a public API for third-party mashups, and even offer an API for their geocoder -- so someone preferring the Google look and feel could mash that up, using Map24 to find the coordinates and Google to display an area map! (Actually, I think that may be how John Handelaar's earlier hack worked -- I note in the comments that he mentions Map24 provide Lycos' mapping backend. aha.)

Anyway -- Map24 -- if you're looking for a good Irish mapping/driving-directions site, it'll do the trick.

Some p0f Data From Craig

Published October 3, 2006

Regarding the use of p0f, passive OS fingerprinting, as an anti-spam measure -- on top of this analysis which I linked to a few weeks back, one of the emeritus SA guys, Craig Hughes, sends over some p0f experiences. Handily, this includes a more detailed breakdown by OS release:

I've been using the SA p0f plugin for nearly a month or so now both on gumstix's web server and my hughes-family.org server, and it actually looks like it could be pretty useful. So far I've just been scoring 0.001 for each OS to collect data, but here's the results amavis has logged:

This breakdown shows what %age of the stuff coming in via OS xyz is spam or ham. ie 84.6% of all mail received from Windows-2000 is spam, 14.9% is ham (the rest is viruses). The first numeric column is number of messages of each type. Statistics are only since the last time amavis restarted:

On his home machine (comcast cable modem connection) :

spam.byOS.Windows-2000	438	1/h	84.6 %
spam.byOS.Linux	417	1/h	18.3 %
spam.byOS.Windows-XP	265	1/h	97.8 %
spam.byOS.UNKNOWN	135	0/h	55.1 %
spam.byOS.Windows-XP/2000	24	0/h	100.0 %
spam.byOS.Novell	5	0/h	100.0 %
spam.byOS.Windows-98	3	0/h	60.0 %
spam.byOS.Windows-2003	2	0/h	66.7 %
spam.byOS.FreeBSD	2	0/h	1.3 %
spam.byOS.Solaris	1	0/h	1.8 %
spam.byOS.Windows-SP3	1	0/h	100.0 %
ham.byOS.Linux	1851	6/h	81.2 %
ham.byOS.FreeBSD	143	0/h	96.0 %
ham.byOS.UNKNOWN	102	0/h	41.6 %
ham.byOS.Windows-2000	77	0/h	14.9 %
ham.byOS.Solaris	56	0/h	98.2 %
ham.byOS.NetCache	6	0/h	100.0 %
ham.byOS.Windows-XP	6	0/h	2.2 %
ham.byOS.Tru64	2	0/h	100.0 %
ham.byOS.AIX	2	0/h	100.0 %
ham.byOS.Windows-98	2	0/h	40.0 %
ham.byOS.Windows-2003	1	0/h	33.3 %

On gumstix.com (hosted at some provider in Texas):

spam.byOS.Windows-2000	401	1/h	58.4 %
spam.byOS.Windows-XP	131	0/h	92.9 %
spam.byOS.UNKNOWN	64	0/h	18.7 %
spam.byOS.Windows-XP/2000	29	0/h	96.7 %
spam.byOS.FreeBSD	11	0/h	4.1 %
spam.byOS.Linux	11	0/h	0.5 %
spam.byOS.Windows-98	6	0/h	85.7 %
spam.byOS.Solaris	4	0/h	3.3 %
spam.byOS.Windows-SP3	2	0/h	100.0 %
ham.byOS.Linux	1983	4/h	97.6 %
ham.byOS.UNKNOWN	277	0/h	80.8 %
ham.byOS.Windows-2000	271	0/h	39.4 %
ham.byOS.FreeBSD	253	0/h	93.7 %
ham.byOS.Solaris	116	0/h	96.7 %
ham.byOS.NetCache	40	0/h	100.0 %
ham.byOS.Windows-XP	9	0/h	6.4 %
ham.byOS.Windows-NT	7	0/h	70.0 %
ham.byOS.Novell	3	0/h	100.0 %
ham.byOS.Windows-XP/2000	1	0/h	3.3 %
ham.byOS.Windows-98	1	0/h	14.3 %
ham.byOS.Windows-2003	1	0/h	100.0 %

my home machine has a lot more relayed mail coming to it (all my various craig@* email addresses forward into there) which is probably why the linux spam rate is higher there -- the relaying machines are probably running linux and forwarding spam through.

Interesting figures -- but I'm still not-convinced that the correlation is quite high enough to form a good enough basis for solid anti-spam rules; reliable rules in the SpamAssassin core typically have over 95% accuracy at differentiating ham from spam (at least when we first check them in).

Update: it's a natural for use as a Bayes token, though. The way amavisd-new implements p0f support is perfect for this use.

BTW, my guess is that many of the spam hits for "linux" are due to things like Netgear/Linksys routers, running embedded linuces. No evidence, just guessing ;)

Linus on Bayesian filtering

Published October 2, 2006

Linus Torvalds, in a post to linux-kernel today:

I'm sorry, but spam-filtering is simply harder than the bayesian word-count weenies think it is. I even used to know something about bayesian filtering, since it was one of the projects I worked on at uni, and dammit, it's not a good approach, as shown by the fact that it's trivial to get around.

I don't know why people got so excited about the whole bayesian thing. It's fine as one small clause in a bigger framework of deciding spam, but it's totally inappropriate for a "yes/no" kind of decision on its own.

If you want a yes/no kind of thing, do it on real hard issues, like not accepting email from machines that aren't registered MX gateways. Sure, that will mean that people who just set up their local sendmail thing and connect directly to port 25 will just not be able to email, but let's face it, that's why we have ISP's and DNS in the first place.

But don't do it purely on some bogus word analysis.

If you want to do word analysis, use it like SpamAssassin does it - with some Bayesian rule perhaps adding a few points to the score. That's entirely appropriate. But running bogo-filter instead of spamassassin is just asinine.

Me, I like bogofilter -- those guys are cool, and it's a great anti-spam product for many purposes. But of course I have to agree with Linus that the correct approach in most cases is a bigger picture than just Bayes alone, a la SpamAssassin ;)

Back in one piece

Published September 28, 2006

Well, I'm back in Dublin in one piece, after a great honeymoon in Corsica. Lots of stuff to catch up on, so if you're waiting on a response, sorry, it might take a little longer...

Hitched! Pt. 2

Published September 13, 2006

Well, the second half of the wedding -- the fun part, with dinner, dancing, friends, and family -- went off without a hitch. Our hippy-crap-laden humanist ceremony, celebrated with the aid of our friend Gerry, was a great success; the pianist and various DJs provided fantastic aural accompaniment; and the venue, Markree Castle in County Sligo, was fantastic, taking care of the entire party in every way we hadn't foreseen and putting up with us far into the early hours of the next day.

That was the most fun I've had in yonks, and thanks to everyone who came. (And those who didn't, due to the whims of US visa conditions -- you were much missed.)

Photos will follow once we're back from the honeymoon, which starts tomorrow morning. later ;)

BarCamp Ireland

Published September 8, 2006

wow, BarCamp Ireland is really shaping up!

Unfortunately, it's very unlikely that I'll be able to make it, due to all the wedding/honeymoon activity around that time (and it being down in Cork, which is a bit of a nontrivial journey at the moment). Pity, it looks like it'll be great -- and could probably do with some more talks about open source, to go with all the web2.0/startup content ;)

SpicyLinks and del.icio.us Network Summarization

Published September 6, 2006

Ross Mayfield:

Every time I see Gabe Rivera of TechMeme, I ask for the same thing -- MeMeme. Give me TechMeme where the core index is based on who I read, about 150 people at any given time, to show me what my friends are interested in.

Funnily eough, that is exactly why I wrote SpicyLinks!

It works pretty well -- in fact, nowadays I don't really bother reading slashdot, Digg, Reddit, et al, particularly frequently, because I know that all the really interesting stuff will be at the top of my newsreader in the SpicyLinks feed.

Anyway, I've been calling SpicyLinks a 'summarizing aggregator', but the discussion that arose from Ross' posting inspired me. A little bit of hacking has come up with an interesting twist: take a del.icio.us social network, a CGI script called deliciousnetwork2opml.cgi, and 15 minutes hacking on SpicyLinks to support inclusion of OPML via a remote URI, and hey presto -- it's now a social-network summarising aggregator. ;)

Stretch-to-fit Textareas – Now A Firefox Extension

Published September 3, 2006

Since it's been turning out to be really quite useful, here's a Firefox extension version of the Stretch-to-fit Textareas Greasemonkey user-script I wrote a few weeks back. In other words, Greasemonkey not required!

Unblocked

Published August 30, 2006

I just found an error in an Apache config file for taint.org, resulting in some of the legacy RSS feed URLs producing invalid data -- this meant that anyone subscribed to the Feedburner feed, for example, had been missing out on my witterings. Fixed now -- apologies!

Flickr’s Lousy US-Only Maps

Published August 29, 2006

Update: This is now fixed. See here for details...

Here's the 2lmc boys getting rightly annoyed about Flickr's new mapping feature, which displays geotagged photos overlaid on a mapping UI -- as they note, it's basically a steaming pile of crap outside the US:

However, because Flickr are owned by Yahoo, they're using their maps. And, like all Yahoo! products, if you're not American, it sucks.

Compare this lovely data-rich map of SF:

With this featureless grey blob:

That's just pathetic -- there isn't a single place name visible, and even the Phoenix Park, the biggest urban park in Europe, is simply displayed just as a light-coloured splat with a road going through it.

It appears the Yahoo! mapping data for the UK and Ireland just isn't really there. What someone needs to do, is take the geotagging data from Flickr, and overlay it on the far more informative Google map data instead ;):

It's a real shame -- I used to rely on Y! Maps to get directions everywhere while in the US. They're missing out on so many customers here...

Update: good news -- the Flickr maps are now things of beauty to match Google's:

Hitched!

Published August 25, 2006

Yesterday was spent in the beautiful surrounds of Naas Leisure Centre, attending the Kildare Registry Office for a brief ceremony and some putting of pen to paper -- and hey presto, myself and the lovely C are now husband and wife ;) About time, really -- we've been going out for 13 years, after all.

This is just the legal preliminaries -- the big party is two weeks from now, in a castle in Sligo, and it's shaping up to be a great party. But still, legally, she's my wife now...

By the way, one bonus of getting the legal stuff out of the way in advance is that we now don't have to have all the fun marred by legal requirements on the big day. As a result, our mate Gerry, who a few taint.org readers will know, will be presiding over the real wedding ceremony. ;)

The EHIC and Irish government websites

Published August 21, 2006

The European Health Insurance Card is dead handy, providing access to healthcare for EU residents while travelling in Europe -- it's definitely worth having one.

There were a few reports in the Irish newspapers last week of an announcement by the Health Service Executive, warning of "a bogus website" which charges a fee of EUR22 to process applications for this:

The HSE also warned that the site is asking applicants to submit detailed financial information. "It has come to the attention of the Health Service Executive that Irish residents are being targeted by a website which is unnecessarily charging people to apply for EHIC cards. The bogus site concerned -- http://www.ehic-card.eu/ -- is not connected to the HSE," said the HSE in a statement.

I'd link to the HSE's press release on the topic, but it's down, apparently -- and that's pretty indicative of the problem. You see, I've been trying to apply for one of these recently.

The HSE has been announcing that there's no need to use this "bogus site", since we can just use the "real" site at http://www.ehic.ie/ to apply for one. Here's what they neglect to mention:

(a) that unless you're a pensioner you can't apply for one online -- you have to print out a form, fill it in, and post it to your local health office.
(b) there's no indication on the site as to what exactly your "Local Health Office" may be, just a long list of mysterious locations.
(c) in order to apply, the form demands that you supply all that 'detailed financial information' -- namely your name, address, date of birth, proof of residency, and PPS number -- anyway.
(d) the "bogus site" isn't really all that bogus after all.

If they had a simple and usable online application process, perhaps they wouldn't be plagued by other sites attempting to offer that service for what is really a quite reasonable EUR22 fee?

This is a pretty frequent phenomenon on Irish governmental websites; a half-assed attempt to bring governmental services online, resulting in shiny informational sites, full of clip-art of smiling people talking on the phone, which all come down to a bottom line of "print this out and post it in" or "call this number" -- business as usual. Having said that, at least I can generally still get a human on the phone, which still beats dealing with US government agencies, I guess!

BTW, I notice the HSE claim that it only takes 10 working days for an EHIC to arrive using their system. I applied for mine 3 weeks ago, and there's been no word yet...

Searching GMail with a Firefox Smart Keyword

Published August 18, 2006

Here's a Firefox Smart Keyword to search your GMail:

https://mail.google.com/mail/?search=query&view=tl&q=%s

Usage example, assuming you use 'mail' as the keyword: (CTRL-L) mail whatever

Don’t use bl.spamcop.net as a blocklist

Published August 17, 2006

Update: as of Oct 2007, this advice is obsolete. The Spamcop algorithms have been greatly improved, as far as I and others can tell.

I've been hearing increasing reports of false positives using bl.spamcop.net.

One today spurred me to check out exactly how many times it I'm seeing it misfiring on nonspam in my own mail collection. The results have been pretty astonishing.

In my nonspam collection, it fired on 1043 messages out of 8415 in July; 12.4% of the mail. It gets worse for August, though -- 884 messages out of 3729 since the start of August. That's a staggering 23% of my nonspam mail this month. ;)

Most of that is due to the listings of GMail and Yahoo! Groups, both of which seem to have been listed for large swathes of the past month and a half.

Now, an important point -- it can work pretty well as a single input to a scoring system, like Spamcop itself or SpamAssassin. In fact, I didn't lose any mail as a result of those listings; SpamAssassin assigns only 1.5 points to the RCVD_IN_BL_SPAMCOP_NET rule, so it's easily corrected by other rules.

However, people using it to block or reject spam outright, or who've changed the score of the RCVD_IN_BL_SPAMCOP_NET rule, need to turn that off ASAP -- as they are losing mail.

More parallel string-match algorithm hacking: re2xs

Published August 17, 2006

Last week, Matt Sergeant released a great little perl script, re2xs, which takes a set of simplified regexps, converts them to the subset of regular expression language supported by re2c, then uses that to build an XS module.

In other words, it offers the chance for SpamAssassin rules to be compiled into a trie structure in C code to match multiple patterns in parallel. Given that this is then compiled down to native machine code, it has the potential to be the fastest method possible, apart from using dedicated hardware co-processors.

Sure enough, Matt's results were pretty good -- he says, 'I managed to match 10k regexps against 10k strings in 0.3s with it, which I think is fairly good.' ;)

Unfortunately, turning this into something that works with SpamAssassin hasn't been quite so easy. SpamAssassin rules are free to use the full perl regular expression language -- and this language supports many features that re2c's subset does not. So we need to extract/translate the rule regexps to simplified subsets. This has generally been the case with all parallel matching systems, anyway, so that's not a massive problem.

More problematically, re2c itself does not support nested patterns -- if one token is contained within another, e.g. "FOO" within "FOOD", then the subsumed token will not be listed as a match. SpamAssassin rules, of course, are free to overlap or subsume each other, so an automated way to detect this is required.

For simple text patterns, this is easy enough to do using substring matching -- e.g. "FOOD" =~ /\QFOO\E/ . Unfortunately, once any kind of sophisticated regexp functionality is available, this is no longer the case: consider /FOO*OD/ vs /FOO/ , /F[A-Z]OD/ vs /FO[M-P]/ , /F(?:OO|U)D/ vs /F(?:O|UU)?O/ .

The only way to do this is to either (a) fully parse the regexp, build the trie, and basically reimplement most of re2c to do this in advance; or (b) change the trie-generation code in re2c to support states returning multiple patterns, as Aho-Corasick does.

I requested support for this in re2c, but got a brush-off, unfortunately. So work continues...

In other news, that food poisoning thing I had back at the end of June has lingered on. It's now pretty clear that it isn't food poisoning or a stomach bug... but I still have no idea what it actually is. No fun :(

“Stretch-to-fit Textareas” Greasemonkey User Script

Published August 10, 2006

Here's another quick-hack Greasemonkey user script I wrote recently.

Stretch-to-fit Textareas is a user script which improves the usability of editable textareas; it causes them to "stretch" vertically to fit their contents, as you type. This behaviour was inspired by that of textareas in FogBugz.

It can be inhibited by turning off the small checkbox to the right of each textarea.

Update: it's worth noting that this is different from the Resizeable Textareas Firefox extension. Whereas the latter allows the user to resize the textareas by hand, this user script does that action automatically, based on the contents of the field; no manual resize-handle-searching and dragging is required. On the other hand, this user script will only stretch textareas vertically, whereas the extension allows them to be dragged in both dimensions. In fact, the two are complementary -- I'm running both, and I suggest you do too ;)

Update 2: here's a Firefox extension version -- Greasemonkey not required!

LKML discusses anti-spam moderation

Published August 9, 2006

LKML: Alexey Zaytsev: Time to forbid non-subscribers from posting to the list? -- the linux-kernel mailing list discusses list moderation as an anti-spam strategy.

Spam really sucks; anything that deals with email now has to include some set of anti-spam features because of it. The LKML has important features that mitigate against simply closing the list partially, such as being a point where bug reports are submitted -- so this is a thorny issue for them.

For what it's worth, I have written a system to further automate moderation beyond the basic features provided by Mailman and ezmlm. http://taint.org/wk/ModerateList describes this in detail; in essence, it's a specialised mail user agent designed to moderate lists quickly and efficiently, with an outboard spam filter built in (SpamAssassin, of course, via its perl API).

I moderate about a thousand messages per week using this (last time I checked), and it takes about 30 seconds per day to do so, so it's pretty efficient.

In other news: wow, talking to a good accountant can really mitigate complicated tax issues... phew.

Wedding Poems

Published August 5, 2006

OK -- looks like I've found the perfect poem for our wedding ceremony; allow me to present "Gravity of Love":

One day, one day I asked myself
What is the right number or symbol?
What is the perfect equation?
What truly is LOGIC?
And who decides right reasoning?

In cause of no answer to my quest,
I traveled through the physical and metaphysical,
I traveled through the delusional and mystical
And at last back to the physical.

I made most important invention of my life career
That it's only in the mysterious equation; logic of love
Any logical; mystical and psychological reasoning can be found.
It's you in me I only believe that's true and real

All I can say is -- Wow.

Underwhelmed by ScreenClick

Published August 3, 2006

For the past few years, I've been a very happy user of Netflix, the innovative web site which let you receive DVDs via the post for a flat fee per month, for US residents. When I got back to Dublin, I was very happy to see that there was a local equivalent, in the form of ScreenClick -- so I signed up.

However, I've become increasingly disillusioned with their service, for the same reasons as Adrian Weckler writes about here...

Turnaround time: this varies wildly, and can take nearly a week to turn around a DVD from dropping it in the postbox to receiving the next one. Netflix was reliably two days for me, out in suburban Orange County, California; Even this Kansas blogger noted that the longest they'd waited was 4 days.

This may seem to be an externality for Screenclick -- but really, it shouldn't be. Their business is built on the postal service, and they have to have decent results for it to work.

The 'wishlist' model: Netflix uses a queue, operating on a first-in, first-out model, while Screenclick uses something they call a 'wishlist', where the DVDs are delivered based both on position in the list and availability -- in other words, you can find you've been delivered the DVD at number 10 in your list, instead of whatever's at the top.

Again, superficially a minor point. However, one important factor is that these services are bought by households, not by individuals. Chez jm, that means that we operated a pretty strict alternating system in our Netflix queue -- one movie for me, one movie for the lovely C, repeat. This is now thoroughly scuppered with a random 'lucky dip' system. On top of that, forget about watching a serial in order. The end result is a mess.

The website: it's atrocious, a hodge-podge of ads for third-party sites, press coverage of Screenclick, more ads for Screenclick (hey, I'm already a customer!), and news clippings I couldn't care less about -- with finally a few tiny sidebar boxes containing the things I want (login, search box and wishlist). My impression: it's designed to sell the company to investors and advertisers, not for customer use.

On top of that, it's all squished into a tiny window -- Irish web designers need to buy bigger screens! That late-'90's Jakob Nielsen thing about users not knowing how to scroll? They've learned by now.

That's not even talking about the awful Javascript that's used to edit the wishlist ordering, where little buttons need to be clicked repetitively, one by one, to reorder the list. Surely someone took a look around at other sites first -- Amazon perhaps -- to see how other sites do it?

Anyway, on this count, I sent in a mail containing a batch of bug reports and unsolicited opinions, and got no reply. ;)

Less bang-for-buck: pretty simple. Netflix: 3 movies at a time, more movies in the collection, $17.99 per month; Screenclick, 2 movies at a time, EUR 19.99 ($25.56, $10 more expensive than the equivalent Netflix service) per month. Surprisingly, this is actually a minor issue compared to the others, though, since it's made plain from the outset.

These may seem to be minor points, but when selling a disposable-income service to consumers, the difference between an essential leisure-time service and a waste of pocket money is a very fine line. Looks like Adrian eventually cancelled. I'm not at that point yet, but it's heading that way...

‘Bugzilla See Earlier Comments’ User Script

Published August 1, 2006

Here's a new Greasemonkey user script which fixes a minor annoyance in the Bugzilla user interface. When viewing the 'Create a New Attachment' page, this will transclude the previous comments onto the bottom of that page, for reference while editing: bz_see_earlier_comments.user.js

Thanks to Jesse Ruderman for the nifty AJAXish iframe-transclusion trick it uses.

What Jeff Killed

Published July 31, 2006

What Jeff Killed is a blog from Shadow Hills, CA, documenting the murderous antics of Jeff, a large ginger tomcat:

we provide Jeff with food and water; however, this does little to lessen his killer instinct. To humans, Jeff is an exceptionally good-tempered and friendly cat; to rodents and other small animals, he is death itself. It could be that Jeff likes to bring us gifts to repay our hospitality. Perhaps he is simply a hardwired killing machine. All we know for certain is that he hunts down a wide variety of small animals and disembowels, decapitates, and dines on them. Often.

This was passed on by the lovely C, who noted 'number of kills is about the same, cat for cat' -- indeed, Bubba, our cat, certainly had a similar career in Irvine, CA. However, I notice that as yet, there are no cases where Jeff has left the entrails and decapitated head of a rabbit lying up against the sandals of the neighbour's 6 year old daughter... that was fun.

Kick.ie

Published July 26, 2006

I just noticed an interesting new site on the Irish web -- kick.ie.

It's closely based on the model of Digg, with a community of contributors who post new stories, comment, and "kick" stories they like so that those stories are given top billing. The interesting twist is that it's not as general as Digg -- instead of having a very broad "news" site, covering all bases, there are instead a smaller set of topic-focused "kick" sites. Using this model for the relatively-small Irish weblogging scene works pretty well, I think.

It's nicely done -- fast, clean, and featuring nifty features like RSS feeds throughout, and reader-contributed tagging. Nice work by Gavin Joyce!

Well worth subscribing to.

(Also, it's cool to see that one of my posts discussing Irish road deaths managed to mass 7 'kicks' a couple of weeks back ;)

Year 2038 Bug Strikes Early

Published July 20, 2006

Noted previously in the link-blog -- here are more details on the first known instance of the Year 2038 UNIX epoch rollover bug, where AOLServer installs hung due to a 32-year timeout value hitting the end-of-epoch.

It appears that it was caused by an 'official workaround' for an Oracle driver bug, where an infinite timeout was desired. Instead of implementing true support for infinite timeouts, the developer just used a very large value -- one BILLION seconds, Dr. Evil-style. Unfortunately, this led to the overflow issue.

Here's some key snippets from the mailing list thread:

Bas Scheffers:

On 17 May 2006, at 21:34, Dossy Shiobara wrote:

Dave Siktberg seems to have narrowed it down to 2006-05-12 21:25.

In what timezone? It sound like that could equate to "Sat May 13 02:27:28 BST 2006", or 1147483648 seconds since epoch, which makes it exactly 1,000,000,000 seconds until expiry of 32 bit time. Coincidence? Seems too strange as to a computer that is not a nice round number.

'Jesus' Jeff Rogers:

I had problems starting at the exact same time but on Solaris, where they manifested as a EINVAL return from pthread_cond_tomedwait. After a day of tracing the problem with debug builds and working with my sysadmin to track what changed (of course, nothing had) I cam to the same 1 billion second issue.

Which coincidentally is the expiry time (MaxOpen and MaxIdle) set on my database connections. My system is ACS-derived, so I wouldn't be surprised if these database settings are common in other ACS-derived systems.

The only bug is that Ns_CondTimedWait doesn't do any wraparound on the time parameter. All the same, I've been enjoying telling people that I hit my first y2038 bug.

Andrew Piskorski:

For those interested in ancient trivia, I think it was TWO bugs, one in the Oracle driver and/or OCI libraries (most likely OCI), and one in AOLserver. I think the workaround dates from before I ever used AOLserver, but I have these old comments in my AOLserver config file:

MaxIdle and MaxOpen:

Settings these to 1000000000 is a historical bug workaround. Could now probably set this to some normal number, or set to 0 to disable entirely. E.g., in this thread Rob Mayoff says:

http://www.arsdigita.com/bboard/q-and-a-fetch-msg?msg%5fid=000Ibq

It is a bug workaround. Many Linux users (including me) saw that when AOLserver tried to close a database connection, it would hang in the Oracle driver. So people started setting and MaxIdle to a very large number to keep connections from closing. You can also set them to zero, but at the time the bug was discovered, AOLserver had a bug that prevented you from setting them to zero.

I believe the bug was also seen, very rarely, on Solaris.

Curtis Galloway managed to get Oracle to investigate. They suggested to workarounds: use IPC or TCP to connect (which is what I do on my system), or set bequeath_detach=yes in sqlnet.ora.

2002/01/10 14:22 EST

Uselessly, the arsdigita thread URL is now a victim of needless website reorganisation, and redirects to their front page. Still, I think that's enough info.

This is certainly going to be one of the first widely-recorded Y2038 rollover bugs, I think...

A Little Downtime

Published July 19, 2006

Quick note: taint.org, and the other sites on the same host, will be down for somewhere between 30 minutes and an hour tomorrow, at 1000 UTC, as the host moves to a new datacenter (and a new IP address).

Handily, the host will also get a hefty RAM upgrade, which should improve matters the next time we get slashdotted ;)

(If you need to get in touch during the downtime, jmason at gmail dot com will be the best bet.)

Update: this is now complete.

‘Small Engine Repair’

Published July 19, 2006

Last Friday, I visited the Galway Film Fleadh to see the Irish premiere of a new feature-length movie called Small Engine Repair, which was directed by a mate of mine called Niall Heery.

I loved it -- funny, extremely black comedy, reminded me a lot of The Deer Hunter in visual style, but unmistakably Irish at the same time. (Blog movie reviews seem to be out of favour right now, so I'll leave it at that.)

Here's hoping it picks up wider distribution very soon -- it deserves to be big, I think. Nice one, Niall! Happily, the voters of the Fleadh agreed -- it went on to win the Best First Feature award.

Actually, it's been a good year for friends and family at the Fleadh -- I note that my cousin, Eoin Ryan, picked up first prize for Best Irish Short Animation with his excellent short, Demon. cool!

Road Deaths in Ireland

Published July 12, 2006

Road deaths are a hot topic in Ireland. They're actually lower, per capita, than rates in other countries, but are given plenty of column inches and headlines here, and have become a government priority as a result.

Here's the latest headline:

[Gay Byrne, head of the Road Safety Authority] claimed young people were ignoring road safety campaigns and that all he could do was to warn people to reduce speed and not to drink and drive. "I don't know what else we can do. We have done all the horror ads, but there are obviously a great number of people who don't look at television, listen to radio, or read newspapers and don't get the message," he said.

Ads. Great. Well, one thing that could be done is fixing the unsafe roads, and building decent ones; Irish country roads, while picturesque, are unable to deal with the levels of traffic they're now facing. It's time to apply modern safety standards, instead of considering a 2-lane boreen to be adequate.

There's been a bit of improvement here; the roads from Dublin to Sligo, and from Dublin to Dundalk, for example, are both now fantastic, well-designed roads, and safe as a result. But try to get from Sligo to anywhere that isn't Dublin, and you're right back on those boreens again -- with maniacs overtaking on blind corners into oncoming traffic and so on.

But here's the real reason for the post. I have to reserve some special scorn for this idiot:

Hotelier Declan Corbett, who employed both siblings, yesterday called on Mr Byrne to resign following his comments.

"I am after coming down from the Frewen family house and if Gay Byrne or Michael McDowell were after witnessing what I saw he wouldn't be coming out this morning with this ranting and blaming the young people of Ireland," he said. [...]

"Gay Byrne was given this job and he shouldn't have been given this job. It's typical Dublin 4 job-for-the-boys. A job like this should be given to someone in rural Ireland - somebody like Sean Og O'hAilpin that young people look up to."

Sean Og O'hAilpin, eh? As Paul Moloney noted -- that'd be the same Sean Og who ~~ended his Gaelic football career when he~~ overtook a car on a bend, at speed, crashing head-on into oncoming traffic? A great example, indeed.

I think that might be the problem.

A Released Perl With Trie-based Regexps!

Published July 7, 2006

Good news! From the Perl 5.9.2 'perl592delta' change log:

The regexp engine now implements the trie optimization : it's able to factorize common prefixes and suffixes in regular expressions. A new special variable, ${^RE_TRIE_MAXBUF}, has been added to fine-tune this optimization.

in other words, the trie-optimization patch contributed by demerphq back in March 2005 is now in a released build of Perl. Yay!

Here's a writeup of what it does:

A trie is a way of storing keys in a tree structure where the branching logic is determined by the value of the digits of the key. Ie: if we have "car", "cart", "carp", "call", "cull" and "cars" we can build a trie like this:

        c + a + r + t
          |   |   |
          |   |   + p
          |   |   |
          |   |   + s
          |   | 
          |   + l - l
          |   
          + u - l - l

What the patch does is make /a | list | of | words/ into a trie that matches those words. This means that we can efficiently tell if any of the words are at a given location in a strng by simply walking the string and trie at the same time. In many cases we can rule out the entire list by looking at only one character of the input. The current way perl handles this would require looking at N chars where N is the number of words involved. (BTW: Thats the beauty of a trie, its lookup time is independent of the number of words it stores but rather on the key length of the word being looked up. )

SpamAssassin is, of course, both (a) very regular-expression-intensive and (b) searches a single block of text for a large number of independent patterns in parallel. I'd love to see someone coming up with a patch to SpamAssassin that uses trie-compatible regexps when the perl version is >= 5.9.2, and gets increased performance that way. hint ;)

BTW, the Regexp::Trie module on CPAN is related -- in that it, similar to Regexp::Optimizer, Regex::PreSuf, or Regexp::Assemble, will compile a list of words or regular expressions into a super-efficient trie-style regexp. However, without the trie patch to the regexp engine itself, this would be a minor efficiency tweak at best; although having said that, Regexp::Assemble's POD notes:

You should realise that large numbers of alternations are processed in perl's regular expression engine in O(n) time, not O(1). If you are still having performance problems, you should look at using a trie. Note that Perl's own regular expression engine will implement trie optimisations in perl 5.10 (they are already available in perl 5.9.3 if you want to try them out). Regexp::Assemble will do the right thing when it knows it's running on a a trie'd perl. (At least in some version after this one).

(PS: interestingly, demerphq mentioned back in March 2005 that he was working on Aho-Corasick matching next. A-C is a great parallel-matching algorithm, and I would imagine it would increase performance yet more. I wonder what happened to that...)

Linksys NSLU2 Contemplation

Published July 7, 2006

These days, I shouldn't have time for after-hours hobby projects; I should be organising weddings and so on. But it's a compulsion. ;)

As a result, here's some notes I've been keeping on building a home NAS (network-attached storage) server, using the nifty little Linksys NSLU2: http://taint.org/wk/BuildingNasServer

Anyone done this? Care to leave a comment noting the results? I'm curious.

Smithfield’s Decay

Published July 5, 2006

I live in Dublin 7, on the north side of Dublin. Historically, the north side has been run-down and under-developed, always losing out to the more well-maintained, and well-funded, south side.

A few years ago, though, it looked like this was changing; the Spire in O'Connell St. was erected, new bars and shops opened, and the Luas line was installed. One site, Smithfield Square in Dublin 7, was radically overhauled; its derelict buildings were renovated or knocked down, new construction was going up, and fantastic architecture was being put in place. The future was looking bright.

That was back around 2000/2001; in fact, I remember walking past the avenue of braziers on Milennium night. Fast forward -- I've been back in Dublin 6 months now, and as far as I can tell, all that has petered out, while I was away. This Frank McDonald article in the Irish Times sums it up perfectly:

The cafes, bars and restaurants that were meant to be part of [Smithfield] are nowhere to be seen. The promoters had promised residents "an entire lifestyle on your doorstep, extended by the possibilities of the city and beyond". There was to be an eclectic mix of restaurants and stylish bars - "a unique mix of offerings, ranging from food to culture to entertainment and leisure in a family-friendly development", according to Paddy Kelly.

In November 2003, his son Chris said: "We are hoping it will emulate the New York example where everything - from your launderette, hairdresser and your masseuse - is only a block away, and that people will live, work and socialise within the same area". On another occasion, London's Covent Garden was cited as the urban model.

Incredibly, the lower end of Smithfield - through which Luas runs - remains unfinished six years after the rest of it was re-paved in an award-winning scheme by McGarry Ni Eanaigh Architects. It also has a redundant stone-clad structure, which served briefly as a plug-in point for open-air concerts.

The only real entertainment available in the area is the annual Christmas ice rink or the seriously indigenous and pre-existing horse fair, still being held on the first Sunday of every month.

Otherwise, the plaza attracts an assortment of winos, or juvenile offenders on their way to the Children's Court, handcuffed to prison warders.

The little stage set up for open-air concerts is now covered in graffiti, and hosts a solid crew of junkies and winos; the braziers are no longer lit; the square boasts a permanent encrustation of construction fencing. The fruit and veg market that used to be held in one of the buildings has been bought out and moved on to somewhere on the outskirts of town, replaced by "Fresh", which -- while it sells the odd bit of interesting food, like the nice Bretzel bakery bread -- is really just an upscale Spar. Even the local Indian takeaway has dropped in quality, and is now shipping out generic dishes that aren't even made with Indian spices.

To be quite honest, Smithfield -- and, to be honest, much of the north side -- gives the impression it's been abandoned again, after only one or two years of short-term investment, and no long-term thinking.

What happened?

(PS: it's not over for Dublin 7, though -- about a half-mile from Smithfield, a flashy new restaurant is set to open this weekend. But who's to say that Capel St. won't find itself similarly forgotten in a year or two?)

Blogorrah

Published July 5, 2006

Blurred Keys: Blogorrah.com - the start of empire building with 'very few overheads'. Blurred Keys, "an Irish media blog", brings the revelation that Blogorrah "copies" Gawker.com.

Honestly, though, this is blatantly obvious -- and I'd consider it unfair to call this "copying". It's simply taking a successful format and adapting it to the local market, and doing so very well indeed if you ask me.

Blogorrah is a hilarious read. If you're Irish and you're not subscribed, you're really missing out... it's the funniest thing on the Irish web these days.

Daily Links Posting Off Again

Published July 5, 2006

I've turned this off again; even though it provides a nice way for people to comment and discuss link posts (which del.icio.us doesn't provide, unfortunately), it does tend to break up the flow of the "main" article part of the weblog, and isn't entirely popular I think.

If you're interested in the links, your best bet is to read either the main page itself in your browser, where the link-blog appears over there ---> , or one of these RSS feeds:

links for 2006-07-04

Published July 5, 2006

Richi'Blog: Hotmail Has Many, Many Spamtraps

Use old user accounts; reject with "550 user unknown" for 6 months; recycle into a spamtrap. This is the technique myself and Matt Sergeant have used for several years; I don't think I've ever noted it on a web-accessible URL though, so here it is

(tags: anti-spam spam hotmail spamtraps honeypots)
Janek Simon and his Carpet Invaders

'Janek Simon unites the old geometric designs of Caucasian and Armenian carpets with the low-resolution abstractness of the Space Invaders' (via deepdisco)

(tags: carpets space-invaders games art via:deepdisco janek-simon)

links for 2006-07-03

Published July 4, 2006

85363f-deathwind.gif (GIF Image, 250x100 pixels)

This GIF is both (a) an imitation-Apple ][-screenshot and (b) valid, compilable C code for Hunt The Wumpus. amazing! it reads: "COMPILE THIS FLAG: gcc -no-integrated-cpp -DGIF89a="char *s=\"" -x c -W flag.gif"

(tags: fyad-flag gcc gif hacks somethingawful apple-ii hunt-the-wumpus)
InternetNews coverage of Google's architecture, from Urs Hoelzle at EclipseCon 2005

covering MapReduce, GFS, and -- a new one to me -- Global Work Queue: 'like old-time batch processing .. schedules queries into batch jobs and places them on pools of machines. The setup is optimized for running random computations over tons of data.'

(tags: queueing batch-jobs google distribution massively-parallel ipc-dirqueue global-work-queue)
A Search Engine That's Becoming an Inventor - New York Times

more info on the Google backend systems

(tags: google backend distribution massively-parallel queueing)
Wooster Collective: Another Crate Piece from Melbourne

milk crates hold special status down under -- this is excellent

(tags: crates melbourne australia street-art art tetris)

links for 2006-07-02

Published July 3, 2006

AdamMaguire.com: The Government prepares itself for the stem cell debate

hmm; either the Irish government is hedging its bets regarding stem cells -- or the left hand doesn't know what the right is doing. Better, but still unclear....

(tags: stem-cells ireland science research forfas)
TechWire: Dublin City Council "not interested" in city-wide mesh/wi-fi broadband

'the Council (a) doesn't get it (b) isn't interested and (c) doesn't think anyone else would care enough for it to be worth its while'. pathetic -- that could do incredible things for Dublin

(tags: dublin wifi broadband ireland bureaucrats)

Ecch – that must have been poisonous! –more–

Published June 30, 2006

Since consuming a misjudged sossie at a BBQ last Saturday, I've been suffering from a stomach bug, causing nausea, sweating and the occasional vomit (never fun). On top of this, I spent Monday to Wednesday in Serbia on a work trip.

The result -- I've managed to miss the entirety of ApacheCon EU 2006 in Dublin. I considered dropping down to catch the end of it this morning, but had to abort the attempt due to a bout of in-transit nausea.

All in all, a pretty miserable week. :(

Update: here's something vaguely uplifting -- a cover of Europe's 'Final Countdown' in Khmer.

Update 2: wow, that little stomach bug has been wreaking havoc -- over the weekend 3 more people laid low in our social group. sorry all...

links for 2006-06-29

Published June 30, 2006

The Daily WTF - One Version to Rule Them All

'I've noticed that in several places (most prominently, Help-About), there is the product version, build number, etc. ... We don't want the customers knowing this information and need it removed.' genius (via Donal)

(tags: daily-wtf via:donal funny versioning marketing cluetrain customers software)
BreakingNews.ie: Dermot Ahern makes stem cell research pledge to Pope

'The Government will ban any EU funding for stem cell research in Ireland, the Irish Foreign Affairs Minister told the Pope today.' Amazing. This is not the Ireland I was hoping to return to! Are we back in 1980 again? wtf...

(tags: dermot-ahern fundamentalism ireland pope research science progress eu)
A Shout Out to My Pepys - THE FUTURE LIES AHEAD

'Ladies and gentlemen, I'm in a select club of the first victims of the Year 2038 Bug.'

(tags: year-2038 bugs ouch software aolserver unix epoch time)
Emergent Chaos: Email Thread Visualization

fancy infoviz of discussion threads; as I comment on the page, I think this is overkill, and GMail's "conversation" view does just fine

(tags: threading email usenet infoviz discussion)
Google Account Authentication

"Google TypeKey" in other words

(tags: distributed-authentication authentication google web)

links for 2006-06-27

Published June 28, 2006

Light Blue Touchpaper: Ignoring the â€œGreat Firewall of Chinaâ€

just firewall out RSTs, and the Great Firewall's keyword blocker is defeated

(tags: china censorship half-baked richard-clayton tcp-ip firewalls great-firewall-of-china)
Essentials, 2006 edition

Mark Pilgrim's suggested apps for an Ubuntu desktop -- some quite good suggestions here, with lots of KDE goodness. I just wish amaroK was as user-friendly and usable as the amazing (but not well-maintained) JuK, though

(tags: linux kde unix desktops software applications)

links for 2006-06-26

Published June 27, 2006

Emergent Chaos: I'm Joining Microsoft

wow, Adam Shostack joins MS!

(tags: adam-shostack microsoft jobs work security software)

links for 2006-06-24

Published June 25, 2006

Defense Tech: Damn It! '24' Stars Meet Homeland Security Bigs

The set designer for '24' helped design the operations center at the National Counterterrorism Center, apparently. I bet that really helps (via substitute)

(tags: via:substitute funny absurd dhs government tax-dollars 24 tv-vs-reality)

links for 2006-06-22

Published June 23, 2006

separated by a common language

'Observations on British and American English by an American linguist in the UK', via Ben. I fall between these two stools on a regular basis, or three if you count Hiberno-English as well

(tags: english language speech transatlantic)

links for 2006-06-21

Published June 22, 2006

Amazon.com: Lance James' Blog

author of "Phishing Exposed", general smart guy where phishing attacks are concerned. (also: Amazon does blogs now?)

(tags: phishing lance-james anti-spam weblogs amazon)
ESV Bible Blog: Mechanical Turk Recap

Bible annotation using Amazon's "Mechanical Turk" HIT service; a success. However they did invite their blog readers to participate, which would have skewed results by providing willing participants

(tags: amazon mechanical-turk hit web bible esv)
Daring Fireball: Interoperability and DRM Are Mutually Exclusive

a great post from John Gruber, pointing out the key problem with DRM -- it forces vendor lock-in, and precludes interoperability, as a core design goal

(tags: interoperability drm vendor-lock-in apple itunes aac mp3 music bpi)
Conference on Email and Anti-Spam

this year's CEAS, July 27-28 2006. CEAS is reliably the best anti-spam conference; worth attending, although I won't be this year

(tags: ceas anti-spam)
Mark Jason Dominus: Higher-Order Perl

'[the book] is about functional programming techniques in Perl. It's about how to write functions that can modify and manufacture other functions.' wow, missed this -- sounds AWESOME

(tags: functional-programming perl eval mjd books toread wishlist)
Softguide Dublin City Centre Maps

it's pretty hard to find decent maps of Dublin online -- these are very good, although not quite Google-maps-shiny, they surpass GMaps' quality in terms of data (via Sander Temme)

(tags: maps dublin softguide)

links for 2006-06-20

Published June 21, 2006

The Rise and Fall of CORBA

Michi Henning (!) slates the history of CORBA extensively, blaming the OMG's process and praising the open source community. wow (via slashdot)

(tags: via:slashdot corba michi-henning distobj rpc networking oo omg)
BLDGBLOG: Your Concrete Utopia

for sale: one partially plutoniumâ€“contaminated Pacific atoll, 718 miles from its nearest neighbour; unfortunately the golf course is closed. see also Ballard's 'Terminal Beach'

(tags: land via:bldgblog islands utopias pacific cold-war nuclear-tests johnston-atoll terminal-beach ballard)

Winding stair, Aiguafreda

Published June 20, 2006

Taken last week in Aigufreda on the Costa Brava, Catalunya, Spain.

links for 2006-06-19

Published June 20, 2006

Juggling oranges [dive into mark]

Mail.app proprietary crapness. 'Iâ€™m forced to migrate all my mail yet again from yet another proprietary format, and the best documentation Iâ€™ve found so far is on LiveJournal. .. somebody deserves to be fired for that.'

(tags: mail mbox fidelity mail.app apple mark-pilgrim open-data openness data future-proofing proprietary)
Micronomicon Abroad: MONEY!

Maya meticulously recorded almost every penny/baht/kip/ringgit spent over the course of her 6-month travels through SE Asia. about right, going by my own experience; I wish I'd bought more souvenirs

(tags: travel backpacking asia holidays vacation)

Vodafone Ireland’s flat rate mobile data card

Published June 19, 2006

Adrian Weckler posts details of Vodafone Ireland's new flat price datacard; costing 50 Euros per month, including VAT; fully flat rate (hooray, something useful at last!); and they claim that they'll be rolling out HSDPA, which offers 1.2Mbps to 11Mbps rates, 'starting in Dublin in October'.

Those are great numbers, but further info seems thin on the ground; they haven't bothered updating their own website yet, amazingly.

Anyone got further info? What rates does it offer right now? How would one order such a beast?

links for 2006-06-08

Published June 9, 2006

DoubleTwist Ventures

'focuses on the development of interoperability solutions for digital media, and the reverse engineering of proprietary systems for which licensing options are non-existent or impractical' -- and have hired Jon Lech Johansen

(tags: reverse-engineering drm copyright jon-lech-johansen)

Holidaze

Published June 8, 2006

Quick note -- I'm off on vacation next week -- so I probably won't read any email while I'm there ;) Talk to you after the 17th.

links for 2006-06-07

Published June 8, 2006

Haystack

C|Net's distributed filesystem, a la GFS, Mogilefs (via acme)

(tags: via:acme distributed filesystems gfs mogilefs cnet haystack)
Understanding the Network-Level Behavior of Spammers [PDF, slides]

good data on large-scale spammer behaviour as of 2006, presented at NANOG37. Relay-IP-based techniques not so good any more, but we knew that. Unfortunately doesn't analyze SURBL/URIBL content-oriented DNSBLs, which have picked up the slack nicely

(tags: dnsbls blocklists nanog anti-spam nick-feamster anirudh-ramachandran botnets bgp routes)

Running Dapper

Published June 7, 2006

I took the plunge over the weekend, and live-upgraded the new 'Dapper Drake' Ubuntu release -- ouch. Here's the two key lessons I learned:

Don't run "grub-install" in a misremembered attempt to update the current GRUB boot menu 'menu.lst' file with the new kernel; sadly, this will quietly remove important details from your old menu.lst, such as "initrd" lines, rendering those kernels unbootable. Moral: ensure brain is in gear before meddling with MBRs!
If you're a Kubuntu user, watch out. Ensure you run apt-get install ubuntu-base ubuntu-desktop -- bringing the entirety of GNOME up to date -- as well as apt-get install kubuntu-desktop after the upgrade; it appears that some part of a new hotplugging subsystem is not included as a dependency of kubuntu-desktop. Failure to do this results in an inability to use USB/hotpluggable devices, including internal devices like the Synaptics touchpad. No pointer devices (mice or touchpads) means no X server at boot, which is always a little annoying.

Some day I'll just do things the right way, and do a fresh-from-CD install instead. Ah well. The good stuff: the new kernel, or possibly Xorg, is proving to be a lot speedier -- window updates are noticeably smoother; and the new Ubuntu GNOME theme is similarly tasty.

SpamAssassin advisory CVE-2006-2447

Published June 7, 2006

CVE 2006-2447, in which Radoslaw Zielinski spotted a nasty in spamd's 'vpopmail' support in pretty much all recent versions of Apache SpamAssassin.

If you use spamd with vpopmail, go read the advisory and determine if you need to take action. Not many people will need to, I think; it's a very rare setup. Still, it's important to get the warning out there anyway.

The irony is that the bug is triggered partly by the "--paranoid" switch. This was intended to increase security, by increasing paranoia when possibly-unsafe situations arose -- hence providing a great demonstration of how the addition of optional code paths, even in the best intentions, can reduce security by allowing bugs to creep in unnoticed.

links for 2006-06-06

Published June 7, 2006

Gallows humor from inside Enron

fake 419s, 'How to Explain Enron to Your Children', and 'we falsify commodity markets so that we can deliver physical commodities to our customers at a ridiculously unsustainable price' -- all scraped from the Enron mail corpus

(tags: enron funny mail corporate corruption gallows-humour)
Optimizing Javascript for Execution Speed

great notes on speeding up javascript; I have a Greasemonkey script this will be useful with, once I get some tuits (via yoz)

(tags: via:yoz javascript optimization speed coding toread greasemonkey userscripts)

links for 2006-06-05

Published June 6, 2006

ITworld.com - Even the Builders of Windows Find Tech Support a Challenge

Microsoft CEO Steve Ballmer attends wedding; a parent asks if he'd have a look at their PC; Ballmer spends _no less than two days_ attempting to rid it of encrusted malware infestations -- before giving up and shipping it back to Redmond. hilarious

(tags: malware steve-ballmer ceos spyware viruses ms-windows microsoft funny)

Web x, where x != 2.0

Published June 2, 2006

Regarding the O'Reilly/CMP "Web 2.0 (SM)" trademark shitstorm, Sean McGrath humourously suggested a workaround -- using a different revision number instead of "2.0", specifically e, 2.71....

However, it's not quite that simple in many jurisdictions, apparently. It seems that trademark law -- in the US, at least -- allows trademarks which include a number to also cover uses within roughly plus or minus 10 of that number. In other words, CMP's application will cover the range from Web -8.0 (SM) (assuming negative numbers are included?) to Web 12.0 (SM).

So much for "Web 3.0", "Web 2.1", "Web 2.71...", and so on. Back to the drawing board, Sean! ;)

(disclaimer: IANAL, of course. Credit to Craig for that tidbit.)

Update: doh, got the value of e wrong...

links for 2006-06-01

Published June 2, 2006

WP-Cache

I got slashdotted yesterday! Unfortunately, stock WordPress falls over pretty quickly. Once I managed to get this plugin installed, though, things were a lot better... thumbs up for WP-Cache

(tags: slashdot slashdotting load wordpress plugins caching weblogs)
how to reduce the size of an XP vmware image

I need to do this soon; damn copy-on-write disk images are chewing up my disk space

(tags: vmware disk-images emulation windows-xp xplite disks vmplayer)
Schneier on Security: Common Passwords

one large website's password list analysed; 1.4% of passwords were "123456", and 2.5% overall began with 1234

(tags: passwords security i-love-to-count)
Live-upgrading to Ubuntu 6.06 "Dapper Drake"

Dapper is now released -- and is live-upgradable via apt-get. am I stupid enough to do this? quite possibly; I've done it for the past 5 upgrades

(tags: ubuntu dapper linux upgrades debian apt)
Pingerati

a message router for pings, for web pages containing microformat data. Interesting to see that Upcoming.org is currently the only ping producer -- their pings are then consumed by evdb, the only third-party ping receiver listed

(tags: evdb upcoming open-apis apis pingerati technorati message-routers)
slashdotting.png (PNG Image, 1024x768 pixels)

graph of request frequency over the past few days at taint.org; that spike was pretty major

(tags: graphs weblog meta)

links for 2006-05-31

Published June 1, 2006

ongoing: On Grids

great article on current grid computing, featuring MPI, MapReduce, Hadoop, and promising a new UNIXy thing from tbray called Sigrid (ha!). Mind-boggling quote from Jim Gray: 'Memory is the new disk. Disk is the new tape.'

(tags: grid-computing parallel tim-bray mapreduce hadoop mpi sigrid jim-gray server-farms)
"patent goo" -- self-replicating Paxil

spontaneously converts the off-patent anhydrous form of the drug into the patented hemihydrate form, which then successively converts more and more of the anhydrous form, Ice-9-style. Never mind "viral" licenses, this takes the biscuit! (via substitute)

(tags: via:substitute viral-licenses gray-goo paxil drugs chemistry bizarre polymorph ice-9 patents)

Blog Spam, and a ‘nofollow’ Post-Mortem

Published May 31, 2006

An interesting article on blog-spam countermeasures -- Google's embarrassing mistake. Quote:

I think it's time we all agreed that the 'nofollow' tag has been a complete failure.

For those of you new to the concept, nofollow is a tag that blogs can add to hyperlinks in blog comments. The tag tells Google not to use that link in calculating the PageRank for the linked site. [...]

Since its enthusiastic adoption a year and a half ago, by Google, Six Apart, WordPress, and of course the eminent Dave Winer, I think we can all agree that nofollow has done -- nothing. Comment spam? Thicker than ever. It's had absolutely no effect on the volume of spam. That's probably because comment spammers don't give a crap, because the marginal cost of spamming is so low. Also, nofollow-tagged links are still links, which means that humans can still click on them -- and if humans can click, there's a chance somebody might visit the linked sites after all.

I agree. At the time, I pointed at this comment from Mark Pilgrim:

Spammers have it in their heads now that weblog comments are a vector to exploit. They don't look at individual results and tweak their software to stop bothering individuals. They write generic software that works with millions of sites and goes after them en masse. So you would end up with just as much spam, it would just be displayed with unlinked URLs.

Spammers don't read blogs; they just write to them.

I still think he was spot on.

However, one part of the 'Google's embarrassing mistake' article is a red herring -- I think the chilling effect on "nonspam links" is not to be worried about; as Jeremy Zawodny said, life's too short to worry about dropping links purely in the hopes of giving yourself Page Rank. I don't know if I really want links that people are leaving purely for that reason. ;)

In fact, I wouldn't be surprised to hear that Google's crawler starts treating "nofollow" links as mildly non-spammy in a future revision, due to their wide use in wikis, blogs etc.

To be honest, though -- I don't see the problem of blog-spam much anymore. As I said here:

[Weblog] comment spam should be a lot easier to deal with than SMTP spam. ... With weblog comments, you control the protocol entirely, whereas with SMTP you're stuck with an existing protocol and very little "wiggle room".

On my WordPress weblog [ie. here] -- which, admittedly, gets only about 1/4 of the traffic plasticbag.org does -- I've instituted a very simple check stolen from Jeremy Zawodny. I simply include a form field which asks the comment poster for my first name, and if they fail to supply that, the comment is dropped. In addition, I've removed the form fields to post directly, requiring that all comments are previewed; this has the nice bonus of increasing comment quality, too.

Those are the only antispam measures I'm using there, and as a result of those two I get about 1 successful spam posted per week, which is a one-click moderation task in my email. That's it.

The key is to not use the same measures as everyone else -- if every weblog has a different set of protocols, with different form fields asking different simple questions, the only spammers that can beat that are the ones that write custom code for your site -- or use human operators sitting down to an IE window.

Trackbacks, however -- turn that off. The protocol was designed poorly, with insufficient thought given to its abuse potential; there's no point keeping it around, now that it's a spam vector.

Finally, a "perfect" solution to blog spam, while allowing comments, is unachievable. There will always be one guy who's going to sit down at a real web browser to hand-type a comment extolling the virtues of some product or another. The goal is to get it to a level where you get one of those per week, and it's a one-click operation to discard them.

(Update: This story got Slashdotted! The poor server's been up and down repeatedly -- looks like it needs an upgrade. In the meantime, WP-Cache has proven its weight in gold; recommended...)

links for 2006-05-30

Published May 31, 2006

One year without nicotine!

yay me!

(tags: non-smoking nicotine addiction cigarettes life progress)

Retroactive Tagging With TagThe.Net

Published May 30, 2006

Hacky hack hack.

Ever since I enabled tags on taint.org, I've been mildly annoyed by the fact that there were thousands of older entries deprived of their folksonomic chunky goodness. A way to 'retroactively tag' those entries somehow would be cool.

Last week, Leonard posted a link on his linkblog to TagThe.net, a web service which offers a nifty REST API; simply upload a chunk of text, and it'll suggest a few tags for that text, like this:

echo 'Hi there, I am a tag-suggesting robot' | curl "http://tagthe.net/api/?text=`urlencode`"
<?xml version="1.0" encoding="UTF-8"?>
<memes>
  <meme source="urn:memanage:BAD542FA4948D12800AA92A7FAD420A1" updated="Tue May 30 20:20:39 CEST 2006">
    <dim type="topic">
      <item>robot</item>
    </dim>
    <dim type="language">
      <item>english</item>
    </dim>
  </meme>
</memes>

This looked promising.

Anyway, I've now implemented this -- it worked great! If you're curious, here's details of how I did it. It's a bit hacky, since I'm only going to be doing this once -- and very UNIXy and perlish, because that's how I do these things -- but maybe somebody will find it useful.

How I Retroactively Tagged taint.org

This weblog runs WordPress -- so all the entries are stored in a MySQL database. I took the MySQL dump of the tables, and a quick script figured out that out of somewhere over 1600-ish posts, there were 1352 that came from the pre-tag era, requiring tag inference. A mail to the TagThe.Net team established that they were happy with this level of usage.

I grepped the post IDs and text out of the SQL dump, threw those into a text file using the simple format 'id=NNN text=SQLHTMLSTRING' (where SQLHTMLSTRING was the nicely-escaped HTML text taken directly from the SQL dump), and ran them through this script.

That rendered the first 2k of each of those entries as a URL-encoded string, invoked the REST API with that, got the XML output, and extracted the tags into another UNIXy text-format output file. (It also added one tag for the 'proto-tag' system I used in the early days, where the first word of the entry was a single tag-style category name.)

Next, I ran this script, which in turn took that intermediate output and converted it to valid PHP code, like so:

cat suggestedtags | ./taglist-to-php.pl  > addtags.php
scp addtags.php my.server:taint.org/wp-admin/

The generated page 'addtags.php' looks like this:

<?php
  require_once('admin.php');
  global $utw;
  $utw->SaveTags(997, array("music","all","audio","drm-free",
      "faq","lunchbox","destination","download","premiere","quote"));
  [...]
  $utw->SaveTags(998, array("software","foo","swf","tin","vnc"));
  $utw->SaveTags(999, array("oses","eek","longhorn","ram",
    "winsupersite","windows","amount","base","dog","preview","system"));
?>

Once that page was in place, I just visited it in my (already logged in) web browser window, at http://taint.org/wp-admin/addtags.php, and watched as it gronked for a while. Eventually it stopped, and all those entries had been tagged. (If I wasn't so hackish, I might have put in a little UI text here -- but I didn't.)

The results are very good, I think.

A success: http://taint.org/tag/research has picked up a lot of the interesting older entries where I discussed things like IBM's Tieresias pattern-recognition algorithm. That's spot on.

A minor downside: it's not so good at nouns. This entry talks about Silicon Valley and geographical insularity, and mentions "Silicon Valley" prominently -- one or both of those words would seem to be a good thing to tag with, but it missed them.

Still, that's a minor issue -- the tags it has suggested are generally very appropriate and useful.

Next, I need to find a way to auto-generate titles for the really old entries ;)