Skip to content

Justin's Linklog Posts

Image Watermarking With ‘pamcomp’

Web: My Dad runs a couple of websites — his architectural photography business, and Andalucia Photo Gallery, a side project selling some lovely photos from the Andalusia region of Spain.

Needless to say, as the family geek, guess who coded all that up? Using WebMake, naturally ;) This was the main reason I wrote the ‘thumbnail_tag’ plugin.

You’ll note, however, that the image to right is watermarked, quite small, and encoded with a low quality setting. It turned out after a couple of years of operation, that the images were being downloaded and used in print all over the place — from both sites!

It seems photo piracy is rampant. Even with terms of use clearly linked on the sites, it’s still commonplace for print publications to swipe the images — and not just the little guys, either — some big commercial names have apparently used the images without asking (or paying licensing fees).

The Andalucia gallery site was a favourite; being a good hit for ‘travel photos spain’ meant lots of images being used for holiday pages in magazines, newspapers, and so on.

Needless to say, digital watermarking software doesn’t work — it’s trivial to load an image into Photoshop, resize or crop, and resave, apparently. Even if PS did respect the watermarks, netpbm doesn’t, and a watermarked image isn’t identifiable as such once it appears in print anyway! So we went for the blunt-tool approach, adding visible watermarks to the images.

It’s pretty easy — pamcomp allows you to overlay one image on top of another, using a third as an ‘alpha mask’ to control transparency. The results are pretty nice and not too intrusive.

It’s a shame it has to be done, though… :(

MS Patents sudo(8)

Patents: The varchars.com scraped RSS feeds now include new patent grants and applications by certain companies! Interesting, although given that most developers are advised not to look, not advisable ;)

However, I glanced at the MS one — and immediately spotted this gem: US Patent 6,775,781, filed by Microsoft, is a patent on the concept of ‘a process configured to run under an administrative privilege level’ which, based on authorization information ‘in a data store’, may perform actions at administrative privilege on behalf of a ‘user process’.

This, and the patent claims, perfectly describe the operation of sudo, fundamentally as it’s operated since running on a 4.1BSD VAX-11/750 in 1980.

20 years head start on a patent application — surely that must qualify as prior art ;)

RFID Security

Security: It looks like the security people are starting to take a look at RFID, and it’s not pretty.

I link-blogged this the other day — RFDump is a tool to display and modify data in RFID tags — including deployed ones, at least in some cases. (Think rewriting the price tags in a shop, scrambling the tracking numbers on a warehouse full of goods, or corrupting frequent-shopper data on a card.)

It looks like this was also discussed at USENIX Security ’04 in an RSA presentation (those notes are swarming with typos, but the content’s there ;)

That talk has some interesting stuff — ‘blocker’ tags which spoof readers with gibberish data, or crash the collision-detection network protocol; while that’s being discussed as a security tool here, if the protocol is that hackable, and the hardware is available, I could see that having additional interesting effects in a supermarket. Of course, range is an issue — but that hasn’t stopped Bluetooth hacking, wardriving, etc.

If you ask me, it looks an awful lot like RFID is chock-full of security holes, and the features that make it so attractive (low power use, low cost, tiny size) will be the very features that militate against adding security. We could be in for interesting times here…

A ‘Boulder Pledge scoreboard’ website

Spam: Ask Slashdot: How Powerful is the Turn-Off Power of Spam? The question is, ‘How often do you make the decision to NOT buy something form a company because you know they engage in spamming activities?’

This is an old idea — it goes back to a December 1996 column by Roger Ebert, of all people, who proposes the following pledge that all internet users should take:

Under no circumstances will I ever purchase anything offered to me as the result of an unsolicited e-mail message. Nor will I forward chain letters, petitions, mass mailings, or virus warnings to large numbers of others. This is my contribution to the survival of the online community.

8 years later, it’s more important than ever.

However, it’s complicated by one additional factor — not everyone knows which products and companies use spam to advertise. For example, did you know that Kraft routinely advertise their Gevalia coffee through spam?

My suggestion — a daring individual (that rules me out ;) should set up a website where samples of major-product-advertising spam are collected from (trusted) reporters. A quick scoreboard based on how many reports a particular company accumulates, and we have a Boulder Pledge reputation service.

Some simple rules should be applied:

  • Messages arriving at never-used spamtrap addresses, or scraped addresses from USENET or the web, especially if the message hits multiple of those addresses (indicating a high volume), is the basis for a listing;
  • Failure to respect opt-outs, of course, would be a biggie;
  • Using a known spamhaus, or sending via open proxies in Shandong, would be a massive thumbs-down;
  • Failure to clean up it’s act after being made aware of the problem, oh dear.

It’d be essential to take an extremely careful approach to this; any hint of personal axe-grinding, and the site would be useless, written off as just the work of ‘another anti-spam kook’.

Essentially, this’d be a Fortune-500-oriented version of spamvertized.org.

Reportedly, many of the large companies using spam to advertise are fully aware at a management level that they are responsible for spamming. (That line about open proxies in Shandong is no joke — at least one Fortune 500 company has hired a spamhaus that does this.)

Doubtless, some spamvertisers may be victim to an overzealous but clueless marketing department, on the other hand — but either way, a public ‘name and shame’ forum gives a great impetus for them to avoid this problem, at least once they’ve been bitten the first time.

In some cases, it’s dodgy ‘affiliates’ that use spam to advertise their products — but a company that operates affiliates really should post a policy that says that affiliates found to be spamming will be terminated and have their commissions forfeited; reportedly, that has been found in other programs to quickly cut off the problem.

Spamusement rocks!

Spam: oh man, Spamusement started off well, and has just been getting better and better; * HEATH WARNING * had me laughing out loud, and the idea of linking the entries since August 8 as a series is genius.

Announcing IPC::DirQueue

Perl: So, I wrote a new CPAN module recently — IPC::DirQueue. It implements a nifty design pattern for slightly larger systems, ones where multiple processes, possibly on multiple machines, must collaborate to deal with incoming task submissions. To quote the POD:

This module implements a FIFO queueing infrastructure, using a directory as the communications and storage media. No daemon process is required to manage the queue; all communication takes place via the filesystem.

A common UNIX system design pattern is to use a tool like lpr as a task queueing system; for example, this article describes the use of lpr as an MP3 jukebox.

However, lpr isn’t as efficient as it could be. When used in this way, you have to restart each task processor for every new task. If you have a lot of startup overhead, this can be very inefficient. With IPC::DirQueue, a processing server can run persistently and cache data needed across multiple tasks efficiently; it will not be restarted unless you restart it.

Multiple enqueueing and dequeueing processes on multiple hosts (NFS-safe locking is used) can run simultaneously, and safely, on the same queue.

Since multiple dequeuers can run simultaneously, this provides a good way to process a variable level of incoming tasks using a pre-defined number of worker processes.

If you need more CPU power working on a queue, you can simply start another dequeuer to help out. If you need less, kill off a few dequeuers.

If you need to take down the server to perform some maintainance or upgrades, just kill the dequeuer processes, perform the work, and start up new ones. Since there’s no ‘socket’ or similar point of failure aside from the directory itself, the queue will just quietly fill with waiting jobs until the new dequeuer is ready.

Arbitrary ‘name = value’ metadata pairs can be transferred alongside data files. In fact, in some cases, you may find it easier to send unused and empty data files, and just use the ‘metadata’ fields to transfer the details of what will be worked on.

Sound interesting? Here’s the tarball.

CEAS Roundup

Spam: So, CEAS was great fun, and very educational:

  • Got to meet up with various antispammers, including Daniel and Theo from the SpamAssassin dev team, Jeff Chan from SURBL, Dan Kohn from Habeas, Catherine Hampton from The SpamBouncer, Miles Libbey, John Levine, Neil Schwartzman — lots of good chats.
  • MS really know how to feed a conference! I hear rumours there was an extra-special tinned-meat-product-based dish at the banquet…
  • But their firewalling tendencies put a serious damper on keeping in touch with the outside world, at least until we set up an SSH tunnel on port 443 ;)
  • During a lull, Dan Kohn fired off a hands-up census — a good 75% of the attendees (roughly) admitted to using SpamAssassin!

My highlight papers:

  • IBM’s Chung-Kwei pattern-discovery system — the one which Mark dug up. Very interesting stuff; it turns out that bioinformatics is full of large corpora of data (genomes) which you then need to find patterns in. Funnily enough, so is SpamAssassin: s/genomes/spam/, s/patterns/regular expressions/. The more advanced pattern-discovery algorithms even allow complex patterns to contain alternative blocks, ‘don’t-cares’ and similar regular-expression-like features.

    The really good bit of Chung-Kwei is the Teiresias algorithm (more pages, online demo). Of course, being IBM research, it’s probably patented to the hilt, and may be tricky to license; but it’s certainly pointed us in a whole new interesting direction — anyone know any bioinformaticians?

    IBM is really gearing up on anti-spam research. 4 of the 6 papers listed on their website were presented this year, at CEAS.

  • Another good paper was On Attacking Statistical Spam Filters, by Gregory L. Wittel and S. Felix Wu, which (similarly to Henry Stern’s submission, which I helped a little with) dealt with an attack on Bayesian filters.

    This is interesting stuff; we’re pretty sure it’s not as serious as it could possibly be, in SpamAssassin’s implementation, but it’s still a serious attack.

  • The Impact of Feature Selection on Signature-Driven Spam Detection was an interesting paper on AOL’s new signature schemes. (The conference was sponsored by Cloudmark, BTW, but those guys were nowhere to be seen — in which case they missed this presentation ;)
  • Reputation Network Analysis for Email Filtering was interesting, in that it mirrors to a degree the thinking behind web-o-trust.org, but in my opinion suffered due to a lack of thought about avoiding spoofing (by including IP address information in the FOAF file, it could do this now). However, once SPF becomes pervasive, this could be combined with that to generate personalised webs of trust usable for email whitelisting.
  • Resisting SPAM Delivery by TCP Damping was very nifty; plug a classifier into your MTA, and thereby detect connections from spam relays. Once you’ve found them, you then throttle down their connection as they attempt to deliver spam. Some other TCP-level tricks can do nifty stuff like massively increasing the bandwidth consumption of the spamming machines. Very very nice!

I took copious notes on the SpamAssassin wiki, if anyone’s curious.

Patents in an open source world

Patents: Newsforge: Patents in an open source world, by Lawrence Rosen (founding partner of Rosenlaw and Einschlag).

Interesting article, but I’m not sure summary point number 2 (‘continue to document our own “prior art” to prevent others from patenting things they weren’t the first to invent’) really helps, when the patent examiners clearly haven’t performed the simplest Google check. I’ve found obvious prior art in 30 seconds, by plugging 3 words from patent claims into Google in the past (and yes, I have a reasonable idea how to read patent claims by now).

Point number 3 is interesting, since it contradicts most other advice I’ve read regarding patent searches: ‘Conduct a reasonably diligent search for patents we might infringe. At least search the portfolios of our major competitors. (This, by the way, is also a great way to make sure we’re aware of important technology advances by our competitors.) Maintain a commercially reasonable balance between doing nothing about patents and being obsessed with reviewing every one of them.’

However, this comment really is interesting and raises something major that I’d never heard of before — users of proprietary software can also face a significant risk from the patent threat. In particular, according to the linked comment, Microsoft licensed some patented technology from a company called Timeline Inc., but the license was not sublicenseable — in other words, it did not grant their customers the rights to fully use the technology! (in fairness to MS, this was established later in court.) Result: href=”http://trends.newsforge.com/comments.pl?sid=39443&cid=96153″>MS SQL server OEMs and ISVs are now being sued.

Post-Apocalyptic Fiction

Reading: Both jim winstead and Nelson Minar have praised Earth Abides , a 1949 post-apocalyptic novel where ‘all but a handful of people die from a mystery disease’, and the ensuing narrative ‘follows one man’s attempt to rebuild something like a society.’ It seems a tip from original happy mutant Mark Frauenfelder was the pointer for both of ’em.

I’m a huge fan of the genre; I think it’s something about our age group, growing up in the shadow of Reagan’s ‘Evil Empire’ speeches, Threads and (much less terrifying) The Day After.

Given that, it looks like Earth Abides goes straight into the wishlist. However, I should make another couple of reading tips while I’m at it, in the same genre:

First off, Jack London’s short story
The Scarlet Plague (1912) is a clear antecedent to Earth Abides. In this story, too, a plague hits the planet and wipes out most of civilization; an old man talks to children who’ve known nothing but the post-apocalypse period. It’s pretty short and well worth a read.

But my main recommendation is Kim Stanley Robinson’s The Wild Shore (1984), first book of his Three Calfornias trilogy, and his debut novel.

It takes place in 2047, 60 years after a massive nuclear attack on the US, by Russian infiltrators (pretty dated, eh ;). The narrator is a teenager in a primitive agrarian community on the coast of southern Orange County. His group are farmers, living far away from the previously built-up areas; the people who live amongst those ruins are shunned, and the different tribes meet only occasionally to trade. Disposable butane lighters are a treasured commodity.

He gradually discovers that the US was once a superpower, and that they are now being kept in a virtually stone-age state by outside powers. The interesting factor here is that most sci-fi authors, at this point, would embark on a jingoistic, militaristic armed struggle; it initially seems that’s what’s happening, but Robinson takes a very interesting tack, in his own style, and this really makes the book something special.

(I won’t go too far into it, but if you really want to know and don’t mind spoilers, this site thoroughly spills the beans.)

Counterfeit Cops

IP: A funny IPR-enforcement-related story from New Scientist (sorry, subscriber-only link):

Just before delegates (to the 28 May ‘Global Congress on Combating Counterfeiting’) left Brussels to ponder their future anti-counterfeiting measures, a salutary tale started doing the rounds.

The WCO (World Customs Organization) produces a CD database of the codes needed to identify goods by type so that local customs authorities can collect the appropriate duties. The discs sell for EUR 1000 apiece, but WCO investigators have found that staff at some border posts, which are supposedly the front line in counterfeit detection, are not using the official CDs. Instead – you’ve guessed it – they are buying cut-price pirated copies, complete with crudely photocopied, plainly fake covers and sleeve notes.

Physician, heal thyself!

Kentucky sez ‘Opt-Out Still Doesn’t Work’

Spam: Some fantastic data in this paper from the Kentucky Long-Term Policy Research Center.

It’s a brief 2-pager detailing the effectiveness of the CAN-SPAM Act in reducing the spam load, using a set of test addresses. The methodology is pretty good.

One point in particular is very important: ‘opting out’ from spam Just Does Not Work. This graph tells the whole story:

After opting out from spams received, the amount of spam received at those ‘opted out’ test addresses actually rose. (This even after CAN-SPAM made such activity explicitly illegal.)

Some other data:

  • obfuscating addresses on web pages is still working; 7.7 times the spam is received if you don’t bother doing so.
  • e-mail harvesting also continues after CAN-SPAM made it illegal.

If anyone needed proof, this shows that spammers are quite happy to break the law; strong enforcement ‘teeth’ are needed for any anti-spam legislation. (UK, take note: the thoroughly useless system whereby spam complaints must be submitted on paper isn’t going to help!)

The Technical Details document also notes something interesting: one test address was set up to test ‘opting out’ of legitimate mass mail from some (unnamed) big websites, and continued to receive ads ‘sometimes months after opting-out’. For shame!

(thx to John Levine for forwarding the links.)

Spam: Michael Radwin on open HTTP redirectors, and in particular noting that Yahoo! have (finally) closed their main one down. One down, several hundred to go ;)

Good history of the exploitation techniques that spammers have been using, too.

BEST SONG EVER — identified!

Funny: Some of the taint.org readership (that’s you, Nishad) may be familiar with BEST SONG EVER.mp3 — it’s an insane, 10-minute workout: one guy ranting at a high pitch in some east-asian language at an incredible speed over some cheesy Casio, hardly taking a breath, punctuated by bizarre 7-Zark-7-style ribbits and squawks. By the end of it, he’s nearly hoarse. It is incredibly bizarre. Turkopop has nothing on this.

Well, it’s origin has been discovered — he’s called E Pak Sa, and the style is called ‘Pansori’. His version is a modern take on this ancient traditional style — ‘While singing, he would imitate the sound of all of the instruments used in the prelude and interlude, and even the sound of the whistle used to gather the tourists.’ From there, he grew in popularity, especially in Japan:

‘Sell-out concerts, myriad television appearances, riots at in-stores, and Japanese teens speaking Korean are all products of E Pak Sa’s impact in Japan. E had infiltrated the popular culture of Japan and paved the way for other Korean artist to do the same.’

And guess what — his Encyclopedia of Pon-Chak album can be listened to online! The YMCA cover — track 2 — is strongly recommended.

Ross Anderson not quite so cool anymore

Security: Ross Anderson, crypto and security guru extraordinaire, moonlights as — wait for it — a street bagpipe player:

I play the pipes (the Great Highland Bagpipe and the Scottish smallpipes). I played competitively as a teenager, and thereafter paid my way through university by working as a street musician in Germany, France, the Netherlands and Denmark.

NOOOOOO! ANYTHING BUT THE BAGPIPES!!!

Only joking. But yes, he really does play the bagpipes. And that submission to the EU’s consultation on the management of copyright and related rights is worth a read, to get an idea of how the new increased enforcement of music copyright has had chilling effects on the viability of the UK’s folk music scene. (found via Karl-Friedrich Lenz.)

Kiera Knightley

Funny: Kiera Knightley’s photoshop boobjob has been all over the place recently — it’s a pretty extensive reworking. But then, that’s standard practice nowadays…

However, best comment goes to stephendann:

In photo 2, she has the quad damage. The skin colour darkens, the chest expands, the stomach contracts and the character skin is obviously altered so the rest of the players know she’s supercharged. In POTC:King Arthur, it’s a more subtle damage modified than (the) UT2K4 glowing purple bow.

LOL!

Great SSH tip, and how to fix a KDE glitch

Unix: via Ted Leung, Adam Rosi-Kessel’s Linux Tips page has some very useful tips, and this one’s great — to avoid
getting SSH connection resets, add the following to your .ssh/config:

    serveraliveinterval 300
    serveralivecountmax 10 

This will insure (jm: sic) that ssh will occasional send an ACK type request every 300 seconds so that the connection doesn’t die.

As a similar tip that took a while to track down — KDE users who’ve upgraded between KDE releases, will probably by now have seen lots of messages like this:

  nameofapp (KIconLoader): WARNING: Icon directory /usr/share/icons/hicolor/
  group 48x48/stock/text not valid.

It took a bit of googling about to find the cure:

  • run in a shell (I cannot find this on any menu): kdebugdialog –fullmode
  • select: debug area: 264 kdecore (KIconLoader)
  • Change the Warning Output to ‘None’
  • select: OK

DVD pirate’s pitch ends in arrest

Funny: BBC: DVD pirate’s pitch ends in arrest:

A man has been arrested after trying to sell counterfeit DVDs – at a Trading Standards Office.

The man had apparently missed the sign on the office in Beehive Lane, Chelmsford, Essex, and asked if anyone would like to buy pirated films. Staff said they were very interested indeed in what he had to sell, but when he realised where he was he ran off, leaving his wares and £210 in cash.

Police later arrested the man in a supermarket in Chelmsford.

Hacking Netflix

Movies: Hacking Netflix, via torrez.

Jason Kottke points out a great quote on a Friendster cross-site scripting attack — this great quote: ‘We have a policy that we are not being hacked.’

He also speculates that Google used the GMail invite-network data for whitelisting — but whitelisting based on email address alone is trivially exploitable, so I’d doubt it.

I’m just back from a trip over to Cape Cod to meet family (halfway between here and Ireland, y’see ;) — lots and lots of luvverly lobster and sundry shellfish — and after a 6 day trip, had 5000 spams and a couple of thousand nonspam mails to deal with. Thankfully SpamAssassin dealt with the spams (only about 5 false negatives, no false positives I could spot) — but I’m going to have to do something about that volume of mail. drowning in the stuff. argh.

Microsoft 0wnz ‘http’

Web: Back in 2002, it occurred to someone to check the Google search results for ‘http’, to figure out what the most popular sites were.

Looks like it’s changed — here’s the top five results from a Google search for ‘http’ now:

  • 1: Microsoft
  • 2: AltaVista (!!)
  • 3: Yahoo!
  • 4: My Excite
  • 5: Google

My guess: older links are getting good PageRank, using whatever new tweaked algorithm they’re using. But AltaVista beating Google? ;)

RTE’s Bush Interview

TV: RTE’s ‘Prime Time’ secured a fantastic interview with GWB, with Carole Coleman asking a few very pointed questions. Watch it with RealPlayer, or listen to the audio in MP3 (2.7Mb).

There’s a pretty accurate transcript here:

Let me finish! How many times do I have to tell you how to do your job? See, I gotta insult France at least once. Then I gotta claim ‘merica to be the most generous nation in the whole wide world, even though it’s not true. And listen, let me mention that democracy in Pakistan, too. And guess what? I’m the first president to ever call for a Palestinian state and I’m damn proud of it – just look at the size of my smirk now. Listen, as long as I keep repeating myself and mouthing empty platitudes, you won’t have a chance to call me on any of the bullshit coming out of my mouth.

OK, the official one is here.

It appears that the White House just dropped the ball on this one; reportedly, they had her list of questions three days in advance, but given that they suggested that she ‘ask him a question on the outfit that Taoiseach Bertie Ahern wore to the G8 summit’ (!!!), they weren’t paying attention, and expected some kind of giggling moronic schoolgirl, or something.

Hilariously, the White House has since complained to RTE, the Irish Embassy, the Irish Government, and the reporter herself. Probably God, too. I doubt Prime Time will ever get a White House interview again, but given what they clearly expect from the poodles in the White House press corps, that’s hardly much of a loss.

(I’d love to see what’d happen if he had to deal with Paxman ;)

Also, went to see Fahrenheit 9/11. Fantastic movie, and best of all, incredibly well-attended.

My favourite moment: the reminder of just how easily the US news media sold itself out during the war. Seeing Katie Couric blurting ‘Navy Seals rock!!’ like some kind of starstruck 5-year-old with an Action Man toy, was a classic. It’s good to see that this will be immortalized in celluloid, as it was truly shocking at the time. (Not much has changed; Judith Miller is still writing for the NYT.)

Samuel L. Jackson’s ‘Irish’ comment

Ireland: Here’s a hot UL that’s floating around the irish web right now —

In a British program about Samuel L Jackson and Colin Farrell’s lastest movie SWAT presented by British presenter, Kate Thornton, the following exchange occured:
  • Thornton: What was it like working with Colin (Farrell), cos he
    • is just so hot in the U.K. right now?
  • Jackson: He’s pretty hot in the U.S. too.
  • Thornton: Yeah, but he is one of our own.
  • Jackson: Isn’t he from Ireland?
  • Thornton: Yeah, but we can claim him cos Ireland is beside us.
  • Jackson: You see that’s your problem right there. You British keep claiming people that don’t belong to you. We had that problem here in America too, it was called slavery.

… yeah, right. ;)

(Update: Actually, believe it or not, that’s more or less how it really went. Here’s the transcript.)

Some commentary at
TheReggaeBoyz.com (quote: ‘I NEARLY DEAD TO RASS!!!!’) and Kuro5hin.

It looks like the TV programme does exist; no scripts online, unfortunately, so we’ll never figure out if this one really happened, I think.

IMO, it’s made up for sure. That last line is just a little too harsh for a primetime schmooze-a-gram, at the very least. Plus, it’s the kind of thing only an Irishman would give a shit about — the perpetual adoption of Irish celebs and worthies by the UK media is a continual source of irritation for the Irish — as Dervala puts it:

‘No, Oscar Wilde was ours. You put him in jail, though. And Shaw was ours. And Yeats. And Johnny Rotten.’

Announcing a new script

Web: Minor software announcement — after some time using HTMLThumbnail, album, and even WebMake to build photo galleries, I finally got peeved enough, and gave in to the temptation of ‘not invented here’. ;)

Presenting Uffizi, a CSS- and template-driven, themable perl script to generate photo galleries. Quoting the POD:

  • it’s very self-contained, apart from dependencies on Image::Size and the ImageMagick convert command
  • fast, efficient incremental rebuilding
  • generates full CSS-styled, templated and valid HTML
  • every part of the generated HTML can be modified through the templates
  • generates reasonably-sized images as well as thumbnails, with a link to the full-sized image
  • secure — all pages are static HTML, so your webserver won’t get r00ted through a silly photo album script

I am, of course, using it on my own photo pages, and I’m very happy with it; it’s been a while since I had to hack it. (I need to get it to thumbnail MPEGs as well, but apart from that it’s teh nifty IMO.)

SpamAssassin now an Apache TLP!

Spam: SpamAssassin is now officially an Apache top-level project! InternetNews.com coverage:

The Apache Software Foundation is taking the spam fight to a new level — literally — with the promotion of its Spam Assassin project to top-level status.

Hooray ;)

The ‘humans are 99.84% accurate’ figure

Spam: ‘The spam-classifying accuracy of a human being is 99.84%’. This statement has passed into SlashDot lore as the gospel truth, so time for some debunking.

First off, that’s not what Bill Yerazunis said in the CRM-114 Sparse Binary Polynomial Hashing and the CRM114 Discriminator paper. Here’s the real quote:

the human author’s measured accuracy as an antispam filter is only 99.84% on the first pass

Here’s a copy of the original mail:

I manually classified the same set of 1900 messages twice, and found three errors in my own classifications, hence I have a 99.84% success rate.

(my emphasis). In other words, the author sat down and ran through 1900 messages manually, then ran through them again, and checked to see how many messages in the first batch disagreed with the second.

Let’s consider an alternative situation, where a user is presented with one message, and asked to take their time, give it a full examination and some thought, and then classify the message. I would consider that more likely to be classified correctly, since fatigue will not be an issue (after 1900 messages, I’m pretty tired of eyeballing), and neither will time pressure (taking 20 seconds on each of 1900 mails would require 10.5 hours, and would be excruciatingly boring to boot).

In addition, the study wasn’t clear on exactly how much information from each mail was presented. Too little (just the subject line) or too much (every header and raw HTML), and a human will be more likely to make mistakes than if the mail is rendered fully, and the extraneous header info hidden. In my experience, I’ve never hand-classified 1900 messages purely through either method, because it’s just too tiring, and I know I’ll make quite a few mistakes. The UI for this work is important.

And finally, the figure is derived from a study with one user performing a task once. There’s no way you could use that figure in a serious setting — it’s not valid statistical science. Here’s Henry’s comment:

Yerazunis’ study of “human classification performance” is fundamentally flawed. He did a “user study” where he sat down and re-classified a few thousand of his personal e-mails and wrote down how many mistakes he made. He repeats this experiment once and calls his results “conclusive.” There are several reasons why this is not a sound methodology:
  • a) He has only one test subject (himself). You cannot infer much about the population from a sample size of 1.
  • b) He has already seen the messages before. We have very good associative memory. You will also notice that he makes fewer mistakes on the second run which indicates that a human’s classification accuracy (on the same messages) increases with experience. For this very reason, it is of the utmost importance to test classification performance on unseen data. After all, the problem tends towards “duplicate detection” when you’ve seen the data before hand.
  • c) He evaluates his own performance. When someone’s own ego is on the line, you would expect that it would be very difficult to remain objective.

So, to correct the statement:

‘The spam-classifying accuracy of this one guy, when classifying nearly two thousand mails by hand, was 99.84%, once.’

Cormack and Lynam’s study on supervised spam detection

Spam: or, ‘SlashDot spam drama’. So, a few days ago, I forwarded a link to a paper I’d been sent — it’s a great paper, and I’m not just saying that because SpamAssassin did well — it really tests some of the popular open-source spam filters comprehensively, and correctly. (The authors have 24 years of information retrieval research between them.)

The results have been pretty incendiary. ;) Here’s a timeline with links, in case you were wondering where we are right now:

A UNIX shell tip

UNIX: I’ve just made the first change to my core bash configuration in years, to add -b to the set command-line. It triggered some thinking about when the last one was.

It turns out, that apart from writing scripts and aliases frequently, I haven’t changed my commandline UI in any respect, since about 2 years ago. By contrast, I’ve been hacking about with GUI settings continually, new desktop backgrounds, themes, colours, etc. Odd!

Anyway, here’s the tip — it’s very handy, I find.

I changed to using a 2-line prompt, with the first line containing the time and the full working directory, in a ‘magic’ cut-and-pasteable format:

        : exit=0 Thu Jun 24 17:55:29 PDT 2004; cd /home/jm/DL
        : jm 1203...; 

Note that the prompt starts with “:”, which means that bash/sh will ignore the line until it hits “;”. The end result is that the entire line evaluates to “cd /home/jm/DL” when pasted. Hey presto, cd’ing several terminals to the same dir just involves triple-clicking in one, and middle-button-pasting into the others. nifty! Similarly, the second line has a little bit of prompt, but that snippet will be ignored when cut and pasted.

Having the exit status of the last command (bash var: $?) is useful too. The code:

  do_prompt () {
    echo ": exit=$? `date`; cd $PWD"
  }
  PROMPT_COMMAND='do_prompt $?'   # executed before every prompt
  do_prompt 0                     # set up first prompt
  PS1=": `whoami` \!"
  PS2="... >>; "            # continuation prompt
  PS1="$PS1...; "

The Web-App generation

Software: Mark Twomey, in response to all the Win32 API stuff recently:

We now have a generation of computer users … who have never received or sent email from a so called ‘rich client’, never had to send a postal order off to order something from some distant vendor, and are not amazed by something like a search engine. ….

Those (‘rich client’) people remind me of minicomputer users who crapped on the ‘crummy little operating systems’ used on ‘crummy little desktop computers.’

He’s right, you know — for de yoot, Windows is generally just a way to access Hotmail.

Ahmed Chalabi and Iran’s encryption

Security: some crypto drama.

Ahmad Chalabi apparently told the Iranian government that the NSA had broken their secret code, according to ‘US intelligence officials’: NYTimes: Chalabi Reportedly Told Iran That U.S. Had Code. This story is still running — Bruce Schneier has just posted his expert opinion, as has Ross Anderson. As I noted on Eric Rescorla’s weblog, here’s my (non-expert) theory ;)

It’s known that the Iranians used Crypto AG equipment up until about 1992, and it’s been widely reported that Crypto AG’s systems were backdoored by the NSA and traffic routinely decrypted. (also, Baltimore Sun story, 1995)

Reportedly, the Anglo-Irish discussions of the 1985 were a rather one-sided affair, because the Irish government used Crypto AG machines to communicate between their Embassy in London and Dublin, and intercepts of their reports were fed back to the UK government.

In addition, according to this article (backup), the NSA also provided Iraq with intercepts of Iranian secret traffic, while Iraq was a US ally — which could explain why Chalabi would have known about it.

It also speculates as to how it was done:

‘Knowledgeable sources indicate that the Crypto AG enciphering process, developed in cooperation with the NSA and the German company Siemans, involved secretly embedding the decryption key in the cipher text. Those who knew where to look could monitor the encrypted communication, then extract the decryption key that was also part of the transmission, and recover the plain text message. Decryption of a message by a knowledgeable third party was not any more difficult than it was for the intended receiver. (More than one method was used. Sometimes the algorithm was simply deficient, with built-in exploitable weaknesses.)’

So my opinion is that Chalabi’s claim was very old news from the 80’s and early 90’s — which pretty much fits in with the rest of his tip-offs to everyone else ;)

“Vice-President Hunter Thompson”

Politics: Kerry in Colorado:

“Just to put your minds all at ease, I have four words for you that I know will relieve you greatly,” Kerry told the fund-raiser. “How does this sound? Vice President Hunter Thompson.”

Travel: Great posting on culture shock and ‘going native’ at Yankee Fog.

Hacks: Dan Kaminsky’s LayerOne presentation hits Slashdot. Definitely one of the highlights of that conference.

Spam: confession for two: a spammer spills it all. Interesting — especially since the spammer winds up earning less than he would have working for Starbucks.

It’s also worth noting this posting from Gary Smith on the sa-users list, in which Gary filled out a spam form with some not-entirely-valid info — with hilarious results!

So I did talk to some of these lenders. Apparently they buy leads from www.lendergateway.com . One guy that I talked to was irritated because it costs him $100 per lead they sell him and it’s supposed to only be sold to him. He apologized quite a bit and was nice enough to give me the information on who sold him the names. The number he game me goes to voicemail which I’m going to try later. A couple other people told me what I can do with myself and one lady kept saying that she couldn’t give me information on who provided her with my information.

The stupid thing is each time I talk to them I tell them I’m on a cell and that I need their name and number and I’ll call them right back. They give it to me… So when they hang up I start calling again and again. I’ve been irritating the hell out of them…

Anyways, that’s the fun storing of what happens when these forms are filled out.

$100 per spurious ‘lead’ would make a serious dent, if enough spurious leads showed up… ;)

WINW

Net: WINW Is Not WASTE: ‘WINW is a small worlds networking utility. It was inspired by WASTE … (WINW) has diverged from its original mission to create a clean-room WASTE clone. Today, the WINW feature set is different from that of WASTE, and its protocol is incompatible with WASTE’s protocol. However, WINW and WASTE achieve similar goals: they allow people who trust each other to communicate securely.’

Not quite there yet — just a Windows version with no sharing — but actively under development. One to keep an eye on…

Great Economist article on UNIX

Software: Economist: Unix’s founding fathers (via sourcefrog.net). A very good article on Thompson, Kernighan and Ritchie’s amazing achievement, with some new details I hadn’t heard before:

AT&T was required under the terms of a 1958 court order in an antitrust case to license its non-telephone-related technology to anyone who asked. And so Unix and C were distributed, mostly to universities, for only a nominal fee. When one considers the ineptness of AT&T’s later attempts to commercialise Unix — after the court order ceased to be applicable because of another antitrust case which broke up AT&T in 1984 — this restriction, an accidental boost to what would later become known as the open-source movement, becomes even more crucial.

So that’s how that happened. Just think — if it wasn’t for that court case, we’d probably all be hacking on VMS. ;)

Also at sourcefrog, mbp points out that the Sulston reverse-engineering story is ‘remarkably similar to that of Richard Stallman several years earlier, when the frustration of closed-source printer software helped motivate him to start the GNU project’.

Patents: yet another sourcefrog link, this time to a CNet story with a hilarious quote regarding software patents and the GIF/PNG debacle:

But Unisys credited its exertion of the LZW patent with the creation of the PNG format, and whatever improvements the newer technology brought to bear.

‘We haven’t evaluated the new recommendation for PNG, and it remains to be seen whether the new version will have an effect on the use of GIF images,’ said Unisys representative Kristine Grow. ‘If so, the patent situation will have achieved its purpose, which is to advance technological innovation. So we applaud that.’

Wow. Presumably by the same logic, they applaud al-Qaeda for improving airline security innovation, too…