Skip to content

Justin's Linklog Posts

About the title change

The eagle-eyed may have spotted a change that took place a month or two ago in the taint.org configuration — I ditched the old weblog tagline.

Previously, this weblog was titled "taint.org: Happy Software Prole". This title had been in place since around October 2003, when Daniel Lyons wrote a particularly idiotic article for Forbes entitled "Linux’s Hit Men", which I took umbrage to:

Here we go again — the old ‘free software is communism’ line […] The article goes on to bemoan how software companies who write proprietary extensions into GPL-licensed software, have to comply with the terms of the license. It’s all a bit of an obvious dig — but I am looking forward to the follow-up article — that’s the one where the author bemoans how commercial software companies send out their ‘enforcers’ to extort money from companies who don’t bother paying the royalties and runtime license fees their licenses require.

As an free/open-source-software guy, I happily adopted ‘happy software prole’ as an absurd tagline, in the spirit of detournement. Fast-forward to 3.5 years on, however, and I’d say most people can’t even remember the Forbes article, or that Daniel Lyons guy! So that tagline was a bit old and busted, really.

On top of this, I’d noticed something I do in my weblog reading — I’ve started renaming blogs in the feed reader from their fancy title, to simply the name of the author.

I’ve found that when reading blogs, I’m interested in who’s writing. When skimming through the feeds of a morning, having to spend 5 seconds to recall that "ByteSurgery.com" is Robin Blandford is just a wee bit superfluous, sorry Robin. ;)

As a favour for readers, I’ve saved them the trouble, and renamed the blog to be quite explicit about who’s writing; the taint.org tagline is now just "taint.org: Justin Mason’s Weblog". Let’s face it — it’s a bit functional. Hopefully it’s helpful, though!

(And finally, it gives me the edge in the ongoing Google war against the non-me "Justin Masons" out there… and against a heart surgeon and a Texan basketball player, I need it. ;)

A recycling puzzle

Myself and Tom were in a taxi last night, stopped at a stop light, when I noticed something odd.

A girl, about 20 or so, walking along the path stopped beside a bag of recycled rubbish, and bent over as if she was tying her shoelace. Instead of fixing her lace, though, she quickly ripped a hole in the (transparent) plastic bag, grabbed a crumpled Fanta can, and walked off.

WTF? anyone got any theories?

Coworking.IE

Coworking.ie is a new community-driven coworking group-blog and promotion site, set up by Jason Roe.

Coworking’s a pretty cool idea — ‘a movement to create a community of cafe-like collaboration spaces for developers, writers and independents.’ Great news for us teleworkers.

I’ve subscribed — it’ll be interesting to track development of this concept, in Ireland and elsewhere…

New list for Irish users of MythTV

MythTV is a pretty great product, once you get it working — however, it can be labour-intensive, involving lots of local knowledge to deal with the ins and outs of each area’s TV provider, cable service, etc.

To that end, we’re recently set up a new mailing list: mythtv-ireland, a list for discussion of topics of interest for MythTV users in Ireland.

Particularly on-topic:

  • the NTL frequencies list for areas in Ireland

  • hacks to scrape the Channel 6 schedule from their website

  • dealing with the NTL Digital set-top box

Sign up, if you’re interested!

Twitter and del.icio.us

Walter Higgins says:

It’s just occurred to me why I don’t like twitter – It doesn’t fulfill any need that isn’t already fulfilled by del.icio.us. I usually post a note alongside each bookmark which lets me micro-blog (post short comments without having to think too much). If I want to signal to someone to take a look at the bookmarked item I just tag it for:[nameofperson] which I suppose you could loosely call ‘chat’. Since I gave up personal blogging, del.icio.us has fulfilled a need for short-hand blogging. Thinking about it – twitter is like del.icio.us but without the bookmarks – viewed in that light it really is hard to understand why anyone would use twitter.

To my mind, though, there’s a big difference:

  • My del.icio.us page is where I post things I’m reading, and things I think others may be interested in reading;

  • My twitter page is where I post things I’m doing, and chat.

There’s no way I’d try to hold a conversation in my del.icio.us bookmarks! ;) Different tools for different uses.

Geeking out on the ‘leccy bill

A good post from Lars Wirzenius on measuring the electricity consumption of his computer hardware. Here’s a previous post of mine on the subject.

With the rising cost of energy, a keenness to reduce consumption for green purposes, and an overweening nerdity in general, I did some more investigation around my house recently.

I have a pretty typical Irish electricity meter; it contains a visible disc with a red dot, which spins at a speed proportional to power usage. (There’s a good pic of something similar at the Wikipedia page).

The fuse-board works out as follows (discarding the boring ones like the house alarm etc.):

  • Fuse 7 – gas-fired central heating (on), fridge (on), kitchen power sockets

  • Fuse 8 – TV in standby, idle PVR, Wii in standby, digital cable set-top box, washing machine

  • Fuse 9 – telephone, DSL router, Linksys WRT54G AP/router

  • Fuse 10 – bedroom sockets, home office with laptop, printer, speakers, laptop-server etc.

The approach was simply to turn off the house fuses at the fuse board, one by one, and measure how long it took the disc to make a full revolution; then invert that (1/n) to convert from units of time over a static power value, to some notional unit of power consumption over a static time interval (I haven’t figured out how to convert to kW/h or anything like that, they’re just makey-uppy units).

Fuses Time/power Power/time
Baseline (all fuses on) 22.71 seconds 0.0440
Fuse 7 off 43.03 0.0232
Fuses 7 and 8 off 57.92 0.0172
Fuse 7, 8 and 9 off 84.88 0.0117
Fuse 7, 8, 9, and 10 off ~20 minutes (I’d guess) 0.0008?

(I stopped measuring on the last one and just estimated; it was crawling around.)

Breaking out the individual fuses, that works out as:

Fuse Power/time
Fuse 7 (central heating, fridge, kitchen bits) 0.0208
Fuse 8 (TV, Wii, set-top box, washing machine) 0.0060
Fuse 9 (phones, routers) 0.0055
Fuse 10 (home office, bedrooms) 0.0109

Good results already: (a) it was pretty clear that fuse 7 was doing all the quotidian legwork, eating the majority of the power, and (b) the TV equipment and internet/wifi infrastructure was pretty good at low-power operation (yay). However (c) the computer bits aren’t so great, but still only half the power consumption of the kitchen bits.

Breaking down the kitchen consumption further:

Appliances Time/power Power/time
Gas central heating on (rechecking the baseline) 20.46 0.0488
Gas central heating off 34.15 0.0292
Washing machine on (40 degree wash) 13.65 0.0732
Dishwasher on 2.53 0.3952
Dishwasher and dehumidifier on 2.53 0.3952

Subtracting the baseline:

Appliance Power/time
Gas central heating 0.0196
Washing machine 0.0244
Dishwasher 0.3464
Dishwasher and dehumidifier 0.3464

So the central heating, despite being supposedly gas-fired, eats lots of power! I guess this is the electric pump, used to drive the heated water around the house to the radiators. Ah well, I’m not skimping on that ;)

More practically: the dishwasher result is incredible. That’s 30 times the power usage of the house’s computer hardware. This is a ~7-year-old standard dishwasher; obviously green power consumption wasn’t an issue back then! We’re running it less frequently now, obviously; the odd hand-wash of bulky and nearly-clean items helps. With any luck when we move in a few months, we can replace it with a greener model.

The washing machine is about what I would expect, so I’m OK with that.

Also interesting to note that our dehumidifier is unnoticeable in the volume of the dishwasher; I could have tried to work it out properly in isolation, but couldn’t be bothered by that stage ;)

Sender Address Verification considered harmful

(as an anti-spam technique, at least.)

Sender-address verification, also known as callback verification, is a technique to verify that mail is being sent with a valid envelope-sender return address. It is supported by Exim and Postfix, among others.

Some view this as a useful anti-spam technique. In my opinion, it’s not.

Spam/anti-spam is an adversarial "game". Whenever you’re considering anti-spam techniques, it’s important to bear in mind game theory, and the possible countermeasures that spammers will respond with. Before SAV became prevalent, spam was often sent using entirely fake sender data; hence the initial attractiveness of SAV. Once SAV became worth evading, the spammers needed to find "real" sender addresses to evade it. And where’s the obvious place to find real addresses? On the list of target addresses they’re spamming!

Since the spam is now sent using forged sender addresses of "real" people, when a spam bounces (as much of it does), the bounce will be sent back not to an entirely fake address, but to a spam recipient’s address.

Hence, the spam recipients now get twice as much mail from each spam run — spam aimed at them, and bounce blowback from hundreds of spams aimed at others, forged to appear to be from them.

This is the obvious "next move" in response to SAV, which is one reason why we never implemented something like it in SpamAssassin.

On top of this — it doesn’t work well enough anymore. <A href=’http://atm.tut.fi/list-archive/nanog/msg37168.html’>Verizon use SAV. Have you ever heard anyone talk about how great Verizon’s spam filtering is? Didn’t think so.

(This post is a little late, given that SAV has been used for years now, but better late than never ;)

By the way, it’s worth noting that it’s still marginally acceptable to use SAV as a general email acceptance policy for your site — ie. as a way to assert that you’re not going to accept mail from people who won’t accept mail to the envelope sender address used to deliver it. Just don’t be fooled into thinking it’s helping the spam problem, or is helping anyone else but yourself.

Finally, this Sender Address Verification is different from what Sendio calls Sender Address Verification. That’s just challenge-response, which is crap for an entirely different, and much worse, set of reasons.

Something in the oven

Check out what’s cooking chez Mason:

Thrills and spills! I may have to cut down on the extra-curricular activities for a while, so we’d better get SpamAssassin 3.2.0 released before August 21st ;)

Spam volumes at accidental-DoS levels

Both Jeremy Zawodny and Dale Dougherty at O’Reilly Radar are expressing some pretty serious frustration with the current state of SMTP. I have to say, I’ve been feeling it too.

A couple of months back, our little server came under massive load; this had happened before, and normally in those situations it was a joe-job attack. Switching off all filtering and just collecting the targeted domain’s mail in a buffer for later processing would work to ameliorate the problem, by allowing the load to "drain". Not this time, though.

Instead, when I turned off the filtering, the load was still too high — the massive volume of spam (and spam blowback / backscatter) was simply too much for the Postfix MTA. The MTA could not handle all the connections and SMTP traffic in time to simply collect all the data and store it in a file!

Looking into the "attack" afterwards, once the load was back under control, it looked likely that it wasn’t really an attack — it was just a volume spike. Massive SMTP load, caused by spammers increasing the volume of their output for no apparent reason. (Since then, spam volumes have been increasing still further on a nearly weekly basis.)

This is the effect of botnets — the amount of compromised hosts is now big enough to amplify spam attacks to server-swamping levels. Our server is not a big one, but it serves less than 50 users’ email I’d say; the user-to-CPU-power ratio is pretty good compared to most ISPs’ servers.

So here’s the thing. New SMTP-based methods of delivering nonspam email — whether based on DKIM, SPF, webs of trusted servers, or whatever — will not be able to operate if they have to compete for TCP connection slots with spammers, since spammers can now swamp the SMTP listener for port 25 with connections. In effect, spam will DDoS legitimate email, no matter what authentication system that legit mail uses to authenticate itself.

This, in my opinion, is a big problem.

What’s the fix? A "new SMTP" on a whole different port, where only authed email is permitted? How do you make that DoS-resistant? Ideas?

(Obviously, counting on spammers to notice or care is not a good approach.)

A SpamAssassin rule-discovery algorithm

Just to get a little techie again… here’s a short article on a new algorithm I’ve come up with.

Text-matching rule-based anti-spam systems are pretty common — SpamAssassin, of course, is probably the most well-known, and of course the proprietary apps built on SpamAssassin also use this. However, other proprietary apps also seem to use similar techniques, such as Symantec’s Brightmail and MessageLabs’ scanner (hi Matt ;) — and doubtless there are others. As a result, ways to write rules quickly and effectively are valuable.

So far, most SpamAssassin text rules are manually developed; somebody looks at a few spam samples, spots common phrases, and writes a rule to match that. It’d be great to automate more of that work. Here’s an algorithm I’ve developed to perform this in a memory-efficient and time-efficient way. I’m quite proud of this, so thought it was worth a blog posting. ;)

Corpus collection

First, we collect a corpus of spam and "ham" (non-spam) mails. Standard enough, although in this case it helps to try to keep it to a specific type of mail (for example, a recent stock spam run, or a run from the OEM spammer).

Typically, a simple "grep" will work here, as long as the source corpus is all spam anyway; a small number of irrelevant messages can be left in, as long as the majority 80% or so are variations on the target message set. (The SpamAssassin mass-check tool can now perform this on the fly, which is helpful, using the new ‘GrepRenderedBody’ mass-check plugin.)

Rendering

Next, for each spam message, render the body. This involves:

  • decoding MIME structure
  • discarding non-textual parts, or parts that are not presented to the viewer by default in common end-user MUAs (such as attachments)
  • decoding quoted-printable and base64 encoding
  • rendering HTML, again based on the behaviour of the HTML renderers used in common end-user MUAs
  • normalising whitespace, "this is\na \ntest" -> "this is a test"

All pretty basic stuff, and performed by the SpamAssassin "body" rendering process during a "mass-check" operation. A SpamAssassin plugin outputs each message’s body string to a log file.

Next, we take the two log files, and process them using the following algorithm:

N-gram Extraction

Iterate through each mail message in the spam set. Each message is assigned a short message ID number. Cut off all but the first 32 kbytes of the text (for this algorithm, I think it’s safe to assume that anything past 32 KB will not be a useful place for spammers to place their spam text). Save a copy of this shortened text string for the later "collapse patterns" step.

Split the text into "words" — ie. space-separated chunks of non-whitespace chars. Compress each "word" into a shorter ID to save space:

"this is a test" => "a b c d"

(The compression dictionary used here is shared between all messages, and also needs to allow reverse lookups.)

Then tokenize the message into 2-word and 3-word phrase snippets (also known as N-grams):

"a b c d" => [ "a b", "b c", "c d", "a b c", "b c d" ]

Remove duplicate N-grams, so each N-gram only appears once per message.

For each N-gram token in this token set, increment a counter in a global "token count" hashtable, and add the message ID to the token’s entry in a "message subset hit" table.

Next, process the ham set. Perform the same algorithm, except: don’t keep the shortened text strings, don’t cut at 32KB, and instead of incrementing the "token count" hash entries, simply delete the entries in the "token count" and "message subset hit" tables for all N-grams that are found.

By the end of this process, all ham and spam have been processed, and in a memory-efficient fashion. We now have:

  • a table of hit-counts for the message text N-grams, with all N-grams where P(spam) < 1.0 — ie. where even a single ham message was hit — already discarded
  • the "message subset hit" table, containing info about exactly which subset of messages contain a given N-gram
  • the token-to-word reverse-lookup table

To further reduce memory use, the word-to-token forward-lookup table can now be freed. In addition, the values in the "message subset hit" table can be replaced with their hashes; we don’t need to be able to tell exactly which messages are listed there, we just need a way to tell if one entry is equal to another.

Summarisation

Iterate through the hit-count table. Discard entries that occur too infrequently to be listed; discard, especially, entries that occur only once. (We’ve already discarded entries that hit any ham.)

Make a hash that maps the message subsets to the set of all N-gram patterns for that message-subset. For each subset, pick a single N-gram, and note the hit-count associated with it as the hit-count value for that entire message-subset. (Since those N-grams all appear in the exact same subset of messages, they will always have the same P(spam) — this is a safe shortcut.)

Iterate through the message subsets, in order of their hit-count. Take all of the message-subset’s patterns, decode the N-grams in all patterns using the token-to-word reverse-lookup table, and apply this algorithm to that pattern set:

Collapse patterns

So, input here is an array of N-gram patterns, which we know always occur in the same subset of messages. We also have the saved array of all spam messages’ shortened text strings, from the N-gram extraction step. With this, we can apply a form of the BLAST pattern-discovery algorithm, from bioinformatics.

Pop the first entry off the array of patterns. Find any one mail from the saved-mails array that hits this pattern. Find the single character before the pattern in this mail, and prepend it to the pattern. See if the hits for this new pattern are the same message set as hit the old pattern; if not, restore the old pattern and break. If you hit the start of the mail message’s text string, break. Then apply the same algorithm forward through the mail text.

By the end of that, you have expanded the pattern from the basic N-gram as far as it’s possible to go in both directions without losing a hit.

Next, discard all patterns in the pattern array that are subsumed by (ie. appear in) this new expanded pattern. Add it to the output list of expanded patterns, unless it in turn is already subsumed by a pattern in that list; discard any patterns in the output list that are subsumed by this new pattern; and move onto the next pattern in the input list until they’re all exhausted.

(By the way, the "discard if subsumed" trick is the reason why we start off with 3-word N-grams — it gives faster results than just 2-word N-grams alone, presumably by reducing the amount of work that this collapse stage has to do, by doing more of it upfront at a relatively small RAM cost.)

Summarisation (continued)

Finally, output a line listing the percentage of the input spam messages hit (ie. (hit-count value / total number of spams) * 100) and the list of expanded patterns for that message-subset, then iterate on to the next message-subset.

Example

Here’s an example of some output from recent "OEM" stock spam:

$ ./seek-phrases-in-corpus --grep 'OEM' \
        spam:dir:/local/cor/recent/spam/*.2007022* \
        ham:dir:/local/cor/recent/ham/*.200702*
[mass-check progress noises omitted]
 RATIO   SPAM%    HAM%   DATA
 1.000  72.421   0.000  / OEM software - throw packing case, leave CD, use electronic manuals. Pay for software only and save 75-90%! /,
                         / TOP 1O ITEMS/
 1.000  73.745   0.000  / $99 Macromedia Studio 8 $59 Adobe Premiere 2.0 $59 Corel Grafix Suite X3 $59 Adobe Illustrator CS2 $129 Autodesk Autocad 2007 $149 Adobe Creative Suite 2 /,
                         /s: Adobe Acrobat PR0 7 $69 Adobe After Effects $49 Adobe Creative Suite 2 Premium $149 Ableton Live 5.0.1 $49 Adobe Photoshop CS $49 http:\/\//,
                         / Microsoft Office 2007 Enterprise Edition Regular price: $899.00 Our offer: $79.95 You save: $819.95 (89%) Availability: Pay and download instantly. http:\/\//,
                         / Adobe Acrobat 8.0 Professional Market price: $449.00 We propose: $79.95 Your profit: $369.05 (80%) Availability: Available for /,
                         / $49 Windows XP Pro w\/SP2 $/,
                         / Top-ranked item. (/,
                         /, use electronic manuals. Pay for software only and save 75-90%! /,
                         / Microsoft Windows Vista Ultimate Retail price: $399.00 Proposition: $79.95 Your benefit: $319.05 (80%) Availability: Can be downloaded /,
                         / $79 MS Office Enterprise 2007 $79 Adobe Acrobat 8 Pro $/,
                         / Best choice for home and professional. (/,
                         / OEM software - throw packing case, leave CD/,
                         / Sales Rank: #1 (/,
                         / $79 Microsoft Windows Vista /,
                         / manufacturers: Microsoft...Mac...Adobe...Borland...Macromedia http:\/\//
 1.000  73.855   0.000  / MS Office Enterprise 2007 /,
                         /9 Microsoft Windows Vista /,
                         / Microsoft Windows Vista Ultimate /,
                         /9 Macromedia Studio 8 /,
                         / Adobe Acrobat 8.0 /,
                         / $79 Adobe /
 1.000  74.242   0.000  / Windows XP Pro/
 1.000  74.297   0.000  / Adobe Acrobat /
 1.000  74.462   0.000  / Adobe Creative Suite /
 1.000  74.573   0.000  / Adobe After Effects /
 1.000  74.738   0.000  / Adobe Illustrator /
 1.000  74.959   0.000  / Adobe Photoshop CS/
 1.000  75.014   0.000  / Adobe Premiere /
 1.000  75.290   0.000  / Macromedia Studio /
 1.000  75.786   0.000  /OEM software/
 1.000  75.841   0.000  / Creative Suite /
 1.000  75.896   0.000  / Photoshop CS/
 1.000  75.951   0.000  / After Effects /
 1.000  76.062   0.000  /XP Pro/
 1.000  82.460   0.000  / $899.00 Our /,
                         / Microsoft Office 2007 Enterprise /,
                         / $79.95 You/

Immediately, that provides several useful rules; in particular, that final set of patterns can be combined with a SpamAssassin "meta" rule to hit 82% of the samples. Generating this took a quite reasonable 58MB of virtual memory, with a runtime of about 30 minutes, analyzing 1816 spam and 7481 ham mails on a 1.7Ghz Pentium M laptop.

(Update:) here’s a sample message from that test set, demonstrating the top extracted snippets in bold:

  Return-Path: <tyokaluassa.com@ultradian.com>
  X-Spam-Status: Yes, score=38.2 required=5.0 tests=BAYES_99,DK_POLICY_SIGNSOME,
          FH_HOST_EQ_D_D_D_D,FH_HOST_EQ_VERIZON_P,FH_MSGID_01C67,FUZZY_SOFTWARE,
          HELO_LOCALHOST,RCVD_IN_NJABL_DUL,RCVD_IN_PBL,RCVD_IN_SORBS_DUL,RDNS_DYNAMIC,
          URIBL_AB_SURBL,URIBL_BLACK,URIBL_JP_SURBL,URIBL_OB_SURBL,URIBL_RHS_DOB,
          URIBL_SBL,URIBL_SC_SURBL shortcircuit=no autolearn=spam version=3.2.0-r492202
  Received: from localhost (pool-71-125-81-238.nwrknj.east.verizon.net [71.125.81.238])
          by dogma.boxhost.net (Postfix) with SMTP id E002F310055
          for <xxxxxxxxxxx@jmason.org>; Sun, 18 Feb 2007 08:58:20 +0000 (GMT)
  Message-ID: <000001c7533a$b1d3ba00$0100007f@localhost>
  From: "Kevin Morris" <tyokaluassa.com@ultradian.com>
  To: <xxxxxxxx@jmason.org>
  Subject: Need S0ftware?
  Date: Sun, 18 Feb 2007 03:57:56 -0500

  OEM software - throw packing case, leave CD, use electronic manuals.
  Pay for software only and save 75-90%!

  Discounts! Special offers! Software for home and office!
              TOP 1O ITEMS.

    $79 Microsoft Windows Vista Ultimate
    $79 MS Office Enterprise 2007
    $79 Adobe Acrobat 8 Pro
    $49 Windows XP Pro w/SP2
    $99 Macromedia Studio 8
    $59 Adobe Premiere 2.0
    $59 Corel Grafix Suite X3
    $59 Adobe Illustrator CS2
  $129 Autodesk Autocad 2007
  $149 Adobe Creative Suite 2
  http://ot.rezinkaoem.com/?0B85330BA896A9992D0561E08037493852CE6E1FAE&t0

            Mac Specials:
  Adobe Acrobat PR0 7             $69
  Adobe After Effects             $49
  Adobe Creative Suite 2 Premium $149
  Ableton Live 5.0.1              $49
  Adobe Photoshop CS              $49
  http://ot.rezinkaoem.com/-software-for-mac-.php?0B85330BA896A9992D0561E08037493852CE
  6E1FAE&t6

  See more by this manufacturers:
  Microsoft...Mac...Adobe...Borland...Macromedia
  http://ot.rezinkaoem.com/?0B85330BA896A9992D0561E08037493852CE6E1FAE&t4

  Microsoft Windows Vista Ultimate
  Retail price:  $399.00
  Proposition:  $79.95
  Your benefit:  $319.05 (80%)
  Availability: Can be downloaded INSTANTLY.
  http://ot.rezinkaoem.com/2480.php?0B85330BA896A9992D0561E08037493852CE6E1FAE&t3
  Best choice for home and professional. (37268 reviews)

  Microsoft Office 2007 Enterprise Edition
  Regular price:  $899.00
  Our offer:  $79.95
  You save:  $819.95 (89%)
  Availability: Pay and download instantly.
  http://ot.rezinkaoem.com/2442.php?0B85330BA896A9992D0561E08037493852CE6E1FAE&t1
  Sales Rank: #1 (121329 reviews)

  Adobe Acrobat 8.0 Professional
  Market price:  $449.00
  We propose:  $79.95
  Your profit:  $369.05 (80%)
  Availability: Available for INSTANT download.
  http://ot.rezinkaoem.com/2441.php?0B85330BA896A9992D0561E08037493852CE6E1FAE&t2
  Top-ranked item. (31949 reviews)

Further work

Things that would be nice:

  • It’d be nice to extend this to support /.*/ and /.{0,10}/ — matching "anys", also known as "gapped alignment" searches in bioinformatics, using algorithms like the Smith-Waterman or Needleman-Wunsch algorithms. (Update: this has been implemented.)
  • A way to detect and reverse-engineer templates, e.g. "this is foo", "this is bar", "this is baz" => "this is (foo|bar|baz)", would be great.
  • Finally, heuristics to detect and discard likely-poor patterns are probably the biggest wishlist item.

Tuits are the problem, of course, since $dayjob is the one that pays the bills, not this work. :(

The code is being developed here, in SpamAssassin SVN. Feel free to comment/mail if you’re interested, have improvement ideas, or want more info on how to use it… I’d love to see more people trying it out!

Some credit: I should note that IBM’s Chung-Kwei system, presented at CEAS 2004, was the first time I’d heard of a pattern-discovery algorithm (namely, their proprietary Teiresias algorithm) being applied to spam.

Irish Blog Awards 2007

Well, that was fun! Taint.org didn’t make the shortlists, but I went along anyway just to hang out — and lots of chat was had accordingly. Got to finally meet up with a few people I’d chatted with online, like Nialler9 — and with a few old friends I don’t get to see often enough: Antoin, Elana, Brendan, Clare Dillon (ex-Iona!), and another ex-Ionian, Aisling Mackey. A good laugh.

Have to say though, it seems a vote from me was the kiss of death in many of the categories: Sarah Carey, Blogorrah, Ireland from a Polish perspective, and (the late lamented) TCAL all got my thumbs-up in the shortlist voting, and all wound up missing out on the chunk’o’lucite. Sorry about that guys. ;)

Thanks again to Damien for organising the whole do! It’s great to have an event like this to bring each of our disparate blogs physically together for a bit of community.

By the way I’d like to point out that, in contrast to the Blogorrah Bock the Robber mafiosi, I had a real moustache… ;)

BT’s daily disconnects, revisited

As I noted last year, BT, the ISP I use here in Ireland, disconnects broadband sessions on a daily basis, assigning a new IP address; this is really aggravating to anyone who uses a VPN, such as most telecommuters. Reportedly, this is done to work around deficiencies in their billing system.

A comment from Jeremy on that post suggested something interesting, though:

Just had a very helpful tech support guy on from BT. [… he] told me to restart the modem sometime that will make it convenient for the 24 hour IP change – i.e. restart it at 6am, and then it’ll change IP every day at 6am.

I’ve tested this, and it works. Much more convenient! Now the renumbering and VPN breakage can take place when I want it to — at the start of the workday, instead of some random point chosen by BT’s billing system. Quite an improvement.

To make this useful, here’s a script, "reboot-zyxel", which will reboot your Zyxel P-660RU router remotely over the LAN. (It requires perl and curl.)

MailScanner developer in hospital

According to this message, Julian Field, the main developer of MailScanner, was found collapsed at his home last Friday. More details via the SA list:

He is in ICU though stable condition. I’ll not go into any details, anyone interested and not on the MS list can read the thread on the MS archive.

Currently any plans for cards and such as are on hold until further instructions are given to the MS list. However Matt Hampton has setup a clustermap at this address.

Matt will also forward any well wishes left on the website along with the map. Visiting the page will show Julian and his family just how far reaching his software is and how many people appreciate his efforts.

Get well, Julian! :(

Script: mythsshimport

Here’s a useful script for users of a MythTV box equipped with a PVR-350 MPEG capture/playback card — <a href=’http://jmason.org/software/scripts/mythsshimport.txt’>mythsshimport:

NAME

mythsshimport – transcode and install video files onto a MythTV box

SYNOPSIS

mythsshimport file1 [file2 …]

DESCRIPTION

Transcodes video files (AVI, MPEG, MOV, WMV etc.) into MythTV-compatible and PVR-350-optimised MPEG-2 .nuv files, suitable for viewing on a 4/3 screen, then transfers them to the MythTV backend, inserts them into the "recorded programs" listings, and builds seek tables.

All this happens on-the-fly, at faster-than-real-time rates; with a recent CPU in the transcoding box, and over an 802.11b wifi home network, you can start the process and start watching the video within 20 seconds, while it is transcoded and transferred in the background.

SSH is used as the network transport. If you have the CPU power available on the MythTV backend itself, you can run this script there (as the mythtv user) and it will skip the SSH parts entirely.

REQUIREMENTS

  • ssh password-less key access from transcode box into mythtv@mythbox (this could be localhost, if you’re transcoding on the mythbox). Test using: "ssh mythtv@mythbox echo hi". If you run this script on the mythbox as the mythtv user, this is not required.

  • mencoder. Tested with 2:0.99+1.0pre7try2+cvs20060117-0ubuntu8 (I swear that’s a version string and not just me rolling my head around the keyboard)

  • MythTV. Tested with MythTV 0.20.

  • The "contrib/myth.rebuilddatabase.pl" script from the MythTV source tarball, installed on the mythbox in $PATH: download from svn.mythtv.org.

  • screen(1) installed on the transcoding box, used to keep the mencoder output readable

Download here.

Masonic spam

Wow, here’s a new one — and kind of appropriate, given my surname ;) Masonic spam!

To: xxxxxx at taint.org

Subject: Dear Benefactor Of 2007 Masory Grant,

From: Dr.Lavine Ferdon Ferdon

Date: Wed, 21 Feb 2007 15:40:26 +0100 (CET)

Dear Benefactor Of 2007 Masory Grant,

The Freemason society of Bournemout under the jurisdiction of the all Seeing Eye, Master Nicholas Brenner has after series of secret deliberations selected you to be a beneficiary of our 2007 foundation laying grants and also an optional opening at the round table of the Freemason society.

These grants are issued every year around the world in accordance with the objective of theFreemasons as stated by Thomas Paine in 1808 which is to ensure the continuous freedom of man and toenhance mans living conditions.

We will also advice that these funds which amount to USD2.5million be used to better the lot of man through your own initiative and also we will go further to inform that the open slot to become a Freemason is optional, you can decline the offer.

In order to claim your grant, contact the Grand Lodge Office co-secretary Dr.Lavine Ferdon Ferdon Grand Lodge Office Co-Secretary’s email: (lavin_ferd_law at excite.com)

Dr.Lavine Ferdon Ferdon,

Co-Secretary Freemason Society of Holdenhurst Road,

Bournemouth.

Sir David Hurley,

Secretary Freemason Society of Holdenhurst Road,

Brilliant. But why Bournemouth?

HOWTO block editing of pages in Moin Moin

A useful Moin Moin anti-spam tip, via Upayavira at the ASF: adding ACLs to pages so that only certain users can edit them. This is an easy way to interfere with the wiki spammers who get past the existing (quite good) Moin Moin anti-spam subsystems. They tend to aim for the common Wiki pages, such as WikiSandBox, RecentChanges, and FrontPage, so if you make those pages uneditable, that’ll cause them more trouble — and hopefully cause them to move on to easier targets, instead of defacing your wiki. Here’s how to do it (at least for Moin Moin >= 1.5.1).

Open a shell on the machine where the Moin Moin software is installed. Edit your "wikiconfig.py" file (in my case this is at /home/moinmoin/moin-1.5.1/share/moin/jmwiki/wikiconfig.py), and change the "acl_rights_before" line to read:

    acl_rights_before = u"JustinMason:read,write,delete,revert,admin"

Replace "JustinMason" with your wiki login name, of course.

Create an administrative group of trusted users. Do this by creating a page called "AdminGroup" containing

#acl All:read
These are the members of this group, who can edit certain restricted pages:
 * JustinMason

Now, for the sensitive pages (like FrontPage etc.), edit each one and add an access-control list line at the top of each page containing:

#acl AdminGroup:read,write All:read

That’s it. Users who are not in the AdminGroup will no longer be able to edit those pages. That should help… at least for a while ;)

Update: you should also use this in wikiconfig.py:

    acl_rights_default = u'Known:read,write,revert All:read'

This blocks non-logged-in users from writing to pages.

Irish Blog Awards

A quick note; the Irish Blog Awards shortlisting votes are about to end later today. I’ve been nominated in the long list (thanks!), for best technology blog — feel free to vote for me if you like ;)

Update: boo, no shortlisting. Still, probably my own fault, I was a bit too wishy-washy with the vote hustling! Maybe next year…

Odd legal mail

Last week, I received an odd-looking mail from "Claims Administration Center" ClaimsAdministrationCenter /at/ enotice.info, sent to my private email address — the one listed in an image on http://jmason.org/ (it never gets spam).

The mail reads:

Mittlholtz v . International Medical Research, Inc., Sophie Chen, John Chen, and Allan Wang ("IMR Defendants"), aka Meco, et al. v. IMR, et al., case No. GIC846200.

We are requesting by order of the Court filed with the Superior Court for the County of San Diego, CA, that you post the attached Summary notice as a Public Service Announcement on your web-site.

Below is a link to the PDF Summary Notice (Note: The document is in the .PDF format. To view the documents you will need the Adobe Acrobat Reader)

http://echo.bluehornet.com/ct/ct.php?t=….

This message was intended for: webaddress@jmason.org You were added to the system January 17, 2007. For more information please follow the URL below: http://echo.bluehornet.com/subscribe/source.htm?c=…

Follow the URL below to update your preferences or opt-out: http://echo.bluehornet.com/phase2/survey1/survey.htm?CID=…

Googling for GIC846200, I find it on a cached "civil new filed cases index" page at sandiego.courts.ca.gov:

CASE NUMBER FILE DATE CATEGORY LOCATION

GIC846200 04/21/2005 A72120 – Personal Injury (Other) San Diego MECO vs INTERNATIONAL MEDICAL RESEARCH INCORPORATED

So the case exists. I have no idea who either of the parties are, however.

The URLs in the message were all web-bugged; but bluehornet seem legit in general.

The URL http://www.enotice.info/ times out. Seems to have no spam-related Google Groups hits, although there are a lot of discussions about some iffy-looking class-action suit about Google Adsense.

After quite a bit of discomfort and asking around about the reputation of both bluehornet.com and enotice.info, I eventually succumbed and clicked through. The Summary URL above, after logging my click, redirects to this PDF file, which reads:

This case, called Mittleholtz v . International Medical Research, Inc., Sophie Chen, John Chen, and Allan Wang (‘IMR Defendants’), et al., case No. GIC846200, is a class action lawsuit that alleges that the IMR Defendants unlawfully distributed a product containing synthetic chemicals, the presence of which was also concealed from the public as a result of the IMR Defendants’ alleged failure to conduct any testing for adulteration by synthetic chemicals, including but not limited to diethylstilbestrol (DES) and warfarin (or coumadin), which is the active chemical in bloodthinners. Defendants deny the allegations. The Court has not formed any opinions concerning the merits of the lawsuit nor has it ruled for or against the Plaintiffs as to any of their claims. The sole purpose of this notice is to inform you of the lawsuit so that you may make an informed decision as to whether you wish to remain in or opt out of this class action.

You have legal rights and choices in this case. You can:

  • Join the case. You do not have to do or pay anything to be part of this case. And, you have to accept the final result in the case.

  • Exclude yourself and file your own lawsuit. If you want your own lawyer, you will have to exclude yourself as set forth below and pay your lawyer’s fees and costs.

  • Exclude yourself and not sue. If you do not wish to be part of this case and do not want to bring your own lawsuit, please mail a first class letter stating that you want to be excluded from the Mittleholtz v IMR class action (Case No. GIC846200), or you may fill out the letter available at www.gilardi.com/mittleholtzsettlement. Make sure the letter has your full name, address and signature. Mail it to: PC-SPES Litigation, Class Administrator, c/o Gilardi & Co. LLC, P O Box 8060 San Rafael, CA 94912-8060 by March 23, 2007.

    *This is only a summary. For complete notice and further information go to: www.gilardi.com/mittleholtzsettlement or call the toll-free number 1-877-800-7853.

So in other words, it’s hand-targeted unsolicited, but probably not bulk, email, flogging a class-action suit about ‘synthetic chemicals’ (presumably as opposed to the ‘organic’ variety). I suspect, given the phrasing in the initial mail, they probably googled for a keyword or company name, and found a hit somewhere in taint.org’s 5 years of archives — hence the PSA request.

In fact, I bet this forwarded story is what they found through Googling. Pity they didn’t include a URL for that!

Does sending legal notices like this through email not seem particularly risky, given the lack of reliability of the medium?

An odd situation, all told…

More ‘Small Engine Repair’

Plug plug plug: next week is the 2007 Jameson Dublin International Film Festival — some great movies being shown, I’m looking forward to it. Most of all, though, I want to recommend Small Engine Repair, which <a href=’http://taint.org/2006/07/19/125234a.html’>I’ve written about before. It’s being shown in the festival at 6:20 PM on Wed 21st Feb in <a href=’http://www.dubliniff.com/content/venues’>IFI 1 — <a href=’http://www.dubliniff.com/bookings/choose_tickets/73′>tickets can be booked online here, at EUR 9 apiece.

Writer and director, Niall Heery, <a href=’http://www.ifta.ie/winners0607/breakthrough.htm‘> won the Breakthrough Talent Award at this year’s Irish Film and Television Awards at the weekend. Nice one Niall!

Go see it if you get a chance — it’s a fantastic movie, in my opinion. And be sure to vote for it for the festival’s Audience Award…

Wikipedia and rel=”nofollow”

Apparently, Wikipedia has (possibly temporarily) decided to re-add the rel="nofollow" attribute to outbound links from their encyclopedia pages.

There’s been a lot of heat and light generated about this, most missing one thing: there’s no reason why Google needs to pay attention.

Google, or any other search engine, can treat links in the Wikipedia pages any way they like — including ignoring ‘nofollow’, applying extra anti-spam heuristics of their own, or even trusting the links more highly.

‘Nofollow’ has had pretty much no effect on web-spam, and now is generally festooned all over weblog posts across the internet, both spammed and non-spammed posts, at that. It’d be interesting to see if it’s yet flipped to mean a higher correlation with nonspam than spam content…

Update: It appears Wikipedia used ‘nofollow’ before, so this is not exactly new, either.

more on social whitelisting with OpenID

An interesting post from Simon Willison, noting that he is now publishing a list of "non-spammy" OpenID identities (namely people who posted one or more non-spammy comments to his blog).

I attempted to comment, but my comments haven’t appeared — either they got moderated as irrelevant (I hope not!) or his new anti-comment-spam heuristics are wonky ;) Anyway, I’ll publish here instead.

It’s possible to publish a whitelist in a "secure" fashion — allowing third parties to verify against it, without explicitly listing the identities contained. One way is using Google’s enchash format. Another is using something like the algorithm in LOAF.

Also, a small group of people (myself included) tried social-network-driven whitelisting a few years back, with IP addresses and email, as the Web-o-Trust.

Social-network-driven whitelisting is not as simple as it first appears. Once someone in the web — a friend of a friend — trusts a marginally-spammy identity, and a spam is relayed via that identity, everyone will get the spam, and tracking down the culprit can be hard unless you’ve designed for that in the first place (this happened in our case, and pretty much killed the experiment). I think you need to use a more complex Advogato-style trust algorithm, and multiple "levels" of outbound trust, instead of the simplistic Web-o-Trust model, to avoid this danger.

Basically, my gut feeling is that a web of trust for anti-spam is an attractive concept, possible, but a lot harder than it looks. It’s been suggested repeatedly ever since I started writing SpamAssassin, but nobody’s yet come up with a working one… that’s got to indicate something ;) (Mind you, the main barrier has probably been waiting for workable authentication, which is now in place with DK/SPF/DKIM.)

In the meantime, the concept of a trusted third party who publishes their concept of an identity’s reputation — like Dun and Bradstreet, or Spamhaus — works very nicely indeed, and is pretty simple and easy to implement.

SpamArchive.org no more

Remember SpamArchive.org, the site that allowed random Internet users to upload their spam? It was set up back in 2002 by <a href=’http://www.ciphertrust.com’>CipherTrust, one of the commercial anti-spam vendors, to offer a large, ‘standard’ database of known spam to be used for testing, developing, and benchmarking anti-spam tools, and for anti-spam researchers. It got a bit of coverage at <a href=’http://it.slashdot.org/article.pl?sid=02/11/21/028220′>Slashdot and Wired News at the time.

It never really was too useful for its supposed purposes, though, at least for us in SpamAssassin, since:

  1. it collected submissions from random internet users, without vetting, and therefore couldn’t be guaranteed to be 100% valid;

  2. it ‘anonymized’ the headers too much for the spam to be useful in testing a filter like SpamAssassin, which requires correct header data for valid results;

  3. collecting spam has never been a problem; avoiding it is ;)

Anyway, looks like Ciphertrust/Secure Computing have since lost interest, since they’ve allowed the domain to lapse. It has instead been picked up by a domain speculator:

Domain ID:D134033677-LROR
Domain Name:SPAMARCHIVE.ORG
Created On:30-Nov-2006 18:52:13 UTC
Last Updated On:01-Dec-2006 12:42:26 UTC
Expiration Date:30-Nov-2007 18:52:13 UTC
Sponsoring Registrar:PSI-USA, Inc. dba Domain Robot (R68-LROR)
Status:TRANSFER PROHIBITED
Registrant ID:ABM-9376887
Registrant Name:Robert Farris
Registrant Organization:Virtual Clicks
Registrant Street1:P.O. Box 232471
Registrant Street2:
Registrant Street3:
Registrant City:San Diego
Registrant State/Province:US
Registrant Postal Code:92023
Registrant Country:US
Registrant Phone:+1.7205968887
Registrant Phone Ext.:
Registrant FAX:
Registrant FAX Ext.:
Registrant Email:domain_whois@virtualclicks.com
Name Server:NS1.DIGITAL-DNS-SERVER.COM
Name Server:NS2.DIGITAL-DNS-SERVER.COM

A visit to http://www.spamarchive.org/ now reveals a parking page, which grabs the browser window, forces it to front, maximises it, attempts to bookmark it, add it to the Firefox sidebar — and who knows what else ;)

apres-Barcamp!

Well, that was great fun — well worth the trip down. Got to put a load of faces to names, meeting up with a fair few people I’ve been conversing with online — and a few I hadn’t met before, online or off. Plenty of thought-provoking and interesting chats, too!

My talk went down well, I think. Unfortunately, we didn’t quite know how to operate the projector, so the attendees, while they got to hear me talk, didn’t get to read the leftmost quarter or so of each slide ;)

To make up for it, here they are:

OpenOffice 2 source (234k), PDF (320k), HTML

(PS: Regarding GUI interfaces to managing EC2 — a question that came up in the Q&A — here’s one that looks pretty interesting…)

Barcamp!

I was wavering for a minute there, but I’ve decided to head down to Waterford for Barcamp Ireland – SouthEast — a bit last-minute, but there you go! Tickets and hotel booked.

I’m hoping to give a quick, 20-minute intro to Amazon’s EC2 and S3 web services — what they are, how they’re used, some interesting features and a few gotchas to watch out for.

Also, I’m up for dinner on the Saturday night, given there’s a promise of free booze ;)

Any taint.org readers heading down?

Debunking the “cocaine on 100% of Irish banknotes” story

BBC: Cocaine on ‘100% of Irish euros’:

One hundred percent of banknotes in the Republic of Ireland carry traces of cocaine, a new study has found.

Researchers used the latest forensic techniques that would detect even the tiniest fragments to study a batch of 45 used banknotes.

The scientists at Dublin’s City University said they were "surprised by their findings".

Also at RTE, <a href=’http://breaking.examiner.ie/2007/01/09/story292583.html’>Irish Examiner, <a href=’http://www.physorg.com/news87664001.html’>PhysOrg.com, <a href=’http://www.bloomberg.com/apps/news?pid=20601100&sid=aF0.zayPZ0eo&refer=germany’>Bloomberg.com, even at <a href=’http://www.inform.kz/showarticle.php?lang=eng&id=147654′>Kazakhstan’s KazInform.

This story is (of course) being played widely in the media as "OMG Ireland must use more coke than anywhere else" — in particular, in comparison with a previous study in the US:

The most recent survey carried out in the US showed 65% of dollar notes were contaminated with cocaine.

The DCU press-release has a few more details:

Using a technique involving chromatography/mass spectrometry, a sample of 45 bank notes were analysed to show the level of contamination by cocaine. …

62% of notes were contaminated with levels of cocaine at concentrations greater than 2 nanograms/note, with 5% of the notes showing levels greater than 100 times higher, indicating suspected direct use of the note in either drug dealing or drug inhalation. … The remainder of the notes which showed only ultra-trace quantities of cocaine was most probably the result of contact with other contaminated notes, which could have occurred within bank counting machines or from other contaminated surfaces.

However, looking at an abstract of what I think is the paper in question, Evaluation of monolithic and sub 2 µm particle packed columns for the rapid screening for illicit drugs — application to the determination of drug contamination on Irish euro banknotes, Jonathan Bones, Mirek Macka and Brett Paull, Analyst, 2007, DOI: 10.1039/b615669j, that says:

A study comparing recently available 100 × 3 mm id, 200 × 3 mm id monolithic reversed-phase columns with a 50 × 2.1 mm id, 1.8 µm particle packed reversed-phase columns was carried out to determine the most efficient approach … for the rapid screening of samples for 16 illicit drugs and associated metabolites. … Method performance data showed that the new LC-MS/MS method was significantly more sensitive than previous GC-MS/MS based methods for this application.

My emphasis. I’d guess that that means that comparing this result to banknote-analysis experiments carried out elsewhere using different methods is probably invalid — perhaps this method is more efficient at picking up ‘contact with other contaminated notes, which could have occurred within bank counting machines or from other contaminated surfaces’, as noted in the DCU release?

Email authentication is not anti-spam

There’s a common misconception about spam, email, and email authentication; Matt Cutts has been the most recent promulgator, asking ‘Where’s my authenticated email?’, in which various members of the comment thread consider this as an anti-spam question.

Here’s the thing — email these days is authenticated. If you send a mail from GMail, it’ll be authenticated using both SPF and DomainKeys. However, this alone will not help in the fight against spam.

Put simply — knowing that a mail was sent by ‘jm3485 at massiveisp.net’, is not much better than knowing that it was sent by IP address 192.122.3.45, unless you know that you can trust ‘jm3485 at massiveisp.net’, too. Spammers can (and do) authenticate themselves.

Authentication is just a step along the road to reputation and accreditation, as Eric Allman notes:

Reputation is a critical part of an overall anti-spam, anti-phishing system but is intentionally outside the purview of the DKIM base specification because how you do reputation is fundamentally orthogonal to how you do authentication.

Conceptually, once you have established an identity of an accountable entity associated with a message you can start to apply a new class of identity-based algorithms, notably reputation. … In the longer term reputation is likely to be based on community collaboration or third party accreditation.

As he says, in the long term, several vendors (such as <a href=’http://www.londonactionplan.org/files/reports/ReturnPathbeyond_authentication.pdf’>Return Path and Habeas) are planning to act as accreditation bureaus and reputation databases, undoubtedly using these standards as a basis. Doubtless Spamhaus have similar plans, although they’ve not mentioned it.

But there’s no need to wait — in the short term, users of SpamAssassin and similar anti-spam systems can run their own personal accreditation list, by whitelisting frequent correspondents based on their DomainKeys/DKIM/SPF records, using whitelist_from_spf, whitelist_from_dkim, and whitelist_from_dk.

Hopefully more ISPs and companies will deploy outbound SPF, DK and DKIM as time goes on, making this easier. All three technologies are useful for this purpose (although I prefer DKIM, if pushed to it ;).

It’s worth noting that the upcoming SpamAssassin 3.2.0 can be set up to run these checks upfront, "short-circuiting" mail from known-good sources with valid SPF/DK/DKIM records, so that it isn’t put through the lengthy scanning process.

That’s not to say Matt doesn’t have a point, though. There are questions about deployment — why can’t I already run "apt-get install postfix-dkim-outbound-signer" to get all my outbound mail transparently signed using DKIM signatures? Why isn’t DKIM signing commonplace by now?

How to deal with joe-jobs and massive bounce storms

As I’ve noted before, we still have a major problem with sites generating bounce/backscatter storms in response to forged mail — whether deliberately targeted, as a "Joe-Job", or as a side-effect of attempts to evade over-simplistic sender address verification as seen in spam, viruses, and so on.

Sites sending these bounces have a broken mail configuration, but there are thousands remaining out there — it’s very hard to fix an old mail setup to avoid this issue. As a result, even if your mail server is set up correctly and can handle the incoming spam load just fine, a single spam run sent to other people can amplify the volume of response bounces in a <a href=’http://en.wikipedia.org/wiki/Smurf_attack’>Smurf-attack-style volume multiplication, acting as a denial of service. I’ve regularly had serious load problems and backlogs on my MX, due solely to these bounces.

However, I think I’ve now solved it, with only a little loss of functionality. Here’s how I did it, using Postfix and SpamAssassin.

(UPDATE: if you use the algorithm described below, you’ll block mail from people using Sender Address Verification! Use this updated version instead.)

Firstly, note that if you adopt this, you will lose functionality. Third party sites will not be able to generate bounces which are sent back to senders via your MX — except during the SMTP transaction.

However, if a message delivery attempt is run from your MX, and it is bounced by the host during that SMTP transaction, this bounce message will still be preserved. This is good, since this is basically the only bounce scenario that can be recommended, or expected to work, in modern SMTP.

Also, a small subset of third-party bounce messages will still get past, and be delivered — the ones that are not in the <a href=’http://www.faqs.org/rfcs/rfc3464.html‘> RFC-3464 bounce format generated by modern MTAs, but that include your outbound relays in the quoted header. The idea here is that "good bounces", such as messages from mailing lists warning that your mails were moderated, will still be safe.

OK, the details:

In Postfix

Ideally, we could do this entirely outside Postfix — but in my experience, the volume (amplified by the Smurf attack effects) is such that these need to be rejected as soon as possible, during the SMTP transaction.

Update: I’ve now changed this technique: see this blog post for the current details, and skip this section entirely!

(If you’re curious, though, here’s what I used to recommend:)

In my Postfix configuration, on the machine that acts as MX for my domains — edit ‘/etc/postfix/header_checks’, and add these lines:
/^Return-Path: <>/                              REJECT no third-party DSNs
/^From:.*MAILER-DAEMON/                         REJECT no third-party DSNs
Edit ‘/etc/postfix/null_sender’, and add:
<>              550 no third-party DSNs
Edit ‘/etc/postfix/main.cf’, and ensure it contains these lines:
header_checks = regexp:/etc/postfix/header_checks
smtpd_sender_restrictions = check_sender_access hash:/etc/postfix/null_sender
(If you already have an ‘smtpd_sender_restrictions’ line, just add ‘check_sender_access hash:/etc/postfix/null_sender’ to the end.) Finally, run:
sudo postmap /etc/postfix/null_sender
sudo /etc/init.d/postfix restart
This catches most of the bounces — RFC-3464-format Delivery-Status-Notification messages from other mail servers.

In SpamAssassin

Install the Virus-bounce ruleset. This will catch challenge-response mails, "out of office" noise, "virus scanner detected blah" crap, and bounce mails generated by really broken groupware MTAs — the stuff that gets past the Postfix front-line.

Once you’ve done these two things, that deals with almost all the forged-bounce load, at what I think is a reasonable cost. Comments welcome…

Kernighan and Pike on debugging

While reading the log4j manual, I came across this excellent quote from Brian W. Kernighan and Rob Pike’s "The Practice of Programming":

As personal choice, we tend not to use debuggers beyond getting a stack trace or the value of a variable or two. One reason is that it is easy to get lost in details of complicated data structures and control flow; we find stepping through a program less productive than thinking harder and adding output statements and self-checking code at critical places. Clicking over statements takes longer than scanning the output of judiciously-placed displays. It takes less time to decide where to put print statements than to single-step to the critical section of code, even assuming we know where that is. More important, debugging statements stay with the program; debugging sessions are transient.

+1 to that.

5 things revisited

Hey Danny! I’ve already filled out my "5 Things" list. Surprisingly (or thankfully) nobody has commented on #5 ;)

Great Things, btw. I might adopt #4, and see if it works.

It’s great fun following the web of "5 Things" links as they percolate through the interwebs. now if only the people I nominated would get on with their lists…

Script: knewtab

Here’s a handy script for konsole users like myself:

knewtab — create a new tab in a konsole window, from the commandline

usage: knewtab {tabname} {command line …}

Creates a new tab in a "konsole" window (the current window, or a new one if the command is not run from a konsole).

Requires that the konsole app be run with the "–script" switch.

Download ‘knewtab.txt’

Spam zombies — we need to cure the disease, not suppress the symptoms

Here’s a great presentation from Joe St Sauver presented at the London Action Plan meeting recently: Infected PCs Acting As Spam Zombies: We Need to Cure the Disease, Not Just Suppress the Symptoms

Some key points in brief:

Despite all our ongoing efforts: the spam problem continues to worsen, with nine out of every ten emails now spam; spam volume has increased by 80% over just the past few months and users face a constantly morphing flood of malware trying to take over their computers. Bottom line: we’re losing the war on spam.

The root cause of today’s spam problems is spam zombies, with 85% of all spam being delivered via spam zombies.

The spam zombie problem grows worse every day (with over ninety one million new spam zombies per year)

Users don’t, won’t, or can’t clean up their infected PCs; and ISPs can’t be expected to clean up their infected customers’ PCs.

Filtering port 25 and doing rate limiting is like giving cough syrup to someone with lung cancer — it may suppress some overt symptoms but it doesn’t cure the underlying disease.

Filtered and rate-limited spam zombies CAN still be used for many, many OTHER bad things, and they represent a huge problem if left to languish in a live infected state.

Joe’s take — "we’re in the middle of a worldwide cyber crisis". I agree. He suggests a new strategy:

It is common for universities to produce and distribute a one-click clean-up-and-secure CD for use by their students and faculty. It’s now time for our governments to produce and distribute an equivalent disk for everyone to use.

I agree the existing schemes are clearly not working; this is an interesting suggestion. Read/listen to the presentation in full for more details; <a href=’http://www.uoregon.edu/~joe/lapcnsa2/‘> pick up PDF, PPT and video here.

Massive spam volumes causing ISP delays

Via Steve Champeon‘s daily links, the following spam-in-the-news stories illustrate a rising trend:

Huge amounts of spam are said to be responsible for delays in the email network of NZ ISP Xtra.

Several customers have vented their frustrations on an Xtra website message board saying some emails were days late, The New Zealand Herald reports.

… Record volumes of spam meant such problems would be "an unfortunate and on-going reality of the internet not specific to any provider", he said.

Mr Bowler said Telecom had invested "tens of millions of dollars" in email and anti-spam software and worked closely with two of the world’s leading anti-spam vendors.

Holiday spam e-mails are to blame for slowing message delivery to faculty and staff in schools across Kentucky …

"Some 123-reg customers may have experienced intermittent delays in their emails in the last two weeks. We had received a particularly high level of image-based spam attacks over a short period of time," the Pipex subsidiary said.

Small businesses are threatening legal action over continuing glitches with Xtra’s email service and the Consumers’ Institute says they may have a case.

Several people have contacted the Herald complaining that delays and non-deliveries of emails over the past three weeks on the Xtra network are severely affecting their businesses. …

The institute’s David Russell said home users could claim compensation for email delays if they had suffered "a real measurable loss".

Non-commercial customers were covered by the Consumer Guarantees Act and services they paid for had to be of a "reasonable quality".

Although it might be more difficult for small business owners, they could also have a case, Mr Russell said. "If there has been a considerable amount of money, they could consider legal action or, if the amount was smaller, they could go through the disputes tribunal."

In other words, the DDOS-like elements of the spam problem are becoming an increasing worry; even with working spam filtering in place, the record size of zombie botnets means that spammers can now destroy organisations’ computing infrastructure, almost accidentally.

Spammers don’t care if an organisation’s infrastructure collapses while they’re sending their spam to it — they just want to maximise exposure of their spam, by any means necessary. If that requires knocking a company off the air entirely for a while, so be it.

I’m not sure what can be done about this, in terms of filtering. It may finally be time to fall back to a "side channel" of trusted, authenticated SMTP peers, and leave the spam-filled world of random email from people and organisations you don’t know to one side, as a lower-priority system which can (and will, frequently) collapse, without affecting the ‘important’ stuff. What a mess. :(

Alternatively, maybe it’s time for governments to start putting serious money into botnet-spam-related arrests and prosecution.

This has additional issues for ISPs, too, btw — I wonder if Earthlink are taking note of that Xtra lawsuit story above….

Cliche-finder bookmarklet

Quinn posted a link to a nifty CGI by Aaron Swartz which detects uses of common cliches, with the list of cliches to avoid taken from the Associated Press Guide to News Writing. In addition, she also mentioned there’s the Passivator, ‘a passive verb and adverb flagger for Mozilla-derived browsers, Safari, and Opera 7.5’.

Combining the two, I’ve hacked together a bookmarklet version of the cliche finder — it can be found on this page. (Couldn’t place it inline into this post due to stupid over-aggressive Markdown, grr.)

Fun! Probably not IE-compatible, though.

5 things

Tagged by richi! drat. OK, here are 5 things you probably don’t know about me:

  1. I’m a certified SCUBA diver, at PADI Advanced Open Water Diver level. (oh, look, so’s Tom Raftery!)

  2. I generally try to avoid meeting my heroes, since I get quite tongue-tied in the presence of people I admire — I once stammered "I think you’re brilliant" at Alex Paterson, instead of anything more witty or interesting.

  3. I met my wife at a student occupation in university, where her knowledge of the science and nature questions in Trivial Pursuit, and amazing looks of course, got me hooked ;)

  4. I could listen to Brian Eno’s Taking Tiger Mountain By Strategy and Here Come The Warm Jets on repeat for several weeks, if necessary.

  5. I was a child model, modelling (among other things) underpants for Dunnes Stores! It’s all been downhill since then, really ;)

Passing it on: go for it, Brendan, Colm, Lisey, and Jason.

An anti-challenge-response Xmas linkfest

As all right-thinking people know by now, Challenge-response spam filtering is broken and abusive, since it simply shifts the work of filtering spam out of your email, onto innocent third-parties — either your legitimate correspondents, people on mailing lists you read, or even random people you have never heard of (due to spam blowback).

I’ve ranted about this in the past, but I’m not alone in this opinion — and frequently find myself explaining it. To avoid repeating myself, here’s a canonical collection of postings from around the web on this topic.

Description: This "selfish" method of spam filtering replies to all email with a "challenge" – a message only a living person can (theoretically) respond to. There are several problems with this method which have been well known for many years.

  1. Does not scale: If everyone used this method, nobody would ever get any mail.
  2. Annoying: Many users refuse to reply to the challenge emails, don’t know what they are or don’t trust them.
  3. Ineffective: Because of confusion about these emails, many of them are confirmed by people who did not trigger them. This results in the original malicious email being delivered.
  4. Selfish: This is the problem we are mainly concerned with. By using challenge/response filtering, you are asking innumerable third parties to receive your challenge emails just so that a relatively few legitimate ones get through to the intended recipient.

C-R systems in practice achieve an unacceptably high false-positive rate (non-spam treated as spam), and may in fact be highly susceptible to false-negatives (spam treated as non-spam) via spoofing.

Effective spam management tools should place the burden either on the spammer, or, at the very least, on the person receiving the benefits of the filtering (the mail recipient). Instead, challenge-response puts the burden on, at best, a person not directly benefitting, and quite likely (read on) a completely innocent party. The one party who should be inconvenienced by spam consequences ¿ the spammer ¿ isn’t affected at all.

Worse: C-R may place the burden on third parties either inadvertantly (via spoofed sender spam or virus mail), or deliberately (see Joe Job, below). Such intrusions may even result in subversion of the C-R system out of annoyance. Many recent e-mail viruses spoof the e-mail sender, including Klez, Sobig variants, and others.

The collateral damage from widely used C/R systems, even with implementations that avoid the stupid bugs, will destroy usable e-mail. [jm: in fairness, this was written in 2003.]

Challenge systems have effects a lot like spam. In both cases, if only a few people use them they’re annoying because they unfairly offload the perpetrator’s costs on other people, but in small quantities it’s not a big hassle to deal with. As the amount of each goes up, the hassle factor rapidly escalates and it becomes harder and harder for everyone else to use e-mail at all.

I’m skeptical of CR as a response to email. If you’re the first on your block to adopt CR, and if nobody else uses anti-spam technology, then CR might provide you some modest benefit. But it¿s hard to see how CR can be widely successful in a world where most people use some kind of spam defense.

If these systems are so brain-dead as to not bother adding my address to the whitelist when the user sends me e-mail, I have serious trouble understanding why anyone is using them.

Is it just me? Is this too hard to figure out?

Anyway, there’s another 5 minutes I’ll never get back. It’s too bad there’s no mail header to warn me that "this message is from a TDMA user", because then I’d be able to procmail ’em right to /dev/null where they belong.

Ugh.

This bullshit is not going to "solve" the spam problem, people. If that’s your solution, please let me opt out. Forever.

C/R slows down and impedes communication by placing unwanted barriers between you and your clients/suppliers.

If you must insist on using some form of C/R please make sure that you whitelist my address before you contact me as I will not reply to challenges.

We will not answer any challenges generated in response to our mailing list postings. Thus, if you’re using a challenge-response system and not receiving TidBITS, you’ll need to figure that out on your own. Also, if you send us a personal note and we receive a challenge to our reply, we may or may not respond to it, depending on our workload at the time.

uol.com.br uses a very broken method of anti-spam. Everytime someone sends an email message to one of their members, they send back a verification message, asking the original sender to click a link before they will allow the message through. These messages are themselves a form of spam, and the resulting back-scatter of these messages is altogether bad for the Internet, the UOL member, and all of the UOL member’s contacts. UOL is aware of the complaints against them, and they refuse to correct the issue, claiming that their members love the service.

I hate C/R systems. With a passion. I absolutely will not respond to them. They go in the trash. I don’t get them very often but I get them more and more. I think they have the potential to seriously damage email communication as we know it. And I’m not alone in this opinion.

Phew.

Linux USB frequent reconnects – workaround

I’ve been running into problems recently (since several months ago at least), with USB hardware on my Thinkpad T40 running Ubuntu Hoary Dapper; in particular, every time I plug in my iPod or one of my USB hard disks nowadays, I get this:

[5008549.187000] usb 4-3: USB disconnect, address 14
[5008550.143000] usb 4-3: new high speed USB device using ehci_hcd and address 18
[5008552.643000] usb 4-3: new high speed USB device using ehci_hcd and address 27
[5008557.393000] usb 4-3: new high speed USB device using ehci_hcd and address 43
[5008557.893000] usb 4-3: new high speed USB device using ehci_hcd and address 44
[5008558.643000] usb 4-3: new high speed USB device using ehci_hcd and address 46
[5008558.895000] ehci_hcd 0000:00:1d.7: port 3 reset error -110
[5008558.896000] hub 4-0:1.0: hub_port_status failed (err = -32)
[5008559.893000] usb 4-3: new high speed USB device using ehci_hcd and address 48
[5008562.643000] usb 4-3: new high speed USB device using ehci_hcd and address 58
[5008563.143000] usb 4-3: new high speed USB device using ehci_hcd and address 59
[5008563.643000] usb 4-3: new high speed USB device using ehci_hcd and address 60
[5008570.143000] usb 4-3: new high speed USB device using ehci_hcd and address 85

This repeats ad infinitum until the USB device is disconnected.

I had this down as a hardware issue (since it started happening just after warranty expiration ;), but some accidental googling revealed several other cases — and a workaround:

sudo modprobe -r ehci-hcd

Run that repeatedly, each time replugging the device and monitoring dmesg via watch -n 1 ‘dmesg | tail’ in a window, until the device is finally recognised as a USB hard disk. It generally seems to take 3 or 4 attempts, in my experience.

This LKML thread suggests hardware changes can cause it, but this hardware hasn’t changed in years. Annoying.

Anyway, this is ongoing. This tip seems to help, but it might be just treating a symptom, I don’t know — just posting for google and posterity… and to moan, of course :(

Threadless deals with plagiarism

(Updated since original posting; see end of post for details)

Paging boogah!

Interesting situation playing out at <a href=’http://www.threadless.com/’>ThreadlessI think this may be the first time a stolen design made it through voting and so on, onto cotton, without being spotted. Here’s the design, supposedly by someone called ‘rocketrobyn’:

And here’s the (apparently original) stencil art by miso and ghostpatrol:

BTW, note the perspective being copied from the photo’s odd angle, to the shirt design…

The Threadless design’s submission page has some classic comments:

  • Boney_King_of_Nowhere: Wow. Are you by any chance a fan of Bansky? Because this is almost a rip off. Almost. Awsome though.
  • rocketrobyn (this is my design): Thank you for the positive comments. I really like this shirt too! […] I’m not sure who Bansky [jm:sic] is, but I’ll check it out!

Heh.

I heard about this via <a href="http://youthoughtwewouldntnotice.com/blog3/”>You Thought We Wouldn’t Notice, a street-design plagiarism blog, where ghostpatrol (one of the stencil artists) posted a blog post about the situation. In the comments there, Jake from Threadless pipes up:

jake n on 12 Dec 2006 at 4:30 am

hey, jake here from threadless. i was just made aware of this situation and want to give you all my assurance that we will handle this properly.

the designer will not be paid and the design will either be removed or licensed from the original designer if they are willing.

give us a couple days to sort the details.

Not to appear whingy, 2 hours later "n." posts:

The original owners are not willing to license this design to Threadless, and want it removed from the site. Neither artist has yet been contacted by Threadless.

Bit of patience there ;)

More links:

It’s an interesting situation, and so far Threadless is handling it very well as far as I can see — the only people who aren’t are some other graf and stencil artists in the reaction threads, vituperating about Threadless not using psychic powers to detect plagiarism:

i tell you, you aren’t printing any of my subs, i know it as they score way too low to get noticed. but on the off chance that someone rips off a design i’ve done, as blatantly as this…i would definitely seek reparations from threadless and the offending subber. do a background check with the subbers available websites etc.

Background checks?! wtf.

Good reaction from miso though:

Once again, we own automatic copyright on these images,…

To clarify — we are not blaming Threadless. They didn’t take the design knowing that it was stolen [if they had done so witch such knowledge, we would be approaching this very differently].

This is the fault of the "designer", and hopefully this will sort itself out in the next few days. [Who, by the way, has claimed to have done these designs — "This is a t-shirt I designed for Threadless."]

As yet, either GP nor I have yet been contacted by either the company or "designer" to fix this, but Jake from Threadless has left a very nice comment for us on "You Thought We Wouldn’t Notice".

The Threadless blog reactions are worth watching if you want to follow the ongoing drama.

Update: reposted to preshrunk. In the comments there, someone notes that it’s not the first Threadless tee to make it to production before plagiarism was spotted — The Killing Tree was first. There are some oblique references to this in this blog post’s comments.

Backscatter in InformationWeek

Yay! Kudos to Richi Jennings, who’s been trumpeting the dangers of backscatter to InformationWeek recently. It’s a great article. I particularly like how it digs up this impressively off-the-mark quote:

Tal Golan, CTO, president, and founder of Sendio, maker of a challenge/response e-mail appliance used by more than 150 enterprise consumers, disagrees strongly with Jennings’s assertion that challenge-based filtering has problems. "Without question, the benefit to the whole community at large drastically outweighs that FUD [fear, uncertainty, and doubt] that’s out there in the marketplace that somehow challenge/response makes the problem worse," he says. "The real issue is that filters don’t work. From our perspective, challenge/response is the only solution. This whole concept of backscatter is just not true. Very, very rarely do spammers forge the e-mail addresses of legitimate companies anymore."

hahahaha. Well, since last Thursday, "very very rarely" translates as "214 MB of backscatter in my inbox". The facts aren’t on Tal Golan’s side here…

(PS: SpamAssassin 3.2.0 will include backscatter detection.)