Hacky hack hack.
Ever since I enabled tags on taint.org, I’ve been mildly annoyed by the fact
that there were thousands of older entries deprived of their folksonomic chunky
goodness. A way to ‘retroactively tag’ those entries somehow would be cool.
Last week, Leonard posted a link on his linkblog to
TagThe.net, a web service which offers a nifty REST API;
simply upload a chunk of text, and it’ll suggest a few tags for that text, like
this:
echo 'Hi there, I am a tag-suggesting robot' | curl "http://tagthe.net/api/?text=`urlencode`"
<?xml version="1.0" encoding="UTF-8"?>
<memes>
<meme source="urn:memanage:BAD542FA4948D12800AA92A7FAD420A1" updated="Tue May 30 20:20:39 CEST 2006">
<dim type="topic">
<item>robot</item>
</dim>
<dim type="language">
<item>english</item>
</dim>
</meme>
</memes>
This looked promising.
Anyway, I’ve now implemented this — it worked great! If you’re curious, here’s details of how I did it. It’s a bit hacky, since I’m only going to be doing this once — and very UNIXy
and perlish, because that’s how I do these things — but maybe somebody will
find it useful.
How I Retroactively Tagged taint.org
This weblog runs WordPress — so all the entries are stored in a MySQL database. I took the MySQL dump of
the tables, and a quick
script figured out that out of somewhere over 1600-ish posts, there were 1352
that came from the pre-tag era, requiring tag inference. A mail to the
TagThe.Net team established that they were happy with
this level of usage.
I grepped the post IDs and text out of the SQL dump, threw those into a text
file using the simple format ‘id=NNN text=SQLHTMLSTRING’ (where SQLHTMLSTRING
was the nicely-escaped HTML text taken directly from the SQL dump), and ran
them through this script.
That rendered the first 2k of each of those entries as a URL-encoded string,
invoked the REST API with that, got the XML output, and extracted the tags into
another UNIXy text-format output file. (It also added one tag for the
‘proto-tag’ system I used in the early days, where the first word of the entry
was a single tag-style category name.)
Next, I ran this script, which
in turn took that intermediate output and converted it to valid PHP code, like
so:
cat suggestedtags | ./taglist-to-php.pl > addtags.php
scp addtags.php my.server:taint.org/wp-admin/
The generated page ‘addtags.php’ looks like this:
<?php
require_once('admin.php');
global $utw;
$utw->SaveTags(997, array("music","all","audio","drm-free",
"faq","lunchbox","destination","download","premiere","quote"));
[...]
$utw->SaveTags(998, array("software","foo","swf","tin","vnc"));
$utw->SaveTags(999, array("oses","eek","longhorn","ram",
"winsupersite","windows","amount","base","dog","preview","system"));
?>
Once that page was in place, I just visited it in my (already logged in) web
browser window, at
http://taint.org/wp-admin/addtags.php,
and watched as it gronked for a while. Eventually it stopped, and all those
entries had been tagged. (If I wasn’t so hackish, I might have put in a little UI text here — but I didn’t.)
The results are very good, I think.
A success: http://taint.org/tag/research has picked up a lot of the
interesting older entries where I discussed things like IBM’s Tieresias
pattern-recognition algorithm. That’s spot on.
A minor downside: it’s not so good at nouns. This
entry talks about Silicon Valley and geographical
insularity, and mentions "Silicon Valley" prominently — one or both of those
words would seem to be a good thing to tag with, but it missed them.
Still, that’s a minor issue — the tags it has suggested are generally very
appropriate and useful.
Next, I need to find a way to auto-generate titles for the really
old entries ;)