Hacky hack hack.
Ever since I enabled tags on taint.org, I’ve been mildly annoyed by the fact that there were thousands of older entries deprived of their folksonomic chunky goodness. A way to ‘retroactively tag’ those entries somehow would be cool.
Last week, Leonard posted a link on his linkblog to TagThe.net, a web service which offers a nifty REST API; simply upload a chunk of text, and it’ll suggest a few tags for that text, like this:
echo 'Hi there, I am a tag-suggesting robot' | curl "http://tagthe.net/api/?text=`urlencode`" <?xml version="1.0" encoding="UTF-8"?> <memes> <meme source="urn:memanage:BAD542FA4948D12800AA92A7FAD420A1" updated="Tue May 30 20:20:39 CEST 2006"> <dim type="topic"> <item>robot</item> </dim> <dim type="language"> <item>english</item> </dim> </meme> </memes>
This looked promising.
Anyway, I’ve now implemented this — it worked great! If you’re curious, here’s details of how I did it. It’s a bit hacky, since I’m only going to be doing this once — and very UNIXy and perlish, because that’s how I do these things — but maybe somebody will find it useful.
How I Retroactively Tagged taint.org
This weblog runs WordPress — so all the entries are stored in a MySQL database. I took the MySQL dump of the tables, and a quick script figured out that out of somewhere over 1600-ish posts, there were 1352 that came from the pre-tag era, requiring tag inference. A mail to the TagThe.Net team established that they were happy with this level of usage.
I grepped the post IDs and text out of the SQL dump, threw those into a text file using the simple format ‘id=NNN text=SQLHTMLSTRING’ (where SQLHTMLSTRING was the nicely-escaped HTML text taken directly from the SQL dump), and ran them through this script.
That rendered the first 2k of each of those entries as a URL-encoded string, invoked the REST API with that, got the XML output, and extracted the tags into another UNIXy text-format output file. (It also added one tag for the ‘proto-tag’ system I used in the early days, where the first word of the entry was a single tag-style category name.)
Next, I ran this script, which in turn took that intermediate output and converted it to valid PHP code, like so:
cat suggestedtags | ./taglist-to-php.pl > addtags.php scp addtags.php my.server:taint.org/wp-admin/
The generated page ‘addtags.php’ looks like this:
<?php require_once('admin.php'); global $utw; $utw->SaveTags(997, array("music","all","audio","drm-free", "faq","lunchbox","destination","download","premiere","quote")); [...] $utw->SaveTags(998, array("software","foo","swf","tin","vnc")); $utw->SaveTags(999, array("oses","eek","longhorn","ram", "winsupersite","windows","amount","base","dog","preview","system")); ?>
Once that page was in place, I just visited it in my (already logged in) web browser window, at http://taint.org/wp-admin/addtags.php, and watched as it gronked for a while. Eventually it stopped, and all those entries had been tagged. (If I wasn’t so hackish, I might have put in a little UI text here — but I didn’t.)
The results are very good, I think.
A success: http://taint.org/tag/research has picked up a lot of the interesting older entries where I discussed things like IBM’s Tieresias pattern-recognition algorithm. That’s spot on.
A minor downside: it’s not so good at nouns. This entry talks about Silicon Valley and geographical insularity, and mentions “Silicon Valley” prominently — one or both of those words would seem to be a good thing to tag with, but it missed them.
Still, that’s a minor issue — the tags it has suggested are generally very appropriate and useful.
Next, I need to find a way to auto-generate titles for the really old entries ;)