Justin's Linklog

Hacky hack hack.

Ever since I enabled tags on taint.org, I've been mildly annoyed by the fact that there were thousands of older entries deprived of their folksonomic chunky goodness. A way to 'retroactively tag' those entries somehow would be cool.

Last week, Leonard posted a link on his linkblog to TagThe.net, a web service which offers a nifty REST API; simply upload a chunk of text, and it'll suggest a few tags for that text, like this:

echo 'Hi there, I am a tag-suggesting robot' | curl "http://tagthe.net/api/?text=`urlencode`"
<?xml version="1.0" encoding="UTF-8"?>
<memes>
  <meme source="urn:memanage:BAD542FA4948D12800AA92A7FAD420A1" updated="Tue May 30 20:20:39 CEST 2006">
    <dim type="topic">
      <item>robot</item>
    </dim>
    <dim type="language">
      <item>english</item>
    </dim>
  </meme>
</memes>

This looked promising.

Anyway, I've now implemented this -- it worked great! If you're curious, here's details of how I did it. It's a bit hacky, since I'm only going to be doing this once -- and very UNIXy and perlish, because that's how I do these things -- but maybe somebody will find it useful.

How I Retroactively Tagged taint.org

This weblog runs WordPress -- so all the entries are stored in a MySQL database. I took the MySQL dump of the tables, and a quick script figured out that out of somewhere over 1600-ish posts, there were 1352 that came from the pre-tag era, requiring tag inference. A mail to the TagThe.Net team established that they were happy with this level of usage.

I grepped the post IDs and text out of the SQL dump, threw those into a text file using the simple format 'id=NNN text=SQLHTMLSTRING' (where SQLHTMLSTRING was the nicely-escaped HTML text taken directly from the SQL dump), and ran them through this script.

That rendered the first 2k of each of those entries as a URL-encoded string, invoked the REST API with that, got the XML output, and extracted the tags into another UNIXy text-format output file. (It also added one tag for the 'proto-tag' system I used in the early days, where the first word of the entry was a single tag-style category name.)

Next, I ran this script, which in turn took that intermediate output and converted it to valid PHP code, like so:

cat suggestedtags | ./taglist-to-php.pl  > addtags.php
scp addtags.php my.server:taint.org/wp-admin/

The generated page 'addtags.php' looks like this:

<?php
  require_once('admin.php');
  global $utw;
  $utw->SaveTags(997, array("music","all","audio","drm-free",
      "faq","lunchbox","destination","download","premiere","quote"));
  [...]
  $utw->SaveTags(998, array("software","foo","swf","tin","vnc"));
  $utw->SaveTags(999, array("oses","eek","longhorn","ram",
    "winsupersite","windows","amount","base","dog","preview","system"));
?>

Once that page was in place, I just visited it in my (already logged in) web browser window, at http://taint.org/wp-admin/addtags.php, and watched as it gronked for a while. Eventually it stopped, and all those entries had been tagged. (If I wasn't so hackish, I might have put in a little UI text here -- but I didn't.)

The results are very good, I think.

A success: http://taint.org/tag/research has picked up a lot of the interesting older entries where I discussed things like IBM's Tieresias pattern-recognition algorithm. That's spot on.

A minor downside: it's not so good at nouns. This entry talks about Silicon Valley and geographical insularity, and mentions "Silicon Valley" prominently -- one or both of those words would seem to be a good thing to tag with, but it missed them.

Still, that's a minor issue -- the tags it has suggested are generally very appropriate and useful.

Next, I need to find a way to auto-generate titles for the really old entries ;)

Archives

Retroactive Tagging With TagThe.Net