Justin's Linklog

Twitter has this "Trending Topics" sidebar now, which lists the following topics:

Trending Topics

TGIF

National Cleavage

G20

Easter

#grammarsongs

France

#rp09

French

Grand National

Report Says Deal

Now, I'm not going to go into the topic of National Cleavage right now. 'Report Says Deal' is intriguing because it makes no sense, until you click through to see:

Real-time results for "Report Says Deal"

dlloydsecret Google to Buy Twitter? Report Says Deal is in the Works http://bit.ly/Wt1Wb half a minute ago from twitterfeed
dlloydthemlmpro Google to Buy Twitter? Report Says Deal is in the Works http://bit.ly/Wt1Wb 1 minute ago from twitterfeed
techupdates [PCWrld] Google to Buy Twitter? Report Says Deal is in the Works http://tinyurl.com/c63ont 3 minutes ago from twitterfeed
icidade Google to Buy Twitter? Report Says Deal is in the Works. http://is.gd/quu9 4 minutes ago from TweetDeck
chrisgraves Retweeting @CinWomenBlogger: Retweeting @ays: Google to Buy Twitter? Report Says Deal is in the Works - PC World http://bitly.com/LhT4 6 minutes ago from twhirl

So I'd say that Twitter's "Trending Topics" uses N-grams of between 1 and 3 "words" for topic identification. In this case, rather than "Report Says Deal", a better topic string would be something like:

Google to Buy Twitter? Report Says Deal is in the Works - PC World

or even:

Google to Buy Twitter? Report Says Deal is in the Works - PC World http://bitly.com/LhT4

Funnily enough this is exactly the issue I ran into while developing this algorithm. The trick at this point is to apply a variant of the BLAST pattern-discovery algorithm, expanding the patterns sideways while they still match the same subsets of the corpus until they're maximal.

Twitter folks, if you can read Perl, "assemble_regexps()" in seek-phrases-in-log in SpamAssassin SVN does this pretty nicely, and reasonably efficiently, and is licensed under the ASL 2.0. ;)

Archives

“Report Says Deal”

Trending Topics

Real-time results for "Report Says Deal"