Twitter has this "Trending Topics" sidebar now, which lists the following topics:
Trending Topics
- TGIF
- National Cleavage
- G20
- Easter
- #grammarsongs
- France
- #rp09
- French
- Grand National
- Report Says Deal
Now, I’m not going to go into the topic of National Cleavage right now. ‘Report Says Deal’ is intriguing because it makes no sense, until you click through to see:
Real-time results for “Report Says Deal”
- dlloydsecret Google to Buy Twitter? Report Says Deal is in the Works http://bit.ly/Wt1Wb
- dlloydthemlmpro Google to Buy Twitter? Report Says Deal is in the Works http://bit.ly/Wt1Wb
- techupdates [PCWrld] Google to Buy Twitter? Report Says Deal is in the Works http://tinyurl.com/c63ont
- icidade Google to Buy Twitter? Report Says Deal is in the Works. http://is.gd/quu9
- chrisgraves Retweeting @CinWomenBlogger: Retweeting @ays: Google to Buy Twitter? Report Says Deal is in the Works – PC World http://bitly.com/LhT4
So I’d say that Twitter’s "Trending Topics" uses N-grams of between 1 and 3 "words" for topic identification. In this case, rather than "Report Says Deal", a better topic string would be something like:
Google to Buy Twitter? Report Says Deal is in the Works – PC World
or even:
Google to Buy Twitter? Report Says Deal is in the Works – PC World http://bitly.com/LhT4
Funnily enough this is exactly the issue I ran into while developing this algorithm. The trick at this point is to apply a variant of the BLAST pattern-discovery algorithm, expanding the patterns sideways while they still match the same subsets of the corpus until they’re maximal.
Twitter folks, if you can read Perl, "assemble_regexps()" in seek-phrases-in-log in SpamAssassin SVN does this pretty nicely, and reasonably efficiently, and is licensed under the ASL 2.0. ;)