Skip to content

Archives

“Report Says Deal”

Twitter has this “Trending Topics” sidebar now, which lists the following topics:

Trending Topics

  • TGIF
  • National Cleavage
  • G20
  • Easter
  • #grammarsongs
  • France
  • #rp09
  • French
  • Grand National
  • Report Says Deal

Now, I’m not going to go into the topic of National Cleavage right now. ‘Report Says Deal’ is intriguing because it makes no sense, until you click through to see:

Real-time results for “Report Says Deal”

  1. Too_cool_normal dlloydsecret Google to Buy Twitter? Report Says Deal is in the Works http://bit.ly/Wt1Wb half a minute ago from twitterfeed    
  2. Orig_8102_003_normal dlloydthemlmpro Google to Buy Twitter? Report Says Deal is in the Works http://bit.ly/Wt1Wb 1 minute ago from twitterfeed    
  3. Ad-tech-paul2_normal techupdates [PCWrld] Google to Buy Twitter? Report Says Deal is in the Works http://tinyurl.com/c63ont 3 minutes ago from twitterfeed    
  4. Orkut_normal icidade Google to Buy Twitter? Report Says Deal is in the Works. http://is.gd/quu9 4 minutes ago from TweetDeck    
  5. Img00315_normal chrisgraves Retweeting @CinWomenBlogger: Retweeting @ays: Google to Buy Twitter? Report Says Deal is in the Works – PC World http://bitly.com/LhT4 6 minutes ago from twhirl

So I’d say that Twitter’s “Trending Topics” uses N-grams of between 1 and 3 “words” for topic identification. In this case, rather than “Report Says Deal“, a better topic string would be something like:

Google to Buy Twitter? Report Says Deal is in the Works – PC World

or even:

Google to Buy Twitter? Report Says Deal is in the Works – PC World http://bitly.com/LhT4

Funnily enough this is exactly the issue I ran into while developing this algorithm. The trick at this point is to apply a variant of the BLAST pattern-discovery algorithm, expanding the patterns sideways while they still match the same subsets of the corpus until they’re maximal.

Twitter folks, if you can read Perl, “assemble_regexps()” in seek-phrases-in-log in SpamAssassin SVN does this pretty nicely, and reasonably efficiently, and is licensed under the ASL 2.0. ;)