How To Scan Thousands of Tweets Without Tears

Over the weekend I wrote a post about Twitter and cruft. While I love Twitter, I hated the cruft that I had to wade through when trying to follow Twitter lists for information.

After I wrote the post I found a JavaScript solution and included that in my writeup. I also heard from Hanan Cohen, who’s put together his own PHP solution. But the problem was really bothering me, so I spent this past weekend trying to figure out a way to get a single overview of the several Twitter lists I follow, with as much cruft removed as possible, so that I can easily scan through it and find the good stuff. And I think I’ve made a good start on a solution. Here’s what I did:

1) First, I ListiMonkey’d. ListiMonkey is a service that will e-mail you the contents of any Twitter list you specify. You can set up how often you want to get a list of tweets, and you can specify how often you want to receive them. You’ll receive up to 100 tweets per e-mail. You can do some preliminary filtering through ListiMonkey, though I found there was a limit to how many terms I could filter. Every e-mail I got (maybe 300-400 a day) went into a text file.

2) Next, I TextPipe’d. I took my one day’s worth of tweets from Twitter lists (a big text file) and fed it to a software program called TextPipe, which describes itself as an “industrial strength text transformation, conversion, cleansing and extraction workbench.” Using TextPipe I stripped out all the HTML, removed all duplicate lines (every tweet is on its own line), removed all lines that had cruft I didn’t want (filtering out two or three dozen keywords) and then output it to a nice, clean, much smaller text file.

3) Then, I TEA’d. Using the TEA Text Editor, I scanned through the list of remaining Tweets, deleting the tweet-lines I didn’t want to review further. After I was done with that I used TEA’s HTML tools to convert the list of leftover, “to be looked at further” tweets into an HTML document.

4) At this Point, I Converted. TEA can turn the list into an HTML file, but the problem remains that the links are unclickable. So my last step was to go to David Weinberger’s Convert URL’s to Hyperlinks utility and turn my basic HTML file into a basic HTML file with clickable URLs.

5) Finally, I Firefox’d. I opened this HTML file in Firefox and quickly opened and scanned through the tweets I had put aside for further review.

Going through these steps is going to let me review a lot of content from a lot of lists and save me a tremendous amount of time.

A few additional thoughts:

a) I can probably do this in Perl. I know, but I can experiment with and implement filters in TextPipe way faster than I can do it in Perl.

b) It’s not perfect. TextPipe doesn’t truly remove all the duplicates, as the same tweet can be posted three times with three different URLs. To eliminate those I’ll have to do some spadework with regular expressions.

c) TextPipe is expensive. TextPipe Standard is $199. For the amount of time this will save me in trying to keep up with all the tweetstreams that capture my interest, it’ll pay for itself.

d) This solution will tempt me to subscribe to even more Twitter lists. THIS is the problem I’m going to have to watch out for….

Categories: News

Tagged as: ,