Blog Archives

How To Scan Thousands of Tweets Without Tears

Over the weekend I wrote a post about Twitter and cruft. While I love Twitter, I hated the cruft that I had to wade through when trying to follow Twitter lists for information.

After I wrote the post I found a JavaScript solution and included that in my writeup. I also heard from Hanan Cohen, who’s put together his own PHP solution. But the problem was really bothering me, so I spent this past weekend trying to figure out a way to get a single overview of the several Twitter lists I follow, with as much cruft removed as possible, so that I can easily scan through it and find the good stuff. And I think I’ve made a good start on a solution. Here’s what I did:

1) First, I ListiMonkey’d. ListiMonkey is a service that will e-mail you the contents of any Twitter list you specify. You can set up how often you want to get a list of tweets, and you can specify how often you want to receive them. You’ll receive up to 100 tweets per e-mail. You can do some preliminary filtering through ListiMonkey, though I found there was a limit to how many terms I could filter. Every e-mail I got (maybe 300-400 a day) went into a text file.

2) Next, I TextPipe’d. I took my one day’s worth of tweets from Twitter lists (a big text file) and fed it to a software program called TextPipe, which describes itself as an “industrial strength text transformation, conversion, cleansing and extraction workbench.” Using TextPipe I stripped out all the HTML, removed all duplicate lines (every tweet is on its own line), removed all lines that had cruft I didn’t want (filtering out two or three dozen keywords) and then output it to a nice, clean, much smaller text file.

3) Then, I TEA’d. Using the TEA Text Editor, I scanned through the list of remaining Tweets, deleting the tweet-lines I didn’t want to review further. After I was done with that I used TEA’s HTML tools to convert the list of leftover, “to be looked at further” tweets into an HTML document.

4) At this Point, I Converted. TEA can turn the list into an HTML file, but the problem remains that the links are unclickable. So my last step was to go to David Weinberger’s Convert URL’s to Hyperlinks utility and turn my basic HTML file into a basic HTML file with clickable URLs.

5) Finally, I Firefox’d. I opened this HTML file in Firefox and quickly opened and scanned through the tweets I had put aside for further review.

Going through these steps is going to let me review a lot of content from a lot of lists and save me a tremendous amount of time.

A few additional thoughts:

a) I can probably do this in Perl. I know, but I can experiment with and implement filters in TextPipe way faster than I can do it in Perl.

b) It’s not perfect. TextPipe doesn’t truly remove all the duplicates, as the same tweet can be posted three times with three different bit.ly URLs. To eliminate those I’ll have to do some spadework with regular expressions.

c) TextPipe is expensive. TextPipe Standard is $199. For the amount of time this will save me in trying to keep up with all the tweetstreams that capture my interest, it’ll pay for itself.

d) This solution will tempt me to subscribe to even more Twitter lists. THIS is the problem I’m going to have to watch out for….

Google Reader’s New Page-Monitoring Feature — How’s it Working?

A couple weeks ago I covered Google’s new feature that allows you to monitor pages even when they don’t have RSS feeds. A few days ago reader LP e-mailed me and asked about the new feature, “Did it work?” And I realized I had completely forgotten to write a follow-up post. So yeah, about Google Reader’s new page-monitoring feature….

The first great thing about this feature is that it taught me how many Web pages do in fact have RSS feeds. I went to several places meaning to monitor the page for pages, only to discover that RSS feeds were available now. Yay!

I did find some places that did not have RSS feeds, though; the best example is probably the Twitter lists that use Tweets from ResearchBuzz. The URL for the list is http://twitter.com/ResearchBuzz/lists/memberships but I didn’t know of any way to track when new lists were added to this page. So that was my test case for Google Reader.

Every change to the page is a new entry in Google Reader. The screenshot above shows an example of an entry. There’s no context on the page, and if I wasn’t familiar with the page content to start with, the entry wouldn’t be useful (in other words, I wouldn’t share it.)

I also tried the Google Reader with http://www.ted.com/pages/view?id=348, which is a list of upcoming TEDx events all over the world. Again, I didn’t get any context, just the line that changed.

One Google Reader update monitor I did failed. I was trying to monitor a particular business in Google Maps because I wanted to see what kind of reviews they got. I think this might be my fault, however. I looked up the business in Google, and then used the extremely-long-and-awkward URL supplied by Google as my monitoring URL. Google never got an update for that page, and complained that the page didn’t exist. I’m going to try it again using the link supplied by Google on the business’ page.

For me, the gold standard for page monitoring remains WebSite-Watcher, a client-side application available at http://www.aignes.com/. However it is for Windows only. Until it’s available for my operating system, I think I’ll keep using this new feature of Google Reader.

Google Reader Lets You Monitor Page Changes Without RSS

It’s not as nifty as a cell phone, or as amazing as street views of businesses all over the world, but to me it is big news — really big news. Google announced yesterday
that Google Reader can now be used to monitor pages for Web changes — whether they have RSS feeds or not.

Ten years after I started using RSS, it’s pretty prevalent but not universal. Google’s announcement means it’s going to be a lot easier to follow those random pages that don’t have RSS feeds for update information.

Are you already using the Google Reader for RSS feeds? Adding non-RSS content is easy. Just click on the “Add a Subscription” button and you’ll get a form into which you can paste an RSS feed URL or a regular HTML page URL. Google will ask you to confirm that you would like to create a feed to monitor based on that page.

Now, HTML pages are not RSS feeds. Their information is harder to isolate and delineate. So while the idea is that Google is going to “provide short snippets of page changes,” it’s not clear what those snippets are going to look like. Is going going to get hung up on a date change or counter change? (This has been a problem in the past with software like WebSite Watcher.) Are the snippets going to be meaningful?

I’ve added some pages to Google Reader and will revisit them in a week or so to see what kind of snippets I’m getting as results.

Monitor Twitter Lists with ListiMonkey

I have actually been using ListiMonkey for a few weeks, ever since I heard about it from Steve Rubel. First I loved it, then I hated it. After several e-mail conversations with the developers and some tweaks they’ve made to the tool I love it again. If you’re at all interested in trapping information via Twitter, I think you’ll love it too.

Have you ever tried to monitor Twitter via its search-results-as-RSS-feeds? It’s tough. For the kinds of keywords I’ve tried to use, I got a lot of spam. It got so I couldn’t use the feeds; they were too spammy.

Enter ListiMonkey at http://listimonkey.com/. ListiMonkey allows you to specify a Twitter list, enter the keywords for which you want to monitor that list, and then specify an e-mail address to which you want to get the results, and how often you want to get the results (hourly or daily). (It’s possible to follow a list and get the all tweets generated by not specifying any keywords, but I don’t recommend that — you’ll get lots of e-mail with lots of tweets unless you choose your lists very very carefully.) That’s it. There’s no registration involved. You WILL have to confirm your e-mail for each alert, of course.

Now, if you monitor a Twitter list, you’re obviously not getting as much as you’d get if you were monitoring the entire Twitter stream. On the other hand, if someone gets added to a Twitter list it’s because someone ELSE thinks they post stuff that’s worth reading. And you’ll cut down the spam level to almost nothing. You’re getting useful results, in other words.

ListiMonkey does have about 250 Twitter lists available, but I think you’ll have more luck finding lists using the TweetDeck Directory at http://tweetdeck.com/#directory. Once you’ve found a list you want to follow, the obvious next question is what kind of keywords do you want to monitor?

This is what was tough for me in figuring out how to use ListiMonkey, and it’s one thing that’s changed a lot thanks to the developers. I found a couple of lists where I just wanted to find out what kind of links people were putting out there. I didn’t necessarily want tweets without links. So my first keyword monitor on ListiMonkey was just http.

Naturally this found all tweets that had a URL in them, and none without. But it also found retweets, checkins using FourSquare/Gowalla, pictures people were posting, etc. I didn’t want any of that. (And making sure I didn’t get that was important, for two reasons: one I didn’t want to get drowned in e-mail alerts and two, ListiMonkey limits its monitors to 100/tweets per mail. If I didn’t filter as closely as I could I would miss stuff.)

Initially ListiMonkey did not allow me to do complex queries like that, where I specified one keyword that I was looking for and a bunch of keywords that I weren’t. But that has been added in. So I did a lot of experiments where I looked for links to resources but not to anything extraneous, and ended up with a ListiMonkey query with several keywords:

http -4sq -gowal -rt -twitpic

That gets me e-mails from ListiMonkey that are full of resource-y link goodness.

When you get an e-mail from ListiMonkey, it’ll look like this:

You’ll get the tweet, of course, with the author and avatar, timestamp, and option to retweet or reply to the tweet (of course you’ll have to be logged in to your Twitter account to do that.) The e-mail also has links to edit your alert or delete your alert if it’s not working out for you.

One thing you should know: ListiMonkey is tracking the clicks on the links in its e-mail. You might think you’re clicking on a bit.ly link when actually you’re clicking on http://listimonkey.com/link/track?alert_id=1378&url=http://bit.ly/5gGhqm . Just a heads-up if you’re concerned about link tracking (I’m not.) If it really bothers you, you can always highlight the link in the tweet and then copy/paste it to your browser.

You can learn more about ListiMonkey via its FAQ. ListiMonkey was a small shop project, and while there’s no charge for the service the developer is accepting donations. I think they’ve put together a great tool here; if you agree with me how about slipping them a few bucks via the Donate tool on the FAQ page?

Serious Information Trapping on Twitter with RowFeeder

Do you need to do some serious information trapping on Twitter? Got some keywords you want to monitor and you don’t want to miss a thing? Check out this nifty application I heard about from Ed — RowFeeder. (Thanks Ed!) RowFeeder’s not cheap, but if you want to quickly gather materials from a Twitter flow and get them in a format that you can easily manipulate, it looks like a heck of a tool.

RowFeeder’s at http://rowfeeder.com/. Here’s how it works: you specify a term or hashtag you want to track. Then you pay — well, you’re supposed to pay but the pay button didn’t work when I tried it; instead I got an e-mail address to contact for making payments. This is where the ain’t cheap part comes in — it’s $2.49 to monitor a term/tag for up to 48 hours. (This would very quickly make me very broke.)

RowFeeder monitors the tweetstream and fetches tweets that match your tags, using them to populate a Google Spreadsheet like the one you see in the screenshot. Information is broken out into columns including username, Tweet, homepage, location, and date.

This is too expensive for me to use on a regular basis but I can easily see how a PR firm or company who wants to track comments about a release could get a lot of use out of RowFeeder. I’d have a little concern about monitoring the entire tweetstream — there’s a lot of spam out there. It’d be nice to monitor just specified lists.

Cool Tool!

Tracking the News Cycle in Blogs and “Traditional” News Media

This is a little far afield of search engines but humor me for a minute. Cornell University had an interesting story about the life cycle of a news story on blogs and on “traditional” media.

Three researchers tracked 1.6 million online news sites, both traditional media and blogs, over a three-month period leading up to the last presidential election. 90 million articles ended up in the analysis. According to their research, traditional media has a pattern of stories rising to prominence slowly than dying quickly, while in blogs stories would become popular quickly and then hang around longer. Of course in both cases stories eventually “cycle out” and new news comes in.

I care about this because it gives me information that might allow me to refine my strategy as an information trapper. There are of course ongoing topics that I pay attention to all the time. On the other hand there are topics that are current-event-based or based on research that I’m doing that I only want to follow temporarily.

Knowing that mainstream news sources ramp up stories slowly and then drop them off over a few days might make me decide that I only want to run temporary searches for a week or so before discarding the search. Or I might decide that I’ll only give a temporary search a week before I reevaluate my search results and decide to use a different set of keywords, or a different focus. On the other hand, I may decide that the life cycle of a story in a blog might mean that I should extend my search for far longer than I would otherwise.

You can get additional information on the research at http://memetracker.org/supp/. I don’t do a lot of temporary trapping, but I do some, and I’m looking at this research as a first step to come up with some standard approaches. At the very least it’s given me something to think about.

Google News, Getting on My Nerves a Bit

Many many moons ago, I set up a news alert on Google News to send me a ping whenever a news story mentioned ResearchBuzz. I always liked to see where news and mentions ended up. As you might imagine, that Google Alert went very silent last year when ResearchBuzz lapsed into a semi-coma.

So imagine my surprise when I started getting a fair number of Google Alerts within the last few weeks for ResearchBuzz. I’ve been posting more, but not enough that would warrant an avalanche of news mentions. And the summaries were mostly business-related stories. Where were all these new mentions coming from?

I finally went over to Google News (http://news.google.com) and ran a few searches. It didn’t take long to figure out that Google News was being a little more flexible with my searches than I intended. Fortunately there’s an easy remedy.

At the moment (a while after I first noticed this problem) there are three results for the query ResearchBuzz on Google News. Only one of them — a mention by the Search Engine Journal — is about this site. (Thanks, Search Engine Journal!) The other two are about business. Looking at them both I see they both mention publicly-traded companies with a set of research links that includes this: Research, Stock Buzz.

So Google News is taking my ResearchBuzz query and turning into something like “Research * Buzz”. And while I appreciate some flexibility in my searches — especially as the data pool for news is much smaller than that for the Web — it doesn’t find me information I need.

I took a look at the Google News preferences, but while you can change whether or not you can get suggestions, you can’t really change whether or not Google alters your search term. You CAN make sure that Google News searches for your query exactly as you enter it by using a + in front of your search term. Enclosing it in quotes will also do the trick.

This particular Google search quirk isn’t NEARLY as irritating as searching for words, looking in a page cache, and then discovering they’re not there. But it is leading me to think about “best practices” for setting information traps with Google Alerts. Quotes all the time? Plus marks? More test searches so I can make sure ahead of time that I’m getting only the results I want? I haven’t decided yet.

Follow

Get every new post delivered to your Inbox.

Join 3,856 other followers