News

Failed Experiment: MediaWiki as RSS Feed Aggregator / Archivist

Now that the book has gone to press and I’m not writing every spare second, I have some free time. Having free time on a weekend is an astounding concept that I should try more often. By Saturday evening I had completed everything on my short to-do list (do laundry, clean up the kitchen, write up some ResearchBuzz, win a domination victory in Civ IV Warlords) and decided it was time to do some playing-with-the-intention-of-gathering-knowledge-while-trying-really-weird-stuff — in other words, mad scientisting.

In the last few weeks I’ve been getting steadily more obsessed with the possibilities of a wiki at my company for purposes of knowledge management. But the more I thought about it (and dreamed about it, and drew all over the giant whiteboards at work) the more I decided that wanting that was fine, but even better would be a wiki that not only managed company knowledge — that is, held it in one place in a nicely-organized way — but also generated it via customized external RSS feeds.

In other words, if WidgetCo wanted to know what bloggers were saying about them, they could check a wiki page with an incorporated feed from Technorati or another blog search engine. If they wanted to get a list of the latest press releases from their competitor, they could again check the wiki with a keyword-based RSS feed from Google News, and so on. Not only would it be handy to have all that information in one place, but with MediaWiki and other wikis you could discuss the content, make notes, etc.

As I continued to ponder, I wondered if maybe you couldn’t go a couple of steps even further and use the wiki as an archivist, capturing the output of an RSS feed over time. There are ways to do this in a regular RSS feed — like Feedcatch — but I thought it would be better to have the archived feed entries within a wiki for purposes of discussion, easy review, etc. (And yeah, I know that the contents of a feed would often go bad, but in some cases it wouldn’t. For example, say you had a Feed43 feed that captured the number of Bloglines subscribers to your feed. Having that number archived would allow you to watch how it changed over time.)

So I decided to see if you could set up a wiki with an RSS feed reader that archived the feed contents over time. I couldn’t manage it. But I’m going to write up here what I did, with the thought that a) somebody might find it interesting, and b) somebody e-mails me and goes, “dummy, all you have to do is x your y and then z”.

By the way, expect lots of complete ignorance. I know very little about wikis. There were a couple of times when I thought, “Hmm, it seems to be that it’d be stupid to try this because x will happen,” but I tried it anyway because I wasn’t sure. So please keep the snickering down to a dull roar. Thank you.

Step 1. MediaWiki Installation

Of course to kick things off I installed MediaWiki. Not much to say here because my host offers it as a one-click installation. You can see the installed wiki at http://www.researchbuzz.org/wiki/. I call it POCwiki, Proof of Concept wiki, since I’m experimenting with it. I did not want folks playing with it so I turned off new account creation and anonymous editing. So it’s a display model only. Once I had it up and going I had to have a way to put feeds on pages without driving myself insane. That was accomplished with SimplePie.

Step 2. Introducing SimplePie

SimplePie is an RSS parser with a plug-in for MediaWiki. I have a serious nerdcrush on SimplePie. It was incredibly easy to install with excellent directions. And the syntax for adding a feed to a page couldn’t be simpler (brackets altered so they’ll display on the page)

>feed>http://www.example.com/example.rss>/feed>

There are some switches you can add to that, but that’s the default. And it worked the first time. Whoopee!

I made up a few pages with RSS feeds from Google news, including one I knew would generate a lot of entries (a Google News search for the) and pondered for a minute. I thought, “This is where I’m probably being stupid. The wiki is not going to recognize the new entry as new content and make a history entry for it, because the page itself — the single line of >feed> >/feed> — is not going to change. But let’s try it anyway.”

Once the feeds were set up I went to bed.

Step 3. SimplePie Troubles

When I got up this morning I checked the wiki. Sure enough, there were no history changes. But even more astonishing is that the feed pages hadn’t changed at all. And you’re not going to tell me that a Google News search for the doesn’t change in seven hours. I went to the SimplePie support forum and as it turned out I wasn’t the only one having this problem. And it’s apparently a MediaWiki issue, not a SimplePie issue. After trying a couple of things I solved it by turning caching off in MediaWiki. POCwiki is very small and uninteresting but the idea of turning the cache off made me uncomfortable. Caches are there for a reason, aren’t they?

Step 4. Two Problems

So now I have two problems.

FIRST PROBLEM — I have to turn the cache off to get the feeds to update on the wiki in a timely manner.

SECOND PROBLEM — As I figured, the RSS feed updates are not showing up on history, therefore the RSS entries are not being archived. I have a wiki that’s generating knowledge via updated RSS feeds, but it’s not archiving them.

Step 5. The Possibilities

After thinking about it a while I’ve come up with a few possibilities for each of the two problems, but I don’t know if I’m on the right track or not.

FIRST PROBLEM — The pages are getting stuck in a cache because the actual edited page — the one line of … — is not changing. One possibility would be to edit the page periodically using Pywikipediabot.

Say you had a page with a feed that looks like this:

http://www.researchbuzz.com/researchbuzz.rss

If you put a fake switch on the end of the URL the feed will still work, like so:

http://www.researchbuzz.com/researchbuzz.rss?botnudge=1

So maybe you could set up a ‘bot to periodically change the number in the switch to the current Julian date?

http://www.researchbuzz.com/researchbuzz.rss?botnudge=2454064.06458

If you did that every hour (or two hours, or six hours, or whenever, depending on how busy you anticipate the feed to be) you wouldn’t have to worry about leaving page caching turned off. I know the pybot can edit pages, I don’t know if it can insert numbers like Julian dates.

Another possibility would be to simply not worry about the fact that the caching is turned off.

(Or maybe you could use the pybot’s touch.py offering?)

SECOND PROBLEM — Even if you solve the cache problem you’re still generating only current feeds. You are not saving old feed entries for wikiposterity.

I looked at the pybot and saw while it could edit and change, I didn’t see any options for scraping. If pybot could scrape, it seems to me that you could have the pybot periodically scrape the contents of a page, and then save the scraped contents as a new page. So if I have a page called GoogleNewsWii, pybot could scrape it and either a) save it as GoogleNewsWiiArchive (which would overwrite an existing page and make a history entry) or b) save it with the Julian date appended — call it GoogleNewsWii2454066.18194 — as a new page. But here we are touching the limits of my Python knowledge and I don’t know if the pybot is built to do this kind of thing.

Conclusion

Though what I wanted to happen didn’t happen, I’m still glad I tried this. I learned a lot about MediaWiki and a lot about SimplePie. Even if I can’t get archived RSS feeds, the possibilities of active feeds in wiki pages are interesting enough that I’m going to keep running experiments and try integrating more content. Stay tuned. Oh, and if this particular wheel has already been invented, or if there’s a simpler way to solve it and I’m going around my knee to get to my elbow, please drop me a note.

Categories: News