Fri
20 Feb 2004
5:58 pm
Shimon's Recommendations Engine
Posted by shimon under computers/blogging/feed recommendationsNo Comments
I've been working a bit recently on a recommendations engine based on the data of Share Your OPML. The SYO database contains the RSS subscription lists of 733 people, who have 81,438 subscriptions to 24,748 distinct feeds. It uses your subscriptions list, and those of everyone else in the database, to help you find feeds you'd like, but that you aren't yet subscribed to.
This is an area Andrew Grumet, Kingsley Kerce, and I have been talking about for a few weeks. He had the data in convenient form in an RDBMS the day after Dave Winer released the SDK, but last weekend I took a few hours and got it flowing into my own RDBMS so I could play too.
As I've been working on recommendations, I've found that it's really hard to evaluate whether the recommendations are any good. First of all, because feeds.scripting.com ends in, and is advertised mostly on, scripting.com, the audience is biased to include people interested in blogging and technology, liberal politics, and so forth. If you randomly pick a feed from SYO, how likely is it to be something interesting?
- Task: answer this question by building a feature that takes you to a random feed.
Another problem is that it's rather cumbersome to actually look at any of the recommended feeds. SYO only includes XML links, and an obvious improvement would be to spider the linked XML and dig out a website URL. Indeed, Andrew's site very conveniently does this. I'd like to do it or something like it myself; I'm considering perhaps offering a frassle-ized interface to the RSS feed contents rather than a simple link. This would offer a consistent interface to externally-produced content, which is both good and bad.
The next question is, how do you quickly look at a blog that some goofy software has recommended to you and decide whether you'll like it and what it's about? In a perfect world, you'd have enough time to carefully read everything on it, but in real life we're all in hurry. Here are some easier metrics that I think can be correlated to quality and relevance, or can at least give you a sense of what the feed's about and what kind of burden it will impose on you, the reader:
- titles of posts
- what outside sites are linked from the posts
- category titles, if the feed has categories
- average time between posts - to help you determine how often new posts will knock on your door. Perhaps better presented as average number of new posts per day?
- average number of words in a post (and min, max?)
- what feeds point to the same website URL? Am I subscribed to any of those? (i.e. is this feed just a copy of something I already read?)
Currently, there is only a development version of my recommendations engine in an undisclosed location, but I'll try to get it up on this site, and public, real soon now.
Full Entries RSS