If you've been using a desktop RSS aggregator, you know how great it feels to find an RSS feed for a site you frequent. Suppose there's a certain club you like, and you want to know who's playing there most weekends. You could check the site each week, but it'd be much simpler if your aggregator simply showed you whenever a new event was published.

You could ask them to publish a feed. That would be the ideal solution, but it's not likely to happen. Most people still don't know what a feed is good for. If the publisher can't provide a feed on their own, you can use a tool called a scraper to automatically turn their web pages into an RSS feed. One such scraper is MyRSS, which claims to allow non-programmers to easily scrape any web page. I used MyRSS once, and not only did it have an annoying advertising model, but the feed it produced was terrible. It contained only some of the titles, mistook links on the side of the page for new content, and failed to update when new content arrived. In short, it just didn't work.

It's hard to blame MyRSS for this. People don't format their pages to make them simple for a computer to scrape. People format pages for other people to read. A scraper that works only by automatic means, without guidance from a person, is doomed to fail somewhere.

I'd like to propose an alternative. Instead of hoping to make the software smart enough to figure out any page on its own, let's make software that's good at listening to people. Setting up a scraper for your favorite page should be an interactive process.

First, you enter a URL. The first thing the scraper does is fetch that URL and show it to you. To make things challenging, let's suppose you want to scrape Boston.com's event listing for today, which includes event date and event location in addition to some standard RSS fields.

Second, the scraper shows you the page, but marked clearly to show what fields would be scraped. This gives you an opportunity to look at specific examples and correct the scraper before it begins the long cycle of producing beautiful scraped RSS feeds. It might look like this:

All the fields to be scraped are identified by colored borders. Pink indicates event date, green is title, and blue is the list of categories.

The yellow block at the left of each field indicates that this is a guess by the scraper. To confirm the scraper's guess, you click the yellow block and it turns into a green check-mark, as illustrated on the first date. To remove an incorrect guess, you click the red X at right.

At each correction, the scraper adjusts what it looks for before and after each field. The scraper may also want to let the user correct other values, like the total number of items currently on the page. That way the scraper could make sure it is being flexible enough to notice all the relevant stuff on the page, but not so flexible that it picks up garbage.

The best part about this design, in my not-so-humble opinion, is that it's doable. All you need to do is inject a few <span>s and images into the existing web page. If you're making a scraper anyway, you had to have a parser and some ways of guessing what parts of the web page correspond to fields in the feed.

I hope this motivates someone to build a better scraper. We could use one! And if you want my input/assistance on such a project, let me know.