Peer-to-peer RSS

For at least a year now, and probably longer, Bart and I have been occasionally chatting about how to build tools to make it easier to read and keep up-to-date on several webcomics. Our preferred scales differ – he reads less than 10, while I have more than 600 between my two lists – but we still want basically the same things. These come down to the ability to keep track of where we left off reading in the archives, and the ability to find out about new updates to our favorites. Our shared “it would be nice” features include getting good recommendations from our peers. I also would like to be able to publish my list.

When Bart pointed me at ComicAlert! (“CA!”), I thought this was going to be the answer. They provide customized per-user RSS feeds that include all your favorite comics. You specify which comics are, in fact, your favorites from a list of thousands that their software can automatically glean updates from. You can request the addition of any comic they don’t already support using a simple web form. You can find comics you hadn’t heard of before with the usual sort of recommendation system. On paper it sounds like it addresses pretty much everything we want.

Bart’s first sign of trouble was that the ComicAlert! people are a bit flaky about keeping up with their queues of add and repair requests. In fact I don’t see any sign that any of the requests I put in a month or two ago have been dealt with.

I first noticed trouble when I discovered I couldn’t add more than 40 comics to my favorites. This is an intentional limit in the ComicAlert! system; I exchanged some really interesting e-mail with the site admin when I wrote to ask what was up with that limit. They’re concerned about two major issues: people stealing the content (comic update information) that CA! worked so hard to collect, and people consuming tons of bandwidth and CPU on the CA! servers with frequent RSS queries to accounts with large numbers of comics. I can’t help much with the first point without destroying any dreams these people might have of making money on this someday, but I provided a couple of pages worth of technical suggestions addressing the second issue. Maybe some of them will get implemented eventually.

It seems obvious that if you’re working hard to collect information, and then working hard to disseminate that information, you just need to distribute that work onto the people who wanted the information in the first place. Peer-to-peer techniques seem perfect for this.

Hey, that was my conclusion. That’s all: P2P beats the pants off of centralized services like ComicAlert. You can stop reading now, unless you want to know what I think follows naturally from this conclusion.

First, why should we treat the daily Something Positive differently from the daily Slashdot or the Daily Vanguard? They’re all sites with regularly updating content, where a visitor might reasonably want to know where they left off reading and when new content is available. Recommendations might be nice too.

If all web news sources are indexed this way, how far away is a distributed version of Google News? It’s just a different kind of recommendation: instead of saying, “These content sources are similar,” it says, “These particular items have similar content.” I would find newspaper editorial cartoons much more interesting if they appeared alongside articles about whatever news story they happened to be about, for example. What if any webcomic about Sony’s rootkit DRM scheme appeared alongside blog posts and news articles on the same subject?

The best way to collect data for a P2P network like this is to have a small handful of the users poll the RSS feeds of sites they care about, and distribute any new items through the network. This allows for substantially reducing the load on the origin servers, and since the network can use a protocol designed for just this purpose, updates can become available to users faster and with less total bandwidth consumption, on average. I think.

But this plan introduces a new problem: how do you know that the update attributed to a site actually came from that site? You could check with the origin server and compare, but first of all that raises the question of why you didn’t just skip the P2P network and go straight to the origin server in the first place; and secondly the data may legitimately have changed since your peer source last checked it. So ideally the origin server would digitally sign each item, but in the current normal usage of RSS that would be mostly silly and as far as I know nobody does it. Therefore, whoever inserts the content into the P2P network should sign it, and the software should provide some tools for helping users to manage the reputation each key has, regarding whether the signer can be trusted to copy data into the network correctly. The software could randomly validate entries it recieves against origin servers, and use that information to contribute to the reputation metric. If multiple signers agree on the contents of a particular item then their reputations should be combined to determine the trustworthiness of those particular contents. Generally, I think a little statistics goes a long way toward reliable-enough data.

Not all comics, or all sites with periodic updates, have RSS feeds. In fact the majority of webcomics don’t. This is a key piece that ComicAlert! is filling in: they scrape the HTML of the non-RSS comics looking for clues that the comic has been updated. If people develop their own HTML scrapers for particular comics (or other sites) and insert the resulting data into the P2P network, then the same signature scheme should apply cleanly, except that now the trust question isn’t just, “Is this person trying to insert bad data?” but also, “Is this particular scraping script broken?” If the layout of a site changes and one person’s scraper doesn’t get updated, somebody else can write a better one and others can start relying on the second person’s key for the updates to that site.

But once you’ve got digitally signed posts in the network, there’s no reason they have to be derived from a web site somewhere. People could use this system to make pseudonymous blog posts: generate a new key to use as your pseudonym identity, and you can post any content you want without in any way attaching your real identity to it. Other users can still choose to ignore you, or read what you’ve written first thing every morning, or anything in between, but they could be your next-door neighbor and not know it. And perhaps you’ve noticed that some people now post RSS feeds of new BitTorrent files, which you can configure your computer to automatically download as they’re posted: combine with these pseudonymous RSS feeds and the distibuted tracker and distributed DB features of modern BitTorrent clients, and you get a substantial amount of privacy in your choice of what you’re downloading or sharing.

That’s some big stuff for a simple comics update tracker, which is one reason I see open source as so much better than the alternative: proprietary software developers never have enough time to follow all the logical extensions of their work, and the people who can and want to implement those extensions don’t have access to the foundation that the proprietary developers do so they rarely get anywhere.