RSS via NNTP?

Posted 17 Apr 2003 at 06:55 UTC by garym

It's 2003 -- are we really building RSS aggregators that pull a feed a thousand times to feed a thousand customers? USENET and RSS have a common transmission pattern where small messages from diverse sites are often redistributed to locally clustered users, so why are we pulling so much XML when we already know how to fix it?

How many blogs are there? Obviously and open ended question, but from what we know about bubbles and bridges, it's fair to say that from any particular topical vantage point, there is a relatively small number of blogs. For argument sake, let's say 20,000, or about the same as the number-per-geography as USENET newsgroups.

I just noticed in our company website referrers that the top of the list, twice the number two, is Radio Userland's NewsAggregator. If I understand Radio correctly, they have these aggregators that will pull RSS files for your own personal collection and merge the items into your own personal blog-update menu, and that is a Good Thing but ... are they polling my website once for each and every Radio Subscriber who asks for my feed, or are they polling me just once an hour, and then handing their subscribers the items from their cached copy?

If Radio actually is caching my RSS to redistribute to Radio customers, then something is desperately wrong because the number of hits on my site is way over 24 per day. If they are not sharing a cache, then let me get this straight: They are one of the world's largest concentrations of blogs and they have encouraged their membership to pull the RSS (a technology which they helped pioneer) and they then multiply each feed poll frequency by the number of subscribers? Naw ... it can't be.

Isn't this situation reminiscent of USENET? The necessity inventing USENET was large localized academic and corporate campuses where thousands of people were trying to read the same short messages shared across the Internet. Rather than send that same mailinglist message to all 1500 employees at a plant or all 15,000 students at a University, one message was fetched and cached on their local NNTP server, and that one copy was shared locally by anyone interested in reading it.

NNTP is one of my favourite protocols, and one that I've always hoped would see a renaissance. I thought for sure somebody would have implemented these web-forum thingies in NNTP (I did, but my clients wouldn't deploy it, chickened out, went for something "secure" running on a RDBMS that never quite worked right) ... NNTP just seems so, well, viral. Maybe this is NNTP's second coming, it's second killer app...

Maybe it's time we repurposed NNTP as our RSS distribution network. It seems a natural, miles ahead of what we have now. Today people must constantly ask each server if the ETAG is newer than their last poll; NNTP/USENET would work like a blogroll ping, and aggregators could subscribe to whole topic groups if they wish; as you post to your portal or blog, you're also posting to your FeedGroups and, like any good usenet reader (such as GNUS), reading behaviours could be tracked in score files, kill files and all the other filtre tricks we honed over those two decades of USENET. Plus we get a free taxonomy to classify RSS feeds, a push-listener event model, we even get editorial controls to publish updates, corrections and followups!

Ok, I'll admit USENET isn't instant, but maybe that can be fixed. There's also the waste of feeds rolling from NNTP hub to hub whether or not anyone really cares, and the lag would be way slower than what we have today with polling, but ... if the current escalating strain of RSS-seekers continues to build, it's only a matter of time before I and probably many others will have to turn these feed servers off because we cannot afford the bandwidth demand. The Radio Userlands of the world might implement even just a Squid-proxy to ease the burden (of course, Squid is easy enough, Radio could do this tomorrow!) but that's going to lead to proxy-busting for late-breaking news, negotiating new rules for expire magic, and we're still not dealing with threading, scoring and the soon-to-come explosion in RSS topic maps ...

What do you think: if RSS is not carried over NNTP, won't it become necessary to invent new protocols exactly like NNTP to cope with the scale problems?

NNTP is evil!, posted 17 Apr 2003 at 10:17 UTC by apenwarr » (Master)

Well, just to get the flamewar started: NNTP is the most horrendously evil pathetic rotten useless crappy protocol that ever there was. And I mean that in the nicest possible way, because all the NNTP *software* was much worse. (I say "was", not because it got better, but because nobody cares anymore.)

Some news *reader* programs were sort of passable, but as many people don't realize (I think garym does), they don't even use NNTP; we had to invent an entirely different protocol for that.

So, what sucks donkey gonads about NNTP? Well, I'm glad you asked!

- it's push-only. Push-only protocols are as stupid as pull-only protocols, except they waste more bandwidth sending stuff nobody wants.

- it manages to have huge lag even *though* it's push-only. If you think about it, this makes no sense whatsoever. The guy knows it's changed, and he knows who to tell about it, so he... waits, and does nothing for a while. Brilliant! (This is actually the software's fault, not the protocol's, but *all* the software does it.)

- you can't subscribe and unsubscribe to things in-band. That is, the guy pushing stuff will push whatever he wants, and you can *just try* to ignore it! Just try! You can either refuse the TCP connection entirely, or you can accept all the data he sends you and silently toss out anything you don't want. How do you make him actually not send you the stuff you don't want? Well, you email the server admin, of course. NNTP has no in-band subscription/unsubscription mechanism. What gets sent is in the sender's hand-edited config file.

- it has no protection whatsoever against spam, which is why usenet is totally overrun by spam. Plus see above about not being able to unsubscribe from the spammer when you want to. (Web discussion boards have no *good* protection from spam either, but the gratuitous variation in user interfaces and APIs make widespread automated spam almost impossible. If you're going to make things more consistent, you *have* to deal with the spam problem nowadays.)

- you can arrange NNTP servers in a distribution hierarchy, but this has to be done completely by hand. And see above about the lack of in-band subscription/unsubscription. You'll have to email the admin of the NNTP server if you want to subscribe.

- if you have a small server at the end of the line (a "leaf node") and you're talking to a big server, and your disk fills up and you have to expire something from your disk, what should you do? Why, delete it from your disk and never see it again even if someone wants it, of course! Even though the big server upstream can keep it around ten times as long as you, there's no way to retrieve it later. Sorry! This is a *push-only* protocol, remember?

And specific to this particular case, NNTP:

- solves a non-problem. RSS files are tiny. Sending hundreds or thousands of them doesn't matter.

- is about ten times as complicated as HTTP to implement.

- has a much weaker caching and expiration mechanism than HTTP.

What you *actually* want is:

- HTTP

- a worldwide (or at least RSS-universe-wide) network of http caches that connect to each other in a random net. Ideally they'd connect automatically to each other using something like NetSelect.

- a basic kind of push cache-flushing, in which the guy who updates the page can inform people who've registered interest in the page (and they can inform their interested customers, etc) to throw away their cached value. Then the client can choose to either refresh its cache immediately, or just wait until the next person actually wants the document.

By the way, I think just a squid cache network with 2-hour timeouts on your RSS files (reduced during "heavy news" periods) would actually do the job just fine.

I need to write an essay about all this sometime. Ah, for more hours in the day...

nntp//rss , posted 17 Apr 2003 at 12:33 UTC by iand » (Journeyer)

Jason Brome has a project that uses NNTP as a transport for RSS called nntp//rss:

nntp//rss is a Java-based bridge between RSS feeds and NNTP clients, enabling you to read your favorite RSS syndicated content within your existing NNTP newsreader. RSS feeds are represented as NNTP newsgroups, providing a simple, straightforward and familiar environment for news reading.

NNTP feed sizes, posted 17 Apr 2003 at 12:46 UTC by k » (Journeyer)

When I was administering NNTP servers a few years ago, the _incoming feed_ we nabbed from different providers, after chopping out the repeats topped out around 20mbit/s. Thats without the spam.

Feeders, therefore, would simply store the data coming in, feed what each peer asked for and eventually get rid of it. We'd need insane disk IO to keep up with the 10 or so peers we had. Subsequently we didn't keep articles in the _feeders_ around very long. If it was missed, oops.

Now, readers. alt.binaries is _big_. Even without the spam. Trying to keep it around for longer than a couple of days becomes quite difficult. You may be able to build a machine with a few hundred gigabytes of disk space but a couple hundred DSL clients trolling alt.binaries for the 400 posts making up a movie quickly kill all but the beefiest server.

Your "subscription" idea has merit. My NNTP-fu has disappeared but I think a "peer-subscribe" group membership method for feeders would be useful. You could then redelegate the existing peer-group information to "ACLs" (ie if I ask for alt.binaries.* but I'm not paying for a full feed, I don't get it.)

In any case, I _DO_ remember doing some fun maths one day. I was trying to calculate how many feeds were going over the US<->(Amsterdam/UK) fibre routes. I stopped at around 40. _40 full feeds_ because providers were buying their own DS3/STM1 across the pond to link their Europe<->US networks and peering their news feeds together. Think about that for a minute.

As a squid developer I always wondered about integrating some form of cache 'discovery' with neighbouring ISP caches to build an NNTP reader/feeder network dynamically. Alas..

Interesting idea, but how about integration with the existing weblogging tools, posted 18 Apr 2003 at 13:37 UTC by jpm » (Journeyer)

That idea has been on my mind for quite a while - to create a distributed network of simple NNTP servers that would replicate the available metadata about a weblog (coming from RSS or whatever other source) with other peers.

See webseitz.fluxent.com/wiki/PaperCut for a little bit more on this.

Now, I already have a working project (Papercut - papercut.org) that is an NNTP server and is modular enough to serve content based on a wide variety of sources - message board databases (such as Phorum or phpBB), standalone databases, xml files or whatever. And since papercut is simple and flexible enough (it just needs python to run), I'm sure we could create a storage module to get the contents from a RSS file or even a Berkeley DB file.

The real problem is related to how the integration with the existing weblogging tools will be done. I mean, how are you going to make the NTTP server update its database of articles when a RSS file is updated ? The Berkeley DB option is pretty simple - it will simply cache the responses and get the information from the database file only when necessary.

Another particular thing that might be a problem is how to make a good configuration tool where the user (probably owner of a weblog) will categorize his weblog and most of all, choose which other weblogs he/she wants to replicate to his local NNTP server. And obviously, from where! I'm guessing that we would need a sort of central NNTP server that would contain all of the categories and weblog articles, such as:

alt.weblog.weblog_name

or even:

alt.weblog.category.weblog_name

The problem with categories is that I couldn't think of exactly one category where my weblog would always apply to.

Anyway, just a few thoughts about your post in a implementation side of things. We should keep in touch - I would like to bring this idea further.

Exactly which category, posted 23 Apr 2003 at 23:45 UTC by garym » (Master)

This issue of exactly one category recnetly reared it's head for me in another context, and the solution there was just too obvious: Per-category NNTP Trackback.

We already have the blog software such as MoveableType using per-category trackback to ping central topical blogrolls such as the TopicExchange, and the emerging ENT standard (or one of the competing standards) would build up the taxonomy; this gets into another issue of taxonomic proliferation but that's not, IMHO, a showstopper.

All that is needed is to extend the existing per-category pings to include the NNTP ping, or even to provide gateway trackback services which will accept xmlrpc/REST messages for these ENT-tagged RSS feeds, and then pull down the feed and inject it into the RSS network.

I like this idea, posted 24 Apr 2003 at 08:15 UTC by chalst » (Master)

Some comments:

apenwarr's point I don't think is a real problem, since if I understand NNTP correctly delays can only happen if you have many hops: if you want to get up-to-date information you just ask the NNTP server that the blog feeds to directly. If you don't care too much about up-to-dateness then just use your friendly local NNTP server as usual.
An obvious taxonomy: rss.<domain-name>. The body of the message should be the usual RSS XML text. Thus rss.* would be a hierarchy that operates to rather more liberal rules about news group creation, but I don't think this create real difficulties.
I note that graydon's excellent monotone project uses NNTP for distributing diffs, so not everyone has given up on NNTP as a useful protocol...