Three Systemic Problems with Open-Source Hosting Sites

Posted on by Eric Raymond

I’ve been off the air for several days due to a hosting-site failure last Friday. After several months of deteriorating performance and various services being sporadically inaccessible, Berlios’s webspace went 404 and the Subversion repositories stopped working…taking my GPSD project down with them. I had every reason to fear this might be permanent, and spent the next two days reconstructing as much as possible of the project state so we could migrate to another site.

Berlios came back up on Monday. But I don’t trust it will stay that way. This weekend rubbed my nose in some systemic vulnerabilities in the open-source development infrastructure that we need to fix. Rant follows.

1. Hosting Sites Are Data Jails

The worst problem with almost all current hosting sites is that they’re data jails. You can put data (the source code revision history, mailing list address lists, bug reports) into them, but getting a complete snapshot of that data back out often ranges from painful to impossible.

Why is this an issue? Very practically, because hosting sites, even well-established ones, sometimes go off the air. Any prudent project lead should be thinking about how to recover if that happens, and how to take periodic backups of critical project data. But more generally, it’s your data. You should own it. If you can’t push a button and get a snapshot of your project state out of the site whenever you want, you don’t own it.

When berlios.de crashed on me, I was lucky; I had been preparing to migrate GPSD off the site due to deteriorating performance; I had a Subversion dump file that was less than two weeks old. I was able to bring that up to date by translating commits from an unofficial git mirror. I was doubly lucky in that the Mailman adminstrative pages remained accessible even when the project webspace and repositories had been 404 for two days.

But actually retrieving my mailing-list data was a hideous process that involved screen-scraping HTML by hand, and I had no hope at all of retrieving the bug tracker state.

This anecdote illustrates the most serious manifestations of the data-jail problem. Third-generation version-control (hg, git, bzr, etc.) systems pretty much solve it for code repositories; every checkout is a mirror. But most projects have two other critical data collections: their mailing-list state and their bug-tracker state. And, on all sites I know of in late 2009, those are seriously jailed.

This is a problem that goes straight to the design of the software subsystems used by these sites. Some are generic: of these, the most frequent single offender is 2.x versions of Mailman, the most widely used mailing-list manager (the Mailman maintainers claim to have fixed this in 3.0). Bug-trackers tend to be tightly tied to individual hosting engines, and are even harder to dig data out of. They also illustrate the second major failing…

2. Hosting Sites have Poor Scriptability

All hosting-site suites are Web-centric, operated primarily or entirely through a browser. This solves many problems, but creates a few as well. One is that browsers, like GUIs in general, are badly suited for stereotyped and repetitive tasks. Another is that they have poor accessibility for people with visual or motor-control issues.

Here again the issues with version-control systems are relatively minor, because all those in common use are driven by CLI tools that are easy to script. Mailing lists don’t present serious issues either; the only operation on them that normally goes through the web is moderation of submissions, and the demands of that operation are fairly well matched to a browser-style interface.

But there are other common operations that need to be scriptable and are generally not. A representative one is getting a list of open bug reports to work on later – say, somewhere that your net connection is spotty. There is no reason this couldn’t be handled by an email autoresponder robot connected to the bug-tracker database, a feature which would also improve tracker accessibility for the blind.

Another is shipping a software release. This normally consists of uploading product files in various shipping formats (source tarballs, debs, RPMs, and the like) to the hosting site’s download area, and associating with them a bunch of metadata including such things as a short-form release announcement, file-type or architecture tags for the binary packages, MD5 signatures, and the like.

With the exception of the release announcement, there is really no reason a human being should be sitting at a web browser to type in this sort of thing. In fact there is an excellent reason a human shouldn’t do it by hand – it’s exactly the sort of fiddly, tedious semi-mechanical chore at which humans tend to make (and then miss)finger errors because the brain is not fully engaged.

It would be better for the hosting system’s release-registration logic to accept a job card via email, said job card including all the release metadata and URLs pointing to the product files it should gather for the release. Each job card could be generated by a project-specific script that would take the parts that really need human attention from a human and mechanically fill in the rest. This would both minimize human error and improve accessibility.

In general, a good question for hosting-system designers to be asking themselves about each operation of the system would be “Do I provide a way to remote-script this through an email robot or XML-RPC interface or the like?” When the answer is “no”, that’s a bug that needs to be fixed.

3. Hosting Sites Have Inadequate Support for Immigration

The first (and in my opinion, most serious) failing I identified is poor support for snapshotting and if necessary out-migrating a project. Most hosting systems do almost as badly at in-migrating a project that already has a history, as opposed to one started from nothing on the site.

Even uploading an existing source code repository at start of a project (as opposed to starting with an empty one) is only spottily supported. Just try, for example, to find a site that will let you upload a mailbox full of archives from a pre-existing development list in order to re-home it at the project’s new development site.

This is the flip side of the data-jail problem. It has some of the same causes, and many of the same consequences too. Because it makes re-homing projects unnecessarily difficult, it means that project leads cannot respond effectively to hosting-site problems. This creates a systemic brittleness in our development infrastructure.

Addressing the Problems

I believe in underpromising and overperforming, so I’m not going to talk up any grand plans to fix this. Yet. But I will say that I intend to do more than talk. And two days ago the project leaders of Savane, the hosting system that powers gna.org and Savanna, read this and invited me to join their project team.

Google+

Eric Raymond
This entry was posted in Software by Eric Raymond. Bookmark the permalink.

50 thoughts on “Three Systemic Problems with Open-Source Hosting Sites

  1. spacer Mike Earl on said:

    At risk of hopelessly broadening the scope, most social-networking sites, or even more generically, hosted applications have these issues. With social networking sites some of it may be intended lock-in, but…

  2. spacer JessicaBoxer on said:

    Mike Earl Says:
    > At risk of hopelessly broadening the scope,

    Jumping quickly on Mike’s bandwagon, ESR’s “data jail” expression, (which is both new to me and very descriptive) reminds me of what I think is a very serious problem that is just not setting off enough alarm bells, namely the growing influence of Apple. Many on this blog on the places where guys like ESR frequents are constantly railing against the evils of Microsoft, and MS has plenty of shortcomings. But I am getting to the point where I think we need to start rooting for Windows Mobile 6.5. It is undoubtedly a terrible operating system, but it has one thing that is vital: the right to program it.

    What I mean by that is that I can create a program for WM 6.5 and sell it or give it away to anyone who also has WM 6.5. It is deeply disturbing to me that Apple, Google and RIM have locked things up so tight that you need their permission to install software on your own phone. Permission that, by all indications, is both capricious at at times, and deeply self serving at other times.

    Those of you who hate Bill Gates and Steve Balmer need to take a serious look at Steve Jobs. He is far more aggressive in controlling his platform and users that Microsoft ever was. Can you even imagine being in a situation where you needed Bill Gates permission to install software (including for example Linux) on your PC? Bill Gates was bad. Steve Jobs is much, MUCH worse.

    There is no doubt in my mind that a lot of personal computing is moving to the phone, and probably ultimately to the cloud. Google is a little better, but they have shown either an ambivalence or incompetence in deploying their platforms and app store. What sort of world do we live in when Microsoft makes the most open and free operating system for a platform?

    What price, I would ask, for pretty icons?

  3. spacer esr on said:

    No, let’s not go down that rathole. Yes, the data-jail effect on social networking sites is bad, and the iPhone is worse. But I can’t solve those problems, so there is little point in trying to beat them to death in this comment thread. Don’t go there. Let’s stick to the open-source infrastructure issue, a real problem that we actually have some chance to address constructively.

  4. spacer Jeff Read on said:

    Jessica,

    A tightly controlled platform worked out enormously well for game consoles. Why not the iPhone?

    I think that this kind of thing is going to become more commonplace in the future: controlled platforms where the vendor serves a gatekeeper role of sorts. People are learning the harsh lesson that complete openness leads to secondary problems like profound lack of integration (Linux) or malware (Windows). In a more interconnected world, closed platforms will prevail. Sad to say, but I think the sort of freedoms the open source movement advocates are something 90% or more of end users neither want nor are prepared to handle.

    And that’s leaving aside the fact that Macintosh, and not Linux, is becoming the preferred development and personal-use platform of even open-source hackers…

  5. spacer Jeff Read on said:

    I’ve heard the term “Hotel California” used to describe certain “cloud” services like Gmail :)

    I think GitHub has an interesting approach. Almost by definition, your Git repo on GitHub is a copy of some other repo you pushed from, likely on your home box. Part of the problem comes from a cultural institution that stems from the “free hosting” of the 1990s, where access to your own stuff was limited in some way. That caused us to develop bad cultural habits. In an ideal world, we’d be able to ssh into our hosting accounts, and thereby scp or rsync down our entire repositories, data, etc. with a single command line; contrariwise, to maintain our sites we’d simply make local changes and rsync them up. I don’t see why f.e. email or bug reports would be any different: you have the email archives or the SQLite database of bug reports in your home directory on the remote server, and both they and the scripts which prettyprint them for browsers are captured in the global snarfdown.

    But again, bad habits proliferated, and we began to see our sites as something to be maintained remotely rather than locally.

  6. spacer JessicaBoxer on said:

    Jeff Read Says:
    > A tightly controlled platform worked out enormously well for game consoles. Why not the iPhone?

    Because you don’t store your life on you PS/2. However, I will certainly respect Eric’s wishes and leave this thread alone.

  7. spacer Shenpen on said:

    I’m both quite inexperienced in such issues and a bit drunk (it’s late night here), so forgive me if I’m saying something stupid but isn’t the deeper problem behind the whole problem set is hosting sites forgetting the basic principles of Unix philosophy?

    And while I know hackers tend to be sceptical about the currently trendy stuff, but if we are looking for a way to implement Unix philosophy in the Internet and solve the problems you mentioned, isn’t the current trend of SOA, Service-Oriented-Architecture, a good way of thinking about it? That every major GUI function of the hosting site should be exposed as a (not necessarily, there are other options, but for example as a) XMLRPC function – not hand-coded, but using a framework that automatically makes it so – and according to the Open Source traditions, they could leave to to other people to write utilities for migrating in, migrating out etc. ?

  8. spacer Aaron Traas on said:

    @Shenpen

    Absolutely — open interfaces web service interfaces solve most of the problems Eric is talking about.

    I’ve actually been working with XMLRPC over HTTP a bit — much better way to do stuff. Makes life a breeze, especially since it’s plain text going over the wire that I can see in Firebug. Big step up from the annoying AMF-based remoting I was doing prior. Best still, it is easily parsed and consumed by ANYTHING, from a fancy Flash RIA to grep.

    ESR: Have you looked at Google for mailing list serving? They focus on making it easy to migrate away from their services. Bug tracking… I guess run TRAC or something on a generic web hosting company that you contract cheaply and make regular backups…

  9. spacer Christopher Mahan on said:

    So essentially if you consider the big ball of wax that is the aggregate project files, such as repository snapshot and histories, mailing lists (archives, and published html archives), wiki content, static content such as general web site, etc, bug-trackers and histories, some of the infrastructures to run them such as mediawiki, trac, bugzilla, svn, git, that have site-specific config files…

    Woah, hold on there, that is a big ball of wax. And yes, essentially, migrating from on hosted solution to another involves careful and skilled manual labor.

    So you want to take this big ball of wax and allow it to be moved much more seamlessly from one host to another.

    Oh and while it is hosted, you want to be able to script it, assuming something like perl, python, ruby, or the usual unix sed, awk, shell even, so you can do things like autobackup, auto-notifications, get data in and out, regenerate static content.

    The way I would do it is to create something like a virtual machine image running debian with the stuff pre-installed, and moving from one host to another would simply entail: load this image on your VPS, and a script would then check and log into the dns service and fix all the entries so everything would point correctly to the new IP address.

    Thoughts?

  10. spacer Christopher Mahan on said:

    By the way, I know you wrote a xml-rpc tool to get bug info into bugzilla. Is there to your knowledge, and xmlrpc interface for getting the stuff out in a sane way?

  11. spacer esr on said:

    >I’m both quite inexperienced in such issues and a bit drunk (it’s late night here), so forgive me if I’m saying something stupid but isn’t the deeper problem behind the whole problem set is hosting sites forgetting the basic principles of Unix philosophy?

    That’s one way to describe the design failure, yes. And it did occur to me when I was writing the rant, I just decided not to take the rhetoric in that direction.

  12. spacer esr on said:

    SOA is a good idea as far as it goes, but with that approach of automatically exposing service interfaces on a per-page basis you risk ending up with unwanted cohesions between the flow of your web GUI and the shape of your service API. IMO.

  13. spacer esr on said:

    >The way I would do it is to create something like a virtual machine image running debian with the stuff pre-installed, and moving from one host to another would simply entail: load this image on your VPS, and a script would then check and log into the dns service and fix all the entries so everything would point correctly to the new IP address.

    I’m looking at an approach on a completely different level – essentially, an object-broker daemon that speaks JSON to clients and has back ends to manipulate the host SQL database, a Mailman instance, and so forth. Client and daemon exchange JSON objects;some are interpreted as reports, others as requests to edit state.

    >By the way, I know you wrote a xml-rpc tool to get bug info into bugzilla. Is there to your knowledge, and xmlrpc interface for getting the stuff out in a sane way?

    Not to my knowledge.

  14. spacer The Monster on said:

    Given the state of virtualization technology, a hosting company ought to be able to offer its customers virtual servers to which they can use ssh, rsync, or whatever, and run whatever scripts locally as suit their needs. If they do a lousy job and hose their VM, it shouldn’t affect other customers.

    Another advantage of virtualization is the ability to offer high availability even when individual physical servers may fail. If everything is in the SAN, the loss of a physical server or three from the farm shouldn’t even be noticable to outsiders. And customers whose needs must be scaled up can be given bigger time slices and more bandwith with no changes on the customers’ side.

    So hosting companies should be doing it anyway.

  15. spacer Tom on said:

    Apologies in advance but I wonder what RMS would think about this thread?

    What are the chances that the technical solutions being discussed here might be developed sufficiently to respond to the RMS concerns about the cloud?

  16. spacer David McCabe on said:

    The suggestion that project hosting sites offer full-blown VPS misses the point of project hosting: Hackers don’t like to be admins; it’s boring.

  17. spacer rasker on said:

    A project called Bugs Everywhere, bugseverywhere.org/ , could solve (or work around) part of the problem. It incorporates the bug tracker in the vcs (multiple vcs supported) so pulling your code also gets you the bug history and outstanding bug list.

  18. spacer xfer_rdy on said:

    Given the state of visualization technology and the fact there is a rapid commoditization of computer resources driven by multiple economic factors, you are receiving the exact quality you are paying for. The issues here are not about technology, but governance and trust. Your data, your business and other aspects of our lives we share in this electronic media are now surrendered to others. Many facets of our lives are held digitally and many times are held hostage by the partners we select to proxy our relationship with others. You are effectively “relying on the kindness of strangers” to promote and help manage your relationships. We rely on them to be a trusted custodian of the data that represents our parts of our lives. When something goes wrong, very wrong, most people feel violated. You should feel violated, they are violating your trust. How did we put ourselves in this position ?

    We blindly relinquished our sense of responsibility to others, whether it be teachers, doctors, day care centers, lawyers, clergy, political representatives and appointees, other civil servants, and salesmen. We find ourselves repeating these same patterns with hosting providers or cool looking web presences. Shouldn’t we stop handing over the value our relationships and many times the ability to earn a living to the untrustworthy ? How do we ensure the once trustworthy are still trusted ? What should we look for ? We entrust our partner/provider-proxy with many important aspects of our lives, why isn’t there transparency to ensure what we share is cared for the way we expect it should be.

    More times than not, governance is not about making data available, but denying access to it. It often said “possession is 9/10s of the law” the same is true for denying access to data. If you have a website that you are conducting business on, many providers will permit you to upload all the data you would like for free… Getting that data back, however, is another issue. Depending on providers, you may have to run a gauntlet of unclear fees and other costs. Cloud computing, the panacea of low cost compute are riddled with these practices. Its like an child’s amusement park where you pay on the way out, they don’t tell you about the fees and the fees can change while your in there. If you don’t pay, they keep your children.

    Does it really matter whether the the tech is corba, rcp, xml, webdav or the next grand pooba of tech ? All the protocols and technology in the world won’t help you if the provider denies you access to your data. It don’t matter whether its because they can’t manage the scale of their business or they are holding your data for ransom, you still can’t get your data which can stall portions of your life or income.

  19. spacer Christian Reis on said:

    Some hosting providers actually do make a commitment to provide you with all your project data upon request; I think the issue in many cases is not malicious intent, but simply one of resourcing, because designing and maintaining a full import-and-export system is