Node and Scaling in the Small vs Scaling in the Large

Over the past few weeks, I’ve been taking whatever spare moments I can find to think about what technologies we’re going to use to build the initial release of BankSimple. Many people would probably assume that I’d immediately reach for Scala, what with having co-authored a book on the language, but that’s not how I approach engineering problems. Each and every problem has an appropriate set of applicable technologies, and it’s up to the engineer to justify their use.

(Incidentally, Scala may well be a good fit for BankSimple, in no small part due to a bunch of third-party Java code that we need to integrate with, but that’s a whole different blog post, probably for a whole different blog.)

One of the most talked-about technologies amongst the Hacker News crowd is Node, a framework for writing and running event-driven JavaScript code on the V8 virtual machine. As part of evaluating what tech to work with, I considered Node. Yesterday, I expressed some general skepticism about Node, and the framework’s author, Ryan Dahl, requested that I write up my thoughts in more detail. So here we are.

I’m certainly not here to disparage Ryan, who’s a nice guy and a superb programmer; he knows more about low-level C than most of us could ever hope to, and without so much as a neckbeard to show for it. Nor am I here to question the enthusiastic community that’s quickly grown up around Node; if you’ve found a tool you enjoy working with and are committed to growing with it, more power to you.

Rather, the purpose of this post is to question two of the Node project’s stated goals, goals which seem to me to be at cross purposes.

What’s Node For?

The About section of Node’s homepage states:

“Node’s goal is to provide an easy way to build scalable network programs.”

Then, a couple paragraphs on, we’re told:

“Because nothing blocks, less-than-expert programmers are able to develop fast systems [with Node].”

So, is the purpose of Node to provide an easy way to build scalable network programs, or is it to allow less-than-expert programmers to develop “fast systems”?

While these goals may appear related, they’re very different in practice. In order to better understand why, we need to make a distinction between what I’m calling “scaling in the small” and “scaling in the large”.

Scaling In the Small

In a system of no significant scale, basically anything works.

The power of today’s hardware is such that, for example, you can build a web application that supports thousands of users using one of the slowest available programming languages, brutally inefficient datastore access and storage patterns, zero caching, no sensible distribution of work, no attention to locality, etc. etc. Basically, you can apply every available anti-pattern and still come out the other end with a workable system, simply because the hardware can move faster than your bad decision-making.

This is a great thing, actually. It means we can prototype haphazardly using whatever technologies we hold dear, and those prototypes will often take us farther than we expected. Better still, when we hit a roadblock, getting around it is trivial. Moving forward just means spending several minutes thinking about your problem and picking an implementation technology with slightly better performance characteristics than whatever you were using before.

This is where I believe Node fits in.

If you look at who’s flocking to Node, it’s largely web developers who have been working in dynamic languages with what we could politely call limited performance characteristics. Adding Node to their architectures means that these developers have gone from having essentially no concurrency story and very constrained runtime performance to having some semi-sane concurrency story – one rigidly enforced by the Node framework – running on a virtual machine with comparatively respectable performance. They slice off a painful bit of their application that’s suited to asynchrony, rewrite it in Node, and move on.

That’s awesome. That kind of outcome definitely meets Node’s secondary stated goal of “less-than-expert programmers” being “able to develop fast systems”. However, it has very little to do with scaling in the larger, more widely-understood sense of the term.

Scaling In the Large

In a system of significant scale, there is no magic bullet.

When your system is faced with a deluge of work to do, no one technology is going to make it all better. When you’re operating at scale, pushing the needle means a complex, coordinated dance of well-applied technologies, development techniques, statistical analyses, intra-organizational communication, judicious engineering management, speedy and reliable operationalization of hardware and software, vigilant monitoring, and so forth. Scaling is hard. So hard, in fact, that the ability to scale is a deep competitive advantage of the sort that you can’t simply go out and download, copy, purchase, or steal.

Herein lies my criticism of Node’s primary stated goal: “to provide an easy way to build scalable network programs”. I fundamentally do not believe that there is an easy way to build scalable anything. What’s happening is that people are confusing easy problems for easy solutions.

If you have an easy problem that was handily solved by moving from one piece of extremely limiting technology to a bleeding-edge piece of slightly less limiting technology, consider yourself lucky, but it doesn’t mean you’re operating at scale. Twitter certainly had such an easy win when, while at a fraction of the scale the service now operates at, one of their engineers rewrote their in-house Ruby-based message queue in Scala. That was great, but it was scaling in the small. Twitter is still fighting an uphill battle to scale in the large, because doing so is about much, much more than which technology you choose.

Growing Up Node

What concerns me about Node is that it’s going to be difficult to grow with as engineers move from scaling in the small to scaling in the large. (No, I’m not making the “callbacks turn into a pile of spaghetti code” argument, although I think you hear that time and again because it’s an actual developer pain point in async systems.)

The bold thing about Node is that everything is asynchronous, right down to file I/O; here, I admire Ryan’s commitment to a consistent and clear thesis for his software. Engineers who deeply understand their system’s workload may find places where Node’s model is a good fit, and may be a good fit effectively indefinitely; we won’t know that until Node sees long-term mature deployment. Most systems I’ve worked on change, though. Workloads change. The data you’re moving around changes. What was once well-handed by an asynchronous solution is suddenly now better served by a threaded solution, or vice versa, or you’re faced with some other out-of-the-blue change entirely.

If you’re deeply invested in Node, you’re stuck with one way of approaching concurrency; one way of modeling your problems and solutions. If it doesn’t fit into an event-based model, you’re hosed. If, on the other hand, you’re working with a system that allows for a variety of concurrency approaches (the JVM, the CLR, C, C++, GHC, etc.), you have the flexibility to change your concurrency model as your system evolves.

At the moment, Node’s core premise – that events necessarily yield performance – is still in question. Researchers at UC Berkeley found that “threads can achieve all of the strengths of events, including support for high concurrency, low overhead, and a simple concurrency model”. A later study that builds on that work shows that events and a pipelining approach perform equally well, and that blocking sockets can actually increase performance. In the industrial Java world, it comes up periodically that non-blocking I/O isn’t necessarily a better fit than threads. Even one of the most-cited documents on the subject with the blatant title of Why Threads Are A Bad Idea ends up concluding that you shouldn’t abandon threads for high-end servers. There just isn’t a one-size-fits-all concurrency solution.

In fact, taking a hybrid approach to concurrency seems to be the way forward if the academy is any indication. Computer scientists at the University of Pennsylvania found that a combination of threads and events offers the best of both worlds. The Scala team at EPFL has argued that Actors unify thread- and event-based programming into one tidy, easily comprehensible abstraction. Russ Cox, formerly of Bell Labs and now on the Go programming language project at Google goes so far as to argue that threads vs events is a nonsensical question. (Note that none of this even begins to touch on the distribution aspect of scaling a system; threads are constructs for a single computer, and events are constructs for a single CPU. We’re not even talking about distributing work across machines in a straightforward way, as you can in Erlang, but that’s worth thinking about if you’re babysitting a rapidly growing system.)

The point being: seasoned engineers are using a mix of threaded, event-based, and alternative concurrency approaches like Actors and, experimentally, STM. To them, the idea that “non-blocking means it’s fast” sounds, well, a bit silly; it’s scalability urban mythology. The guys that are getting paid the big bucks to deliver scalable solutions aren’t up at night feverishly rewriting their systems in Node. They’re doing what they’ve always done: measuring, testing, benchmarking, thinking hard, keeping up with the academic literature that pertains to their problems. That’s what scaling in the large necessitates.

Conclusion

For my investment of engineering time, I’d rather build on a system that allows me the flexibility of mixing an async approach with other ways of modeling concurrency. A hybrid concurrency model may not be as straightforward and pure as Node’s approach, but it’s going to be more adaptable. While BankSimple is in its infancy, we’re going to be faced with the happy problems of scaling in the small, and Node might be a reasonable choice for us at this early stage. But when we do have to scale in the large, I’d rather have a variety of options open to me, and I’d rather not be faced with the prospect of a big rewrite under pressure.

Node is a lovely bit of code with an enthusiastic community, a whip-smart maintainer, and a bright future. As a “bridge technology” that offers immediate solutions to early scaling problems in a way that’s particularly accessible to a generation of web developers who largely come from dynamic language backgrounds, it makes sense. Node more than meets its secondary stated goal of bringing reasonable performance to developers of intermediate experience who need to solve network-oriented problems. Node is, for a certain breed of programmer, familiar and fun, and it’s undeniably easy to get started with. People in the Node community are having a good time reinventing the familiar wheels of web frameworks, package management, testing libraries, etc., and I don’t begrudge them that. Every programming community reinvents those things to their norms.

Once we’ve made clear what Node is a good fit for, though, it’s important to remember that there are no panaceas for problems of significant scale. Node and its strictly asynchronous event-driven approach should be viewed as a very early point on a continuum of technologies and methodologies that comprise scaling in the large.

Approach popular solutions cautiously. Everybody may be talking about a hot new technology, but very few people are actually working at a scale at which those technologies are going to be put through their paces. Those who are tend to be heads-down in numbers and research, working with tools and techniques that have been maturing over time. If you invest your time in a new technology, be ready to learn and grow with it, and maybe to jump ship when you find yourself constrained.

This stuff isn’t easy.

—Jul 27, 2010

Alex Payne writes online here.

See also the archive, books & talks.

An individual post follows.