Mike Daisy and Globalization

People may have seen the episode of This American Life that retracted the previous piece with Mike Daisy. It is a pretty uncomfortable thing to listen to. The relevant thing is that Mike Daisy fabricated much of the details of his first-hand account to make Chinese labor look much harsher than it is. Then he lied to the NPR fact checkers to avoid their uncovering this.

The question is, does it matter that he lied? Everyone agrees Chinese labor conditions are harsh in comparison to western standards. His argument is that although, yes, he did make up much of his story, it could have been true. Critics say that this doesn’t really undermine his main point; counter critics say that it does.

Personally I think it is mostly an irrelevant distraction. Globalization is literally remaking the lives of billions of people, and whether you believe the change is good or bad should be primarily an empirical discussion about whether things are getting better or worse in these countries. It shouldn’t be carried out in overly emotional, under-reasoned stage performances. Whether or not Mike Daisy’s anecdotes about his visit to China actually took place shouldn’t really matter to us because we shouldn’t make decisions about global trade policy on the basis of theatrical anecdotes.

The important debate we should be having is about whether globalization—particularly shipping low-skill work to other countries with lower wages—is a good or bad thing for these countries. Popular opinion is that it is bad because it enriches large corporations by exploiting workers. However there is an argument to be made that it is good. That argument is best layed out in this 1997 article, by Paul Krugman. I won’t repeat the content of the article because I think it’s short and worth reading.

I read that article in the late nineties and it greatly changed my perception of economics and the debate around globalization. The reason I think everyone should read it is because it phrases the debate in terms of what is good for third world countries rather than emotional appeals. Economists see globalization as a huge transfer of money, technology, and skills from the first world to the third world. They see this as permanently transforming the standard of living in these countries in a way that a century of poverty relief and charitable aid hasn’t. I think the focus of the discussion should be less about whether Mike Daisy lied, and more about whether these beliefs are true.

I am not an economist, I haven’t studied this, so I don’t know if this is correct. But my values are pretty straight-forward. I think people born in the US won the equivalent of the geographical lottery and have opportunities that are either completely out of reach or exceedingly unlikely for people in other parts of the world. We have a moral obligation to strive for equalization; I would support this process even if it was bad for the US (which it doesn’t seem to be). I come from a particularly liberal family so it particularly galls me that the left seems to be willing to decide an issue that one way or another deeply impacts the lives of billions of people purely based on a kind of kneejerk belief that large corporations are evil.

Look, we could learn that this doesn’t work the way we thought. Maybe the manufacturing jobs do not lead to a development of industry in the way it’s supposed to. Or maybe increasing economic vitality doesn’t lead to greater political freedom in repressive regimes. Or maybe there are other unforeseen consequences. Economists are, after all, the people most likely to fuck up a country on an unproven but intellectually stimulating theory. But let’s talk about the evidence of these things happening or not happening.

For people who oppose globalization, I think the question is how will you accomplish this global equalization? It isn’t enough to have a vague esthetic icky feeling about buying a phone made by people who make less than you, you have to have a plausible way to fix this inequality. The people who have thought the most about this think that buying the phone is the best way to fix it, it is really important to be sure they are wrong before you try to stop people from buying phones.

Posted at 3:22 AM on March 20, 2012 Permalink ∞

Getting Real About Distributed System Reliability

There is a lot of hype around distributed data systems, some of it justified. It’s true that the internet has centralized a lot of computation onto services like Google, Facebook, Twitter, LinkedIn (my own employer), and other large web sites. It’s true that this centralization puts a lot of pressure on system scalability for these companies. Its true that incremental and horizontal scalability is a deep feature that requires redesign from the ground up and can’t be added incrementally to existing products. It’s true that, if properly designed, these systems can be run with no planned downtime or maintenance intervals in a way that traditional storage systems make harder. It’s also true that software that is explicitly designed to deal with machine failures is a very different thing from traditional infrastructure. All of these properties are critical to large web companies, and are what drove the adoption of horizontally scalable systems like Hadoop, Cassandra, Voldemort, etc. I was the original author of Voldemort and have worked on distributed infrastructure for the last four years or so. So in-so-far as there is a “big data” debate, I am firmly in the “pro-” camp. But one thing you often hear is that this kind of software is more reliable than the traditional alternatives it replaces, and this just isn’t true. It is time people talked honestly about this.

You hear this assumption of reliability everywhere. Now that scalable data infrastructure has a marketing presence, it has really gotten bad. Hadoop or Cassandra or what-have-you can tolerate machine failures then they must be unbreakable right? Wrong.

Where does this come from? Distributed systems need to partition data or state up over lots of machines to scale. Adding machines increases the probability that some machine will fail, and to address this these systems typically have some kind of replicas or other redundancy to tolerate failures. The core argument that gets used for these systems is that if a single machine has probability P of failure, and if the software can replicate data N times to survive N-1 failures, and if the machines fail independently, then the probability of losing a particular piece of data must be P^N. So for any desired reliability R and any single-node failure probability P you can pick some replication N so that P^N < R. This argument is the core motivation behind most variations on replication and fault tolerance in distributed systems. It is true that without this property the system would be hopelessly unreliable as it grew. But this leads people to believe that distributed software is somehow innately reliable, which unfortunately is utter hogwash.

Where is the flaw in the reasoning? Is it the dreaded Hadoop single points of failure? No, it is far more fundamental than that: the problem is the assumption that failures are independent. Surely no belief could possibly be more counter to our own experience or just common sense than believing that there is no correlation between failures of machines in a cluster. You take a bunch of pieces of identical hardware, run them on the same network gear and power systems, have the same people run and manage and configure them, and run the same (buggy) software on all of them. It would be incredibly unlikely that the failures on these machines would be independent of one another in the probabilistic sense that motivates a lot of distributed infrastructure. If you see a bug on one machine, the same bug is on all the machines. When you push bad config, it is usually game over no matter how many machines you push it to.

P^N is an upper bound on reliability but one that you could never, never approach in practice. For example Google has a fantastic paper that gives empirical numbers on system failures in Bigtable and GFS and reports empirical data on groups of failures that show rates several orders of magnitude higher than the independence assumption would predict. This is what one of the best system and operations teams in the world can get: your numbers may be far worse.

The actual reliability of your system depends largely on how bug free it is, how good you are at monitoring it, and how well you have protected against the myriad issues and problems it has. This isn’t any different from traditional systems, except that the new software is far less mature. I don’t mean this disparagingly, I work in this area, it is just a fact. Maturity comes with time and usage and effort. This software hasn’t been around for as long as MySQL or Oracle, and worse, the expertise to run it reliably is much less common. MySQL and Oracle administrators are plentiful, but folks experience with, say, serious production Zookeeper operations knowledge are much more rare.

Kernel filesystem engineers say it takes about a decade for a new filesystem to go from concept to maturity. I am not sure these systems will be mature much faster—they are not easier systems to build and the fundamental design space is much less well explored. This doesn’t mean they won’t be useful sooner, especially in domains where they solve a pressing need and are approached with an appropriate amount of caution, but they are not yet by any means a mature technology.

Part of the difficulty is that distributed system software is actually quite complicated in comparison to single-server code. Code that deals with failure cases and is “cluster aware” is extremely tricky to get right. The root of the problem is that dealing with failures effectively explodes the possible state space that needs testing and validation. For example it doesn’t even make sense to expect a single-node database to be fast if its disk system suddenly gets really slow (how could it), but a distributed system does need to carry on in the presence of single degraded machine because it has some many machines, one is sure to be degraded. These kind of “semi-failures” are common and very hard to deal with. Correctly testing these kinds of issues in a realistic setting is brutally hard and the newer generation of software doesn’t have anything like the QA processes its more mature predecessors had. (If you get a chance get someone who has worked at Oracle to describe to you what kind of testing they do to a line of code that goes into their database before it gets shipped to customers). As a result there are a lot of bugs. And of course these bugs are on all the machines, so they absolutely happen together.

Likewise distributed systems typically require more configuration and more complex configuration because they need to be cluster aware, deal with timeouts, etc. This configuration is, of course, shared; and this creates yet another opportunity to bring everything to its knees.

And finally these systems usually want lots of machines. And no matter how good you are, some part of operational difficulty always scales with the number of machines.

Let’s discuss some real issues. We had a bug in Kafka recently that lead to the server incorrectly interpreting a corrupt request as a corrupt log, and shutting itself down to avoid appending to a corrupt log. Single machine log corruption is the kind of thing that should happen due to a disk error, and bringing down the corrupt node is the right behavior—it shouldn’t happen on all the machines at the same time unless all the disks fail at once. But since this was due to corrupt requests, and since we had one client that sent corrupt requests, it was able to sequentially bring down all the servers. Oops. Another example is this Linux bug which causes the system to crash after ~200 days of uptime. Since machines are commonly restarted sequentially this lead to a situation where a large percentage of machines went hard down one after another. Likewise any memory management problems—either leaks or GC problems—tend to happen everywhere at once or not at all. Some companies do public post-mortums for major failures and these are a real wealth of failures in systems that aren’t supposed to fail. This paper has an excellent summary of HDFS availability at Yahoo—they note how few of the problems are of the kind that high availability for the namenode would solve. This list could go on, but you get the idea.

I have come around to the view that the real core difficulty of these systems is operations, not architecture or design. Both are important but good operations can often work around the limitations of bad (or incomplete) software, but good software cannot run reliably with bad operations. This is quite different from the view of unbreakable, self-healing, self-operating systems that I see being pitched by the more enthusiastic NoSQL hypesters. Worse yet, you can’t easily buy good operations in the same way you can buy good software—you might be able to hire good people (if you can find them) but this is more than just people; it is practices, monitoring systems, configuration management, etc.

These difficulties are one of the core barriers to adoption for distributed data infrastructure. LinkedIn and other companies that have a deep interest in doing creative things with data have taken on the burden of building this kind of expertise in-house—we employ committers on Hadoop and other open source projects on both our engineering and operations team, and have done a lot of from-scratch development in this space where there was gaps. This makes it feasible to take full advantage of an admittedly valuable but immature set of technologies, and let’s us build products we couldn’t otherwise—but this kind of investment only makes sense at a certain size and scale. It may be too high a cost for small startups or companies outside the web space trying to bootstrap this kind of knowledge inside a more traditional IT organization.

This is why people should be excited about things like Amazon’s DynamoDB. When DynamoDB was released, the company DataStax that supports and leads development on Cassandra released a feature comparison checklist. The checklist was unfair in many ways (as these kinds of vendor comparisons usually are), but the biggest thing missing in the comparison is that you don’t run DynamoDB, Amazon does. That is a huge, huge difference. Amazon is good at this stuff, and has shown that they can (usually) support massively multi-tenant operations with reasonable SLAs, in practice.

I really think there is really only one thing to talk about with respect to reliability: continuous hours of successful production operations. That’s it. In some ways the most obvious thing, but not typically what you hear when people talk about these systems. I will believe the system can tolerate (some) failures when I see it tolerate those failures; I believe it can run for a year without downtime when i see it run for a year without downtime. I call this empirical reliability (as opposed to theoretical reliability). And getting good empirical reliability is really, really hard. These systems end up being large hunks of monitoring, tests, and operational procedures with a teeny little distributed system strapped to the back.

You see this showing up in discussions of the CAP theorem all the time. The CAP theorem is a useful thing, but it applies more to system design than system implementations. A design is simple enough that you can maybe prove it provides consistency or tolerates partition failures under some assumptions. This is a useful lens to look at system designs. You can’t hope to do this kind of proof with an actual system implementation, the thing you run. The difficulty of building these things means it is really unthinkable that these systems are, in actual reality, either consistent, available, or partition tolerant—they certainly all have numerous bugs that will break each of these guarantees. I really like this paper that takes the approach of actually trying to calculate the observed consistency of eventually consistent systems—they seem to do it via simulation rather than measurement, which is unfortunate, but the idea is great.

It isn’t that system design is meaningless, it is worth discussing the system design as it does act as a kind of limiting factor on certain aspects of reliability and performance as the implementation matures and improves, but don’t take it too seriously as guaranteeing anything.

So why isn’t this kind of empirical measurement more talked about? I don’t know. My pet theory is it has to do with the somewhat rigid and deductive mindset of classical computer science. This is inherited from pure math, and conflicts with the attitude in scientific disciplines. It leads to a preference for starting with axioms, and then proving various properties that follow from these axioms. This world view doesn’t embrace the kind of empirical measurements you would expect to justify claims about reality (for another example see this great blog post on programming language productivity claims in programming language research). But that is getting off topic. Suffice it to say, when making predictions about how a system will work in the real world I believe in measurements of reality a lot more than arguments from first-principles.

I think we should insist on a little more rigor and empiricism in this area.

I would love to see claims in academic publication around practicality or reliability justified in the same way we justify performance claims—by doing it. I would be a lot more likely to believe an academic distributed system was practically feasible if it was run continuously under load for a year successfully and if information was reported on failures and outages. Maybe that isn’t feasible for an academic project, but few other allegedly scientific academic disciplines can get away with making claims about reality without evidence.

More broadly I would like to see more systems whose operation is verifiable. Many systems have the ability to log out information about their state in a way that makes programmatic checking of various invariants and guarantees possible (such as consistency or availability). An example of such an invariant for a messaging system is that all the messages sent to it are received by all the subscribers. We actually measure the truth of this statement in real-time in production for our Kafka data pipeline for all 450 topics and all their subscribers. The number of corner-cases one uncovers with this kind of check, run through a few hundred use cases and a few billion messages per day is truly astounding. I think this is a special case of a broad class of verification that can be done on the running system that goes far far deeper than what is traditionally considered either monitoring or testing. Call it unit testing in production, or self-proving systems, or next generation monitoring, or whatever, but I think this kind of deep verification is something that makes turning the theoretical claims a design makes into measured properties of the real running system.

Likewise if you have a “NoSQL vendor” I think it is reasonable to ask them to provide hard information on customer outages. They don’t need to tell you who the customer is, but they should let you know the observed real-life distribution of MTTF and MTTR they are able to achieve, not just highlight one or two happy cases. Make sure you understand how they measure this, do they have automated test load that runs or just wait for people to complain? This is a reasonable thing for people paying for a service to ask for. To a certain extent if they provide this kind of empirical data it isn’t clear why you should even care what their architecture is beyond intellectual curiosity.

Distributed systems solve a number of key problems at the heart of scaling large websites. I absolutely think this is going to become the default approach to handling state in internet application development. But no one benefits from the kind of irrational exuberance that currently surrounds the “big data” or nosql systems communities. This area is experiencing a boom—but nothing takes us from boom to bust like unrealistic hype and broken promises. We should be a little more honest about where these systems already shine and where they are still maturing.

Posted at 11:58 AM on March 19, 2012 24 notes Permalink ∞

Scala Macros: “Oh God Why?”

Since this is now being read by people I don’t know, let me add this preface. Scala is one of the better things happening in programming. Any complaints come from the fact that I want it to work as more than a playground for innovation in language design and I think there is always a tension between innovation and refinement. Refutations from Martin Odersky, surely one of the world’s more calm and reasonable people, and Havoc Pennington provide a good counter-balance.

I saw this conversation on twitter:

This was my reaction to the Scala macros proposal too. Not because there is anything necessarily bad about macros or the proposal, but just because—is this really the most critical thing? For context Coda Hale works at Yammer, Stu Hood works at Twitter, Ted Nyman works at (Bank) Simple, and I work at LinkedIn. Each of these companies is trying to write stuff in Scala and so we have a high investment in Scala working out as a language that is good for writing production code (not just as a language that tests out new ideas, or that shows solidarity with the functional programming movement, or that acts a signaling mechanism to tell potential applicants you are hipper than the average Java programmer).

I think the Scala folks have done a good job of putting together something that is significantly better than Java as a language. It’s fast, concise, statically type-checked, and inherits a lot of the good operational properties of the JVM. But languages aren’t just, well, languages; and they aren’t even runtime environments for that matter. They are platforms and tool chains and libraries and bodies of documentation, and cultures, and practices. And here Scala does not yet do so well.

My personal experience with Scala came from working on Apache Kafka, which is a distributed messaging system that serves as LinkedIn’s data pipeline. So I have been working in Scala for the last few years. I still think using Scala was a good decision for a hard, stand-alone piece of software like this, and I am glad we did. Scala fixes all kinds of nastiness in the Java language. But it isn’t all roses, Scala has all kinds of gaps everywhere else. You end up with very pretty code, but gobs of practical annoyances in trying to write that code. The strong hope I have is that the Scala folks will focus on fixing these bread and butter issues, and those of us who just want to use programming languages will be left with something that is a better all around developer experience.

Here is a list of basic issues that hurt actual use of Scala for writing programs. These are pretty well known and much griped about elsewhere, but it bears repeating.

The compiler is slow. C++ slow. Waiting on a compiler is not a good way to spend your day.
This would be less of an issue except that IDE support is also weak. A lot of the advantages of static typing and rich libraries really surface when you have an IDE to help navigate, auto-complete, and continuously compile, and neither Eclipse nor IntelliJ is quite there yet. Progress has been made on Eclipse over the last few years, much of it recently, but there is a long way to go before it is the kind of tool you can work with seamlessly without the semi-regular pauses and crashes and spurious errors that are just really annoying in the kind of tool you stare at all day long.
The generated jar files are comically large.
There does seems to be some interest in release-to-release compatibility but it is somewhat haphazard. Upgrades have been nightmarish for us because transitive dependencies between internal libraries combined with non-compatibility leads to a kind of “stop the company and everyone upgrade” firedrill each time we do it.
The error messages from the compiler are often deeply cryptic.
Comprehensible runtime stack traces and profiling is defeated by the many layers of Scala compiler magic.
There isn’t much in the way of a standard library, which means always relying on Java’s standard libraries. What is there often has some questionable parts (e.g. virtually any attempt to just instantiate a DoubleLinkedList in the shipped version of 2.8.0 seemed to throw an NullPointerException which kind of implies no one tried running it). Java’s standard libraries are pretty awkward to use even in Java and much more awkward when mixed in with all your pretty Scala code.
The documentation is pretty impenetrable. Unlike Java, type signatures in Scala don’t really explain idiomatic usage to mortals very well. Needless to say, the docs don’t have anything so pedestrian as examples. (Of course I would kill for the kind of user-annotated documentation that MySQL and PHP have).

Note how deeply unfancy these problems are. I think most programming time is taken up by this kind of deeply unfancy problem: renaming things, waiting for compiles, stupid errors, debugging, understanding libraries, etc. This is true whether you are working on the dreariest enterprise workflow thingy or the sexiest mobile social distributed bigdata cloud framework.

When it comes to Scala language development or community, I am barely even a bystander. I don’t follow the mailing list or contribute code, so I’m not really in a position to describe the values or focus of that community. Even this kind of commentary is a bit whiny. But from what I see the focus of Scala development seems to be on innovation. I am sympathetic. Innovation is fun. Scala folks seem to want to do things like macros, or actors, or parallel collections. Unfortunately my experience with Actors left me convinced we should stick with Doug Lea’s code for any production usage. Neither, frankly, is the lack of a parallel collection library the thing holding back Scala adoption by any stretch of the imagination. A parallel collections library is way high up on the Maslow Hierarchy of programming needs, a reasonably complete, well-thought-out, bug-free, Scala-friendly I/O package in the standard library is much closer to “food, clothing, and shelter”.

Back to macros. Does Scala need them? I don’t know, maybe—I’m not knowledgeable enough to say. In most languages, I understand macros to provide two things (1) forced code in-lining and (2) a kind of non-strict semantics or lazy evaluation of arguments. Scala already has the ability to do lazy evaluation of arguments with functions, and the JVM does make some reasonable attempt at on-the-fly code in-lining, so I don’t see the compelling need. But this is not my point. The Scala folks seem super smart, so I assume that Scala will be better off with macros and that surely I will understand why when they have done it. Rather, what I am asking is why this now? Given all the other issues why keep on adding deep, hard language features that must certainly complicate fixing all the dreary unsexy problems we poor Scala users already have?

I understand why no one wants to work on writing documentation or putting together a fast rigorously regression tested standard library. Those things are boring. It is reasonable that people who are smart enough to work on building a language like Scala don’t want to spent their time on that kind of thing.

My own experience is that you can only push the innovation envelope so far. You can club together maybe two or five or ten audacious new ideas, but at some point if you want to make something valuable you need to stop having new ideas and start fixing the ten thousand little things that prevent people from making use of your bold thoughts.

Posted at 1:22 PM on March 11, 2012 20 notes Permalink ∞

Does off-the-shelf machine learning need a benchmark?

I really like the blog post “Why Generic Machine Learning Fails” by Joe Reisinger of metamarkets. His point is that successful industrial applications of machine learning are usually not based on black box algorithms being blindly fed data; rather they come from humans thinking deeply about the nature of the problem and iteratively working with a data set and algorithm—fiddling with features, tuning parameters, loss functions, etc. Usually the results are dependent on domain-specific insights that are not transferable to any other problem. A related discussion on the specific importance of domain-specific insight versus modeling skill can be found here.

This take on the discipline of modeling isn’t algorithm-centric, instead the person doing the modeling basically has access to large bag of mathematical and algorithmic tricks and creatively applies them to solve the problem at hand. There may be little that is reusable between problems, because each trick has its pros and cons, and the tricks that work in one application are unlikely to be the tricks that work in another application. In school my adviser referred to this approach to prediction as “cooking”—namely experimenting with lots of ingredients until you get something that tastes good. I think this was meant to be both a little derogatory in that cooking isn’t really based on a guiding theory, but also somewhat admiring, since cooking is a bit of an art (and he liked to cook).

The practical superiority of this “cooking approach” matches my experience as well. Most of the gains we have had in data and modeling problems I was involved in at LinkedIn came from creative feature engineering and analyzing results to debug where our models did well and where they did poorly (and then brainstorming solutions). It’s easy to see why academia doesn’t focus on this: research should be broadly applicable, so domain-specific feature engineering isn’t very sexy. This isn’t only a matter of preference, either, in commercial settings there are certain advantages. Specifically, our data sets aren’t a static thing. You often own the source of the data as well as the model, so you can always go and get new independent predictors by collecting new things or creating new feedback loops. This is generally not the case for academic research which works against fixed, static data sets by necessity. For a fixed, static data set, of course, the algorithm is the only thing that can be changed.

Since, after all, organizations often only have a few dozen key data problems, it is not too burdensome to do custom modeling work on each problem. In addition to producing better results, this is generally cheaper and simpler than building out generic infrastructure. Even for products which are very prediction-centric (e.g. recommendations) the actual model building portion is a small fraction of the overall effort of a production system, so a custom approach to modeling doesn’t add much to the work involved.

But—and here is what I really want to argue—even if the “cooking” approach to data problems is currently the best path for folks in industry, that doesn’t make it a good paradigm for machine learning and statistics research. After all one of the main motivations for machine learning as a discipline distinct from statistics was an interest in “black box” prediction without the bag of distribution-centric tricks and expertise that traditional statistical modeling favored. The end goal should be to automate the statistician as completely as possible, not just to add more tricks to her bag. I think this is an area where academia falls a little bit short.

A related point is discussed in a great opinion paper by Larry Wasserman (he of All of Statistics fame). The thesis of the paper is that although the primary distinction of the last decade or so in statistics has been frequentist versus Bayesian styles of modeling, a more important goal is achieving low-assumption high-dimensional modeling. “Assumptions” are, after all, what make a model fail to be “off-the-shelf” since each assumption requires checking and tweaking to make sure it is true and limits the approach to only certain applicable domains.

In some areas you can actually see things moving backwards. For example the fiddliness of Bayesian statistics is something that has always bothered me. On the theoretical side Bayesians have collapsed lots of disjoint heuristics into a core set of principles, but unfortunately the actual modeling practice advocated by most Bayesians seems to move further from off-the-shelf. Hierarchical modeling and priors and messing around with MCMC, and all of that, add to the assumptions and make the role of a good statistician conducting the analys

Jay Kreps

About

Mike Daisy and Globalization

Getting Real About Distributed System Reliability

Scala Macros: “Oh God Why?”

Does off-the-shelf machine learning need a benchmark?