On package management: Negating the downsides of bundling

Package management is the seemingly simple task of administering software installation, upgrade, and removal. Like many things, it’s only simple when you squint your eyes from half a mile away — as you get closer to the problem, it grows increasingly complex. This is part of the reason why, almost every week, we hear about a new implementation of package management, from Linux distributions to iOS apps to, more recently, JavaScript (e.g. npm, bower, jam).

Perhaps the largest fallacy of package management is the refusal to learn anything about history — every new package manager pretends as if the problem had never existed in the past and encounters, even occasionally surmounting, all the same issues as had been solved years, or even decades, in the past. It’s NIH syndrome at its finest — but of course, this is simply a microcosm of the tech industry as a whole.

One of the overarching issues going under discussion and experimentation in the package-management world is whether to bundle software dependencies or use global, system-level instances. Think Linux versus iOS — Linux distros install a single copy of libraries such as zlib or openssl, whereas iOS or OS X apps would tend to bundle the same functionality into every installed package that uses it.

Therefore, on Linux, shared libraries are king — a single copy of the shared library zlib.so.X.Y.Z would be used by every binary on the system that calls out to zlib’s compression functions. This provides some distinct benefits over bundling (e.g. here, here, here, and here) such as increased security, increased maintenance costs for any changes in the forked, bundled copy, and decreased size both in memory and on disk (read the previous links for details).

Global copies also have significant problems for the vendor as well as the end user in terms of robustness and reliability, primarily due to API or ABI changes to the library itself. For the vendor, bundling increases predictability — you know very accurately what the relevant environment looks like on every single system, and this decreases your support burden. In addition, when a global library changes in an incompatible way, it breaks everything on the user’s system that uses it, all at once. This can be somewhat alleviated by keeping a copy of the old library around, but you can run into extremely interesting issues when you have a binary using two versions of the same library at once (via various dependencies). That’s exactly why OS X uses bundling.

So what could we do to get the best of both worlds? The ideal scenario for distributors, app developers, and end users would be some kind of blend of the bundling and global approaches that optimizes as many of the above benefits and downsides as possible, for the best possible experience all around. My proposal is to provide bundled applications (but not statically linked, rather a directory containing the app and its dependencies), coupled with two things: (1) use of a linker and loader that prefer the bundled copy while falling back to the system copy, and (2) a package manager with a deep understanding of the bundle and what it contains.

This enables, by default, all the benefits of bundling. At the same time, it doesn’t block the most important advantages of global copies — it allows for the package manager to learn when bundled copies are vulnerable to security holes, it even enables them to be upgraded to compatible versions (or incompatible versions if the app is open source), and it further allows the libraries to be rebuilt by the package manager to “vanilla” versions that are guaranteed to be untouched by the vendor. While this approach doesn’t provide the guaranteed predictability that vendors desire, it does place the priorities of the end user first — a usable, more secure system trumps all.

Disclosure: Apple is not a client.

5 Comments

Categories: apps, open-source, operating-systems, packaging.

By dberkholz

— November 12, 2012 at 5:12 pm

Splunk: On culture, developer platforms, and commoditization

Left: Splunk Chairman and CEO Godfrey Sullivan. Right: CTO and co-founder Erik Swan. Credits: SiliconAngle, under CC-BY license.

For those not in the know, Splunk is a company that’s been around for almost 10 years now, building log-analysis software. They’re now going after a larger market by targeting machine-generated data beyond logs — aiming to be a platform for more generic real-time, interactive analytics. With last week’s release of Splunk 5, I figured it might be a timely occasion to write up my experience at Splunk’s annual conference, suitably called “.conf” given the purpose of the software, as part of this fall’s insane travel schedule.

My overall impression of Splunk today is that it’s a maturing company (now 600 people, and post-IPO) that hasn’t gotten overly serious about itself. It’s got a sense of humor and that’s reflected in CTO and co-founder Erik Swan, who was up on stage cracking jokes about his penchant for flip-flops. In fact Erik Swan and a fellow co-founder, Rob Das, spoke about the importance of the company’s culture even as its size spiked (start at 08:50 in that video). On the other side of the picture from Erik are people like CEO Godfrey Sullivan and new SVP of Products Guido Schroeder (out of SAP), representing the serious side of the company.

Innovation comes from customers

CEO Godfrey Sullivan (left) hanging out in the hallway and talking to an attendee at .conf 2012. Credits: Me, low lights, and a cell phone.

Another thing I found very powerful about Splunk was its customer focus, which was exemplified at all levels of the company. Above is a picture I snapped while I was just hanging around in the hallway. It shows one example of the CEO doing everything he could to spend time with users and customers, rather than isolating himself into a bubble as often seems to happen. I also saw this focus in Splunk’s product direction, which is explicitly customer-driven — most of its changes are evolutionary, not revolutionary. Let’s take a few examples:

the addition of a SaaS-based offering in Splunk Storm,
the gradual transformation from handling log data to any machine data,
the increase in report speed to real-time, interactive levels, or
its growing interoperability with Big Data offerings such as Hadoop, and using OData to get at Splunk data from Excel, Tableau, etc.

None of these are massive changes from what Splunk had been doing, but they combine to create the (real) impression of a company that’s keeping up with many of the important industry trends, realizing that people are looking for ease of use and interoperability with the rest of their technologies. More generally, this customer-driven focus is an instantiation of Splunk’s history as software that was adopted bottom-up by ops people, not top-down by the CIO.

I think this focus on customer-led innovation naturally resulted in the next step (again, evolution not revolution) of enabling deeper levels of the same. That is, making Splunk into an extensible platform rather than a piece of software that’s contorted into all sorts of less-than-perfect fits.

Going after developers and creating a platform

We’ve covered the good; now for the challenges. Splunk wants to turn itself into a platform rather than a product (but really, doesn’t everybody?), which is going to be particularly tricky for a company that’s historically catered to sysadmins rather than developers. It not only has to figure out how to reach out to developers but also how to make a legitimate platform play.

It was a pleasure to see that the folks running the efforts were aware, at the highest level, of some of the critical aspects in getting developer traction — primary among them the barrier to entry (my writeup) and the community (Steve’s writeup), both of which were specifically highlighted during the keynote in the form of documentation, support, community, and ease of use. I had an outstanding chat with these folks at .conf and seem to recall that Brad Lovering (a former technical fellow at Microsoft) and I actually got kicked out of the room after running so long over our time slot. So what are the difficulties going to be? The tactics:

how do you attack the barrier to entry from a product strategy and marketing point of view,
how do you get developers to care about and start running Splunk,
what are they looking for, and
how do you find the intersection between that and what you can offer?

It seems like they’re generally going after a “more is better” approach now, with SDKs for Python, Java, JavaScript, PHP already and adding new ones for Ruby and C#, along with a REST API that those, as well as any unsupported languages, can hit. Given constrained resources, though, it could make more sense to go after a deep approach targeted on just a few of the likeliest candidate languages rather than something skin-deep that “works” everywhere.

Entering a broader market, and commoditization from below

Splunk as analytics software

As Splunk tries to compete in a new market for analysis of machine data, it gains a new set of competitors with varying levels of capability as well as a much deeper knowledge of the market. SVP of Product Guido Schroeder is well aware of this, as the ongoing themes for the next generation of Splunk that he mentioned in the keynote had clear tie-ins to the broader analytical market:

Big Data
Mobile
Collaboration
Visual Analysis
Predictive

If you’re reading that list, you’ll see a heavy overlap with the existing and future feature sets of analytical software, be it from big guns like SAP, IBM, and SAS, or newer/smaller entrants such as Datameer, Karmasphere, Alpine Data Labs and Platfora.

Look out below!

A perpetual concern for every company should be disruption from the low end of the market, particularly as price seems to be a significant factor in the growing success of alternatives to Splunk. Whether it’s a SaaS startup like Sumo Logic or an open-source alternative like Kibana+Logstash (an increasingly popular combination in the DevOps community), you can’t rest on your laurels and assume you’ve won because there will always be new competitors at the bottom waiting to move up.

If I were Splunk, I would be far more concerned about competition and commoditization via open-source software than any of the other companies I’ve mentioned here. Why? Because particularly in the DevOps community, there’s an increasing focus on usability — as you can see if you simply click through to the above links. This is an area where open source has historically faltered, and if it manages to sort that out, it’s going to cancel out one of the major advantages of corporate software.

Disclosure: IBM and SAP are clients. Splunk is not a client but covered registration, travel, and meals for the conference. SAS, Datameer, Karmasphere, Alpine Data Labs, Platfora, and Sumo Logic are not clients.

No Comments

Categories: adoption, big-data, community, data-science, devops, mobile, open-source, Uncategorized.

By dberkholz

— November 9, 2012 at 11:49 am

What can data scientists learn from DevOps?

I was at Strata NY the week before last, and fortunately I got out just in time to beat Sandy. I’ve been thinking at Strata and since about how the relatively new discipline of data science could learn from the gradually maturing concept of DevOps, which seems to be about 3-5 years ahead of data science. In my experience, many data scientists resemble the ops side of the DevOps equation. They devote a great deal of effort to the statistical analysis without backing it up with solid software-engineering techniques, in the same way as many ops need to be led to the joys of maintainable, reproducible, collaborative approaches to infrastructure. So how could we create a culture around what I’ll call Devalytics, for lack of a better term?

Build a culture of “Analysis as code”

In the same way the DevOps mantra is “Infrastructure as code,” today’s data scientists need to think of all their scripts as actual software that will require ongoing maintenance, enhancement, and support. To paint with a broad brush, there is no such thing as a one-off script. As soon as anyone else has access to it, or if it even sticks around on your local filesystem, it will almost inevitably be reused and applied to different situations in the future.

Rather than continuing to pretend analysis is a one-time, ad hoc action, automate it. In DevOps, the goal is to avoid logging into command prompts on individual servers because it greatly increases effort and decreases maintainability and reproducibility. As soon as something is automated, you save a huge amount of time repeating identical steps over and over. Conversely, you need to maintain the automation machinery, but a cost-benefit analysis will show that the effort rapidly pays off — particularly for complex actions such as analysis that are nontrivial to get right.

Teach software engineering to data scientists

Software engineering is not just writing code. Efficient automation requires that you apply modern methods in software development to your analysis. Many of today’s data scientists come from backgrounds in either statistics or other hard sciences (physics and biology are surprisingly common). They may have learned how to code, but they never learned modern development techniques such as continuous integration, collaborative development tools such as real-time chat and mailing lists, or even use of modern (a.k.a. fast, distributed) version control a la GitHub.

Test your code and your data. Using continuous integration and unit testing enables you to constantly know whether your code meets the standards needed for a successful analysis. Data scientists often simply spot-check a few results, or possibly verify the final data by hand or with a script, but rarely automate tests for either parts of the code or the output data itself. The value of appropriate control data sets and analyses is vast, yet this is relatively uncommon even in good software engineering. Data scientists have the opportunity to bring the best of both worlds, the science and the programming, together — but far too often, it’s instead the worst of each.
Use version control. The benefits of maintaining code in version control are numerous, from the ability to look at what changed over time, to easy discovery of bugs, to enabling others to work on the code simultaneously. And yet, many people’s idea of version control is .bak files, with dates appended if you’re lucky. Even if you’re creating pipelines using visual programming, dragging building blocks around in a GUI, this has no bearing on the potential for version control on the backend — being able to see changes over time is critical.
Catalyze collaboration. When working with others, the toolset in use has a major impact upon the success and pace of progress. The best practice in leading-edge companies like GitHub today is to work asynchronously (Monktoberfest video), interrupting others only when they are willing to be interrupted, rather than when you want to interrupt them. This requires tools for real-time chat (whether it looks more like IRC, Salesforce Chatter, or IBM Connections); long-format discussion and decision-making, where the best entrants are options like Google Groups and StackExchange, while the old standby is mailing lists; and issue tracking, such as GitHub Issues or Atlassian JIRA. In essence, the goal is to bring the types of collaboration tools that have been popularized by open-source software into other styles of social business.

Apply agile development and continuous delivery

I’ve experienced, time and time again, the significant benefits you accrue by developing iteratively. Leading-edge version control such as Git (in combination with GitHub) encourages agile development using small commits by simply making it incredibly fast and easy to perform multiple, small commits instead of huge, monolithic ones. This greatly eases testing and debugging, even making much of it automatible using tools like git-bisect.

Continuous delivery is a method of bringing these small commits all the way to production services in a frequent, iterative fashion, rather than combining a series of small commits into a daily or weekly push to production. Etsy, for example, deploys to production 30 times a day (Monktoberfest video). The equivalent of continuous delivery in Devalytics is always ensuring a production-ready analysis is available on a rapid, regular basis — however incomplete it may be in terms of features. You can add features over time in an agile fashion, which prevents waterfall-style failures where nothing is ever ready for production.

Keep scaling in mind, but don’t optimize prematurely

When you need Big Data solutions, take advantage of the above methods, including lower-level DevOps techniques like configuration management for the underlying machines (virtual or bare metal). This makes it much easier to scale both the data and the algorithms. Scalability is one of the big selling points for Revolution Analytics, which parallelizes many of the core algorithms in R.

Although you should architect code with the potential for scaling later on, it often doesn’t make sense to actually incorporate scalability if you don’t anticipate any need for it. As Donald Knuth has said, “Premature optimization is the root of all evil.” And yet, the key word in that statement is premature — some percentage of the time, you actually will need to optimize. Just don’t do it when you have no need for it; that’s wasted time and effort.

Monitor the output

As a data scientist, you should be familiar with the concept that the unusual results — the anomalies — often comprise the most interesting data. They tend to drive many of the follow-up questions that result in truly unexpected discoveries rather than simply confirming a prediction.

Besides anomalies, the other value that monitoring provides is a view of the trends over time, which you can integrate into a live dashboard with the results of various algorithms for predictive analytics. Of course these analytics are maintained in version control and brought to production using continuous delivery, just like everything else you’re now doing correctly.

People like Jason Dixon see the future of monitoring as composeable open-source components. This provides an opportunity for data science to integrate into monitoring as another building block for advanced machine learning and predictive analytics.

Conclusions

Thinking about ways to make your analytical work easy to replicate, build upon, and scale while saving significant amounts of time in the process are the basic tenets of “Devalytics.” Applying the above lessons will transform your one-off statistics or machine-learning runs into live scientific metrics that can provide significant and ongoing value.

Disclosure: Salesforce.com, IBM, and Atlassian are clients. GitHub has been a client. Revolution Analytics and Twitter are not clients (although they should be), and neither is Etsy.

2 Comments

Categories: big-data, cloud, data-science, devops, distributed-development, ibm, open-source, social.

By dberkholz

— November 6, 2012 at 10:15 am

On developer success, GitHub, and low barriers to entry

I recently came across an intriguing post describing academic research on how GitHub has affected developer culture. This is a specific case study of something I spend a lot of time talking and thinking about — how toolsets strongly influence the cultural norms, governance, and processes of the teams using them.

In this instance, here’s what the researchers found:

Transparency improves on-ramps to a project. Put simply, it’s easier to learn how the culture works when it’s possible to lurk. When you can’t see how people interact and develop collaboratively either externally or as a new contributor, it lengthens the time it takes to become a productive developer.
Continuous integration, and more broadly testing as a whole, must be easy to use if it’s going to be used at all. Making it a pain to submit things for testing means developers will skip it. If tools like Jenkins are so cleanly integrated into your ALM toolchain that contributors don’t even need to think about them unless the build/tests break, then their adoption will vastly increase.
One-off commits are a frequent occurrence when contributing is trivial. A user who is otherwise uninvolved in the project as a developer might make a one-time contribution because it’s so easy to do so. When you require users to fill out forms, register for barely usable software, etc., then you lose a lot of these potential contributions.
More negatively, a focus on build- and test-driven development has resulted in fewer tests for bad input. Many newer contributors have never learned to write test suites, but senior developers assume the opposite. Using BDD/TDD without teaching “safe testing” leads to a lack of tests for invalid results and functionality, only tests to confirm that the intended results occur upon the intended input.

The changes involving barriers to entry mirror what we’ve been saying at RedMonk for years. I would further add that the potential for one-off commits dramatically increases the size of the opening of your recruitment funnel, which gives you many more opportunities to grow your developer community. The existence of a whole new class of first-time contributors in what has become the standard model for contribution, GitHub, means this:

Some projects that perceived themselves as inevitably shrinking are, in reality, only failing to keep up with the pace of expectations in open-source development.

Disclosure: GitHub and CloudBees (which employs Jenkins founder Kohsuke Kawaguchi) are former clients.

Blogs

RedMonk

On package management: Negating the downsides of bundling

Splunk: On culture, developer platforms, and commoditization

Innovation comes from customers

Going after developers and creating a platform

Entering a broader market, and commoditization from below

Splunk as analytics software

Look out below!

What can data scientists learn from DevOps?

Build a culture of “Analysis as code”

Teach software engineering to data scientists

Apply agile development and continuous delivery

Keep scaling in mind, but don’t optimize prematurely

Monitor the output

Conclusions

On developer success, GitHub, and low barriers to entry