Measure Anything, Measure Everything

Posted by Ian Malpass | Filed under data, engineering, infrastructure

If Engineering at Etsy has a religion, it’s the Church of Graphs. If it moves, we track it. Sometimes we’ll draw a graph of something that isn’t moving yet, just in case it decides to make a run for it. In general, we tend to measure at three levels: network, machine, and application. (You can read more about our graphs in Mike’s Tracking Every Release post.)

Application metrics are usually the hardest, yet most important, of the three. They’re very specific to your business, and they change as your applications change (and Etsy changes a lot). Instead of trying to plan out everything we wanted to measure and putting it in a classical configuration management system, we decided to make it ridiculously simple for any engineer to get anything they can count or time into a graph with almost no effort. (And, because we can push code anytime, anywhere, it’s easy to deploy the code too, so we can go from “how often does X happen?” to a graph of X happening in about half an hour, if we want to.)

Meet StatsD

StatsD is a simple NodeJS daemon (and by “simple” I really mean simple — NodeJS makes event-based systems like this ridiculously easy to write) that listens for messages on a UDP port. (See Flickr’s “Counting & Timing” for a previous description and implementation of this idea, and check out the open-sourced code on github to see our version.) It parses the messages, extracts metrics data, and periodically flushes the data to graphite.

We like graphite for a number of reasons: it’s very easy to use, and has very powerful graphing and data manipulation capabilities. We can combine data from StatsD with data from our other metrics-gathering systems. Most importantly for StatsD, you can create new metrics in graphite just by sending it data for that metric. That means there’s no management overhead for engineers to start tracking something new: simply tell StatsD you want to track “grue.dinners” and it’ll automagically appear in graphite. (By the way, because we flush data to graphite every 10 seconds, our StatsD metrics are near-realtime.)

Not only is it super easy to start capturing the rate or speed of something, but it’s very easy to view, share, and brag about them.

Why UDP?

So, why do we use UDP to send data to StatsD? Well, it’s fast — you don’t want to slow your application down in order to track its performance — but also sending a UDP packet is fire-and-forget. Either StatsD gets the data, or it doesn’t. The application doesn’t care if StatsD is up, down, or on fire; it simply trusts that things will work. If they don’t, our stats go a bit wonky, but the site stays up. Because we also worship at the Church of Uptime, this is quite alright. (The Church of Graphs makes sure we graph UDP packet receipt failures though, which the kernel usefully provides.)

Measure Anything

Here’s how we do it using our PHP StatsD library:

StatsD::increment("grue.dinners");

That’s it. That line of code will create a new counter on the fly and increment it every time it’s executed. You can then go look at your graph and bask in the awesomeness, or for that matter, spot someone up to no good in the middle of the night:

spacer

We can use graphite’s data-processing tools to take the the data above and make a graph that highlights deviations from the norm:

spacer

(We sometimes use the “rawData=true” option in graphite to get a stream of numbers that can feed into automatic monitoring systems. Graphs like this are very “monitorable.”)

We don’t just track trivial things like how many people are signing into the site — we also track really important stuff, like how much coffee is left in the kitchen:

spacer

Time Anything Too

In addition to plain counters, we can track times too:

$start = microtime(true);
eat_adventurer();
StatsD::timing("grue.dinners", (microtime(true) - $start) * 1000);

StatsD automatically tracks the count, mean, maximum, minimum, and 90th percentile times (which is a good measure of “normal” maximum values, ignoring outliers). Here, we’re measuring the execution times of part of our search infrastructure:

spacer

Sampling Your Data

One thing we found early on is that if we want to track something that happens really, really frequently, we can start to overwhelm StatsD with UDP packets. To cope with that, we added the option to sample data, i.e. to only send packets a certain percentage of the time. For very frequent events, this still gives you a statistically accurate view of activity.

To record only one in ten events:

StatsD::increment(“adventurer.heartbeat”, 0.1);

What’s important here is that the packet sent to StatsD includes the sample rate, and so StatsD then multiplies the numbers to give an estimate of a 100% sample rate before it sends the data on to graphite. This means we can adjust the sample rate at will without having to deal with rescaling the y-axis of the resulting graph.

Measure Everything

We’ve found that tracking everything is key to moving fast, but the only way to do it is to make tracking anything easy. Using StatsD, we enable engineers to track what they need to track, at the drop of a hat, without requiring time-sucking configuration changes or complicated processes.

Try StatsD for yourself: grab the open-sourced code from github and start measuring. We’d love to hear what you think of it.

Share this:

  • Twitter

Category: data, engineering, infrastructure

144 responses to Measure Anything, Measure Everything

  • spacer Julien says:
    February 15, 2011 at 10:04 pm

    BRilliant! We use a similar approach with collectd and try to track anything relevant! the funny thing is that something can only be relevant for a specific release, so we track, and then forget!

    Reply
  • spacer Manas says:
    February 15, 2011 at 10:43 pm

    Why not have the traditional SNMP traps?

    Reply
  • spacer Ian Malpass says:
    February 15, 2011 at 11:43 pm

    JULIEN: Well, the good news is that these UDP pings are so lightweight that it’s generally not a problem to keep them around for a while, and it’s surprising how often you find that “just for this release” metrics turn out to have interesting information in them weeks later. But yes, it’s good to clean house every so often.

    MANAS: Those would have worked, I’m sure. I think there are lots of ways to solve this particular problem. As long as a given solution has next to no management overhead, and is trivially easy for engineers to use, you’ve got something useful.

    Reply
  • spacer Daniel says:
    February 15, 2011 at 11:44 pm

    I’m not familiar with the concept of negative coffee. Also, I see you guys sit around the coffee pot at 17:00 just waiting for the fresh pot to finish brewing, and then immediately chug away spacer

    Reply
  • spacer Eric says:
    February 16, 2011 at 1:33 am

    Will you be releasing the StatsD PHP client library?

    Reply
  • spacer Ian Malpass says:
    February 16, 2011 at 5:31 am

    Eric: Yep, it’s already in with the statsd code on github – https://github.com/etsy/statsd/blob/master/php-example.php

    Reply
  • spacer Ian Malpass says:
    February 16, 2011 at 6:02 am

    Daniel: I see you’ve spotted that our coffee monitoring system doesn’t cope well with people leaving the pot off the scale, demonstrating the importance of tracking metrics in software development spacer

    Reply
  • spacer Steve Ivy says:
    February 16, 2011 at 2:34 pm

    Ian, this is great stuff. I’ve already got a project his is going to get stood up for. I ported the PHP sample to Python, since that’s my environment. You can find it on my statsd fork:

    https://github.com/sivy/statsd/blob/master/python_example.py

    (I sent a pull request just in case you guys find it useful)

    Thanks again for sharing your tools!

    Reply
  • spacer Steve Ivy says:
    February 16, 2011 at 4:24 pm

    A stand-alone Python Statsd client is now at:

    https://github.com/sivy/py-statsd

    Cheers!

    Reply
  • slakin.net | mattsn0w.com » New Project/Goal: Learn new shit! says:
    February 16, 2011 at 8:36 pm

    [...] stumbled across this recent posting by one of the etsy.com engineers. I am “like WOW!”. I am jumping on the web 3.0 [...]

    Reply
  • spacer efkastner says:
    February 17, 2011 at 4:11 am

    Steve: I applied your patches, good stuff spacer

    Someone needs to make a ruby gem or example client library for us to include!

    Reply
  • spacer Steve Ivy says:
    February 17, 2011 at 4:37 am

    Erik,

    Thanks! I noticed that a bit earlier.

    I also managed to get the standalone client into pypi tonight (pypi.python.org/pypi/pystatsd/), and got it to install via pip on my server. Now to get cairo, pixman, and pycairo working… *grumble*.

    Reply
  • spacer Tom Taylor says:
    February 18, 2011 at 12:48 pm

    I wrote a little Ruby client (basically a port of the Python example), over here:

    https://github.com/tomtaylor/statsd-client

    Reply
  • spacer Steve Ivy says:
    February 18, 2011 at 1:25 pm

    See also, perl client:

    https://github.com/sivy/statsd-client

    Reply
  • spacer Tim Spence says:
    February 19, 2011 at 8:00 pm

    Ian,
    I get that the fire-and-forget power of UDP allows your apps to track anything/everything without compromising responsiveness. I have a question about the Why behind StatsD. Before you wrote StatsD, did you find that you were saturating Graphite’s agent (carbon-agent), or was this more of a preemptive strike? I’m curious about carbon-agent’s capacity under variable load.

    Great blog, btw–it’s inspiring to see a whole crew of developers so proud of the tools they build!

    Reply
  • A Smattering of Selenium #42 « Official Selenium Blog says:
    February 21, 2011 at 12:51 pm

    [...] Measure Anything, Measure Everything seems pretty cool. Suspect you could do something in your scripts to ping the counter so you could get visualizations of your runs. [...]

    Reply
  • spacer Steve Ivy says:
    February 21, 2011 at 2:49 pm

    As I mentioned to Erik (Kastner) the other day, it would be cool if there was a wiki or other public repository of stats/graphite recipes. I know how to shove data into graphite with statsd, but I don’t feel like I have a good grasp of how to best tease out the interesting graphs.

    Reply
  • spacer Mark Bainter says:
    February 23, 2011 at 6:56 pm

    Tim – I think the issue is in your first sentence. To do what they’re doing with carbon directly you’d have to have the additional overhead of building a tcp connection.

    If carbon had the ability to receive data via UDP messages like this I think it would be fine in terms of load. But this code also abstracts some of the the work. As simple as it is to get data into graphite, this lets you easily add certain kinds of graphs without the developers using it having to know much about how it works.

    It also lets you force them into a given hierarchy – so they can’t clutter the root with tons of new graph paths, which is a nice touch as well.

    Reply
  • spacer Ian Malpass says:
    February 23, 2011 at 7:02 pm

    Tim – Mark’s point is a good one, but really, the key feature of StatsD is that it aggregates metrics into time buckets (10 seconds in our case). When you send data to graphite, you say “store value N for metric M at time T”. If you have multiple, separate M events happening at time T, you need a central aggregator to sum these and then send a single value to graphite. This central aggregation also allows us to do the statistical work for the timing functions – high/low/mean/90th-percentile.

    Reply
  • Webdis is Full of Awesome | Jeremy Zawodny's blog says:
    February 23, 2011 at 7:12 pm

    [...] simple performance metrics without a lot of centralized processing. I could use something like StatsD from the Etsy folks but got inspired by reading about Redis at Disqus the other [...]

    Reply
  • spacer Steve Ivy says:
    February 25, 2011 at 3:48 am

    Aaaand, once more client – in node.js this time:

    https://github.com/sivy/node-statsd

    Reply
  • spacer Steve Ivy says:
    February 27, 2011 at 7:08 pm

    Joshue Frederick (jfred on gihub) contributed a python implementation of the statsd server. I don’t know how it compares to the node version for speed (it’s not async) but it’s pretty cool to have another implementation of the server.

    Reply
  • spacer Steve Ivy says:
    February 27, 2011 at 7:11 pm

    oh, link: https://github.com/sivy/py-statsd

    Reply
  • Closet Stats Junkie – Michael Grace says:
    February 28, 2011 at 10:23 pm

    [...] This blog post by the etsy engineering team about tracking everything made me drool codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/ [...]

    Reply
  • spacer Phillip Winn says:
    March 2, 2011 at 6:21 pm

    We implemented this in my office, and found frequent 5 second delays (or delays in increments of 5 seconds, since we have multiple statsd calls per transaction). We had to turn it off.

    We don’t seem to see as many (or any) delays when the (PHP) client and server are on the same host, but as soon as they’re separated by a network, the delays are awful.

    Since Etsy is clearly using more than one server (ha!), presumably you either deal with this problem, or have worked around it, or, I suppose, have a better network than we do. There doesn’t seem to be any way to “fire and forget” an async StatsD::increment, for example.

    Any thoughts from either Etsy or other PHP users?

    Reply
  • spacer Adam says:
    March 4, 2011 at 9:31 pm

    How are you guys doing the coffee graph? I can’t seem to find any documentation in graphite about actual counting on the graphs. Nothing in the statsD library makes me thing I can control that either.

    Reply
  • spacer Andrew Gwozdziewycz says:
    March 4, 2011 at 11:16 pm

    And, I’ve added a java client–https://github.com/apgwoz/statsd/blob/master/StatsdClient.java

    Reply
  • spacer Wil Tan says:
    March 6, 2011 at 7:02 am

    We just created an erlang implementation. It’s only been rudimentary tested against Joshua’s / Steve’s pystatd.server.

  • gipoco.com is neither affiliated with the authors of this page nor responsible for its contents. This is a safe-cache copy of the original web site.