All Blogs
fumanchu
RTS
Lydie
Jonathan?
phunt
Susan
diacrisis

The Hand of FuManChu

Unfortunately, Fu Manchu stops the whipping.

Home
Contact
Log in

« Wow. Does isinstance blow up with ABC's?

A replacement for sessions »

logging.statistics

11/19/10

01:08:45 am, by fumanchu

, 1007 words

Categories: Python, CherryPy

logging.statistics

Statistics about program operation are an invaluable monitoring and debugging tool. How many requests are being handled per second, how much of various resources are in use, how long we've been up. Unfortunately, the gathering and reporting of these critical values is usually ad-hoc. It would be nice if we had 1) a centralized place for gathering statistical performance data, 2) a system for extrapolating that data into more useful information, and 3) a method of serving that information to both human investigators and monitoring software. I've got a proposal. Let's examine each of those points in more detail.

Data Gathering

Just as Python's logging module provides a common importable for gathering and sending messages, statistics need a similar mechanism, and one that does not require each package which wishes to collect stats to import a third-party module. Therefore, we choose to re-use the logging module by adding a statistics object to it.

That logging.statistics object is a nested dict:

import logging
if not hasattr(logging, 'statistics'): logging.statistics = {}

It is not a custom class, because that would 1) require apps to import a third-party module in order to participate, 2) inhibit innovation in extrapolation approaches and in reporting tools, and 3) be slow. There are, however, some specifications regarding the structure of the dict.

    {
   +----"SQLAlchemy": {
   |        "Inserts": 4389745,
   |        "Inserts per Second":
   |            lambda s: s["Inserts"] / (time() - s["Start"]),
   |  C +---"Table Statistics": {
   |  o |        "widgets": {-----------+
 N |  l |            "Rows": 1.3M,      | Record
 a |  l |            "Inserts": 400,    |
 m |  e |        },---------------------+
 e |  c |        "froobles": {
 s |  t |            "Rows": 7845,
 p |  i |            "Inserts": 0,
 a |  o |        },
 c |  n +---},
 e |        "Slow Queries":
   |            [{"Query": "SELECT * FROM widgets;",
   |              "Processing Time": 47.840923343,
   |              },
   |             ],
   +----},
    }

The logging.statistics dict has strictly 4 levels. The topmost level is nothing more than a set of names to introduce modularity. If SQLAlchemy wanted to participate, it might populate the item logging.statistics['SQLAlchemy'], whose value would be a second-layer dict we call a "namespace". Namespaces help multiple emitters to avoid collisions over key names, and make reports easier to read, to boot. The maintainers of SQLAlchemy should feel free to use more than one namespace if needed (such as 'SQLAlchemy ORM').

Each namespace, then, is a dict of named statistical values, such as 'Requests/sec' or 'Uptime'. You should choose names which will look good on a report: spaces and capitalization are just fine.

In addition to scalars, values in a namespace MAY be a (third-layer) dict, or a list, called a "collection". For example, the CherryPy StatsTool keeps track of what each worker thread is doing (or has most recently done) in a 'Worker Threads' collection, where each key is a thread ID; each value in the subdict MUST be a fourth dict (whew!) of statistical data about
each thread. We call each subdict in the collection a "record". Similarly, the StatsTool also keeps a list of slow queries, where each record contains data about each slow query, in order.

Values in a namespace or record may also be functions, which brings us to:

Extrapolation

def extrapolate_statistics(scope):
    """Return an extrapolated copy of the given scope."""
    c = {}
    for k, v in scope.items():
        if isinstance(v, dict):
            v = extrapolate_statistics(v)
        elif isinstance(v, (list, tuple)):
            v = [extrapolate_statistics(record) for record in v]
        elif callable(v):
            v = v(scope)
        c[k] = v
    return c

The collection of statistical data needs to be fast, as close to unnoticeable as possible to the host program. That requires us to minimize I/O, for example, but in Python it also means we need to minimize function calls. So when you are designing your namespace and record values, try to insert the most basic scalar values you already have on hand.

When it comes time to report on the gathered data, however, we usually have much more freedom in what we can calculate. Therefore, whenever reporting tools fetch the contents of logging.statistics for reporting, they first call extrapolate_statistics (passing the whole statistics dict as the only argument). This makes a deep copy of the statistics dict so that the reporting tool can both iterate over it and even change it without harming the original. But it also expands any functions in the dict by calling them. For example, you might have a 'Current Time' entry in the namespace with the value "lambda scope: time.time()". The "scope" parameter is the current namespace dict (or record, if we're currently expanding one of those instead), allowing you access to existing static entries. If you're truly evil, you can even modify more than one entry at a time.

However, don't try to calculate an entry and then use its value in further extrapolations; the order in which the functions are called is not guaranteed. This can lead to a certain amount of duplicated work (or a redesign of your schema), but that's better than complicating the spec.

After the whole thing has been extrapolated, it's time for:

Reporting

A reporting tool would grab the logging.statistics dict, extrapolate it all, and then transform it to (for example) HTML for easy viewing, or JSON for processing by Nagios etc (and because JSON will be a popular output format, you should seriously consider using Python's time module for datetimes and arithmetic, not the datetime module). Each namespace might get its own header and attribute table, plus an extra table for each collection. This is NOT part of the statistics specification; other tools can format how they like.

Turning Collection Off

It is recommended each namespace have an "Enabled" item which, if False, stops collection (but not reporting) of statistical data. Applications SHOULD provide controls to pause and resume collection by setting these entries to False or True, if present.

Usage

    import logging
    # Initialize the repository
    if not hasattr(logging, 'statistics'): logging.statistics = {}
    # Initialize my namespace
    mystats = logging.statistics.setdefault('My Stuff', {})
    # Initialize my namespace's scalars and collections
    mystats.update({
        'Enabled': True,
        'Start Time': time.time(),
        'Important Events': 0,
        'Events/Second': lambda s: (
            (s['Important Events'] / (time.time() - s['Start Time']))),
        })
    ...
    for event in events:
        ...
        # Collect stats
        if mystats.get('Enabled', False):
            mystats['Important Events'] += 1

Permalink4 comments »

4 comments

Comment from: Pete Fein [Visitor]

I like the idea (I've written something similar though somewhat less flexible in the past), but -1 on multiple nested dicts. The code/description gets too hard to follow - four nested dicts is at least two too many. If you need ASII art to explain your code, you've done something wrong (see SocketServer). I find the arguments about 3rd party modules & speed somewhat specious. The class overhead compared to the basic data gathering/extrapolating operations is minimal. Classes would also allow stats gathering to be turned into a total no-op at runtime. Imagine something like:

mystats = get_stat_keeper("SQLAlchemy")
mystats.increment("Inserts")

and in your app setup code, injecting a dummy object that does nothing for increment() - no need to check a flag each time (keeps the code cleaner too). We haven't even gotten to thread safety or testing...

I agree that flexibility in extrapolation is important & using callables the way you do seems like a good way to go. Don't see how a class approach precludes that.

Similarly -1 on just stuffing this on the logging module. Yes, it's somewhat a reasonable place for this (a top-level statskeeper might be better), but it introduces potential for collisions (multiple modules defining extrapolate_stats). Come on, just use a module. ;-)

Got more thoughts but not enough time... I like tho.

11/19/10 @ 11:18

Comment from: Will Maier [Visitor] · will.m.aier.us/

Hi-

I like the idea of combining the aggregation of statistical and plain records in the logging package. In fact, I liked it so much that I did it after I read your post:

packages.python.org/statzlogger/

My implementation is slightly different: statzlogger uses custom logging.Handler subclasses with aggregation properties inspired by Google's sawzall language to track statistics. It doesn't have any reporting or presentation bits yet, but I may add those if I need them.

You can find the code on bitbucket/github (and the documentation at p.p.o, above):

github.com/wcmaier/statzlogger
bitbucket.org/wcmaier/statzlogger

Thanks for your post!

11/19/10 @ 16:09

Comment from: Marius Gedminas [Visitor]

· gedmin.as

mystats['Important Events'] += 1 is not thread-safe. See the comments at bit.ly/h9YCCT (full URL: effbot.org/pyfaq/what-kinds-of-global-value-mutation-are-thread-safe.htm )

11/22/10 @ 15:47

Comment from: fumanchu [Member]

@Marius, yes I'm well aware. But for many many projects, it doesn't matter: either the concurrency isn't high enough to exhibit the problem, or you don't care if you experience a couple of lost updates. The example Josiah gave in his comment on effbot's page is true but idealized; in the real world, one has thousands, even millions of opcodes between successive increments, not six. Throw those in there and you quickly get over 99% accuracy. In general, that's a better tradeoff than serializing access via threading Locks. You can also trade memory for speed by appending to a list (which is generally atomic) instead of incrementing a counter.

11/23/10 @ 11:55

Comment feed for this post

Name:

Email:

Your email address will not be revealed on this site.

Website:

Your URL will be displayed.

Human Check:

Please enter the phrase "I am a real human." in the textbox above.

Comment text:

Options:

Auto-BR (Line breaks become <br />)
Remember me (Name, email & website)
Allow message form (Allow users to contact you through a message form (your email will not be revealed.)

April 2012
Sun	Mon	Tue	Wed	Thu	Fri	Sat
<< <		> >>
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30