Intermediate and Advanced Software Carpentry in Python

Author:	C Titus Brown
Date:	June 18, 2007

Welcome! You have stumbled upon the class handouts for a course I taught at Lawrence Livermore National Lab, June 12-June 14, 2007.

These notes are intended to accompany my lecture, which was a demonstration of a variety of "intermediate" Python features and packages. Because the demonstration was interactive, these notes are not complete notes of what went on in the course. (Sorry about that; they have been updated from my actual handouts to be more complete...)

However, all 70 pages are free to view and print, so enjoy.

All errors are, of course, my own. Note that almost all of the examples starting with '>>>' are doctests, so you can take the source and run doctest on it to make sure I'm being honest. But do me a favor and run the doctests with Python 2.5 ;).

Note that Day 1 of the course ran through the end of "Testing Your Software"; Day 2 ran through the end of "Online Resources for Python"; and Day 3 finished it off.

Example code (mostly from the C extension sections) is available here; see the README for more information.

Contents

Idiomatic Python
- Some basic data types
- List comprehensions
- Building your own types
- Iterators
- Generators
- assert
- Conclusions
Structuring, Testing, and Maintaining Python Programs
- Programming for reusability
- Modules and scripts
- Packages
- A short digression: naming and formatting
- Another short digression: docstrings
- Sharing data between code
- Scoping: a digression
- Back to sharing data
- How modules are loaded (and when code is executed)
- PYTHONPATH, and finding packages & modules during development
- setup.py and distutils: the old fashioned way of installing Python packages
- setup.py, eggs, and easy_install: the new fangled way of installing Python packages
Testing Your Software
- An introduction to testing concepts
- The doctest module
- Unit tests with unittest
- Testing with nose
- Code coverage analysis
- Adding tests to an existing project
- Concluding thoughts on automated testing
An Extended Introduction to the nose Unit Testing Framework
- What are unit tests?
- Why use a framework? (and why nose?)
- A few simple examples
  - Test fixtures
  - Examples are included!
- A somewhat more complete guide to test discovery and execution
  - Running tests
  - Debugging test discovery
- The nose command line
  - -w: Specifying the working directory
  - -s: Not capturing stdout
  - -v: Info and debugging output
  - Specifying a list of tests to run
- Running doctests in nose
- The 'attrib' plug-in -- selectively running subsets of tests
- Running nose programmatically
- Writing plug-ins -- a simple guide
- nose caveats -- let the buyer beware, occasionally
- Credits
Idiomatic Python revisited
- sets
- any and all
- Exceptions and exception hierarchies
- Function Decorators
- try/finally
- Function arguments, and wrapping functions
Measuring and Increasing Performance
- Which profiler should you use?
- Measuring code snippets with timeit
Speeding Up Python
- psyco
  - Installing psyco
  - Using psyco
- pyrex
Tools to Help You Work
- IPython
- screen and VNC
- Trac
Online Resources for Python
Wrapping C/C++ for Python
- Manual wrapping
- Wrapping Python code with SWIG
- Wrapping C code with pyrex
- ctypes
- SIP
- Boost.Python
- Recommendations
- One or two more notes on wrapping
Packages for Multiprocessing
- threading
- Writing (and indicating) threadsafe C extensions
- parallelpython
- Rpyc
- pyMPI
- multitask
Useful Packages
- subprocess
- rpy
- matplotlib
Idiomatic Python Take 3: new-style classes
- Managed attributes
- Descriptors
GUI Gossip
Python 3.0

Idiomatic Python

Extracts from The Zen of Python by Tim Peters:

Beautiful is better than ugly.

Explicit is better than implicit.

Simple is better than complex.

Readability counts.

(The whole Zen is worth reading...)

The first step in programming is getting stuff to work at all.

The next step in programming is getting stuff to work regularly.

The step after that is reusing code and designing for reuse.

Somewhere in there you will start writing idiomatic Python.

Idiomatic Python is what you write when the only thing you're struggling with is the right way to solve your problem, and you're not struggling with the programming language or some weird library error or a nasty data retrieval issue or something else extraneous to your real problem. The idioms you prefer may differ from the idioms I prefer, but with Python there will be a fair amount of overlap, because there is usually at most one obvious way to do every task. (A caveat: "obvious" is unfortunately the eye of the beholder, to some extent.)

For example, let's consider the right way to keep track of the item number while iterating over a list. So, given a list z,

>>> z = [ 'a', 'b', 'c', 'd' ]

let's try printing out each item along with its index.

You could use a while loop:

>>> i = 0
>>> while i < len(z):
...    print i, z[i]
...    i += 1
0 a
1 b
2 c
3 d

or a for loop:

>>> for i in range(0, len(z)):
...    print i, z[i]
0 a
1 b
2 c
3 d

but I think the clearest option is to use enumerate:

>>> for i, item in enumerate(z):
...    print i, item
0 a
1 b
2 c
3 d

Why is this the clearest option? Well, look at the ZenOfPython extract above: it's explicit (we used enumerate); it's simple; it's readable; and I would even argue that it's prettier than the while loop, if not exactly "beatiful".

Python provides this kind of simplicity in as many places as possible, too. Consider file handles; did you know that they were iterable?

>>> for line in file('data/listfile.txt'):
...    print line.rstrip()
a
b
c
d

Where Python really shines is that this kind of simple idiom -- in this case, iterables -- is very very easy not only to use but to construct in your own code. This will make your own code much more reusable, while improving code readability dramatically. And that's the sort of benefit you will get from writing idiomatic Python.

Some basic data types

I'm sure you're all familiar with tuples, lists, and dictionaries, right? Let's do a quick tour nonetheless.

'tuples' are all over the place. For example, this code for swapping two numbers implicitly uses tuples:

>>> a = 5
>>> b = 6
>>> a, b = b, a
>>> print a == 6, b == 5
True True

That's about all I have to say about tuples.

I use lists and dictionaries all the time. They're the two greatest inventions of mankind, at least as far as Python goes. With lists, it's just easy to keep track of stuff:

>>> x = []
>>> x.append(5)
>>> x.extend([6, 7, 8])
>>> x
[5, 6, 7, 8]
>>> x.reverse()
>>> x
[8, 7, 6, 5]

It's also easy to sort. Consider this set of data:

>>> y = [ ('IBM', 5), ('Zil', 3), ('DEC', 18) ]

The sort method will run cmp on each of the tuples, which sort on the first element of each tuple:

>>> y.sort()
>>> y
[('DEC', 18), ('IBM', 5), ('Zil', 3)]

Often it's handy to sort tuples on a different tuple element, and there are several ways to do that. I prefer to provide my own sort method:

>>> def sort_on_second(a, b):
...   return cmp(a[1], b[1])

>>> y.sort(sort_on_second)
>>> y
[('Zil', 3), ('IBM', 5), ('DEC', 18)]

Note that here I'm using the builtin cmp method (which is what sort uses by default: y.sort() is equivalent to y.sort(cmp)) to do the comparison of the second part of the tuple.

This kind of function is really handy for sorting dictionaries by value, as I'll show you below.

(For a more in-depth discussion of sorting options, check out the Sorting HowTo.)

On to dictionaries!

Your basic dictionary is just a hash table that takes keys and returns values:

>>> d = {}
>>> d['a'] = 5
>>> d['b'] = 4
>>> d['c'] = 18
>>> d
{'a': 5, 'c': 18, 'b': 4}
>>> d['a']
5

You can also initialize a dictionary using the dict type to create a dict object:

>>> e = dict(a=5, b=4, c=18)
>>> e
{'a': 5, 'c': 18, 'b': 4}

Dictionaries have a few really neat features that I use pretty frequently. For example, let's collect (key, value) pairs where we potentially have multiple values for each key. That is, given a file containing this data,

a 5
b 6
d 7
a 2
c 1

suppose we want to keep all the values? If we just did it the simple way,

>>> d = {}
>>> for line in file('data/keyvalue.txt'):
...   key, value = line.split()
...   d[key] = int(value)

we would lose all but the last value for each key:

>>> d
{'a': 2, 'c': 1, 'b': 6, 'd': 7}

You can collect all the values by using get:

>>> d = {}
>>> for line in file('data/keyvalue.txt'):
...   key, value = line.split()
...   l = d.get(key, [])
...   l.append(int(value))
...   d[key] = l
>>> d
{'a': [5, 2], 'c': [1], 'b': [6], 'd': [7]}

The key point here is that d.get(k, default) is equivalent to d[k] if d[k] already exists; otherwise, it returns default. So, the first time each key is used, l is set to an empty list; the value is appended to this list, and then the value is set for that key.

(There are tons of little tricks like the ones above, but these are the ones I use the most; see the Python Cookbook for an endless supply!)

Now let's try combining some of the sorting stuff above with dictionaries. This time, our contrived problem is that we'd like to sort the keys in the dictionary d that we just loaded, but rather than sorting by key we want to sort by the sum of the values for each key.

First, let's define a sort function:

>>> def sort_by_sum_value(a, b):
...    sum_a = sum(a[1])
...    sum_b = sum(b[1])
...    return cmp(sum_a, sum_b)

Now apply it to the dictionary items:

>>> items = d.items()
>>> items
[('a', [5, 2]), ('c', [1]), ('b', [6]), ('d', [7])]
>>> items.sort(sort_by_sum_value)
>>> items
[('c', [1]), ('b', [6]), ('a', [5, 2]), ('d', [7])]

and voila, you have your list of keys sorted by summed values!

As I said, there are tons and tons of cute little tricks that you can do with dictionaries. I think they're incredibly powerful.

List comprehensions

List comprehensions are neat little constructs that will shorten your lines of code considerably. Here's an example that constructs a list of squares between 0 and 4:

>>> z = [ i**2 for i in range(0, 5) ]
>>> z
[0, 1, 4, 9, 16]

You can also add in conditionals, like requiring only even numbers:

>>> z = [ i**2 for i in range(0, 10) if i % 2 == 0 ]
>>> z
[0, 4, 16, 36, 64]

The general form is

[ expression for var in list if conditional ]

so pretty much anything you want can go in expression and conditional.

I find list comprehensions to be very useful for both file parsing and for simple math. Consider a file containing data and comments:

# this is a comment or a header
1
# another comment
2

where you want to read in the numbers only:

>>> data = [ int(x) for x in open('data/commented-data.txt') if x[0] != '#' ]
>>> data
[1, 2]

This is short, simple, and very explicit!

For simple math, suppose you need to calculate the average and stddev of some numbers. Just use a list comprehension:

>>> import math
>>> data = [ 1, 2, 3, 4, 5 ]
>>> average = sum(data) / float(len(data))
>>> stddev = sum([ (x - average)**2 for x in data ]) / float(len(data))
>>> stddev = math.sqrt(stddev)
>>> print average, '+/-', stddev
3.0 +/- 1.41421356237

Oh, and one rule of thumb: if your list comprehension is longer than one line, change it to a for loop; it will be easier to read, and easier to understand.

Building your own types

Most people should be pretty familiar with basic classes.

>>> class A:
...   def __init__(self, item):
...      self.item = item
...   def hello(self):
...      print 'hello,', self.item

>>> x = A('world')
>>> x.hello()
hello, world

There are a bunch of neat things you can do with classes, but one of the neatest is building new types that can be used with standard Python list/dictionary idioms.

For example, let's consider a basic binning class.

>>> class Binner:
...   def __init__(self, binwidth, binmax):
...     self.binwidth, self.binmax = binwidth, binmax
...     nbins = int(binmax / float(binwidth) + 1)
...     self.bins = [0] * nbins
...
...   def add(self, value):
...     bin = value / self.binwidth
...     self.bins[bin] += 1

This behaves as you'd expect:

>>> binner = Binner(5, 20)
>>> for i in range(0,20):
...   binner.add(i)
>>> binner.bins
[5, 5, 5, 5, 0]

...but wouldn't it be nice to be able to write this?

for i in range(0, len(binner)):
   print i, binner[i]

or even this?

for i, bin in enumerate(binner):
   print i, bin

This is actually quite easy, if you make the Binner class look like a list by adding two special functions:

>>> class Binner:
...   def __init__(self, binwidth, binmax):
...     self.binwidth, self.binmax = binwidth, binmax
...     nbins = int(binmax / float(binwidth) + 1)
...     self.bins = [0] * nbins
...
...   def add(self, value):
...     bin = value / self.binwidth
...     self.bins[bin] += 1
...
...   def __getitem__(self, index):
...     return self.bins[index]
...
...   def __len__(self):
...     return len(self.bins)

>>> binner = Binner(5, 20)
>>> for i in range(0,20):
...   binner.add(i)

and now we can treat Binner objects as normal lists:

>>> for i in range(0, len(binner)):
...   print i, binner[i]
0 5
1 5
2 5
3 5
4 0

>>> for n in binner:
...   print n
5
5
5
5
0

In the case of len(binner), Python knows to use the special method __len__, and likewise binner[i] just calls __getitem__(i).

The second case involves a bit more implicit magic. Here, Python figures out that Binner can act like a list and simply calls the right functions to retrieve the information.

Note that making your own read-only dictionaries is pretty simple, too: just provide the __getitem__ function, which is called for non-integer values as well:

>>> class SillyDict:
...    def __getitem__(self, key):
...       print 'key is', key
...       return key
>>> sd = SillyDict()
>>> x = sd['hello, world']
key is hello, world
>>> x
'hello, world'

You can also write your own mutable types, e.g.

>>> class SillyDict:
...   def __setitem__(self, key, value):
...      print 'setting', key, 'to', value
>>> sd = SillyDict()
>>> sd[5] = 'world'
setting 5 to world

but I have found this to be less useful in my own code, where I'm usually writing special objects like the Binner type above: I prefer to specify my own methods for putting information into the object type, because it reminds me that it is not a generic Python list or dictionary. However, the use of __getitem__ (and some of the iterator and generator features I discuss below) can make code much more readable, and so I use them whenever I think the meaning will be unambiguous. For example, with the Binner type, the purpose of __getitem__ and __len__ is not very ambiguous, while the purpose of a __setitem__ function (to support binner[x] = y) would be unclear.

Overall, the creation of your own custom list and dict types is one way to make reusable code that will fit nicely into Python's natural idioms. In turn, this can make your code look much simpler and feel much cleaner. The risk, of course, is that you will also make your code harder to understand and (if you're not careful) harder to debug. Mediating between these options is mostly a matter of experience.

Iterators

Iterators are another built-in Python feature; unlike the list and dict types we discussed above, an iterator isn't really a type, but a protocol. This just means that Python agrees to respect anything that supports a particular set of methods as if it were an iterator. (These protocols appear everywhere in Python; we were taking advantage of the mapping and sequence protocols above, when we defined __getitem__ and __len__, respectively.)

Iterators are more general versions of the sequence protocol; here's an example:

>>> class SillyIter:
...   i = 0
...   n = 5
...   def __iter__(self):
...      return self
...   def next(self):
...      self.i += 1
...      if self.i > self.n:
...         raise StopIteration
...      return self.i

>>> si = SillyIter()
>>> for i in si:
...   print i
1
2
3
4
5

Here, __iter__ just returns self, an object that has the function next(), which (when called) either returns a value or raises a StopIteration exception.

We've actually already met several iterators in disguise; in particular, enumerate is an iterator. To drive home the point, here's a simple reimplementation of enumerate:

>>> class my_enumerate:
...   def __init__(self, some_iter):
...      self.some_iter = iter(some_iter)
...      self.count = -1
...
...   def __iter__(self):
...      return self
...
...   def next(self):
...      val = self.some_iter.next()
...      self.count += 1
...      return self.count, val
>>> for n, val in my_enumerate(['a', 'b', 'c']):
...   print n, val
0 a
1 b
2 c

You can also iterate through an iterator the "old-fashioned" way:

>>> some_iter = iter(['a', 'b', 'c'])
>>> while 1:
...   try:
...      print some_iter.next()
...   except StopIteration:
...      break
a
b
c

but that would be silly in most situations! I use this if I just want to get the first value or two from an iterator.

With iterators, one thing to watch out for is the return of self from the __iter__ function. You can all too easily write an iterator that isn't as re-usable as you think it is. For example, suppose you had the following class:

>>> class MyTrickyIter:
...   def __init__(self, thelist):
...      self.thelist = thelist
...      self.index = -1
...
...   def __iter__(self):
...      return self
...
...   def next(self):
...      self.index += 1
...      if self.index < len(self.thelist):
...         return self.thelist[self.index]
...      raise StopIteration

This works just like you'd expect as long as you create a new object each time:

>>> for i in MyTrickyIter(['a', 'b']):
...   for j in MyTrickyIter(['a', 'b']):
...      print i, j
a a
a b
b a
b b

but it will break if you create the object just once:

>>> mi = MyTrickyIter(['a', 'b'])
>>> for i in mi:
...   for j in mi:
...      print i, j
a b

because self.index is incremented in each loop.

Generators

Generators are a Python implementation of coroutines. Essentially, they're functions that let you suspend execution and return a result:

>>> def g():
...   for i in range(0, 5):
...      yield i**2
>>> for i in g():
...    print i
0
1
4
9
16

You could do this with a list just as easily, of course:

>>> def h():
...   return [ x ** 2 for x in range(0, 5) ]
>>> for i in h():
...    print i
0
1
4
9
16

But you can do things with generators that you couldn't do with finite lists. Consider two full implementation of Eratosthenes' Sieve for finding prime numbers, below.

First, let's define some boilerplate code that can be used by either implementation:

>>> def divides(primes, n):
...   for trial in primes:
...      if n % trial == 0: return True
...   return False

Now, let's write a simple sieve with a generator:

>>> def prime_sieve():
...    p, current = [], 1
...    while 1:
...        current += 1
...        if not divides(p, current): # if any previous primes divide, cancel
...            p.append(current)           # this is prime! save & return
...            yield current

This implementation will find (within the limitations of Python's math functions) all prime numbers; the programmer has to stop it herself:

>>> for i in prime_sieve():
...    print i
...    if i > 10:
...        break
2
3
5
7
11

So, here we're using a generator to implement the generation of an infinite series with a single function definition. To do the equivalent with an iterator would require a class, so that the object instance can hold the variables:

>>> class iterator_sieve:
...    def __init__(self):
...       self.p, self.current = [], 1
...    def __iter__(self):
...       return self
...    def next(self):
...       while 1:
...          self.current = self.current + 1
...          if not divides(self.p, self.current):
...             self.p.append(self.current)
...             return self.current

>>> for i in iterator_sieve():
...    print i
...    if i > 10:
...        break
2
3
5
7
11

It is also much easier to write routines like enumerate as a generator than as an iterator:

>>> def gen_enumerate(some_iter):
...   count = 0
...   for val in some_iter:
...      yield count, val
...      count += 1

>>> for n, val in gen_enumerate(['a', 'b', 'c']):
...   print n, val
0 a
1 b
2 c

Abstruse note: we don't even have to catch StopIteration here, because the for loop simply ends when some_iter is done!

assert

One of the most underused keywords in Python is assert. Assert is pretty simple: it takes a boolean, and if the boolean evaluates to False, it fails (by raising an AssertionError exception). assert True is a no-op.

>>> assert True
>>> assert False
Traceback (most recent call last):
   ...
AssertionError

You can also put an optional message in:

>>> assert False, "you can't do that here!"
Traceback (most recent call last):
   ...
AssertionError: you can't do that here!

assert is very, very useful for making sure that code is behaving according to your expectations during development. Worried that you're getting an empty list? assert len(x). Want to make sure that a particular return value is not None? assert retval is not None.

Also note that 'assert' statements are removed from optimized code, so only use them to conditions related to actual development, and make sure that the statement you're evaluating has no side effects. For example,

>>> a = 1
>>> def check_something():
...   global a
...   a = 5
...   return True
>>> assert check_something()

will behave differently when run under optimization than when run without optimization, because the assert line will be removed completely from optimized code.

If you need to raise an exception in production code, see below. The quickest and dirtiest way is to just "raise Exception", but that's kind of non-specific ;).

Conclusions

Use of common Python idioms -- both in your python code and for your new types -- leads to short, sweet programs.

Structuring, Testing, and Maintaining Python Programs

Python is really the first programming language in which I started re-using code significantly. In part, this is because it is rather easy to compartmentalize functions and classes in Python. Something else that Python makes relatively easy is building testing into your program structure. Combined, reusability and testing can have a huge effect on maintenance.

Programming for reusability

It's difficult to come up with any hard and fast rules for programming for reusability, but my main rules of thumb are: don't plan too much, and don't hesitate to refactor your code. [1].

In any project, you will write code that you want to re-use in a slightly different context. It will often be easiest to cut and paste this code rather than to copy the module it's in -- but try to resist this temptation a bit, and see if you can make the code work for both uses, and then use it in both places.

[1]	If you haven't read Martin Fowler's Refactoring, do so -- it describes how to incrementally make your code better. I'll discuss it some more in the context of testing, below.

Modules and scripts

The organization of your code source files can help or hurt you with code re-use.

Most people start their Python programming out by putting everything in a script:

calc-squares.py:
  #! /usr/bin/env python
  for i in range(0, 10):
     print i**2

This is great for experimenting, but you can't re-use this code at all!

(UNIX folk: note the use of #! /usr/bin/env python, which tells UNIX to execute this script using whatever python program is first in your path. This is more portable than putting #! /usr/local/bin/python or #! /usr/bin/python in your code, because not everyone puts python in the same place.)

Back to reuse. What about this?

calc-squares.py:
  #! /usr/bin/env python
  def squares(start, stop):
     for i in range(start, stop):
        print i**2

  squares(0, 10)

I think that's a bit better for re-use -- you've made squares flexible and re-usable -- but there are two mechanistic problems. First, it's named calc-squares.py, which means it can't readily be imported. (Import filenames have to be valid Python names, of course!) And, second, were it importable, it would execute squares(0, 10) on import - hardly what you want!

To fix the first, just change the name:

calc_squares.py:
  #! /usr/bin/env python
  def squares(start, stop):
    for i in range(start, stop):
        print i**2

  squares(0, 10)

Good, but now if you do import calc_squares, the squares(0, 10) code will still get run! There are a couple of ways to deal with this. The first is to look at the module name: if it's calc_squares, then the module is being imported, while if it's __main__, then the module is being run as a script: