Longest common subsequence

2009-03-11 • Algorithms, Python, C++, Lcs, CLRS • Comments

Retracing our steps
Simple example
A recursive approach
Problems with the recursive solution
Dynamic programming
Memoize
Recursion limit
Bottom up calculation
Extracting an LCS from the grid
Animation
Performance
Reducing the space for LCS lengths
Hirschberg’s Linear Space Algorithm
Applied to the New York Marathon Results
Hirschberg’s linear space algorithm in C++
Applications: edit distance
Applications: genetic mutations
Applications: version control
Conclusions
Next time
Thanks
Python 3 Notes

←

↑

↖

←

↑

←

↑

←

↖

←

↑

←

↑

←

↑

←

↑

↖

←

↑

←

↑

↖

↑

←

↑

←

↑

←

↑

↖

↑

↖

←

↑

↖

I pinched the idea for this animation from Ned Batchelder’s clever visualisation of Román Cortés’ brilliant CSS Homer Simpson. It depends on Javascript, CSS and a font containing upwards, leftwards and north west arrows. If it doesn’t work in your user agent, my apologies. If you’d like to find out why I’m comparing humans with chimpanzees, read on!

Retracing our steps

Taking a brief step back, this article is the third of a series. In the first episode we posed a puzzle:

Starting with a list of runners ordered by finishing time, select a sublist of runners who are getting younger. What is the longest such sublist?

In the second episode we coded up a brute force solution which searched all possible sublists to find an optimal solution. Although the code was simple and succinct, its exponential complexity made it unsuitable for practical use.

In this episode we’ll discuss an elegant algorithm which solves our particular problem as a special case. On the way we’ll visit dynamic programming, Python decorators, version control and genetics.

An outline of the approach is:

forget about races, runners and ages
the essence of the problem is to determine the longest ordered subsequence found within a given sequence R
which is a measure of how close R is to being ordered
so, take a copy of R and sort it using our ordering criterion to form the sequence S
the subsequences of R which are also subsequences of S are precisely the ordered subsequences of R
so any longest subsequence common to both R and S is actually a longest ordered subsequence of R
which is exactly what we’re looking for!

Copying and sorting finite sequences is straightforward (and efficient implementations can be found in any standard library), so the steps above reduce our problem to an application of a more general problem: how to find the longest common subsequence of two sequences.

Simple example

As a simple example I’ve highlighted “HMAN”, the longest subsequence common to both “HUMAN” and “CHIMPANZEE”

H U M A N

C H I M P A N Z E E

Let’s note up front that there may not be a unique LCS of two sequences (as a trivial example, “A” and “B” are both LCSes of “AB”, “BA”), which is why we’ll talk about a longest common subsequence rather than the longest common subsequence.

We can also rule out a simple divide and conquer scheme. We can’t just split both inputs in half and join together the LCSes of the beginnings and endings to form an answer. Consider finding the LCS of “123ABC”, “DEF123” for example — concatenating the LCSes of “123”, “DEF” and “ABC”, “123” gives the empty string, which is clearly wrong. (We’ll see later that a more sophisticated divide and conquer scheme does exist, but it comes at a cost.)

A recursive approach

An elegant recursive approach to solving this problem is based around the observation that the LCS of two sequences can be built from the LCSes of prefixes of these sequences. In algorithm speak, the longest common subsequence problem has optimal substructure.

def lcs(xs, ys):
    '''Return a longest common subsequence of xs and ys.
    
    Example
    >>> lcs("HUMAN", "CHIMPANZEE")
    ['H', 'M', 'A', 'N']
    '''
    if xs and ys:
        *xb, xe = xs
        *yb, ye = ys
        if xe == ye:
            return lcs(xb, yb) + [xe]
        else:
            return max(lcs(xs, yb), lcs(xb, ys), key=len)
    else:
        return []

A formal explanation of why this works would be much longer than the code. I’ll sketch out the reasoning, but for a more rigorous approach I recommend Cormen, Leiserson, Rivest and Stein’s classic reference, “Introduction to Algorithms” (CLRS).

Essentially, if either input is empty, then the answer is just the empty sequence. Otherwise, split xs into its beginning, xb, and its final element xe; and similarly for ys. If the value xe equals ye, then lcs(xs, ys) must end with this value, and the remainder of lcs(xs, ys) must solve lcs(xb, yb). Otherwise, the longer of lcs(xs, yb) and lcs(xb, ys) must be a solution to lcs(xs, ys).

By the way, the assignment *xb, xe = xs is a nice example of Python 3.0’s extended unpacking syntax. In Python 2.6 and earlier you might write xb, xe = xs[:-1], xs[-1].

The recursion here is interesting. At each step, we compare two elements from the original inputs. Depending on the result, we must evaluate either one or two smaller LCSes.

Note also the call to max in the recursion. At this point, the algorithm chooses between two prefix pairs of the original inputs, (xs, yb) and (xb, ys), based on the length of their LCSes. Here’s the source of multiple solutions: max() returns the larger of its two arguments, and if lcs(xs, yb) and lcs(xb, ys) have equal lengths, I suspect the first will always be selected; but if we wanted all LCSes rather than any one LCS, we would have to continue down both routes.

The other important point to make here is that, in any case, both routes must be fully evaluated before a choice can be made. Loosely speaking, every additional step can double the workload. This is a problem!

Problems with the recursive solution

Work is needed to turn the recursive solution into a practical one. The key issue is that the recursive algorithm calculates the LCS of the same prefix pair multiple times. Let’s run a quick code experiment to show this. Starting with the version of lcs() shown above, we can adapt it to keep track of call counts^[1].

from collections import defaultdict
concat = ''.join

def lcs(xs, ys):
    ....

def count_lcs_calls(lcs):
    '''Return a pair (lcs, calls)
    
    Where:
    lcs - a wrapped version of lcs, which counts up calls
    calls - a dict mapping arg pairs to the number of times lcs
    has been called with these args.
    '''
    calls = defaultdict(int)
    def wrapped(xs, ys):
        calls[(concat(xs), concat(ys))] += 1
        return lcs(xs, ys)
    return wrapped, calls

lcs, calls = count_lcs_calls(lcs)

If count_lcs_calls() reminds you of a Python decorator, good! That’s pretty much how it works. We can now call the newly decorated version of lcs and examine the calls dict to see what’s been going on.

With two completely different inputs, the recursive algorithm never passes the xe == ye test, and must always consider another pair of prefixes of the inputs.

>>> lcs('MAN', 'PIG')
[]
>>> print(*sorted((v, k) for k, v in calls.items()), sep='\n')
(1, ('', 'PIG'))
(1, ('M', 'PIG'))
(1, ('MA', 'PIG'))
(1, ('MAN', ''))
(1, ('MAN', 'P'))
(1, ('MAN', 'PI'))
(1, ('MAN', 'PIG'))
(2, ('MA', 'PI'))
(3, ('', 'PI'))
(3, ('M', 'PI'))
(3, ('MA', ''))
(3, ('MA', 'P'))
(6, ('', 'P'))
(6, ('M', ''))
(6, ('M', 'P'))

Here, we show the LCS of “MAN” and “PIG is the empty sequence. The next command prints the call counts to standard output, showing that e.g. we calculate lcs('MA', 'PI') twice and lcs('M', 'P') six times. It should be no surprise to see these call counts appearing in Pascal’s triangle.

At the other extreme, given identical inputs the recursion zips straight through the identical prefixes, and no call is repeated.

>>> calls.clear()
>>> lcs('CHIMP', 'CHIMP')
['C', 'H', 'I', 'M', 'P']
>>> print(*sorted((v, k) for k, v in calls.items()), sep='\n')
(1, ('', ''))
(1, ('C', 'C'))
(1, ('CH', 'CH'))
(1, ('CHI', 'CHI'))
(1, ('CHIM', 'CHIM'))
(1, ('CHIMP', 'CHIMP'))

Generally, though, we won’t be so lucky. By the time I interrupted a call to find the LCS of two short strands of DNA, there had already been a total of over 200 million recursive calls to the function, and we’d compared 'A' and 'G' alone over 24 million times. The algorithm, as presented, is of exponential complexity, hence suitable for only the briefest of inputs.

>>> lcs('ACCGGTCGAGTGCGCGGAAGCCGGCCGAA', 'GTCGTTCGGAATGCCGTTGCTCTGTAAA')
....
KeyboardInterrupt
>>> sum(calls.values())
227603392
>>> max(calls.values())
24853152
>>> import operator
>>> value = operator.itemgetter(1)
>>> mostcalled = list(sorted(calls.items(), key=value))[-5:]
>>> print(*mostcalled, sep='\n')
(('', 'GT'), 13182769)
(('A', 'GT'), 13182769)
(('A', 'G'), 24853152)
(('', 'G'), 24853152)
(('A', ''), 24853152)

Dynamic programming

We’ve already seen the LCS of two sequences can be built from the LCSes of prefixes of these subsequences; that is, an optimal solution to the problem can be built from optimal solutions to subproblems, a property known as optimal substructure. Our code experiment above shows that the subproblems repeat, possibly many times. This second property, of overlapping subproblems, means we have a classic example of a problem we can solve using dynamic programming. There are a couple of standard ways to progress:

memoization
converting from top-down to bottom-up

Let’s try these out.

Memoize

Memoization is a simple idea. Rather than repeatedly recalculate a function for the same input values, we cache the result of each call, keyed by the function call arguments. If, as in this case, full evaluation is expensive compared to the look-up, and we’ve got space for the cache, then the technique makes sense.

In Python, we can write an adaptor function which converts our original function into a memoized version of itself. Here’s just such an adaptor for functions which take an arbitrary number of positional parameters.

def memoize(fn):
    '''Return a memoized version of the input function.
    
    The returned function caches the results of previous calls.
    Useful if a function call is expensive, and the function 
    is called repeatedly with the same arguments.
    '''
    cache = dict()
    def wrapped(*v):
        key = tuple(v) # tuples are hashable, and can be used as dict keys
        if key not in cache:
            cache[key] = fn(*v)
        return cache[key]
    return wrapped

We can use this to decorate our longest common subsequence function. I’ve also taken the opportunity to bypass some of the list splitting, creating an inner function which uses indices into the original xs and ys inputs.

def lcs(xs, ys):
    '''Return the longest subsequence common to xs and ys.
    
    Example
    >>> lcs("HUMAN", "CHIMPANZEE")
    ['H', 'M', 'A', 'N']
    '''
    @memoize
    def lcs_(i, j):
        if i and j:
            xe, ye = xs[i-1], ys[j-1]
            if xe == ye:
                return lcs_(i-1, j-1) + [xe]
            else:
                return max(lcs_(i, j-1), lcs_(i-1, j), key=len)
        else:
            return []
    return lcs_(len(xs), len(ys))

This is a big improvement. If xs and ys have nx and ny elements respectively, then the values of i and j passed to the inner lcs_(i, j) function must fall into the inclusive ranges [0, nx], [0, ny], and the cache will grow to contain at most (nx + 1)×(ny + 1) elements. If the input sequences are of similar size, it’s easy to see this version of lcs operates in quadratic time and space.

>>> ss = lcs('ACCGGTCGAGTGCGCGGAAGCCGGCCGAA', 'GTCGTTCGGAATGCCGTTGCTCTGTAAA')
>>> ''.join(ss)
'GTCGTCGGAAGCCGGCCGAA'

Recursion limit

The memoized version of lcs() remembers previous results, but it’s still highly recursive, and that can cause problems. Try comparing a couple of longer synthesised strands of DNA.

>>> from random import choice
>>> def base(): return choice('ACGT')
>>> dna1 = [base() for _ in range(10000)]
>>> dna2 = [base() for _ in range(10000)]
>>> lcs(dna1, dna2)
....
RuntimeError: maximum recursion depth exceeded

We might try bumping up the recursion limit but this approach is neither clean nor portable. As the sys.setrecursionlimit() documentation says:

The highest possible limit is platform-dependent. A user may need to set the limit higher when she has a program that requires deep recursion and a platform that supports a higher limit. This should be done with care, because a too-high limit can lead to a crash.

We might also try using a different language or a different Python implementation, but instead let’s stick with CPython and explore a bottom-up approach.

Bottom up calculation

Rather than continue with our top-down recursive algorithm, we can employ another standard dynamic programming technique and transform the algorithm into one which works from the bottom up. Recall that, in the general case, to calculate lcs(xs, ys) we needed to solve three sub-problems, lcs(xb, yb), lcs(xb, ys), lcs(xs, yb), where xb is the result of removing the end element from xs, and similarly for yb.

    *xb, xe = xs
    *yb, ye = ys
     if xe == ye:
         return lcs(xb, yb) + [xe]
     else:
         return max(lcs(xs, yb), lcs(xb, ys), key=len)

These subproblems in turn depend on the LCSes of yet shorter prefix pairs of xs and ys. A bottom up approach reverses the direction and calculates the LCSes of all prefix pairs of xs and ys, from shortest to longest. We iterate through these prefix pairs filling a rectangular grid with the information needed to continue and to eventually reconstruct an LCS. Once again, the code is easier to read than an explanation.

from collections import defaultdict, namedtuple
from itertools import product

def lcs_grid(xs, ys):
    '''Create a grid for longest common subsequence calculations.
    
    Returns a grid where grid[(j, i)] is a pair (n, move) such that
    - n is the length of the LCS of prefixes xs[:i], ys[:j]
    - move is \, ^, <, or e, depending on whether the best move
      to (j, i) was diagonal, downwards, or rightwards, or None.
    
    Example:
       T  A  R  O  T
    A 0< 1\ 1< 1< 1<
    R 0< 1^ 2\ 2< 2<
    T 1\ 1< 2^ 2< 3\
    '''
    Cell = namedtuple('Cell', 'length move')
    grid = defaultdict(lambda: Cell(0, 'e'))
    sqs = product(enumerate(ys), enumerate(xs))
    for (j, y), (i, x) in sqs:
        if x == y:
            cell = Cell(grid[(j-1, i-1)].length + 1, '\\')
        else:
            left = grid[(j, i-1)].length
            over = grid[(j-1, i)].length
            if left < over:
                cell = Cell(over, '^')
            else:
                cell = Cell(left, '<')
        grid[(j, i)] = cell
    return grid

The implementation here makes a couple of unusual choices. It models a 2 dimensional grid of values using a dict keyed by (i, j) pairs, rather than a more conventional array representation. Furthermore, it uses a defaultdict, which returns a value even for keys which haven’t been filled in. I’ve used this data structure primarily to avoid special case code at the edges of the grid. For example, a grid cell can be accessed at position (j-1, i-1) even if i or j is zero, and that cell automatically takes the desired (0, 'e') value.

I’ve come across the suggestion that defaultdict’s should be converted into regular dict’s before being passed back to callers, which makes a lot of sense, but in this case I’m choosing to return a defaultdict since its default value property will be useful in subsequent calculations — to find an actual LCS for example.

Extracting an LCS from the grid

Cells in the LCS grid hold moves and LCS lengths. To extract an actual LCS, we need to retrace the moves. Here’s an iterative algorithm (a recursive one would also be possible) which does this.

def lcs(xs, ys):
    '''Return a longest common subsequence of xs, ys.'''
    # Create the LCS grid, then walk back from the bottom right corner
    grid = lcs_grid(xs, ys)
    i, j = len(xs) - 1, len(ys) - 1
    lcs = list()
    for move in iter(lambda: grid[(j, i)].move, 'e'):
        if move == '\\':
            lcs.append(xs[i])
            i -= 1
            j -= 1
        elif move == '^':
            j -= 1
        elif move == '<':
            i -= 1
    lcs.reverse()
    return lcs

If you haven’t come across the 2-argument version of iter, it’s a neat way to generate a sequence terminated by a sentinel value. In this case we move back through the grid, finishing when we hit an 'e' end move. Note that we only require the move fields of the grid cells to reconstruct an LCS. The length fields were used during calculation of the grid itself.

Animation

I’ve reproduced here the animation shown at the start of the article. It demonstrates the LCS grid being calculated, one cell at a time. The numbers in the cells correspond to the lengths of the LCSes ending at various prefix pairs of the inputs and the arrows indicate the moves: as an example, the 2 in the 4th row, 4th column shows that lcs("CHIM", "HUMA") has length 2, and the leftwards arrow shows the best previous LCS was lcs("CHIM", "HUM").

Once the grid has been filled, the red and orange arrows track back through it to find an LCS. The orange arrows indicate matches — elements of an LCS — and the red arrows indicate a route which threads through in the greatest possible number of these.

Word Aligned

space sensitive programming