<<< Back to blog Archives

Top-Down operator precedence parsing

January 2nd, 2010 at 5:08 pm

Introduction

Recursive-descent parsers have always interested me, and in the past year and a half I wrote a few articles on the topic. Here they are in chronological order:

  • Recursive descent, LL and predictive parsers
  • Some problems of recursive descent parsers
  • A recursive descent parser with an infix expression evaluator

The third article describes a method that combines RD parsing with a different algorithm for parsing expressions to achieve better results. This method is actually used in the real-world, for example in GCC and Parrot (source).

An alternative parsing algorithm was discovered by Vaughan Pratt in 1973. Called Top Down Operator Precedence, it shares some features with the modified RD parser, but promises to simplify the code, as well as provide better performance. Recently it was popularized again by Douglas Crockford in his article, and employed by him in JSLint to parse Javascript.

I encountered Crockford’s article in the Beautiful Code book, but found it hard to understand. I could follow the code, but had a hard time grasping why the thing works. Recently I became interested in the topic again, tried to read the article once more, and again was stumped. Finally, by reading Pratt’s original paper and Frederick Lundh’s excellent Python-based piece [1], I understood the algorithm.

So this article is my usual attempt to explain the topic to myself, making sure that when I forget how it works in a couple of months, I will have a simple way of remembering.

The fundamentals

Top down operator precedence parsing (TDOP from now on) is based on a few fundamental principles:

  • A "binding power" mechanism to handle precedence levels
  • A means of implementing different functionality of tokens depending on their position relative to their neighbors – prefix or infix.
  • As opposed to classic RD, where semantic actions are associated with grammar rules (BNF), TDOP associates them with tokens.

Binding power

Operator precedence and associativity is a fundamental topic to be handled by parsing techniques. TDOP handles this issue by assigning a "binding power" to each token it parses.

Consider a substring AEB where A takes a right argument, B a left, and E is an expression. Does E associate with A or with B? We define a numeric binding power for each operator. The operator with the higher binding power "wins" – gets E associated with it. Let’s examine the expression:

1 + 2 * 4

Here it is once again with A, E, B identified:

1 + 2 * 4
  ^ ^ ^
  A E B

If we want to express the convention of multiplication having a higher precedence than addition, let’s define the binding power (bp) of * to be 20 and that of + to be 10 (the numbers are arbitrary, what’s important is that bp(*) > bp(+)). Thus, by the definition we’ve made above, the 2 will be associated with *, since its binding power is higher than that of +.

Prefix and infix operators

To parse the traditional infix-notation expression languages [2], we have to differentiate between the prefix form and infix form of tokens. The best example is the minus operator (-). In its infix form it is subtraction:

a = b - c  # a is b minus c

In its prefix form, it is negation:

a = -b   # b has a's magnitude but an opposite sign

To accommodate this difference, TDOP allows for different treatment of tokens in prefix and infix contexts. In TDOP terminology the handler of a token as prefix is called nud (for "null denotation") and the handler of a token as infix is called led (for "left denotation").

The TDOP algorithm

Here’s a basic TDOP parser:

def expression(rbp=0):
    global token
    t = token
    token = next()
    left = t.nud()
    while rbp < token.lbp:
        t = token
        token = next()
        left = t.led(left)

    return left

class literal_token(object):
    def __init__(self, value):
        self.value = int(value)
    def nud(self):
        return self.value

class operator_add_token(object):
    lbp = 10
    def led(self, left):
        right = expression(10)
        return left + right

class operator_mul_token(object):
    lbp = 20
    def led(self, left):
        return left * expression(20)

class end_token(object):
    lbp = 0

We only have to augment it with some support code consisting of a simple tokenizer [3] and the parser driver:

import re
token_pat = re.compile("\s*(?:(\d+)|(.))")

def tokenize(program):
    for number, operator in token_pat.findall(program):
        if number:
            yield literal_token(number)
        elif operator == "+":
            yield operator_add_token()
        elif operator == "*":
            yield operator_mul_token()
        else:
            raise SyntaxError('unknown operator: %s', operator)
    yield end_token()

def parse(program):
    global token, next
    next = tokenize(program).next
    token = next()
    return expression()

And we have a complete parser and evaluator for arithmetic expressions involving addition and multiplication.

Now let’s figure out how it actually works. Note that the token classes have several attributes (not all classes have all kinds of attributes):

  • lbp – the left binding power of the operator. For an infix operator, it tells us how strongly the operator binds to the argument at its left.
  • nud – this is the prefix handler we talked about. In this simple parser it exists only for the literals (the numbers)
  • led – the infix handler.

The key to enlightenment here is to notice how the expression function works, and how the operator handlers call it, passing in a binding power.

When expression is called, it is provided the rbp – right binding power of the operator that called it. It consumes tokens until it meets a token whose left binding power is equal or lower than rbp. Specifically, it means that it collects all tokens that bind together before returning to the operator that called it.

Handlers of operators call expression to process their arguments, providing it with their binding power to make sure it gets just the right tokens from the input.

Let’s see, for example, how this parser handles the expression:

3 + 1 * 2 * 4 + 5

Here’s the call trace of the parser’s functions when parsing this expression:

<<expression with rbp 0>>
    <<literal nud = 3>>
    <<led of "+">>
    <<expression with rbp 10>>
       <<literal nud = 1>>
       <<led of "*">>
       <<expression with rbp 20>>
          <<literal nud = 2>>
       <<led of "*">>
       <<expression with rbp 20>>
          <<literal nud = 4>>
    <<led of "+">>
    <<expression with rbp 10>>
       <<literal nud = 5>>

The following diagram shows the calls made to expression on various recursion levels:

spacer

The arrows show the tokens on which each execution of expression works, and the number above them is the rbp given to expression for this call.

Apart from the initial call (with rbp=0) which spans the whole input, expression is called after each operator (by its led handler) to collect the right-side argument. As the diagram clearly shows, the binding power mechanism makes sure expression doesn’t go "too far" – only as far as the precedence of the invoking operator allows. The best place to see it is the long arrow after the first +, that collects all the multiplication terms (they must be grouped together due to the higher precedence of *) and returns before the second + is encountered (when the precedence ceases being higher than its rbp).

Another way to look at it: at any point in the execution of the parser, there’s an instance of expression for each precedence level that is active at that moment. This instance awaits the results of the higher-precedence instance and keeps going, until it has to stop itself and return its result to its caller.

If you understand this example, you understand TDOP parsing. All the rest is really just more of the same.

Enhancing the parser

The parser presented so far is very rudimentary, so let’s enhance it to be more realistic. First of all, what about unary operators?

As I’ve mentioned in the section on prefix and infix operators, TDOP makes an explicit distinction between the two, encoding it in the difference between the nud and led methods. Adding the subtraction operator handler [4]:

class operator_sub_token(object):
    lbp = 10
    def nud(self):
        return -expression(100)
    def led(self, left):
        return left - expression(10)

nud handles the unary (prefix) form of minus. It has no left argument (since it’s prefix), and it negates its right argument. The binding power passed into expression is high, since infix minus has a high precedence (higher than multiplication). led handles the infix case similarly to the handlers of + and *.

Now we can handle expressions like:

3 - 2 + 4 * -5

And get a correct result (-19).

How about right-associative operators? Let’s implement exponentiation (using the caret sign ^). To make the operation right-associative, we want the parser to treat subsequent exponentiation operators as sub-expressions of the first one. We can do that by calling expression in the handler of exponentiation with a rbp lower than the lbp of exponentiation:

class operator_pow_token(object):
    lbp = 30
    def led(self, left):
        return left ** expression(30 - 1)

When expression gets to the next ^ in its loop, it will find that still rbp < token.lbp and won’t return the result right away, but will collect the value of the sub-expression first.

And how about grouping with parentheses? Since each token can execute actions in TDOP, this can be easily handled by adding actions to the ( token.

class operator_lparen_token(object):
    lbp = 0
    def nud(self):
        expr = expression()
        match(operator_rparen_token)
        return expr

class operator_rparen_token(object):
    lbp = 0

Where match is the usual RD primitive:

def match(tok=None):
    global token
    if tok and tok != type(token):
        raise SyntaxError('Expected %s' % tok)
    token = next()

Note that ( has lbp=0, meaning that it doesn’t bind to any token on its left. It is treated as a prefix, and its nud collects an expression after the (, right until ) is encountered (which stops the expression parser since it also has lbp=0). Then it consumes the ) itself and returns the result of the expression [5].

Here’s the code for the complete parser, handling addition, subtraction, multiplication, division, exponentiation and grouping by parentheses:

import re

token_pat = re.compile("\s*(?:(\d+)|(.))")

def tokenize(program):
    for number, operator in token_pat.findall(program):
        if number:
            yield literal_token(number)
        elif operator == "+":
            yield operator_add_token()
        elif operator == "-":
            yield operator_sub_token()
        elif operator == "*":
            yield operator_mul_token()
        elif operator == "/":
            yield operator_div_token()
        elif operator == "^":
            yield operator_pow_token()
        elif operator == '(':
            yield operator_lparen_token()
        elif operator == ')':
            yield operator_rparen_token()
        else:
            raise SyntaxError('unknown operator: %s', operator)
    yield end_token()

def match(tok=None):
    global token
    if tok and tok != type(token):
        raise SyntaxError('Expected %s' % tok)
    token = next()

def parse(program):
    global token, next
    next = tokenize(program).next
    token = next()
    return expression()

def expression(rbp=0):
    global token
    t = token
    token = next()
    left = t.nud()
    while rbp < token.lbp:
        t = token
        token = next()
        left = t.led(left)
    return left

class literal_token(object):
    def __init__(self, value):
        self.value = int(value)
    def nud(self):
        return self.value

class operator_add_token(object):
    lbp = 10
    def nud(self):
        return expression(100)
    def led(self, left):
        right = expression(10)
        return left + right

class operator_sub_token(object):
    lbp = 10
    def nud(self):
        return -expression(100)
    def led(self, left):
        return left - expression(10)

class operator_mul_token(object):
    lbp = 20
    def led(self, left):
        return left * expression(20)

class operator_div_token(object):
    lbp = 20
    def led(self, left):
        return left / expression(20)

class operator_pow_token(object):
    lbp = 30
    def led(self, left):
        return left ** expression(30 - 1)

class operator_lparen_token(object):
    lbp = 0
    def nud(self):
        expr = expression()
        match(operator_rparen_token)
        return expr

class operator_rparen_token(object):
    lbp = 0

class end_token(object):
    lbp = 0

Sample usage:

>>> parse('3 * (2 + -4) ^ 4')
48

Closing words

When people consider parsing methods to implement, the debate usually goes between hand-coded RD parsers, auto-generated LL(k) parsers, or auto-generated LR parsers. TDOP is another alternative [6]. It’s an original and unusual parsing method that can handle complex grammars (not limited to expressions), relatively easy to code, and is quite fast.

What makes TDOP fast is that it doesn’t need deep recursive descents to parse expressions – only a couple of calls per token are required, no matter how the grammar looks. If you trace the token actions in the example parser I presented in this article, you’ll notice that on average, expression and one nud or led method are called per token, and that’s about it. Frederik Lundh compares the performance of TDOP with several other methods in his article, and gets very favorable results.

spacer
[1] Which is also the source for most of the code in this article – so the copyright is Fredrik Lundh’s
[2] Like C, Java, Python. An example of a language that doesn’t have infix notation is Lisp, which has prefix notation for expressions.
[3] This tokenizer just recognizes numbers and single-character operators.
[4] Note that to allow our parser actually recognize -, an appropriate dispatcher should be added to the tokenize function – this is left as an exercise to the reader.
[5] Quiz: is it useful having a led handler for a left paren as well? Hint: how would you implement function calls?
[6] By the way, I have no idea where to categorize it on the LL/LR scale? Any ideas?

Related posts:

  1. A recursive descent parser with an infix expression evaluator
  2. considerations in MIXAL parsing
  3. The many faces of operator new in C++
  4. Some problems of recursive descent parsers
  5. Adventures in parsing C: ASTs for switch statements

This entry was posted on Saturday, January 2nd, 2010 at 17:08 and is filed under Articles, Compilation. You can follow any responses to this entry through the RSS 2.0 feed. You can skip to the end and leave a response. Pinging is currently not allowed.

5 Responses to “Top-Down operator precedence parsing”

  1. philhasseyspacer Says:
    January 2nd, 2010 at 20:18

    I used TDOP for www.tinypy.org .. which parses most python code. I found the TDOP algorithm worked really well spacer

  2. elibenspacer Says:
    January 3rd, 2010 at 06:36

    @Phil, I really like the work you did with tinypy. The bootstrapped parser written in tinypy itself is really cool. I wonder if that is because TDOP isn’t suitable for C, or just because you wanted the code to be smaller. Crockford mentions TDOP is most suitable from dynamic languages, although I’m sure it can be implemented in C too.

  3. philhasseyspacer Says:
    January 4th, 2010 at 01:50

    Beats me. I think I heard about TDOP on planet.python a few years ago, and read those javascript articles. It seemed magical, yet simple enough that maybe I could comprehend it. Which was the case after reading over the code and implementing it.

    I think the reason I really liked it was that the code you write ends up ‘feeling’ like the code you are parsing. Which makes it seem much easier to understand. Even a couple years later, I was able to add in function annotations to tinypy without it taking me too long to remember how everything worked.

    -Phil

  4. Tom Lynnspacer Says:
    January 7th, 2010 at 13:25

    This algorithm would really benefit from some renaming. I think the code below is equivalent to your version. token.prefix() (“nud”) and token.infix() (“led”) should both be presumed to construct an incomplete AST node with a hole on the right, as I understand it.

    def expression(bind_right=0):
        global token
        left = token.prefix()
        token = next()
        while token.bind_left > bind_right:
            left = t.infix(left)
            token = next()
        return left

    In this version it’s clearer to me that it uses the previous value of token — it’s originally initialized in parse().

  5. Mitsuaki Kuwaharaspacer Says:
    January 23rd, 2010 at 02:23

    I read Beautiful Code too. I couldn’t catch the heart of algorithm, TDOP.
    This article really helps my understanding.
    It is good translation, nud to prefix handler and led to infix handler.
    The figure of the recursion levels is very clear. I like it.
    Thanks!

Leave a Reply

To post code with preserved formatting, enclose it in `backticks` (even multiple lines)


gipoco.com is neither affiliated with the authors of this page nor responsible for its contents. This is a safe-cache copy of the original web site.