Beware Regular Expressions

August 18, 2003 11:46 PM

Some people, when confronted with a problem, think "I know, I’ll use regular expressions." Now they have two problems. — Jamie Zawinski in comp.lang.emacs.

Regular expressions are a very powerful tool. They're also very easy to mis-use. The biggest problem with regexps occurs when you attempt to use a series of regular expressions to substitute for an integrated parser.

I recently upgraded Movable Type, and in the process I installed Brad Choate's excellent MT-Textile plugin. MT-Textile gives you a vaguely wiki-like syntax for blog entries that rescues you from writing a mess of angle-brackets every time you want to write a post.

I love MT-Textile, but sadly the more comfortable I get with it, the more I realise its limitations. MT-Textile is built on top of a series of regular expressions, and as such, the more you try to combine different Textile markups, the more likely you are to confuse the parser and end up with something different to what you intended. Any parser built on top of multiple regular expressions gets confused very easily, depending on the order the regexps are applied in.

I ran into the same problem with I was running my own wiki. I started with a Perl wiki, which (like all Perl code) was highly dependent on regular expressions. I quickly found that the effort required to add new markup to the wiki, keeping in mind the way each regexp would interact with the previous and subsequent expressions, increased exponentially with the complexity of the expression language.

After a certain point, diminishing returns will kill you.

I'd like to propose the following rule:

Every regexp that you apply to a particular block of text reduces the applicability of regular expressions by an order of magnitude.

I won't pretend to be a great expert in writing parsers—I dropped out of University before we got to the compiler-design course—but after a point, multiple regular expressions will hurt you, and you're much better off rolling your own parser.

2 TrackBacks

Listed below are links to blogs that reference this entry: Beware Regular Expressions.

TrackBack URL for this entry: fishbowl.pastiche.org/mt-tb.cgi/334

MT-Textile from Gerhard Froehlich on August 19, 2003 8:44 PM

MT-Textile is a MT plugin, which enables Wiki style text formating for post editing. Very kool.... Read More

www.hopstudios.com/nep/overheard/001422.html from Overheard on January 26, 2005 5:24 PM

"Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems."... Read More

2 Comments

By Stephan Schmidt on August 19, 2003 6:55 PM

All you wrote about parsers is fine. From a theroretical point of view. Wie use a lot of regex in SnipSnap. Two persons tried to replace that with a parser and failed.

The parser
- should be easy to extend by developers (filters) without the need for developers to know about other filters
- changeable during runtime

We would gladly use a parser that satisfies this needs, but usualy parsers are Yacc-style which means hard to extend, hard to understand and compile time bound.

By Stewart Johnson on August 22, 2003 1:40 PM

I feel the need to come to the defence of REs. :)

While I agree with your post in principle, what I think you're really advocating is the *structured* application of REs. If you go down the route of writing a proper parser (eg: using flex/bison or ANTLR) the first step is to write an ordered list of REs that will be used to tokenise the source text. The important step being that the list of REs is _clearly ordered_, and there are explicit rules on whether to keep matching subsequent REs or not depending on which REs have matched already. In this context, the application of further REs increases the complexity of your parser in something closer to a linear fashion.

I don't really follow Stephen Schmidt's comments about parsers being hard to extend and compile-time bound. The easiest way to parse a language is to specify the grammar based on tokens, and that's exactly what yacc/bison do to the tokenised output from [f]lex. The separation of tokenising and grammar building is precisely what makes these tools useful and powerful. And if you don't want to use C/C++ there's perl-bison and ANTLR (for Java).

Previously: RedHat is Even More Brain-Damaged Than I Thought

Next: Sendmail Configuration for Mummies

gipoco.com is neither affiliated with the authors of this page nor responsible for its contents. This is a safe-cache copy of the original web site.