October 28, 2012
Anyone who has taken a class on computability theory, compilers, or finite automata knows that regular expressions can be parsed in linear time by generating a nondeterministic finite automaton (NFA) from the regular expression and converting it into a deterministic one (DFA).
Unfortunately, regular expression libraries like Python’s
remodule frequently generate exponential-time parsers. Why? The reason is that regex libraries like
reproduce backtracking parsers that can do more than just parse regular languages. They can parse some non-regular languages and support extra features like backreferences, which finite automata alone cannot do.
To combat this, I’ve built the backbone of a true regular expressions library: a finite automata engine. It can be used to build a regex parser generator that takes worst-case linear time, as a true regex parser should do.