How Regexes Work: Notes and Errata

The third illustration on page 50 is missing a letter. There should be an a on the arrow from V to W, like this:

These diffs show how to add the . metacharacter to Regex.pm. We only need to add eight lines! If you want to try it yourself, don't peek!

Caution: This . isn't exactly like Perl's, because Perl's . doesn't match \n. It's a simple change to give it Perl's behavior.

NP-Completeness

The demo program illustrates an situation where Perl's built-in regular expressions take longer than Regex.pm to find that there's no match---a lot longer. How can this be? Is Regex.pm simply better?

No. The answer is that Regex.pm is worse, and that's why it's faster.

Huh?

Simple. Perl's built-in regexes have backreferences. Regex.pm doesn't. That lack of functionality enables Regex.pm to use a really super optimization that can speed things up a lot. However, the optimization makes it impossible to support backreferences. So Regex.pm is faster because it's less powerful.

This tradeoff is not a superficial thing, either. Regex.pm's algorithm is well-known and efficient. You might wonder if there is perhaps some way to extend it or modify it so that it works for regexes with backreferences too. The answer is that there probably isn't such an extension, for strong theoretical reasons.

Regexes with backreferences can be shown to be NP-Complete. This means, in practice, that there is no efficient algorithm known for computing whether or not such a regex matches, and that if there were such an algorithm, it would immediately translate into efficient algorithms for solving a huge class of other well-studied problems for which no efficient algorithms are known. This class of problems includes the famous Travelling Salesman problem, integer linear programming, the problem of recognizing whether or not a number is prime, and, in fact, any problem where you can efficiently check a proposed solution to see whather or not it is actually a correct solution.

Conversely, regular expression matching without backreferences is not NP-complete; there is a well-known, efficient algorithm for it, which is implemented by Regex.pm.

For more details, including a proof, see Regular Expression Matching is NP-complete.

Found an error, or have a remark? Send me mail.

Return to: Universe of Discourse main page | What's new page | Perl Paraphernalia | How Regexes Work

mjd-tpj-regex@plover.com