Perl Lines of the Day from March, 1999

8 March, 1999

  my @tokens = split /(\\\\(?=\\*[{}])|\\[{}]|[{}\n])/, $self->{SOURCE};

This showed up in Text::Template v1.11. It's basically another lexer.

Here's the problem I had: Text::Template parses a file, looking for Perl code that's delimited by { and }. It then evals the code. Now, you might want to have some braces in your file that don't delimit code. So that's simple, you just put backslashes in front of them.

Now there's an interesting question, which is: What does a backslash mean when it's not in front of a brace? Note that Perl has two different answers to the analogous question: "\m" is the same as "m", but in '\m' the backslash is taken literally---it's only special before another quote or backslash. I decided to use the '' semantics, because if someone wrote this:

   ...{ print "Ouch\n"; }...

I didn't want it to print Ouchn.

Now there's the question of how you get an actual backslash in your program, and the obvious answer is that you can put in \\ and Text::Template will replace the two backslashes with one. But I didn't like this, because to get Perl to include a backslash in a string you have to double it anyway, and that means that to include a backslash in a string in a template you'd have to quadruple it. I wanted to find a solution that didn't require this.

Here's the solution I found: Since \ is only really useful for escaping braces, it only needs to be escaped itself when it appears before a brace. This means that the only special sequences have the form \\\\\{ or \\\\\}. In such sequences, \\ is replaced by \. Elsewhere, \ is passed through unchanged.

Now, to the lexer. First, let's see it in /x form:

  my @tokens = split /(
                       \\\\        # Double backslash
                         (?=\\*    # Must be followed by more backslashes
                            [{}]   # and then a curly brace
                       \\[{}]      # Escaped brace
                       [{}\n]      # brace or newline

The lexer's job is to divide the input into tokens. Some substrings of the input, like { and }, have a special meaning. We're going to locate these special substrings and break up the input wherever we see them. Then we'll process the special sequences and the parts in between. The first version of the lexer just looked like this:

  @tokens = split /([{}])/, $self->{SOURCE};

This splits up the input at the braces and returns a list of strings, some of which are the braces, and some of which are the things that were between the braces. But we want to return \{ and \} as single tokens, so we add:

  @tokens = split /(\\[{}]|[{}])/, $self->{SOURCE};

Finally, we'd like to have \\ returned as a single token, but only in contexts where it's meaningful, as part of a \\\\\{ sequence. To do that, we use the somewhat obscure (?=PAT) operator. (?=PAT) says to look forward for something that matches PAT, and to match successfully only if PAT is found, but not to include PAT in the match itself. Here we use \\\\(?=\\*[{}]) to match \\ only when it is follwed by more backslashes and then a brace. However, the token is just the \\, and not the other backslashes or the brace; they will be included in later tokens.

The Line of The Day thus pulls out as single tokens the following items: unbackslashed braces, backslashed braces, backslashed backslashes (but only in the right context), and other sequences in between these things.

Why is this interesting? For two reasons. First, it shows how to use regular expressions to construct lexers with quite a lot of lookahead, more than is usually allowed by programs like lex. In this case the tokenization of the input stream may depend on characters arbitrarily far forward. Second, because the zero-width lookahead assertions don't get too much use, and when they do it's usually in a very few ways, which suggests that people don't really understand how to use them. This article is a contribution to the list of possible uses for zero-width assertions.

See 18 April 1998 and 8 April 1998 for previous lexers. See 3 September 1998 for a previous (somewhat similar) use of zero-width assertions.

Return to: Universe of Discourse main page | What's new page | Perl Paraphernalia | Line of the Day