Regular Expression Mastery

Tokenizing

The tokenizer:

        sub tokens {
          my @tokens = 
            split m{(
                     \*\* | :=  # ** or := operator 
                    |
                     [-+*/^()=]  # some other operator
                    |
                     [A-Za-z]\w* # Identifier
                    |
                     \d*\.\d+(?:[Ee]\d+)?  # Decimal number
                    |
                     \d+         # Integer
                    )}x, shift();
          grep /\S/, @tokens;
        }

Easy to understand and to change, efficient, predictable.
Behaves very much like similar lex-generated parsers
This is why we need /x:

     split 
       m{(\*\*|:=|[-+*/^()=]|[A-Za-z]\w*|\d*\.\d+(?:[Ee]\d+)?|\d+)}, 
       shift();

Note that the order of the | alternatives is important
- Is ** one token or two? What about 12.23?