Perl Lines of the Day from April, 1998

27 April, 1998

Maybe it's been too long since we had a real horror.


  $setuid = $setuid * 2 + ($perm =~ s/([st])$/($1 eq lc $1)?'x':'-'/ie);

This is part of my Stat::lsMode module. It's part of a function that takes a permission string like drwxr-xr-x such as comes out of the Unix ls -l command, and converts it back into a permission number, such as would come out of stat, or suitable for passing to chmod.

What does this do? In addition to the 0777 and 0644 modes that everyone knows about, Unix has three extra bits on top of those for special purposes. The low bit is called the `sticky bit' and has various meanings. The other two bits are the famous `setuid bit' and the slightly less famous `setgid bit'. These bits are all typically set on executable files, and control various details of what happens when the file is executed. For example, a file is normally executed with the permissions of the person who invoked it. But if its setuid bit is on, it is instead invoked with the permissions of the file's owner. The setgid bit is analogous.

If a file's setuid bit is on, ls -l displays s instead of x for its user-execute permission. Similarly, if the file's setgid bit is on, ls -l displays s instead of x for its group-execute permission, and if the sitcky bit is on, it displays t (not s) for its world-execute permission.

A file with mode 00751 has permissions rwxr-x--x, but a file with mode 07751 has permissions rwsr-s--t. The extra 7 on top of the permission contains the three supplemental bits: 4 for the setuid bit, 2 for the setgid bit, and 1 for the sticky bit.

The line of the day is called in a loop in which $perm is set first to the user permissions, then to the group permissions, and finally to the other permissions. Each time, it looks at the last character of this three-character permission string to see if it's s or t, and if so, it adds the appropriate value to the $setuid variable, which will hold this extra information at the end. The technique is familiar to C programmers, who write things like this all the time:

  /* Convert string to integer */
  int atoi(const char *s) {
    int i = 0;
    while (*s) {
      i = i * 10 + (*s - '0');
      s++;
    }
    return i;
  }

My $setuid = $setuid * 2 + (binary digit value) is like the i = i * 10 + (decimal digit value) in the C code above.

The pattern match is a little funny too; it's translating s and t back to the appropriate letter x so that a later part of the function will notice the x's and compute the execute permission appropriately. It doesn't often happen that a special bit is on while the corresponding execute bit is off, but when it does, ls -l displays a capital S or T instead of lowercase. $1 eq lc $1 checks for lowercase; if so, the s/// replaces the letter with the usual x, and if not, it replaces it with the - that indicates no execute permission.

Why do it this way? I dunno, that's just the way it came out. I think it's fun.

25 April, 1998


  <<EOM;

I'm writing a program that assembles an HTML page; various functions construct different parts of it and return the text to be integrated into the page. It turns out that one of these functions doesn't have to do any computing; the text it's going to return is the same every time. It just needs to return a literal string.

Clearly, I could write this:

  sub top_boilerplate {
    qq{
  ...
    };
  }

But for some reason I felt that a here-document would be better. I was momentarily confused about how to get what I wanted with a here document. Of course, I've done print <<EOM; a million times, and $var = <<EOM; about a hundred thousand, but for some reason it took me a minute to see what to do here. I even considered using return <<EOM; before settling on the Line of the Day.

It looks peculiar, but it's just what I wanted. The function now looks like this:

  sub top_boilerplate {
    <<EOM;
  ...
  EOM
  }

23 April, 1998


  unless (x = new(ELEMENT)) die("Out of memory");

Big deal! What's interesting about that?

Well, the braces around the body of the unless are missing. So is the $ on the variable name. That should kindle an awful suspicion in your mind...

Oh, yes. This is a line from a C program.

I actually had to write a C program today, and wow, I sure was suffering until I did this:

  #define unless(c) if (!(c))

I was a lot happier after that. Not completely happy. I missed die. That wasn't a big problem:

  void
  die(const char *format, ...) 
    {
      va_list ap;
      va_start(ap, format);
    
      vfprintf(stderr, format, ap);
    
      va_end(ap);
      exit(1);
    }

After that, things went beautifully. Programming in C felt good again. I got 250 lines written and they work great. Fast, too! Every time I program in C again I'm shocked at how fast it is once you get it written.

Now the only thing wrong is that I have to break myself of the habit of writing the condition after the statement. I can't make die if ... work in C.

I had to show you this stuff today. I wrote 500 lines of code, half Perl and half C, and a really sordid grammar for Parse::RecDescent, and this was the only thing in the whole mess that was noteworthy. What could I do?

21 April, 1998


  @ftype = split /[. ]+/, '. p c ? d ? b ? - ? l ? s ? ? ?';

This array maps the mode numbers that come out of UNIX stat to file types. For example, if a file has a mode that begins with 4, it is a directory; if the mode begins with 6 it is a block special file. ? means that the value is invalid.

I originally wrote this:

  @ftype = qw(? p c ? d ? b ? - ? l ? s ? ? ?);

But there's a small difficulty: For various reasons, I didn't want that first element to be ?---I wanted it to be empty. There's no way to get this with qw(). I had had a second line, $ftype[0] = '', but I didn't like this; it didn't seem declarative enough.

However, qw() is short for split, and by writing the split explicitly, we can get better control over how the string is split up. In this case we don't even need much control---all we want is a null initial field.

The . isn't really necessary, but I put it in for documentation purposes.

23 April: James Wetterau points out that this would probably be easier to understand:

  @ftype = ('', qw(p c ? d ? b ? - ? l ? s ? ? ?));

I agree, and I'll probably end up using this. It's fortunate that James' trick applies here.

19 April, 1998


  local $" = ')(';  

This useful item has been in my bag of tricks for some time. Oftentimes, when you are debugging a program, you print out data items to see what they are, either from inside the program, or in the debugger. One of the first tricks you learn is to use something like this:

  print ".$x.";

Because then if $x has leading or trailing white space, you can see where and how much and what kind. The Line of the Day is a similar trick for arrays.

If you have an array which might contain null or whitespace items in it, and you want to print it out, you can be deceived. For example, consider the token array from the 18 April item:

  ('x', ' ', '=', ' ', '3.4', '& ', 'y', '', '=', '', '5')

If you were to print this out with print "@tokens", you would get this:

  x   =   3.4 &  y  =  5

This is not very easy to read. Some of the tokens are empty, and some are white space, and it's hard to see where these tokens are. In fact, the output is actually ambiguous, so there is no question of simply getting good at reading such output.

Normally, when an array is interpolated into a string, its elements are separated by spaces. This choice of separator is controlled by the Perl special variable $". You can get more readable output by redefining it temporarily. After the redefinition in the Line of the Day, the statement print "(@tokens)" produces this output:

  (x)( )(=)( )(3.4)(& )(y)()(=)()(5)

Now the empty tokens and the tokens with spaces are easy to see.

In this program, which parsed arithmetic expressions, ( and ) were common tokens, so I used $"= '][' instead.

This technique is especially useful in the interactive debugger. You can set $" once at the beginning of your debugging session, and use it for displaying all the arrays you need to look at.

18 April, 1998


  @tokens = split(/(:=|\*\*|[-()+*\/;]|[A-Za-z_]\w*|(?:\d+(?:\.\d+)?))/, $str);

This monster is another exercise in parsing techniques in Perl. (See 8 April for an earlier one.) This particular technique is a real winner. It is a complete input tokenizer (`lexer') in one line.

In a C program, you write a lexer as a function that reads its input, one character at a time, and which implements a state machine to decide when it has seen enough characters to make a complete token. This involves looking ahead to see what is coming up, which in turn means you sometimes have to `unread' characters. For example, suppose you're trying to parse Perl. You are about to read a new token. The first character is *. Is that a complete token? Well, if the next character is also *, then no, because this * is part of an exponentiation operator token. But if the next character is, say, a 3, then the * is a token all by itself, and is a multiplication operator token, or maybe a typeglob indicator. So you must read the next character to find out for sure. Suppose it is a 3. Now your lexer has read one character too many, and has to put the 3 back, for the next time it is called.

Writing lexers can be a pain. Even a simple lexer runs to many lines of C code. A special tool, lex, was invented just for writing lexers: You give it a description of the tokens you want to recognize, and it takes care of figuring out the state machine and writing the C code to implement it.

I used to think that lexical analysis was one of a the few general tasks that Perl was not good at. That was because I knew that reading (with getc) and analyzing one character at a time was rather slow in Perl. That is true, but the conclusion does not follow. It just shows that I did not have enough imagination. Writing a C-style character-by-character input-driven state machine in Perl is a waste of time, because Perl has support for input-driven state machines already built into it: Regexes are input-driven state machines. The Line of the Day is a complete lexer for a calculator program similar to the Unix bc, written as one regex.

To better see what it is doing, here is the lexer spread out and commented:

  @tokens = split(/                  # A token is one of the following:
                   (
		     :=              # 1. the assignment operator
		   |                
                     \*\*            # 2. the exponentiation operator
                   |
                     [-()+*\/;]      # 3. some other single-character operator
                   |
                     [A-Za-z_]\w*    # 4. an identifier
                   |
                     (?:\d+          # 5. or a number
                        (?:\.\d+)?   #    (with an optional decimal part)
                     )
                   )
                  /x, $str);

(The /x version was the one that made it into the final version of the program.)

When a Perl regex has two alternatives, like A|B, it tries to match A first, and if it can't, it tries B instead. This means that if we list the longer tokens first, the regex match will look for them before it looks for shorter tokens. In the example above, ** (line 2) appears before * (line 3), so if there is a ** in the input, it will be recognized as a single token, not as two *'s.

The list returned from this split contains all the tokens that could be found. Each token is separated from the next by a string of whatever non-token characters (such as whitespace) separated it from the next token. For example, the string x = 3.4& y=5 turns into the list ('x', ' ', '=', ' ', '3.4', '& ', 'y', '', '=', '', '5'). A higher-level subroutine can analyze each token in turn to see what it means, ignoring the innocuous non-tokens that are all whitespace, and signaling a syntax error on the one that says '& '.

This technique is extremely powerful and simple, and everyone should know it.

16 April, 1998


  length($_[0] = $out);

Jeff Hakner asked about tied filehandles on the perl5-porters mailing list, and in particular how to tie the read function to one. A call to read looks like read($buf, $len), and it's supposed to read at most $len characters into $buf, and return the number of characters read. $buf and $len are passed to READ, the tied function. Jeff wanted to know how READ could modify $buf since presumably READ only gets a copy of $buf's contents. Jeff asked if shouldn't the tie interface specify that a reference to $buf be passed to READ instead?

This is a sure sign of too much C programming. In C, the tie interface would have had to pass a pointer to buf. In Perl, that's not necessary, because Perl doesn't have call-by-value semantics. (Why do people think it does?) Since Perl has call-by-reference semantics, $_[0] is actually an alias for $buf, and assigning to it or otherwise modifying its value (with s/// or whatever) actually modifies $buf.

My tied READ function computes the appropriate output string of $len characters or fewer, and places it in $out. The Line of the Day is the last line in the READ function. It's the Line of the Day because I can imagine that this line might be the last line in almost any READ function: It stores the read string into $buf and then returns the number of characters so stored.

11 April, 1998


  $suffix++ while $userdb->get_data("$u$suffix");

A program to select a unique, unused username from a database uses this line to select the username. The user has asked for $u. The get_data call accepts a username and (in scalar context) returns true if that username is already in use, false if it is available.

There is a race condition here, of course: If two people run two instances of this program at the same time, asking for a certain username that is in use, both processes might simultaneously decide to append the same suffix. However, semaphores elsewhere in the program prevent this.

9 April, 1998


  sprintf(split(/$;/o, shift());

I find myself endlessly delighted with $;, which is perhaps one of Perl's least-used features. The index argument in the expression $hash{INDEX} is a very peculiar context, one which appears nowhere else in Perl. It's a scalar context, with an exception: If it looks like a comma separated list like a,b,c, then instead of getting the last list item as you usually would, you instead get join $;, a,b,c.

This was used in perl 4 to emulate multidimensional arrays, and is still useful for such purposes, especially in lightweight applications. You could write $matrix{3,5} to get or set the element in the third row and fifth column of a notional matrix. You were really operating on a hash value keyed by 3$;5, but that didn't matter.

$; is a special variable with no special magic about it; the only property it really has is that it's initialized to contain an uncommon character, which happens to be control-\. The hope is that this character won't appear in the indices to your faux-multidimensional array.

Today's Line of the Day appears in my Interpolation.pm module. Interpolation plays a simple but useful trick that lets you embed function calls into double-quoted strings with a clean syntax: "Our total receipts were $S{'%4.2f',$AMOUNT}" essentially inserts a call to sprintf into the string, formatting the $AMOUNT appropriately. The function call is accomplished by making %S a tied hash.

The FETCH function for the tied hash receives the desired hash key, which, thanks to Perl's strange convention about the comma operator in hash keys, is the two items, joined with $;. The line of the day separates the items again and passes them to sprintf to come up with a formatted string; when it returns this string, the string is substituted into the original string as the `value' of the hash.

For some reason I originally wrote this like this:

  my @args = split /$;/o, shift();
  my $format = shift @args;
  sprintf($format, @args);

Shortly afterwards I realized that I could inline the whole thing.

8 April, 1998


  @tokens = split(/([\(\[\{<>\}\]\)\s])/, $input);

Someone in comp.lang.perl.misc once asked how to parse items with nested parentheses, like (this [list has] some {nested (sublists)}). The first part of any parsing task is always to tokenize the input, which means to turn it from a sequence of characters into a sequence of tokens, where the tokens can be treated by the rest of the parser as atomic.

In this case, the tokens are words and parentheses. The line of the day takes a string like the one I showed, and breaks it into a sequence of tokens:

  (    this     [  list     has  ]  some   
  {    nested   (  sublists )    }  )

The resulting list of tokens can be processed one at a time by a parsing procedure, which can call itself recursively when it sees an open-parenthesis, return when it sees the matching close-parenthesis, and signal an error if there is a mismatching close-parenthesis.

This split technique is quite general and can often be used to tokenize simple text streams; it is more efficient in Perl than the usual character-by-character pattern-matching approach.

5 April, 1998


  my (@q, $i) = @_;       # This line is solemnly dedicated to \mjd.

This isn't my line; it comes from Net::IRC::Event.pm. Someone happened to mention it in the IRC #perl channel, and I said that that probably wasn't doing what the authors intended, and even if it was, it was a lousy way to write it.

Beginners often write lines like this, and they think that the last element of @_ is going to go into $i, and the rest will go into @q, but that's not what happens: instead, everything in @_ goes into @q, and $i remaines undefined.

In this case, that's exactly what the authors wanted, so I was wrong about that. But I still think it's a funny way to write it!

The following week I was teaching a class, and I was discussing this point, about how an array on the left-hand side of an assignment gobbles up all the rest of the items from the right-hand side, leaving any following items undefined. I mentioned that I had seen this done in Net::IRC::Event.pm the previous week, for the the first time ever. But then a funny thing happened. I went home that night, and I happened to be reading the standard vars.pm module that comes bundled with perl 5.004_04, and what did I see right at the top but:

    my ($pack, @imports, $sym, $ch) = @_;

(Line 14.) Which is even worse.

I guess I'll start doing it myself soon.

4 April, 1998


  join $;,@_;

There's nothing really special or noteworthy about this; it joins the arguments to the current subroutine into a single string, with items seprated by whatever was in $;. $; is a special variable that contains a separator that is unlikely to occur in data---by default it is character #29, control-\.

This line appears in my Memoize module. Memoize takes a function that you wrote and attaches a front-end function to it. Each time you call your function, the front end checks the arguments to see if you've called the function with those arguments before. If not, the front-end calls the real function and saves the return value in a hash; if you have, it just returns the saved value from the hash.

The front-end uses the line above to generate the hash key.

When I wrote it, I was again charmed by Perl and by its power, concision, and expressiveness.

3 April, 1998


  my @s = (&senders(), &forwarders(), @H{'Reply-to'}); 

Why is this noteworthy? Because 99 times out of a hundred, when you see something like @H{'Reply-to'}, it was an error, and should have been $H{'Reply-to'} instead.

What's the difference? $H{'Reply-to'} is a scalar value, undef if there was no Reply-to key in %H. But @H{'Reply-to'} is a list, a slice of all the elements of %H with the appropriate keys. If there is no Reply-to key in %H, you get the empty list.

If I'd used $H{'Reply-to'} here, and there was no Reply-to, then @s would have gotten an undef on the end. As it is, it gets nothing.

1 April, 1998


  $top *= $top > 0;

Well, for three weeks I didn't write much code, and what I did write was all straightforward. What can you do?

This line is simple, but rather funny. It's for the image rendering program I mentioned in the 17 February item. It computes the top and the bottom of a line segment that it's going to draw, and then it needs to clip this top and bottom to fit within the bounds of the canvas, so that if $top is negative, it needs to be adjusted to 0, and if $bot is bigger than $YMAX it needs to be adjusted to $YMAX; then the program draws a line segment starting at $top and going to $bot.

Originally I thought I'd only need to adjust $top in this way, and I used the line above to do it. Later, I decided that I'd have to adjust both $top and $bot, but there's no analogous compact expression that makes the adjustment for $bot. (Well, actually $bot = $Y - (($Y - $bot) * ($bot < $Y) is analogous, but that just seemed stupid.) So I took it out and replaced it with the much less interesting

  $top = 0 if $top < 0;
  $bot = $YMAX if $bot > $YMAX;

Return to: Universe of Discourse main page | What's new page | Perl Paraphernalia | Line of the Day

mjd-perl-lotd@plover.com