MEET PERL M-J. Dominus mjd-ieee@plover.com -- What's Perl? -- Perl is a programming language. It was invented about ten years ago by a guy named Larry Wall. Larry is a computer programmer and a linguist. When he designed Perl, he used his computer programmer knowledge to make it a good language for telling the computer what to do, and he used his linguist knowledge to make it an easy language to learn and use. -- What's Perl good for? -- Anything. Everything. Perl is a general-purpose programming language. Perl has many strengths, and the strongest strength is in the area of string and data processing. Since most programs involve some sort of string and data processing, most programs work well in Perl. Perl is also good for system programming; it gives you access to C library functions for fine-grained control of files, processes, and network sockets, while at the same time handling the onerous details of low-level memory management and providing you with powerful built-in data structures. Perl is much faster to write in than C, which makes it good for rapid prototyping. Perl is a good replacement for the Unix shell scripting languages on Unix sysetms, for when you need to write a program quickly to do a small task. And it's faster than most other scripting languages because it's compiled instead of interpreted. Perl is not a small and simple language. It's a big language with a lot of stuff. That's good, because it means it will have the stuff you want, and you can ignore the stuff you don't want. Nobody is expected to know all of Perl; it is Officially Okay to program in a subset of Perl. In this way, Perl is like a natural language: Nobody is expected to know all of English before they start using it; you speak a subset of English first, and pick up more of it as you go. Perl is like that. Some languages pride themselves on `minimality': There's exactly one way to perform any task. Perl isn't like that either; it's fine for computer science research and not so fine for getting things done. Perl's official motto is ``There's More Than One Way To Do It.'' That's good, because that means you can do it *your* way. -- A Subset of Perl -- We'll take a quick trip through Perl Land and see some tiny but powerful example programs. A lot of Perl's power comes from its `regular expressions', or `regexps' which are patterns that you can use to check and disassemble your data. For example, suppose you'd like to check the user's input to see if it contains a number. A typical number might look like this: 143.27 You can write this: unless (/\d+\.\d+/) { warn "I could not find a number!\n"; } The /\d+\.\d+/ is a pattern; by default Perl checks the most recent line of input against the pattern to see if it matches. \d means a digit; \d+ means a sequence of one or more digits. \. means a decimal point, so \d+\.\d+ means two sequences of one or more digits each, separated by a decimal point. `Unless' the most recent line of input contains two sequences of digits separated by a decimal point, Perl will print the warning message ``I could not find a number!'' to the standard error output. Maybe you'd like to make the decimal part optional, so that 143 could also be a number. That's easy; you just put a `?' after the optional part, like this: unless (/\d+(\.\d+)?/) { warn "That's not a number!\n"; } The parentheses around \.\d+ tell Perl which part is optional. Now maybe you'd like to check to see if there is a number, and if so, you want to do something with it. How can you extract the number from the input line? One way is like this: if (/(\d+(\.\d+)?)/) { print "The number you entered was $1. The decimal part: `$2'.\n"; } else { warn "That's not a number!\n"; } When a pattern matches, and it contains parentheses, Perl saves whatever matched the parenthesized part of the pattern into a variable. $1 holds whatever was matched by the part in the first pair of parentheses; $2 holds whatever matched the part of the pattern in the second pair of parentheses, and so on. Here we have two pairs of parentheses; the outside pair encloses the whole pattern, which matches the entire number; if the pattern matches, Perl will store the entire number into $1. The second pair of parentheses encloses just the optional decimal point and following digits; if the pattern matches, Perl will put those digits into $2. The `print' command will print something like: The number you entered was 143.27. The decimal part: `.27'. You can interpolate variables into the middle of a Perl string without using printf or sprintf. (But printf and sprintf are there also if you want them.) -- How do I turn a string into a number? -- Strings and numbers are interchangeable in Perl; if you use a number like a string, Perl turns it into a string for you. And if you use a string like a number, Perl turns it into a number and saves the result internally as a number, so that arithmetic is always fast. Here's a program that reads numbers from the input, adds them up, and prints a total at the end: while (<>) { $total = $total + $_; } print "The total was $total.\n"; The <> symbol is magical. It reads a line of data from the input and puts the line into the default variable, which is $_. <> terminates the while loop at the end of the input. There's more magic to <>, but we'll ignore it for a while. When you see `while (<>)', imagine that it means `while you can read a line of input...'. Perl runs the `while' loop once each time it reads a line of input, and when it sees the `+' it knows that you want it to interpret the line of input in $_ as a number, so that's what it does. It adds that number to the variable $total. At the end of the input, the `while' loop terminates, and perl prints out the total of the number in the input. -- Magical <> -- Where did the input come from? <> has more magic. <> means to emulate the behavior of most Unix commands. Consider a command like `cat': If you wrote just cat with no arguments, then `cat' would read from the standard input, usually the terminal. But if you wrote cat file1 file2 file3 then cat would read from the three files in succession and concatenates their contents. <> does the same thing. If you ran the script with no arguments, <> would read from the standard input, usually the terminal; if you supplied argmuents to the script, <> would invisibly and seamlessly read from each file in succession, opening a new file each time the previous file was exhausted. Just for fun, here's the Perl version of the `cat' command: print while <>; This is part of Perl's polocy of making common tasks easy. Reading from a bunch of files specified on the command line is a common task, so Perl gives you a short cut for it. -- Funny Punctuation -- By now you've noticed that variables in Perl begin with dollar signs. That's so that Perl can evolve without breaking your programs: You never have to worry that a future version of Perl will introduce a new reserved word that will clash with your variables, because the variables all begin with funny punctuation. It's also so that you can interpolate variables into strings without having to use a lot of verbiage like `sprintf'. Only Perl's `scalar' variables begin with $. Scalar variables can hold one number or one string each. Perl also has built-in array variables that begin with @, and a host of functions for operating on arrays. For example reverse @array reverses an array, and @array = split(/,/, $line); interprets the data in $line as a sequence of fields separated by commas, extracts the contents of the fields, and stores them into the array @array. If you're a C programmer, think of the times you've written half a page of C code to do exactly this task, and all the calls to strtok that you'll never have to make again if you use Perl. Perl has a built-in array variable called @ARGV that holds a list of the command-line arguments, so here's a little program that deletes the files that you name: foreach $file (@ARGV) { unlink $file or warn "Could not delete file `$file': $!; akipping.\n"; } `foreach' makes a loop; it sets the scalar variable $file to each of the elements of @ARGV in turn. For each argument, Perl tries to `unlink' (delete) the file with that name. If (and only if) it can't, it emits a warning message like Could not delete file `/vmunix': permission denied; skipping. Whenever there's an error, Perl's magical variable $! will contain an explanation, such as `permission denied' or `no such file'. That makes error reporting convenient and easy in Perl. Perl has a third kind of variable, called a `hash'. A hash is like an array, except that its indices are strings instead of numbers. %x is a hash variable named x, and $x{something} is the value associated with `something' in %x. If %x were an ordinary array, `something' would have to be a number, but because %x is a hash, `something' can be any string at all. There are too many useful uses of hashes to discuss them all here. But we'll see one example just so you can get the feel of real Perl programming. -- Sample Perl Program -- This program reads the input, counts the number of occurrences of each word in the input, and prints out a report of the number of times each word occurred: # Lines in Perl programs that begin with # signs are comments. # Comments are ignored by Perl. # Read the input and count the words while (<>) { foreach $word (split) { $count{$word}++; } } # Print out the report foreach $word (keys %count) { print "The word `$word' occurs $count{$word} time(s).\n"; } When I run this program on the text file containing this article, the output begins like this: The word `files' occurs 2 time(s). The word `scripting' occurs 2 time(s). The word `interpreter' occurs 1 time(s). How does this work? The program has two parts. The top part reads the input and tabulates the words; the bottom block prints the report. We've seen `while (<>)' already; it loops over the input, reading a line of input into $_, and running the block once for every line. `foreach $word' is another loop; it iterates over the results of `split', assigning the variable $word to each word in the input line in turn. `split' we've seen; when used without argmuents, as it is here, it happens to split the contents of $_ into words, separated by white space. $count{$word} is the number of times the word $word has been seen so far; it's one of the elements in the hash %count. %count is like an array whose indices are words instead of numbers; $count{IEEE} is the number of occurrences of the word `IEEE'; $count{the} is the number of occurrences of the word `the'. ++ means to increment the appropriate element of %count. When Perl finishes reading the input, it exits the `while' loop. The hash %count now contains a lot of words, and associated with each word, the number of times that word appeared in the input. The bottom `foreach' loop will print this information out. `keys %count' retrieves a list of all the words stored in %count; `foreach $word' iterates over this list, setting $word to each word in turn. Each time, the `print' command prints out the word ($word) and the count associated with that word ($count{$word}). Let's extend this program a little to better count the words in this article. Program fragments in this article are indented, and I'd like to skip them, so that the program doesn't include program fragments in its word counts. To do this, we just add one line to the top loop. It becomes: while (<>) { next if /^\s+/; foreach $word (split) { $count{$word}++; } } `next' means to start the next iteration of the `while' loop right away, without executing the following lines; the `next' happens only if the condition /^\s+/ is true. /^\s+/ is a pattern match; by default the string matched against the pattern is the current line. The \s in the pattern means a whitespace character, maybe a space or a TAB. \s+ means one or more whitespace characters. The ^ means that this white space must occur at the beginning of the line for the pattern to match. If the current line has white space at the beginning, the `next' command will make Perl skip it entirely and move on to the next line. Let's also have the program skip over `words' like `/^\t/' and ``++'' by having it ignore words that contain funny characters. To do that, we'll add one more line to the program: while (<>) { next if /^\t/; foreach $word (split) { next if $word =~ /[^A-Za-z0-9'-]/; $count{$word}++; } } The new line is next if $word =~ /[^A-Za-z0-9'-]/; and here the `next' means to skip to the next iteration of the `foreach' loop, effectively skipping the current word and moving on to the next word---if. If what? If $word =~ /[^A-Za-z0-9'-]/ , that's what. The =~ is the pattern matching operator; it says here that instead of checking the current line to see if it matches the pattern, Perl should check $word instead. [^A-Za-z0-9'-] is a list of charcaters that $word might contain; the ^ at the beginning means `Every character *except*...'. `A-Za-z0-9'-' means any letter, any digit, an apostrophe, or a hyphen; the pattern matches if $word contains any character *except* one of these. If $word contains any character other than a letter, digit, apostophe, or hyphen, the pattern match succeeds, the condition is true, the `next' command is invoked, and Perl skips the rest of the loop for that word; in particular it *doesn't* increment the $count{$word}. -- What Else? -- Perl has a lot of exciting features, and I had to left most of them out. Here are some of the most exciting: * Perl has a simple, flexible, and powerful object model for object-oriented programming. Each module has its own namespace so that variables in one package won't collide with variables in another package. Perl supports multiple inheritance. * Perl has hundreds of `modules', libraries of Perl code that do all sorts of interesting things. There are Perl modules already written for you that will * parse and generate HTML, * talk the FTP protocol on the network, * compute with arbitrarily large numbers, * make queries to any popular SQL database, * encrypt data, * (insert several hundred of your favorite useful things here) * It's easy to extend Perl. You can write your own functions in C and link them in and use them just as if they were built in. * You can embed a Perl interpreter into a program written in another language. * Perl has a built-in symbolic debugger. * Perl comes with an extensive and complete manual, to the tune of a quarter-million words. Some of the manual is for reference; some is tutorials; some is lists of frequently asked questions. It's easy to find what you want. One manual section is `perldiag'; it contains an explanation of every warning or error message that Perl can emit. Every Perl built-in function is described in the `perlfunc' document. `perlipc' is a tutorial for interprocess communication in Perl. * Perl is portable. It's easy to write programs that work without change even on very different systems. Perl is available under Unix, Microsoft Windows, Macintosh, Amiga, and many others. * Perl is binary-safe. You don't have to worry if your data contains ASCII NUL characters, or eight-bit characters, or anything else. You don't have to worry about the strings in a file being too long; Perl can suck your entire file into a single scalar if you want. You don't have to worry about your data having too many fields to fit into an array; Perl can make an array as big as you want, and then grow it automatically if you change your mind later. * Perl has a built-in data flow analysis mode, that lets it decide whether data have come from a trustworthy or an untrustworthy source, and refuse to perform dangerous operations with untrustworthy data. Turning this on eliminates many common security problems. This is especially important in CGI programs, which are run on your computers by unknown people all over the world, without any supervision. * Perl comes with utilities that will translate your old `awk' and `sed' scripts into Perl programs that will run much faster. * Perl has a `warning' mode that will detect misspelled variable and function names, use of uninitialized variables, and other common errors. * The Perl community is large but closely-knit. Experts regularly appear in the comp.lang.perl.misc Usenet newsgroup and on the #perl IRC channel. The Perl development mailing list, perl5-porters, is large, but still small enough that everyone seems to know everyone else; there isn't any of the factionalism that you see on many such lists. Larry Wall, the creator of Perl, is still very much involved in its ongoing development. Perl has a quarterly journal, _The Perl Journal_. Perl had its first conference in August in San Jose; hundreds of Perl experts attended from all over the world. As a result of all this friendly discussion, Perl is well supported. It's easy to find an expert to help you when you need it. -- Downsides of Perl -- Perl isn't perfect; there are some problems with Perl. The biggest problems: Perl can be slow, it can use a lot of memory, and it can be hard to read. Perl is faster than most interpreted languages, often much faster, because it parses your program all at once at startup, and remembers the result; this lets it do the same sort of speed optimizations that compilers normally perform, such as constant folding. It's faster than the shell, because it doesn't spawn a new process every time it wants to look at a file or do an arithmetic calculation. It's faster than TCL because it doesn't re-parse every line as it comes to it, and because it has real lists and numbers, not just strings. But it's not as fast as C. For best results, prototype your application in Perl, and then if it's too slow, rewrite in C or some other compiled language. Perl programs can take longer to start up than programs written in other languages, because perl has to compile your program every time it's run, before it can start executing it. An experimental compiler module promises some relief here by allowing you to save the compiled version of your program to the disk. Perl data structures are easy to use, and they're flexible, but they take up more memory than the corresponding structures would in C. The Perl compiler/interpreter itself takes up a lot of memory. Perl is easy to write, but if the programmer isn't careful, it can be hard to read. The Perl community is working hard on this, and recent versions of Perl have tried hard not to exacerbate the funny punctuation situation. There's now an optional format for regexps that lets you include comments and white space, which goes a long way toward making these powerful patterns less cryptic. Perl is not (yet) available absolutely everywhere. These downsides aren't too down. Even when Perl isn't the Right Tool to do your job, it's often the right tool for the first draft of the tool that will do your job. -- Where to get it? -- Perl comes pre-installed with many operating systems. You may have it already! But if not, you can get the most recent version of Perl from http://www.perl.com/CPAN/src/latest.tar.gz if you have a Unix machine, and from http://www.perl.com/CPAN/ports/ if not. -- Bibliography -- _Learning Perl_, Randal Schwartz and Tom Chirsitiansen. O'Reilly and Associates, 302 pp. ISBN 1-56592-284-0. A tutorial introduction to Perl. Be sure to get the second edition of this book, since the first edition is several years out of date. _Programming Perl_, Larry Wall, Tom Christiansen, and Randal L. Schwartz. O'Reilly and Associates, 646+xxi pp. ISBN 1-56592-149-6. The complete reference and guide to Perl for the programmer. Again, be sure you get the second edition. _Advanced Perl Programming_, Sriram Srinivasan. O'Reilly and Associates, 434 pp. ISBN 1-56592-220-4. _The Perl Journal_. Quarterly journal devoted to Perl. See http://www.tpj.com/ for details or write to The Perl Journal, POB 54, Boston MA 02101.