MEET PERL

			     M-J. Dominus
			 mjd-ieee@plover.com


-- What's Perl? --

Perl is a programming language.  It was invented about ten years ago
by a guy named Larry Wall.  

Larry is a computer programmer and a linguist.  When he designed Perl,
he used his computer programmer knowledge to make it a good language
for telling the computer what to do, and he used his linguist
knowledge to make it an easy language to learn and use.

-- What's Perl good for? --

Anything.  Everything.  Perl is a general-purpose programming
language.  Perl has many strengths, and the strongest strength is in
the area of string and data processing.  Since most programs involve
some sort of string and data processing, most programs work well in
Perl.  Perl is also good for system programming; it gives you access
to C library functions for fine-grained control of files, processes,
and network sockets, while at the same time handling the onerous
details of low-level memory management and providing you with powerful
built-in data structures.

Perl is much faster to write in than C, which makes it good for rapid
prototyping.  Perl is a good replacement for the Unix shell scripting
languages on Unix sysetms, for when you need to write a program
quickly to do a small task.  And it's faster than most other scripting
languages because it's compiled instead of interpreted.

Perl is not a small and simple language.  It's a big language with a
lot of stuff.  That's good, because it means it will have the stuff
you want, and you can ignore the stuff you don't want.  Nobody is
expected to know all of Perl; it is Officially Okay to program in a
subset of Perl.  In this way, Perl is like a natural language: Nobody
is expected to know all of English before they start using it; you
speak a subset of English first, and pick up more of it as you go.
Perl is like that.

Some languages pride themselves on `minimality': There's exactly one
way to perform any task.  Perl isn't like that either; it's fine for
computer science research and not so fine for getting things done.
Perl's official motto is ``There's More Than One Way To Do It.''
That's good, because that means you can do it *your* way. 


-- A Subset of Perl --


We'll take a quick trip through Perl Land and see some tiny but
powerful example programs.

A lot of Perl's power comes from its `regular expressions', or
`regexps' which are patterns that you can use to check and disassemble
your data.  For example, suppose you'd like to check the user's input
to see if it contains a number.  A typical number might look like
this:

	143.27

You can write this:

	unless (/\d+\.\d+/) {
	  warn "I could not find a number!\n";
	}

The /\d+\.\d+/ is a pattern; by default Perl checks the most recent
line of input against the pattern to see if it matches.  \d means a
digit; \d+ means a sequence of one or more digits. \. means a decimal
point, so \d+\.\d+ means two sequences of one or more digits each,
separated by a decimal point.  `Unless' the most recent line of input
contains two sequences of digits separated by a decimal point, Perl
will print the warning message ``I could not find a number!''  to the
standard error output.

Maybe you'd like to make the decimal part optional, so that 143 could
also be a number.  That's easy; you just put a `?' after the optional
part, like this:

	unless (/\d+(\.\d+)?/) {
	  warn "That's not a number!\n";
	}

The parentheses around \.\d+ tell Perl which part is optional.

Now maybe you'd like to check to see if there is a number, and if so,
you want to do something with it.  How can you extract the number from
the input line?

One way is like this:

	if (/(\d+(\.\d+)?)/) {
	  print "The number you entered was $1.  The decimal part: `$2'.\n";
	} else {
	  warn "That's not a number!\n";
	}

When a pattern matches, and it contains parentheses, Perl saves
whatever matched the parenthesized part of the pattern into a
variable.  $1 holds whatever was matched by the part in the first pair
of parentheses; $2 holds whatever matched the part of the pattern in
the second pair of parentheses, and so on.  Here we have two pairs of
parentheses; the outside pair encloses the whole pattern, which
matches the entire number; if the pattern matches, Perl will store the
entire number into $1.  The second pair of parentheses encloses just
the optional decimal point and following digits; if the pattern
matches, Perl will put those digits into $2.

The `print' command will print something like:

	The number you entered was 143.27.  The decimal part: `.27'.

You can interpolate variables into the middle of a Perl string without
using printf or sprintf.  (But printf and sprintf are there also if
you want them.)


-- How do I turn a string into a number? --


Strings and numbers are interchangeable in Perl; if you use a number
like a string, Perl turns it into a string for you.  And if you use a
string like a number, Perl turns it into a number and saves the result
internally as a number, so that arithmetic is always fast.  Here's a
program that reads numbers from the input, adds them up, and prints a
total at the end:

	while (<>) {
	  $total = $total + $_;
	}

	print "The total was $total.\n";

The <> symbol is magical.  It reads a line of data from the input and
puts the line into the default variable, which is $_.  <> terminates
the while loop at the end of the input.  There's more magic to <>, but
we'll ignore it for a while.  When you see `while (<>)', imagine that
it means `while you can read a line of input...'.

Perl runs the `while' loop once each time it reads a line of input,
and when it sees the `+' it knows that you want it to interpret the
line of input in $_ as a number, so that's what it does.  It adds that
number to the variable $total.  

At the end of the input, the `while' loop terminates, and perl prints
out the total of the number in the input.

-- Magical <> --

Where did the input come from?  <> has more magic.  <> means to
emulate the behavior of most Unix commands.  Consider a command like
`cat': If you wrote just

	cat

with no arguments, then `cat' would read from the standard input,
usually the terminal.  But if you wrote

	cat file1 file2 file3

then cat would read from the three files in succession and
concatenates their contents.  <> does the same thing.  If you ran the
script with no arguments, <> would read from the standard input,
usually the terminal; if you supplied argmuents to the script, <>
would invisibly and seamlessly read from each file in succession,
opening a new file each time the previous file was exhausted.  Just
for fun, here's the Perl version of the `cat' command:

	print while <>;

This is part of Perl's polocy of making common tasks easy.  Reading
from a bunch of files specified on the command line is a common task,
so Perl gives you a short cut for it.


-- Funny Punctuation --

By now you've noticed that variables in Perl begin with dollar signs.
That's so that Perl can evolve without breaking your programs: You
never have to worry that a future version of Perl will introduce a new
reserved word that will clash with your variables, because the
variables all begin with funny punctuation.  It's also so that you can
interpolate variables into strings without having to use a lot of
verbiage like `sprintf'.

Only Perl's `scalar' variables begin with $.  Scalar variables can
hold one number or one string each.  Perl also has built-in array
variables that begin with @, and a host of functions for operating on
arrays.  For example

	reverse @array

reverses an array, and 

	@array = split(/,/, $line);

interprets the data in $line as a sequence of fields separated by
commas, extracts the contents of the fields, and stores them into the
array @array.  If you're a C programmer, think of the times you've
written half a page of C code to do exactly this task, and all the
calls to strtok that you'll never have to make again if you use Perl.

Perl has a built-in array variable called @ARGV that holds a list of
the command-line arguments, so here's a little program that deletes
the files that you name:

	foreach $file (@ARGV) {
	  unlink $file
	    or warn "Could not delete file `$file': $!; akipping.\n";
	}


`foreach' makes a loop; it sets the scalar variable $file to each of
the elements of @ARGV in turn.  For each argument, Perl tries to
`unlink' (delete) the file with that name.  If (and only if) it can't,
it emits a warning message like

	Could not delete file `/vmunix': permission denied; skipping.

Whenever there's an error, Perl's magical variable $! will contain an
explanation, such as `permission denied' or `no such file'.  That
makes error reporting convenient and easy in Perl.

Perl has a third kind of variable, called a `hash'.  A hash is like an
array, except that its indices are strings instead of numbers.  %x is
a hash variable named x, and $x{something} is the value associated
with `something' in %x.  If %x were an ordinary array, `something'
would have to be a number, but because %x is a hash, `something' can
be any string at all.

There are too many useful uses of hashes to discuss them all here.
But we'll see one example just so you can get the feel of real Perl
programming.

-- Sample Perl Program --

This program reads the input, counts the number of occurrences of each
word in the input, and prints out a report of the number of times each
word occurred:

        # Lines in Perl programs that begin with # signs are comments.
        # Comments are ignored by Perl.

	# Read the input and count the words
	while (<>) {
	  foreach $word (split) { 
  	    $count{$word}++;      
          }
	}

	# Print out the report
	foreach $word (keys %count) {  
	  print "The word `$word' occurs $count{$word} time(s).\n"; 
	}

When I run this program on the text file containing this article, the
output begins like this:

	The word `files' occurs 2 time(s).
	The word `scripting' occurs 2 time(s).
	The word `interpreter' occurs 1 time(s).

How does this work?

The program has two parts.  The top part reads the input and tabulates
the words; the bottom block prints the report.  We've seen `while
(<>)' already; it loops over the input, reading a line of input into
$_, and running the block once for every line.  `foreach $word' is
another loop; it iterates over the results of `split', assigning the
variable $word to each word in the input line in turn.  `split' we've
seen; when used without argmuents, as it is here, it happens to split
the contents of $_ into words, separated by white space.

$count{$word} is the number of times the word $word has been seen so
far; it's one of the elements in the hash %count.  %count is like an
array whose indices are words instead of numbers; $count{IEEE} is the
number of occurrences of the word `IEEE'; $count{the} is the number of
occurrences of the word `the'.  ++ means to increment the appropriate
element of %count.

When Perl finishes reading the input, it exits the `while' loop.  The
hash %count now contains a lot of words, and associated with each
word, the number of times that word appeared in the input.  The bottom
`foreach' loop will print this information out.  `keys %count'
retrieves a list of all the words stored in %count; `foreach $word'
iterates over this list, setting $word to each word in turn.  Each
time, the `print' command prints out the word ($word) and the count
associated with that word ($count{$word}).

Let's extend this program a little to better count the words in this
article.  Program fragments in this article are indented, and I'd like
to skip them, so that the program doesn't include program fragments in
its word counts.  To do this, we just add one line to the top loop.
It becomes:

	while (<>) {
          next if /^\s+/;
	  foreach $word (split) { 
  	    $count{$word}++;      
          }
	}

`next' means to start the next iteration of the `while' loop right
away, without executing the following lines; the `next' happens only
if the condition /^\s+/ is true.  /^\s+/ is a pattern match; by
default the string matched against the pattern is the current line.
The \s in the pattern means a whitespace character, maybe a space or a
TAB.  \s+ means one or more whitespace characters.  The ^ means that
this white space must occur at the beginning of the line for the
pattern to match.  If the current line has white space at the
beginning, the `next' command will make Perl skip it entirely and move
on to the next line.

Let's also have the program skip over `words' like `/^\t/' and ``++''
by having it ignore words that contain funny characters.  To do that,
we'll add one more line to the program:

	while (<>) {
          next if /^\t/;
	  foreach $word (split) { 
	    next if $word =~ /[^A-Za-z0-9'-]/;
  	    $count{$word}++;      
          }
	}

The new line is

	    next if $word =~ /[^A-Za-z0-9'-]/;

and here the `next' means to skip to the next iteration of the
`foreach' loop, effectively skipping the current word and moving on to
the next word---if.  If what?    If

	$word =~ /[^A-Za-z0-9'-]/

, that's what.  The =~ is the pattern matching operator; it says here
that instead of checking the current line to see if it matches the
pattern, Perl should check $word instead.  [^A-Za-z0-9'-] is a list of
charcaters that $word might contain; the ^ at the beginning means
`Every character *except*...'.  `A-Za-z0-9'-' means any letter, any digit,
an apostrophe, or a hyphen; the pattern matches if $word contains any
character *except* one of these.

If $word contains any character other than a letter, digit, apostophe,
or hyphen, the pattern match succeeds, the condition is true, the
`next' command is invoked, and Perl skips the rest of the loop for
that word; in particular it *doesn't* increment the $count{$word}.


-- What Else? --


Perl has a lot of exciting features, and I had to left most of them
out.  Here are some of the most exciting:

* Perl has a simple, flexible, and powerful object model for
  object-oriented programming.  Each module has its own namespace so
  that variables in one package won't collide with variables in
  another package.  Perl supports multiple inheritance.

* Perl has hundreds of `modules', libraries of Perl code that do all
  sorts of interesting things.  There are Perl modules already written
  for you that will

	* parse and generate HTML, 
	* talk the FTP protocol on the network, 
	* compute with arbitrarily large numbers,
	* make queries to any popular SQL database,
	* encrypt data,
	* (insert several hundred of your favorite useful things here)

* It's easy to extend Perl.  You can write your own functions in
  C and link them in and use them just as if they were built in.

* You can embed a Perl interpreter into a program written in another
  language. 

* Perl has a built-in symbolic debugger.

* Perl comes with an extensive and complete manual, to the tune of a
  quarter-million words.  Some of the manual is for reference; some is
  tutorials; some is lists of frequently asked questions.  It's easy
  to find what you want.  One manual section is `perldiag'; it contains
  an explanation of every warning or error message that Perl can emit.
  Every Perl built-in function is described in the `perlfunc'
  document.  `perlipc' is a tutorial for interprocess communication in
  Perl.

* Perl is portable.  It's easy to write programs that work without
  change even on very different systems.  Perl is available under
  Unix, Microsoft Windows, Macintosh, Amiga, and many others.

* Perl is binary-safe.  You don't have to worry if your data contains
  ASCII NUL characters, or eight-bit characters, or anything else.
  You don't have to worry about the strings in a file being too long;
  Perl can suck your entire file into a single scalar if you want.
  You don't have to worry about your data having too many fields to
  fit into an array; Perl can make an array as big as you want, and
  then grow it automatically if you change your mind later.

* Perl has a built-in data flow analysis mode, that lets it decide
  whether data have come from a trustworthy or an untrustworthy
  source, and refuse to perform dangerous operations with
  untrustworthy data.  Turning this on eliminates many common security
  problems.  This is especially important in CGI programs, which are
  run on your computers by unknown people all over the world, without
  any supervision.

* Perl comes with utilities that will translate your old `awk' and
  `sed' scripts into Perl programs that will run much faster.

* Perl has a `warning' mode that will detect misspelled variable and
  function names, use of uninitialized variables, and other common
  errors.

* The Perl community is large but closely-knit.  Experts regularly appear
  in the comp.lang.perl.misc Usenet newsgroup and on the #perl IRC
  channel.  The Perl development mailing list, perl5-porters, is
  large, but still small enough that everyone seems to know everyone
  else; there isn't any of the factionalism that you see on many such
  lists.  Larry Wall, the creator of Perl, is still very much involved
  in its ongoing development.  Perl has a quarterly journal, _The Perl
  Journal_.  Perl had its first conference in August in San Jose;
  hundreds of Perl experts attended from all over the world.

  As a result of all this friendly discussion, Perl is well supported.
  It's easy to find an expert to help you when you need it.


-- Downsides of Perl --

Perl isn't perfect; there are some problems with Perl.  The biggest
problems:  Perl can be slow, it can use a lot of memory, and it can be
hard to read.

Perl is faster than most interpreted languages, often much faster,
because it parses your program all at once at startup, and remembers
the result; this lets it do the same sort of speed optimizations that
compilers normally perform, such as constant folding.  It's faster
than the shell, because it doesn't spawn a new process every time it
wants to look at a file or do an arithmetic calculation.  It's faster
than TCL because it doesn't re-parse every line as it comes to it, and
because it has real lists and numbers, not just strings.  But it's not
as fast as C.  For best results, prototype your application in Perl,
and then if it's too slow, rewrite in C or some other compiled
language.

Perl programs can take longer to start up than programs written in
other languages, because perl has to compile your program every time
it's run, before it can start executing it.  An experimental compiler
module promises some relief here by allowing you to save the compiled
version of your program to the disk.

Perl data structures are easy to use, and they're flexible, but they
take up more memory than the corresponding structures would in C.  The
Perl compiler/interpreter itself takes up a lot of memory.

Perl is easy to write, but if the programmer isn't careful, it can be
hard to read.  The Perl community is working hard on this, and recent
versions of Perl have tried hard not to exacerbate the funny
punctuation situation.  There's now an optional format for regexps
that lets you include comments and white space, which goes a long way
toward making these powerful patterns less cryptic.

Perl is not (yet) available absolutely everywhere.


These downsides aren't too down.  Even when Perl isn't the Right Tool
to do your job, it's often the right tool for the first draft of the
tool that will do your job.


-- Where to get it? --

Perl comes pre-installed with many operating systems.  You may have it
already! But if not, you can get the most recent version of Perl from

	http://www.perl.com/CPAN/src/latest.tar.gz

if you have a Unix machine, and from

	http://www.perl.com/CPAN/ports/

if not.


-- Bibliography --


_Learning Perl_, Randal Schwartz and Tom Chirsitiansen.  O'Reilly and
    Associates, 302 pp.   ISBN 1-56592-284-0.

    A tutorial introduction to Perl.  Be sure to get the second
    edition of this book, since the first edition is several years out
    of date.

_Programming Perl_, Larry Wall, Tom Christiansen, and 
    Randal L. Schwartz.  O'Reilly and Associates, 646+xxi pp. 
    ISBN 1-56592-149-6.

    The complete reference and guide to Perl for the programmer.
    Again, be sure you get the second edition.

_Advanced Perl Programming_, Sriram Srinivasan.  O'Reilly and
    Associates, 434 pp. ISBN 1-56592-220-4.

_The Perl Journal_.  Quarterly journal devoted to Perl.  See
    http://www.tpj.com/ for details or write to The Perl Journal, 
    POB 54, Boston MA 02101.