Date: 02 Oct 2001 16:52:35 -0400
From: Joe Schaefer <joe+usenet@sunstarsys.com>
Subject: Re: Efficient code?
Message-Id: <m33d51n4rw.fsf@mumonkan.sunstarsys.com>

"S Warhurst" <s.warhurst@rl.ac.uk> writes:

> 1) If one wants to go through each line of a 100,000 line text file,
> looking for one of 50 different strings, is the quickest way to do it
> (in terms of processor time) to use 50 IF... ELSIF... commands? (some
> of these do require regexp matching.. eg: "I like cornflakes for
> breakfast" where cornflakes could be any breakfast cereal).

If you have enough RAM (which you probably do), you might try slurping 
the whole file in and study()ing the scalar before doing any pattern
matches:

  # untested code

  read TEXTFILE, $_, -s TEXTFILE;
  study;        # be careful, this might actually slow things down

  my @line_break = -1;
  push @line_break, pos while /\n/g;

  # just in case the last char in $_ isn't a newline:
  push @line_break, length $_ unless $linebreak[-1] == length $_;
  
  my @re = map qr/$_/, qw/list of 50 regexps/;

  foreach my $r (@re) {

    my $n = 1;  # line number of matching text

    while (/$r/g) {

        # first get $line: assumes $r won't match across lines-
        # i.e. be careful with "\s"

        # XXX: might be faster to invert @line_break 
        ++$n until $line_break[$n] >= pos;

        my $line = substr $_, $line_break[$n-1] + 1,
                              $line_break[$n] - $line_break[$n-1];

        # avoids mutiple matches on the same line by advancing pos
        pos = $line_break[$n];

        # now do something

    }


  }
  
The line marked XXX might be accelerated by first inverting @line_break
outside the loops:

  my %line_num = map {$line_break[$_] => $_} 0..$#line_break;

  # just in case last char isn't "\n":
  $line_num{ -1 } = $line_num{ $line_break[-1] };

and replacing the line below the XXX with

            $n = $line_num{ index($_, "\n", pos) };

but I haven't tested this either, so YMMV.

> 2) If one has an array of email addresses like:
> 
> @emails = ('john@here.com', 'tim@there.com', 'bill@anisp.com',
> 'tim@there.com', 'john@here.com')
> 
> and wanted to make them unique I would use the line:
> 
> @emails = do {my %h; grep {!$h {$_} ++} @emails}
> 
> Is there an easy way to modify this line so that it gives a count of email
> addresses aswell, resulting in an array like this:
> 
> @emails = (['bill@anisp.com', '1'],
>                   ['john@here.com', '2'],
>                   ['tim@there.com', '2']);
> 

Since you already have %h, I don't know why you'd want to do this.
Here's one way:

  my %h;
  $h{$_}++ for @emails;
  @emails = map [$_ => $h{$_}], keys %h; # redundant given %h
  

> ----------------------------
> 
> 3) If I have an array like the following (there would actually be several
> hundred of rows):
> 
> @array = (['bath.ac.uk', '46'],
>                  ['blackpool.ac.uk', '22'],
>                  ['hull.ac.uk', '13'],
>                  ['sussex.ac.uk', '36'],
>                  ['hull.ac.uk', '31'],
>                  ['blackpool.ac.uk', '2']);
> 
> and I want to find the unique domains in column 1 and total the values in
> column 2 so it looks like this:
> 
> @array = (['bath.ac.uk', '46'],
>                  ['blackpool.ac.uk', '24'],
>                  ['hull.ac.uk', '44'],
>                  ['sussex.ac.uk', '36']);


Again- use a hash:

  my %h;
  $h{ $_->[0] } += $_->[1]  for @array;
  @array = map [ $_ => $h{$_} ], sort keys %h;   # redundant given %h

-- 
Joe Schaefer     "There is something fascinating about science. One gets such
                    wholesale returns of conjecture out of such a trifling
                                     investment of fact."
                                               --Mark Twain