Sample solutions and discussion
Perl Expert Quiz of The Week #14 (20030611)


        Manufacture a function, 'repeated_substring'.  The argument to
        the function is a string.  The function should return the
        longest substring of the argument that appears at least twice.
        For example, given "123 1234", the function should return
        "123".

        If there is more than one repeated substring of maximum
        length, the function may return any of the substrings.  If
        there is no repeated substring, the function should return
        either undef or an empty string.

        The repetitions may *not* overlap.  For example, "ana" is
        *not* a repeated substring of "banana".  The longest repeated
        substrings in "banana" are "an" and "na".

        The function should be efficient, even for very large input
        strings.

----------------------------------------------------------------

Response to this quiz on the -discuss list was excellent.  33 programs
were submitted by 16 different programmers.  I ran a test suite for
each one.  The test suite contained 19 tests, and each program got up
to 30 seconds to complete each test.  This was a lot of time, since
the fastest submissions were able to complete all 19 tests in 30
seconds total.  Here are the preliminary results:

                           Tests passed
                           |   Total time
                           |   |        Length (lines)
                           v   v        v
        hek2.pl        ##  19  ( 29.3)  22
        scott6.pl      ##  19  ( 29.9)  22
        scott4.pl      ##  19  ( 29.9)  25
        scott5.pl      ##  19  ( 31.1)  25
        parseval4.pl   ##  19  ( 31.9)  26
        hek1.pl        ##  19  ( 36.7)  21
        laarhoven1.pl  ##  19  ( 71.2)  19
        okopnik3.pl    ##  19  (107.8)  18

        okopnik1.pl    ##  18  (101.8)
        okopnik2.pl    ##  17  ( 80.6)
        georgep1.pl    ##  17  ( 71.7)

        scott3.pl      ##  16  ( 69.8)
        scott2.pl      ##  16  ( 69.8)
        parseval3.pl   ##  16  ( 36.8)
        kimball1.pl    ##  16  ( 96.0)
        baringer1.pl   ##  16  (124.6)

        wilson2.pl     ##  15  (128.7)
        scott1.pl      ##  15  ( 68.5)
        schmidt1.pl    ##  15  ( 17.6)
        palmer1.pl     ##  15  (161.0)
        webb2.pl       ##  14  ( 70.7)
        parseval2.pl   ##  14  ( 71.6)
        gray3.pl       ##  14  (156.0)

        wilson1.pl     ##  13  (123.6)
        gray2.pl       ##  13  ( 51.8)
        gray1.pl       ##  13  (163.1)
        webb3.pl       ##  12  (117.5)
        kimball0.pl    ##  12  (210.1)
        parseval1.pl   ##  11  ( 72.3)
        lerner1.pl     ##  11  (184.0)
        webb1.pl       ##  10  ( 74.0)
        triv.pl        ##   1  (  0.035)

All these results should be taken with a grain of salt, since my
computer system might have been doing something else while the test
was running; this might affect the times directly, and if it caused a
program to take more than 30 seconds for a test, that would look like
a test failure.

There didn't seem to be a good way to choose between the top
submissions from the overview data above, so I chose hek2.pl more or
less arbitrarily. Here it is; I've changed the formatting a little
bit.

        sub repeated_substring {
          my $size = length $_[0];
          my $maxsize = int ($size / 2);
          my ($f_index, $f_offset) = (0,0);

        OUTER:
          for (my $index=0; $index < $size; $index++) {

            # The big optimizer 
            # (added a bounds check to keep us from going beyond the
            # end of the data) 
            if ($f_offset > 20 && $index + $f_offset <= $size) {
              for (my $t_index = $index + $f_offset - 15; 
                   $t_index >= $index; 
                   $t_index = $t_index - 15) {
                if (index($_[0], substr($_[0], $t_index, 
                                        $index+$f_offset-$t_index),
                          $index+$f_offset) == -1) {
                  $index = $t_index;
                  next OUTER;
                }
              }
            }

            for (my $offset=$f_offset+1; 
                 $offset <= $maxsize && $index + $offset <= $size; 
                 $offset++) {
              if (index($_[0], substr($_[0], $index, $offset), 
                        $index+$offset) >= 0) {
                $f_index  = $index;
                $f_offset = $offset;
              } else {
                last;
              }
            }
          }
          return substr($_[0], $f_index, $f_offset);
        }

Pr. Hek explains this program in some detail at

        http://perl.plover.com/~alias/list.cgi?1:mss:1438

Here's my understanding of it.  Let's first ignore the block labeled
"the big optimizer".   Without this block, the program is:

        sub repeated_substring {
          my $size = length $_[0];
          my $maxsize = int ($size / 2);
          my ($f_index, $f_offset) = (0,0);

        OUTER:
          for (my $index=0; $index < $size; $index++) {

            for (my $offset=$f_offset+1; 
                 $offset <= $maxsize && $index + $offset <= $size; 
                 $offset++) {
              if (index($_[0], substr($_[0], $index, $offset), 
                        $index+$offset) >= 0) {
                $f_index  = $index;
                $f_offset = $offset;
              } else {
                last;
              }
            }
          }
          return substr($_[0], $f_index, $f_offset);
        }

This reduced version of the program is still correct, although
considerably slower.  

The target string is $_[0].  The longest repeated substring will occur
in the target at position $f_index, and it will have length $f_offset.
The main loop of the function, 'OUTER' scans the target from left to
right, looking at substrings at position $index.  The inner loop
searches for repeated substrings of increasing length; $offset holds
the length.  The innermost block looks at the substring of length $offset
at position $index; if it is a repeated substring, $f_index is
replaced with $index and $f_offset with $offset.  

If there's a long repeated substring at position $index, then there
must be a shorter one.  Since the inner loop looks for shorter
substrings first, it can quit as soon as the shorter substring it's
looking for isn't repeated.

Looking for repeated substrings is performed by 

              if (index($_[0], substr($_[0], $index, $offset), 
                        $index+$offset) >= 0) {
                ...
               }

substr($_[0], $index, $offset) is the first appearance of the
substring the expression is looking for.  
index($_[0], SUBSTRING, $index+$offset) asks Perl to look for an
appearance of the same substring but at a position $index+$offset or
later in the string.  If index returns a nonnegative value, this
second apearance was found.  To allow the function to locate
overlapping repeated substrings, change ($index+$offset) to
($index+1). 

Now let's consider the optimization block:

            if ($f_offset > 20 && $index + $f_offset <= $size) {
              for (my $t_index = $index + $f_offset - 15; 
                   $t_index >= $index; 
                   $t_index = $t_index - 15) {
                if (index($_[0], substr($_[0], $t_index, 
                                        $index+$f_offset-$t_index),
                          $index+$f_offset) == -1) {
                  $index = $t_index;
                  next OUTER;
                }
              }
            }

Here's the idea.  Let's suppose that the target string is 800
characters long, that the function is now searching at position 400,
and that the function has already located a repeated substring of
length 40.  The inner loop would begin by looking for repeated
substrings of length 41.  Let's suppose that the part of the string
the function will examine begins like this:

        abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ

The next test will look to see if the string

        abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNO

is repeated after position 441.  It probably isn't, and if it isn't,
the next thing the function will do is to look to see if the string

        bcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOP

is repeated after position 442.  If not, the next step will be to look
for the string 

        cdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOP

after position 443, and so on.  Pr. Hek's insight was that these tests
all have information in common.  There's a good chance that
ABCDEFGHIJKLMNO won't be repeated in the rest of the file, and if it
isn't, then none of these three 41-character strings can be repeated
either, since they all contain ABCDEFGHIJKLMNO. In fact, if
ABCDEFGHIJKLMNO isn't a repeated string in the rest of the file, then
any 41-character-long repeated substring must start no further left
than position 400 + 41 - 14 = 427, which is the "B".  A length-41
repeated string starting to the left of that position would have to
contain ABCDEFGHIJKLMNO, which the function knows doesn't appear
again.

The optimization loop only kicks in when the longest repeated
substring so far is at least 20 characters long.  If it does kick in,
then the 'for' loop examines the last 15 characters of the string.  If
the last 15 characters don't appear again, the $index = $t_index
assignment skips to these last 15 characters and continues from there,
short-circuiting all the tests that would normally be required to move
that far to the right.  If the last 15 characters in the substring
*do* appear again, and the substring is long enough, the loop tries
again with the its last 30 (then 45, 60, etc.) characters instead.

The choice of the constants 20 and 15 is essentially arbitrary; They
seems to have been chosen because they produced good performance for
Pr. Hek's test data.

Careful readers will have noticed a small discrepancy in the
explanation above.  In the

        abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ

example, Pr. Hek's code will skip to the capital 'A' and continue
searching for repeated substrings at that position.  But since the
function knows that "ABCDEFGHIJKLMNO" is not repeated, there can't be
any long repeated substring at that position.  It could have skipped
to the capital "B" instead.  Replacing

          $index = $t_index;

with

          $index = $t_index+1;

implements this; all the tests still pass, and performance on the test
suite gets a few percent better.  (Of course, I might have overlooked
something here, and broken the program.)

This optimization suggests further avenues for investigation.  As I
noted, the choice of 15 and 20 is essentially arbitrary.  On some test
cases, such as the 32kib random file, the optimization is never
enabled at all, because there is no repeated substring of 20
characters.  There's a tradeoff in the size of the $t_index value the
function would like to find; if $t_index is small, the function can
skip more characters, but if it's small, the skipping test is less
likely to be successful.  I wonder, for example, if some sort of
binary search for the smallest successful $t_index would be worth
doing. 

I also wonder if some kind of dynamic programming approach might work,
first accumulating a table of short non-repeated substrings, and then
using this information to eliminate large parts of the search.

For example, suppose the target string is 10000 characters long, and
one knows that the string at positions 4950-5050 is not repeated.
This effectively cuts the target string in two, since repeated
substring of length >= 100 must end to the left of 5050 or start to
the right of 4950, and this prunes out a large part of the search
space; one might be able to repeat the reasoning in regard to the
length-100 strings at positions 99, 198, 297, etc., quickly ruling out
the possibility of a repeated substring of length >= 100.

----------------------------------------------------------------

More notes:

1. This question was motivated by a problem I've been puzzling over
   for a long time: Given a large collection of source code, how can
   you find sections of the code that are identical or similar?  These
   are targets for cleanup.  For example, similar or identical bocks
   in two parts of a program can often be replaced by calls to a
   single function.

   Locating repeated substrings in a string isn't exactly what you
   want for this, of course.  But it appears to be the hardest part of
   what you want.  The two obvious problems that come to mind are that
   you want

        if ( keys %{ $h{$a} } ) {
          print "Finished\n";
        }

   to be recognized as identical to

        if(keys%{$h{$a}}){print"Finished\n";}

   and that you want

        for (my $i=0; $i < @array; $i++) { print "$i: $array[$i]\n" }

   to be recognized as identical to

        for (my $j=0; $j < @item; $j++) { print "$j: $item[$j]\n" }

   Both of these seem easy to solve, by tokenizing the source code and
   then applying the repeated-substring algorithm to token sequences
   instead of to character sequences.  There are, of course, a lot of
   other interesting subproblems here.

   The problem *has* been solved successfully; see

        Brenda Baker, Bell Labs 
        http://cm.bell-labs.com/cm/cs/who/bsb/index.html 
        Parameterized Duplications in Strings: Algorithms and an
          Application to Sotware Maintenance  
        Siam J. Computing, October 1997. 

   Unfortunately, Baker's source code is unavailable, and her
   algorithm is patented.

2. One idea that people often suggest in response to this problem is
   to split the string in half and then use a longest-common-substring
   algorithm such as the one in Algorithm::Diff.  This doesn't work,
   of course, because both appearances of a repeated substring might
   be in the same half of the argument string, or one occurrence might
   span the midpoint of the argument.  There may be a way to apply LCS
   techniques to this problem, but I don't know them.

3. The easy answer to the problem is to use the regex engine.  For
   example, one might say

        sub repeated_substring { 
          my ($lrs) = $_[0] =~ /(.*).*\1/s;
          $lrs;
        }

   Unfortunately, this doesn't work correctly.  For the target
   "aa-bcbc" it reports that the longest repeated substring is "a".
   It's rather difficult to get the engine to produce the longest
   anything.  Itzik Lerner used:

        sub repeated_string {
          my ($string,$result)=shift;
          while ($string=~/(.+)(?=.*?\1)/sg) {
            $result=$1 if length($1) > length($result);
          }
          return $result;
        }

   This produces the wrong answer for the string "acBBaccBB"; Here's
   what goes wrong.  The first match locates the repeated "ac".  Each
   /g match picks up where the previous one left off, so the regex
   engine then continues searching just before the first 'B'.  It does
   locate the repeated substring 'BB', but this is the same length as
   'ac', so it's ignored.  Unfortunately, the longest repeated
   substring is 'cBB'.  The 'cBB' overlaps the 'ac' that was found on
   the fist pass, so the engine starts in the middle of the 'cBB' on
   the second pass and never sees it.

   Even if this could be fixed, the key to solving this problem
   efficiently seemed to be using clever heuristics to prune the
   search space; the regex engine doesn't do this.

4. Marco Baringer posted test results and timings for many of the
   programs posted to the perl-qotw-discuss list; Pr. Baringer's
   results are available at

        http://www.bese.it/~segv/perl-qotw/14/results.html

   Tassilo Parseval posted an analysis of the dependence of running
   time on target string length and result string length, complete
   with graphs:

        http://www-users.rwth-aachen.de/tassilo.parseval/qotw/

   My own test and timing results are available from

        http://perl.plover.com/qotw/misc/e014/

   The test harness is 'TestDupl.pm'; some test data is in this file,
   and the rest is in the 't' subdirectory.  The programs I tested are
   in 'progs' and the results are in 'results'.

5. I used an assortment of test data, including several short strings;
   the US Constitution; a 32 kiB file of random bytes (it happened to
   contain a repeated substring of length 3); the same random file
   with 64 x's appended to the beginning and the end; the longest
   possible file with 97 different characters but no repeated
   substring of length 2 or more; a 2 kb SQL statement supplied by
   James Edward Gray; and strings of 4, 64, and 16384 "a"s.

6. The usual approach to this sort of problem is to use a data
   structure called a "suffix tree".  The Baker method I mentioned
   above is based on suffix trees.  An explanation of suffix trees is
   available at

        http://perl.plover.com/qotw/misc/e014/papers/suffix-trees/

7. Only Ton Hospel submitted a program that uses the suffix tree
   approach.  Pr. Hospel's program caused Perl to dump core on my
   machine.

8. To abort programs early if they were taking a long time to run, my
   test harness calls alarm() before running each test.  The idea is
   that when the alarm clock goes off, the incoming signal will
   interrupt the test.

   This works with Perl 5.6.1, but it didn't work well with 5.8.0.
   5.8.0 introduced a new regime for handling signals.  Instead of
   interrupting Perl and invoking the signal handler right away, an
   incoming signal in 5.8.0 just sets a flag.  Perl then resumes what
   it was doing.  After finishing each op, it checks the flag and
   dispatches the signal handler if the flag is set.

    Unfortunately, some Perl ops can take a very long time to finish.
    For example, Ronald Kimball sent in the following program:

        sub repeated_substring {
          my($str) = @_;
          my $substr = '';

          my $pos = 0;  while ($pos <= length($str) - 2) {
            pos($str) = $pos;

            if ($str =~ /((.{@{[ 1 + length $substr ]},}).*\2)/sgc) {
              $substr = $2;
              $pos = pos($str) - length($1) + 1;
            } else {
              $pos++;
            }
          }

          return $substr;
        }

   Pr. Kimball says "Of course, that gets ridiculously slow, because
   of all the backtracking."  It is the regex match that is
   ridiculously slow.  But if the alarm clock goes off while the regex
   match is in progress, Perl 5.8.0 sets the flag and resumes the
   match op.  Control doesn't return to my test harness code until the
   ridiculously slow regex match completes.  it took me quite a while
   to figure out why Pr. Kimball's program never finished.  When I did
   figure it out, I reran all the tests with Perl 5.6.1.


Thanks once again to everyone who participated.  I will send a new
quiz fairly late on Wednesday, since I'll be visiting New York all day.