Sample solutions and discussion Perl Expert Quiz of The Week #14 (20030611) Manufacture a function, 'repeated_substring'. The argument to the function is a string. The function should return the longest substring of the argument that appears at least twice. For example, given "123 1234", the function should return "123". If there is more than one repeated substring of maximum length, the function may return any of the substrings. If there is no repeated substring, the function should return either undef or an empty string. The repetitions may *not* overlap. For example, "ana" is *not* a repeated substring of "banana". The longest repeated substrings in "banana" are "an" and "na". The function should be efficient, even for very large input strings. ---------------------------------------------------------------- Response to this quiz on the -discuss list was excellent. 33 programs were submitted by 16 different programmers. I ran a test suite for each one. The test suite contained 19 tests, and each program got up to 30 seconds to complete each test. This was a lot of time, since the fastest submissions were able to complete all 19 tests in 30 seconds total. Here are the preliminary results: Tests passed | Total time | | Length (lines) v v v hek2.pl ## 19 ( 29.3) 22 scott6.pl ## 19 ( 29.9) 22 scott4.pl ## 19 ( 29.9) 25 scott5.pl ## 19 ( 31.1) 25 parseval4.pl ## 19 ( 31.9) 26 hek1.pl ## 19 ( 36.7) 21 laarhoven1.pl ## 19 ( 71.2) 19 okopnik3.pl ## 19 (107.8) 18 okopnik1.pl ## 18 (101.8) okopnik2.pl ## 17 ( 80.6) georgep1.pl ## 17 ( 71.7) scott3.pl ## 16 ( 69.8) scott2.pl ## 16 ( 69.8) parseval3.pl ## 16 ( 36.8) kimball1.pl ## 16 ( 96.0) baringer1.pl ## 16 (124.6) wilson2.pl ## 15 (128.7) scott1.pl ## 15 ( 68.5) schmidt1.pl ## 15 ( 17.6) palmer1.pl ## 15 (161.0) webb2.pl ## 14 ( 70.7) parseval2.pl ## 14 ( 71.6) gray3.pl ## 14 (156.0) wilson1.pl ## 13 (123.6) gray2.pl ## 13 ( 51.8) gray1.pl ## 13 (163.1) webb3.pl ## 12 (117.5) kimball0.pl ## 12 (210.1) parseval1.pl ## 11 ( 72.3) lerner1.pl ## 11 (184.0) webb1.pl ## 10 ( 74.0) triv.pl ## 1 ( 0.035) All these results should be taken with a grain of salt, since my computer system might have been doing something else while the test was running; this might affect the times directly, and if it caused a program to take more than 30 seconds for a test, that would look like a test failure. There didn't seem to be a good way to choose between the top submissions from the overview data above, so I chose hek2.pl more or less arbitrarily. Here it is; I've changed the formatting a little bit. sub repeated_substring { my $size = length $_[0]; my $maxsize = int ($size / 2); my ($f_index, $f_offset) = (0,0); OUTER: for (my $index=0; $index < $size; $index++) { # The big optimizer # (added a bounds check to keep us from going beyond the # end of the data) if ($f_offset > 20 && $index + $f_offset <= $size) { for (my $t_index = $index + $f_offset - 15; $t_index >= $index; $t_index = $t_index - 15) { if (index($_[0], substr($_[0], $t_index, $index+$f_offset-$t_index), $index+$f_offset) == -1) { $index = $t_index; next OUTER; } } } for (my $offset=$f_offset+1; $offset <= $maxsize && $index + $offset <= $size; $offset++) { if (index($_[0], substr($_[0], $index, $offset), $index+$offset) >= 0) { $f_index = $index; $f_offset = $offset; } else { last; } } } return substr($_[0], $f_index, $f_offset); } Pr. Hek explains this program in some detail at http://perl.plover.com/~alias/list.cgi?1:mss:1438 Here's my understanding of it. Let's first ignore the block labeled "the big optimizer". Without this block, the program is: sub repeated_substring { my $size = length $_[0]; my $maxsize = int ($size / 2); my ($f_index, $f_offset) = (0,0); OUTER: for (my $index=0; $index < $size; $index++) { for (my $offset=$f_offset+1; $offset <= $maxsize && $index + $offset <= $size; $offset++) { if (index($_[0], substr($_[0], $index, $offset), $index+$offset) >= 0) { $f_index = $index; $f_offset = $offset; } else { last; } } } return substr($_[0], $f_index, $f_offset); } This reduced version of the program is still correct, although considerably slower. The target string is $_[0]. The longest repeated substring will occur in the target at position $f_index, and it will have length $f_offset. The main loop of the function, 'OUTER' scans the target from left to right, looking at substrings at position $index. The inner loop searches for repeated substrings of increasing length; $offset holds the length. The innermost block looks at the substring of length $offset at position $index; if it is a repeated substring, $f_index is replaced with $index and $f_offset with $offset. If there's a long repeated substring at position $index, then there must be a shorter one. Since the inner loop looks for shorter substrings first, it can quit as soon as the shorter substring it's looking for isn't repeated. Looking for repeated substrings is performed by if (index($_[0], substr($_[0], $index, $offset), $index+$offset) >= 0) { ... } substr($_[0], $index, $offset) is the first appearance of the substring the expression is looking for. index($_[0], SUBSTRING, $index+$offset) asks Perl to look for an appearance of the same substring but at a position $index+$offset or later in the string. If index returns a nonnegative value, this second apearance was found. To allow the function to locate overlapping repeated substrings, change ($index+$offset) to ($index+1). Now let's consider the optimization block: if ($f_offset > 20 && $index + $f_offset <= $size) { for (my $t_index = $index + $f_offset - 15; $t_index >= $index; $t_index = $t_index - 15) { if (index($_[0], substr($_[0], $t_index, $index+$f_offset-$t_index), $index+$f_offset) == -1) { $index = $t_index; next OUTER; } } } Here's the idea. Let's suppose that the target string is 800 characters long, that the function is now searching at position 400, and that the function has already located a repeated substring of length 40. The inner loop would begin by looking for repeated substrings of length 41. Let's suppose that the part of the string the function will examine begins like this: abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ The next test will look to see if the string abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNO is repeated after position 441. It probably isn't, and if it isn't, the next thing the function will do is to look to see if the string bcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOP is repeated after position 442. If not, the next step will be to look for the string cdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOP after position 443, and so on. Pr. Hek's insight was that these tests all have information in common. There's a good chance that ABCDEFGHIJKLMNO won't be repeated in the rest of the file, and if it isn't, then none of these three 41-character strings can be repeated either, since they all contain ABCDEFGHIJKLMNO. In fact, if ABCDEFGHIJKLMNO isn't a repeated string in the rest of the file, then any 41-character-long repeated substring must start no further left than position 400 + 41 - 14 = 427, which is the "B". A length-41 repeated string starting to the left of that position would have to contain ABCDEFGHIJKLMNO, which the function knows doesn't appear again. The optimization loop only kicks in when the longest repeated substring so far is at least 20 characters long. If it does kick in, then the 'for' loop examines the last 15 characters of the string. If the last 15 characters don't appear again, the $index = $t_index assignment skips to these last 15 characters and continues from there, short-circuiting all the tests that would normally be required to move that far to the right. If the last 15 characters in the substring *do* appear again, and the substring is long enough, the loop tries again with the its last 30 (then 45, 60, etc.) characters instead. The choice of the constants 20 and 15 is essentially arbitrary; They seems to have been chosen because they produced good performance for Pr. Hek's test data. Careful readers will have noticed a small discrepancy in the explanation above. In the abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ example, Pr. Hek's code will skip to the capital 'A' and continue searching for repeated substrings at that position. But since the function knows that "ABCDEFGHIJKLMNO" is not repeated, there can't be any long repeated substring at that position. It could have skipped to the capital "B" instead. Replacing $index = $t_index; with $index = $t_index+1; implements this; all the tests still pass, and performance on the test suite gets a few percent better. (Of course, I might have overlooked something here, and broken the program.) This optimization suggests further avenues for investigation. As I noted, the choice of 15 and 20 is essentially arbitrary. On some test cases, such as the 32kib random file, the optimization is never enabled at all, because there is no repeated substring of 20 characters. There's a tradeoff in the size of the $t_index value the function would like to find; if $t_index is small, the function can skip more characters, but if it's small, the skipping test is less likely to be successful. I wonder, for example, if some sort of binary search for the smallest successful $t_index would be worth doing. I also wonder if some kind of dynamic programming approach might work, first accumulating a table of short non-repeated substrings, and then using this information to eliminate large parts of the search. For example, suppose the target string is 10000 characters long, and one knows that the string at positions 4950-5050 is not repeated. This effectively cuts the target string in two, since repeated substring of length >= 100 must end to the left of 5050 or start to the right of 4950, and this prunes out a large part of the search space; one might be able to repeat the reasoning in regard to the length-100 strings at positions 99, 198, 297, etc., quickly ruling out the possibility of a repeated substring of length >= 100. ---------------------------------------------------------------- More notes: 1. This question was motivated by a problem I've been puzzling over for a long time: Given a large collection of source code, how can you find sections of the code that are identical or similar? These are targets for cleanup. For example, similar or identical bocks in two parts of a program can often be replaced by calls to a single function. Locating repeated substrings in a string isn't exactly what you want for this, of course. But it appears to be the hardest part of what you want. The two obvious problems that come to mind are that you want if ( keys %{ $h{$a} } ) { print "Finished\n"; } to be recognized as identical to if(keys%{$h{$a}}){print"Finished\n";} and that you want for (my $i=0; $i < @array; $i++) { print "$i: $array[$i]\n" } to be recognized as identical to for (my $j=0; $j < @item; $j++) { print "$j: $item[$j]\n" } Both of these seem easy to solve, by tokenizing the source code and then applying the repeated-substring algorithm to token sequences instead of to character sequences. There are, of course, a lot of other interesting subproblems here. The problem *has* been solved successfully; see Brenda Baker, Bell Labs http://cm.bell-labs.com/cm/cs/who/bsb/index.html Parameterized Duplications in Strings: Algorithms and an Application to Sotware Maintenance Siam J. Computing, October 1997. Unfortunately, Baker's source code is unavailable, and her algorithm is patented. 2. One idea that people often suggest in response to this problem is to split the string in half and then use a longest-common-substring algorithm such as the one in Algorithm::Diff. This doesn't work, of course, because both appearances of a repeated substring might be in the same half of the argument string, or one occurrence might span the midpoint of the argument. There may be a way to apply LCS techniques to this problem, but I don't know them. 3. The easy answer to the problem is to use the regex engine. For example, one might say sub repeated_substring { my ($lrs) = $_[0] =~ /(.*).*\1/s; $lrs; } Unfortunately, this doesn't work correctly. For the target "aa-bcbc" it reports that the longest repeated substring is "a". It's rather difficult to get the engine to produce the longest anything. Itzik Lerner used: sub repeated_string { my ($string,$result)=shift; while ($string=~/(.+)(?=.*?\1)/sg) { $result=$1 if length($1) > length($result); } return $result; } This produces the wrong answer for the string "acBBaccBB"; Here's what goes wrong. The first match locates the repeated "ac". Each /g match picks up where the previous one left off, so the regex engine then continues searching just before the first 'B'. It does locate the repeated substring 'BB', but this is the same length as 'ac', so it's ignored. Unfortunately, the longest repeated substring is 'cBB'. The 'cBB' overlaps the 'ac' that was found on the fist pass, so the engine starts in the middle of the 'cBB' on the second pass and never sees it. Even if this could be fixed, the key to solving this problem efficiently seemed to be using clever heuristics to prune the search space; the regex engine doesn't do this. 4. Marco Baringer posted test results and timings for many of the programs posted to the perl-qotw-discuss list; Pr. Baringer's results are available at http://www.bese.it/~segv/perl-qotw/14/results.html Tassilo Parseval posted an analysis of the dependence of running time on target string length and result string length, complete with graphs: http://www-users.rwth-aachen.de/tassilo.parseval/qotw/ My own test and timing results are available from http://perl.plover.com/qotw/misc/e014/ The test harness is 'TestDupl.pm'; some test data is in this file, and the rest is in the 't' subdirectory. The programs I tested are in 'progs' and the results are in 'results'. 5. I used an assortment of test data, including several short strings; the US Constitution; a 32 kiB file of random bytes (it happened to contain a repeated substring of length 3); the same random file with 64 x's appended to the beginning and the end; the longest possible file with 97 different characters but no repeated substring of length 2 or more; a 2 kb SQL statement supplied by James Edward Gray; and strings of 4, 64, and 16384 "a"s. 6. The usual approach to this sort of problem is to use a data structure called a "suffix tree". The Baker method I mentioned above is based on suffix trees. An explanation of suffix trees is available at http://perl.plover.com/qotw/misc/e014/papers/suffix-trees/ 7. Only Ton Hospel submitted a program that uses the suffix tree approach. Pr. Hospel's program caused Perl to dump core on my machine. 8. To abort programs early if they were taking a long time to run, my test harness calls alarm() before running each test. The idea is that when the alarm clock goes off, the incoming signal will interrupt the test. This works with Perl 5.6.1, but it didn't work well with 5.8.0. 5.8.0 introduced a new regime for handling signals. Instead of interrupting Perl and invoking the signal handler right away, an incoming signal in 5.8.0 just sets a flag. Perl then resumes what it was doing. After finishing each op, it checks the flag and dispatches the signal handler if the flag is set. Unfortunately, some Perl ops can take a very long time to finish. For example, Ronald Kimball sent in the following program: sub repeated_substring { my($str) = @_; my $substr = ''; my $pos = 0; while ($pos <= length($str) - 2) { pos($str) = $pos; if ($str =~ /((.{@{[ 1 + length $substr ]},}).*\2)/sgc) { $substr = $2; $pos = pos($str) - length($1) + 1; } else { $pos++; } } return $substr; } Pr. Kimball says "Of course, that gets ridiculously slow, because of all the backtracking." It is the regex match that is ridiculously slow. But if the alarm clock goes off while the regex match is in progress, Perl 5.8.0 sets the flag and resumes the match op. Control doesn't return to my test harness code until the ridiculously slow regex match completes. it took me quite a while to figure out why Pr. Kimball's program never finished. When I did figure it out, I reran all the tests with Perl 5.6.1. Thanks once again to everyone who participated. I will send a new quiz fairly late on Wednesday, since I'll be visiting New York all day.