(Message perl/qotw/discuss:468)
/home/mjd/bin/mailpager 468
Received: (qmail 22173 invoked by alias); 9 Nov 2002 20:11:02 -0000
Mailing-List: contact perl-qotw-discuss-help@plover.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:perl-qotw-discuss@plover.com>
List-Help: <mailto:perl-qotw-discuss-help@plover.com>
List-Unsubscribe: <mailto:perl-qotw-discuss-unsubscribe@plover.com>
List-Subscribe: <mailto:perl-qotw-discuss-subscribe@plover.com>
List-Digest-Subscribe: <mailto:perl-qotw-discuss-digest-subscribe@plover.com>
List-URL: <URL:http://perl.plover.com/qotw/>
Delivered-To: mailing list perl-qotw-discuss@plover.com
Received: (qmail 22165 invoked from network); 9 Nov 2002 20:11:01 -0000
Date: Sat, 09 Nov 2002 19:25:14 -0500
From: Ben Kennedy <bkennedy@hmsonline.com>
Subject: Re: [SPOILER] Perl 'Expert' Quiz-of-the-Week #4 Solution
To: perl-qotw-discuss@plover.com
Message-id: <005101c2884f$a388f150$9437fea9@panda>
MIME-version: 1.0
X-MIMEOLE: Produced By Microsoft MimeOLE V5.50.4522.1200
X-Mailer: Microsoft Outlook Express 5.50.4522.1200
Content-type: text/plain; charset=iso-8859-1
Content-transfer-encoding: 7BIT
X-Priority: 3
X-MSMail-priority: Normal
References: <3DCD5B2D.30100@finkel.org>


From: "Avi" <avi@finkel.org>
To: <perl-qotw-discuss@plover.com>
Sent: Saturday, November 09, 2002 1:59 PM
Subject: [SPOILER] Perl 'Expert' Quiz-of-the-Week #4 Solution


> I haven't seen any solutions like this out there, so I thought I'd post
> mine.  It seems to run faster than many I've seen (takes about 4 minutes
> to run over Web2), but it does underperform Ben K's recursive solution
> (which I don't yet fully understand. :)

This solution I came up with is relatively simple (code repeated at bottom
of this message) - it's based on the observation that if the string is truly
an anagram, the first chunk of the string must go somewhere into the second
string.  This provides the basis for a simple recursive algorithm based
around placing the first chunk in the first string.  Since we already
established that a greedy approach may not work, it has to look at *every*
possible first chunk going into *every* possible location in the second
word.  For example, "aaabbb" going into "aabbba" has the following first
chunks:

a
aa
aaa
aaab
aaabb
aaabbb

Each chunk is then scanned for in the second string:

a can be matched to (a)abbba or a(a)bbba or aabbb(a)
aa can be matched to (aa)bbba
aaa, aaab, aaabb, aaabbb have no matches

Thus we now have four potential branches, so the function decends with each
of the following word pairs:

aabbb to -abbba
aabbb to a-bbba
aabbb to aabbb-
abbb to -bbba

Note the presense of dashed to indicate removal of the letters in
parenthesis.  If the chunk were simply removed, it would bring together
letters that weren't together initially, and the algoritm would not work.
These word pairs are then sent to the next level of recustion.  Note the
pairs after processing the third entry which was:

aabbb to aabbb-

We have the following new pairs:

Chunk is a, pairs are abbb to -abbb-, abbb to a-bbb-
Chunk is aa, pairs are bbb to -bbb-
Chunk is aab, pairs are bb to -bb-
Chunk is aabb, pairs are b to -b-
Chunk is aabbb, pairs are (null) to --

That last conditions means we have a complete anagram - the number of dashes
(2) is equal to the depth of recursion (2), which means we eliminated the
second word in only 2 chunks:

(a)(aabbb) to (aabbb)(a)

Once a solution is found, its marked as the "best" solution and the code
does not descend to levels beyond this depth.

Here are the functions again, with one optimization.  The previous version
tried all chunk from shortest to longest, e.g.

1
12
123
1234

The newer version tries them longest to shortest:

1234
123
12
1

While a purely greedy approach does guarantee an optimal solution, I think
processing the possible chunks in a greedy order will yield faster average
case performance.  Also if a two words aren't anagrams, it will go through
every possible set of chunks before failing.  But since the quiz states that
only anagrams will get passed to the function, I didn't worry about it.
Also the code uses spaces instead of dashes to replace chunks in the second
word.

--Ben Kennedy

sub min_chunks {
  my($w1, $w2) = @_;

  # this hash will be sent through the recursive function
  #  to short-circuit non-optimal solutions
  my(%data_collection) = (
                      BEST => length($w1) + 1,
                     );

  # perform the comparision
  recursive_compare(lc $w1, lc $w2, 1, [ ], \%data_collection);

  # indicates that we found at least one anagram
  if ($data_collection{BEST} <= length($w1)) {
    # print "Smallest decomposition: $data_collection{BEST}\n";
    # print join('', map { "($_)" } @{$data_collection{CHUNKS}}), "\n";
    return $data_collection{BEST};
  } else {
    # print "Hey, they weren't anagrams at all!!\n";
    return undef;
  }
}

sub recursive_compare {
  my($w1, $w2, $depth, $words, $shared_ref) = @_;

  for (my $i = length($w1) ; $i >= 1 ; $i--) {
    # fragment we are looking for in second string
    my $test_word = substr($w1, 0, $i);

    # rest of fragment to send recursively
    my $next_w1 = substr($w1, $i);

    # go through all instances of word fragment in second word - this
    while($w2 =~ /$test_word/g) {
      # form the next word - the space is required so that further
      #  chunks can't be made over the gap of the chunk we are removing
      my $next_w2= $` . " " . $';

      # indicates that entire second word has been replaced
      if ((" " x $depth eq $next_w2) and (not $next_w1)) {
        # note the new best score, and save the words for later
        $shared_ref->{BEST} = $depth;
        $shared_ref->{CHUNKS} = [ @$words, $test_word ]
      } elsif (($depth + 1) < $shared_ref->{BEST}) {
        # go down a level only if its a potential optimal solution
        #  - keep track of seen word list and depth
        recursive_compare($next_w1, $next_w2, $depth + 1,
                          [ @$words, $test_word ], $shared_ref);
      }
    }
  }
}