Perl weekly challenge 96

This week we had contrasting challenges.

Challenge 1 - Reverse Words

Take a string of words {with arbitrary white space around the words} and reverse the order of the words in the string and removing any redundant white space.

This is a classic example of a 1-liner....

   join q( ), reverse grep {$_} split m{\s+}, $_[0];

Challenge 2 - Edit Distance

I will provide 2 solutions here... one a less optimal solution which at the same time gives us a nice way of rendering the alignment - and then an more efficient "boiling down" of the first algorithm to just return the distance...

I'm just going to add "Another day job challenge!"

To be able to make "nicer" output - rather than just keeping track of the edit distance of substrings - we will actually keep the alignment of the two words as a string of "operations" whether they be Indels or SNPs.

One of my background is working with genomic data and this can be thought of as a simple alignment algorithm - and so I think of the three operations as Indels {inserts/deletes - remembering an insert into one sequence is just the same as a delete from the other} and SNPs - or single nucleotide polymorphisms.

The simple alignment string representation we will use consists of:
  '|' - the letters are the same;
  'v' - insert
  '^' - delete
  ' ' - SNP/modify

We can convert this to an edit distance by counting all the non-"|"s In perl we do this with tr/^v /^v / which returns the number of matches in scalar form. See {_edit_dist - function}

Finally we include a nice way to render the alignment {edits} By showing the two words with appropriate inserts in each word and indicate where the letters match in each word via a the alignment string in the middle. See {render_alighnment function}

  kitten-    sunday    boat rid-ing
   ||| |v      ||||    |^||||||v|||
  sitting    monday    b-at ridding

Additional note - we "memoise" the alignment function - as it will be called with the same subseq of letters following different paths through the two sequences. This increases performance...

From a "genomic" point of view this is known as the basis of the Smith-Waterman local alignment algorithm. Although Smith-Waterman has other features - including variable "penalties" for each type of edit {inserts, deletes, modifications}. Even having different penalties for certain changes {this is also similar to how typing correction software works - with assuming adjacent key typos are more likely.



We solve the recursively (stripping letters from one or both words each time). We have a number of options.

* Either of the words has no-letters - so the alignment is either a set of inserts/deletes from the other string.
* If the first character of each word is the same - we continue to the next letters {and an alignment is marked between the two words}
* If they are not the same - we look to see which of the options insert, delete or snp makes has the lowest score...

The other two helper functions render this string (given the two sequences) showing the gaps and alignments; and work out the edit distance from the alignment.

sub alignment_string {
  my( $s, $t ) = @_;
  my $key = "$s\t$t";
  return $cache{$key} if exists $cache{$key};
  ## Both strings are empty so reached end!
  return $cache{$key}||=''              if $t eq q() && $s eq q();
  ## Exhausted t so all edits are now deletes...
  return $cache{$key}||='^' x length $s if $t eq q();
  ## Exhausted s so all edits are now inserts...
  return $cache{$key}||='v' x length $t if $s eq q();
  ## First letters are the same so we just prepend the
  ## match symbol (|) and continue...
  return $cache{$key}||='|'.alignment_string(substr($s,1),substr($t,1))
                                        if ord $s == ord $t;

  ## We now have three choices - "insert", "delete" or "SNP"
  my($d,$i,$m) = (
    '^'.alignment_string( substr($s,1), $t           ),
    'v'.alignment_string( $s,           substr($t,1) ),
    ' '.alignment_string( substr($s,1), substr($t,1) ),
  return  $cache{$key}||=
        _edit_dist( $d ) < _edit_dist( $i )
    ? ( _edit_dist( $d ) < _edit_dist( $m ) ? $d : $m )
    : ( _edit_dist( $i ) < _edit_dist( $m ) ? $i : $m );

sub edit_distance {
  return _edit_dist( alignment_string( @_ ) );

sub _edit_dist { ## Count inserts(v), deletes(^) & mis-matches( )
  return $_[0] =~ tr/^v /^v /;

sub render_alignment {
  my( $s, $t ) = @_;
  my $a = alignment_string( $s, $t );
  my( $top, $bot ) = ( '','' );
  foreach ( split m{}, $a ) {
    $top .= $_ eq 'v' ? '-' : substr $s, 0, 1, '';
    $bot .= $_ eq '^' ? '-' : substr $t, 0, 1, '';
  return sprintf "%s\n%s (%d)\n%s\n",
    $top, $a, _edit_dist($a), $bot;

If we are not interested in the "alignment" diagram we can simplify the code:

sub edit_distance_simple {
  my( $s, $t ) = @_;
  return $cache_x{"$s\t$t"}||=
     $t eq q()          ? length $s
   : $s eq q()          ? length $t
   : (ord $s == ord $t) ? edit_distance(substr($s,1),substr($t,1))
   :                      1+(sort { $a <=> $b }
                            edit_distance(substr($s,1),substr $t,1)

Note re-caches - these memoize the function - from trials the approximate hit is 50% - this matches up with the non recursive solution.

Leave a comment

About James Curtis-Smith

user-pic Perl developer for nearly 30 years now, mainly in maintenance scripts and web pages, using mod_perl.