is_almost()

I struggled with a problem where a given method would return an array of array refs of data, but the order (and sometimes presence) of array ref elements were sometimes slightly different. This is because this code needed to test real data and I could not mock the results. After giving this some thought, I realized I wanted something like the Levenshtein edit distance for data structures. Marcel GrĂ¼nauer suggested that each element get assgined a unicode character. This solves my problem nicely with the following code ...

(Fair warning, this is a hack)

use strict;

use Test::More 'no_plan';

use Test::Differences;
use Data::Dumper;
use Text::WagnerFischer 'distance';

sub is_almost($$$;$) {
    my ( $have, $want, $threshhold, $message ) = @_;

    $message ||= 'The two arrays should be close enough';

    unless ( 'ARRAY' eq ref $have and 'ARRAY' eq ref $want ) {
        require Carp;
        Carp::confess(
            "First two arguments to is_almost() must be array refs");
    }

    local $Test::Builder::Level = $Test::Builder::Level + 1;
    if ( !@$want ) {
        if ( !@$have ) {
            pass $message;
        }
        else {
            eq_or_diff $have, $want, $message;
        }
        return;
    }

    my %char_for;
    my $index = 1;
    my ( $have_str, $want_str ) = ( '', '' );
    local $Data::Dumper::Indent   = 0;
    local $Data::Dumper::Sortkeys = 1;
    local $Data::Dumper::Terse    = 1;

    foreach my $element (@$have) {
        $have_str .= $char_for{ Dumper($element) } ||= chr( $index++ );
    }
    foreach my $element (@$want) {
        $want_str .= $char_for{ Dumper($element) } ||= chr( $index++ );
    }
    my $distance = distance( $have_str, $want_str ) / @$want;
    if ( $distance <= $threshhold ) {
        pass $message;
        if ($distance) {
            diag "Distance is $distance";
        }
    }
    else {
        eq_or_diff $have, $want, $message;
        diag "Distance is $distance";
    }
}
my $want = [
    [ 1, 'North Beach',       'au', 'city' ],
    [ 2, 'North Beach',       'us', 'city' ],
    [ 3, 'North Beach',       'us', 'city' ],
    [ 4, 'North Beach Hotel', 'us', 'hotel' ],
    [ 5, 'North Beach',       'us', 'city' ],
    [ 6, 'North Beach',       'us', 'city' ],
];
my $have = [
    [ 1, 'North Beach',       'au', 'city' ],
    [ 2, 'North Beach',       'us', 'city' ],
    [ 3, 'North Beach',       'us', 'city' ],
    [ 4, 'North Beach Hotel', 'us', 'hotel' ],
    [ 5, 'North Beach',       'us', 'city' ],
    [ 6, 'North Beach',       'us', 'city' ],
];
is_almost $have, $want, .20;

$have = [
    [ 2, 'North Beach',       'us', 'city' ],
    [ 3, 'North Beach',       'us', 'city' ],
    [ 4, 'North Beach Hotel', 'us', 'hotel' ],
    [ 5, 'North Beach',       'us', 'city' ],
    [ 6, 'North Beach',       'us', 'city' ],
];
is_almost $have, $want, .20;
$have = [
    [ 2, 'North Beach',       'us', 'city' ],
    [ 3, 'North Beach',       'us', 'city' ],
    [ 4, 'North Beach Hotel', 'us', 'hotel' ],
    [ 5, 'North Beach',       'us', 'city' ],
    [ 6, 'North Beach',       'us', 'city' ],
    [ 1, 'North Beach',       'au', 'city' ],
];
is_almost $have, $want, .20;
__END__
ok 1 - The two arrays should be close enough
ok 2 - The two arrays should be close enough
# Distance is 0.166666666666667
not ok 3 - The two arrays should be close enough
#   Failed test 'The two arrays should be close enough'
#   at almost.pl line 90.
# +----+------------------------------+----+------------------------------+
# | Elt|Got                           | Elt|Expected                      |
# +----+------------------------------+----+------------------------------+
# |    |                              *   0|1,North Beach,au,city         *
# |   0|2,North Beach,us,city         |   1|2,North Beach,us,city         |
# |   1|3,North Beach,us,city         |   2|3,North Beach,us,city         |
# |   2|4,North Beach Hotel,us,hotel  |   3|4,North Beach Hotel,us,hotel  |
# |   3|5,North Beach,us,city         |   4|5,North Beach,us,city         |
# |   4|6,North Beach,us,city         |   5|6,North Beach,us,city         |
# *   5|1,North Beach,au,city         *    |                              |
# +----+------------------------------+----+------------------------------+
# Distance is 0.333333333333333

It seems many of the "edit distance" modules struggle with unicode, so I played with different ones until I had one which led to results I considered vaguely satisfactory.

This is the first pass at a rough, rough hack. Suggestions welcome.

1 Comment

Interesting :) But you seem to use control chars with $index = 1

Leave a comment

About Ovid

user-pic Have Perl; Will Travel. Freelance Perl/Testing/Agile consultant. Photo by http://www.circle23.com/. Warning: that site is not safe for work. The photographer is a good friend of mine, though, and it's appropriate to credit his work.