is_almost()

I struggled with a problem where a given method would return an array of array refs of data, but the order (and sometimes presence) of array ref elements were sometimes slightly different. This is because this code needed to test real data and I could not mock the results. After giving this some thought, I realized I wanted something like the Levenshtein edit distance for data structures. Marcel Grünauer suggested that each element get assgined a unicode character. This solves my problem nicely with the following code ...

(Fair warning, this is a hack)

use strict;

use Test::More 'no_plan';

use Test::Differences;
use Data::Dumper;
use Text::WagnerFischer 'distance';

sub is_almost($$$;$) {
    my ( $have, $want, $threshhold, $message ) = @_;

    $message ||= 'The two arrays should be close enough';

    unless ( 'ARRAY' eq ref $have and 'ARRAY' eq ref $want ) {
        require Carp;
        Carp::confess(
            "First two arguments to is_almost() must be array refs");
    }

    local $Test::Builder::Level = $Test::Builder::Level + 1;
    if ( !@$want ) {
        if ( !@$have ) {
            pass $message;
        }
        else {
            eq_or_diff $have, $want, $message;
        }
        return;
    }

    my %char_for;
    my $index = 1;
    my ( $have_str, $want_str ) = ( '', '' );
    local $Data::Dumper::Indent   = 0;
    local $Data::Dumper::Sortkeys = 1;
    local $Data::Dumper::Terse    = 1;

    foreach my $element (@$have) {
        $have_str .= $char_for{ Dumper($element) } ||= chr( $index++ );
    }
    foreach my $element (@$want) {
        $want_str .= $char_for{ Dumper($element) } ||= chr( $index++ );
    }
    my $distance = distance( $have_str, $want_str ) / @$want;
    if ( $distance <= $threshhold ) {
        pass $message;
        if ($distance) {
            diag "Distance is $distance";
        }
    }
    else {
        eq_or_diff $have, $want, $message;
        diag "Distance is $distance";
    }
}
my $want = [
    [ 1, 'North Beach',       'au', 'city' ],
    [ 2, 'North Beach',       'us', 'city' ],
    [ 3, 'North Beach',       'us', 'city' ],
    [ 4, 'North Beach Hotel', 'us', 'hotel' ],
    [ 5, 'North Beach',       'us', 'city' ],
    [ 6, 'North Beach',       'us', 'city' ],
];
my $have = [
    [ 1, 'North Beach',       'au', 'city' ],
    [ 2, 'North Beach',       'us', 'city' ],
    [ 3, 'North Beach',       'us', 'city' ],
    [ 4, 'North Beach Hotel', 'us', 'hotel' ],
    [ 5, 'North Beach',       'us', 'city' ],
    [ 6, 'North Beach',       'us', 'city' ],
];
is_almost $have, $want, .20;

$have = [
    [ 2, 'North Beach',       'us', 'city' ],
    [ 3, 'North Beach',       'us', 'city' ],
    [ 4, 'North Beach Hotel', 'us', 'hotel' ],
    [ 5, 'North Beach',       'us', 'city' ],
    [ 6, 'North Beach',       'us', 'city' ],
];
is_almost $have, $want, .20;
$have = [
    [ 2, 'North Beach',       'us', 'city' ],
    [ 3, 'North Beach',       'us', 'city' ],
    [ 4, 'North Beach Hotel', 'us', 'hotel' ],
    [ 5, 'North Beach',       'us', 'city' ],
    [ 6, 'North Beach',       'us', 'city' ],
    [ 1, 'North Beach',       'au', 'city' ],
];
is_almost $have, $want, .20;
__END__
ok 1 - The two arrays should be close enough
ok 2 - The two arrays should be close enough
# Distance is 0.166666666666667
not ok 3 - The two arrays should be close enough
#   Failed test 'The two arrays should be close enough'
#   at almost.pl line 90.
# +----+------------------------------+----+------------------------------+
# | Elt|Got                           | Elt|Expected                      |
# +----+------------------------------+----+------------------------------+
# |    |                              *   0|1,North Beach,au,city         *
# |   0|2,North Beach,us,city         |   1|2,North Beach,us,city         |
# |   1|3,North Beach,us,city         |   2|3,North Beach,us,city         |
# |   2|4,North Beach Hotel,us,hotel  |   3|4,North Beach Hotel,us,hotel  |
# |   3|5,North Beach,us,city         |   4|5,North Beach,us,city         |
# |   4|6,North Beach,us,city         |   5|6,North Beach,us,city         |
# *   5|1,North Beach,au,city         *    |                              |
# +----+------------------------------+----+------------------------------+
# Distance is 0.333333333333333

It seems many of the "edit distance" modules struggle with unicode, so I played with different ones until I had one which led to results I considered vaguely satisfactory.

This is the first pass at a rough, rough hack. Suggestions welcome.

1 Comment

Interesting :) But you seem to use control chars with $index = 1

Leave a comment

About Ovid

user-pic Freelance Perl/Testing/Agile consultant and trainer. See http://www.allaroundtheworld.fr/ for our services. If you have a problem with Perl, we will solve it for you. And don't forget to buy my book! http://www.amazon.com/Beginning-Perl-Curtis-Poe/dp/1118013840/