Finding Common Ground ... in Both Directions

Last week I formulated an interesting problem in text processing while working on one of my hobbies.  Since I was only able to devote an hour or two here and there, it took me a few days to get the solution up and running, which indicated to me that it wasn’t as simple as I’d thought it would be at first.  And since “not simple” often means “interesting,” I thought I’d share it here with you.  (Note that I don’t claim this is the best solution, or the most efficient, or the most elegant.  I’m perfectly happy to receive suggestions for improvement if you’re so inclined.)

The exact application isn’t important, 1 so let’s just look at the general parameters.  There’s a series of cards, and each card has one or more powers on it (where a “power” in this case just means a block of text).  I have a file with all the powers in it, and a little script to help me search for patterns within and across the powers.  Let’s assume the powers come in, one per line, like so:

Name of Card / Name of Power : Description text of power.

(They’re not actually formatted like that in the file, but I have another script that transforms them into this.)  So my analyze script lets me search for a given regex (Perl-style, natch) and prints a nice summary of the results, like so:2

analyze 'after attacking'
 
5 times in 5 powers on 5 cards
  After attacking : 5

Okay, that one isn’t very exciting.  But suppose we search for something a little more flexible?

analyze 'add\w* \w+ dic?e'
 
43 times in 39 powers on 36 cards
  add 1 die : 6
  add 2 dice : 3
  add 3 dice : 1
  add 4 dice : 1
  additional attack dice : 6
  additional attack die : 16
  additional defense dice : 1
  additional defense die : 7
  adds 1 die : 1
  adds 4 dice : 1

That’s pretty useful, and it doesn’t take a lot of code to get us this—hey, this is Perl, after all.

my $times = 0;
my (@powers, %cards, %instances);
while ( <> )
{
    my $fullpower = $_;
    if ( /$pattern/io )
    {
        $times +=()= m/$pattern/gio;
        $fullpower =~ m{^(.*?) / } or die("can't figure out which card this is: $_");
        $cards{$1} = 1;
        ++$instances{$1} while m/(\w*$pattern\w*)/gio;
 
        $fullpower =~ s/($pattern)/ colored($1, qw< bold green >) /egio;
        push @powers, $fullpower;
    }
}
 
say "$times times in " . scalar(@powers) . " powers on " . (keys %cards) . " cards";
say "  $_ : $instances{$_}" foreach keys %instances;
if ($opts->{'v'})
{
    say '';
    say foreach @powers;
}

(The real code has a bit more to it, but this is all we need to look at for today’s purposes.)  So we’re going through our input, one line at a time.3  First we copy the power, then we search for our pattern (the /o modifier helps with efficiency, since the pattern won’t change throughout the program).  If we find it, we count how many times the pattern appears and add that to our overall counter.  Then we pull the card name out so we can keep track of the total number of cards.  Then we compile the instance and its count—this is for the breakdown that appears underneath our totals.  Notice that this second time we search, we surround our pattern with \w*: that allows our instances to be composed of whole words only.  Finally, we mark the text we found in green and push that onto an array just in case the user wants to see the full text of everything that matches.  Print out the summary and voilà: we’re all set.

Now, this was a fairly basic, first attempt.  Not to imply that I got this close my first time out of the gate, but this was my first attempt which actually worked, and I ran with this for a long while.

When I started fiddling again, I came up with a couple of very mild improvements:

my $times = 0;
my (@powers, %cards, %instances);
while ( <> )
{
    my $fullpower = $_;
    if ( /$pattern/io )
    {
        $fullpower =~ m{^(.*?) / } or die("can't figure out which card this is: $_");
        $cards{$1} = 1;
        ++$times and ++$instances{$1} while m/(\w*$pattern\w*)/gio;
 
        $fullpower =~ s/($pattern)/ colored($1, qw< bold green >) /egio;
        push @powers, $fullpower;
    }
}
 
say "$times times in " . scalar(@powers) . " powers on " . (keys %cards) . " cards";
say "  $_ : $instances{$_}" foreach sort keys %instances;
if ($opts->{'v'})
{
    say '';
    say foreach @powers;
}

There are only two differences here.  First of all, I ditched my oh-so-clever use of the goatse operator, tortured into something even more unspeakable via mutation with the += operator.  Although, honestly, it wasn’t the fact that I was writing something that very few readers would understand that bugged me: it was the fact that I was searching for the same pattern in the same text 3 times in a row.  At least this way I got it down to 2.  And, yes: it’s true that /$pattern/gio and /(\w*$pattern\w*)/gio aren’t technically the same pattern.  But, if you think about it, they’re both going to match the same number of times, which is all we care about in this case.

The second difference is just that I sorted the summary instances.  At first I thought I wouldn’t care, which is why I didn’t bother in the first place.  But, in practice, I found that sorting helped group similar instances together, which made the output more useful.

Now here’s the search I did recently which led me to ponder my improvement:

analyze '\w+ space'
 
226 times in 180 powers on 130 cards
  1 space : 2
  10 spaces : 1
  2 spaces : 3
  3 spaces : 2
  4 spaces : 9
  5 spaces : 8
  7 spaces : 1
  8 spaces : 2
  The space : 1
  This space : 2
  X spaces : 1
  a space : 7
  additional space : 3
  additional spaces : 3
  adjacent space : 1
  chosen space : 1
  counting spaces : 48
  dungeon space : 3
  empty space : 9
  empty spaces : 2
  ice space : 4
  ice spaces : 2
  land space : 1
  lava space : 1
  lava spaces : 6
  level space : 5
  of spaces : 2
  one space : 2
  road space : 1
  shadow space : 7
  sight spaces : 49
  snow space : 2
  switching spaces : 1
  that space : 3
  the space : 3
  those spaces : 3
  unoccupied space : 2
  water space : 16
  water spaces : 7

My first thought was that it would be nice to “collapse” some of those, like turn “2 spaces” and “3 spaces” and so forth into, say, ”# spaces.” So I did that; it was pretty simple, and therefore uninteresting, so I’m not going to show that here.  But my next thought was far more intriguing.

See, I happen to know that every power that has the words “sight spaces” has them in the phrase “clear sight spaces.” So I started to wonder if there was a way to have the program show me that: sort of a way to “expand” the instances.4  This turns out to be one of those problems that’s easy to grasp as a human, but difficult to explain to a computer.  So let’s try to specify exactly what “expanding” means in this context.

Given a set of words that appear in several texts, find all contiguous words that appear in identical positions in all the texts.  So, basically: take the first instance in a set, find the word immediately previous to the found phrase, then check to see if that same word appears previous to the found phrase in all other instances in the set.  If it does, check the word before that; repeat until you find a word that isn’t the same in all instances or you hit the beginning of the string.  Then do the same for the word immediately following the phrase, repeating until it isn’t found in all instances or you hit the end of the string.  So, not particularly simple.  But not that hard either.

One thing we’d better decide up front is what to do when the number of instances in a set is small.  Doesn’t matter if the number is large or not—it’ll work well for any number of instances from, say, ten to a thousand.  But what if there’s only a single instance?  What if there are only 2 or 3, but they’re all really the same text?  We would “expand” our few words into the whole string.  That’s not very helpful.

Let’s make some arbitrary decisions here (sometimes your decisions have to be somewhat arbitrary, but that doesn’t mean you shouldn’t make ’em).  If we have one instance, we won’t expand at all.  To avoid expanding out to the whole string for small numbers of instances which might be exact duplicates, let’s say we never cross a sentence boundary.  (In general practice, determining sentence boundaries is a non-trivial task.  But in this specific case we’re blessed with a domain of sentences that are all declarative and contain no abbreviations.  So we’ll take the simplistic route of just never crossing a period.)

So let’s expand our basic loop a bit:

my $times = 0;
my (@powers, %cards, %instances, %instance_refs);
while ( <> )
{
    my $fullpower = $_;
    if ( /$pattern/io )
    {
        $fullpower =~ m{^(.*?) / } or die("can't figure out which card this is: $_");
        $cards{$1} = 1;
        while ( m/(\w*$pattern\w*)/gio )
        {
            ++$times;
            ++$instances{$1};
            if ($opts->{'e'})
            {
                my $key = $1;
                my $ref =
                {
                    source  =>  $_,
                    start   =>  $-[1],
                    end     =>  $+[1],
                };
                push @{ $instance_refs{$key} }, $ref;
            }
        }
 
        $fullpower =~ s/($pattern)/ colored($1, qw< bold green >) /egio;
        push @powers, $fullpower;
    }
}
%instances = expand_instances(%instances) if $opts->{'e'};

Now, assuming we’re doing expansion, we won’t just count our instances: we’ll save a ref to the source that matched, along with where in the string the match begins and ends (happily, @- and @+ exist for this very reason5).

Next we need a way to find the previous word and the next word.  We should be able to do that with a single function taking an argument indicating which direction you want to search in.  That turned out to be a bit trickier than it seemed, but then I hit on the trick of reversing the string when searching backward for previous words.  Here’s what I came up with:

func find_contiguous_word ($direction, $ref, $adj)
{
    my $string = $ref->{'source'};
    my $origin;
    if ($direction eq 'prev')
    {
        $string = reverse $string;
        $origin = length($string) - $ref->{'start'} + $adj;
    }
    else
    {
        $origin = $ref->{'end'} + $adj;
    }
 
    substr($string, $origin) =~ /([^.]+?)(\W|$)/;
    return $direction eq 'prev' ? reverse $1 : $1;
}

The “adjustment” ($adj) starts out at 0 but increases as we find more and more words in common.  If we’re looking backwards, we reverse the source string and set our origin to the length minus the start (because it’s backwards now) and then add the adjustment (which is positive, so you might think we need to subtract it, except everything is backwards now).  If we’re looking forwards, we just need to add the adjustment to the end.  Then we find any number of characters, as long as it’s not a period, up to the first non-word character, or the end of the string (which will be the beginning of the string if we’ve reversed it).  Then we return what we found, reversing it again to make it forwards if we reversed it in the first place.  There’s possibly a more elegant way to do this, but this definitely works.

Now we just need the main event: the actual expansion calculation:

func expand_instances (%instances)
{
    my %expanded;
    foreach (keys %instances)
    {
        my $refs = $instance_refs{$_} or die("can't find ref for $_");
        if (@$refs == 1)
        {
            $expanded{$_} = $instances{$_};
        }
        else
        {
            my $first = shift @$refs;
            my ($move_back, $move_fwd) = (0,0);
            PREV_WORD: while (my $p = find_contiguous_word(prev => $first, $move_back))
            {
                foreach (@$refs)
                {
                    last PREV_WORD unless substr($_->{'source'}, $_->{'start'} - $move_back - length($p), length($p)) eq $p;
                }
                $move_back += length($p);
            }
            NEXT_WORD: while (my $n = find_contiguous_word(next => $first, $move_fwd))
            {
                foreach (@$refs)
                {
                    last NEXT_WORD unless substr($_->{'source'}, $_->{'end'} + $move_fwd, length($n)) eq $n;
                }
                $move_fwd += length($n);
            }
            my $start = $first->{'start'} - $move_back;
            my $end = $first->{'end'} + $move_fwd;
            my $key = substr($first->{'source'}, $start, $end - $start);
            $key =~ s/^\s+//;
            $expanded{$key} = $instances{$_};
        }
    }
    return %expanded;
}

Whew! that’s a mouthful.  Let’s break it down.

First we make a new hash to hold our instance counts—remember, the values of our hash aren’t going to change, just the keys.  Rather than trying to modify the original hash, it’s just easier to make a new one.  For each key in the original hash, we get the corresponding ref hash.6  If there’s only one instance in it, just copy the count.  If there’s more than one, then things get interesting.

First, just shift off the first one.  Doesn’t really matter which one we pick, so I’m just grabbing the first one.  I shift it off to save having to compare it against itself, which wouldn’t tell us much.  Set our forwards and backwards adjustments to 0.  Now, while we find a previous word (which we won’t, if we either hit a period or the front of the string), compare the word we found against the previous word for each of the other refs.  If we don’t find a match, break out of both loops.  But if we have the same previous words in all the other refs, tweak our backwards adjustment and move backwards another word.  The loops for finding the words forwards are much the same.  It took me a few tries to get the math right for calculating the substrings, but what you see there is correct.  I really wanted to find some clever way to combine the two loop pairs, but this is workable.

Now that we know how far to adjust backwards and forwards, we can figure out what our expanded key should be.  We’ll use the first ref ... again, it doesn’t actually matter which one we use, since they’ll all give us the same answer, but $first is already in a separate variable, so it’s convenient.  We adjust the start and end by the adjustments we’ve calcualted (which might well be 0), grab the substring, and trim any spaces off the front.  (Because of the way I’m getting words, I’ll have leading space if we adjusted backwards; I could probably fiddle with my regex to eliminate that, but it seemed simple enough to just trim it here.)  Then we copy over the count into the new hash.  And that’s pretty much all there is to it.

So let’s see our new algorithm in action:

analyze -e '\w+ space'
 
226 times in 180 powers on 130 cards
  1 space : 2
  10 spaces : 1
  2 spaces : 3
  3 spaces : 2
  4 spaces : 9
  5 spaces : 8
  7 spaces : 1
  8 spaces : 2
  All figures on those spaces : 3
  Count the minimum number of spaces between the attacker and : 2
  The space : 1
  This space may be up to : 2
  When counting spaces for : 48
  X spaces : 1
  a space : 7
  additional spaces : 3
  adjacent space : 1
  chosen space : 1
  clear sight spaces : 49
  empty space : 9
  empty spaces : 2
  is on a dungeon space, add 3 to your die roll : 3
  land space : 1
  lava space : 1
  molten lava spaces : 6
  move one additional space : 3
  normal ice spaces : 2
  one space : 2
  or ice space, : 4
  road space : 1
  same-level space adjacent to : 5
  shadow space : 7
  snow space : 2
  switching spaces : 1
  that space : 3
  the space : 3
  unoccupied space : 2
  water space : 16
  water spaces : 7

Our totals are the same, and we get the same number of instances with the same subtotals.  But the instances themselves are expanded to include all common surrounding words.  This allows me to see the maximum amount of context without causing me to end up with more groupings.

Hopefully you found that interesting in some small way.  Or perhaps you’ve spotted some glaring inefficiency, or just plain silliness that could be rewritten much more sensibly.  As I said up at the top of the post, feel free to point those out in the comments.


1 Although if you’re super interested, you could always read about it on my Other Blog.


2 Note that my search is case-insensitve.  I’m searching English, not code.


3 Remember that there’s a separate transform script which makes each power a single line, and a wrapper script which feeds that output into the STDIN of this script.


4 Of course, with this terminology, it could actually make sense to both “expand” and “collapse” the instances at the same time.  Which probably indicates I need better terminology.


5 Check out their entries in man perlvar for full details.


6 Yes, I’m “cheating” a bit here by taking advantage of the fact that %instance_refs is file-scoped.  It saves me having to pass %instances by reference.


Leave a comment

About Buddy Burden

user-pic 14 years in California, 25 years in Perl, 34 years in computers, 55 years in bare feet.