Perl Weekly Challenge 057: Invert Tree and Shortest Unique Prefix

Shortest Unique Prefix

Write a script to find the shortest unique prefix for each each word in the given list. The prefixes will not necessarily be of the same length.

Sample Input

[ "alphabet", "book", "carpet", "cadmium", "cadeau", "alpine" ]

Expected Output

[ "alph", "b", "car", "cadm", "cade", "alpi" ]

Let me start with the second task as it was definitely simpler (at least for me).

We iterate over all the input words. For each word, we try to find the shortest prefix possible. To know what prefixes have already been used, we keep two hashes: one stores the abandoned prefixes (i.e. those that were not unique anymore), the second one stores the “current” prefixes (the prefix is the key, the actual word is the value). We start from length 1 and add 1 in each step. If the prefix isn’t used and hasn’t been used, we assign it to the word and proceed to the next word. If the prefix is currently used for a different word, we store the prefix as “used” and prolong the prefix for the old word by one—but we continue the loop for the current word, in case their common prefix is longer.

At the end, we map the words to the prefixes and return them. Here’s the code with some tests:

#!/usr/bin/perl
use warnings;
use strict;

sub shortest_unique_prefixes {
    my @words = @_;
    my (%prefixes, %used);
    for my $word (@words) {
        my $i = 1;
        while ($i <= length $word) {
            my $prefix = substr $word, 0, $i++;
            if (exists $prefixes{$prefix}) {
                undef $used{$prefix};
                my $old = $prefixes{$prefix};
                $prefixes{ substr $old, 0, $i }
                    = delete $prefixes{$prefix};
                die "Duplicate prefix: $prefix ($old:$word)"
                    if $i > length $word || $i > length $old;

            } elsif (! exists $used{$prefix}) {
                $prefixes{$prefix} = $word;
                last
            }
        }
    }
    my %to_prefixes = reverse %prefixes;
    return [@to_prefixes{@words}]
}

use Test::More tests => 4;
use Test::Exception;

is_deeply
    shortest_unique_prefixes(
        qw( alphabet book carpet cadmium cadeau alpine )),
    [qw[ alph b car cadm cade alpi ]],
    'sample input';

throws_ok {
    shortest_unique_prefixes(qw( perl perl ))
} qr/Duplicate prefix: perl /, 'detect duplicate';

throws_ok {
    shortest_unique_prefixes(qw( A AA AAA ))
} qr/Duplicate prefix: A /, 'common prefix asc';

throws_ok {
    shortest_unique_prefixes(qw( BBB BB B ))
} qr/Duplicate prefix: B+ /, 'common prefix desc';

Note that for “perl” and „perlaceous”, the subroutine dies, as there’s no unique prefix to distinguish the two words.

Invert Tree

You are given a full binary tree of any height, similar to the one below:
       1
      /  \
     2    3
    / \  / \
   4   5 6  7

Write a script to invert the tree, by mirroring the children of every node, from left to right. The expected output from the tree above would be:

       1
      /  \
     3    2
    / \  / \
   7   6 5  4

The input can be any sensible machine-readable binary tree format of your choosing, and the output should be the same format.

BONUS

In addition to the above, you may wish to pretty-print your binary tree in a human readable text-based format similar to the following:

       1
      /  \
     3    2
    / \  / \
   7   6 5  4

There are two basic ways to represent a tree.

  1. List of the edges. In our case, we might need to add the information whether the edge goes to the left or to the right, but we can also imagine that the first edge described for a given parent is always the left one. Example: (child direction parent)
    2 L 1
    3 R 1
  2. Nest the values. Follow the parent’s value with the children’s values enclosed in parentheses. Example:
    1(2,3)

The first serialisation way is easy to invert. We don’t remove nor add any edges, we just change their orientation. So, to invert a tree, we just need to replace all L’s with R’s and vice versa.

#!/usr/bin/perl
use warnings;
use strict;

print tr/LR/RL/r while <>;

If we hadn’t indicated the direction of an edge but assumed the first edge for a given parent goes to the left, the solution would have been to print the edges in the reversed order.

print for reverse <>;

The second notation is a bit more complex. We need to build some kind of a structure, invert it, and serialise it back. We could use a recursive structure to reflect the input, but it’s easier to just store a list of children for each parent (and it’s easier to implement the inversion for such a representation, too).

#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };

sub invert {
    my ($tree) = @_;
    $_ = [reverse @$_] for values %$tree;
}

sub serialise {
    my ($node, $tree) = @_;
    return $node unless exists $tree->{$node};
    return "$node("
           . join(',', map serialise($_, $tree), @{ $tree->{$node} })
           . ')'
}

chomp( my $structure = <> );

my %tree;
while ($structure =~ s/([0-9]+) \( ([0-9]+) , ([0-9]+) \) /$1/x) {
    my ($parent, $left, $right) = ($1, $2, $3);
    $tree{$parent} = [$left, $right];
}

invert(\%tree);
say serialise($structure, \%tree);

If you’re interested how the recursive representation works, here’s the solution, too:

#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };

use Marpa::R2;

sub invert {
    my ($tree) = @_;
    return $tree unless ref $tree;

    my ($root) = keys %$tree;
    $tree->{$root} =  [ reverse @{ $tree->{$root} } ];
    invert($_) for @{ $tree->{$root} };
}

sub serialise {
    my ($tree) = @_;
    return $tree unless ref $tree;

    my ($root) = keys %$tree;
    return "$root(" . join(',', map serialise($_), @{ $tree->{$root} })
           . ')'
}


my $dsl = << '__DSL__';

lexeme default = latm => 1

Tree ::= node                              action => ::first
       | node ('(') Tree (',') Tree (')')  action => subtree
node ~ [0-9]+

__DSL__

sub subtree { +{ $_[1] => [ $_[2], $_[3] ] } }

my $grammar = 'Marpa::R2::Scanless::G'->new({ source => \$dsl });

chomp( my $input = <> );
my $tree = ${ $grammar->parse(\$input, {semantics_package => 'main'}) };

invert($tree);

say serialise($tree);

Last thing to solve is the bonus. I originally tried to print the tree in the way shown in the task assignment, but it was too hard for my limited time. So, I decided to only print the tree rotated by 90°, i.e. like this:

1-+-2-+-4
  |   \-5
  \-3-+-6
      \-7

Note that the real fun starts when you use longer numbers.

I implemented a subroutine to_graph that takes two arguments, $root and $tree, where $tree corresponds to the structure shown above: it is a hash reference where the keys are the parents and values are the arrays of children.

sub to_graph {
    my ($root, $tree) = @_;
    _to_graph($root, $tree, my $output = []);
    return @$output
}

sub _to_graph {
    my ($root, $tree, $output, @lines) = @_;
    push @$output, $root;

    if (my @children = @{ $tree->{$root} // [] }) {
        push @$output, '-+-';
        _to_graph($children[0], $tree, $output, @lines,
                  (' ' x length($root)) . ' | ');
        push @$output, @lines, ' ' x length $root, ' \\-';
        _to_graph($children[1], $tree, $output, @lines,
                  (' ' x (2 + length($root))) . ' ');

    } else {
        push @$output, "\n";
    }
}

It’s a recursive solution. It prints the parent, and then it serialises the children in a way they are aligned and keep all the needed higher parts of the tree printed. All the vertical lines and spaces to be printed before the node are kept in the @lines array. For added beauty, we can use the Unicode box drawing characters to get the following output:

10─┬─200─┬─4
   │     └─50000
   └─300000─┬─6000000
            └─700

To verify it works I also implemented the conversion from the “edges” format to the same structure, a test suite included. You can find the whole code in GitHub.

What’s missing? Maybe you’ve noticed we haven’t validated the input. There’s almost no error checking, the code can fail in various ways if the input doesn’t follow the specification. In fact, writing a validator for both the “edges” and “structure” formats (for any trees, not just full binary ones) was one of the homework assignments I gave to the students of my Introduction to Natural Language Processing course back at my postdoc days. Maybe we can reuse this task in a future challenge? Which of the two formats is easier to validate?

Leave a comment

About E. Choroba

user-pic I blog about Perl.