Extracting values from a list of (key, value) pairs

By Diab Jerius on March 5, 2014 9:44 PM

If you need ordered key, value pairs, you can either use something like Tie::IxHash or a simple array of key, value pairs. I found myself in the situation where I needed to extract just the keys from such an array.

There are a number of ways to do it, but which is the fastest? I tried a few pure Perl approaches, as well as List::Util::pairkeys (which as of this writing isn't yet in a stable release of Perl, assuming that List::Util remains in the core). The pure Perl approaches either use various means of flipping a binary toggle, or splice()ing through a sacrificial copy of the array.

Here we go:

#!/usr/bin/env perl

use 5.10.1;

use strict;
use warnings;

use List::Util qw[ pairkeys ];

use Benchmark qw[ timethese cmpthese ];

my $N = 10000;

my @N = 1 .. $N;

cmpthese(
    timethese(
        10000,
        {

            'splce' => sub {
                my @v;
                my ( $k, $v );
                my @N = @N;
                push @v, $k while ( ( $k, $v ) = splice( @N, 0, 2 ) );

            },

            '%m' => sub {
                use integer;
                my $flip = 0;
                my @v    = map { ++$flip % 2 ? $_ : () } @N;
            },

            '%g' => sub {
                use integer;
                my $flip = 0;
                my @v    = grep { ++$flip % 2 } @N;
            },

            '1-m' => sub {
                use integer;
                my $flip = 0;
                my @v
                  = map { ( $flip = 1 - $flip ) ? $_ : () } @N;
            },

            '1-g' => sub {
                use integer;
                my $flip = 0;
                my @v    = grep { $flip = 1 - $flip } @N;
            },

            'ff' => sub {
                use integer;
                my $flip = 0;
                my @v    = grep { $flip = !( $flip .. $flip ) } @N;
            },

            'xor' => sub {
                use integer;
                my $flip = 0;
                my @v    = grep { $flip ^= 1 } @N;
            },

            'for' => sub {

                use integer;
                my @v;
                for( my $i = 0 ; $i < @N ; $i+=2 ) {
                    push @v, $N[$i];
                }
            },

            'idxp' => sub {

                use integer;
                my @v;
                push @v, $N[2 * $_] for 0..(@N/2)-1;
            },

            'idxm' => sub {

                use integer;
                my @v;
                @v = map { $N[2 * $_] } 0..(@N/2)-1;
            },

            'idxa' => sub {
                use integer;
                my @v;
                $v[$_] = $N[2 * $_] for 0..(@N/2)-1;
            },

            'pkey' => sub {
                my @v = pairkeys @N;
            },

        } ) );

And here are the results:

        Rate splce   ff   %m  1-m idxm   %g  for  xor idxa  1-g idxp pkey
splce  497/s    --  -3%  -5% -27% -30% -33% -40% -52% -53% -54% -59% -74%
ff     514/s    3%   --  -2% -25% -28% -30% -38% -50% -52% -53% -58% -74%
%m     524/s    6%   2%   -- -23% -26% -29% -37% -49% -51% -52% -57% -73%
1-m    681/s   37%  33%  30%   --  -4%  -8% -18% -34% -36% -37% -44% -65%
idxm   712/s   43%  39%  36%   5%   --  -3% -14% -31% -33% -34% -41% -63%
%g     738/s   49%  44%  41%   8%   4%   -- -11% -29% -31% -32% -39% -62%
for    831/s   67%  62%  58%  22%  17%  13%   -- -20% -22% -24% -32% -57%
xor   1037/s  109% 102%  98%  52%  46%  41%  25%   --  -3%  -4% -15% -47%
idxa  1067/s  115% 108% 104%  57%  50%  45%  28%   3%   --  -2% -12% -45%
1-g   1086/s  119% 111% 107%  59%  52%  47%  31%   5%   2%   -- -11% -44%
idxp  1214/s  144% 136% 131%  78%  70%  64%  46%  17%  14%  12%   -- -37%
pkey  1942/s  291% 278% 270% 185% 173% 163% 134%  87%  82%  79%  60%   --

If you've got List::Util::pairkeys, use it.
Array indexing is surprisingly fast.
Pushing onto the end of an array is significantly faster than direct assignment into it.
grep is faster than map; not too surprising
Remainders (%) are slower then subtraction; not too surprising.
Binary exclusive-or is comparable to subtraction, which is a bit surprising
In this application use integer provided a significant boost in speed.
The flipflop operator is surprisingly slow
splice() and flipflop are essentially tied. splice() is slowed down significantly by needing to make a copy of the array. (If I added the copy cost to the others it moved up two slots in the rankings). It would be even slower if the array (or its elements) were larger. When Copy on Write (COW) makes it into Perl, that difference should diminish.

11 comments

11 Comments

Diab Jerius | March 6, 2014 1:52 AM | Reply

A colleague pointed out I'd left out toggling using exclusive or; I've added that.

Paul "LeoNerd" Evans | March 6, 2014 2:03 AM | Reply

It's nice to see that pairkeys() is fastest there. As I'd hope it would be, because it's written in XS. It's available on CPAN for current perls and will be in 5.20, so it should be available to anyone other than those needing to remain pure-perl on pre-5.20.

Diab Jerius | March 6, 2014 2:28 AM | Reply

Thanks for adding that to List::Util.

Peeking at your code led me to realize I'd forgotten to add straightforward array indexing. It's surprisingly fast. XS still wins, though.

Diab Jerius | March 6, 2014 4:24 PM | Reply

And just for completeness, I added direct assignment into the target array (idxa). As might be expected, it's significantly slower than pushing onto the end of the array (idxp).

QM | March 6, 2014 8:08 PM | Reply

Nice work. pairkey is 4x faster than splice.

Why do the keys get assigned into a variable called @v? That seems contrary, when it should be @k. (Also, just write out @values or @keys. I don't know how many times I've cursed people in modern languages who program like Basic was still the rage, with single letter variables.)

The barrier to entry is rather high for the novice Perler here. Perhaps you can explain why you need a flat array, instead of some other structure? Or why speed trumps readability and maintainability?

There are no comments in the code, so I have to keep re-reading the useful bits of the code, such as '1-m' (and realize that it's a number one and not a letter el). There are 12 cases, otherwise I could keep them straight in my head for a minute or two.

I had to look up a number of things to figure out what was happening. pairkey is neat, but not immediately obvious. Maybe it's just not in my cultural subconcious that it should only return the keys from a flat list of key/value pairs (again, assigned into @v for obscurity).

Why isn't a two dimensional array a better choice here?

my $key = 0;
my $val = 1;
my @N;
push @N, ['somekey','someval'];
...
my @keys;
for my $n (0..$#N) {
    push @keys, $N[$n][$key]
}

Or a bit more obscurant, transposed, and a hash:

my $key = 'key';
my $val = 'val';
my %N;
push @{$N{$key}},'somekey';
push @{$N{$val}},'someval';
...
my @keys = @{$N{$key}};
}

I think you've missed a chance to tell a much deeper story. I'd be interested in the followup.

QM | March 6, 2014 8:09 PM | Reply

I noticed that I can't submit from preview, I have to back up and submit there.

Diab Jerius | March 7, 2014 4:00 AM | Reply

@QM: You've raised some good points.

Yes, @k would have been a better choice than @v. My mistake. I'd fix the code, but then your comment would have no context.

Why a short variable name? The focus here is on the inner parts of the algorithm. Longer variable names would add visual clutter and detract from the comparison. Brevity is not always a curse.

The short and bizarre tags are simply because if they were longer the comparison table at the bottom would be too wide and would wrap due to the formatting in the blog.

Why a flat array? One reason is that constructing it is analogous to constructing a hash. Compare:

 { a => 1, b => 2, c => 3 }

 [ a => 1, b => 2, c => 4 ]

It's idiomatic; it's obvious from looking at the code that a set of key-value pairs is intended. The hash is intrinsically unordered, the array is intrinsically ordered. For small amounts of data I find this style to be much more readable then inserting nested data structures.

The other benefit is that after I extract the list of keys I can easily collapse the data into a hash:

  my @k = pairkeys $arrayref;
  my %h = %{$arrayref};

Now I have the best of both worlds (albeit with some duplicated data).

The counter-argument is that obviously if you're constructing these structures by hand you're not going to care much about performance (there just won't be that many elements in them), and if you've got a ton of data you don't really care how nice it would look if you actually coded it by hand.

Consider though, if you've got to churn through millions of pairs. The cost of constructing and storing an extra arrayref per pair as well as that due to dereferencing might be too high to bear.

You'll need to profile your code and find out if it matters. That's what I'm doing here.

As to which is more readable and maintainable? Compare constructing the initial structures

my @pairs;
push @pairs, [ $key, $value ];
...

my @pairs;
push @pairs, $key, $value;

They look pretty similar. Using appropriately named variables(!) it's obvious what they do. If you're writing a function which takes @pairs as an input, either way you'll need to document what's in it. No difference in maintainance there.

As for the code extracting keys from the data:

my @keys;
push @keys, $_->[0] foreach @pairs;

vs.

my @keys = pairkeys @pairs;

I'd give the edge to the latter.

It also comes down to your application and your taste. In some applications it makes perfect sense to iterate via a traditional for loop if you are doing a lot in the loop:

for my $pair ( @pairs ) {
    my ($key, $value ) = @$pair;
    ...
}

As a matter of style, the above code is equivalent to your example, but I think it's more idiomatic Perl and much more readable.

Further on maintainable code. There are some Perl constructs and idioms which, if you're not familiar with them, don't appear to lend themselves well to maintainability. That can be said about all languages. (C++ STL is an unfathomable mess until you embrace it's inner functional nature. Then it's a slightly more fathomable mess.) In general (as exhibited in the benchmarking code above) I prefer to use the statement modifier versions of foreach, while, etc. and grep, and map where I think they provide greater clarity into what's happening at that point. I hope. There can be a fine line between succinct and inscrutable code.

You make a good point about the balance between performance and maintainabiltiy. I think you can have both, and Perl's idioms can give you that. (If you benchmark them.)

You state that you had to look a few things up to understand them. I don't think that's a bad thing. I enjoy reading code which I don't understand. I've learned more about programming by puzzling over other people's code then I ever did taking that PL/I course.

Graham Knop | March 7, 2014 6:40 AM | Reply

There is a small mistake in your code. @N has an odd number of elements. pairkeys will warn about this if warnings are on.

Another option for this is doing an array slice:

@N[ map $_*2, 0..(@N/2)-1 ]

It's slower than several other options though.

Aristotle | March 7, 2014 8:32 PM | Reply

I wonder how idxa performs if you add a $#v = $#N / 2; before the loop.

Diab Jerius | March 8, 2014 3:03 PM | Reply

@Graham Knop: I've fixed the code; thanks.

@Aristotle: It actually seems to slow it down by a bit. Sequential runs with with your suggestion implemented, followed by the original code give:

with: 1002/s 1016/s
original: 1049/s 1075/s

The difference between the two is probably significant within the noise of the measurements.

mascip | March 14, 2014 7:59 PM | Reply

Have you tried Var::Pairs, written by Damian Conway? It doesn't have a subroutine to extract all keys, but it's very god for loops:

for my $next (pairs @array) {
say $next->index, ' has the value ', $next->value;
}

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About Diab Jerius

I blog about Perl.

More info »

Diab Jerius