Hash of Arrays Deathmatch : Native Perl vs. DBM::Deep vs. Redis

Sometimes one has to make compromises between speed of execution and memory, other times one may not have to be. While working towards a fairly (at least in my mind) complete solution to map Nanopore Sequencing files, I ran against the need to create and access fairly large hash of arrays (think of 1M - 100Mof keys), with each array itself consisting of a a fixed number of elements.
The hash of arrays is a fairly straightforward and fast data structure to create in Perl, the memory overhead can be substantial as the number of keys and values scale upwards. While in my application (at least as envisioned now!) the hash will be created, aggregated over (group by for those into python-pandas or r-data.table vernacular) and then discarded, there are use cases in which the data should be preserved to avoid the computational expensive part of generating them via approximate string matching in biological databases.

So what are the options of generating online, storing within Perl and then accessing hash of arrays that may also serialization and disk storage?

  1. Your standard, native hash of arrays is one option (persistence available via Storable)
  2. The hash/array DBM  DBM::Deep that uses a file to store the database (the file can be a RAMdisk, though we will not be using this option in the benchmarks below).
  3. The NOSQL DBM Redis
  4. A variety of SQL based solutions for analytics (I use MariaDB ColumnStore in real life), which however are not going to be considered here, because a 3NF schema for a database of a single table is kind of an overkill
I tested the first 3 solutions with the following script that uses the Memory::Usage module to track both memory usage and time execution using a very intuitive interface, and supplemented with file size stats  (for the size of the DBM::Deep file) and queries to the Redis server about memory use.

First, the self-explanatory (I hope) profiling script. It creates a long string that is used as key, and also as an element of the array that the key points to
#!/home/chrisarg/perl5/perlbrew/perls/current/bin/perl
use v5.38;

######################################################################
use DBM::Deep;
use Redis;
use Memory::Usage;
use File::stat;
no warnings; ## stop pestering re: accessing arrays in void context

my @chars = ( 'a' .. 'z', 'A' .. 'Z', '0' .. '9', '_' );
my $length_of_randomstring = 200;
my $num_of_randomstrings = 100_000;

tie my %db, 'DBM::Deep', 'foo.db';

my $redisDB = Redis->new();
$redisDB->client_setname('perlfoo');

my $mu = Memory::Usage->new();

$mu->record('Before populating DBM::Deep db');
foreach my $key ( 1 .. $num_of_randomstrings ) {
my $random_string;
$random_string .= $chars[ rand @chars ] for 1 .. $length_of_randomstring;
$db{$random_string} = [
$random_string, 'perl',
'rules', $random_string,
$random_string . $key, $key
];
}
$mu->record('After populating DBM::Deep db');

$mu->record('Before accessing DBM::Deep records');
while ( my ( $key, $val ) = each %db ) {
$val->@*;
}
$mu->record('After accessing DBM::Deep records');

my %in_memory_hash;
## create hash of random strings in memory now
$mu->record('Before populating hash in memory');
foreach my $key ( 1 .. $num_of_randomstrings ) {
my $random_string;
$random_string .= $chars[ rand @chars ] for 1 .. $length_of_randomstring;
$in_memory_hash{$random_string} = [
$random_string, 'perl',
'rules', $random_string,
$random_string . $key, $key
];
}
$mu->record('After populating hash in memory');

$mu->record('Before accessing in memory hash records');
while ( my ( $key, $val ) = each %in_memory_hash ) {
$val->@*;
}
$mu->record('After accessing in memory hash records');

## now do Redis
$mu->record('Before populating Redis db');
foreach my $key ( 1 .. $num_of_randomstrings ) {
my $random_string;
$random_string .= $chars[ rand @chars ] for 1 .. $length_of_randomstring;
$redisDB->sadd(
$random_string,
$random_string, 'perl',
'rules', $random_string,
$random_string . $key, $key

);
}
$mu->record('After populating Redis db');
$mu->record('Before accessing Redis records');
foreach my $key ( 1 .. $num_of_randomstrings ) {
$redisDB->smembers($key);
}
$mu->record('After accessing Redis records');

say "Statistics about memory usage and timing for "
. "$num_of_randomstrings records";
$mu->dump;
my %h = $redisDB->info('memory')->%*;
printf( "Redis dataset memory : %.2f KB\n", $h{used_memory_dataset} / 1024 );
say "Redis memory usage : ", $h{used_memory_human};
printf( "DBM::Deep disk file size : %.2f KB\n", stat('foo.db')->size / 1024 );

unlink 'foo.db' if -e 'foo.db';
$redisDB->flushall;
$redisDB->quit;

In the script above we create the entries of the data structures on the fly, store them and then access them, but without doing any processing. I also tested a version of the script in which the keys were replaced by a (stringified) number, to see how well the 3 storage approaches handle the first case of a very long key.

First, let's look at the case of numerical keys. Here is the output of the script for 1,000 - 10,000 and 100,000 numerical keys


chrisarg@chrisarg-HP-Z840-Workstation:~/software-dev$ ./dbmdeep.pl
Statistics about memory usage and timing for 1000 records
  time    vsz (  diff)    rss (  diff) shared (  diff)   code (  diff)   data (  diff)
     0  31348 ( 31348)  21888 ( 21888)   9216 (  9216)   1552 (  1552)  12744 ( 12744) Before populating DBM::Deep db
     4  32788 (  1440)  23040 (  1152)   9216 (     0)   1552 (     0)  14184 (  1440) After populating DBM::Deep db
     4  32788 (     0)  23040 (     0)   9216 (     0)   1552 (     0)  14184 (     0) Before accessing DBM::Deep records
     5  33188 (   400)  23616 (   576)   9216 (     0)   1552 (     0)  14584 (   400) After accessing DBM::Deep records
     5  33188 (     0)  23616 (     0)   9216 (     0)   1552 (     0)  14584 (     0) Before populating hash in memory
     5  34244 (  1056)  24192 (   576)   9216 (     0)   1552 (     0)  15640 (  1056) After populating hash in memory
     5  34244 (     0)  24192 (     0)   9216 (     0)   1552 (     0)  15640 (     0) Before accessing in memory hash records
     5  34244 (     0)  24192 (     0)   9216 (     0)   1552 (     0)  15640 (     0) After accessing in memory hash records
     5  34244 (     0)  24192 (     0)   9216 (     0)   1552 (     0)  15640 (     0) Before populating Redis db
     5  34244 (     0)  24192 (     0)   9216 (     0)   1552 (     0)  15640 (     0) After populating Redis db
     5  34244 (     0)  24192 (     0)   9216 (     0)   1552 (     0)  15640 (     0) Before accessing Redis records
     5  34244 (     0)  24768 (   576)   9216 (     0)   1552 (     0)  15640 (     0) After accessing Redis records
Redis dataset memory : 886.84 KB
Redis memory usage : 1.78M
DBM::Deep disk file size : 2037.20 KB

chrisarg@chrisarg-HP-Z840-Workstation:~/software-dev$ ./dbmdeep.pl
Statistics about memory usage and timing for 10000 records
  time    vsz (  diff)    rss (  diff) shared (  diff)   code (  diff)   data (  diff)
     0  31356 ( 31356)  21888 ( 21888)   9216 (  9216)   1552 (  1552)  12752 ( 12752) Before populating DBM::Deep db
    42  42776 ( 11420)  32832 ( 10944)   9216 (     0)   1552 (     0)  24172 ( 11420) After populating DBM::Deep db
    42  42776 (     0)  32832 (     0)   9216 (     0)   1552 (     0)  24172 (     0) Before accessing DBM::Deep records
    46  46824 (  4048)  36864 (  4032)   9216 (     0)   1552 (     0)  28220 (  4048) After accessing DBM::Deep records
    46  46824 (     0)  36864 (     0)   9216 (     0)   1552 (     0)  28220 (     0) Before populating hash in memory
    47  57624 ( 10800)  47808 ( 10944)   9216 (     0)   1552 (     0)  39020 ( 10800) After populating hash in memory
    47  57624 (     0)  47808 (     0)   9216 (     0)   1552 (     0)  39020 (     0) Before accessing in memory hash records
    47  57624 (     0)  47808 (     0)   9216 (     0)   1552 (     0)  39020 (     0) After accessing in memory hash records
    47  57624 (     0)  47808 (     0)   9216 (     0)   1552 (     0)  39020 (     0) Before populating Redis db
    48  57624 (     0)  47808 (     0)   9216 (     0)   1552 (     0)  39020 (     0) After populating Redis db
    48  57624 (     0)  47808 (     0)   9216 (     0)   1552 (     0)  39020 (     0) Before accessing Redis records
    49  57624 (     0)  47808 (     0)   9216 (     0)   1552 (     0)  39020 (     0) After accessing Redis records
Redis dataset memory : 6548.09 KB
Redis memory usage : 7.77M
DBM::Deep disk file size : 23175.82 KB

chrisarg@chrisarg-HP-Z840-Workstation:~/software-dev$ ./dbmdeep.pl
Statistics about memory usage and timing for 100000 records
  time    vsz (  diff)    rss (  diff) shared (  diff)   code (  diff)   data (  diff)
     0  31348 ( 31348)  21888 ( 21888)   9216 (  9216)   1552 (  1552)  12744 ( 12744) Before populating DBM::Deep db
   402  155856 ( 124508)  144576 ( 122688)   9216 (     0)   1552 (     0)  137252 ( 124508) After populating DBM::Deep db
   402  155856 (     0)  144576 (     0)   9216 (     0)   1552 (     0)  137252 (     0) Before accessing DBM::Deep records
   438  187016 ( 31160)  176256 ( 31680)   9216 (     0)   1552 (     0)  168412 ( 31160) After accessing DBM::Deep records
   438  187016 (     0)  176256 (     0)   9216 (     0)   1552 (     0)  168412 (     0) Before populating hash in memory
   441  296580 ( 109564)  285696 ( 109440)   9216 (     0)   1552 (     0)  277976 ( 109564) After populating hash in memory
   441  296580 (     0)  285696 (     0)   9216 (     0)   1552 (     0)  277976 (     0) Before accessing in memory hash records
   441  296580 (     0)  285696 (     0)   9216 (     0)   1552 (     0)  277976 (     0) After accessing in memory hash records
   441  296580 (     0)  285696 (     0)   9216 (     0)   1552 (     0)  277976 (     0) Before populating Redis db
   450  296580 (     0)  285696 (     0)   9216 (     0)   1552 (     0)  277976 (     0) After populating Redis db
   450  296580 (     0)  285696 (     0)   9216 (     0)   1552 (     0)  277976 (     0) Before accessing Redis records
   458  296580 (     0)  285696 (     0)   9216 (     0)   1552 (     0)  277976 (     0) After accessing Redis records
Redis dataset memory : 63629.84 KB
Redis memory usage : 67.82M
DBM::Deep disk file size : 213705.30 KB


We see that the execution time and memory usage scales linearly for all 3 storage engines, with DBM::Deep being the slowest (and the most  storage hungry when one counts the size of the disk file and the memory consumed within Perl (as given by the Memory::Usage module, all values in kilobytes). The humble hash of arrays was by far the fastest (only 3 seconds to populate 100,000 keys, <1sec to access them), but consumed lots and lots of memory (about 110MB), and Redis gave an intermediate performance (9 seconds to populate, ~64MB memory consumption and 8 seconds to read through the database of keys/values).
The relevant figures for the case of "long-string" keys are shown below and the conclusions are not materially different.

chrisarg@chrisarg-HP-Z840-Workstation:~/software-dev$ ./dbmdeep.pl
Statistics about memory usage and timing for 1000 records
  time    vsz (  diff)    rss (  diff) shared (  diff)   code (  diff)   data (  diff)
     0  31356 ( 31356)  21888 ( 21888)   9216 (  9216)   1552 (  1552)  12752 ( 12752) Before populating DBM::Deep db
     4  32792 (  1436)  23040 (  1152)   9216 (     0)   1552 (     0)  14188 (  1436) After populating DBM::Deep db
     4  32792 (     0)  23040 (     0)   9216 (     0)   1552 (     0)  14188 (     0) Before accessing DBM::Deep records
     4  33188 (   396)  23616 (   576)   9216 (     0)   1552 (     0)  14584 (   396) After accessing DBM::Deep records
     4  33188 (     0)  23616 (     0)   9216 (     0)   1552 (     0)  14584 (     0) Before populating hash in memory
     5  34516 (  1328)  24768 (  1152)   9216 (     0)   1552 (     0)  15912 (  1328) After populating hash in memory
     5  34516 (     0)  24768 (     0)   9216 (     0)   1552 (     0)  15912 (     0) Before accessing in memory hash records
     5  34516 (     0)  24768 (     0)   9216 (     0)   1552 (     0)  15912 (     0) After accessing in memory hash records
     5  34516 (     0)  24768 (     0)   9216 (     0)   1552 (     0)  15912 (     0) Before populating Redis db
     5  34516 (     0)  24768 (     0)   9216 (     0)   1552 (     0)  15912 (     0) After populating Redis db
     5  34516 (     0)  24768 (     0)   9216 (     0)   1552 (     0)  15912 (     0) Before accessing Redis records
     5  34516 (     0)  24768 (     0)   9216 (     0)   1552 (     0)  15912 (     0) After accessing Redis records
Redis dataset memory : 1098.80 KB
Redis memory usage : 1.97M
DBM::Deep disk file size : 2224.70 KB

chrisarg@chrisarg-HP-Z840-Workstation:~/software-dev$ ./dbmdeep.pl
Statistics about memory usage and timing for 10000 records
  time    vsz (  diff)    rss (  diff) shared (  diff)   code (  diff)   data (  diff)
     0  31352 ( 31352)  21888 ( 21888)   9216 (  9216)   1552 (  1552)  12748 ( 12748) Before populating DBM::Deep db
    41  42780 ( 11428)  32256 ( 10368)   9216 (     0)   1552 (     0)  24176 ( 11428) After populating DBM::Deep db
    41  42780 (     0)  32256 (     0)   9216 (     0)   1552 (     0)  24176 (     0) Before accessing DBM::Deep records
    47  46860 (  4080)  36288 (  4032)   9216 (     0)   1552 (     0)  28256 (  4080) After accessing DBM::Deep records
    47  46860 (     0)  36288 (     0)   9216 (     0)   1552 (     0)  28256 (     0) Before populating hash in memory
    47  59652 ( 12792)  48960 ( 12672)   9216 (     0)   1552 (     0)  41048 ( 12792) After populating hash in memory
    47  59652 (     0)  48960 (     0)   9216 (     0)   1552 (     0)  41048 (     0) Before accessing in memory hash records
    47  59652 (     0)  48960 (     0)   9216 (     0)   1552 (     0)  41048 (     0) After accessing in memory hash records
    47  59652 (     0)  48960 (     0)   9216 (     0)   1552 (     0)  41048 (     0) Before populating Redis db
    48  59652 (     0)  48960 (     0)   9216 (     0)   1552 (     0)  41048 (     0) After populating Redis db
    48  59652 (     0)  48960 (     0)   9216 (     0)   1552 (     0)  41048 (     0) Before accessing Redis records
    49  59652 (     0)  48960 (     0)   9216 (     0)   1552 (     0)  41048 (     0) After accessing Redis records
Redis dataset memory : 8648.31 KB
Redis memory usage : 9.80M
DBM::Deep disk file size : 25048.55 KB

chrisarg@chrisarg-HP-Z840-Workstation:~/software-dev$ ./dbmdeep.pl
Statistics about memory usage and timing for 100000 records
  time    vsz (  diff)    rss (  diff) shared (  diff)   code (  diff)   data (  diff)
     0  31356 ( 31356)  21888 ( 21888)   9216 (  9216)   1552 (  1552)  12752 ( 12752) Before populating DBM::Deep db
   403  156004 ( 124648)  146304 ( 124416)   9216 (     0)   1552 (     0)  137400 ( 124648) After populating DBM::Deep db
   403  156004 (     0)  146304 (     0)   9216 (     0)   1552 (     0)  137400 (     0) Before accessing DBM::Deep records
   447  187384 ( 31380)  177408 ( 31104)   9216 (     0)   1552 (     0)  168780 ( 31380) After accessing DBM::Deep records
   447  187384 (     0)  177408 (     0)   9216 (     0)   1552 (     0)  168780 (     0) Before populating hash in memory
   450  316832 ( 129448)  306432 ( 129024)   9216 (     0)   1552 (     0)  298228 ( 129448) After populating hash in memory
   450  316832 (     0)  306432 (     0)   9216 (     0)   1552 (     0)  298228 (     0) Before accessing in memory hash records
   450  316832 (     0)  306432 (     0)   9216 (     0)   1552 (     0)  298228 (     0) After accessing in memory hash records
   450  316832 (     0)  306432 (     0)   9216 (     0)   1552 (     0)  298228 (     0) Before populating Redis db
   460  316832 (     0)  306432 (     0)   9216 (     0)   1552 (     0)  298228 (     0) After populating Redis db
   460  316832 (     0)  306432 (     0)   9216 (     0)   1552 (     0)  298228 (     0) Before accessing Redis records
   466  316832 (     0)  306432 (     0)   9216 (     0)   1552 (     0)  298228 (     0) After accessing Redis records
Redis dataset memory : 84624.53 KB
Redis memory usage : 88.30M
DBM::Deep disk file size : 232460.97 KB

Summary and conclusions:
  • If your application does not need persistence and you can spare the memory, go for the humble hash of arrays
  • If you are really grasping for memory, then Redis could be brought into the code as an optimization
  • If you want to keep the data around, then Redis may offer the best balance of execution time and space
  • If you need transactions, you can (obviously) forget about hash of arrays
  • DBM::Deep is probably the way forward if you can't use Redis (especially if a Ramdisk can be used). The author of the package plans to get a DBI and NoSQL storage engines as alternatives to the flat file format. Imagine the performance if an in-memory SQL engine like  or DuckDB or ClickHouse could be used (though these are OLAP engines), or if MySQL started cooperating with the maintainer of DBM::Deep!
  • I think that wrapping DBM::Deep around Redis may offer a much simpler (with a very elegant interface !) way to satisfy the use cases of DBM::Deep in a performant (memory and execution time) manner.

1 Comment

calling rand() a lot is slow, so you might want to minimize its impact by using this technique: https://stackoverflow.com/a/63178400

you can then also "rotate" each string for a further reduction of a factor of length($string).

Leave a comment

About chrisarg

user-pic I like to use Perl for material other than text.