July 2023 Archives

SQL::Inserter for faster/multi-row inserting

By Dimitrios Kechagias on July 1, 2023 3:55 PM

SQL::Inserter is new CPAN module based on what we are using at SpareRoom to easily manage buffered inserts, as well as to replace SQL::Abstract's slow insert method.

Inserter OO interface

The idea is that if you want to insert many rows of data (e.g. part of an ETL pipeline, or writing logs etc), you'd want to be doing that with multi-row INSERT statements. With SQL::Inserter you create a very lightweight object to handle inserting for you, just pass it a DBI db handle, the table name and the columns you'll be inserting (optional if you use hashes for inserting, see next section):

use SQL::Inserter;

my $inserter = SQL::Inserter->new(
    dbh   => $dbh,
    table => 'tablename',
    cols  => [qw/col1 col2.../],
);

The inserter by default has a 100-row buffer (feel free to increase for "small" rows) and will be inserting whenever that buffer gets full.

You have 2 options for how to insert, the fastest method is with a flat array containing the values for the row in the order specified in the constructor (cols) - they will be used as bind variables. You can pass multiple rows as well, with an array of size N x number_of_cols. You can pass all your rows in a single call even if they can't fit the buffer - they will be inserted in segments.

The second method is with a hash, which allows you to pass references to SQL statements instead of simple bind values, but one row at a time:

# Fastest: pass single or multiple rows of data as an array
$sql->insert($col1_val1, $col2_val1, $col1_val2...);

# Alt: pass a single row as a hash, allows SQL code passed as
$sql->insert({
    column1 => $data1,
    column2 => \'NOW()',
    ...
});

# Force flush the buffer at any time with no argument on insert
$sql->insert();

Just call insert as needed the inserts will happen whenever the buffer fills, and the buffer will be flushed automatically when the inserter object is destroyed/out of scope. You can always force a flush manually by calling $inserter->insert() if you want to.

SQL building functions

In the past I have complained about the slowness of SQL::Abstract (the time it takes it to create an SQL string can be more than the execution itself!). SQL::Maker is a fast alternative, however it seems our developers in the past had favoured the SQL::Abstract syntax, so we have no replaced it internally with our own fast functions that take similar syntax. SQL::Inserter includes the function we use for insert:

# Similar to SQL::Abstract's insert, but with much less overhead:
my ($sql, @bind) = simple_insert($table, {col1=>$val...});

# Multi-row possible:
my ($sql, @bind) = simple_insert(
    $table,
    [{col1=>$val1...},{col1=>$val2...},...]
);

Since we use MySQL, it also gives the convenience of INSERT IGNORE or evenON DUPLICATE KEY UPDATE variants:

my ($sql, @bind) = simple_insert(
    $table, {col1=>$val1,col1=>$val2}, {duplicates => 'update'}
);

## INSERT INTO table_name (col1,col2)
## VALUES (?,?),(?,?)
## ON DUPLICATE KEY UPDATE col1=VALUES(col1),col2=VALUES(col2)

And lastly, if you simply want an SQL INSERT statement with placeholders for $num_of_rows (perhaps with IGNORE or ON DUPLICATE KEY variants) to do your own inserting, there's an even simpler function:

my $sql = multi_insert_sql(
    'table', [qw/col1 col2.../], $num_of_rows
);

Performance

The multi-row vs single-row performance advantage can be immense (we saved twenty minutes of processing time when switching our logging to multi-row from a single 500k mailshot), but it is very application-specific to quantify. As an indication I can give you the time it takes to insert 100k small rows (3 columns) either as a single or multi-row insert with 100 and 1000 row buffer in our test environment (GCP VM connecting to Cloud SQL):

Single row insert: 87.1s
   100-row insert: 1.36s
  1000-row insert: 0.62s

A speedup of over 100x here. You get the idea - the single row insert includes the network round-trip time which is most of the delay, you avoid that with multi-row inserts. Depending on the size of your row, there's a limit to how many rows you can insert at a time.

The performance of the SQL building part of the process is a bit easier to quantify. SQL::Inserter takes a minimal amount of time to build SQL - as you'd expect it to. There's a benchmark script included with the module that can compare the speed of the functions that gives me (on an M1 Pro):

 Compare SQL::Abstract, SQL::Maker, simple_insert:
                     Rate Abstract Abstract cached Maker Maker cached simple_insert
 Abstract          4207/s       --             -6%  -90%         -91%          -98%
 Abstract cached   4482/s       7%              --  -90%         -90%          -98%
 Maker            44245/s     952%            887%    --          -4%          -76%
 Maker cached     46205/s     998%            931%    4%           --          -75%
 simple_insert   187398/s    4355%           4081%  324%         306%            --
 
 Compare simple_insert, multi_insert_sql for single row:
                      Rate    simple_insert multi_insert_sql
 simple_insert    190037/s               --             -76%
 multi_insert_sql 797596/s             320%               --

SQL::Inserter's simple_insert is over 40x faster than SQL::Abstract and over 3x faster than SQL::Maker. The latter is not slow at all, so you don't have to swap it for performance reasons, but simple_insert gives some extra options for MySQL/MariaDB users.

0 comments

« June 2023 | Main Index | Archives | October 2023 »

About Dimitrios Kechagias

Computer scientist, physicist, amateur astronomer.

More info »

Dimitrios Kechagias