SQL::Inserter for faster/multi-row inserting

SQL::Inserter is new CPAN module based on what we are using at SpareRoom to easily manage buffered inserts, as well as to replace SQL::Abstract's slow insert method.

Inserter OO interface

The idea is that if you want to insert many rows of data (e.g. part of an ETL pipeline, or writing logs etc), you'd want to be doing that with multi-row INSERT statements. With SQL::Inserter you create a very lightweight object to handle inserting for you, just pass it a DBI db handle, the table name and the columns you'll be inserting (optional if you use hashes for inserting, see next section):

use SQL::Inserter;

my $inserter = SQL::Inserter->new(
    dbh   => $dbh,
    table => 'tablename',
    cols  => [qw/col1 col2.../],
);

The inserter by default has a 100-row buffer (feel free to increase for "small" rows) and will be inserting whenever that buffer gets full.

You have 2 options for how to insert, the fastest method is with a flat array containing the values for the row in the order specified in the constructor (cols) - they will be used as bind variables. You can pass multiple rows as well, with an array of size N x number_of_cols. You can pass all your rows in a single call even if they can't fit the buffer - they will be inserted in segments.

The second method is with a hash, which allows you to pass references to SQL statements instead of simple bind values, but one row at a time:

# Fastest: pass single or multiple rows of data as an array
$sql->insert($col1_val1, $col2_val1, $col1_val2...);

# Alt: pass a single row as a hash, allows SQL code passed as
$sql->insert({
    column1 => $data1,
    column2 => \'NOW()',
    ...
});

# Force flush the buffer at any time with no argument on insert
$sql->insert();

Just call insert as needed the inserts will happen whenever the buffer fills, and the buffer will be flushed automatically when the inserter object is destroyed/out of scope. You can always force a flush manually by calling $inserter->insert() if you want to.

SQL building functions

In the past I have complained about the slowness of SQL::Abstract (the time it takes it to create an SQL string can be more than the execution itself!). SQL::Maker is a fast alternative, however it seems our developers in the past had favoured the SQL::Abstract syntax, so we have no replaced it internally with our own fast functions that take similar syntax. SQL::Inserter includes the function we use for insert:

# Similar to SQL::Abstract's insert, but with much less overhead:
my ($sql, @bind) = simple_insert($table, {col1=>$val...});

# Multi-row possible:
my ($sql, @bind) = simple_insert(
    $table,
    [{col1=>$val1...},{col1=>$val2...},...]
);

Since we use MySQL, it also gives the convenience of INSERT IGNORE or evenON DUPLICATE KEY UPDATE variants:

my ($sql, @bind) = simple_insert(
    $table, {col1=>$val1,col1=>$val2}, {duplicates => 'update'}
);

## INSERT INTO table_name (col1,col2)
## VALUES (?,?),(?,?)
## ON DUPLICATE KEY UPDATE col1=VALUES(col1),col2=VALUES(col2)

And lastly, if you simply want an SQL INSERT statement with placeholders for $num_of_rows (perhaps with IGNORE or ON DUPLICATE KEY variants) to do your own inserting, there's an even simpler function:

my $sql = multi_insert_sql(
    'table', [qw/col1 col2.../], $num_of_rows
);

Performance

The multi-row vs single-row performance advantage can be immense (we saved twenty minutes of processing time when switching our logging to multi-row from a single 500k mailshot), but it is very application-specific to quantify. As an indication I can give you the time it takes to insert 100k small rows (3 columns) either as a single or multi-row insert with 100 and 1000 row buffer in our test environment (GCP VM connecting to Cloud SQL):

Single row insert: 87.1s
   100-row insert: 1.36s
  1000-row insert: 0.62s

A speedup of over 100x here. You get the idea - the single row insert includes the network round-trip time which is most of the delay, you avoid that with multi-row inserts. Depending on the size of your row, there's a limit to how many rows you can insert at a time.

The performance of the SQL building part of the process is a bit easier to quantify. SQL::Inserter takes a minimal amount of time to build SQL - as you'd expect it to. There's a benchmark script included with the module that can compare the speed of the functions that gives me (on an M1 Pro):

 Compare SQL::Abstract, SQL::Maker, simple_insert:
                     Rate Abstract Abstract cached Maker Maker cached simple_insert
 Abstract          4207/s       --             -6%  -90%         -91%          -98%
 Abstract cached   4482/s       7%              --  -90%         -90%          -98%
 Maker            44245/s     952%            887%    --          -4%          -76%
 Maker cached     46205/s     998%            931%    4%           --          -75%
 simple_insert   187398/s    4355%           4081%  324%         306%            --
 
 Compare simple_insert, multi_insert_sql for single row:
                      Rate    simple_insert multi_insert_sql
 simple_insert    190037/s               --             -76%
 multi_insert_sql 797596/s             320%               --

SQL::Inserter's simple_insert is over 40x faster than SQL::Abstract and over 3x faster than SQL::Maker. The latter is not slow at all, so you don't have to swap it for performance reasons, but simple_insert gives some extra options for MySQL/MariaDB users.

Cloud Provider Price Performance Comparison: Spot VMs

In my recent Cloud Comparison, I mentioned that I'd look at Spot VM pricing in an update. This is the update - 6 out of the 10 providers tested offer Spot/Preemptible instance pricing.

At SpareRoom we make some good use of Spot VMs. E.g. our perl test suite gets to run on fast VM types at very low cost: currently we are using c3-highcpu-22 instances which normally come at $0.95/hour each. The spot pricing for them is more than 10x lower, at just $0.086/h. At these prices, if our test suite needed it (it's already fast), we'd be able to launch c3-highcpu-176 (176 vCPUs) at well under $1/h!

Profiling Perl under Apache

At SpareRoom we still use Apache/mod_perl and, despite Devel::NYTProf::Apache's warnings about it not being maintained, it is still the best way for profiling Perl.

All you need to do really, for a mod_perl app, is to include this in your Apache config:

PerlPassEnv NYTPROF # Not needed for default settings.
PerlModule Devel::NYTProf::Apache

You should also allow a single worker/child, and make sure it terminates cleanly before processing the output (nytprofhtml).

Since I want my dev/test environments to normally run with multiple workers and no profiler attached, I have a bash alias that exits apache and restarts it set-up for profiling. I thought I'd share an example of that, as I find it very useful.

First, a basic example pulled from what I used on our older CentOS/RedHat VM:

profile_apache() {
  systemctl stop httpd.service
  grep -q -F 'PerlModule Devel::NYTProf::Apache' /etc/httpd/conf/httpd.conf \
    || echo 'PerlModule Devel::NYTProf::Apache' >> /etc/httpd/conf/httpd.conf
  sed -i 's/^#PerlModule Devel/PerlModule Devel/g' /etc/httpd/conf/httpd.conf
  echo "Starting apache to collect profile data. Ctrl+c once to end."
  httpd -X
  echo "Restarting apache and creating html report."
  sed -i 's/^PerlModule Devel/#PerlModule Devel/g' /etc/httpd/conf/httpd.conf
  systemctl start httpd.service &
  nytprofhtml
  echo "Profile report in nytprof/index.html"
}

It's mostly self-explanatory, the profile_apache command will shut down apache, add the PerlModule directive to the config if it's not there (or uncomment it) and launch an attached single worker instance of apache. It will wait until you send a SIGINT (e.g. press ctrl+C) before restoring/relaunching apache to the original configuration and producing the profiler report.

Now, if you have Debian/Ubuntu, you'd have to modify a bit. And if you are running on docker and sed -i gives you Device or resource busy errors (for trying to change mounted file inode from within container), you can use a method that copies instead of modifying in-place, such as sed -ci.

The alias becomes:

profile_apache() {
  apache2ctl stop && sleep 1
  grep -q -F 'PerlModule Devel::NYTProf::Apache' /etc/apache2/mods-available/perl.conf \
    || echo -e "\nPerlModule Devel::NYTProf::Apache" >> /etc/apache2/mods-available/perl.conf
  sed -ci 's/^#PerlModule Devel/PerlModule Devel/g' /etc/apache2/mods-available/perl.conf
  source /etc/apache2/envvars
  echo "Starting apache to collect profile data. Ctrl+c once to end."
  apache2 -X
  echo "Restarting apache and creating html report."
  sed -ci 's/^PerlModule Devel/#PerlModule Devel/g' /etc/apache2/mods-available/perl.conf
  apache2ctl start &
  nytprofhtml
  echo "Profile report in nytprof/index.html"
}

This is the basic idea, although my alias does some extra things like putting the report in an apache-mounted directory and providing a browser link, removing the PerlRequire that preloads our .pm modules (it clashes with NYTProf) etc.

PS, <code> does not work on the bash code above (adds some colours, but renders everything in a single wrapping line with the regular font), so I had to use <pre>. If someone knows how to fix it (tried markdown ``` as well), I can add syntax highlighting - for now I cross-posted to dev.to to make it look right.

Cloud Provider Performance & Price Comparison 2023

CloudB.png

Last year I compared the various VM types of 7 popular cloud providers mainly for Perl performance, as I was looking at workloads that we'd be using at SpareRoom - you can see that comparison here.

This year I expanded the comparison to 10 providers, and while I posted it on dev.to, I thought I'd give a heads up to blogs.perl readers, especially to the commenters of the last years' post that had suggestions I incorporated.

Weather::WeatherKit and Weather::Astro7Timer

Today, the popular Dark Sky weather API is shutting down. I did a little write-up for non-Perl devs on DEV.to, but I thought I'd post here a couple of potentially useful modules I released to CPAN recently. 

Weather::WeatherKit accesses the WeatherKit REST API, which is Apple's official Dark Sky replacement. The module uses Crypt::JWT to create tokens, so accessing WeatherKit then is as simple as:

use Weather::WeatherKit;

my $wk = Weather::WeatherKit->new(
    team_id    => $apple_team_id,
    service_id => $weatherkit_service_id,
    key_id     => $key_id,
    key        => $private_key
);

my %report = $wk->get(
    lat      => 51.2,
    lon      => -1.8,
    dataSets => 'currentWeather'
);

Of course, this API is sort of free, as it requires an Apple developer account. If you don't have one and don't want to get one, there are some alternative APIs, but for the purposes of this post I'll stick to 7Timer, via Weather::Astro7Timer. Even simpler, as it does not need authentication:


use Weather::Astro7Timer;

my $w7t = Weather::Astro7Timer->new();

my %report = $w7t->get(
    product => 'astro',
    lat     => 51.2,
    lon     => -1.8,
);

7Timer is well known in the amateur astronomy community because it offers the "ASTRO" forecast, which includes astronomical seeing and atmospheric transparency, important aspects of a stargazing forecast. I am actually donating the servers the service runs on, and I develop a free iOS client (Xasteria) for it. It has some other forecast products, e.g. passing product => 'civil' will get you a more traditional type of forecast.