Setting up a free Oracle Database for Perl development

I recently added Oracle Database support to SQL::Inserter (check it out if you'd like simple to use, high-performance inserting to SQL databases). I had not used an Oracle Database since my uni days 20 years ago, so I had to set one up to test it.

Even though Oracle provides a free development DB, the process is not as simple as Postgres/MySQL etc., so I thought I'd document it for future reference.

There are basically two ways you can go, with Oracle providing instructions either for a VirtualBox VM, or Docker. For the purposes of this article, we'll use VirtualBox. If you prefer Docker, you can follow Oracle's instructions and skip the next section.

Setting up the Oracle VM

Oracle provides instructions for setting up a VM with their latest 23c Database.

To sum up, you download and install VirtualBox, as well as the 23c VM image (.ova).

Launch VirtualBox, go to File->Import Appliance and select the .ova file that you just downloaded. You can leave the defaults for the import.

Start the Oracle Linux based VM.

Setting up Perl DBI/DBD::Oracle

Open a terminal on the Oracle VM. You have sudo powers (password is oracle), but first we'll need to set up the ORACLE_SID and ORACLE_HOME environment variables. To determine what they should be do:

> cat /etc/oratab

You are looking for an uncommented line with three parts separated by colons, on this VM it is: FREE:/opt/oracle/product/23c/dbhomeFree:Y, with the first part being the SID, the second being the HOME. We can add these to our .bashrc:

 > echo 'export ORACLE_SID=FREE' >> ~/.bashrc
 > echo 'export ORACLE_HOME=/opt/oracle/product/23c/dbhomeFree' >> ~/.bashrc
 > echo 'export LD_LIBRARY_PATH=$ORACLE_HOME/lib' >> ~/.bashrc
 > source ~/.bashrc

We can now install the required libaio and cpanm - using the latter to install the CPAN packages:

> sudo yum install -y libaio-devel perl-App-cpanminus
> sudo -E bash -c 'cpanm -n DBI DBD::Oracle'

The -E parameter passes the env variables we set up above. We are all set to start development.

Note that this Oracle Linux VM comes with Perl 5.26, if you want a more recent version, you could use use Perlbrew.

Connecting to the DB using Perl/DBI

The html page that opens on the first launch of the Oracle VM gives you the db and user names already set up. Using those, you can connect to the DB through Perl/DBI with the DBD::Oracle driver:


#!/usr/bin/env perl
use strict;
use warnings;

use DBI;

# Connect to the pre-created PDB 'freepdb1'
my $dbname = "freepdb1";
my $user   = "hr";
my $pass   = "oracle";
my $dbh    = DBI->connect("dbi:Oracle:$dbname", $user, $pass);

# Or connect to the CDB 'free'
my $user = "system";
my $pass = "oracle";
my $sid  = "free";
my $dbh  = DBI->connect("dbi:Oracle:host=localhost;sid=$sid", $user, $pass);

SQL::Inserter for faster/multi-row inserting

SQL::Inserter is new CPAN module based on what we are using at SpareRoom to easily manage buffered inserts, as well as to replace SQL::Abstract's slow insert method.

Inserter OO interface

The idea is that if you want to insert many rows of data (e.g. part of an ETL pipeline, or writing logs etc), you'd want to be doing that with multi-row INSERT statements. With SQL::Inserter you create a very lightweight object to handle inserting for you, just pass it a DBI db handle, the table name and the columns you'll be inserting (optional if you use hashes for inserting, see next section):

use SQL::Inserter;

my $inserter = SQL::Inserter->new(
    dbh   => $dbh,
    table => 'tablename',
    cols  => [qw/col1 col2.../],
);

The inserter by default has a 100-row buffer (feel free to increase for "small" rows) and will be inserting whenever that buffer gets full.

You have 2 options for how to insert, the fastest method is with a flat array containing the values for the row in the order specified in the constructor (cols) - they will be used as bind variables. You can pass multiple rows as well, with an array of size N x number_of_cols. You can pass all your rows in a single call even if they can't fit the buffer - they will be inserted in segments.

The second method is with a hash, which allows you to pass references to SQL statements instead of simple bind values, but one row at a time:

# Fastest: pass single or multiple rows of data as an array
$sql->insert($col1_val1, $col2_val1, $col1_val2...);

# Alt: pass a single row as a hash, allows SQL code passed as
$sql->insert({
    column1 => $data1,
    column2 => \'NOW()',
    ...
});

# Force flush the buffer at any time with no argument on insert
$sql->insert();

Just call insert as needed the inserts will happen whenever the buffer fills, and the buffer will be flushed automatically when the inserter object is destroyed/out of scope. You can always force a flush manually by calling $inserter->insert() if you want to.

SQL building functions

In the past I have complained about the slowness of SQL::Abstract (the time it takes it to create an SQL string can be more than the execution itself!). SQL::Maker is a fast alternative, however it seems our developers in the past had favoured the SQL::Abstract syntax, so we have no replaced it internally with our own fast functions that take similar syntax. SQL::Inserter includes the function we use for insert:

# Similar to SQL::Abstract's insert, but with much less overhead:
my ($sql, @bind) = simple_insert($table, {col1=>$val...});

# Multi-row possible:
my ($sql, @bind) = simple_insert(
    $table,
    [{col1=>$val1...},{col1=>$val2...},...]
);

Since we use MySQL, it also gives the convenience of INSERT IGNORE or evenON DUPLICATE KEY UPDATE variants:

my ($sql, @bind) = simple_insert(
    $table, {col1=>$val1,col1=>$val2}, {duplicates => 'update'}
);

## INSERT INTO table_name (col1,col2)
## VALUES (?,?),(?,?)
## ON DUPLICATE KEY UPDATE col1=VALUES(col1),col2=VALUES(col2)

And lastly, if you simply want an SQL INSERT statement with placeholders for $num_of_rows (perhaps with IGNORE or ON DUPLICATE KEY variants) to do your own inserting, there's an even simpler function:

my $sql = multi_insert_sql(
    'table', [qw/col1 col2.../], $num_of_rows
);

Performance

The multi-row vs single-row performance advantage can be immense (we saved twenty minutes of processing time when switching our logging to multi-row from a single 500k mailshot), but it is very application-specific to quantify. As an indication I can give you the time it takes to insert 100k small rows (3 columns) either as a single or multi-row insert with 100 and 1000 row buffer in our test environment (GCP VM connecting to Cloud SQL):

Single row insert: 87.1s
   100-row insert: 1.36s
  1000-row insert: 0.62s

A speedup of over 100x here. You get the idea - the single row insert includes the network round-trip time which is most of the delay, you avoid that with multi-row inserts. Depending on the size of your row, there's a limit to how many rows you can insert at a time.

The performance of the SQL building part of the process is a bit easier to quantify. SQL::Inserter takes a minimal amount of time to build SQL - as you'd expect it to. There's a benchmark script included with the module that can compare the speed of the functions that gives me (on an M1 Pro):

 Compare SQL::Abstract, SQL::Maker, simple_insert:
                     Rate Abstract Abstract cached Maker Maker cached simple_insert
 Abstract          4207/s       --             -6%  -90%         -91%          -98%
 Abstract cached   4482/s       7%              --  -90%         -90%          -98%
 Maker            44245/s     952%            887%    --          -4%          -76%
 Maker cached     46205/s     998%            931%    4%           --          -75%
 simple_insert   187398/s    4355%           4081%  324%         306%            --
 
 Compare simple_insert, multi_insert_sql for single row:
                      Rate    simple_insert multi_insert_sql
 simple_insert    190037/s               --             -76%
 multi_insert_sql 797596/s             320%               --

SQL::Inserter's simple_insert is over 40x faster than SQL::Abstract and over 3x faster than SQL::Maker. The latter is not slow at all, so you don't have to swap it for performance reasons, but simple_insert gives some extra options for MySQL/MariaDB users.

Cloud Provider Price Performance Comparison: Spot VMs

In my recent Cloud Comparison, I mentioned that I'd look at Spot VM pricing in an update. This is the update - 6 out of the 10 providers tested offer Spot/Preemptible instance pricing.

At SpareRoom we make some good use of Spot VMs. E.g. our perl test suite gets to run on fast VM types at very low cost: currently we are using c3-highcpu-22 instances which normally come at $0.95/hour each. The spot pricing for them is more than 10x lower, at just $0.086/h. At these prices, if our test suite needed it (it's already fast), we'd be able to launch c3-highcpu-176 (176 vCPUs) at well under $1/h!

Profiling Perl under Apache

At SpareRoom we still use Apache/mod_perl and, despite Devel::NYTProf::Apache's warnings about it not being maintained, it is still the best way for profiling Perl.

All you need to do really, for a mod_perl app, is to include this in your Apache config:

PerlPassEnv NYTPROF # Not needed for default settings.
PerlModule Devel::NYTProf::Apache

You should also allow a single worker/child, and make sure it terminates cleanly before processing the output (nytprofhtml).

Since I want my dev/test environments to normally run with multiple workers and no profiler attached, I have a bash alias that exits apache and restarts it set-up for profiling. I thought I'd share an example of that, as I find it very useful.

First, a basic example pulled from what I used on our older CentOS/RedHat VM:

profile_apache() {
  systemctl stop httpd.service
  grep -q -F 'PerlModule Devel::NYTProf::Apache' /etc/httpd/conf/httpd.conf \
    || echo 'PerlModule Devel::NYTProf::Apache' >> /etc/httpd/conf/httpd.conf
  sed -i 's/^#PerlModule Devel/PerlModule Devel/g' /etc/httpd/conf/httpd.conf
  echo "Starting apache to collect profile data. Ctrl+c once to end."
  httpd -X
  echo "Restarting apache and creating html report."
  sed -i 's/^PerlModule Devel/#PerlModule Devel/g' /etc/httpd/conf/httpd.conf
  systemctl start httpd.service &
  nytprofhtml
  echo "Profile report in nytprof/index.html"
}

It's mostly self-explanatory, the profile_apache command will shut down apache, add the PerlModule directive to the config if it's not there (or uncomment it) and launch an attached single worker instance of apache. It will wait until you send a SIGINT (e.g. press ctrl+C) before restoring/relaunching apache to the original configuration and producing the profiler report.

Now, if you have Debian/Ubuntu, you'd have to modify a bit. And if you are running on docker and sed -i gives you Device or resource busy errors (for trying to change mounted file inode from within container), you can use a method that copies instead of modifying in-place, such as sed -ci.

The alias becomes:

profile_apache() {
  apache2ctl stop && sleep 1
  grep -q -F 'PerlModule Devel::NYTProf::Apache' /etc/apache2/mods-available/perl.conf \
    || echo -e "\nPerlModule Devel::NYTProf::Apache" >> /etc/apache2/mods-available/perl.conf
  sed -ci 's/^#PerlModule Devel/PerlModule Devel/g' /etc/apache2/mods-available/perl.conf
  source /etc/apache2/envvars
  echo "Starting apache to collect profile data. Ctrl+c once to end."
  apache2 -X
  echo "Restarting apache and creating html report."
  sed -ci 's/^PerlModule Devel/#PerlModule Devel/g' /etc/apache2/mods-available/perl.conf
  apache2ctl start &
  nytprofhtml
  echo "Profile report in nytprof/index.html"
}

This is the basic idea, although my alias does some extra things like putting the report in an apache-mounted directory and providing a browser link, removing the PerlRequire that preloads our .pm modules (it clashes with NYTProf) etc.

PS, <code> does not work on the bash code above (adds some colours, but renders everything in a single wrapping line with the regular font), so I had to use <pre>. If someone knows how to fix it (tried markdown ``` as well), I can add syntax highlighting - for now I cross-posted to dev.to to make it look right.

Cloud Provider Performance & Price Comparison 2023

CloudB.png

Last year I compared the various VM types of 7 popular cloud providers mainly for Perl performance, as I was looking at workloads that we'd be using at SpareRoom - you can see that comparison here.

This year I expanded the comparison to 10 providers, and while I posted it on dev.to, I thought I'd give a heads up to blogs.perl readers, especially to the commenters of the last years' post that had suggestions I incorporated.