Moving data around with Yertl over ODBC (to HANA)

Probably, the first Perl blog post that I read this year — or even the first thing I read this year ;) — was "Managing SQL Data with Yertl" by Doug Bell about a nice ETL freamework that he wrote. The framework comes with a set of command-line tools that let you extract, munge and move data between different SQL databases.

As I was struggling at that time with getting some reports running against an old OLTP database, which I was very reluctant to touch, I thought of writing a tool extract, let's say, the last quarter of data, re-model the schema and move the data to HANA DB, which exposes MDX interface. A simple extract-load task can be easily implemented by chaning two ysql processes:


ysql query mysql-local "select * from EntryExternal" | ysql write hana-trial "insert into Entries (caption,category,finalPrice,finalized,resolved) VALUES ($.caption, $.category, $.finalPrice, $.finalized, $.resolved)"

In this example the data is first extracted from a local MySQL server, written to stdout in YAML format and then fed to the second ysql process, which throws the data into a remote HANA database. Yertl let's you store, in a config file, not only the connection details of the instances you often work with but also a number of statements you're frequently running. My config file looks as follows:

---
hana-trial:
  DSN: i079489trial-quickmove
  driver: ODBC
  password: *******
  user: DEV_1XH8VPAXXXXXXXXXXXXXXXX
mysql-local:
  database: bkxdb
  driver: mysql
  password: *******
  user: bkx
mysql-remote-md:
  database: m13088_bkidx1
  driver: mysql
  host: localhost
  password: *******
  port: 13306
  query:
    select-dual: select * from dual
  user: m13088_bkidx1

As you can see I am frequently readying the one dummy row in the dual table ;) That's because I had some really lengthy queries that I wanted to store as separate SQL files, with different extensions for MySQL/HDB dialects. So I made a small patch to Yertl (which I did not check into a git fork yet) to allow reading the queries from files, in the same way as you can pass input in curl, wget etc. - i.e. by passing @filepath instead of the query name.

In my previous post, I explained in detail how to set up an ODBC connection to HANA over a TLS tunnel and connect to a DB with DBD::ODBC driver. Let me quickly highlight the main features of Yertl that were useful in this OLTP-to-OLAP ETL task and then we'll dive into some technical details. I'm leaving the reporting part for another article.

We normally need some filtering and transformations, when we move data between different databases, and — especially — different data models. Part of the Yertl framework is also a utility called yt, which offers a simplistic syntax that let's us filter and modify the YAML documents read from an ysql output:


    ysql mysql-remote-md query get_recent_auctions | yq "if length(.caption) > 0 then . else empty"  | ysql write hana-trial insert_auctions

In this example we're getting a set of recent autions from a remote MySQL database (accessible via an SSH tunnel), filtering out those auctions where, for some reason, caption is empty, and pushing those records to a remote HANA DB (over a TLS tunnel).
Unfortunately, I found it too simplistic to implement the filtering I needed as there is, for example, no switch statement and no support for regex matching. Luckily, though, the author of the framework kindly gave us an easy way of plugging in our own filter implementation that takes a YAML doc object as an output. Here's an excerpt from yq source code:


$ENV{YQ_CLASS} ||= 'ETL::Yertl::Command::yq::Regex';
use_module( $ENV{YQ_CLASS} );
{
    no strict 'refs';
    no warnings 'once';
    *filter = *{ $ENV{YQ_CLASS} . "::filter" };
}

Hence, we can chalk out a more "advanced" filter with regex matching and much more:


package My::ETL::SimpleEntryFilter;
# ABSTRACT: A regex-based parser for programs
$ETL::Yertl::Command::yq::Regex::VERSION = '0.017';
use ETL::Yertl;
use boolean qw( :all );
use Regexp::Common;
sub empty() {
    bless {}, 'empty';
}
*diag = *yertl::diag;
# Filter MUST NOT mutate $doc!
sub filter {
    my ( $class, $filter, $doc, $scope ) = @_;
    my $year = $doc->{year};
    if (defined $year && !($year =~ /^\s*\d{4}\s*$/)){
		$doc->{year} = undef;
    }  
     return &empty unless $doc->{caption};
     my $ISBN = $doc->{ISBN};
     $doc->{ISBN} = undef if $ISBN && $ISBN =~ /^-/; # that must be wrong
     return $doc;
}
1;
 

... and then just set the environment variable YT_CLASS to your new filter class. For example:

set PERL5LIB=C:\Users\rkotowicz\Perl\mylib

set YQ_CLASS=My::ETL::Yertl::DummyFilter

type data\autions_with_attrs_May2013.yaml00 | yq " " | ysql write hana-trial @"queries\insert_auctions.hsql"

What about performance of such loads?

If you kick-off a load the data with a single ysql into a trial HANA instance you probably won't get a speed above 20-60k rows per hour. But we need to note a few things at this point:


  1. Yertl runs DMLs one by one

  2. Auto-commits

  3. You're loading within one ODBC connection that is routed through one TLS tunnel

  4. There are some constraints imposed on the connections in the trial instance

Therefore, we could instead split the data into smaller chunks, run multiple concurrent instances of ysql (ideally routed through multiple tunnels) and make yertl insert rows in batches, committing only after the batch has been inserted. Batch import isn't supported out-of-the-box but, again, I'll put that little modification in my git fork. So, if we go that path then starting 9 instances of ysql over 3 tunnels gives a speed of circa 300k per hour. That still isn't much if you need to move around large volumes quickly but for some ad-hoc appliances and incremental loads that should be fine.

There's one pitfall when constructing batch inserts to HANA. Unlike in MySQL, Oracle and possible couple of other DBMS'es, you can't insert multiple rows with one INSERT. What we can do though, is bind an array of variables for each parameter and run execute_array

Actually, the way it works is also a bit confusing because the DBI manual authoritativel states execute_array "executes the prepared statement once for each parameter tuple (group of values) provided "
However, the ODBC implementation overrides that behavior and there will only be one round-trip to the database.

At this point, it would be good to get a sense of how quick the loads would be if we used the data import feature in HANA studio. Well these are much faster... In my tests I was getting 60k rows per minute (!), which leads us to two conclusions. Firstly, if we need to transfer larger volumes we could covert the YAML output to CSV (and Yertl lends a hand here because there's just a tool for that, which is called yto. Secondly, the ODBC import is far less slower because of the speed of single array execute_array call... I'm not positive about the root cause yet, but it's not the cost of the round-trip for sure. I'll surely update the post, one I found out...

Hopefully, this short article gives you an idea how data can be moved to HANA through ODBC channel on the fly. In a follow-up post I am going to share some ideas about building ad-hoc OLAP databases in this way and the options for presenting that data with MDX and non-MDX tools.


2 Comments

I'm glad you're finding it useful! The project has been stalled at a very early stage (I've been working on Statocles), but I'll be happy to accept any contributions you want to make.

You may want to upgrade your fork: I removed the "query" and "write" commands from ysql. Now you only have to write the query (or give the saved query's name).

The @filepath thing sounds awesome! Since queries are stored per-database, it seems like there needs to be a way to also have a common store of queries, or share queries between databases, or get a query from another database...

The YQ_CLASS thing I'm not proud of ;), but it'll stay for a number of reasons. That said, it seems like there might be ways to make what you're doing easier: yq could accept a --filter_class option, and the class could then do more of the work, removing the need for the " " empty filter you have. Otherwise, you could make your own utility, and Yertl could provide helpers and tools to make it a lot easier to build small, custom utilities like that.

If there's a way to make the execute_array option available to ysql, I'm all for it. Perhaps something like --batch 50 to enable execute_array with a batch size of 50 documents? I'm interested in any performance improvements. I know that if you install the right underlying YAML module, you'll get improved performance (I think YAML::XS is the winner in that case). For CSV, if you have Text::CSV_XS you'll get better results.

I just got GNU parallel to work with YAML using the -d flag: parallel --env _ -d "\n---" --pipe yq ".foo" . That might help with some embarassingly-parallel tasks, though I would also love to see some input/output messaging patterns so one could build microbatch processing workers.

Long-term, version 1.0 of Yertl will be built on an async I/O library (likely IO::Async), and allow for better use of the shell utilities in any other Perl code. If you're interested, I'd love some help! Yertl is the project that I'm most excited about, but I've been distracting myself with more comfortable things like web development...

Leave a comment

About Radek Kotowicz

user-pic I blog about Perl.