Lazy Database Columns and Virtual Vertical Partitioning

Recently we were trying to fix a database performance issue and I had to hit the DBIx::Class irc channel for some advice. Specifically, mysql has an issue where a query with text or blob columns forces a disk sort under certain conditions. Suffice it to say, we hit those conditions with several text columns and benchmarking certain queries showed that removing those columns from a query significantly sped up said queries. That, unfortunately, is a bit tricky under DBIx::Class, but Matt Trout made an excellent suggestion which I later implemented. Why not do a virtual vertical partition of our tables?

Here's a simple package pulled straight from the DBIx::Class docs:

package MyDB::Schema::Result::Artist;
use base qw/DBIx::Class::Core/;

__PACKAGE__->table('artist');
__PACKAGE__->add_columns(qw/ artistid name /);
__PACKAGE__->set_primary_key('artistid');
__PACKAGE__->has_many(cds => 'MyDB::Schema::Result::CD');

1;

Now imagine that the artist table also has huge "biography" and "artistic_vision" columns defined as "TEXT" (in the database, but not shown in the package):

package MyDB::Schema::Result::Artist;
use base qw/DBIx::Class::Core/;

__PACKAGE__->table('artist');
__PACKAGE__->add_columns(qw/ artistid name biography artistic_vision /);
__PACKAGE__->set_primary_key('artistid');
__PACKAGE__->has_many(cds => 'MyDB::Schema::Result::CD');

1;

In mysql, if you create a query which tries to use the MEMORY storage engine, the TEXT columns will force a disk sort, no matter how little data you pull. Thus, you want to avoid pulling those TEXT columns, but it's a bit tricky. Class::DBI allowed you have lazy columns and they looked like this:

__PACKAGE__->columns(Primary   => qw/artistid/);
__PACKAGE__->columns(Essential => qw/name/);
__PACKAGE__->columns(Others    => qw/biography artistic_vision/);

Now if you fetch an artist, you only get the large columns when you request them for an individual result. This means that complex queries could be much faster. However, it also means that if you return 100 artists with one query and you need those extra fields, you have at least 101 SQL calls because you'll fetch those once per object.

We tried a different approach. It's perfectly OK to have more than one result class point to the same table, so we create a "Lazy" result class:

package MyDB::Schema::Result::Artist::Lazy;
use base qw/DBIx::Class::Core/;

__PACKAGE__->table('artist');
__PACKAGE__->add_columns(qw/ artistid biography artistic_vision /);
__PACKAGE__->set_primary_key('artistid');

1;

And we remove the original columns and proxy across:

package MyDB::Schema::Result::Artist;
use base qw/DBIx::Class::Core/;

__PACKAGE__->table('artist');
__PACKAGE__->add_columns(qw/ artistid name/);
__PACKAGE__->set_primary_key('artistid');
__PACKAGE__->has_many(cds => 'MyDB::Schema::Result::CD');
__PACKAGE__->has_one( lazy_columns => 'MyDB::Schema::Result::Artist::Lazy',
    undef,
    { proxy => [qw/biography artistic_vision/] }
);

1;

Queries against the Result::Artist will simply skip the TEXT columns. The proxy ensures that we have transparent read behaviour (we overrode new() and insert() for the write behaviour).

At this point you're saying "but Ovid, that's just as bad as the Class::DBI model!", and you'd be right. Except there's a curious artifact to our code which lets us work around that.

We do not pass result or resultset objects to our views. Instead, we extract the data into something we call blocklists which is effectively a read-only view on that data. This was done in the good-ol-days when didn't want to risk our front-end people being able to make additional SQL queries from the view. Thus, I modified the blocks in the blocklists to not fetch any "lazy" data. Then, before we pass the blocklist to the view, we call "finalize" on it. I wrote something which looks like this:

sub finalize {
  my $self = shift;

  my %blocks_to_complete;
  my @fifo_blocks = $self->blocks;
  # Walk the blocklists and collect all incomplete blocks
  while ( my $block = shift @fifo_blocks ) {
    for my $attr ( $block->block_attributes ) {
      $block->can( $attr ) or next;

      my $value = $block->$attr;
      if ( $value->isa('BlockList') ) {
        push @fifo_blocks => $value->blocks
      }
    }
    next if $block->is_complete; # don't reprocess
    my $lazy_model = $block->lazy_model;

    $blocks_to_complete{ $lazy_model }{ $block->id } ||= [];
    push @{ $blocks_to_complete{ $lazy_model }{ $block->id } } => $block;
  }

  # Walk the lazy classes, fetch the resultsets and populate the blocks
  for my $lazy_model ( keys %blocks_to_complete ) {
    my $blocks_for_id = $blocks_to_complete{ $lazy_model };

    my @ids = keys %$blocks_for_id;
    my @lazy_dbic_rows = Dynamite->model( $lazy_model )->search(
      { id => { -in => [ @ids ] } },
    )->all;

    for my $lazy_dbic_row (@lazy_dbic_rows) {
        # add the lazy data to the blocks
    }
  }

  return 1;
}

Note how we fetch all of the ids for a given block type and rather than make a separate SQL call for each block, we fetch all of the lazy data with one "id IN" query.

I call this "virtual vertical partitioning". We've effectively split each table with problematic data in two and if we physically partition these tables (there are compelling reasons for us not to), we merely need to switch the table name in the "Lazy" class and everything should just work.

Next, I want to push our "created" and "modified" columns into the Lazy classes. They're almost useless, but I've abstracted things out enough that this should be an almost transparent change.

About Ovid

user-pic Freelance Perl/Testing/Agile consultant and trainer. See http://www.allaroundtheworld.fr/ for our services. If you have a problem with Perl, we will solve it for you. And don't forget to buy my book! http://www.amazon.com/Beginning-Perl-Curtis-Poe/dp/1118013840/