ElasticSearch.pm gets big performance boost

ElasticSearch version 0.12 is out today along with some nice new features.

However, the thing I'm most excited about is that ElasticSearch.pm v 0.26 is also out and has support for bulk indexing and pluggable backends, both of which add a significant performance boost.

Pluggable backends

I've factored out the parts which actually talk to the ElasticSearch server into the ElasticSearch::Transport module, which acts as a base class for ElasticSearch::Transport::HTTP (which uses LWP), ::HTTPLite (which uses, not surprisingly, HTTP::Lite) and ::Thrift, which uses the Thrift protocol

I expected Thrift to be the big winner, but it turns out that the generated code is dog-slow. However, HTTP::Lite is about 20% faster than LWP:

   httplite   :  63 seconds, 951 tps
   http       :  79 seconds, 759 tps
   thrift     :  690 seconds, 87 tps

Bulk indexing

Since version 0.11, ElasticSearch has had a bulk operation, which can take a stream of index, create and delete statements in a single request.

For instance, you could do:

   $es->bulk(
        { index => {
            index => 'foo', type=>'bar', id=>1, data => { foo => 'bar' }
        }},
        { create => { 
            index => 'foo', type=>'bar', id=>2, data => { foo => 'bar' }
        }},
        { delete => { 
            index => 'foo', type=>'bar', id=>1
        }}
    );

The number of actions you can pass in depends on how much memory you have, both on the client and the server, and how big your documents are.

I tried tranches of 1,000, 5,000 and 10,000 documents at a time, the results were very similar.

All tranches and all transports averaged about 7.5 seconds or 8,000 transactions per second! These are small documents, so I would be surprised to achieve this rate in the real world, but a 10x improvement is phenomenal.

(These benchmarks were run on my laptop with a single ElasticSearch node, over 59,950 documents ({ text => $string}) whose string value averaged 310 characters in length and consisted of real world text, not randomly generated gibberish. )

Example script

(This is now included in the examples directory of ElasticSearch.pm)

Finally, here is a simple example script which downloads from github all of the issues open against ElasticSearch, indexes them, and provides a simple command line interface to searching for them:

   #!/user/bin/perl

    use strict;
    use warnings;
    use JSON::XS();
    use ElasticSearch();
    use ElasticSearch::Util qw(filter_keywords);
    use HTTP::Lite();
    use v5.12.0;

    my $url = 'http://github.com/api'
        . '/v2/json/issues/list/elasticsearch/elasticsearch/open';

    my $json = JSON::XS->new->utf8(1)->pretty(1);
    my $es = ElasticSearch->new( servers => '127.0.0.1:9200' );

    # Download issues list from github
    my $http = HTTP::Lite->new();
    my $req  = $http->request($url);
    die "couldn't retrieve issues list" unless $req && $req == 200;

    my $issues = $json->decode( $http->body )->{issues};

    # delete index in case it already exists, then create the index
    eval { $es->delete_index( index => 'issues' ) };
    $es->create_index( index => 'issues' );

    # prepare issues for indexing
    my $id = 1;
    my @docs;
    for (@$issues) {

        # each doc needs an index, a type, an ID and data
        my $doc
            = { index => 'issues', type => 'entry', id => $id++, data => $_ };

        # we want to 'create' each doc (as opposed to 'index' or 'delete')
        push @docs, { create => $doc };
    }

    # bulk index docs
    my $res = $es->bulk( \@docs );
    if ( $res->{errors} ) {
        die "Bulk index had issues: " . $json->encode( $res->{errors} );
    }

    # force all changes to be refreshed immediately
    $es->refresh_index();

    say "Total issues indexed: " . $es->count( match_all => {} )->{count};

    # search for issues
    while (1) {
        print "\nEnter keywords to search for, or an issue ID:\n  > ";
        my $keywords = <>;
        chomp $keywords;
        last unless $keywords;

        # if an issue ID, retrieve the doc and display it
        if ( $keywords =~ /^\d+$/ ) {
            my $doc = $es->get(
                index => 'issues',
                type  => 'entry',
                id    => $keywords
            )->{_source};
            for my $key ( sort keys %$doc ) {
                my $val = $doc->{$key} // '';
                say "$key: $val";
            }
            say '-' x 60;
            next;
        }

        # otherwise, we're searching for keywords, so filter
        # them to make sure the keywords don't include special chars
        $keywords = filter_keywords($keywords);

        my $result
            = $es->search( query => { field => { _all => $keywords } } );
        say "Total results found: " . $result->{hits}{total};
        printf( " - %02d: %s\n", $_->{_id}, $_->{_source}{title} )
            for @{ $result->{hits}{hits} };
    }
 

1 Comment

That example script is such a clever demo, keep up the good work!

Leave a comment

About Clinton Gormley

user-pic The doctor will see you now...