Renaming Perl client for Elasticsearch

Dear Perl'ers

I need your help to choose a new name for the official Perl client for Elasticsearch.

Read more here: http://www.elasticsearch.org/blog/renaming-perl-client/


RFC: Single or multiple instances of ORM objects?

In our homegrown ORM we have an in-memory cache, which enables us to ensure that only one instance of any object is live in memory at any one time.

In other words:


    $one = MyObject->get(123);
    $two = MyObject->get(123);

    refaddr($one) == refaddr($two)

I find this setup useful because:

  • if you update one copy of the object, all other copies automatically update
  • get’ing the object again is cheap

When I do a search against the DB, it returns a list of objects, which I can then retrieve (in bulk) from:

-> the in memory cache
  -> memcached
    -> the DB

No DB-based object contains another DB-based object, to avoid circular references. Instead, it just contains the ID of the object. Retrieving the actual object is cheap (assuming it has already been loaded) because we can just request the single instance of that object from the in-memory cache.

The in-memory cache is cleared at the end of each web-request.

The above is pretty similar to how KiokuDB works.

THE FUTURE AND BEYOND:

I’m currently working on an “ORM” that uses ElasticSearch as its backend. (“ORM” is in quotes because ES functions as a Lucene-powered document store, rather than being a relational DB).

I’d like to replicate the current functionality, because I think it has merits, but there is a complication:

Time doesn’t necessarily flow forwards

To explain:

  • ES has real-time GET. In other words, as soon as a document has been indexed (saved), it is available to be retrieved by it’s unique ID
  • When searching for documents, the full document is returned (by default), which means that you don’t have to do a second request to GET the document, but:
  • ES has NEAR-real-time SEARCH. Once a second (by default), the search view is refreshed to include changes that have occurred during the last second

What this means is that I could:


    GET doc 123        -> returns version 6
    SEARCH for doc 123 -> returns version 5

This would normally never happen in a traditional DB, because updates are atomic, and indexes are updated as the document is indexed. But it could happen in a master-slave setup where there is replication lag.

Also, I’m guessing this is a common scenario in NoSQL datastores.

Note:

This is an issue just for the current request, not for writes to ES. Every doc in ES has a _version number, and if you try to update the wrong version, it will throw a Conflict error, in which case you can:

  • get the latest version, reapply your changes and save, or
  • instruct ES to ignore the version and to update the doc regardless

So where might this be a problem:

Scenarios:


    $a = get     -> version 1
    $b = search  -> version 1

This one is easy. $b can just reuse the object in $a.


    $a = get     -> version 1
    $b = search  -> version 1
    $a->change()
    $a->save()   -> version 2

Potentially, the object no longer matches the search that you did, so you may be displaying incorrect results. (eg you search for name == ‘Joe’, then change name to ‘Bob’). But this looks like a reasonable process to me.


    $a = get     -> version 2
    $b = search  -> version 1

Our search has returned an older version of the object. The newer version might or not match the search parameters. Do we display the old results? or the new results?


    $a = get     -> version 1
    $a->change()
    $b = search  -> version 1

We have a changed (but as yet unsaved) object in the cache. Should $b contain the changed object, or the pristine object?


    $a = get     -> version 1
    $a->change()
    $b = search  -> version 2

We have an old (and changed) version in $a. We know that a newer version already exists in the DB, so we’ll get a conflict error if we try to save $a. What do we do?

Proposal:

I think my logic will look something like this:


    my ($class,$id,$version,$data) = @_;

    if (my $cached = $cache->{$id}) {

        return $cached
            if $version <= $cached->{version};

        return $cache->re_new($data);
            unless $cached->has_changed;

    }
    return $cache->{$id} = $class->new($data);

In other words, all instances of the object are always updated to the latest version, EXCEPT if the current instance has been edited and not yet saved. (Saving will throw a conflict error later on anyway).

Also, if you wanted to “detach” an object, then you could clone it and update it independently.

The only issue is that search results may contain a newer object which no longer matches the search parameters. Personally, I’m probably happy to live with this, but I probably need (a) a default setting and (b) a dynamic flag which the user can use to control this behaviour.

Thanks for getting to the bottom of this.

What do you think? See any obvious (or not-so-obvious) flaws?

(Also posted to PerlMonks )

ElasticSearch::Sequence - a blazing fast ticket server

I'm considering ditching my RDBM for my next application and using ElasticSearch as my only data store.

My home-grown framework uses unique IDs for all objects, which currently come from a MySQL auto-increment column, and my framework expects the unique ID to be an integer.

ElasticSearch has its own unique auto-generated IDs, but:

  1. they look like this 'KpSb_Jd_R56dH5Qx6TtxVA' and I'd say are less human-readable than an integer
  2. I would need to change a fair bit of legacy code to…

Perlish concise query syntax for ElasticSearch

Announcing ElasticSearch::SearchBuilder

In Perl, we like to put important things first, so the ElasticSearch query language has always felt a bit wrong to me. For instance, to find docs where the content field contains the text keywords:

    # op        field       value
    { text => { content => 'keywords' } }

To me, the important part of this is the field that we’re operating on, so this feels more natural:

    # field        op       value
    { content => { text => 'keywords' }}

So, in the spirit of SQL::Abstract I am proud to announce ElasticSearch::SearchBuilder, which is tightly integrated into the latest ElasticSearch.pm version 0.38.

Any method which takes a query or filter param (eg search() now also accepts a queryb or filterb parameter instead, whose value will be parsed via SearchBuilder:

Do a full text search of the _all field for 'my keywords':

    $es->search( queryb=> 'my keywords' );

Find docs whose title field contains the text apple but not orange, whose status field contains the value active:

$es->search(
    queryb => {
        title => {
            '='  => 'apple',
            '!=' => 'orange'
        },
        -filter => {
            status => 'active'
        }
    }
)

If you have suggestions to improve the API or the documentation, please get in touch.

You can try out ElasticSearch::SearchBuilder here.

And finally, a more complex example, to demonstrate how much more concisely you can write queries:

Out of all docs published in 2010 and tagged with either “perl” or “ruby”, find those whose title field contains”my keywords”, in which case consider this doc to be particularly relevant (boost: 2) or the title field is missing but the body field contains 'my keywords':

$es->search(
    queryb => {
        -or => [
            {
                title => {
                    '=' => {
                        query => 'my keywords',
                        boost => 2
                }}
            },
            {
                body    => 'my_keywords',
                -filter => {
                    -missing => 'title'
                }
            },
        ],
        -filter => {
            tags => [ 'perl','ruby' ],
            date => {
                '>=' => '2010-01-01',
                '<'  => '2011-01-01'
            },
        }
    }
)

is the equivalent of:

    $es->search(
        query => {
            filtered => {
                filter => {
                    and => [
                        { 
                            terms => { 
                                tags => ["perl", "ruby"] 
                            } 
                        },
                        { 
                            numericrange => { 
                                date => { 
                                    gte => "2010-01-01", 
                                    lt => "2011-01-01" 
                                }
                            }
                        }
                    ],
                },
                query  => {
                    bool => {
                        should => [
                            { 
                                text => { 
                                    title => { 
                                        boost => 2, 
                                        query => "my keywords" 
                                    } 
                                } 
                            },
                            { 
                                filtered => {
                                    filter => { 
                                        missing => { 
                                            field => "title" 
                                        } 
                                    },
                                    query  => { 
                                        text => { 
                                            body => "mykeywords" 
                                        } 
                                    },
                                }
                            }
                        ],
                    }
                }
            }
        }
    )

Which looks better to you?

ElasticSearch.pm v0.37 released, with a small breaking change

Just released ElasticSearch.pm v 0.37 which has a small breaking change.

In version 0.36, $scrolled_search->next() returned the next $size results. Now, by default it returns the next one result, which makes it easier to write:



while ( my $result = $scroller->next ) {...}