ElasticSearch.pm v0.36, now with extra sugar
ElasticSearch v 0.16.0 was released yesterday with a long list of new features, enhancements and bug fixes.
ElasticSearch.pm v 0.36 is on its way to CPAN as we speak.
Besides adding support for the new stuff in v 0.16, I've also added a few features:
scrolled_search()
It is possible to scroll through a long list of results in ElasticSearch, but this required a bit of repetitive code, which is now nicely packaged up in scrolled_search
. So you can do:
$scroll = $es->scrolled_search(
search_type => 'scan', # efficient search type for scrolling
scroll => '2m', # cache search results for the next 2 minutes
);
while (my $doc = $scroll->next(1)) {
# do something
}
reindex()
Users on the mailing list are always asking how to reindex their data, either from one index to another on the same cluster, or from one cluster to another.
Now, scrolled_search()
and reindex()
make it easy to do this in a single command.
For example:
To copy the ElasticSearch website index locally, you could do:
my $local = ElasticSearch->new(
servers => 'localhost:9200'
);
my $remote = ElasticSearch->new(
servers => 'search.elasticsearch.org:80',
no_refresh => 1
);
my $source = $remote->scrolled_search(
search_type => 'scan',
scroll => '5m'
);
$local->reindex(source=>$source);
To copy one local index to another, make the title upper case,
exclude docs of type boring
, and to preserve the version numbers
from the original index:
my $source = $es->scrolled_search(
index => 'old_index',
search_type => 'scan',
scroll => '5m',
version => 1
);
$es->reindex(
source => $source,
dest_index => 'new_index',
transform => sub {
my $doc = shift;
return if $doc->{_type} eq 'boring';
$doc->{_source}{title} = uc( $doc->{_source}{title} );
return $doc;
}
);
no_refresh
By default, ElasticSearch.pm retrieves a list of live nodes from the ElasticSearch cluster, and round-robins around them.
However, if you are talking to a remote ES cluster, or a cluster behind a proxy, this may not be desirable behaviour. The no_refresh
parameter turns off the discovery of live nodes. Instead ES.pm round robins through the list of servers passed to new()
, and can fail over between this list:
my $es = ElasticSearch->new(
servers => ['es1.search.com:80', 'es2.search.com:80'],
no_refresh => 1
);
Music to my ears. :)