ElasticSearch::Sequence - a blazing fast ticket server
I'm considering ditching my RDBM for my next application and using ElasticSearch as my only data store.
My home-grown framework uses unique IDs for all objects, which currently come from a MySQL auto-increment column, and my framework expects the unique ID to be an integer.
ElasticSearch has its own unique auto-generated IDs, but:
- they look like this '
KpSb_Jd_R56dH5Qx6TtxVA
' and I'd say are less human-readable than an integer - I would need to change a fair bit of legacy code to migrate to non-integer IDs
Initially I thought I could keep MySQL around as a ticket server, as described by Flickr but then I wondered if I could achieve the same thing by abusing ElasticSearch's built-in versioning, allowing me to ditch MySQL completely, and give me a distributed ticket server with high availability into the bargain.
The logic is simple: when you index a document in ElasticSearch, it returns a new version number for the document, which is always incrementing and is guaranteed to be unique across the cluster.
# MAIL ID
curl -XPUT 'http://127.0.0.1:9200/sequence/sequence/mail_id?pretty=1'
# {
# "ok" : true,
# "_index" : "sequence",
# "_id" : "mail_id",
# "_type" : "sequence",
# "_version" : 1 # note: version number
# }
We can have multiple distinct sequences by storing a document with a different ID for each sequence.
curl -XPUT 'http://127.0.0.1:9200/sequence/sequence/other_id?pretty=1'
# {
# "ok" : true,
# "_index" : "sequence",
# "_id" : "other_id", # note: different ID
# "_type" : "sequence",
# "_version" : 1
# }
ElasticSearch enables a bunch of features by default, which are very useful for using it as a document store and a full text search server, but aren't relevant in this situation, and will just slow it down.
The amount of data will tiny, so our index only needs one primary shard, not the 5 that are created by default in ElasticSearch. But for high-availability purposes, we'd like this shard to be replicated across all nodes in our cluster. So the index settings look like this:
"settings" : {
"number_of_shards" : 1,
"auto_expand_replicas" : "0-all"
},
For the type mapping (like a schema in a database) we want to turn off the _all
field and _source
field, disable
"sequence" : {
"_source" : { "enabled" : 0 },
"_all" : { "enabled" : 0 },
"_type" : { "index" : "no" },
"enabled" : 0
}
So the full command to create the index and set the type mapping looks like this:
curl -XPUT 'http://127.0.0.1:9200/sequence/?pretty=1' -d '
{
"settings" : {
"number_of_shards" : 1,
"auto_expand_replicas" : "0-all"
},
"mappings" : {
"sequence" : {
"_source" : { "enabled" : 0 },
"_all" : { "enabled" : 0 },
"_type" : { "index" : "no" },
"enabled" : 0
}
}
}
'
Requesting a single ID (indexing the doc to get a new version) at a time is going to be relatively slow, as there is a fair bit of HTTP latency per request. This is fine for normal use, but our ticket server has to be super fast.
So instead, I'm going to request several new version numbers at once using the bulk API, and buffer them.
curl -XPOST 'http://127.0.0.1:9200/_bulk?pretty=1' -d '
{"index":{"_index":"sequence","_type":"sequence","_id":"mail_id"}}
{}
{"index":{"_index":"sequence","_type":"sequence","_id":"mail_id"}}
{}
[*** SNIP ***]
'
# {
# "items" : [
# {
# "index" : {
# "ok" : true,
# "_index" : "sequence",
# "_id" : "mail_id",
# "_type" : "sequence",
# "_version" : 1
# }
# },
# {
# "index" : {
# "ok" : true,
# "_index" : "sequence",
# "_id" : "mail_id",
# "_type" : "sequence",
# "_version" : 2
# }
# },
[*** SNIP ***]
ElasticSearchX::Sequence
I've wrapped up all of the above and released it as ElasticSearchX::Sequence
use ElasticSearch();
use ElasticSearchX::Sequence();
my $es = ElasticSearch->new();
my $seq = ElasticSearchX::Sequence->new( es => $es );
$seq->bootstrap(); # setup the index and type mapping
my $it = $seq->sequence('mail_id');
my $mail_id = $it->next;
Benchmarks
I wrote a small benchmark script which compares:
- MySQL, using the ticket method described by Flickr
- this module, using the httptiny backend
- this module, using the curl backend
- this module, using the curl backend but only requesting blocks of 10 IDs at a time
The results (run on my laptop) are pretty startling:
Rate es_curl_10 db_ticket es_tiny es_curl
es_curl_10 38760/s -- -48% -55% -72%
db_ticket 74627/s 93% -- -13% -47%
es_tiny 85470/s 121% 15% -- -39%
es_curl 140845/s 263% 89% 65% --
If you are already using ElasticSearch as your search server, (and if you're not, you should be - it's fantastic), and you're currently using your DB as a ticket server, I'd consider moving this function over to ElasticSearch instead.
But if you're going to go this route with the non-int keys, why not just use GUIDs and skip the ticket server?
Hi JT
While I will probably use ES's non-int keys for my new application, my existing code expects integer keys (and they're easier to read).
So migration would be made easier by having this option available.
Regardless, I wanted to see how well ES could perform this task.