ElasticSearch::Sequence - a blazing fast ticket server

I'm considering ditching my RDBM for my next application and using ElasticSearch as my only data store.

My home-grown framework uses unique IDs for all objects, which currently come from a MySQL auto-increment column, and my framework expects the unique ID to be an integer.

ElasticSearch has its own unique auto-generated IDs, but:

  1. they look like this 'KpSb_Jd_R56dH5Qx6TtxVA' and I'd say are less human-readable than an integer
  2. I would need to change a fair bit of legacy code to migrate to non-integer IDs

Initially I thought I could keep MySQL around as a ticket server, as described by Flickr but then I wondered if I could achieve the same thing by abusing ElasticSearch's built-in versioning, allowing me to ditch MySQL completely, and give me a distributed ticket server with high availability into the bargain.

The logic is simple: when you index a document in ElasticSearch, it returns a new version number for the document, which is always incrementing and is guaranteed to be unique across the cluster.


# MAIL ID
curl -XPUT 'http://127.0.0.1:9200/sequence/sequence/mail_id?pretty=1' 

# {
#    "ok" : true,
#    "_index" : "sequence",
#    "_id" : "mail_id",
#    "_type" : "sequence",
#    "_version" : 1         # note: version number
# }

We can have multiple distinct sequences by storing a document with a different ID for each sequence.


curl -XPUT 'http://127.0.0.1:9200/sequence/sequence/other_id?pretty=1' 

# {
#    "ok" : true,
#    "_index" : "sequence",
#    "_id" : "other_id",    # note: different ID
#    "_type" : "sequence",
#    "_version" : 1
# }

ElasticSearch enables a bunch of features by default, which are very useful for using it as a document store and a full text search server, but aren't relevant in this situation, and will just slow it down.

The amount of data will tiny, so our index only needs one primary shard, not the 5 that are created by default in ElasticSearch. But for high-availability purposes, we'd like this shard to be replicated across all nodes in our cluster. So the index settings look like this:


   "settings" : {
      "number_of_shards" : 1,           
      "auto_expand_replicas" : "0-all"  
   },

For the type mapping (like a schema in a database) we want to turn off the _all field and _source field, disable _type indexing, and disable indexing for the document (which is only ever going to be an empty hashref):


   "sequence" : {
      "_source" : { "enabled" : 0 },
      "_all"    : { "enabled" : 0 },
      "_type"   : { "index" : "no" },
      "enabled" : 0
   }

So the full command to create the index and set the type mapping looks like this:


curl -XPUT 'http://127.0.0.1:9200/sequence/?pretty=1'  -d '
{
   "settings" : {
      "number_of_shards"     : 1,           
      "auto_expand_replicas" : "0-all"  
   },
   "mappings" : {
      "sequence" : {
         "_source" : { "enabled" : 0 },
         "_all"    : { "enabled" : 0 },
         "_type"   : { "index" : "no" },
         "enabled" : 0
      }
   }
}
'

Requesting a single ID (indexing the doc to get a new version) at a time is going to be relatively slow, as there is a fair bit of HTTP latency per request. This is fine for normal use, but our ticket server has to be super fast.

So instead, I'm going to request several new version numbers at once using the bulk API, and buffer them.


curl -XPOST 'http://127.0.0.1:9200/_bulk?pretty=1'  -d '
{"index":{"_index":"sequence","_type":"sequence","_id":"mail_id"}}
{}
{"index":{"_index":"sequence","_type":"sequence","_id":"mail_id"}}
{}
[*** SNIP ***]
'

# {
#    "items" : [
#       {
#          "index" : {
#             "ok" : true,
#             "_index" : "sequence",
#             "_id" : "mail_id",
#             "_type" : "sequence",
#             "_version" : 1
#          }
#       },
#       {
#          "index" : {
#             "ok" : true,
#             "_index" : "sequence",
#             "_id" : "mail_id",
#             "_type" : "sequence",
#             "_version" : 2
#          }
#       },
[*** SNIP ***]

ElasticSearchX::Sequence

I've wrapped up all of the above and released it as ElasticSearchX::Sequence


use ElasticSearch();
use ElasticSearchX::Sequence();
 
my $es  = ElasticSearch->new();
my $seq = ElasticSearchX::Sequence->new( es => $es );
 
$seq->bootstrap();   # setup the index and type mapping
 
my $it  = $seq->sequence('mail_id');
 
my $mail_id = $it->next;

Benchmarks

I wrote a small benchmark script which compares:

  1. MySQL, using the ticket method described by Flickr
  2. this module, using the httptiny backend
  3. this module, using the curl backend
  4. this module, using the curl backend but only requesting blocks of 10 IDs at a time

The results (run on my laptop) are pretty startling:


               Rate es_curl_10  db_ticket    es_tiny    es_curl
es_curl_10  38760/s         --       -48%       -55%       -72%
db_ticket   74627/s        93%         --       -13%       -47%
es_tiny     85470/s       121%        15%         --       -39%
es_curl    140845/s       263%        89%        65%         --

If you are already using ElasticSearch as your search server, (and if you're not, you should be - it's fantastic), and you're currently using your DB as a ticket server, I'd consider moving this function over to ElasticSearch instead.

2 Comments

But if you're going to go this route with the non-int keys, why not just use GUIDs and skip the ticket server?

Leave a comment

About Clinton Gormley

user-pic The doctor will see you now...