Elasticsearch Templates

When dealing with elasticearch, one has to consider how they want to manage the analysis of the content that is ingested. The use of templates is a way to ease this burden of managing analyzer settings. Let's learn by example...

Catalan Stemmer

Here's a template that defines an analyzer, cat_stems, which utilizes the built-in catalan stemmer. For example, both singular: porro and plural: porros will be reduced to porr when analyzed by the stemmer. Moreover, this template will be applied to any index created with a name that starts with cat.

Template

{
  "template": "cat*",
  "settings": {
    "analysis": {
      "filter": {
        "catalan_stemmer": {
          "type":       "stemmer",
          "language":   "catalan"
        }
      },
      "analyzer": {
        "cat_stems": {
          "tokenizer":  "standard",
          "filter": [
            "lowercase",
            "catalan_stemmer"
          ]
        }
      }
    }
  }
}

Register Template

Here's how we register the new template with elasticsearch so that any future index starting with the letters cat will include the analzyer cat_stems in its settings.


curl -H "Content-Type: application/json" --data-binary @cat_stems.template.json -XPUT http://localhost:9200/_template/catalan_stemmer

Get Template

To see what we just created:

 curl -XGET http://localhost:9200/_template/catalan_stemmer?pretty

which results in:

{
  "catalan_stemmer" : {
    "order" : 0,
    "template" : "cat*",
    "settings" : {
      "index.analysis.filter.catalan_stemmer.type" : "stemmer",
      "index.analysis.analyzer.cat_stems.tokenizer" : "standard",
      "index.analysis.analyzer.cat_stems.filter.0" : "lowercase",
      "index.analysis.filter.catalan_stemmer.language" : "catalan",
      "index.analysis.analyzer.cat_stems.filter.1" : "catalan_stemmer"
    },
    "mappings" : { },
    "aliases" : { }
  }
}

Create an Index

Now when we create the following index it will include the analyzer cat_stems, since the name of the index starts with cat.

  curl -XPUT http://localhost:9200/catalan

One can verify the new index has the analyzer defined as desired:

  curl -XGET http://localhost:9200/catalan/_settings?pretty

which returns:

{
  "catalan" : {
    "settings" : {
      "index" : {
        "uuid" : "kD_fllqCSZ60ZdoqpAuSEw",
        "analysis" : {
          "analyzer" : {
            "cat_stems" : {
              "filter" : [ "lowercase", "catalan_stemmer" ],
              "tokenizer" : "standard"
            }
          },
          "filter" : {
            "catalan_stemmer" : {
              "type" : "stemmer",
              "language" : "catalan"
            }
          }
        },
        "number_of_replicas" : "0",
        "number_of_shards" : "1",
        "version" : {
          "created" : "1030299"
        }
      }
    }
  }
}

Index against the Template

Now that we have created an index the uses customer analyzer, cat_stems, we can then index against it.


curl -XGET 'http://localhost:9200/catalan/_analyze?analyzer=cat_stems&pretty' -d "porros amb balsàmic"

which outputs:

{
  "tokens" : [ {
    "token" : "porr",
    "start_offset" : 0,
    "end_offset" : 5,
    "type" : "",
    "position" : 1
  }, {
    "token" : "amb",
    "start_offset" : 6,
    "end_offset" : 9,
    "type" : "",
    "position" : 2
  }, {
    "token" : "balsamic",
    "start_offset" : 10,
    "end_offset" : 18,
    "type" : "",
    "position" : 3
  } ]
}

Notice that the stemmed value of porr which is what either porro or porros gets reduced to. This allows us to match documents that contain either the singular or plural form of porro.

As a final note, the built-in catalan stemmer also does the asciifolding we saw before.

Leave a comment

About mateu

user-pic I blog about Perl.