Elasticsearch Templates
When dealing with elasticearch, one has to consider how they want to manage the analysis of the content that is ingested. The use of templates is a way to ease this burden of managing analyzer settings. Let's learn by example...
Catalan Stemmer
Here's a template that defines an analyzer, cat_stems, which utilizes the built-in catalan stemmer. For example, both singular: porro and plural: porros will be reduced to porr when analyzed by the stemmer. Moreover, this template will be applied to any index created with a name that starts with cat.
Template
{ "template": "cat*", "settings": { "analysis": { "filter": { "catalan_stemmer": { "type": "stemmer", "language": "catalan" } }, "analyzer": { "cat_stems": { "tokenizer": "standard", "filter": [ "lowercase", "catalan_stemmer" ] } } } } }
Register Template
Here's how we register the new template with elasticsearch so that any future index starting with the letters cat will include the analzyer cat_stems in its settings.
curl -H "Content-Type: application/json" --data-binary @cat_stems.template.json -XPUT http://localhost:9200/_template/catalan_stemmer
Get Template
To see what we just created:
curl -XGET http://localhost:9200/_template/catalan_stemmer?pretty
which results in:
{ "catalan_stemmer" : { "order" : 0, "template" : "cat*", "settings" : { "index.analysis.filter.catalan_stemmer.type" : "stemmer", "index.analysis.analyzer.cat_stems.tokenizer" : "standard", "index.analysis.analyzer.cat_stems.filter.0" : "lowercase", "index.analysis.filter.catalan_stemmer.language" : "catalan", "index.analysis.analyzer.cat_stems.filter.1" : "catalan_stemmer" }, "mappings" : { }, "aliases" : { } } }
Create an Index
Now when we create the following index it will include the analyzer cat_stems, since the name of the index starts with cat.
curl -XPUT http://localhost:9200/catalan
One can verify the new index has the analyzer defined as desired:
curl -XGET http://localhost:9200/catalan/_settings?pretty
which returns:
{ "catalan" : { "settings" : { "index" : { "uuid" : "kD_fllqCSZ60ZdoqpAuSEw", "analysis" : { "analyzer" : { "cat_stems" : { "filter" : [ "lowercase", "catalan_stemmer" ], "tokenizer" : "standard" } }, "filter" : { "catalan_stemmer" : { "type" : "stemmer", "language" : "catalan" } } }, "number_of_replicas" : "0", "number_of_shards" : "1", "version" : { "created" : "1030299" } } } } }
Index against the Template
Now that we have created an index the uses customer analyzer, cat_stems, we can then index against it.
curl -XGET 'http://localhost:9200/catalan/_analyze?analyzer=cat_stems&pretty' -d "porros amb balsàmic"
which outputs:
{ "tokens" : [ { "token" : "porr", "start_offset" : 0, "end_offset" : 5, "type" : "", "position" : 1 }, { "token" : "amb", "start_offset" : 6, "end_offset" : 9, "type" : " ", "position" : 2 }, { "token" : "balsamic", "start_offset" : 10, "end_offset" : 18, "type" : " ", "position" : 3 } ] }
Notice that the stemmed value of porr which is what either porro or porros gets reduced to. This allows us to match documents that contain either the singular or plural form of porro.
As a final note, the built-in catalan stemmer also does the asciifolding we saw before.
Leave a comment