Elasticsearch Token Filters

We recently saw an example of an elasticsearch token filter called the catalan_stemmer. The Catalan language has other token filters available:

  • catalan_stop
  • catalan_elision
  • catalan_keywords

Let's see what they do.

Stop

The catalan_stop filter removes a list (common) of words. Given the example search:

  porros amb balsàmic

applying the catalan_stop filter will remove the word amb (with) from the indexing.

This stop filter is defined as:


"catalan_stop": {
"type": "stop",
"stopwords": "_catalan_"
}

and it is customizable.

Elision

The catalan_elision filter removes elisions. Given the example search:

  Amanida d'escalivada

applying the catalan_elision filter would remove the d' from the indexing.

The catalan_elision filter is defined as such:


"catalan_elision": {
"type": "elision",
"articles": [ "d", "l", "m", "n", "s", "t"]
}


Keywords

The catalan_keywords filter allows one to exclude certain words from being stemmed. An example definition of the keywords filter is:

  "catalan_keywords": {
    "type":       "keyword_marker",
    "keywords":   ["porró"] 
  }

In this example the word porró would not be stemmed to porr

Catalan Analzyer - Sum the Parts

By definition, an elasticsearch analyzer consists of one tokenizer and zero or more token filters. As it turns out, elasticsearch has a built-in analyzer called catalan that is derived from the standard tokenizer combined with the following token filters:
  • catalan_elision
  • catalan_stop
  • catalan_stemmer
  • lowercase

Analysis example


curl -XGET 'http://localhost:9200/_analyze?analyzer=catalan' -d "Amanida d'escalivada amb bacallà i balsàmic"

indexes as:


aman
escaliv
bacall
balsamic

Note the elision, stop and stemmer filters are all applied.

Stem collision example


curl -XGET 'http://localhost:9200/_analyze?analyzer=catalan' -d "porros i porró"

indexes into:


porr
porr

In order to distinguish between the two meanings one could exclude porró from the stemming using the catalan_keywords filter. However this would cause search matches to be dependent on the plurality of the word. Fortunately, we can customize the stemming further using the stemmer_override filter to define our own stemming rules.

Customized Stemming

Let's say we want to singularize porron, then we could add the filter:

  "custom_stem": {
    "type": "stemmer_override",
    "rules": [ 
      "porron=>porró"
    ]
  }

The order filters are applied is relevant. We want to apply the stemmer override before the catalan_keywords and the catalan_stemmer.

Go forth an analyze some corpus...

Analyzer Languages

Elasticsearch has built-in analyzers for over 30 languages:

 arabic, armenian, basque, brazilian, bulgarian,
 catalan, chinese, cjk, czech, danish, dutch,
 english, finnish, french, galician, german, greek,
 hindi, hungarian, indonesian, irish, italian,
 norwegian, persian, portuguese, romanian,
 russian, sorani, spanish, swedish, turkish, thai

Leave a comment

About mateu

user-pic I blog about Perl.