Elasticsearch Token Filters
We recently saw an example of an elasticsearch token filter called the catalan_stemmer. The Catalan language has other token filters available:
- catalan_stop
- catalan_elision
- catalan_keywords
Let's see what they do.
Stop
The catalan_stop filter removes a list (common) of words. Given the example search:
porros amb balsàmic
applying the catalan_stop filter will remove the word amb (with) from the indexing.
This stop filter is defined as:
"catalan_stop": {
"type": "stop",
"stopwords": "_catalan_"
}
and it is customizable.
Elision
The catalan_elision filter removes elisions. Given the example search:
Amanida d'escalivada
applying the catalan_elision filter would remove the d' from the indexing.
The catalan_elision filter is defined as such:
"catalan_elision": {
"type": "elision",
"articles": [ "d", "l", "m", "n", "s", "t"]
}
Keywords
The catalan_keywords filter allows one to exclude certain words from being stemmed. An example definition of the keywords filter is:
"catalan_keywords": { "type": "keyword_marker", "keywords": ["porró"] }
In this example the word porró would not be stemmed to porr
Catalan Analzyer - Sum the Parts
By definition, an elasticsearch analyzer consists of one tokenizer and zero or more token filters. As it turns out, elasticsearch has a built-in analyzer called catalan that is derived from the standard tokenizer combined with the following token filters:- catalan_elision
- catalan_stop
- catalan_stemmer
- lowercase
Analysis example
curl -XGET 'http://localhost:9200/_analyze?analyzer=catalan' -d "Amanida d'escalivada amb bacallà i balsàmic"
indexes as:
aman
escaliv
bacall
balsamic
Note the elision, stop and stemmer filters are all applied.
Stem collision example
curl -XGET 'http://localhost:9200/_analyze?analyzer=catalan' -d "porros i porró"
indexes into:
porr
porr
In order to distinguish between the two meanings one could exclude porró from the stemming using the catalan_keywords filter. However this would cause search matches to be dependent on the plurality of the word. Fortunately, we can customize the stemming further using the stemmer_override filter to define our own stemming rules.
Customized Stemming
Let's say we want to singularize porron, then we could add the filter:
"custom_stem": { "type": "stemmer_override", "rules": [ "porron=>porró" ] }
The order filters are applied is relevant. We want to apply the stemmer override before the catalan_keywords and the catalan_stemmer.
Go forth an analyze some corpus...
Analyzer Languages
Elasticsearch has built-in analyzers for over 30 languages:
arabic, armenian, basque, brazilian, bulgarian, catalan, chinese, cjk, czech, danish, dutch, english, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai
Leave a comment