Elasticsearch Token Filters

By mateu on September 29, 2014 9:48 PM

We recently saw an example of an elasticsearch token filter called the catalan_stemmer. The Catalan language has other token filters available:

catalan_stop
catalan_elision
catalan_keywords

Let's see what they do.

Stop

The catalan_stop filter removes a list (common) of words. Given the example search:

  porros amb balsàmic

applying the catalan_stop filter will remove the word amb (with) from the indexing.

This stop filter is defined as:



  "catalan_stop": {

    "type":       "stop",

    "stopwords":  "_catalan_" 

  }

and it is customizable.

Elision

The catalan_elision filter removes elisions. Given the example search:

  Amanida d'escalivada

applying the catalan_elision filter would remove the d' from the indexing.

The catalan_elision filter is defined as such:



  "catalan_elision": {

    "type": "elision",

    "articles": [ "d", "l", "m", "n", "s", "t"]

  }

Keywords

The catalan_keywords filter allows one to exclude certain words from being stemmed. An example definition of the keywords filter is:

  "catalan_keywords": {
    "type":       "keyword_marker",
    "keywords":   ["porró"] 
  }

In this example the word porró would not be stemmed to porr

Catalan Analzyer - Sum the Parts

By definition, an elasticsearch analyzer consists of one tokenizer and zero or more token filters. As it turns out, elasticsearch has a built-in analyzer called catalan that is derived from the standard tokenizer combined with the following token filters:

catalan_elision
catalan_stop
catalan_stemmer
lowercase

Analysis example

curl -XGET 'http://localhost:9200/_analyze?analyzer=catalan' -d "Amanida d'escalivada amb bacallà i balsàmic"

indexes as:



  aman

  escaliv

  bacall

  balsamic

Note the elision, stop and stemmer filters are all applied.

Stem collision example

curl -XGET 'http://localhost:9200/_analyze?analyzer=catalan' -d "porros i porró"

indexes into:



  porr

  porr

In order to distinguish between the two meanings one could exclude porró from the stemming using the catalan_keywords filter. However this would cause search matches to be dependent on the plurality of the word. Fortunately, we can customize the stemming further using the stemmer_override filter to define our own stemming rules.

Customized Stemming

Let's say we want to singularize porron, then we could add the filter:

  "custom_stem": {
    "type": "stemmer_override",
    "rules": [ 
      "porron=>porró"
    ]
  }

The order filters are applied is relevant. We want to apply the stemmer override before the catalan_keywords and the catalan_stemmer.

Go forth an analyze some corpus...

Analyzer Languages

Elasticsearch has built-in analyzers for over 30 languages:

 arabic, armenian, basque, brazilian, bulgarian,
 catalan, chinese, cjk, czech, danish, dutch,
 english, finnish, french, galician, german, greek,
 hindi, hungarian, indonesian, irish, italian,
 norwegian, persian, portuguese, romanian,
 russian, sorani, spanish, swedish, turkish, thai

0 comments

Tagged as:

elasticsearch catalan token filters

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About mateu

I blog about Perl.

More info »

Mateu