Token - Elasticsearch Analyze API

Yesterday we looked at an example of how to both index and search using elasticsearch. Today, we'll talk a little about what takes place during indexing, particularly tokenization. For example, what happens when we tokenize the phrase:

porros amb basàlmic

To find out we can pass the phrase to the elasticsearch analyzer API like so:


curl -XGET 'localhost:9200/_analyze?tokenizer=standard' -d 'porros amb balsàmic'

Here we are using the standard (default) tokenizer which results in the following output:

{
  "tokens" : [ {
    "token" : "porros",
    "start_offset" : 0,
    "end_offset" : 6,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "amb",
    "start_offset" : 7,
    "end_offset" : 10,
    "type" : "<ALPHANUM>",
    "position" : 2
  }, {
    "token" : "balsàmic",
    "start_offset" : 11,
    "end_offset" : 19,
    "type" : "<ALPHANUM>",
    "position" : 3
  } ]
}

Notice that we receive three tokens which are the original individual words unchanged. What if I'm some poor american that wants to search on balsàmic but doesn't know how to create an accented a, and instead just inputs balsamic. With the current state of the index, this will not result in a match. ASCII folding to the rescue...

One of the things ASCII folding does is to remove accents so that balsamic is what gets put into the index, For example:


curl -XGET 'localhost:9200/_analyze?tokenizer=standard&token_filters=asciifolding' -d 'porros amb balsàmic'

results in

{
  "tokens" : [ {
    "token" : "porros",
    "start_offset" : 0,
    "end_offset" : 6,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "amb",
    "start_offset" : 7,
    "end_offset" : 10,
    "type" : "<ALPHANUM>",
    "position" : 2
  }, {
    "token" : "balsamic",
    "start_offset" : 11,
    "end_offset" : 19,
    "type" : "<ALPHANUM>",
    "position" : 3
  } ]
}

Notice that balsamic appears as a token without the accented a. This type of tokenizing enables one to search either with or without the accent because the accented form will be reduced to the non-accented form.


Leave a comment

About mateu

user-pic I blog about Perl.