DISCOVERY

October 18th, 2019

Writing Elasticsearch Analyzers

Elasticsearch

ELK Stack

JSON

Bash

Search Engine

Text Search

One of the biggest strengths of Elasticsearch is text searching. Elasticsearch holds strings for text searching in text data types. A document can contain one or more fields of type text.

When strings are placed into a field of type text they are processed by an analyzer. Elasticsearch analyzers can be viewed as a pipeline that takes text as an input, breaks it into terms, and returns the terms as output1. These terms are placed in an inverted index which makes an index searchable.

Inverted Index

An inverted index is a data structure commonly used to perform quick text searching. Each text field in Elasticsearch has a corresponding inverted index2. The purpose of an inverted index is to map terms to the documents they appear in3. Terms are created from text with the help of an analyzer.

For example, let's take a field in a document with an ID of 1. If the field contains the text "Hello World" and an analyzer creates a term for each word, the corresponding inverted index would contain two terms - "Hello" and "World". Both these terms would be mapped to ID #1.

When fields are queried in a text search, document matches are determined by the terms in the inverted index. For example, if the text search is "Hello", the document with ID #1 will be returned. Importantly, text search matches aren't determined by the contents of a documents JSON, only by the terms in the inverted index.

Analyzers are critical in determining whether a text search returns a document. The remainder of this article explores the components of analyzers and the impact they have on the inverted index.

Each field of type text is processed by an analyzer. There are three main components of analyzers - character filters, tokenizers, and token filters.

Elasticsearch Constructs

Character Filter

The first component of an analyzer is the character filter. Character filters operate on text passed into the analyzer. Analyzers have zero or more character filters. The purpose of a character filter is to transform text4. This transformed text is passed along to the next analyzer component, the tokenizer.

Tokenizer

The second component of an analyzer is the tokenizer. Tokenizers take the input text of the analyzer (or the transformed text from character filters) and creates tokens5. Each analyzer must have a single tokenizer. Tokens created by the tokenizer are either returned from the analyzer or passed onto the token filter(s).

Token Filter

The third and final component of an analyzer is the token filter. Token filters add, delete, or modify the tokens created by the tokenizer6. Analyzers have zero or more token filters.

When an analyzer executes, it sequentially runs the character filter(s), tokenizer, and token filter(s) in order, respectively. Elasticsearch has many built-in character filters, tokenizers, and token filters for use. Elasticsearch also allows developers to create their own components. With the help of built-in and custom components, analyzers can be configured to suit most needs.

If you don't want to spend time customizing an analyzer, Elasticsearch provides multiple built-in analyzers that work for most requirements.

Elasticsearch exposes the /_analyze endpoint to work directly with an analyzer. This is great for testing the behavior of analyzers and exploring how they tokenize text. /_analyze can be called on an individual index or the entire Elasticsearch cluster.

The most basic usage of /_analyze is to run an analyzer on a string of text. The following API call executes the build-it standard analyzer on the text Hello my name is Andy..

curl -XPOST ${ES_ENDPOINT}/_analyze?pretty=true -H 'Content-Type: application/json' -d '{ "analyzer": "standard", "text": "Hello my name is Andy." }'
{ "tokens": ["hello", "my", "name", "is", "andy"] }

I abbreviated the API response to include only the resulting tokens. This same API call can be executed in the Kibana UI:

The standard analyzer consists of zero character filters, the standard tokenizer, and a lowercase token filter7. The standard tokenizer removes punctuation and places each word in a separate token8. After the standard tokenizer runs, the tokens are in the following form:

{ "tokens": ["Hello", "my", "name", "is", "Andy"] }

Notice that the tokens above are different than the end result of the standard analyzer (the uppercase characters still exist). This is where the standard analyzer's token filter comes into play. As its name suggests, the lowercase token filter modifies tokens by converting uppercase characters to lowercase. After the lowercase token filter runs, the tokens become ["hello", "my", "name", "is", "andy"].

The standard analyzer is one of many built-in analyzers. Another example of a built-in analyzer is the whitespace analyzer.

curl -XPOST ${ES_ENDPOINT}/_analyze?pretty=true -H 'Content-Type: application/json' -d '{ "analyzer": "whitespace", "text": "Hello my name is Andy." }'
{ "tokens": ["Hello", "my", "name", "is", "Andy."] }

The /_analyze endpoint also accepts character filters, token filters, and a tokenizer as input. The following example uses the standard tokenizer and a character filter that strips HTML tags from text.

curl -XPOST ${ES_ENDPOINT}/_analyze?pretty=true -H 'Content-Type: application/json' -d '{ "tokenizer": "standard", "char_filter": ["html_strip"], "text": "<h1>Title</h1>" }'
{ "tokens": ["Title"] }

All the examples so far called the /_analyze endpoint on the entire cluster. While this is beneficial for testing, its less useful in practice. Custom analysers become especially advantageous once we start attaching them to indexes and fields on types within an index.

Custom analyzers can be defined within the JSON configuration object of an index. The following custom analyzer consists of the whitespace tokenizer and a custom emoji_filter character filter.

{ "settings": { "index": { "number_of_shards": 5, "number_of_replicas": 2 }, "analysis": { "analyzer": { "emoji_analyzer": { "tokenizer": "whitespace", "char_filter": [ "emoji_filter" ] } }, "char_filter": { "emoji_filter": { "type": "mapping", "mappings": [ "🙂 => :)", "🙁 => :(", "😀 => :D" ] } } } } }

If this JSON configuration is saved in index.json, the following API call creates the index and names it test:

curl -XPUT ${ES_ENDPOINT}/test -H 'Content-Type: application/json' -d @data/test/index.json

The whitespace tokenizer creates a new token each time it encounters whitespace. The custom emoji character filter turns unicode emojis to ascii compliant emoticons. The following API call demonstrates how to use the custom analyzer.

curl -XPOST ${ES_ENDPOINT}/test/_analyze -H 'Content-Type: application/json' -d '{ "analyzer": "emoji_analyzer", "text": "😀" }'
{ "tokens": [":D"] }

Here are three more examples of custom analyzers:

{ "settings": { "index": { "number_of_shards": 5, "number_of_replicas": 2 }, "analysis": { "analyzer": { "email_analyzer": { "tokenizer": "email" }, "short_words_analyzer": { "tokenizer": "short_words" }, "long_words_analyzer": { "tokenizer": "standard", "filter": ["english_stop"] } }, "filter": { "english_stop": { "type": "stop", "stopwords": "_english_" } }, "tokenizer": { "email": { "type": "pattern", "pattern": "([a-zA-Z0-9_.-]+@[a-zA-Z0-9_.-]+\\.[a-zA-Z]{2,})", "group": 1 }, "short_words": { "type": "classic", "max_token_length": 2 } } } } }
curl -XPOST ${ES_ENDPOINT}/test/_analyze -H 'Content-Type: application/json' -d '{ "analyzer": "email_analyzer", "text": "My emails are andrew@jarombek.com and ajarombek95@gmail.com." }' curl -XPOST ${ES_ENDPOINT}/test/_analyze -H 'Content-Type: application/json' -d '{ "analyzer": "short_words_analyzer", "text": "Hi my name is Andy Jarombek." }' curl -XPOST ${ES_ENDPOINT}/test/_analyze -H 'Content-Type: application/json' -d '{ "analyzer": "long_words_analyzer", "text": "Dotty is a good horse." }'

And here are the responses from the API calls:

{ "tokens": ["andrew@jarombek.com", "ajarombek95@gmail.com"] } { "tokens": ["Hi", "my", "is"] } { "tokens": ["Dotty", "good", "horse"] }

It's time to demonstrate how analyzers work end-to-end. Analyzers are executed on two occasions - when a document is created with a field of type text and when a text field is queried.

Before creating a document, an index is needed with a custom analyzer and a type mapping. Notice that the type mapping contains a single field of type text called name. This field has two additional properties - analyzer and search_analyzer.

{ "settings": { "index": { "number_of_shards": 5, "number_of_replicas": 2, "max_ngram_diff": 10 }, "analysis": { "analyzer": { "tech_analyzer": { "type": "custom", "tokenizer": "standard", "char_filter": [], "filter": ["tech_ngram"] } }, "filter": { "tech_ngram": { "type": "ngram", "min_gram": 4, "max_gram": 10 } } } }, "mappings": { "properties": { "name": { "type": "text", "analyzer": "tech_analyzer", "search_analyzer": "standard" } } } }

analyzer declares the analyzer used when documents are indexed. search_analyzer declares the analyzer used when text searches execute. The name field uses a custom analyzer with a n-gram tokenizer when indexed and the standard analyzer when queried. The n-gram tokenizer creates tokens of length four through ten.

With the index definition created, it's time to populate it with some documents:

curl -XPOST ${ES_ENDPOINT}/tech/_doc/1 -H 'Content-Type: application/json' -d \ '{"name": "Elasticsearch"}' curl -XPOST ${ES_ENDPOINT}/tech/_doc/2 -H 'Content-Type: application/json' -d \ '{"name": "Logstash"}' curl -XPOST ${ES_ENDPOINT}/tech/_doc/3 -H 'Content-Type: application/json' -d \ '{"name": "Kibana"}'

If we could view the inverted index for the name field, it would look something like this8,9:

| Term | Appearances (Document, Frequency) | |------------|-----------------------------------| | Elas | (1,1) | | Elast | (1,1) | | Elasti | (1,1) | | Elastic | (1,1) | | Elastics | (1,1) | | Elasticse | (1,1) | | Elasticsea | (1,1) | | lasticsear | (1,1) | | asticsearc | (1,1) | | sticsearch | (1,1) | | last | (1,1) | | lasti | (1,1) | | lastic | (1,1) | | Logs | (2,1) | | Kiba | (3,1) | | ... | ... | |------------|-----------------------------------|

Viewing the inverted index is helpful since we can see the result of the n-gram tokenizer. Since the standard analyzer is used on the name field at query time, the following query returns the first document.

curl ${ES_ENDPOINT}/tech/_doc/_search?pretty=true -H 'Content-Type: application/json' -d '{ "query": { "match": { "name": "stic" } } }'

This query would not have returned the first document if the entire Elasticsearch string was stored in the inverted index.

Learning how analyzers and inverted indexes work helps explain some of the "magic" in Elasticsearch. In my next Elasticsearch adventure, I will explore querying documents. All the code from this article is available on GitHub.

[1] Pranav Shukla & Sharath Kumar M N, Learning Elastic Stack 6.0 (Birmingham: Packt, 2017), 56

[2] "Understanding the Inverted Index in Elasticsearch", https://codingexplained.com/coding/elasticsearch/understanding-the-inverted-index-in-elasticsearch

[3] "Inverted index", https://en.wikipedia.org/wiki/Inverted_index

[4] Shukla., 57

[5] Shukla., 58

[6] Shukla., 60

[7] "Standard Analyzer", https://bit.ly/2MndoxA

[8] "Inverted Index", https://www.geeksforgeeks.org/inverted-index/

[9] "Writing a simple Inverted Index in Python", https://medium.com/@fro_g/writing-a-simple-inverted-index-in-python-3c8bcb52169a