What are tokenizers in Elasticsearch?

MohammadReza
4 min readMar 28, 2021

Elasticsearch converts each text to some tokens. We have some different tokenizers that I’ll explain to you some of those. For example, it would convert the text “Quick brown fox!” With whitespace tokenizer to [Quick, brown, fox!] tokens. When you search a keyword for example “brown”, Elasticsearch searches in these tokens and finds the place of keyword in the text.
The most popular tokenizer in Elasticsearch are:
* Standard Tokenizer (standard)
* Letter Tokenizer (letter)
* Lowercase Tokenizer (lowercase)
* Whitespace Tokenizer (whitespace)
* UAX URL Email Tokenizer (uax_url_email)
* Classic Tokenizer (classic)
* Path Tokenizer (path_hierarchy)

Some of Tokenizers

Standard Tokenizer (standard):
The “standard” tokenizer divides text into terms on word boundaries, as defined by the Unicode Text Segmentation algorithm. It removes most punctuation symbols. It is the best choice for most languages.
Text: “The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.”
Tokens: [ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog’s, bone ]

Letter Tokenizer (letter):
The “letter” tokenizer breaks text into terms whenever it encounters a character which is not a letter. It does a reasonable job for most European languages, but does a terrible job for some Asian languages, where words are not separated by spaces.
Text: “The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.”
Tokens: [ The, QUICK, Brown, Foxes, jumped, over, the, lazy, dog, s, bone ]

Lowercase Tokenizer (lowercase):
The “lowercase” tokenizer, like the letter tokenizer breaks text into terms whenever it encounters a character which is not a letter, but it also lowercases all terms. It is functionally equivalent to the letter tokenizer combined with the lowercase token filter, but is more efficient as it performs both steps in a single pass.
Text: “The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.”
Tokens: [ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

Whitespace Tokenizer (whitespace):
The “whitespace” tokenizer breaks text into terms whenever it encounters a whitespace character.
Text: “The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.”
Tokens: [ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog’s, bone. ]

UAX URL Email Tokenizer (uax_url_email):
The “uax_url_email” tokenizer is like the standard tokenizer except that it recognises URLs and email addresses as single tokens.
Text: “Email me at john.smith@global-international.com”
Tokens: [ Email, me, at, john.smith@global-international.com ]
Tokens with ‘Standard Tokenizer’: [ Email, me, at, john.smith, global, international.com ]

Classic Tokenizer (classic):
The “classic” tokenizer is a grammar based tokenizer that is good for English language documents. This tokenizer has heuristics for special treatment of acronyms, company names, email addresses, and internet host names. However, these rules don’t always work, and the tokenizer doesn’t work well for most languages other than English:
It splits words at most punctuation characters, removing punctuation. However, a dot that’s not followed by whitespace is considered part of a token.
It splits words at hyphens, unless there’s a number in the token, in which case the whole token is interpreted as a product number and is not split.
It recognizes email addresses and internet hostnames as one token.
Text: “The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.”
Tokens: [ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog’s, bone ]

Path Tokenizer (path_hierarchy):
The “path_hierarchy” tokenizer takes a hierarchical value like a filesystem path, splits on the path separator, and emits a term for each component in the tree.
Text: “/one/two/three”
Tokens: [ /one, /one/two, /one/two/three ]

Some options
All of these tokenizers have some configuration that you can use them. For example, in “Standard Tokenizer” we have “max_token_length” configuration that you can set it. This is the maximum token length. If a token is seen that exceeds this length then it is split at max_token_length intervals. Defaults to 255.
Note the following:
Text: “The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.”
Tokens without max_token_length: [ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog’s, bone. ]
Tokens without max_token_length=5: [ The, 2, QUICK, Brown, Foxes, jumpe, d, over, the, lazy, dog’s, bone ] (It splits ‘jumped’ to ‘jumpe’ and ‘d’ because length of this keyword is more than 5)

Create an index by specifying tokenizer and test it:
To creating an index with specific tokenizer you can use the below code:

Create index:

PUT my-index{    "settings": {        "analysis": {            "analyzer": {                "my_analyzer": {                    "tokenizer": "my_tokenizer"                }             },            "tokenizer": {                "my_tokenizer": {                    "type": "standard",                    "max_token_length": 5                }            }        }    }}

You can test the tokenizer with the below code:

POST my-index/_analyze{    "analyzer": "my_analyzer",    "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."}

--

--