What is an Elasticsearch Analyzer? What are the various types of analyzers in Elasticsearch?
Analyzers are used for Text analysis, it can be either built-in analyzer or custom analyzer.
The analyzer consists of zero or more character filters, at least one Tokenizer and zero or more Token filters.
Character filters break down the stream of string or numerical into characters by stripping out HTML tags, searching the string for key and replacing them with the related value defined in mapping char filter as well as replace the characters based on a specific pattern.
Tokenizer breaks the stream of string into characters, For example, whitespace tokenizer breaks the stream of string while encountering whitespace between characters. Token filters convert these tokens into lower case, remove from string stop words like ‘a', ‘an', ‘the' or replace characters into equivalent synonyms defined by the filter.
Elasticsearch is prebuilt with analyzers that are ready to use. However, you can integrate the built in character, token filters, along with tokenizers to create custom analyzers.
Built-in analyzers are further classified as below:
Tokenizers accept a stream of string, break them into individual tokens and display output as collection or array of these tokens. Tokenizers are mainly grouped into word-oriented, partial word, and structured text tokenizers.
How do Filters work in an Elasticsearch?
Token filters receive text tokens from tokenizer and can manipulate them to compare the tokens for search conditions. These filters compare tokens with the searched stream, resulting in Boolean value, like true or false.
The comparison can be whether the value for searched condition matches with filtered token texts, OR does not match, OR matches with one of the filtered token text returned OR does not match any of the specified tokens, OR value of the token text is within given range OR is not within a given range, OR the token texts exist in search condition or does not exist in the search condition.
How does a character filter in Elasticsearch Analyzer utilized?
Character filter in Elasticsearch analyzer is not mandatory. These filters manipulate the input stream of the string by replacing the token of text with corresponding value mapped to the key.
We can use mapping character filters that use parameters as mappings and mappings_path. The mappings are the files that contain an array of key and corresponding values listed, whereas mappings_path is the path that is registered in the config directory that shows the mappings file present.
-K Himaanshu Shuklaa..
Analyzers are used for Text analysis, it can be either built-in analyzer or custom analyzer.
The analyzer consists of zero or more character filters, at least one Tokenizer and zero or more Token filters.
Character filters break down the stream of string or numerical into characters by stripping out HTML tags, searching the string for key and replacing them with the related value defined in mapping char filter as well as replace the characters based on a specific pattern.
Tokenizer breaks the stream of string into characters, For example, whitespace tokenizer breaks the stream of string while encountering whitespace between characters. Token filters convert these tokens into lower case, remove from string stop words like ‘a', ‘an', ‘the' or replace characters into equivalent synonyms defined by the filter.
Elasticsearch is prebuilt with analyzers that are ready to use. However, you can integrate the built in character, token filters, along with tokenizers to create custom analyzers.
Built-in analyzers are further classified as below:
- Standard Analyzer: This type of analyzer is designed with standard tokenizer which breaks the stream of string into tokens based on maximum token length configured, lower case token filter which converts the token into lower case and stops token filter, which removes stop words such as ‘a', ‘an', ‘the'.
- Simple Analyzer: This type of analyzer breaks a stream of string into a token of text whenever it comes across numbers or special characters. A simple analyzer converts all the text tokens into lower case characters.
- Whitespace Analyzer: This type of analyzer breaks the stream of string into a token of text when it comes across white space between these string or statements. It retains the case of tokens as it was in the input stream.
- Stop Analyzer: This type of analyzer is similar to that of the simple analyzer, but in addition to it removes stop words from the stream of string such as ‘a', ‘an', ‘the'. The complete list of stop words in English can be found from the link.
- Keyword Analyzer: This type of analyzer returns the entire stream of string as a single token as it was. This type of analyzer can be converted into a custom analyzer by adding filters to it.
- Pattern Analyzer: This type of analyzer breaks the stream of string into tokens based on the regular expression defined. This regular expression acts on the stream of string and not on the tokens.
- Language Analyzer: This type of analyzer is used for specific language texts analysis. There are plug-ins to support language analyzers. These plug-ins are Stempel, Ukrainian Analysis, Kuromoji for Japanese, Nori for Korean and Phonetic plugins. There are additional plug-ins for Indian as well as non-Indian languages such as Asian languages (Example, Japanese, Vietnamese, Tibetan) analyzers.
- Fingerprint Analyzer: The fingerprint analyzer converts the stream of string into lower case, removes extended characters, sorts and concatenates into a single token.
Tokenizers accept a stream of string, break them into individual tokens and display output as collection or array of these tokens. Tokenizers are mainly grouped into word-oriented, partial word, and structured text tokenizers.
How do Filters work in an Elasticsearch?
Token filters receive text tokens from tokenizer and can manipulate them to compare the tokens for search conditions. These filters compare tokens with the searched stream, resulting in Boolean value, like true or false.
The comparison can be whether the value for searched condition matches with filtered token texts, OR does not match, OR matches with one of the filtered token text returned OR does not match any of the specified tokens, OR value of the token text is within given range OR is not within a given range, OR the token texts exist in search condition or does not exist in the search condition.
How does a character filter in Elasticsearch Analyzer utilized?
Character filter in Elasticsearch analyzer is not mandatory. These filters manipulate the input stream of the string by replacing the token of text with corresponding value mapped to the key.
We can use mapping character filters that use parameters as mappings and mappings_path. The mappings are the files that contain an array of key and corresponding values listed, whereas mappings_path is the path that is registered in the config directory that shows the mappings file present.
-K Himaanshu Shuklaa..
No comments:
Post a Comment