Motivation
In today’s world of interconnected platforms, each day documents or any other form of digital media is being populated in multiple languages over the Internet. So while they are available online, a mechanism to search them in the provided language too is required but that’s where the difficulty exists and also the expertise of Elastic Search multi language as a search engine comes in.
How Elastic Search fits in
Elastic Search brings in the capability of full-text search and also with the added feature of multi-lingual search. Multi-lingual capability is basically provided using the language-specific analysis, the breaking of words for creating indexes, and making them meaningful at the same and searching them appropriately.
Let’s move forward and take a detailed look to understand how it’s implemented using Elastic Search Analyze API.
Two Words – Analyze and Tokenize
Searching is all about analysis and the first step to analysis occurs when we create indexes. During the process of creating an index, each word is divided into tokens and with each of these token, a term is formed which is related to an index. And once the index has been formulated, the search query is run against these indices. The request text is broken down into tokens and using these tokens they are matched with the indices, the closest it is to the index values more the search result is relevant.
Moving forward we will be looking into the basic three components that help to achieve the multi-lingual approach for your system. The three components being :
- Stop Filter
- Tokenizer
- Mappings
Stop Filter
Stop filters are nothing but a component in our index creation which helps us define stop words, where stop words are nothing but words that are filtered out before or after the processing of the data is completed. They are usually the most common words in the language like the ones in English such as – a the, an, or, and so on but making it language-specific is important else the tokens created for a word or text may not remain meaningful.
A stop analyzer can accept the following parameters :
stopwords – A predefined stop words list like “_thai_” which takes up all the stop words from the Thai language or you can provide an array containing the list of stop words. The default stop words list is “_english_”.
stopwords_path – The path to a file containing all the stop words which are required to be ignored while processing the words.
For example, we can provide the predefined language specific stop filter while creating the index, we will be using it while we use the Elastic language Search Analyze API :
{
“filter”: {
“thai_stop”:
{
“type”: “stop”,
“stopwords”: “_thai_”
}
}
}
Tokenizer
A tokenizer receives a stream of characters, breaks large chunk of data into smaller words or tokens and in return provides the list of tokens generated. The most common example is the “whitespace” tokenizer which divides the string provided into tokens whenever it encounters a whitespace in the string provided.
For example, it would convert the text “Multi-lingual text search using Elastic Search” to the terms
[Multi-lingual, text, search, using, Elastic, Search]
The Tokenizers can be summarized into the following :
- Word Oriented Tokenizers – standard, letter, lowercase, whitespace, classic, thai, uax_url_email
- Partial Word Tokenizers – ngram, edge_ngram
- Structured Text Tokenizers – Keyword Tokenizer, Pattern Tokenizer, Simple Pattern Tokenizer, Char Group Tokenizer, Path Tokenizer
In the below-mentioned example, the Thai tokenizer is used to divide the word into tokens with respect to the Thai language with the additional help from the thai_stop filter which helps to ignore the common language words while tokenizing. We have combined it with the stop filter for Thai.
{
“analyzer”: {
“autocomplete-thai”:
{
“type”:”custom”,
“tokenizer”:”thai”,
“filter”:
[
“lowercase”,
“decimal_digit”,
“thai_stop”
]
}
}
}
Mapping
Mapping is nothing but the process of defining how to store and index the fields a document contains. A mapping definition has :
- Metadata fields – these fields are used to customize how the associated metadata of the document is treated by the Elastic Search. Some basic fields that are considered to contain the metadata of the document are _index, _id, and _source fields.
- Fields or Properties – the mapping contains the list of fields or properties that are associated with the particular document.
Field Data Types
Each field or the property has a data type associated to it which can be any of the following:
- Simple type like text, keyword, date, long, double, boolean, or ip.
- A type that supports the hierarchal nature of JSON as an object or nested.
- Specialized types such as geo_point, geo_shape, or completion.
Further more, the mapping provided can be of two types
- Dynamic Mapping – fields and mapping need not be predefined before they are used, whenever a document is indexed the new field names will be automatically added
- Static Mapping – it’s year old approach of creating the mapping well in advance when we know what needs to be stored in the document thus, defining the mapping while you create the index
Using the stop filter, token analyzer and mapping now let’s look at how we can implement it.
Moving on to how to use the above mentioned concepts in Java code. Maintain the analyzer and the mapping as String values or maintain them as methods in your code as mentioned below, using the method approach:
public static String analyzerSettings() {
return “{\n” +
” \”analysis\”: {\n” +
” \”filter\”: {\n” +
” \”autocomplete_filter\”: {\n” +
” \”type\”: \”edge_ngram\”,\n” +
” \”min_gram\”: 3,\n” +
” \”max_gram\”: 20\n” +
” }\n” +
” },\n” +
” \”analyzer\”: {\n” + //Analyzer for English string
” \”autocomplete\”: {\n” +
” \”type\”: \”custom\”,\n” +
” \”tokenizer\”: \”standard\”,\n” +
” \”filter\”: [\n” +
” \”lowercase\”,\n” +
” \”autocomplete_filter\”\n” +
” ]\n” +
” },\n” +
” \”autocomplete-thai\”: {\n” + //Analyzer for Thai string
” \”type\”: \”custom\”,\n” +
” \”tokenizer\”: \”thai\”,\n” +
” \”filter\”: [\n” +
” \”lowercase\”,\n” +
” \”decimal_digit\”,\n” +
” \”autocomplete_filter\”\n” +
” ]\n” +
” }\n” +
” }\n” +
” }\n” +
“}”;
}
In same way, maintain the Mapping string as a method with all the fields you would be using to index in your code.
public static String toSearchMapping() {
return “{\n” +
” \”properties\”: {\n” +
” \”id\”: {\n” +
” \”type\”: \”integer\”\n” +
” },\n” +
” \”title\”: {\n” +
” \”type\”: \”text\”,\n” +
” \”analyzer\”: \”autocomplete\”,\n” +
” \”search_analyzer\”: \”standard\”,\n” +
” \”fields\”: {\n” +
” \”thai\”: {\n” +
” \”type\”: \”text\”,\n” +
” \”analyzer\”: \”autocomplete-thai\”\n” +
” }\n” +
” }\n” +
” },\n” +
” \”contentType\”: {\n” +
” \”type\”: \”keyword\”\n” +
” },\n” +
” \”longDescription\”: {\n” +
” \”type\”: \”text\”\n” +
” },\n” +
” \”description\”: {\n” +
” \”type\”: \”text\”,\n” +
” \”analyzer\”: \”autocomplete\”,\n” +
” \”search_analyzer\”: \”standard\”,\n” +
” \”fields\”: {\n” +
” \”thai\”: {\n” +
” \”type\”: \”text\”,\n” +
” \”analyzer\”: \”autocomplete-thai\”\n” +
” }\n” +
” }\n” +
” }
” }\n” +
“}”;
}
Now using the above analyzer and mapping in code, ElasticIndexUtil class is created where the index creation and data is updated based on index creation logic.
import org.springframework.stereotype.Component;
import org.elasticsearch.client.RestHighLevelClient;
@Component
public class ElasticIndexUtil {
@Autowired
private RestHighLevelClient elasticsearchClient;
public void validateIndex(String idxName) throws Exception {
try {
Boolean isIndexCreated = false;
String[] mappings = elasticsearchClient.indices().get(new GetIndexRequest(idxName),
RequestOptions.DEFAULT).getIndices();
if (mappings.length == 0) {
throw new IndexNotFoundException(idxName);
}
} catch (IndexNotFoundException | ElasticsearchStatusException e) {
e.printStackTrace();
createIndex(idxName);
isIndexCreated = true;
}
if (isIndexCreated) {
//update your data in the elastic search
}
}
public void bootstrapSeriesIndex(String idxName) throws Exception {
elasticsearchClient.indices().create(new CreateIndexRequest(idxName), RequestOptions.DEFAULT);
elasticsearchClient.indices().close(new CloseIndexRequest(idxName), RequestOptions.DEFAULT);
//Using our analyzers here
elasticsearchClient.indices().putSettings(new UpdateSettingsRequest(idxName)
.settings(SeriesDTO.analyzerSettings(), XContentType.JSON),
RequestOptions.DEFAULT);
elasticsearchClient.indices().open(new OpenIndexRequest(idxName), RequestOptions.DEFAULT);
//Setting the mapping for the indices
elasticsearchClient.indices().putMapping(new PutMappingRequest(idxName)
.source(SeriesDTO.toSearchMapping(), XContentType.JSON), RequestOptions.DEFAULT);
}
}
Summary
In the present generation it’s just getting more and more important for every business as well as any content provided on the Internet to be available in multiple languages, as much as the availability is important the mechanism to search gets equal importance. The above points may have just been the fundamentals, there’s more to explore in the world of creation of indices and searching. Keep exploring, keep coding.
References
The above blog has been made possible using the below-mentioned references:
- For understanding the concept tokens and analyzers
https://www.elastic.co/blog/introducing-multi-language-engines-in-elastic-app-search - Understanding types of Analyzers
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html - Understanding Tokenizers
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html - Understanding Mapping
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html
https://logz.io/blog/elasticsearch-mapping/