Basics To Multi-Lingual Capabilities in Elastic Search

Shalabh Agarwal Oct 17th, 2020

Motivation

In today’s world of interconnected platforms, each day documents or any other form of digital media is being populated in multiple languages over the Internet. So while they are available online, a mechanism to search them in the provided language too is required but that’s where the difficulty exists and also the expertise of Elastic Search multi language as a search engine comes in.

How Elastic Search fits in

Elastic Search brings in the capability of full-text search and also with the added feature of multi-lingual search. Multi-lingual capability is basically provided using the language-specific analysis, the breaking of words for creating indexes, and making them meaningful at the same and searching them appropriately.

Let’s move forward and take a detailed look to understand how it’s implemented using Elastic Search Analyze API.

Two Words – Analyze and Tokenize

Searching is all about analysis and the first step to analysis occurs when we create indexes. During the process of creating an index, each word is divided into tokens and with each of these token, a term is formed which is related to an index. And once the index has been formulated, the search query is run against these indices. The request text is broken down into tokens and using these tokens they are matched with the indices, the closest it is to the index values more the search result is relevant.

Moving forward we will be looking into the basic three components that help to achieve the multi-lingual approach for your system. The three components being :

  • Stop Filter
  • Tokenizer
  • Mappings

Stop Filter

Stop filters are nothing but a component in our index creation which helps us define stop words, where stop words are nothing but words that are filtered out before or after the processing of the data is completed. They are usually the most common words in the language like the ones in English such as – a the, an, or, and so on but making it language-specific is important else the tokens created for a word or text may not remain meaningful.

A stop analyzer can accept the following parameters :

stopwords – A predefined stop words list like “_thai_” which takes up all the stop words from the Thai language or you can provide an array containing the list of stop words. The default stop words list is “_english_”.
stopwords_path – The path to a file containing all the stop words which are required to be ignored while processing the words.
For example, we can provide the predefined language specific stop filter while creating the index, we will be using it while we use the Elastic language Search Analyze API :
{
“filter”: {
“thai_stop”:
{
“type”: “stop”,
“stopwords”: “_thai_”
}
}
}

Tokenizer

A tokenizer receives a stream of characters, breaks large chunk of data into smaller words or tokens and in return provides the list of tokens generated. The most common example is the “whitespace” tokenizer which divides the string provided into tokens whenever it encounters a whitespace in the string provided.

For example, it would convert the text “Multi-lingual text search using Elastic Search” to the terms
[Multi-lingual, text, search, using, Elastic, Search]

The Tokenizers can be summarized into the following :

  • Word Oriented Tokenizers – standard, letter, lowercase, whitespace, classic, thai, uax_url_email
  • Partial Word Tokenizers – ngram, edge_ngram
  • Structured Text Tokenizers – Keyword Tokenizer, Pattern Tokenizer, Simple Pattern Tokenizer, Char Group Tokenizer, Path Tokenizer

In the below-mentioned example, the Thai tokenizer is used to divide the word into tokens with respect to the Thai language with the additional help from the thai_stop filter which helps to ignore the common language words while tokenizing. We have combined it with the stop filter for Thai.

{
“analyzer”: {
“autocomplete-thai”:
{
“type”:”custom”,
“tokenizer”:”thai”,
“filter”:
[
“lowercase”,
“decimal_digit”,
“thai_stop”
]
}
}
}

Mapping

Mapping is nothing but the process of defining how to store and index the fields a document contains. A mapping definition has :

  • Metadata fields – these fields are used to customize how the associated metadata of the document is treated by the Elastic Search. Some basic fields that are considered to contain the metadata of the document are _index, _id, and _source fields.
  • Fields or Properties – the mapping contains the list of fields or properties that are associated with the particular document.

Field Data Types
Each field or the property has a data type associated to it which can be any of the following:

  • Simple type like text, keyword, date, long, double, boolean, or ip.
  • A type that supports the hierarchal nature of JSON as an object or nested.
  • Specialized types such as geo_point, geo_shape, or completion.

Further more, the mapping provided can be of two types

  • Dynamic Mapping – fields and mapping need not be predefined before they are used, whenever a document is indexed the new field names will be automatically added
  • Static Mapping – it’s year old approach of creating the mapping well in advance when we know what needs to be stored in the document thus, defining the mapping while you create the index

Using the stop filter, token analyzer and mapping now let’s look at how we can implement it.

Moving on to how to use the above mentioned concepts in Java code. Maintain the analyzer and the mapping as String values or maintain them as methods in your code as mentioned below, using the method approach:

public static String analyzerSettings() {
return “{\n” +
” \”analysis\”: {\n” +
” \”filter\”: {\n” +
” \”autocomplete_filter\”: {\n” +
” \”type\”: \”edge_ngram\”,\n” +
” \”min_gram\”: 3,\n” +
” \”max_gram\”: 20\n” +
” }\n” +
” },\n” +
” \”analyzer\”: {\n” + //Analyzer for English string
” \”autocomplete\”: {\n” +
” \”type\”: \”custom\”,\n” +
” \”tokenizer\”: \”standard\”,\n” +
” \”filter\”: [\n” +
” \”lowercase\”,\n” +
” \”autocomplete_filter\”\n” +
” ]\n” +
” },\n” +
” \”autocomplete-thai\”: {\n” + //Analyzer for Thai string
” \”type\”: \”custom\”,\n” +
” \”tokenizer\”: \”thai\”,\n” +
” \”filter\”: [\n” +
” \”lowercase\”,\n” +
” \”decimal_digit\”,\n” +
” \”autocomplete_filter\”\n” +
” ]\n” +
” }\n” +
” }\n” +
” }\n” +
“}”;
}

In same way, maintain the Mapping string as a method with all the fields you would be using to index in your code.

public static String toSearchMapping() {
return “{\n” +
” \”properties\”: {\n” +
” \”id\”: {\n” +
” \”type\”: \”integer\”\n” +
” },\n” +
” \”title\”: {\n” +
” \”type\”: \”text\”,\n” +
” \”analyzer\”: \”autocomplete\”,\n” +
” \”search_analyzer\”: \”standard\”,\n” +
” \”fields\”: {\n” +
” \”thai\”: {\n” +
” \”type\”: \”text\”,\n” +
” \”analyzer\”: \”autocomplete-thai\”\n” +
” }\n” +
” }\n” +
” },\n” +
” \”contentType\”: {\n” +
” \”type\”: \”keyword\”\n” +
” },\n” +
” \”longDescription\”: {\n” +
” \”type\”: \”text\”\n” +
” },\n” +
” \”description\”: {\n” +
” \”type\”: \”text\”,\n” +
” \”analyzer\”: \”autocomplete\”,\n” +
” \”search_analyzer\”: \”standard\”,\n” +
” \”fields\”: {\n” +
” \”thai\”: {\n” +
” \”type\”: \”text\”,\n” +
” \”analyzer\”: \”autocomplete-thai\”\n” +
” }\n” +
” }\n” +
” }
” }\n” +
“}”;
}
Now using the above analyzer and mapping in code, ElasticIndexUtil class is created where the index creation and data is updated based on index creation logic.

import org.springframework.stereotype.Component;

import org.elasticsearch.client.RestHighLevelClient;

@Component
public class ElasticIndexUtil {

@Autowired
private RestHighLevelClient elasticsearchClient;

public void validateIndex(String idxName) throws Exception {
try {
Boolean isIndexCreated = false;

String[] mappings = elasticsearchClient.indices().get(new GetIndexRequest(idxName),
RequestOptions.DEFAULT).getIndices();

if (mappings.length == 0) {
throw new IndexNotFoundException(idxName);
}

} catch (IndexNotFoundException | ElasticsearchStatusException e) {
e.printStackTrace();
createIndex(idxName);
isIndexCreated = true;
}

if (isIndexCreated) {
//update your data in the elastic search
}
}

public void bootstrapSeriesIndex(String idxName) throws Exception {
elasticsearchClient.indices().create(new CreateIndexRequest(idxName), RequestOptions.DEFAULT);
elasticsearchClient.indices().close(new CloseIndexRequest(idxName), RequestOptions.DEFAULT);

//Using our analyzers here
elasticsearchClient.indices().putSettings(new UpdateSettingsRequest(idxName)
.settings(SeriesDTO.analyzerSettings(), XContentType.JSON),
RequestOptions.DEFAULT);
elasticsearchClient.indices().open(new OpenIndexRequest(idxName), RequestOptions.DEFAULT);

//Setting the mapping for the indices
elasticsearchClient.indices().putMapping(new PutMappingRequest(idxName)
.source(SeriesDTO.toSearchMapping(), XContentType.JSON), RequestOptions.DEFAULT);
}
}

Summary

In the present generation it’s just getting more and more important for every business as well as any content provided on the Internet to be available in multiple languages, as much as the availability is important the mechanism to search gets equal importance. The above points may have just been the fundamentals, there’s more to explore in the world of creation of indices and searching. Keep exploring, keep coding.

References

The above blog has been made possible using the below-mentioned references:

Shalabh Agarwal - Co-founder, Enveu
Shalabh Agarwal is the co-founder of Enveu, one of the fastest-growing App automation and OTT solutions providers. Shalabh oversees the global businesses for Enveu and has been working in the Technology and SaaS space for over 15 years.

Add a Comment

Your email address will not be published. Required fields are marked *

Looking for Streaming Solutions?

  • Go-Live on 12+ platforms
  • Zero Revenue Share
  • Multiple Monetization Models
  • 50+ Integrations

get a demo today

Take control of your digital media strategy.
Contact us for a no-obligation demo of the Experience
Cloud, tailor-made for you!