Elasticsearch is a search engine with json rest api using Lucene and written in Java. A description of all the advantages of this engine is available at the official website. Further in the text, we will refer to Elasticsearch as ES.
Such engines are used for complex searches in the database of documents. For example, search taking into account the morphology of the language or search by geo coordinates.
In this article, I will cover the basics of ES using blog post indexing as an example. I'll show you how to filter, sort and search documents.
In order not to depend on the operating system, I will make all requests to ES using CURL. There is also a plugin for google chrome called sense.
The text contains links to documentation and other sources. At the end there are links for quick access to the documentation. Definitions of unfamiliar terms can be found in glossaries.
ES installation
To do this, we first need Java. Developers Recommend install Java versions newer than Java 8 update 20 or Java 7 update 55.
ES automatically created index blog and type post. You can draw a conditional analogy: an index is a database, and a type is a table in this database. Each type has its own schema β mapping, just like a relational table. Mapping is generated automatically when the document is indexed:
# ΠΠΎΠ»ΡΡΠΈΠΌ mapping Π²ΡΠ΅Ρ ΡΠΈΠΏΠΎΠ² ΠΈΠ½Π΄Π΅ΠΊΡΠ° blog
curl -XGET "$ES_URL/blog/_mapping?pretty"
In the server response, I added the values ββof the fields of the indexed document in the comments:
It's worth noting that ES makes no distinction between a single value and an array of values. For example, the title field contains just a title, and the tags field contains an array of strings, although they are represented in the mapping in the same way.
We'll talk about mapping more similarly later.
Inquiries
Retrieve a document by its id:
# ΠΈΠ·Π²Π»Π΅ΡΠ΅ΠΌ Π΄ΠΎΠΊΡΠΌΠ΅Π½Ρ Ρ id 1 ΡΠΈΠΏΠ° post ΠΈΠ· ΠΈΠ½Π΄Π΅ΠΊΡΠ° blog
curl -XGET "$ES_URL/blog/post/1?pretty"
New keys appeared in the response: _version ΠΈ _source. In general, all keys starting with _ belong to the service.
Key _version shows the version of the document. It is needed for the optimistic locking mechanism to work. For example, we want to change a document that has version 1. We submit a modified document and indicate that this is an edit of a document with version 1. If someone else also edited a document with version 1 and submitted changes before us, then ES will not accept our changes, because it stores the document with version 2.
Key _source contains the document that we indexed. ES does not use this value for search operations, as indexes are used for searching. To save space, ES keeps the original document compressed. If we need only the id, and not the entire source document, then we can disable source storage.
If we do not need additional information, we can only get the contents of _source:
We chose the last post. size limits the number of documents in the issuance. total shows the total number of documents matching the query. sort in the output contains an array of integers by which sorting is performed. Those. the date has been converted to an integer. You can read more about sorting in documentation.
Filters and Queries
ES since version 2 does not distinguish between filters and requests, instead the concept of contexts is introduced.
The request context differs from the filter context in that the request generates a _score and is not cached. What is _score I will show later.
However, if we look for "stories" in the content field, we will not find anything, because the index contains only the original words, not their stems. In order to make a high-quality search, you need to configure the analyzer.
Field _score shows relevance. If the request is made in the filter context, then the value of _score will always be 1, which means that the filter fully matches.
Analyzers
Analyzers are needed to convert the source text into a set of tokens.
Analyzers consist of one Tokenizer and a few optional TokenFilters. Tokenizer may be preceded by several CharFilters. Tokenizers break the source string into tokens, such as spaces and punctuation. TokenFilter can change tokens, remove or add new ones, for example, leave only the stem of a word, remove prepositions, add synonyms. CharFilter - changes the entire source string, for example, cuts out html tags.
The standard analyzer split the string by spaces and converted everything to lower case, the russian analyzer removed unimportant words, converted to lower case and left the stem of the words.
Let's see which Tokenizer, TokenFilters, CharFilters the russian analyzer uses:
Let's describe our analyzer based on russian, which will strip html tags. Let's call it default, because an analyzer with this name will be used by default.
First, all html tags will be removed from the source string, then it will be split into tokenizer standard tokens, the resulting tokens will be converted to lower case, insignificant words will be deleted, and the stem of the word will remain from the remaining tokens.
Create an index
We have described the default parser above. It will apply to all string fields. Our post contains an array of tags, so the tags will also be processed by the analyzer. Because Since we are looking for posts by exact match to a tag, we need to turn off parsing for the tags field.
Let's create a blog2 index with an analyzer and mapping, in which parsing of the tags field is disabled:
Because Since we are using an analyzer with Russian stemming, this query will return all documents, even though they contain only the word 'history'.
The request may contain special characters, for example:
""fried eggs" +(eggplant | potato) -frittata"
Request syntax:
+ signifies AND operation
| signifies OR operation
- negates a single token
" wraps a number of tokens to signify a phrase for searching
* at the end of a term signifies a prefix query
( and ) signify precedence
~N after a word signifies edit distance (fuzziness)
~N after a phrase signifies slop amount
If you are interested in such articles-lessons, have ideas for new articles or have proposals for cooperation, then I will be glad to receive a message in a personal or email [email protected].