Elasticsearch Basics

Elasticsearch ke enjine ea ho batla e nang le json rest api, e sebelisang Lucene mme e ngotsoe ka Java. Tlhaloso ea melemo eohle ea enjene ena e fumaneha ho websaete ea molao. Ho se latelang re tla bitsa Elasticsearch e le ES.

Lienjineri tse ts'oanang li sebelisoa bakeng sa lipatlisiso tse rarahaneng polokelong ea litokomane. Ka mohlala, batla ka ho nahanela morphology ea puo kapa batla ka li-geo coordinates.

Sehloohong sena ke tla bua ka lintho tsa motheo tsa ES ho sebelisa mohlala oa indexing blog posts. Ke tla u bontša mokhoa oa ho sefa, ho hlophisa le ho batla litokomane.

E le hore ke se ke ka itšetleha ka mokhoa oa ho sebetsa, ke tla etsa likōpo tsohle ho ES ho sebelisa CURL. Ho boetse ho na le plugin ea google chrome e bitsoang kelello.

Mongolo o na le likhokahano tsa litokomane le mehloli e meng. Qetellong ho na le lihokelo tsa ho fihlella kapele litokomane. Litlhaloso tsa mantsoe a sa tloaelehang li ka fumanoa ho mantsoe a hlalosang mantsoe.

Ho kenya

Ho etsa sena, re hloka Java pele. Bahlahisi khothaletsa kenya liphetolelo tsa Java tse ncha ho feta Java 8 update 20 kapa Java 7 update 55.

Kabo ea ES e fumaneha ho webosaete ea moqapi. Ka mor'a ho notlolla archive u lokela ho matha bin/elasticsearch. E fumaneha hape liphutheloana bakeng sa apt le yum. Ho na le setšoantšo sa semmuso bakeng sa docker. Tse ling mabapi le ho kenya.

Kamora ho kenya le ho qala, ha re hlahlobeng ts'ebetso:

# для удобства запомним адрес в переменную
#export ES_URL=$(docker-machine ip dev):9200
export ES_URL=localhost:9200

curl -X GET $ES_URL

Re tla fumana ntho e kang ena:

{
  "name" : "Heimdall",
  "cluster_name" : "elasticsearch",
  "version" : {
    "number" : "2.2.1",
    "build_hash" : "d045fc29d1932bce18b2e65ab8b297fbf6cd41a1",
    "build_timestamp" : "2016-03-09T09:38:54Z",
    "build_snapshot" : false,
    "lucene_version" : "5.4.1"
  },
  "tagline" : "You Know, for Search"
}

Indexing

Ha re kenyeng poso ho ES:

# Добавим документ c id 1 типа post в индекс blog.
# ?pretty указывает, что вывод должен быть человеко-читаемым.

curl -XPUT "$ES_URL/blog/post/1?pretty" -d'
{
  "title": "Веселые котята",
  "content": "<p>Смешная история про котят<p>",
  "tags": [
    "котята",
    "смешная история"
  ],
  "published_at": "2014-09-12T20:44:42+00:00"
}'

karabo ea seva:

{
  "_index" : "blog",
  "_type" : "post",
  "_id" : "1",
  "_version" : 1,
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "created" : false
}

ES e entsoe ka bo eona index blog le mofuta poso. Re ka etsa papiso e nang le maemo: index ke database, 'me mofuta ke tafole sebakeng sena sa polokelo. Mofuta o mong le o mong o na le morero oa oona - ho tseba likarolo, joalo ka tafole ea likamano. 'Mapa o etsoa ka bo eona ha tokomane e thathamisitsoe:

# Получим mapping всех типов индекса blog
curl -XGET "$ES_URL/blog/_mapping?pretty"

Karabelong ea seva, ke kentse boleng ba likarolo tsa tokomane e ngolisitsoeng ho maikutlo:

{
  "blog" : {
    "mappings" : {
      "post" : {
        "properties" : {
          /* "content": "<p>Смешная история про котят<p>", */ 
          "content" : {
            "type" : "string"
          },
          /* "published_at": "2014-09-12T20:44:42+00:00" */
          "published_at" : {
            "type" : "date",
            "format" : "strict_date_optional_time||epoch_millis"
          },
          /* "tags": ["котята", "смешная история"] */
          "tags" : {
            "type" : "string"
          },
          /*  "title": "Веселые котята" */
          "title" : {
            "type" : "string"
          }
        }
      }
    }
  }
}

Ke habohlokoa ho hlokomela hore ES ha e khetholle pakeng tsa boleng bo le bong le letoto la boleng. Ka mohlala, sebaka sa sehlooho se na le sehlooho feela, 'me sebaka sa li-tags se na le likhoele tse ngata, le hoja li emeloa ka tsela e tšoanang' mapa.
Re tla bua haholoanyane ka 'mapa hamorao.

Likopo

Ho fumana tokomane ka id ea eona:

# извлечем документ с id 1 типа post из индекса blog
curl -XGET "$ES_URL/blog/post/1?pretty"
{
  "_index" : "blog",
  "_type" : "post",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "_source" : {
    "title" : "Веселые котята",
    "content" : "<p>Смешная история про котят<p>",
    "tags" : [ "котята", "смешная история" ],
    "published_at" : "2014-09-12T20:44:42+00:00"
  }
}

Ho hlahile linotlolo tse ncha karabong: _version и _source. Ka kakaretso, linotlolo tsohle ho qala ka _ li khetholloa e le tsa molao.

Senotlolo _version e bonts'a mofuta oa tokomane. Hoa hlokahala hore mokhoa o nang le tšepo oa ho notlela o sebetse. Ka mohlala, re batla ho fetola tokomane e nang le mofuta oa 1. Re fana ka tokomane e fetotsoeng 'me re bontša hore ena ke tokomane e fetoletsoeng ka mofuta oa 1. Haeba motho e mong a boetse a hlophisa tokomane e nang le mofuta oa 1' me a kenya liphetoho ka pel'a rona, joale ES e ke ke ea amohela liphetoho tsa rona, hobane e boloka tokomane le mofuta oa 2.

Senotlolo _source e na le tokomane eo re e thathamisitseng. ES ha e sebelise boleng bona bakeng sa ts'ebetso ea ho batla hobane Li-index li sebelisoa ho batla. Ho boloka sebaka, ES e boloka tokomane ea mohloli e hatisitsoeng. Haeba re hloka id feela, eseng tokomane eohle ea mohloli, re ka tima polokelo ea mohloli.

Haeba re sa hloke lintlha tse ling, re ka fumana feela litaba tsa _source:

curl -XGET "$ES_URL/blog/post/1/_source?pretty"
{
  "title" : "Веселые котята",
  "content" : "<p>Смешная история про котят<p>",
  "tags" : [ "котята", "смешная история" ],
  "published_at" : "2014-09-12T20:44:42+00:00"
}

U ka boela ua khetha likarolo tse itseng feela:

# извлечем только поле title
curl -XGET "$ES_URL/blog/post/1?_source=title&pretty"
{
  "_index" : "blog",
  "_type" : "post",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "_source" : {
    "title" : "Веселые котята"
  }
}

Ha re supe tse ling tse ling 'me re botse lipotso tse thata.

curl -XPUT "$ES_URL/blog/post/2" -d'
{
  "title": "Веселые щенки",
  "content": "<p>Смешная история про щенков<p>",
  "tags": [
    "щенки",
    "смешная история"
  ],
  "published_at": "2014-08-12T20:44:42+00:00"
}'
curl -XPUT "$ES_URL/blog/post/3" -d'
{
  "title": "Как у меня появился котенок",
  "content": "<p>Душераздирающая история про бедного котенка с улицы<p>",
  "tags": [
    "котята"
  ],
  "published_at": "2014-07-21T20:44:42+00:00"
}'

Ho hlophisa

# найдем последний пост по дате публикации и извлечем поля title и published_at
curl -XGET "$ES_URL/blog/post/_search?pretty" -d'
{
  "size": 1,
  "_source": ["title", "published_at"],
  "sort": [{"published_at": "desc"}]
}'
{
  "took" : 8,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : null,
    "hits" : [ {
      "_index" : "blog",
      "_type" : "post",
      "_id" : "1",
      "_score" : null,
      "_source" : {
        "title" : "Веселые котята",
        "published_at" : "2014-09-12T20:44:42+00:00"
      },
      "sort" : [ 1410554682000 ]
    } ]
  }
}

Re khethile poso ea ho qetela. size e fokotsa palo ea litokomane tse lokelang ho fanoa. total e bonts'a palo eohle ea litokomane tse tsamaellanang le kopo. sort sephethong se na le palo ea palo e felletseng eo ho hlopha ho etsoang ka eona. Tseo. letsatsi le fetotsoe ho palo e felletseng. Lintlha tse ling mabapi le ho hlopha li ka fumanoa ho litokomane.

Lisefe le lipotso

ES kaha mofuta oa 2 ha o khetholle lipakeng tsa li-filters le lipotso, ho fapana le hoo ho hlahisoa mohopolo oa maemo.
Taba ea potso e fapane le sebopeho sa filthara hobane potso e hlahisa _score mme ha e bolokoe. Ke tla u bontša hore na _score ke eng hamorao.

Sefa ho latela letsatsi

Re sebelisa kopo lethathama moelelong oa filthara:

# получим посты, опубликованные 1ого сентября или позже
curl -XGET "$ES_URL/blog/post/_search?pretty" -d'
{
  "filter": {
    "range": {
      "published_at": { "gte": "2014-09-01" }
    }
  }
}'

Sefa ka li-tag

Re sebelisa potso ea nako ho batla li-ID tsa litokomane tse nang le lentsoe le fanoeng:

# найдем все документы, в поле tags которых есть элемент 'котята'
curl -XGET "$ES_URL/blog/post/_search?pretty" -d'
{
  "_source": [
    "title",
    "tags"
  ],
  "filter": {
    "term": {
      "tags": "котята"
    }
  }
}'
{
  "took" : 9,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "blog",
      "_type" : "post",
      "_id" : "1",
      "_score" : 1.0,
      "_source" : {
        "title" : "Веселые котята",
        "tags" : [ "котята", "смешная история" ]
      }
    }, {
      "_index" : "blog",
      "_type" : "post",
      "_id" : "3",
      "_score" : 1.0,
      "_source" : {
        "title" : "Как у меня появился котенок",
        "tags" : [ "котята" ]
      }
    } ]
  }
}

Patlo e felletseng ea mongolo

Litokomane tsa rona tse tharo li na le tse latelang sebakeng sa litaba:

  • <p>Смешная история про котят<p>
  • <p>Смешная история про щенков<p>
  • <p>Душераздирающая история про бедного котенка с улицы<p>

Re sebelisa Kopanya potso ho batla li-ID tsa litokomane tse nang le lentsoe le fanoeng:

# source: false означает, что не нужно извлекать _source найденных документов
curl -XGET "$ES_URL/blog/post/_search?pretty" -d'
{
  "_source": false,
  "query": {
    "match": {
      "content": "история"
    }
  }
}'
{
  "took" : 13,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 0.11506981,
    "hits" : [ {
      "_index" : "blog",
      "_type" : "post",
      "_id" : "2",
      "_score" : 0.11506981
    }, {
      "_index" : "blog",
      "_type" : "post",
      "_id" : "1",
      "_score" : 0.11506981
    }, {
      "_index" : "blog",
      "_type" : "post",
      "_id" : "3",
      "_score" : 0.095891505
    } ]
  }
}

Leha ho le joalo, haeba re batla "lipale" tšimong ea litaba, re ke ke ra fumana letho, hobane Lenane lena le na le mantsoe a pele feela, eseng likutu tsa ’ona. E le hore u etse lipatlisiso tsa boleng bo phahameng, u lokela ho lokisa analyzer.

tšimo _score e bontša bohlokoa. Haeba kopo e etsoa ka mokhoa oa sefe, joale boleng ba _score bo tla lula bo lekana le 1, e bolelang hore ho tšoana ho feletseng le sefe.

Bahlahlobisisi

Bahlahlobisisi lia hlokahala ho fetolela mongolo oa mohloli hore e be sehlopha sa li-tokens.
Analyzers e na le e le 'ngoe Tokenizer le tse 'maloa boikhethelo Li-tokenFilters. Tokenizer e ka 'na ea etelloa pele ke tse' maloa CharFilters. Li-tokenizer li roba khoele ea mohloli ho li-tokens, joalo ka libaka le litlhaku tsa matšoao. TokenFilter e ka fetola li-tokens, ea hlakola kapa ea eketsa tse ncha, ho etsa mohlala, ea siea feela kutu ea lentsoe, tlosa liemeli, eketsa mahlalopa. CharFilter - e fetola khoele eohle ea mohloli, mohlala, e fokotsa li-tag tsa html.

ES e na le tse 'maloa bahlahlobisisi ba maemo. Ka mohlala, analyzer Russia.

Ha re nke monyetla API 'me re bone hore na bahlahlobi ba tloaelehileng le ba Russia ba fetola khoele joang "Lipale tse qabolang ka likatsana":

# используем анализатор standard       
# обязательно нужно перекодировать не ASCII символы
curl -XGET "$ES_URL/_analyze?pretty&analyzer=standard&text=%D0%92%D0%B5%D1%81%D0%B5%D0%BB%D1%8B%D0%B5%20%D0%B8%D1%81%D1%82%D0%BE%D1%80%D0%B8%D0%B8%20%D0%BF%D1%80%D0%BE%20%D0%BA%D0%BE%D1%82%D1%8F%D1%82"
{
  "tokens" : [ {
    "token" : "веселые",
    "start_offset" : 0,
    "end_offset" : 7,
    "type" : "<ALPHANUM>",
    "position" : 0
  }, {
    "token" : "истории",
    "start_offset" : 8,
    "end_offset" : 15,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "про",
    "start_offset" : 16,
    "end_offset" : 19,
    "type" : "<ALPHANUM>",
    "position" : 2
  }, {
    "token" : "котят",
    "start_offset" : 20,
    "end_offset" : 25,
    "type" : "<ALPHANUM>",
    "position" : 3
  } ]
}
# используем анализатор russian
curl -XGET "$ES_URL/_analyze?pretty&analyzer=russian&text=%D0%92%D0%B5%D1%81%D0%B5%D0%BB%D1%8B%D0%B5%20%D0%B8%D1%81%D1%82%D0%BE%D1%80%D0%B8%D0%B8%20%D0%BF%D1%80%D0%BE%20%D0%BA%D0%BE%D1%82%D1%8F%D1%82"
{
  "tokens" : [ {
    "token" : "весел",
    "start_offset" : 0,
    "end_offset" : 7,
    "type" : "<ALPHANUM>",
    "position" : 0
  }, {
    "token" : "истор",
    "start_offset" : 8,
    "end_offset" : 15,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "кот",
    "start_offset" : 20,
    "end_offset" : 25,
    "type" : "<ALPHANUM>",
    "position" : 3
  } ]
}

Mohlahlobi ea tloaelehileng o ile a arola khoele likheong 'me a fetola ntho e' ngoe le e 'ngoe hore e be litlhaku tse tlaase, mohlahlobi oa Serussia o ile a tlosa mantsoe a sa reng letho, a a fetola hore e be litlhaku tse tlaase ebe o siea kutu ea mantsoe.

Ha re boneng hore na ke Tokenizer efe, TokenFilters, CharFilters eo mohlahlobi oa Serussia a e sebelisang:

{
  "filter": {
    "russian_stop": {
      "type":       "stop",
      "stopwords":  "_russian_"
    },
    "russian_keywords": {
      "type":       "keyword_marker",
      "keywords":   []
    },
    "russian_stemmer": {
      "type":       "stemmer",
      "language":   "russian"
    }
  },
  "analyzer": {
    "russian": {
      "tokenizer":  "standard",
      /* TokenFilters */
      "filter": [
        "lowercase",
        "russian_stop",
        "russian_keywords",
        "russian_stemmer"
      ]
      /* CharFilters отсутствуют */
    }
  }
}

Ha re hlalose mohlahlobi oa rona ho latela Serussia, se tla seha li-tag tsa html. Ha re e bitse kamehla, hobane mohlahlobi ea nang le lebitso lena o tla sebelisoa ka ho sa feleng.

{
  "filter": {
    "ru_stop": {
      "type":       "stop",
      "stopwords":  "_russian_"
    },
    "ru_stemmer": {
      "type":       "stemmer",
      "language":   "russian"
    }
  },
  "analyzer": {
    "default": {
      /* добавляем удаление html тегов */
      "char_filter": ["html_strip"],
      "tokenizer":  "standard",
      "filter": [
        "lowercase",
        "ru_stop",
        "ru_stemmer"
      ]
    }
  }
}

Ntlha ea pele, li-tag tsohle tsa HTML li tla tlosoa mohloling oa mohloli, joale tekanyo ea tokenizer e tla e arola ka li-tokens, li-tokens tse hlahisoang li tla fallela boemong bo tlaase, mantsoe a sa reng letho a tla tlosoa, 'me matšoao a setseng a tla sala e le kutu ea lentsoe.

Ho theha index

Ka holimo re hlalositse analyzer ea kamehla. E tla sebetsa libakeng tsohle tsa likhoele. Poso ea rona e na le li-tag tse ngata, kahoo li-tag le tsona li tla sebetsoa ke mohlahlobi. Hobane Re batla lipapatso ka ho ts'oana hantle le tag, ebe re hloka ho thibela tlhahlobo ea sebaka sa li-tag.

Ha re theheng blog2 ea index e nang le mohlahlobi le 'mapa, moo tlhahlobo ea sebaka sa li-tag e koetsoeng:

curl -XPOST "$ES_URL/blog2" -d'
{
  "settings": {
    "analysis": {
      "filter": {
        "ru_stop": {
          "type": "stop",
          "stopwords": "_russian_"
        },
        "ru_stemmer": {
          "type": "stemmer",
          "language": "russian"
        }
      },
      "analyzer": {
        "default": {
          "char_filter": [
            "html_strip"
          ],
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "ru_stop",
            "ru_stemmer"
          ]
        }
      }
    }
  },
  "mappings": {
    "post": {
      "properties": {
        "content": {
          "type": "string"
        },
        "published_at": {
          "type": "date"
        },
        "tags": {
          "type": "string",
          "index": "not_analyzed"
        },
        "title": {
          "type": "string"
        }
      }
    }
  }
}'

Ha re kenyeng li-post tse 3 tse tšoanang ho index ena (blog2). Ke tla tlohela mokhoa ona hobane ... ho tšoana le ho kenya litokomane ho index ea blog.

Patlo ea mongolo o felletseng ka tšehetso ea polelo

Ha re shebeng mofuta o mong oa kopo:

# найдем документы, в которых встречается слово 'истории'
# query -> simple_query_string -> query содержит поисковый запрос
# поле title имеет приоритет 3
# поле tags имеет приоритет 2
# поле content имеет приоритет 1
# приоритет используется при ранжировании результатов
curl -XPOST "$ES_URL/blog2/post/_search?pretty" -d'
{
  "query": {
    "simple_query_string": {
      "query": "истории",
      "fields": [
        "title^3",
        "tags^2",
        "content"
      ]
    }
  }
}'

Hobane Re sebelisa analyzer e nang le stemming ea Serussia, joale kopo ena e tla khutlisa litokomane tsohle, leha li na le lentsoe 'histori' feela.

Kopo e ka ba le litlhaku tse ikhethileng, mohlala:

""fried eggs" +(eggplant | potato) -frittata"

Kopa syntax:

+ signifies AND operation
| signifies OR operation
- negates a single token
" wraps a number of tokens to signify a phrase for searching
* at the end of a term signifies a prefix query
( and ) signify precedence
~N after a word signifies edit distance (fuzziness)
~N after a phrase signifies slop amount
# найдем документы без слова 'щенки'
curl -XPOST "$ES_URL/blog2/post/_search?pretty" -d'
{
  "query": {
    "simple_query_string": {
      "query": "-щенки",
      "fields": [
        "title^3",
        "tags^2",
        "content"
      ]
    }
  }
}'

# получим 2 поста про котиков

litšupiso

PS

Haeba u thahasella lingoloa tse tšoanang-lithuto, u na le mehopolo bakeng sa lingoliloeng tse ncha, kapa u na le litlhahiso tsa tšebelisano, ke tla thabela ho amohela molaetsa ka molaetsa oa hau kapa ka lengolo-tsoibila. [imeile e sirelelitsoe].

Source: www.habr.com