Elasticsearch Basics

Elasticsearch bụ igwe nchọta nwere json rest api, na-eji Lucene wee dee ya na Java. Nkọwa nke uru niile nke injin a dị na ebe nrụọrụ weebụ gọọmentị. N'ime ihe na-esote anyị ga-ezo aka na Elasticsearch dị ka ES.

A na-eji ngwa ngwa ndị yiri ya maka nyocha dị mgbagwoju anya na nchekwa data akwụkwọ. Dịka ọmụmaatụ, ọchụchọ na-eburu n'uche ọdịdị ọdịdị asụsụ ahụ ma ọ bụ chọọ site na nhazi geo.

N'isiokwu a, m ga-ekwu maka ihe ndị bụ isi nke ES na-eji ihe atụ nke indexing blog posts. Aga m egosi gị otu esi enyocha, hazie na ịchọ akwụkwọ.

Ka ị ghara ịdabere na sistemụ arụmọrụ, m ga-eji CURL rịọ ES niile arịrịọ. Enwekwara ngwa mgbakwunye maka google chrome nke a na-akpọ uche.

Ederede a nwere njikọ na akwụkwọ na isi mmalite ndị ọzọ. Na njedebe enwere njikọ maka ịnweta akwụkwọ ngwa ngwa. Enwere ike ịchọta nkọwa nke okwu na-amaghị ama na nkọwa nkọwa.

Nwụnye

Iji mee nke a, anyị chọrọ Java mbụ. Ndị mmepe nwere ike ikwu wụnye ụdị Java ọhụrụ karịa Java 8 update 20 ma ọ bụ Java 7 melite 55.

Nkesa ES dị na saịtị mmepe. Mgbe ịmepechara ebe nchekwa, ịkwesịrị ịgba ọsọ bin/elasticsearch. Dịkwa ngwugwu maka apt na yum... enwere foto gọọmentị maka docker. Ndị ọzọ gbasara nrụnye.

Mgbe echichi na malite, ka anyị lelee ọrụ:

# для удобства запомним адрес в переменную
#export ES_URL=$(docker-machine ip dev):9200
export ES_URL=localhost:9200

curl -X GET $ES_URL

Anyị ga-enweta ihe dị ka nke a:

{
  "name" : "Heimdall",
  "cluster_name" : "elasticsearch",
  "version" : {
    "number" : "2.2.1",
    "build_hash" : "d045fc29d1932bce18b2e65ab8b297fbf6cd41a1",
    "build_timestamp" : "2016-03-09T09:38:54Z",
    "build_snapshot" : false,
    "lucene_version" : "5.4.1"
  },
  "tagline" : "You Know, for Search"
}

Indexing

Ka anyị tinye akwụkwọ ozi na ES:

# Добавим документ c id 1 типа post в индекс blog.
# ?pretty указывает, что вывод должен быть человеко-читаемым.

curl -XPUT "$ES_URL/blog/post/1?pretty" -d'
{
  "title": "Веселые котята",
  "content": "<p>Смешная история про котят<p>",
  "tags": [
    "котята",
    "смешная история"
  ],
  "published_at": "2014-09-12T20:44:42+00:00"
}'

nzaghachi nkesa:

{
  "_index" : "blog",
  "_type" : "post",
  "_id" : "1",
  "_version" : 1,
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "created" : false
}

Emepụtara ES na-akpaghị aka ndeksi blọọgụ na Ụdị biputere. Anyị nwere ike ise ntụnyere ọnọdụ: index bụ nchekwa data, na ụdị bụ tebụl na nchekwa data a. Ụdị ọ bụla nwere atụmatụ nke ya - nkewa, dị ka tebụl mmekọrịta. A na-emepụta nkewa na-akpaghị aka mgbe edepụtara akwụkwọ ahụ:

# Получим mapping всех типов индекса blog
curl -XGET "$ES_URL/blog/_mapping?pretty"

Na nzaghachi nkesa, agbakwunyere m ụkpụrụ nke ubi akwụkwọ ndenye aha na nkwupụta:

{
  "blog" : {
    "mappings" : {
      "post" : {
        "properties" : {
          /* "content": "<p>Смешная история про котят<p>", */ 
          "content" : {
            "type" : "string"
          },
          /* "published_at": "2014-09-12T20:44:42+00:00" */
          "published_at" : {
            "type" : "date",
            "format" : "strict_date_optional_time||epoch_millis"
          },
          /* "tags": ["котята", "смешная история"] */
          "tags" : {
            "type" : "string"
          },
          /*  "title": "Веселые котята" */
          "title" : {
            "type" : "string"
          }
        }
      }
    }
  }
}

Ọ dị mma ịmara na ES anaghị ekewa n'etiti otu uru na ọtụtụ ụkpụrụ. Dịka ọmụmaatụ, mpaghara aha nwere nanị aha, na ngalaba mkpado nwere ọtụtụ eriri, ọ bụ ezie na a na-anọchi anya ha n'otu ụzọ ahụ na nkewa.
Anyị ga-ekwukwu gbasara eserese eserese ma emechaa.

Arịrịọ

Iweghachi akwụkwọ site na id ya:

# извлечем документ с id 1 типа post из индекса blog
curl -XGET "$ES_URL/blog/post/1?pretty"
{
  "_index" : "blog",
  "_type" : "post",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "_source" : {
    "title" : "Веселые котята",
    "content" : "<p>Смешная история про котят<p>",
    "tags" : [ "котята", "смешная история" ],
    "published_at" : "2014-09-12T20:44:42+00:00"
  }
}

Igodo ọhụrụ pụtara na nzaghachi: _version и _source. N'ozuzu, igodo niile na-amalite na _ na-nkewa dị ka ukara.

Igodo _version na-egosi ụdị akwụkwọ ahụ. Ọ dị mkpa ka usoro mkpọchi nchekwube na-arụ ọrụ. Dịka ọmụmaatụ, anyị chọrọ ịgbanwe akwụkwọ nwere ụdị 1. Anyị na-edobe akwụkwọ gbanwere wee gosi na nke a bụ ndezi akwụkwọ nwere ụdị 1. Ọ bụrụ na onye ọzọ dezie akwụkwọ nwere ụdị 1 ma nyefee mgbanwe n'ihu anyị, mgbe ahụ, ES agaghị anabata mgbanwe anyị, n'ihi na ọ na-echekwa akwụkwọ ahụ na ụdị 2.

Igodo _source nwere akwụkwọ anyị depụtara. ES anaghị eji uru a maka ọrụ ọchụchọ n'ihi na A na-eji ndenye aha maka ịchọ. Iji chekwaa oghere, ES na-echekwa akwụkwọ isi mmalite abịakọrọ. Ọ bụrụ na anyị chọrọ naanị id ahụ, ọ bụghị akwụkwọ isi mmalite niile, mgbe ahụ anyị nwere ike gbanyụọ nchekwa isi mmalite.

Ọ bụrụ na anyị achọghị ozi ọzọ, anyị nwere ike nweta naanị ọdịnaya nke _source:

curl -XGET "$ES_URL/blog/post/1/_source?pretty"
{
  "title" : "Веселые котята",
  "content" : "<p>Смешная история про котят<p>",
  "tags" : [ "котята", "смешная история" ],
  "published_at" : "2014-09-12T20:44:42+00:00"
}

Ị nwekwara ike họrọ naanị ụfọdụ ubi:

# извлечем только поле title
curl -XGET "$ES_URL/blog/post/1?_source=title&pretty"
{
  "_index" : "blog",
  "_type" : "post",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "_source" : {
    "title" : "Веселые котята"
  }
}

Ka anyị depụta posts ole na ole ma gbaa ajụjụ ndị dị mgbagwoju anya.

curl -XPUT "$ES_URL/blog/post/2" -d'
{
  "title": "Веселые щенки",
  "content": "<p>Смешная история про щенков<p>",
  "tags": [
    "щенки",
    "смешная история"
  ],
  "published_at": "2014-08-12T20:44:42+00:00"
}'
curl -XPUT "$ES_URL/blog/post/3" -d'
{
  "title": "Как у меня появился котенок",
  "content": "<p>Душераздирающая история про бедного котенка с улицы<p>",
  "tags": [
    "котята"
  ],
  "published_at": "2014-07-21T20:44:42+00:00"
}'

Rttọ

# найдем последний пост по дате публикации и извлечем поля title и published_at
curl -XGET "$ES_URL/blog/post/_search?pretty" -d'
{
  "size": 1,
  "_source": ["title", "published_at"],
  "sort": [{"published_at": "desc"}]
}'
{
  "took" : 8,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : null,
    "hits" : [ {
      "_index" : "blog",
      "_type" : "post",
      "_id" : "1",
      "_score" : null,
      "_source" : {
        "title" : "Веселые котята",
        "published_at" : "2014-09-12T20:44:42+00:00"
      },
      "sort" : [ 1410554682000 ]
    } ]
  }
}

Anyị họọrọ post ikpeazụ. size na-amachi ọnụ ọgụgụ akwụkwọ ndị a ga-enye. total na-egosi ọnụ ọgụgụ akwụkwọ niile dabara na arịrịọ ahụ. sort N'ime mmepụta ahụ nwere ọnụọgụ ọnụọgụgụ nke ejiri mee nhazi. Ndị ahụ. agbanwere ụbọchị ahụ ka ọ bụrụ integer. Enwere ike ịchọta ozi ndị ọzọ gbasara nhazi akwụkwọ.

Nzacha na ajụjụ

ES ebe ụdị 2 anaghị ama ọdịiche dị n'etiti nzacha na ajụjụ, kama ewebata echiche nke okirikiri.
Ọnọdụ ajụjụ dị iche na ọnọdụ nzacha n'ihi na ajụjụ a na-ewepụta _score ma ọ bụghị cache. Aga m egosi gị ihe _score bụ ma emechaa.

Wepụta ụbọchị

Anyị na-eji arịrịọ ahụ nso n'ọnọdụ nke nzacha:

# получим посты, опубликованные 1ого сентября или позже
curl -XGET "$ES_URL/blog/post/_search?pretty" -d'
{
  "filter": {
    "range": {
      "published_at": { "gte": "2014-09-01" }
    }
  }
}'

Wepụta site na mkpado

Anyị na-eji ajụjụ okwu ka ịchọọ ids akwụkwọ nwere mkpụrụokwu enyere:

# найдем все документы, в поле tags которых есть элемент 'котята'
curl -XGET "$ES_URL/blog/post/_search?pretty" -d'
{
  "_source": [
    "title",
    "tags"
  ],
  "filter": {
    "term": {
      "tags": "котята"
    }
  }
}'
{
  "took" : 9,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "blog",
      "_type" : "post",
      "_id" : "1",
      "_score" : 1.0,
      "_source" : {
        "title" : "Веселые котята",
        "tags" : [ "котята", "смешная история" ]
      }
    }, {
      "_index" : "blog",
      "_type" : "post",
      "_id" : "3",
      "_score" : 1.0,
      "_source" : {
        "title" : "Как у меня появился котенок",
        "tags" : [ "котята" ]
      }
    } ]
  }
}

Nchọ ederede zuru oke

Atọ n'ime akwụkwọ anyị nwere ihe ndị a na mpaghara ọdịnaya:

  • <p>Смешная история про котят<p>
  • <p>Смешная история про щенков<p>
  • <p>Душераздирающая история про бедного котенка с улицы<p>

Anyị na-eji ajụjụ egwuregwu ka ịchọọ ids akwụkwọ nwere mkpụrụokwu enyere:

# source: false означает, что не нужно извлекать _source найденных документов
curl -XGET "$ES_URL/blog/post/_search?pretty" -d'
{
  "_source": false,
  "query": {
    "match": {
      "content": "история"
    }
  }
}'
{
  "took" : 13,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 0.11506981,
    "hits" : [ {
      "_index" : "blog",
      "_type" : "post",
      "_id" : "2",
      "_score" : 0.11506981
    }, {
      "_index" : "blog",
      "_type" : "post",
      "_id" : "1",
      "_score" : 0.11506981
    }, {
      "_index" : "blog",
      "_type" : "post",
      "_id" : "3",
      "_score" : 0.095891505
    } ]
  }
}

Otú ọ dị, ọ bụrụ na anyị na-achọ "akụkọ" na mpaghara ọdịnaya, anyị agaghị ahụ ihe ọ bụla, n'ihi na Ndekọ ahụ nwere naanị mkpụrụokwu mbụ, ọ bụghị isi ha. Iji mee ọchụchọ dị elu, ịkwesịrị ịhazi onye nyocha.

ubi _score egosi mkpa. Ọ bụrụ na emechara arịrịọ a na ọnọdụ nzacha, mgbe ahụ uru _score ga-adị nhata 1 mgbe niile, nke pụtara nzacha zuru oke na nzacha.

Ndị nyocha

Ndị nyocha achọrọ iji tọghata ederede isi mmalite ka ọ bụrụ otu akara ngosi.
Ndị nyocha nwere otu Tokenizer na ọtụtụ nhọrọ Ihe nzacha Token. Tokenizer nwere ike buru ọtụtụ ụzọ tupu ya CharFilters. Tokenizers na-agbaji eriri isi mmalite ka ọ bụrụ akara ngosi, dị ka oghere na mkpụrụedemede akara. TokenFilter nwere ike ịgbanwe akara ngosi, ihichapụ ma ọ bụ tinye nke ọhụrụ, dịka ọmụmaatụ, hapụ naanị isi okwu ahụ, wepụ prepositions, tinye synonyms. CharFilter - na-agbanwe eriri isi mmalite niile, dịka ọmụmaatụ, bepụ mkpado HTML.

ES nwere ọtụtụ ọkọlọtọ analyzers. Dịka ọmụmaatụ, onye nyocha Russian.

Ka anyị were ohere API ka anyị hụ ka ndị nyocha ọkọlọtọ na ndị Russia si agbanwe eriri "Akụkọ na-atọ ọchị banyere kittens":

# используем анализатор standard       
# обязательно нужно перекодировать не ASCII символы
curl -XGET "$ES_URL/_analyze?pretty&analyzer=standard&text=%D0%92%D0%B5%D1%81%D0%B5%D0%BB%D1%8B%D0%B5%20%D0%B8%D1%81%D1%82%D0%BE%D1%80%D0%B8%D0%B8%20%D0%BF%D1%80%D0%BE%20%D0%BA%D0%BE%D1%82%D1%8F%D1%82"
{
  "tokens" : [ {
    "token" : "веселые",
    "start_offset" : 0,
    "end_offset" : 7,
    "type" : "<ALPHANUM>",
    "position" : 0
  }, {
    "token" : "истории",
    "start_offset" : 8,
    "end_offset" : 15,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "про",
    "start_offset" : 16,
    "end_offset" : 19,
    "type" : "<ALPHANUM>",
    "position" : 2
  }, {
    "token" : "котят",
    "start_offset" : 20,
    "end_offset" : 25,
    "type" : "<ALPHANUM>",
    "position" : 3
  } ]
}
# используем анализатор russian
curl -XGET "$ES_URL/_analyze?pretty&analyzer=russian&text=%D0%92%D0%B5%D1%81%D0%B5%D0%BB%D1%8B%D0%B5%20%D0%B8%D1%81%D1%82%D0%BE%D1%80%D0%B8%D0%B8%20%D0%BF%D1%80%D0%BE%20%D0%BA%D0%BE%D1%82%D1%8F%D1%82"
{
  "tokens" : [ {
    "token" : "весел",
    "start_offset" : 0,
    "end_offset" : 7,
    "type" : "<ALPHANUM>",
    "position" : 0
  }, {
    "token" : "истор",
    "start_offset" : 8,
    "end_offset" : 15,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "кот",
    "start_offset" : 20,
    "end_offset" : 25,
    "type" : "<ALPHANUM>",
    "position" : 3
  } ]
}

Onye nyocha ọkọlọtọ na-ekewa eriri ahụ site na oghere wee gbanwee ihe niile gaa na obere okwu, onye nyocha Russia wepụrụ okwu ndị na-adịghị mkpa, gbanwee ya na obere okwu ma hapụ isi okwu ahụ.

Ka anyị hụ nke Tokenizer, TokenFilters, CharFilters onye nyocha Russia na-eji:

{
  "filter": {
    "russian_stop": {
      "type":       "stop",
      "stopwords":  "_russian_"
    },
    "russian_keywords": {
      "type":       "keyword_marker",
      "keywords":   []
    },
    "russian_stemmer": {
      "type":       "stemmer",
      "language":   "russian"
    }
  },
  "analyzer": {
    "russian": {
      "tokenizer":  "standard",
      /* TokenFilters */
      "filter": [
        "lowercase",
        "russian_stop",
        "russian_keywords",
        "russian_stemmer"
      ]
      /* CharFilters отсутствуют */
    }
  }
}

Ka anyị kọwaa onye nyocha anyị dabere na Russian, nke ga-ebipụ mkpado html. Ka anyị kpọọ ya ndabara, n'ihi na A ga-eji onye nyocha nwere aha a na ndabara.

{
  "filter": {
    "ru_stop": {
      "type":       "stop",
      "stopwords":  "_russian_"
    },
    "ru_stemmer": {
      "type":       "stemmer",
      "language":   "russian"
    }
  },
  "analyzer": {
    "default": {
      /* добавляем удаление html тегов */
      "char_filter": ["html_strip"],
      "tokenizer":  "standard",
      "filter": [
        "lowercase",
        "ru_stop",
        "ru_stemmer"
      ]
    }
  }
}

Nke mbụ, a ga-ewepụ mkpado HTML niile na eriri isi iyi, mgbe ahụ ọkọlọtọ tokenizer ga-ekewa ya na tokens, akara ndị ga-esi na ya pụta ga-aga na obere okwu, a ga-ewepụ okwu ndị na-adịghị mkpa, na akara ndị fọdụrụ ga-anọgide na-abụ isi okwu ahụ.

Ịmepụta ndeksi

N'elu anyị kọwara onye nyocha nke ndabara. Ọ ga-emetụta mpaghara eriri niile. Ozi anyị nwere ọtụtụ mkpado, yabụ onye nyocha ga-ahazi mkpado ndị ahụ. N'ihi na Anyị na-achọ posts site kpọmkwem dakọtara na mkpado, mgbe ahụ, anyị kwesịrị gbanyụọ nyocha maka ubi mkpado.

Ka anyị mepụta blọgụ index index nke nwere onye nyocha na nkewa, ebe enwere nkwarụ nyocha mpaghara mkpado:

curl -XPOST "$ES_URL/blog2" -d'
{
  "settings": {
    "analysis": {
      "filter": {
        "ru_stop": {
          "type": "stop",
          "stopwords": "_russian_"
        },
        "ru_stemmer": {
          "type": "stemmer",
          "language": "russian"
        }
      },
      "analyzer": {
        "default": {
          "char_filter": [
            "html_strip"
          ],
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "ru_stop",
            "ru_stemmer"
          ]
        }
      }
    }
  },
  "mappings": {
    "post": {
      "properties": {
        "content": {
          "type": "string"
        },
        "published_at": {
          "type": "date"
        },
        "tags": {
          "type": "string",
          "index": "not_analyzed"
        },
        "title": {
          "type": "string"
        }
      }
    }
  }
}'

Ka anyị tinye otu posts atọ ahụ na ndeksi a (blog3). M ga-ahapụ usoro a n'ihi na ... ọ dị ka ịgbakwunye akwụkwọ na ndeksi blọgụ.

Ọchọ ederede zuru oke yana nkwado okwu

Ka anyị leba anya n'ụdị arịrịọ ọzọ:

# найдем документы, в которых встречается слово 'истории'
# query -> simple_query_string -> query содержит поисковый запрос
# поле title имеет приоритет 3
# поле tags имеет приоритет 2
# поле content имеет приоритет 1
# приоритет используется при ранжировании результатов
curl -XPOST "$ES_URL/blog2/post/_search?pretty" -d'
{
  "query": {
    "simple_query_string": {
      "query": "истории",
      "fields": [
        "title^3",
        "tags^2",
        "content"
      ]
    }
  }
}'

N'ihi na Anyị na-eji ihe nyocha nke nwere Russian stemming, mgbe ahụ arịrịọ a ga-eweghachi akwụkwọ niile, n'agbanyeghị na ha nwere naanị okwu 'akụkọ ihe mere eme'.

Arịrịọ ahụ nwere ike ịnwe mkpụrụedemede pụrụ iche, dịka ọmụmaatụ:

""fried eggs" +(eggplant | potato) -frittata"

Rịọ syntax:

+ signifies AND operation
| signifies OR operation
- negates a single token
" wraps a number of tokens to signify a phrase for searching
* at the end of a term signifies a prefix query
( and ) signify precedence
~N after a word signifies edit distance (fuzziness)
~N after a phrase signifies slop amount
# найдем документы без слова 'щенки'
curl -XPOST "$ES_URL/blog2/post/_search?pretty" -d'
{
  "query": {
    "simple_query_string": {
      "query": "-щенки",
      "fields": [
        "title^3",
        "tags^2",
        "content"
      ]
    }
  }
}'

# получим 2 поста про котиков

zoro

PS

Ọ bụrụ na ị nwere mmasị na isiokwu ndị yiri ya - nkuzi, nwee echiche maka akụkọ ọhụrụ, ma ọ bụ nwee atụmatụ maka imekọ ihe ọnụ, mgbe ahụ, m ga-enwe obi ụtọ ịnweta ozi na ozi nkeonwe ma ọ bụ site na email. [email protected].

isi: www.habr.com