Nā Kumu Elasticsearch

He ʻenekini huli ʻo Elasticsearch me json rest api, me ka hoʻohana ʻana iā Lucene a kākau ʻia ma Java. Aia ka wehewehe ʻana i nā pono āpau o kēia ʻenekini ma kahua pūnaewele. Ma nā mea aʻe e pili ana mākou iā Elasticsearch e like me ES.

Hoʻohana ʻia nā ʻenekini like no nā ʻimi paʻakikī i kahi waihona palapala. No ka laʻana, ʻimi e noʻonoʻo i ka morphology o ka ʻōlelo a i ʻole ka huli ʻana ma nā geo coordinates.

Ma kēia ʻatikala e kamaʻilio wau e pili ana i nā kumu o ES me ka hoʻohana ʻana i ka laʻana o ka indexing blog posts. E hōʻike wau iā ʻoe pehea e kānana, e ʻimi a huli i nā palapala.

I ʻole e hilinaʻi i ka ʻōnaehana hana, e hana wau i nā noi āpau iā ES me ka hoʻohana ʻana iā CURL. Aia kekahi plugin no google chrome i kapa ʻia manaʻo.

Aia ka kikokikona i nā loulou i ka palapala a me nā kumu ʻē aʻe. Ma ka hope aia nā loulou no ke komo wikiwiki ʻana i ka palapala. Hiki ke loaʻa nā wehewehe o nā huaʻōlelo kamaʻāina ʻole ma papa huaʻōlelo.

Hoʻokomo

No ka hana ʻana i kēia, pono mākou iā Java. Nā mea hoʻomohala paipai aku e hoʻouka i nā mana Java hou aʻe ma mua o Java 8 update 20 a i ʻole Java 7 update 55.

Loaʻa ka māhele ES ma kahua hoʻomohala. Ma hope o ka wehe ʻana i ka waihona pono ʻoe e holo bin/elasticsearch. Loaʻa pū nā pūʻolo no ka apt a me ka yum. ^ E Ha yM. Aia nō kiʻi kūhelu no docker. Nā mea hou aku e pili ana i ke kau ʻana.

Ma hope o ka hoʻouka ʻana a me ka hoʻomaka ʻana, e nānā kākou i ka hana:

# для удобства запомним адрес в переменную
#export ES_URL=$(docker-machine ip dev):9200
export ES_URL=localhost:9200

curl -X GET $ES_URL

E loaʻa iā mākou kekahi mea e like me kēia:

{
  "name" : "Heimdall",
  "cluster_name" : "elasticsearch",
  "version" : {
    "number" : "2.2.1",
    "build_hash" : "d045fc29d1932bce18b2e65ab8b297fbf6cd41a1",
    "build_timestamp" : "2016-03-09T09:38:54Z",
    "build_snapshot" : false,
    "lucene_version" : "5.4.1"
  },
  "tagline" : "You Know, for Search"
}

Ka helu helu

E hoʻohui i kahi leka iā ES:

# Добавим документ c id 1 типа post в индекс blog.
# ?pretty указывает, что вывод должен быть человеко-читаемым.

curl -XPUT "$ES_URL/blog/post/1?pretty" -d'
{
  "title": "Веселые котята",
  "content": "<p>Смешная история про котят<p>",
  "tags": [
    "котята",
    "смешная история"
  ],
  "published_at": "2014-09-12T20:44:42+00:00"
}'

pane kikowaena:

{
  "_index" : "blog",
  "_type" : "post",
  "_id" : "1",
  "_version" : 1,
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "created" : false
}

Hana 'akomi 'o ES papa kuhikuhi blog a ʻAno pou. Hiki iā mākou ke kahakiʻi i kahi hoʻohālikelike kūlana: ʻo kahi index he waihona, a ʻo kahi ʻano he papa i loko o kēia waihona. Loaʻa i kēlā me kēia ʻano kāna papahana ponoʻī − kŰia, e like me ka papa pili. Hana 'akomi 'ia ka palapala 'āina ke kuhikuhi 'ia ka palapala:

# Получим mapping всех типов индекса blog
curl -XGET "$ES_URL/blog/_mapping?pretty"

Ma ka pane kikowaena, ua hoʻohui au i nā waiwai o nā kahua o ka palapala kuhikuhi i nā ʻōlelo:

{
  "blog" : {
    "mappings" : {
      "post" : {
        "properties" : {
          /* "content": "<p>Смешная история про котят<p>", */ 
          "content" : {
            "type" : "string"
          },
          /* "published_at": "2014-09-12T20:44:42+00:00" */
          "published_at" : {
            "type" : "date",
            "format" : "strict_date_optional_time||epoch_millis"
          },
          /* "tags": ["котята", "смешная история"] */
          "tags" : {
            "type" : "string"
          },
          /*  "title": "Веселые котята" */
          "title" : {
            "type" : "string"
          }
        }
      }
    }
  }
}

Pono e hoʻomaopopo ʻaʻole ʻokoʻa ʻo ES ma waena o kahi waiwai hoʻokahi a me kahi ʻano waiwai. No ka laʻana, he poʻo inoa wale nō ke kahua poʻo inoa, a ʻo ke kahua hōʻailona i loaʻa i nā laina laina, ʻoiai ua hōʻike ʻia lākou ma ke ʻano like i ka palapala ʻāina.
E kamaʻilio hou mākou e pili ana i ka palapala ʻāina ma hope.

Nā noi

Ke kiʻi nei i kahi palapala ma kona id:

# извлечем документ с id 1 типа post из индекса blog
curl -XGET "$ES_URL/blog/post/1?pretty"
{
  "_index" : "blog",
  "_type" : "post",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "_source" : {
    "title" : "Веселые котята",
    "content" : "<p>Смешная история про котят<p>",
    "tags" : [ "котята", "смешная история" ],
    "published_at" : "2014-09-12T20:44:42+00:00"
  }
}

Ua ʻike ʻia nā kī hou i ka pane: _version и _source. Ma keʻano laulā, hoʻomaka nā kī āpau me _ ua helu ʻia ma ke ʻano he kūlana.

Kaomi _version hōʻike i ka mana palapala. Pono ia no ka hana ʻana o ka locking mechanical. No ka laʻana, makemake mākou e hoʻololi i kahi palapala i loaʻa ka mana 1. Hoʻouna mākou i ka palapala i hoʻololi ʻia a hōʻike he hoʻoponopono kēia o kahi palapala me ka mana 1. Inā hoʻoponopono kekahi i kahi palapala me ka mana 1 a waiho i nā loli i mua o mākou, a laila ʻAʻole ʻae ʻo ES i kā mākou hoʻololi, no ka mea mālama ia i ka palapala me ka mana 2.

Kaomi _source aia ka palapala a mākou i kuhikuhi ai. ʻAʻole hoʻohana ʻo ES i kēia waiwai no nā hana ʻimi no ka mea Hoʻohana ʻia nā papa kuhikuhi no ka ʻimi ʻana. No ka mālama ʻana i ka hakahaka, mālama ʻo ES i kahi palapala kumu i hoʻopaʻa ʻia. Inā makemake mākou i ka id wale nō, ʻaʻole ka palapala kumu holoʻokoʻa, a laila hiki iā mākou ke hoʻopau i ka waiho kumu.

Inā ʻaʻole pono mākou i ka ʻike hou aʻe, hiki iā mākou ke loaʻa nā ʻike o _source:

curl -XGET "$ES_URL/blog/post/1/_source?pretty"
{
  "title" : "Веселые котята",
  "content" : "<p>Смешная история про котят<p>",
  "tags" : [ "котята", "смешная история" ],
  "published_at" : "2014-09-12T20:44:42+00:00"
}

Hiki iā ʻoe ke koho i kekahi mau kahua wale nō:

# извлечем только поле title
curl -XGET "$ES_URL/blog/post/1?_source=title&pretty"
{
  "_index" : "blog",
  "_type" : "post",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "_source" : {
    "title" : "Веселые котята"
  }
}

E kuhikuhi i kekahi mau pou hou a e holo i nā nīnau paʻakikī.

curl -XPUT "$ES_URL/blog/post/2" -d'
{
  "title": "Веселые щенки",
  "content": "<p>Смешная история про щенков<p>",
  "tags": [
    "щенки",
    "смешная история"
  ],
  "published_at": "2014-08-12T20:44:42+00:00"
}'
curl -XPUT "$ES_URL/blog/post/3" -d'
{
  "title": "Как у меня появился котенок",
  "content": "<p>Душераздирающая история про бедного котенка с улицы<p>",
  "tags": [
    "котята"
  ],
  "published_at": "2014-07-21T20:44:42+00:00"
}'

Kōkua

# найдем последний пост по дате публикации и извлечем поля title и published_at
curl -XGET "$ES_URL/blog/post/_search?pretty" -d'
{
  "size": 1,
  "_source": ["title", "published_at"],
  "sort": [{"published_at": "desc"}]
}'
{
  "took" : 8,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : null,
    "hits" : [ {
      "_index" : "blog",
      "_type" : "post",
      "_id" : "1",
      "_score" : null,
      "_source" : {
        "title" : "Веселые котята",
        "published_at" : "2014-09-12T20:44:42+00:00"
      },
      "sort" : [ 1410554682000 ]
    } ]
  }
}

Ua koho mākou i ka pou hope. size palena i ka helu o nā palapala e hoʻopuka ʻia. total hōʻike i ka huina o nā palapala e pili ana i ka noi. sort i loko o ka mea hoʻopuka i loaʻa i kahi ʻano helu helu e hana ʻia ai ke ʻano. ʻO kēlā mau mea. ua hoʻololi ʻia ka lā i helu helu. Hiki ke loaʻa ka ʻike hou aku e pili ana i ka hoʻokaʻawale ʻana ma palapala.

Nā kānana a me nā nīnau

ʻAʻole ʻokoʻa ka ES mai ka mana 2 ma waena o nā kānana a me nā nīnau hoʻokomo ʻia ka manaʻo o nā pōʻaiapili.
He ʻokoʻa ka pōʻaiapili hulina mai kahi pōʻaiapili kānana no ka mea e hoʻopuka ka nīnau i kahi _score a ʻaʻole i hūnā ʻia. E hōʻike aku au iā ʻoe i ka _score ma hope.

Kānana ma ka lā

Hoʻohana mākou i ka noi huahelu ma ka pōʻaiapili o ka kānana:

# получим посты, опубликованные 1ого сентября или позже
curl -XGET "$ES_URL/blog/post/_search?pretty" -d'
{
  "filter": {
    "range": {
      "published_at": { "gte": "2014-09-01" }
    }
  }
}'

Kānana ma nā lepili

Hoʻohana mākou nīnau huaʻōlelo e ʻimi i nā id palapala i loaʻa kekahi huaʻōlelo:

# найдем все документы, в поле tags которых есть элемент 'котята'
curl -XGET "$ES_URL/blog/post/_search?pretty" -d'
{
  "_source": [
    "title",
    "tags"
  ],
  "filter": {
    "term": {
      "tags": "котята"
    }
  }
}'
{
  "took" : 9,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "blog",
      "_type" : "post",
      "_id" : "1",
      "_score" : 1.0,
      "_source" : {
        "title" : "Веселые котята",
        "tags" : [ "котята", "смешная история" ]
      }
    }, {
      "_index" : "blog",
      "_type" : "post",
      "_id" : "3",
      "_score" : 1.0,
      "_source" : {
        "title" : "Как у меня появился котенок",
        "tags" : [ "котята" ]
      }
    } ]
  }
}

Huli kikokikona piha

ʻEkolu o kā mākou palapala i loaʻa i kēia ma ke kahua ʻike:

  • <p>Смешная история про котят<p>
  • <p>Смешная история про щенков<p>
  • <p>Душераздирающая история про бедного котенка с улицы<p>

Hoʻohana mākou nīnau pili e ʻimi i nā id palapala i loaʻa kekahi huaʻōlelo:

# source: false означает, что не нужно извлекать _source найденных документов
curl -XGET "$ES_URL/blog/post/_search?pretty" -d'
{
  "_source": false,
  "query": {
    "match": {
      "content": "история"
    }
  }
}'
{
  "took" : 13,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 0.11506981,
    "hits" : [ {
      "_index" : "blog",
      "_type" : "post",
      "_id" : "2",
      "_score" : 0.11506981
    }, {
      "_index" : "blog",
      "_type" : "post",
      "_id" : "1",
      "_score" : 0.11506981
    }, {
      "_index" : "blog",
      "_type" : "post",
      "_id" : "3",
      "_score" : 0.095891505
    } ]
  }
}

Eia nō naʻe, inā mākou e ʻimi i nā "moʻolelo" ma ke kahua ʻike, ʻaʻole e loaʻa iā mākou kekahi mea, no ka mea Aia i loko o ka papa kuhikuhi nā huaʻōlelo kumu wale nō, ʻaʻole ko lākou mau kumu. I mea e hana ai i kahi hulina kiʻekiʻe, pono ʻoe e hoʻonohonoho i ka mea anana.

kahua _score hōʻikeʻike pili. Inā hoʻokō ʻia ka noi ma kahi pōʻaiapili kānana, a laila e like mau ka waiwai _score me 1, ʻo ia hoʻi ka hoʻohālikelike piha ʻana i ka kānana.

Nā mea nānā

Nā mea nānā pono e hoʻololi i ke kikokikona kumu i pūʻulu o nā hōʻailona.
ʻO nā mea hoʻopukapuka hoʻokahi ʻOkenlelo Loke a me kekahi mau koho TokenFilters. Hiki i ka Tokenizer ma mua o kekahi Kānāwai. Hoʻokaʻawale nā ​​Tokenizers i ke kaula kumu i mau hōʻailona, ​​​​e like me nā hakahaka a me nā kaha kikoʻī. Hiki iā TokenFilter ke hoʻololi i nā hōʻailona, ​​holoi a hoʻohui i nā mea hou, no ka laʻana, waiho wale i ke kumu o ka huaʻōlelo, wehe i nā prepositions, hoʻohui i nā ʻano like. CharFilter - hoʻololi i ke kaula kumu holoʻokoʻa, no ka laʻana, ʻoki i nā huaʻōlelo html.

He nui ka ES nā mea kālailai maʻamau. ʻO kahi laʻana, kahi mea nānā Lūkini.

E hoʻohana pono kākou api a e ʻike kākou pehea e hoʻololi ai ka poʻe loiloi maʻamau a me ka russian i ke kaula "Nā moʻolelo ʻakaʻaka e pili ana i nā pōpoki":

# используем анализатор standard       
# обязательно нужно перекодировать не ASCII символы
curl -XGET "$ES_URL/_analyze?pretty&analyzer=standard&text=%D0%92%D0%B5%D1%81%D0%B5%D0%BB%D1%8B%D0%B5%20%D0%B8%D1%81%D1%82%D0%BE%D1%80%D0%B8%D0%B8%20%D0%BF%D1%80%D0%BE%20%D0%BA%D0%BE%D1%82%D1%8F%D1%82"
{
  "tokens" : [ {
    "token" : "веселые",
    "start_offset" : 0,
    "end_offset" : 7,
    "type" : "<ALPHANUM>",
    "position" : 0
  }, {
    "token" : "истории",
    "start_offset" : 8,
    "end_offset" : 15,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "про",
    "start_offset" : 16,
    "end_offset" : 19,
    "type" : "<ALPHANUM>",
    "position" : 2
  }, {
    "token" : "котят",
    "start_offset" : 20,
    "end_offset" : 25,
    "type" : "<ALPHANUM>",
    "position" : 3
  } ]
}
# используем анализатор russian
curl -XGET "$ES_URL/_analyze?pretty&analyzer=russian&text=%D0%92%D0%B5%D1%81%D0%B5%D0%BB%D1%8B%D0%B5%20%D0%B8%D1%81%D1%82%D0%BE%D1%80%D0%B8%D0%B8%20%D0%BF%D1%80%D0%BE%20%D0%BA%D0%BE%D1%82%D1%8F%D1%82"
{
  "tokens" : [ {
    "token" : "весел",
    "start_offset" : 0,
    "end_offset" : 7,
    "type" : "<ALPHANUM>",
    "position" : 0
  }, {
    "token" : "истор",
    "start_offset" : 8,
    "end_offset" : 15,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "кот",
    "start_offset" : 20,
    "end_offset" : 25,
    "type" : "<ALPHANUM>",
    "position" : 3
  } ]
}

Ua hoʻokaʻawale ka mea hoʻoponopono maʻamau i ke kaula i nā ākea a hoʻololi i nā mea a pau i ka helu haʻahaʻa, ua wehe ka mea loiloi Lūkini i nā huaʻōlelo koʻikoʻi ʻole, ua hoʻololi iā ia i ka helu haʻahaʻa a waiho i ke kumu o nā huaʻōlelo.

E ʻike kākou i ka Tokenizer, TokenFilters, CharFilters i hoʻohana ʻia e ka mea loiloi russian:

{
  "filter": {
    "russian_stop": {
      "type":       "stop",
      "stopwords":  "_russian_"
    },
    "russian_keywords": {
      "type":       "keyword_marker",
      "keywords":   []
    },
    "russian_stemmer": {
      "type":       "stemmer",
      "language":   "russian"
    }
  },
  "analyzer": {
    "russian": {
      "tokenizer":  "standard",
      /* TokenFilters */
      "filter": [
        "lowercase",
        "russian_stop",
        "russian_keywords",
        "russian_stemmer"
      ]
      /* CharFilters отсутствуют */
    }
  }
}

E wehewehe mākou i kā mākou mea loiloi e pili ana i ka Lūkini, nāna e ʻoki i nā huaʻōlelo html. E kapa aku kakou ia mea paʻamau, no ka mea e hoʻohana ʻia kahi mea loiloi me kēia inoa.

{
  "filter": {
    "ru_stop": {
      "type":       "stop",
      "stopwords":  "_russian_"
    },
    "ru_stemmer": {
      "type":       "stemmer",
      "language":   "russian"
    }
  },
  "analyzer": {
    "default": {
      /* добавляем удаление html тегов */
      "char_filter": ["html_strip"],
      "tokenizer":  "standard",
      "filter": [
        "lowercase",
        "ru_stop",
        "ru_stemmer"
      ]
    }
  }
}

ʻO ka mea mua, e hoʻoneʻe ʻia nā tag HTML āpau mai ke kaula kumu, a laila e hoʻokaʻawale ka maʻamau tokenizer i mau hōʻailona, ​​e neʻe nā hōʻailona hopena i ka huaʻōlelo haʻahaʻa, e wehe ʻia nā huaʻōlelo liʻiliʻi, a ʻo nā hōʻailona i koe e noho i ke kumu o ka huaʻōlelo.

Ke hana ʻana i kahi Index

Ma luna aʻe ua wehewehe mākou i ka mea hoʻoponopono paʻamau. E pili ana ia i na kahua kaula a pau. Loaʻa i kā mākou pou i kahi ʻano o nā hōʻailona, ​​​​no laila e hana pū ʻia nā hōʻailona e ka mea loiloi. No ka mea Ke ʻimi nei mākou i nā pou ma ka hoʻohālikelike pololei ʻana i kahi hōʻailona, ​​​​a laila pono mākou e hoʻopau i ka nānā ʻana no ke kahua tag.

E hana kākou i index blog2 me kahi mea anaana a me ka palapala ʻāina, kahi i hoʻopau ʻia ai ka nānā ʻana o nā kahua hōʻailona:

curl -XPOST "$ES_URL/blog2" -d'
{
  "settings": {
    "analysis": {
      "filter": {
        "ru_stop": {
          "type": "stop",
          "stopwords": "_russian_"
        },
        "ru_stemmer": {
          "type": "stemmer",
          "language": "russian"
        }
      },
      "analyzer": {
        "default": {
          "char_filter": [
            "html_strip"
          ],
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "ru_stop",
            "ru_stemmer"
          ]
        }
      }
    }
  },
  "mappings": {
    "post": {
      "properties": {
        "content": {
          "type": "string"
        },
        "published_at": {
          "type": "date"
        },
        "tags": {
          "type": "string",
          "index": "not_analyzed"
        },
        "title": {
          "type": "string"
        }
      }
    }
  }
}'

E hoʻohui i nā pou like 3 i kēia papa kuhikuhi (blog2). E haʻalele wau i kēia kaʻina hana no ka mea... ua like ia me ka hoʻohui ʻana i nā palapala i ka index blog.

Huli kikokiko piha me ke kākoʻo ʻōlelo

E nānā kākou i kekahi ʻano noi:

# найдем документы, в которых встречается слово 'истории'
# query -> simple_query_string -> query содержит поисковый запрос
# поле title имеет приоритет 3
# поле tags имеет приоритет 2
# поле content имеет приоритет 1
# приоритет используется при ранжировании результатов
curl -XPOST "$ES_URL/blog2/post/_search?pretty" -d'
{
  "query": {
    "simple_query_string": {
      "query": "истории",
      "fields": [
        "title^3",
        "tags^2",
        "content"
      ]
    }
  }
}'

No ka mea Ke hoʻohana nei mākou i kahi mea anaana me ka Russian stemming, a laila e hoʻihoʻi kēia noi i nā palapala āpau, ʻoiai aia wale nō ka huaʻōlelo 'history'.

Loaʻa paha i ka noi nā huaʻōlelo kūikawā, no ka laʻana:

""fried eggs" +(eggplant | potato) -frittata"

Noi i ka syntax:

+ signifies AND operation
| signifies OR operation
- negates a single token
" wraps a number of tokens to signify a phrase for searching
* at the end of a term signifies a prefix query
( and ) signify precedence
~N after a word signifies edit distance (fuzziness)
~N after a phrase signifies slop amount
# найдем документы без слова 'щенки'
curl -XPOST "$ES_URL/blog2/post/_search?pretty" -d'
{
  "query": {
    "simple_query_string": {
      "query": "-щенки",
      "fields": [
        "title^3",
        "tags^2",
        "content"
      ]
    }
  }
}'

# получим 2 поста про котиков

kūmole

PS

Inā makemake ʻoe i nā ʻatikala like ʻole-nā haʻawina, loaʻa nā manaʻo no nā ʻatikala hou, a i ʻole nā ​​manaʻo no ka launa pū ʻana, a laila e hauʻoli wau i ka loaʻa ʻana o kahi leka ma kahi leka pilikino a i ʻole leka uila. [pale ʻia ka leka uila].

Source: www.habr.com