Elasticsearch injin bincike ne tare da JSON REST API, ta amfani da Lucene kuma an rubuta shi cikin Java. Ana samun bayanin duk fa'idodin wannan injin a Daga yanzu, za mu koma Elasticsearch azaman ES.
Ana amfani da irin waɗannan injunan don bincikar bayanan daftarin aiki mai sarƙaƙƙiya, kamar bincike akan tsarin halittar harshe ko bincike ta hanyar haɗin gwiwar yanki.
A cikin wannan labarin, zan rufe tushen tushen ES ta amfani da misalin rubutun rubutun ra'ayin yanar gizo. Zan nuna muku yadda ake tacewa, warwarewa, da bincika takardu.
Don zama mai zaman kansa daga tsarin aiki, zan yi duk buƙatun na ES ta amfani da CURL. Akwai kuma plugin don Google Chrome da ake kira .
Ana ba da hanyoyin haɗin kai zuwa takaddun bayanai da sauran tushe a cikin rubutun. Ana ba da hanyoyin hanyoyin shiga da sauri zuwa takardu a ƙarshe. Ana iya samun ma'anar kalmomin da ba a sani ba a ciki .
Shigar da ES
Don wannan mun fara buƙatar Java. Masu haɓakawa Shigar da nau'ikan Java sababbi fiye da sabuntawar Java 8 20 ko Java 7 sabuntawa 55.
Ana samun rarrabawar ES a Bayan cire kayan tarihin, kuna buƙatar gudu bin/elasticsearch. Akwai kuma . Akwai . .
Bayan shigarwa da ƙaddamarwa, bari mu duba ayyukan:
# для удобства запомним адрес в переменную
#export ES_URL=$(docker-machine ip dev):9200
export ES_URL=localhost:9200
curl -X GET $ES_URLZa mu sami martani mai kama da haka:
{
"name" : "Heimdall",
"cluster_name" : "elasticsearch",
"version" : {
"number" : "2.2.1",
"build_hash" : "d045fc29d1932bce18b2e65ab8b297fbf6cd41a1",
"build_timestamp" : "2016-03-09T09:38:54Z",
"build_snapshot" : false,
"lucene_version" : "5.4.1"
},
"tagline" : "You Know, for Search"
}Fitarwa
Bari mu ƙara rubutu zuwa ES:
# Добавим документ c id 1 типа post в индекс blog.
# ?pretty указывает, что вывод должен быть человеко-читаемым.
curl -XPUT "$ES_URL/blog/post/1?pretty" -d'
{
"title": "Веселые котята",
"content": "<p>Смешная история про котят<p>",
"tags": [
"котята",
"смешная история"
],
"published_at": "2014-09-12T20:44:42+00:00"
}'
amsawar uwar garken:
{
"_index" : "blog",
"_type" : "post",
"_id" : "1",
"_version" : 1,
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"created" : false
}
An ƙirƙira ES ta atomatik blog da post. Za a iya zana m misali: fihirisa ita ce bayanai, kuma nau'i shine tebur a cikin wannan bayanan. Kowane nau'i yana da tsarin kansa- , kamar tebur mai alaƙa. Ana haifar da taswira ta atomatik lokacin da aka yiwa daftarin lissafi:
# Получим mapping всех типов индекса blog
curl -XGET "$ES_URL/blog/_mapping?pretty"A cikin martanin uwar garken, na ƙara ƙimar filin daftarin aiki a cikin sharhi:
{
"blog" : {
"mappings" : {
"post" : {
"properties" : {
/* "content": "<p>Смешная история про котят<p>", */
"content" : {
"type" : "string"
},
/* "published_at": "2014-09-12T20:44:42+00:00" */
"published_at" : {
"type" : "date",
"format" : "strict_date_optional_time||epoch_millis"
},
/* "tags": ["котята", "смешная история"] */
"tags" : {
"type" : "string"
},
/* "title": "Веселые котята" */
"title" : {
"type" : "string"
}
}
}
}
}
}Yana da kyau a lura cewa ES baya bambanta tsakanin ƙima ɗaya da tsararrun ƙima. Misali, filin take yana ƙunshe da take kawai, yayin da filin tags yana ƙunshe da tsararrun igiyoyi, ko da yake ana wakilta su iri ɗaya a cikin taswira.
Za mu yi magana game da taswira dalla-dalla daga baya.
Bukatu
Ciro daftarin aiki ta ID:
# извлечем документ с id 1 типа post из индекса blog
curl -XGET "$ES_URL/blog/post/1?pretty"{
"_index" : "blog",
"_type" : "post",
"_id" : "1",
"_version" : 1,
"found" : true,
"_source" : {
"title" : "Веселые котята",
"content" : "<p>Смешная история про котят<p>",
"tags" : [ "котята", "смешная история" ],
"published_at" : "2014-09-12T20:44:42+00:00"
}
}Sabbin maɓallai sun bayyana a cikin martanin: _version и _sourceGabaɗaya, duk maɓallan da suka fara da _ ana rarraba su azaman abubuwan sabis.
Key _version Yana nuna sigar daftarin aiki. Wannan ya zama dole don kyakkyawan tsarin kullewa yayi aiki. Misali, muna so mu canza daftarin aiki tare da sigar 1. Muna gabatar da takaddun da aka gyara kuma muna nuna cewa wannan bita ce ga takaddar tare da sigar 1. Idan wani kuma ya gyara takaddar tare da sigar 1 kuma ya gabatar da canje-canje a gabanmu, ES ba zai karɓi canje-canjenmu ba, tunda yana adana takaddun tare da sigar 2.
Key _source Ya ƙunshi daftarin aiki da muka ƙididdigewa. ES baya amfani da wannan ƙimar don ayyukan bincike, kamar yadda ake amfani da fihirisa don nema. Don ajiye sarari, ES yana adana nau'in daftarin aiki na asali. Idan kawai muna buƙatar ID ɗin kuma ba duka ainihin takaddar ba, za mu iya musaki adana ainihin.
Idan ba mu buƙatar ƙarin bayani, za mu iya samun abubuwan da ke cikin _source kawai:
curl -XGET "$ES_URL/blog/post/1/_source?pretty"{
"title" : "Веселые котята",
"content" : "<p>Смешная история про котят<p>",
"tags" : [ "котята", "смешная история" ],
"published_at" : "2014-09-12T20:44:42+00:00"
}
Hakanan zaka iya zaɓar wasu filayen kawai:
# извлечем только поле title
curl -XGET "$ES_URL/blog/post/1?_source=title&pretty"{
"_index" : "blog",
"_type" : "post",
"_id" : "1",
"_version" : 1,
"found" : true,
"_source" : {
"title" : "Веселые котята"
}
}Bari mu fitar da wasu ƴan posts kuma mu gudanar da ƙarin hadaddun tambayoyi.
curl -XPUT "$ES_URL/blog/post/2" -d'
{
"title": "Веселые щенки",
"content": "<p>Смешная история про щенков<p>",
"tags": [
"щенки",
"смешная история"
],
"published_at": "2014-08-12T20:44:42+00:00"
}'curl -XPUT "$ES_URL/blog/post/3" -d'
{
"title": "Как у меня появился котенок",
"content": "<p>Душераздирающая история про бедного котенка с улицы<p>",
"tags": [
"котята"
],
"published_at": "2014-07-21T20:44:42+00:00"
}'Tacewa
# найдем последний пост по дате публикации и извлечем поля title и published_at
curl -XGET "$ES_URL/blog/post/_search?pretty" -d'
{
"size": 1,
"_source": ["title", "published_at"],
"sort": [{"published_at": "desc"}]
}'{
"took" : 8,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : null,
"hits" : [ {
"_index" : "blog",
"_type" : "post",
"_id" : "1",
"_score" : null,
"_source" : {
"title" : "Веселые котята",
"published_at" : "2014-09-12T20:44:42+00:00"
},
"sort" : [ 1410554682000 ]
} ]
}
}Mun zaɓi matsayi na ƙarshe. size yana iyakance adadin takaddun da aka bayar. total yana nuna jimlar adadin takardun da suka dace da tambayar. sort Fitowar tana ƙunshe da tsararrun lambobi waɗanda ake aiwatar da su. Wato an canza kwanan wata zuwa lamba. Kuna iya karanta ƙarin game da rarrabawa a ciki .
Tace da tambaya
ES tun sigar 2 baya bambanta tsakanin masu tacewa da tambayoyi, maimakon haka .
Mahallin tambayar ya bambanta da mahallin tacewa a cikin cewa tambayar tana haifar da _score kuma ba a adana shi ba. Zan yi bayanin menene _score daga baya.
Tace da kwanan wata
Amfani da tambayar cikin yanayin tace:
# получим посты, опубликованные 1ого сентября или позже
curl -XGET "$ES_URL/blog/post/_search?pretty" -d'
{
"filter": {
"range": {
"published_at": { "gte": "2014-09-01" }
}
}
}'Tace da tags
Muna amfani Don nemo ID ɗin daftarin aiki mai ɗauke da kalmar da aka bayar:
# найдем все документы, в поле tags которых есть элемент 'котята'
curl -XGET "$ES_URL/blog/post/_search?pretty" -d'
{
"_source": [
"title",
"tags"
],
"filter": {
"term": {
"tags": "котята"
}
}
}'{
"took" : 9,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 1.0,
"hits" : [ {
"_index" : "blog",
"_type" : "post",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"title" : "Веселые котята",
"tags" : [ "котята", "смешная история" ]
}
}, {
"_index" : "blog",
"_type" : "post",
"_id" : "3",
"_score" : 1.0,
"_source" : {
"title" : "Как у меня появился котенок",
"tags" : [ "котята" ]
}
} ]
}
}Cikakken bincike na rubutu
Takardun mu guda uku sun ƙunshi abubuwa masu zuwa a cikin filin abun ciki:
<p>Смешная история про котят<p><p>Смешная история про щенков<p><p>Душераздирающая история про бедного котенка с улицы<p>
Muna amfani Don nemo ID ɗin daftarin aiki mai ɗauke da kalmar da aka bayar:
# source: false означает, что не нужно извлекать _source найденных документов
curl -XGET "$ES_URL/blog/post/_search?pretty" -d'
{
"_source": false,
"query": {
"match": {
"content": "история"
}
}
}'{
"took" : 13,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : 0.11506981,
"hits" : [ {
"_index" : "blog",
"_type" : "post",
"_id" : "2",
"_score" : 0.11506981
}, {
"_index" : "blog",
"_type" : "post",
"_id" : "1",
"_score" : 0.11506981
}, {
"_index" : "blog",
"_type" : "post",
"_id" : "3",
"_score" : 0.095891505
} ]
}
}Koyaya, idan muka bincika “labarun” a cikin filin abun ciki, ba za mu sami komai ba, saboda fihirisar ta ƙunshi kalmomi na asali kawai, ba mai tushe ba. Don yin bincike mai inganci, muna buƙatar saita mai nazari.
filin _score nuna Idan an aiwatar da tambayar a cikin mahallin tacewa, ƙimar _score koyaushe zata kasance 1, wanda ke nufin tace ɗin ya dace sosai.
Masu nazari
ana buƙatar don canza rubutun tushen zuwa saitin alamomi.
Masu nazari sun kunshi daya da na zaɓi da yawa Tokenizer na iya riga da yawa Tokenizers suna karya kirtani mai tushe zuwa alamomi, misali, ta sarari da haruffan rubutu. TokenFilters na iya canza alamu, cire su, ko ƙara sababbi, misali, ta barin kalmar tushe kawai, cire prepositions, ko ƙara ma'ana. CharFilters suna gyara tushen tushen gaba ɗaya, misali, ta cire alamun HTML.
Akwai da yawa a cikin ES Misali, mai nazari .
Mu yi amfani kuma bari mu ga yadda ma'auni da masu nazarin Rasha suka canza kirtani "Labarai masu ban dariya game da kittens":
# используем анализатор standard
# обязательно нужно перекодировать не ASCII символы
curl -XGET "$ES_URL/_analyze?pretty&analyzer=standard&text=%D0%92%D0%B5%D1%81%D0%B5%D0%BB%D1%8B%D0%B5%20%D0%B8%D1%81%D1%82%D0%BE%D1%80%D0%B8%D0%B8%20%D0%BF%D1%80%D0%BE%20%D0%BA%D0%BE%D1%82%D1%8F%D1%82"{
"tokens" : [ {
"token" : "веселые",
"start_offset" : 0,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 0
}, {
"token" : "истории",
"start_offset" : 8,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "про",
"start_offset" : 16,
"end_offset" : 19,
"type" : "<ALPHANUM>",
"position" : 2
}, {
"token" : "котят",
"start_offset" : 20,
"end_offset" : 25,
"type" : "<ALPHANUM>",
"position" : 3
} ]
}# используем анализатор russian
curl -XGET "$ES_URL/_analyze?pretty&analyzer=russian&text=%D0%92%D0%B5%D1%81%D0%B5%D0%BB%D1%8B%D0%B5%20%D0%B8%D1%81%D1%82%D0%BE%D1%80%D0%B8%D0%B8%20%D0%BF%D1%80%D0%BE%20%D0%BA%D0%BE%D1%82%D1%8F%D1%82"{
"tokens" : [ {
"token" : "весел",
"start_offset" : 0,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 0
}, {
"token" : "истор",
"start_offset" : 8,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "кот",
"start_offset" : 20,
"end_offset" : 25,
"type" : "<ALPHANUM>",
"position" : 3
} ]
}Daidaitaccen mai nazari ya raba kirtani a sarari kuma ya canza komai zuwa ƙananan haruffa, mai nazarin Rashanci ya cire kalmomin da ba su da mahimmanci, ya canza su zuwa ƙananan haruffa, ya bar kalmar mai tushe.
Bari mu ga wane Tokenizer, TokenFilters, da CharFilters mai nazarin Rasha ke amfani da shi:
{
"filter": {
"russian_stop": {
"type": "stop",
"stopwords": "_russian_"
},
"russian_keywords": {
"type": "keyword_marker",
"keywords": []
},
"russian_stemmer": {
"type": "stemmer",
"language": "russian"
}
},
"analyzer": {
"russian": {
"tokenizer": "standard",
/* TokenFilters */
"filter": [
"lowercase",
"russian_stop",
"russian_keywords",
"russian_stemmer"
]
/* CharFilters отсутствуют */
}
}
}Bari mu kwatanta namu na tushen nazari na Rasha wanda zai tube tags HTML. Za mu kira shi tsoho, saboda wannan zai zama tsoho analyzer.
{
"filter": {
"ru_stop": {
"type": "stop",
"stopwords": "_russian_"
},
"ru_stemmer": {
"type": "stemmer",
"language": "russian"
}
},
"analyzer": {
"default": {
/* добавляем удаление html тегов */
"char_filter": ["html_strip"],
"tokenizer": "standard",
"filter": [
"lowercase",
"ru_stop",
"ru_stemmer"
]
}
}
}Da farko, za a cire duk tags na HTML daga tushen tushen, sannan za a raba su zuwa alamomi ta ma'aunin tokenizer, za a canza alamun da aka samu zuwa ƙananan haruffa, za a cire kalmomin da ba su da mahimmanci, sauran alamun za su zama tushen kalmar.
Ƙirƙirar fihirisa
A sama, mun bayyana tsoho analyzer. Za a yi amfani da shi a duk filayen kirtani. Matsayinmu yana ƙunshe da tarin tags, don haka masu binciken za su sarrafa alamun. Tun da muna neman abubuwan da suka dace da ainihin alamar, muna buƙatar musaki bincike don filin "tags".
Bari mu ƙirƙiri fihirisar blog2 tare da mai nazari da taswira, wanda a cikinsa aka kashe nazarin filin tags:
curl -XPOST "$ES_URL/blog2" -d'
{
"settings": {
"analysis": {
"filter": {
"ru_stop": {
"type": "stop",
"stopwords": "_russian_"
},
"ru_stemmer": {
"type": "stemmer",
"language": "russian"
}
},
"analyzer": {
"default": {
"char_filter": [
"html_strip"
],
"tokenizer": "standard",
"filter": [
"lowercase",
"ru_stop",
"ru_stemmer"
]
}
}
}
},
"mappings": {
"post": {
"properties": {
"content": {
"type": "string"
},
"published_at": {
"type": "date"
},
"tags": {
"type": "string",
"index": "not_analyzed"
},
"title": {
"type": "string"
}
}
}
}
}'Bari mu ƙara rubutu guda uku iri ɗaya zuwa wannan fihirisa (blog2). Zan tsallake wannan tsari, saboda yana kama da ƙara takardu zuwa fihirisar bulogi.
Binciken cikakken rubutu tare da tallafin magana
Bari mu saba da wani nau'in tambaya:
# найдем документы, в которых встречается слово 'истории'
# query -> simple_query_string -> query содержит поисковый запрос
# поле title имеет приоритет 3
# поле tags имеет приоритет 2
# поле content имеет приоритет 1
# приоритет используется при ранжировании результатов
curl -XPOST "$ES_URL/blog2/post/_search?pretty" -d'
{
"query": {
"simple_query_string": {
"query": "истории",
"fields": [
"title^3",
"tags^2",
"content"
]
}
}
}'Tun da muna amfani da na'urar nazari tare da tushen Rasha, wannan tambayar za ta dawo da duk takaddun, kodayake suna ɗauke da kalmar 'tarihi' kawai.
Tambayar na iya ƙunshi haruffa na musamman, misali:
""fried eggs" +(eggplant | potato) -frittata"Tambaya syntax:
+ signifies AND operation
| signifies OR operation
- negates a single token
" wraps a number of tokens to signify a phrase for searching
* at the end of a term signifies a prefix query
( and ) signify precedence
~N after a word signifies edit distance (fuzziness)
~N after a phrase signifies slop amount# найдем документы без слова 'щенки'
curl -XPOST "$ES_URL/blog2/post/_search?pretty" -d'
{
"query": {
"simple_query_string": {
"query": "-щенки",
"fields": [
"title^3",
"tags^2",
"content"
]
}
}
}'
# получим 2 поста про котиковnassoshi
PS
Idan kuna sha'awar irin wannan labarin koyawa, kuna da ra'ayoyi don sabbin labarai, ko kuna da wasu shawarwarin haɗin gwiwa, zan yi farin cikin ji daga gare ku ta saƙon sirri ko imel a m.kuzmin+habr@darkleaf.ru.
source: www.habr.com
