Elasticsearch injin bincike ne tare da json rest api, ta amfani da Lucene kuma an rubuta shi cikin Java. Ana samun bayanin duk fa'idodin wannan injin a
Ana amfani da irin wannan injuna don hadaddun bincike a cikin bayanan daftari. Misali, bincika la'akari da yanayin yanayin harshe ko bincika ta hanyar haɗin gwiwar geo.
A cikin wannan labarin zan yi magana game da abubuwan yau da kullun na ES ta yin amfani da misalin abubuwan da ke nuna rubutun blog. Zan nuna muku yadda ake tacewa, warwarewa da bincika takardu.
Don kar a dogara da tsarin aiki, zan yi duk buƙatun zuwa ES ta amfani da CURL. Akwai kuma plugin don google chrome da ake kira
Rubutun ya ƙunshi hanyoyin haɗi zuwa takardu da sauran tushe. A ƙarshe akwai hanyoyin haɗin gwiwa don samun damar shiga cikin sauri zuwa takaddun. Ana iya samun ma'anar kalmomin da ba a sani ba a ciki
Shigar da ES
Don yin wannan, da farko muna buƙatar Java. Masu haɓakawa
Ana samun rarrabawar ES a bin/elasticsearch
. Akwai kuma
Bayan shigarwa da ƙaddamarwa, bari mu duba ayyukan:
# для удобства запомним адрес в переменную
#export ES_URL=$(docker-machine ip dev):9200
export ES_URL=localhost:9200
curl -X GET $ES_URL
Za mu sami wani abu kamar haka:
{
"name" : "Heimdall",
"cluster_name" : "elasticsearch",
"version" : {
"number" : "2.2.1",
"build_hash" : "d045fc29d1932bce18b2e65ab8b297fbf6cd41a1",
"build_timestamp" : "2016-03-09T09:38:54Z",
"build_snapshot" : false,
"lucene_version" : "5.4.1"
},
"tagline" : "You Know, for Search"
}
Fitarwa
Bari mu ƙara rubutu zuwa ES:
# Добавим документ c id 1 типа post в индекс blog.
# ?pretty указывает, что вывод должен быть человеко-читаемым.
curl -XPUT "$ES_URL/blog/post/1?pretty" -d'
{
"title": "Веселые котята",
"content": "<p>Смешная история про котят<p>",
"tags": [
"котята",
"смешная история"
],
"published_at": "2014-09-12T20:44:42+00:00"
}'
amsawar uwar garken:
{
"_index" : "blog",
"_type" : "post",
"_id" : "1",
"_version" : 1,
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"created" : false
}
An ƙirƙira ES ta atomatik
# Получим mapping всех типов индекса blog
curl -XGET "$ES_URL/blog/_mapping?pretty"
A cikin martanin uwar garken, na ƙara ƙimar filayen daftarin aiki a cikin sharhi:
{
"blog" : {
"mappings" : {
"post" : {
"properties" : {
/* "content": "<p>Смешная история про котят<p>", */
"content" : {
"type" : "string"
},
/* "published_at": "2014-09-12T20:44:42+00:00" */
"published_at" : {
"type" : "date",
"format" : "strict_date_optional_time||epoch_millis"
},
/* "tags": ["котята", "смешная история"] */
"tags" : {
"type" : "string"
},
/* "title": "Веселые котята" */
"title" : {
"type" : "string"
}
}
}
}
}
}
Yana da kyau a lura cewa ES baya bambanta tsakanin ƙima ɗaya da tsararrun ƙima. Misali, filin take yana ƙunshe da take kawai, kuma filin tags yana ƙunshe da tsararrun igiyoyi, ko da yake ana wakilta su iri ɗaya wajen yin taswira.
Za mu yi magana game da taswira daga baya.
Bukatu
Maido da takarda ta id:
# извлечем документ с id 1 типа post из индекса blog
curl -XGET "$ES_URL/blog/post/1?pretty"
{
"_index" : "blog",
"_type" : "post",
"_id" : "1",
"_version" : 1,
"found" : true,
"_source" : {
"title" : "Веселые котята",
"content" : "<p>Смешная история про котят<p>",
"tags" : [ "котята", "смешная история" ],
"published_at" : "2014-09-12T20:44:42+00:00"
}
}
Sabbin maɓallai sun bayyana a cikin martanin: _version
и _source
. Gabaɗaya, duk maɓallan farawa da _
ana rarraba su a matsayin hukuma.
Key _version
ya nuna daftarin aiki version. Ana buƙatar don ingantaccen tsarin kullewa yayi aiki. Misali, muna so mu canza takarda mai sigar 1. Muna gabatar da takaddun da aka canza kuma muna nuna cewa wannan editan takarda ne mai nau'in 1. Idan wani kuma ya gyara takarda tare da sigar 1 kuma ya gabatar da canje-canje a gabanmu, to. ES ba zai yarda da canje-canjenmu ba, saboda yana adana daftarin aiki tare da sigar 2.
Key _source
ya ƙunshi daftarin aiki da muka yi maƙasudi. ES baya amfani da wannan ƙimar don ayyukan bincike saboda Ana amfani da fihirisa don nema. Don ajiye sarari, ES yana adana daftarin aiki da aka matsa. Idan muna buƙatar id ɗin kawai, kuma ba duka daftarin aiki ba, to zamu iya musaki ma'ajiyar tushe.
Idan ba mu buƙatar ƙarin bayani, za mu iya samun abubuwan da ke cikin _source kawai:
curl -XGET "$ES_URL/blog/post/1/_source?pretty"
{
"title" : "Веселые котята",
"content" : "<p>Смешная история про котят<p>",
"tags" : [ "котята", "смешная история" ],
"published_at" : "2014-09-12T20:44:42+00:00"
}
Hakanan zaka iya zaɓar wasu filayen kawai:
# извлечем только поле title
curl -XGET "$ES_URL/blog/post/1?_source=title&pretty"
{
"_index" : "blog",
"_type" : "post",
"_id" : "1",
"_version" : 1,
"found" : true,
"_source" : {
"title" : "Веселые котята"
}
}
Bari mu fitar da wasu ƴan posts kuma mu gudanar da ƙarin hadaddun tambayoyi.
curl -XPUT "$ES_URL/blog/post/2" -d'
{
"title": "Веселые щенки",
"content": "<p>Смешная история про щенков<p>",
"tags": [
"щенки",
"смешная история"
],
"published_at": "2014-08-12T20:44:42+00:00"
}'
curl -XPUT "$ES_URL/blog/post/3" -d'
{
"title": "Как у меня появился котенок",
"content": "<p>Душераздирающая история про бедного котенка с улицы<p>",
"tags": [
"котята"
],
"published_at": "2014-07-21T20:44:42+00:00"
}'
Tacewa
# найдем последний пост по дате публикации и извлечем поля title и published_at
curl -XGET "$ES_URL/blog/post/_search?pretty" -d'
{
"size": 1,
"_source": ["title", "published_at"],
"sort": [{"published_at": "desc"}]
}'
{
"took" : 8,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : null,
"hits" : [ {
"_index" : "blog",
"_type" : "post",
"_id" : "1",
"_score" : null,
"_source" : {
"title" : "Веселые котята",
"published_at" : "2014-09-12T20:44:42+00:00"
},
"sort" : [ 1410554682000 ]
} ]
}
}
Mun zaɓi matsayi na ƙarshe. size
yana iyakance adadin takardun da za a bayar. total
yana nuna jimlar adadin takardun da suka dace da buƙatar. sort
a cikin abin da ake fitarwa yana ƙunshe da tsararrun lambobi waɗanda ake aiwatar da su. Wadancan. an canza kwanan wata zuwa lamba. Ana iya samun ƙarin bayani game da rarrabuwa a ciki
Tace da tambaya
ES tun sigar 2 baya bambanta tsakanin masu tacewa da tambayoyi, maimakon haka
Mahallin tambaya ya bambanta da mahallin tacewa a cikin cewa tambayar tana haifar da _score kuma ba a adana shi ba. Zan nuna muku menene _score daga baya.
Tace da kwanan wata
Muna amfani da buƙatar
# получим посты, опубликованные 1ого сентября или позже
curl -XGET "$ES_URL/blog/post/_search?pretty" -d'
{
"filter": {
"range": {
"published_at": { "gte": "2014-09-01" }
}
}
}'
Tace da tags
Muna amfani
# найдем все документы, в поле tags которых есть элемент 'котята'
curl -XGET "$ES_URL/blog/post/_search?pretty" -d'
{
"_source": [
"title",
"tags"
],
"filter": {
"term": {
"tags": "котята"
}
}
}'
{
"took" : 9,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 1.0,
"hits" : [ {
"_index" : "blog",
"_type" : "post",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"title" : "Веселые котята",
"tags" : [ "котята", "смешная история" ]
}
}, {
"_index" : "blog",
"_type" : "post",
"_id" : "3",
"_score" : 1.0,
"_source" : {
"title" : "Как у меня появился котенок",
"tags" : [ "котята" ]
}
} ]
}
}
Cikakken bincike na rubutu
Uku daga cikin takaddun mu sun ƙunshi abubuwa masu zuwa a cikin filin abun ciki:
<p>Смешная история про котят<p>
<p>Смешная история про щенков<p>
<p>Душераздирающая история про бедного котенка с улицы<p>
Muna amfani
# source: false означает, что не нужно извлекать _source найденных документов
curl -XGET "$ES_URL/blog/post/_search?pretty" -d'
{
"_source": false,
"query": {
"match": {
"content": "история"
}
}
}'
{
"took" : 13,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : 0.11506981,
"hits" : [ {
"_index" : "blog",
"_type" : "post",
"_id" : "2",
"_score" : 0.11506981
}, {
"_index" : "blog",
"_type" : "post",
"_id" : "1",
"_score" : 0.11506981
}, {
"_index" : "blog",
"_type" : "post",
"_id" : "3",
"_score" : 0.095891505
} ]
}
}
Duk da haka, idan muka bincika "labarun" a cikin filin abun ciki, ba za mu sami wani abu ba, saboda Fihirisar ta ƙunshi ainihin kalmomin kawai, ba tushen su ba. Domin yin bincike mai inganci, kuna buƙatar saita mai nazari.
filin _score
nuna
Masu nazari
Masu nazari sun kunshi daya
ES yana da yawa
Mu yi amfani
# используем анализатор standard
# обязательно нужно перекодировать не ASCII символы
curl -XGET "$ES_URL/_analyze?pretty&analyzer=standard&text=%D0%92%D0%B5%D1%81%D0%B5%D0%BB%D1%8B%D0%B5%20%D0%B8%D1%81%D1%82%D0%BE%D1%80%D0%B8%D0%B8%20%D0%BF%D1%80%D0%BE%20%D0%BA%D0%BE%D1%82%D1%8F%D1%82"
{
"tokens" : [ {
"token" : "веселые",
"start_offset" : 0,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 0
}, {
"token" : "истории",
"start_offset" : 8,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "про",
"start_offset" : 16,
"end_offset" : 19,
"type" : "<ALPHANUM>",
"position" : 2
}, {
"token" : "котят",
"start_offset" : 20,
"end_offset" : 25,
"type" : "<ALPHANUM>",
"position" : 3
} ]
}
# используем анализатор russian
curl -XGET "$ES_URL/_analyze?pretty&analyzer=russian&text=%D0%92%D0%B5%D1%81%D0%B5%D0%BB%D1%8B%D0%B5%20%D0%B8%D1%81%D1%82%D0%BE%D1%80%D0%B8%D0%B8%20%D0%BF%D1%80%D0%BE%20%D0%BA%D0%BE%D1%82%D1%8F%D1%82"
{
"tokens" : [ {
"token" : "весел",
"start_offset" : 0,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 0
}, {
"token" : "истор",
"start_offset" : 8,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "кот",
"start_offset" : 20,
"end_offset" : 25,
"type" : "<ALPHANUM>",
"position" : 3
} ]
}
Mai nazari na yau da kullun ya raba kirtani zuwa sararin samaniya kuma ya canza komai zuwa ƙarami, mai nazarin Rashanci ya cire kalmomin da ba su da mahimmanci, ya canza shi zuwa ƙananan harafi kuma ya bar tushen kalmomin.
Bari mu ga wane Tokenizer, TokenFilters, CharFilters mai nazarin Rasha ke amfani da shi:
{
"filter": {
"russian_stop": {
"type": "stop",
"stopwords": "_russian_"
},
"russian_keywords": {
"type": "keyword_marker",
"keywords": []
},
"russian_stemmer": {
"type": "stemmer",
"language": "russian"
}
},
"analyzer": {
"russian": {
"tokenizer": "standard",
/* TokenFilters */
"filter": [
"lowercase",
"russian_stop",
"russian_keywords",
"russian_stemmer"
]
/* CharFilters отсутствуют */
}
}
}
Bari mu bayyana manazarta dangane da Rashanci, wanda zai yanke alamun html. Bari mu kira shi tsoho, saboda za a yi amfani da mai nazari mai wannan suna ta tsohuwa.
{
"filter": {
"ru_stop": {
"type": "stop",
"stopwords": "_russian_"
},
"ru_stemmer": {
"type": "stemmer",
"language": "russian"
}
},
"analyzer": {
"default": {
/* добавляем удаление html тегов */
"char_filter": ["html_strip"],
"tokenizer": "standard",
"filter": [
"lowercase",
"ru_stop",
"ru_stemmer"
]
}
}
}
Da farko, duk tags HTML za a cire daga tushen kirtani, sa'an nan tokenizer misali zai raba shi zuwa Alamu, sakamakon sakamakon za su matsa zuwa ƙananan haruffa, za a cire kalmomi marasa mahimmanci, sauran alamun za su kasance tushen kalmar.
Ƙirƙirar fihirisa
A sama mun bayyana tsoho analyzer. Zai shafi duk filayen kirtani. Matsayinmu yana ƙunshe da tarin tags, don haka masu binciken za su sarrafa alamun. Domin Muna neman posts ta daidai daidai da alamar, to muna buƙatar musaki bincike don filin tags.
Bari mu ƙirƙiri fihirisar blog2 tare da mai nazari da taswira, wanda binciken filin tag ɗin ya lalace:
curl -XPOST "$ES_URL/blog2" -d'
{
"settings": {
"analysis": {
"filter": {
"ru_stop": {
"type": "stop",
"stopwords": "_russian_"
},
"ru_stemmer": {
"type": "stemmer",
"language": "russian"
}
},
"analyzer": {
"default": {
"char_filter": [
"html_strip"
],
"tokenizer": "standard",
"filter": [
"lowercase",
"ru_stop",
"ru_stemmer"
]
}
}
}
},
"mappings": {
"post": {
"properties": {
"content": {
"type": "string"
},
"published_at": {
"type": "date"
},
"tags": {
"type": "string",
"index": "not_analyzed"
},
"title": {
"type": "string"
}
}
}
}
}'
Bari mu ƙara rubutu guda 3 iri ɗaya zuwa wannan fihirisar (blog2). Zan bar wannan tsari saboda... yana kama da ƙara takardu zuwa fihirisar blog.
Cikakken bincike na rubutu tare da tallafin magana
Bari mu kalli wani nau'in buƙata:
# найдем документы, в которых встречается слово 'истории'
# query -> simple_query_string -> query содержит поисковый запрос
# поле title имеет приоритет 3
# поле tags имеет приоритет 2
# поле content имеет приоритет 1
# приоритет используется при ранжировании результатов
curl -XPOST "$ES_URL/blog2/post/_search?pretty" -d'
{
"query": {
"simple_query_string": {
"query": "истории",
"fields": [
"title^3",
"tags^2",
"content"
]
}
}
}'
Domin Muna amfani da na'urar nazari tare da tushen Rasha, to wannan buƙatar za ta dawo da duk takaddun, kodayake suna ɗauke da kalmar 'tarihi' kawai.
Buƙatun na iya ƙunshi haruffa na musamman, misali:
""fried eggs" +(eggplant | potato) -frittata"
Nemi syntax:
+ signifies AND operation
| signifies OR operation
- negates a single token
" wraps a number of tokens to signify a phrase for searching
* at the end of a term signifies a prefix query
( and ) signify precedence
~N after a word signifies edit distance (fuzziness)
~N after a phrase signifies slop amount
# найдем документы без слова 'щенки'
curl -XPOST "$ES_URL/blog2/post/_search?pretty" -d'
{
"query": {
"simple_query_string": {
"query": "-щенки",
"fields": [
"title^3",
"tags^2",
"content"
]
}
}
}'
# получим 2 поста про котиков
nassoshi
PS
Idan kuna sha'awar irin wannan labarin-darussan, kuna da ra'ayoyi don sabbin labarai, ko kuna da shawarwari don haɗin gwiwa, to zan yi farin cikin karɓar saƙo a cikin saƙo na sirri ko ta imel [email kariya].
source: www.habr.com