Elasticsearch Basics

Elasticsearch injin bincike ne tare da json rest api, ta amfani da Lucene kuma an rubuta shi cikin Java. Ana samun bayanin duk fa'idodin wannan injin a official website. A cikin abin da ke biyo baya za mu koma Elasticsearch azaman ES.

Ana amfani da irin wannan injuna don hadaddun bincike a cikin bayanan daftari. Misali, bincika la'akari da yanayin yanayin harshe ko bincika ta hanyar haɗin gwiwar geo.

A cikin wannan labarin zan yi magana game da abubuwan yau da kullun na ES ta yin amfani da misalin abubuwan da ke nuna rubutun blog. Zan nuna muku yadda ake tacewa, warwarewa da bincika takardu.

Don kar a dogara da tsarin aiki, zan yi duk buƙatun zuwa ES ta amfani da CURL. Akwai kuma plugin don google chrome da ake kira hankali.

Rubutun ya ƙunshi hanyoyin haɗi zuwa takardu da sauran tushe. A ƙarshe akwai hanyoyin haɗin gwiwa don samun damar shiga cikin sauri zuwa takaddun. Ana iya samun ma'anar kalmomin da ba a sani ba a ciki ƙamus.

Shigar da ES

Don yin wannan, da farko muna buƙatar Java. Masu haɓakawa bada shawara shigar da nau'ikan Java sababbi fiye da sabuntawar Java 8 20 ko Java 7 sabuntawa 55.

Ana samun rarrabawar ES a mawallafi site. Bayan cire kayan tarihin kuna buƙatar gudu bin/elasticsearch. Akwai kuma fakiti don dacewa da yum. Akwai Hoton hukuma don docker. Ƙari game da shigarwa.

Bayan shigarwa da ƙaddamarwa, bari mu duba ayyukan:

# для удобства запомним адрес в переменную
#export ES_URL=$(docker-machine ip dev):9200
export ES_URL=localhost:9200

curl -X GET $ES_URL

Za mu sami wani abu kamar haka:

{
  "name" : "Heimdall",
  "cluster_name" : "elasticsearch",
  "version" : {
    "number" : "2.2.1",
    "build_hash" : "d045fc29d1932bce18b2e65ab8b297fbf6cd41a1",
    "build_timestamp" : "2016-03-09T09:38:54Z",
    "build_snapshot" : false,
    "lucene_version" : "5.4.1"
  },
  "tagline" : "You Know, for Search"
}

Fitarwa

Bari mu ƙara rubutu zuwa ES:

# Добавим документ c id 1 типа post в индекс blog.
# ?pretty указывает, что вывод должен быть человеко-читаемым.

curl -XPUT "$ES_URL/blog/post/1?pretty" -d'
{
  "title": "Веселые котята",
  "content": "<p>Смешная история про котят<p>",
  "tags": [
    "котята",
    "смешная история"
  ],
  "published_at": "2014-09-12T20:44:42+00:00"
}'

amsawar uwar garken:

{
  "_index" : "blog",
  "_type" : "post",
  "_id" : "1",
  "_version" : 1,
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "created" : false
}

An ƙirƙira ES ta atomatik index blog da nau'in post. Zamu iya zana kwatancen sharadi: fihirisa ita ce bayanai, kuma nau'in tebur ne a cikin wannan bayanan. Kowane nau'i yana da nasa makirci - Taswirar, kamar tebur mai alaƙa. Ana samar da taswira ta atomatik lokacin da aka yiwa daftarin lissafi:

# Получим mapping всех типов индекса blog
curl -XGET "$ES_URL/blog/_mapping?pretty"

A cikin martanin uwar garken, na ƙara ƙimar filayen daftarin aiki a cikin sharhi:

{
  "blog" : {
    "mappings" : {
      "post" : {
        "properties" : {
          /* "content": "<p>Смешная история про котят<p>", */ 
          "content" : {
            "type" : "string"
          },
          /* "published_at": "2014-09-12T20:44:42+00:00" */
          "published_at" : {
            "type" : "date",
            "format" : "strict_date_optional_time||epoch_millis"
          },
          /* "tags": ["котята", "смешная история"] */
          "tags" : {
            "type" : "string"
          },
          /*  "title": "Веселые котята" */
          "title" : {
            "type" : "string"
          }
        }
      }
    }
  }
}

Yana da kyau a lura cewa ES baya bambanta tsakanin ƙima ɗaya da tsararrun ƙima. Misali, filin take yana ƙunshe da take kawai, kuma filin tags yana ƙunshe da tsararrun igiyoyi, ko da yake ana wakilta su iri ɗaya wajen yin taswira.
Za mu yi magana game da taswira daga baya.

Bukatu

Maido da takarda ta id:

# извлечем документ с id 1 типа post из индекса blog
curl -XGET "$ES_URL/blog/post/1?pretty"
{
  "_index" : "blog",
  "_type" : "post",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "_source" : {
    "title" : "Веселые котята",
    "content" : "<p>Смешная история про котят<p>",
    "tags" : [ "котята", "смешная история" ],
    "published_at" : "2014-09-12T20:44:42+00:00"
  }
}

Sabbin maɓallai sun bayyana a cikin martanin: _version и _source. Gabaɗaya, duk maɓallan farawa da _ ana rarraba su a matsayin hukuma.

Key _version ya nuna daftarin aiki version. Ana buƙatar don ingantaccen tsarin kullewa yayi aiki. Misali, muna so mu canza takarda mai sigar 1. Muna gabatar da takaddun da aka canza kuma muna nuna cewa wannan editan takarda ne mai nau'in 1. Idan wani kuma ya gyara takarda tare da sigar 1 kuma ya gabatar da canje-canje a gabanmu, to. ES ba zai yarda da canje-canjenmu ba, saboda yana adana daftarin aiki tare da sigar 2.

Key _source ya ƙunshi daftarin aiki da muka yi maƙasudi. ES baya amfani da wannan ƙimar don ayyukan bincike saboda Ana amfani da fihirisa don nema. Don ajiye sarari, ES yana adana daftarin aiki da aka matsa. Idan muna buƙatar id ɗin kawai, kuma ba duka daftarin aiki ba, to zamu iya musaki ma'ajiyar tushe.

Idan ba mu buƙatar ƙarin bayani, za mu iya samun abubuwan da ke cikin _source kawai:

curl -XGET "$ES_URL/blog/post/1/_source?pretty"
{
  "title" : "Веселые котята",
  "content" : "<p>Смешная история про котят<p>",
  "tags" : [ "котята", "смешная история" ],
  "published_at" : "2014-09-12T20:44:42+00:00"
}

Hakanan zaka iya zaɓar wasu filayen kawai:

# извлечем только поле title
curl -XGET "$ES_URL/blog/post/1?_source=title&pretty"
{
  "_index" : "blog",
  "_type" : "post",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "_source" : {
    "title" : "Веселые котята"
  }
}

Bari mu fitar da wasu ƴan posts kuma mu gudanar da ƙarin hadaddun tambayoyi.

curl -XPUT "$ES_URL/blog/post/2" -d'
{
  "title": "Веселые щенки",
  "content": "<p>Смешная история про щенков<p>",
  "tags": [
    "щенки",
    "смешная история"
  ],
  "published_at": "2014-08-12T20:44:42+00:00"
}'
curl -XPUT "$ES_URL/blog/post/3" -d'
{
  "title": "Как у меня появился котенок",
  "content": "<p>Душераздирающая история про бедного котенка с улицы<p>",
  "tags": [
    "котята"
  ],
  "published_at": "2014-07-21T20:44:42+00:00"
}'

Tacewa

# найдем последний пост по дате публикации и извлечем поля title и published_at
curl -XGET "$ES_URL/blog/post/_search?pretty" -d'
{
  "size": 1,
  "_source": ["title", "published_at"],
  "sort": [{"published_at": "desc"}]
}'
{
  "took" : 8,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : null,
    "hits" : [ {
      "_index" : "blog",
      "_type" : "post",
      "_id" : "1",
      "_score" : null,
      "_source" : {
        "title" : "Веселые котята",
        "published_at" : "2014-09-12T20:44:42+00:00"
      },
      "sort" : [ 1410554682000 ]
    } ]
  }
}

Mun zaɓi matsayi na ƙarshe. size yana iyakance adadin takardun da za a bayar. total yana nuna jimlar adadin takardun da suka dace da buƙatar. sort a cikin abin da ake fitarwa yana ƙunshe da tsararrun lambobi waɗanda ake aiwatar da su. Wadancan. an canza kwanan wata zuwa lamba. Ana iya samun ƙarin bayani game da rarrabuwa a ciki takardun.

Tace da tambaya

ES tun sigar 2 baya bambanta tsakanin masu tacewa da tambayoyi, maimakon haka an gabatar da manufar mahallin.
Mahallin tambaya ya bambanta da mahallin tacewa a cikin cewa tambayar tana haifar da _score kuma ba a adana shi ba. Zan nuna muku menene _score daga baya.

Tace da kwanan wata

Muna amfani da buƙatar iyaka cikin yanayin tace:

# получим посты, опубликованные 1ого сентября или позже
curl -XGET "$ES_URL/blog/post/_search?pretty" -d'
{
  "filter": {
    "range": {
      "published_at": { "gte": "2014-09-01" }
    }
  }
}'

Tace da tags

Muna amfani tambaya na lokaci don nemo ids daftarin aiki mai ɗauke da kalmar da aka bayar:

# найдем все документы, в поле tags которых есть элемент 'котята'
curl -XGET "$ES_URL/blog/post/_search?pretty" -d'
{
  "_source": [
    "title",
    "tags"
  ],
  "filter": {
    "term": {
      "tags": "котята"
    }
  }
}'
{
  "took" : 9,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "blog",
      "_type" : "post",
      "_id" : "1",
      "_score" : 1.0,
      "_source" : {
        "title" : "Веселые котята",
        "tags" : [ "котята", "смешная история" ]
      }
    }, {
      "_index" : "blog",
      "_type" : "post",
      "_id" : "3",
      "_score" : 1.0,
      "_source" : {
        "title" : "Как у меня появился котенок",
        "tags" : [ "котята" ]
      }
    } ]
  }
}

Cikakken bincike na rubutu

Uku daga cikin takaddun mu sun ƙunshi abubuwa masu zuwa a cikin filin abun ciki:

  • <p>Смешная история про котят<p>
  • <p>Смешная история про щенков<p>
  • <p>Душераздирающая история про бедного котенка с улицы<p>

Muna amfani tambayar wasa don nemo ids daftarin aiki mai ɗauke da kalmar da aka bayar:

# source: false означает, что не нужно извлекать _source найденных документов
curl -XGET "$ES_URL/blog/post/_search?pretty" -d'
{
  "_source": false,
  "query": {
    "match": {
      "content": "история"
    }
  }
}'
{
  "took" : 13,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 0.11506981,
    "hits" : [ {
      "_index" : "blog",
      "_type" : "post",
      "_id" : "2",
      "_score" : 0.11506981
    }, {
      "_index" : "blog",
      "_type" : "post",
      "_id" : "1",
      "_score" : 0.11506981
    }, {
      "_index" : "blog",
      "_type" : "post",
      "_id" : "3",
      "_score" : 0.095891505
    } ]
  }
}

Duk da haka, idan muka bincika "labarun" a cikin filin abun ciki, ba za mu sami wani abu ba, saboda Fihirisar ta ƙunshi ainihin kalmomin kawai, ba tushen su ba. Domin yin bincike mai inganci, kuna buƙatar saita mai nazari.

filin _score nuna dacewa. Idan an aiwatar da buƙatar a cikin mahallin tacewa, to ƙimar _score koyaushe zata kasance daidai da 1, wanda ke nufin cikakken daidaitawa da tacewa.

Masu nazari

Masu nazari ana buƙatar don canza rubutun tushen zuwa saitin alamomi.
Masu nazari sun kunshi daya Tokenizer da dama na zaɓi TokenFilters. Tokenizer na iya kasancewa da yawa kafin su CharFilters. Tokenizers suna karya tushen kirtani zuwa alamomi, kamar sarari da haruffan rubutu. TokenFilter na iya canza alamu, share ko ƙara sababbi, alal misali, bar tushen kalmar kawai, cire prepositions, ƙara ma'ana. CharFilter - yana canza duk layin tushen, misali, yana yanke alamun html.

ES yana da yawa daidaitattun masu nazari. Misali, mai nazari Rasha.

Mu yi amfani API kuma bari mu ga yadda ma'auni da masu nazari na Rasha suka canza kirtani "Labarai masu ban dariya game da kittens":

# используем анализатор standard       
# обязательно нужно перекодировать не ASCII символы
curl -XGET "$ES_URL/_analyze?pretty&analyzer=standard&text=%D0%92%D0%B5%D1%81%D0%B5%D0%BB%D1%8B%D0%B5%20%D0%B8%D1%81%D1%82%D0%BE%D1%80%D0%B8%D0%B8%20%D0%BF%D1%80%D0%BE%20%D0%BA%D0%BE%D1%82%D1%8F%D1%82"
{
  "tokens" : [ {
    "token" : "веселые",
    "start_offset" : 0,
    "end_offset" : 7,
    "type" : "<ALPHANUM>",
    "position" : 0
  }, {
    "token" : "истории",
    "start_offset" : 8,
    "end_offset" : 15,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "про",
    "start_offset" : 16,
    "end_offset" : 19,
    "type" : "<ALPHANUM>",
    "position" : 2
  }, {
    "token" : "котят",
    "start_offset" : 20,
    "end_offset" : 25,
    "type" : "<ALPHANUM>",
    "position" : 3
  } ]
}
# используем анализатор russian
curl -XGET "$ES_URL/_analyze?pretty&analyzer=russian&text=%D0%92%D0%B5%D1%81%D0%B5%D0%BB%D1%8B%D0%B5%20%D0%B8%D1%81%D1%82%D0%BE%D1%80%D0%B8%D0%B8%20%D0%BF%D1%80%D0%BE%20%D0%BA%D0%BE%D1%82%D1%8F%D1%82"
{
  "tokens" : [ {
    "token" : "весел",
    "start_offset" : 0,
    "end_offset" : 7,
    "type" : "<ALPHANUM>",
    "position" : 0
  }, {
    "token" : "истор",
    "start_offset" : 8,
    "end_offset" : 15,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "кот",
    "start_offset" : 20,
    "end_offset" : 25,
    "type" : "<ALPHANUM>",
    "position" : 3
  } ]
}

Mai nazari na yau da kullun ya raba kirtani zuwa sararin samaniya kuma ya canza komai zuwa ƙarami, mai nazarin Rashanci ya cire kalmomin da ba su da mahimmanci, ya canza shi zuwa ƙananan harafi kuma ya bar tushen kalmomin.

Bari mu ga wane Tokenizer, TokenFilters, CharFilters mai nazarin Rasha ke amfani da shi:

{
  "filter": {
    "russian_stop": {
      "type":       "stop",
      "stopwords":  "_russian_"
    },
    "russian_keywords": {
      "type":       "keyword_marker",
      "keywords":   []
    },
    "russian_stemmer": {
      "type":       "stemmer",
      "language":   "russian"
    }
  },
  "analyzer": {
    "russian": {
      "tokenizer":  "standard",
      /* TokenFilters */
      "filter": [
        "lowercase",
        "russian_stop",
        "russian_keywords",
        "russian_stemmer"
      ]
      /* CharFilters отсутствуют */
    }
  }
}

Bari mu bayyana manazarta dangane da Rashanci, wanda zai yanke alamun html. Bari mu kira shi tsoho, saboda za a yi amfani da mai nazari mai wannan suna ta tsohuwa.

{
  "filter": {
    "ru_stop": {
      "type":       "stop",
      "stopwords":  "_russian_"
    },
    "ru_stemmer": {
      "type":       "stemmer",
      "language":   "russian"
    }
  },
  "analyzer": {
    "default": {
      /* добавляем удаление html тегов */
      "char_filter": ["html_strip"],
      "tokenizer":  "standard",
      "filter": [
        "lowercase",
        "ru_stop",
        "ru_stemmer"
      ]
    }
  }
}

Da farko, duk tags HTML za a cire daga tushen kirtani, sa'an nan tokenizer misali zai raba shi zuwa Alamu, sakamakon sakamakon za su matsa zuwa ƙananan haruffa, za a cire kalmomi marasa mahimmanci, sauran alamun za su kasance tushen kalmar.

Ƙirƙirar fihirisa

A sama mun bayyana tsoho analyzer. Zai shafi duk filayen kirtani. Matsayinmu yana ƙunshe da tarin tags, don haka masu binciken za su sarrafa alamun. Domin Muna neman posts ta daidai daidai da alamar, to muna buƙatar musaki bincike don filin tags.

Bari mu ƙirƙiri fihirisar blog2 tare da mai nazari da taswira, wanda binciken filin tag ɗin ya lalace:

curl -XPOST "$ES_URL/blog2" -d'
{
  "settings": {
    "analysis": {
      "filter": {
        "ru_stop": {
          "type": "stop",
          "stopwords": "_russian_"
        },
        "ru_stemmer": {
          "type": "stemmer",
          "language": "russian"
        }
      },
      "analyzer": {
        "default": {
          "char_filter": [
            "html_strip"
          ],
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "ru_stop",
            "ru_stemmer"
          ]
        }
      }
    }
  },
  "mappings": {
    "post": {
      "properties": {
        "content": {
          "type": "string"
        },
        "published_at": {
          "type": "date"
        },
        "tags": {
          "type": "string",
          "index": "not_analyzed"
        },
        "title": {
          "type": "string"
        }
      }
    }
  }
}'

Bari mu ƙara rubutu guda 3 iri ɗaya zuwa wannan fihirisar (blog2). Zan bar wannan tsari saboda... yana kama da ƙara takardu zuwa fihirisar blog.

Cikakken bincike na rubutu tare da tallafin magana

Bari mu kalli wani nau'in buƙata:

# найдем документы, в которых встречается слово 'истории'
# query -> simple_query_string -> query содержит поисковый запрос
# поле title имеет приоритет 3
# поле tags имеет приоритет 2
# поле content имеет приоритет 1
# приоритет используется при ранжировании результатов
curl -XPOST "$ES_URL/blog2/post/_search?pretty" -d'
{
  "query": {
    "simple_query_string": {
      "query": "истории",
      "fields": [
        "title^3",
        "tags^2",
        "content"
      ]
    }
  }
}'

Domin Muna amfani da na'urar nazari tare da tushen Rasha, to wannan buƙatar za ta dawo da duk takaddun, kodayake suna ɗauke da kalmar 'tarihi' kawai.

Buƙatun na iya ƙunshi haruffa na musamman, misali:

""fried eggs" +(eggplant | potato) -frittata"

Nemi syntax:

+ signifies AND operation
| signifies OR operation
- negates a single token
" wraps a number of tokens to signify a phrase for searching
* at the end of a term signifies a prefix query
( and ) signify precedence
~N after a word signifies edit distance (fuzziness)
~N after a phrase signifies slop amount
# найдем документы без слова 'щенки'
curl -XPOST "$ES_URL/blog2/post/_search?pretty" -d'
{
  "query": {
    "simple_query_string": {
      "query": "-щенки",
      "fields": [
        "title^3",
        "tags^2",
        "content"
      ]
    }
  }
}'

# получим 2 поста про котиков

nassoshi

PS

Idan kuna sha'awar irin wannan labarin-darussan, kuna da ra'ayoyi don sabbin labarai, ko kuna da shawarwari don haɗin gwiwa, to zan yi farin cikin karɓar saƙo a cikin saƙo na sirri ko ta imel [email kariya].

source: www.habr.com