Elasticsearch Basics

Elasticsearch o se masini suʻesuʻe ma le json rest api, faʻaaoga Lucene ma tusia i Java. O lo'o maua se fa'amatalaga o mea lelei uma o lenei afi i faʻapitonuʻu aloaia. I le mea o loʻo mulimuli mai o le a tatou vaʻai i Elasticsearch pei ES.

E fa'aogaina masini fa'atusa mo su'esu'ega lavelave i totonu o fa'amaumauga fa'amaumauga. Mo se fa'ata'ita'iga, su'esu'e e fa'atatau i le morphology o le gagana po'o le su'esu'e e fa'atūlaga fa'afanua.

I totonu o lenei tusiga o le a ou talanoa e uiga i faʻavae o le ES faʻaaogaina le faʻataʻitaʻiga o le faʻasinoina o pou blog. O le a ou fa'aali atu ia te oe le fa'amama, fa'avasega ma su'esu'e pepa.

Ina ia aua nei faʻalagolago i le faiga faʻaogaina, o le a ou faia uma talosaga i le ES faʻaaoga CURL. O loʻo iai foʻi se faʻapipiʻi mo google chrome e taʻua lagona.

O tusitusiga o lo'o iai feso'ota'iga i fa'amaumauga ma isi fa'apogai. I le faaiuga o loʻo i ai fesoʻotaʻiga mo le vave maua o faʻamaumauga. E mafai ona maua fa'amatalaga o upu e le masani ai ile faʻaupuga tusi.

Fa'apipi'i ES

Ina ia faia lenei mea, matou te manaʻomia muamua Java. Atina'e fautua faʻapipiʻi Java versions fou nai lo Java 8 faʻafouina 20 poʻo Java 7 faʻafouina 55.

O lo'o maua le tufatufaga ES ile upega tafa'ilagi. A maeʻa ona tatala le faʻamaumauga e te manaʻomia e tamoe bin/elasticsearch. E avanoa foi afifi mo apt ma yum. E i ai ata aloaia mo docker. E uiga i le fa'apipi'iina.

A maeʻa faʻapipiʻi ma faʻalauiloa, seʻi o tatou siaki le gaioiga:

# для удобства запомним адрес в переменную
#export ES_URL=$(docker-machine ip dev):9200
export ES_URL=localhost:9200

curl -X GET $ES_URL

O le a matou mauaina se mea e pei o lenei:

{
  "name" : "Heimdall",
  "cluster_name" : "elasticsearch",
  "version" : {
    "number" : "2.2.1",
    "build_hash" : "d045fc29d1932bce18b2e65ab8b297fbf6cd41a1",
    "build_timestamp" : "2016-03-09T09:38:54Z",
    "build_snapshot" : false,
    "lucene_version" : "5.4.1"
  },
  "tagline" : "You Know, for Search"
}

Fa'asinomaga

Se'i o tatou fa'aopoopo se pou i le ES:

# Добавим документ c id 1 типа post в индекс blog.
# ?pretty указывает, что вывод должен быть человеко-читаемым.

curl -XPUT "$ES_URL/blog/post/1?pretty" -d'
{
  "title": "Веселые котята",
  "content": "<p>Смешная история про котят<p>",
  "tags": [
    "котята",
    "смешная история"
  ],
  "published_at": "2014-09-12T20:44:42+00:00"
}'

tali a le server:

{
  "_index" : "blog",
  "_type" : "post",
  "_id" : "1",
  "_version" : 1,
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "created" : false
}

E otometi lava ona faia index blog ma a ituaiga pou. E mafai ona tatou tusia se fa'atusa fa'atatau: o se fa'asinomaga o se fa'amaumauga, ma o se ituaiga o se laulau i totonu o lenei fa'amaumauga. E tofu ituaiga ta'itasi ma lana polokalame − faafanua, e pei lava o se laulau fa'afeso'ota'i. E otometi ona faia fa'afanua pe a fa'asino le pepa:

# Получим mapping всех типов индекса blog
curl -XGET "$ES_URL/blog/_mapping?pretty"

I le tali a le 'auʻaunaga, na ou faʻaopoopoina le taua o fanua o le faʻasinomaga pepa i faʻamatalaga:

{
  "blog" : {
    "mappings" : {
      "post" : {
        "properties" : {
          /* "content": "<p>Смешная история про котят<p>", */ 
          "content" : {
            "type" : "string"
          },
          /* "published_at": "2014-09-12T20:44:42+00:00" */
          "published_at" : {
            "type" : "date",
            "format" : "strict_date_optional_time||epoch_millis"
          },
          /* "tags": ["котята", "смешная история"] */
          "tags" : {
            "type" : "string"
          },
          /*  "title": "Веселые котята" */
          "title" : {
            "type" : "string"
          }
        }
      }
    }
  }
}

E taua le matauina o le ES e le faʻaeseeseeseina i le va o se tasi tau ma se faʻasologa o tau. Mo se faʻataʻitaʻiga, o le ulutala ulutala o loʻo i ai se ulutala, ma o faʻailoga faʻailoga o loʻo i ai le tele o manoa, e ui lava o loʻo faʻatusalia i le auala lava e tasi i le faʻafanua.
O le a tatou talanoa atili e uiga i faafanua mulimuli ane.

Talosaga

Toe aumai se pepa i lona id:

# извлечем документ с id 1 типа post из индекса blog
curl -XGET "$ES_URL/blog/post/1?pretty"
{
  "_index" : "blog",
  "_type" : "post",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "_source" : {
    "title" : "Веселые котята",
    "content" : "<p>Смешная история про котят<p>",
    "tags" : [ "котята", "смешная история" ],
    "published_at" : "2014-09-12T20:44:42+00:00"
  }
}

Na aliali mai ki fou i le tali: _version и _source. I se tulaga lautele, o ki uma e amata i _ o lo'o fa'avasegaina o ni tagata aloa'ia.

Ki _version fa'aalia le fa'asologa o pepa. E manaʻomia mo le faʻaogaina o le loka faʻamoemoe e galue. Mo se faʻataʻitaʻiga, matou te manaʻo e sui se pepa o loʻo i ai le version 1. Matou te tuʻuina atu le suiga o le pepa ma faʻaalia o se faʻataʻitaʻiga lea o se pepa faʻatasi ma le version 1. Afai e faʻasaʻo e se tasi se pepa faʻatasi ma le version 1 ma tuʻuina mai suiga i o matou luma, ona E le talia e ES a tatou suiga, aua na te teuina le pepa ma le version 2.

Ki _source o lo'o i ai le pepa na matou fa'asinoina. E le fa'aogaina e le ES lea tau mo galuega su'esu'e ona O lo'o fa'aogaina fa'asino igoa mo su'esu'ega. Ina ia fa'asaoina le avanoa, e teu ai e le ES se pepa fa'apogai. Afai tatou te manaʻomia naʻo le id, ae le o le faʻamaumauga atoa, ona mafai lea ona tatou faʻamalo le teuina o punaoa.

Afai matou te le manaʻomia ni faʻamatalaga faaopoopo, e mafai ona matou mauaina naʻo mea o loʻo i totonu o le _source:

curl -XGET "$ES_URL/blog/post/1/_source?pretty"
{
  "title" : "Веселые котята",
  "content" : "<p>Смешная история про котят<p>",
  "tags" : [ "котята", "смешная история" ],
  "published_at" : "2014-09-12T20:44:42+00:00"
}

E mafai fo'i ona e filifilia na'o vaega fa'apitoa:

# извлечем только поле title
curl -XGET "$ES_URL/blog/post/1?_source=title&pretty"
{
  "_index" : "blog",
  "_type" : "post",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "_source" : {
    "title" : "Веселые котята"
  }
}

Se'i tatou fa'asino i ni nai pou ma fa'agasolo fesili lavelave.

curl -XPUT "$ES_URL/blog/post/2" -d'
{
  "title": "Веселые щенки",
  "content": "<p>Смешная история про щенков<p>",
  "tags": [
    "щенки",
    "смешная история"
  ],
  "published_at": "2014-08-12T20:44:42+00:00"
}'
curl -XPUT "$ES_URL/blog/post/3" -d'
{
  "title": "Как у меня появился котенок",
  "content": "<p>Душераздирающая история про бедного котенка с улицы<p>",
  "tags": [
    "котята"
  ],
  "published_at": "2014-07-21T20:44:42+00:00"
}'

Faʻavasegaina

# найдем последний пост по дате публикации и извлечем поля title и published_at
curl -XGET "$ES_URL/blog/post/_search?pretty" -d'
{
  "size": 1,
  "_source": ["title", "published_at"],
  "sort": [{"published_at": "desc"}]
}'
{
  "took" : 8,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : null,
    "hits" : [ {
      "_index" : "blog",
      "_type" : "post",
      "_id" : "1",
      "_score" : null,
      "_source" : {
        "title" : "Веселые котята",
        "published_at" : "2014-09-12T20:44:42+00:00"
      },
      "sort" : [ 1410554682000 ]
    } ]
  }
}

Na matou filifilia le pou mulimuli. size fa'atapula'a le aofa'i o pepa e tu'uina atu. total fa'aalia le aofa'i o pepa e fetaui ma le talosaga. sort i totonu o le fa'auluuluga o lo'o i ai se fa'asologa o numera e fa'atino ai le fa'avasegaina. O na. o le aso na liua i le numera atoa. E mafai ona maua nisi fa'amatalaga e uiga i le fa'avasegaina i totonu fa'amaumauga.

Filifiliga ma fesili

ES talu mai le fa'aaliga 2 e le'o fa'ailogaina le va o filiga ma fesili, nai lo ua fa'ailoa mai le manatu o fa'amatalaga.
E 'ese'ese le tala'aga o le fa'amatalaga mai se fa'amatalaga fa'amama ona o le su'esu'ega e maua ai se _score ma e le'o fa'asaoina. O le a ou faaali atu ia te oe le _score mulimuli ane.

Filifili ile aso

Matou te faʻaaogaina le talosaga mamao i le tulaga o le faamama:

# получим посты, опубликованные 1ого сентября или позже
curl -XGET "$ES_URL/blog/post/_search?pretty" -d'
{
  "filter": {
    "range": {
      "published_at": { "gte": "2014-09-01" }
    }
  }
}'

Filifili e fa'ailoga

Matou te faaaogaina fesili ole taimi e su'e ai id pepa o lo'o iai se upu ua tu'uina atu:

# найдем все документы, в поле tags которых есть элемент 'котята'
curl -XGET "$ES_URL/blog/post/_search?pretty" -d'
{
  "_source": [
    "title",
    "tags"
  ],
  "filter": {
    "term": {
      "tags": "котята"
    }
  }
}'
{
  "took" : 9,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "blog",
      "_type" : "post",
      "_id" : "1",
      "_score" : 1.0,
      "_source" : {
        "title" : "Веселые котята",
        "tags" : [ "котята", "смешная история" ]
      }
    }, {
      "_index" : "blog",
      "_type" : "post",
      "_id" : "3",
      "_score" : 1.0,
      "_source" : {
        "title" : "Как у меня появился котенок",
        "tags" : [ "котята" ]
      }
    } ]
  }
}

Su'ega tusitusiga atoa

E tolu a matou pepa o loʻo i ai mea nei i totonu o le anotusi:

  • <p>Смешная история про котят<p>
  • <p>Смешная история про щенков<p>
  • <p>Душераздирающая история про бедного котенка с улицы<p>

Matou te faaaogaina fa'afetaui fesili e su'e ai id pepa o lo'o iai se upu ua tu'uina atu:

# source: false означает, что не нужно извлекать _source найденных документов
curl -XGET "$ES_URL/blog/post/_search?pretty" -d'
{
  "_source": false,
  "query": {
    "match": {
      "content": "история"
    }
  }
}'
{
  "took" : 13,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 0.11506981,
    "hits" : [ {
      "_index" : "blog",
      "_type" : "post",
      "_id" : "2",
      "_score" : 0.11506981
    }, {
      "_index" : "blog",
      "_type" : "post",
      "_id" : "1",
      "_score" : 0.11506981
    }, {
      "_index" : "blog",
      "_type" : "post",
      "_id" : "3",
      "_score" : 0.095891505
    } ]
  }
}

Ae peitaʻi, afai tatou te suʻeina "tala" i totonu o le anotusi, tatou te le maua se mea, aua O le faasinoupu e aofia ai na o uluai upu, ae le o latou au. Ina ia faia se suʻesuʻega maualuga, e tatau ona e faʻatulagaina le suʻega.

laufanua _score faʻaali talafeagai. Afai e fa'atinoina le talosaga i totonu o se fa'aoga faamama, o le tau o le _score o le a tutusa i taimi uma ma le 1, o lona uiga o se fetaui atoatoa i le faamama.

Tagata suʻesuʻe

Tagata suʻesuʻe e mana'omia e fa'aliliu ai le fa'amatalaga autu i se seti o fa'ailoga.
E tasi le au su'esu'e Faʻailoga ma le tele o filifiliga TokenFilters. Tokenizer atonu e muamua i nisi CharFilters. Tokenizers vaevae le manoa puna i ni faailoga, e pei o avanoa ma mataitusi faailoga. TokenFilter e mafai ona suia faʻailoga, tape pe faʻaopoopo mea fou, mo se faʻataʻitaʻiga, tuʻu naʻo le ogalaau o le upu, aveese prepositions, faʻaopoopo upu tutusa. CharFilter - suia le manoa puna uma, mo se faʻataʻitaʻiga, tipi ese pine html.

ES e tele su'esu'e masani. Mo se faʻataʻitaʻiga, se suʻesuʻega Rusia.

Tatou faaaoga tatau api ma seʻi o tatou vaʻai pe faʻafefea ona suia e le au suʻesuʻe masani ma Rusia le manoa "Tala malie e uiga i pusi":

# используем анализатор standard       
# обязательно нужно перекодировать не ASCII символы
curl -XGET "$ES_URL/_analyze?pretty&analyzer=standard&text=%D0%92%D0%B5%D1%81%D0%B5%D0%BB%D1%8B%D0%B5%20%D0%B8%D1%81%D1%82%D0%BE%D1%80%D0%B8%D0%B8%20%D0%BF%D1%80%D0%BE%20%D0%BA%D0%BE%D1%82%D1%8F%D1%82"
{
  "tokens" : [ {
    "token" : "веселые",
    "start_offset" : 0,
    "end_offset" : 7,
    "type" : "<ALPHANUM>",
    "position" : 0
  }, {
    "token" : "истории",
    "start_offset" : 8,
    "end_offset" : 15,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "про",
    "start_offset" : 16,
    "end_offset" : 19,
    "type" : "<ALPHANUM>",
    "position" : 2
  }, {
    "token" : "котят",
    "start_offset" : 20,
    "end_offset" : 25,
    "type" : "<ALPHANUM>",
    "position" : 3
  } ]
}
# используем анализатор russian
curl -XGET "$ES_URL/_analyze?pretty&analyzer=russian&text=%D0%92%D0%B5%D1%81%D0%B5%D0%BB%D1%8B%D0%B5%20%D0%B8%D1%81%D1%82%D0%BE%D1%80%D0%B8%D0%B8%20%D0%BF%D1%80%D0%BE%20%D0%BA%D0%BE%D1%82%D1%8F%D1%82"
{
  "tokens" : [ {
    "token" : "весел",
    "start_offset" : 0,
    "end_offset" : 7,
    "type" : "<ALPHANUM>",
    "position" : 0
  }, {
    "token" : "истор",
    "start_offset" : 8,
    "end_offset" : 15,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "кот",
    "start_offset" : 20,
    "end_offset" : 25,
    "type" : "<ALPHANUM>",
    "position" : 3
  } ]
}

Na vaeluaina e le tagata su'esu'e masani le manoa i avanoa ma fa'aliliu mea uma i mata'itusi la'ititi, na aveese e le tagata su'esu'e Rusia ni upu le taua, fa'aliliu i mata'itusi laiti ma tu'u ai le fua o upu.

Se'i tatou va'ai po'o fea Tokenizer, TokenFilters, CharFilters e fa'aaoga e le su'esu'e Rusia:

{
  "filter": {
    "russian_stop": {
      "type":       "stop",
      "stopwords":  "_russian_"
    },
    "russian_keywords": {
      "type":       "keyword_marker",
      "keywords":   []
    },
    "russian_stemmer": {
      "type":       "stemmer",
      "language":   "russian"
    }
  },
  "analyzer": {
    "russian": {
      "tokenizer":  "standard",
      /* TokenFilters */
      "filter": [
        "lowercase",
        "russian_stop",
        "russian_keywords",
        "russian_stemmer"
      ]
      /* CharFilters отсутствуют */
    }
  }
}

Sei o tatou faʻamatalaina la matou suʻesuʻega faʻavae i luga o le Rusia, lea o le a tipi ese ai faʻailoga html. Se'i ta'ua o le faaletonu, aua ose su'esu'e e iai le igoa lea o le a fa'aaogaina e aunoa ma se totogi.

{
  "filter": {
    "ru_stop": {
      "type":       "stop",
      "stopwords":  "_russian_"
    },
    "ru_stemmer": {
      "type":       "stemmer",
      "language":   "russian"
    }
  },
  "analyzer": {
    "default": {
      /* добавляем удаление html тегов */
      "char_filter": ["html_strip"],
      "tokenizer":  "standard",
      "filter": [
        "lowercase",
        "ru_stop",
        "ru_stemmer"
      ]
    }
  }
}

Muamua, o tag HTML uma o le a aveesea mai le manoa puna, ona vaeluaina lea e le tokenizer standard i ni faʻailoga, o faʻailoga e maua ai o le a faʻagasolo i mataʻitusi laiti, o le a aveesea upu le taua, ma o faʻailoga o loʻo totoe o le a tumau pea le aʻa o le upu.

Fausia o se Fa'asinomaga

I luga na matou faʻamatalaina le suʻesuʻe faʻaletonu. O le a fa'aoga i fanua manoa uma. O la matou pou o loʻo i ai se faʻasologa o pine, o lea o le a faʻatautaia foi e le tagata suʻesuʻe ia pine. Aua O loʻo matou suʻeina pou e fetaui tonu ma se pine, ona matou manaʻomia lea e faʻamalo suʻesuʻega mo le faʻailoga.

Sei o tatou faia se index blog2 ma se suʻesuʻega ma faʻafanua, lea e faʻaletonu ai le auiliiliga o faʻailoga fanua:

curl -XPOST "$ES_URL/blog2" -d'
{
  "settings": {
    "analysis": {
      "filter": {
        "ru_stop": {
          "type": "stop",
          "stopwords": "_russian_"
        },
        "ru_stemmer": {
          "type": "stemmer",
          "language": "russian"
        }
      },
      "analyzer": {
        "default": {
          "char_filter": [
            "html_strip"
          ],
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "ru_stop",
            "ru_stemmer"
          ]
        }
      }
    }
  },
  "mappings": {
    "post": {
      "properties": {
        "content": {
          "type": "string"
        },
        "published_at": {
          "type": "date"
        },
        "tags": {
          "type": "string",
          "index": "not_analyzed"
        },
        "title": {
          "type": "string"
        }
      }
    }
  }
}'

Se'i o tatou fa'aopoopo le 3 pou tutusa i lenei fa'ailoga (blog2). O le a ou aveese lenei faiga ona... e tutusa ma le faʻaopoopoina o pepa i le blog index.

Su'esu'ega fa'amatalaga atoa ma fa'amatalaga lagolago

Sei o tatou tilotilo i se isi ituaiga o talosaga:

# найдем документы, в которых встречается слово 'истории'
# query -> simple_query_string -> query содержит поисковый запрос
# поле title имеет приоритет 3
# поле tags имеет приоритет 2
# поле content имеет приоритет 1
# приоритет используется при ранжировании результатов
curl -XPOST "$ES_URL/blog2/post/_search?pretty" -d'
{
  "query": {
    "simple_query_string": {
      "query": "истории",
      "fields": [
        "title^3",
        "tags^2",
        "content"
      ]
    }
  }
}'

Aua O loʻo matou faʻaaogaina se suʻesuʻega faʻatasi ma Rusia stemming, ona toe faʻafoʻi lea e lenei talosaga pepa uma, e ui lava o loʻo i ai naʻo le upu 'talafaasolopito'.

O le talosaga e ono iai ni mataitusi fa'apitoa, mo se fa'ata'ita'iga:

""fried eggs" +(eggplant | potato) -frittata"

Talosaga syntax:

+ signifies AND operation
| signifies OR operation
- negates a single token
" wraps a number of tokens to signify a phrase for searching
* at the end of a term signifies a prefix query
( and ) signify precedence
~N after a word signifies edit distance (fuzziness)
~N after a phrase signifies slop amount
# найдем документы без слова 'щенки'
curl -XPOST "$ES_URL/blog2/post/_search?pretty" -d'
{
  "query": {
    "simple_query_string": {
      "query": "-щенки",
      "fields": [
        "title^3",
        "tags^2",
        "content"
      ]
    }
  }
}'

# получим 2 поста про котиков

mau

PS

Afai e te fiafia i tala tutusa-lesona, maua ni manatu mo tala fou, pe i ai ni talosaga mo le galulue faʻatasi, ona ou fiafia lea e maua se feʻau i se savali patino poʻo le imeli. [imeli puipuia].

puna: www.habr.com