PostgreSQL Antipatterns: tatsuniya na sake fasalin bincike da suna, ko "Ingantacciyar gaba da gaba"

Dubban manajoji daga ofisoshin tallace-tallace a fadin kasar rikodin tsarin CRM mu dubun dubatar lambobin sadarwa a kullum - hujjojin sadarwa tare da yuwuwar ko abokan ciniki na yanzu. Kuma don wannan, dole ne ka fara nemo abokin ciniki, kuma zai fi dacewa da sauri. Kuma wannan yana faruwa sau da yawa da suna.

Saboda haka, ba abin mamaki ba ne cewa, sake yin nazarin tambayoyin "nauyi" akan ɗaya daga cikin mafi yawan bayanan bayanai - namu. VLSI asusun kamfani, Na sami "a saman" neman neman “sauri” ta suna don katunan kungiya.

Bugu da ƙari, ƙarin bincike ya nuna misali mai ban sha'awa ingantawa na farko sannan kuma lalacewar aiki nema tare da gyare-gyaren jeri-jere ta ƙungiyoyi da yawa, kowanne ɗayansu ya yi aiki da kyakkyawar niyya.

0: menene mai amfani yake so?

PostgreSQL Antipatterns: tatsuniya na sake fasalin bincike da suna, ko "Ingantacciyar gaba da gaba"[KDPV daga nan]

Menene mai amfani yakan nufi lokacin da suke magana game da binciken "sauri" da suna? Kusan ba zai taba zama bincike na "gaskiya" don substring kamar ... LIKE '%роза%' - domin to sakamakon ya hada da ba kawai 'Розалия' и 'Магазин Роза'amma роза' har ma 'Дом Деда Мороза'.

Mai amfani yana ɗauka a matakin yau da kullun cewa za ku samar masa da shi bincika ta farkon kalma a cikin take da kuma sanya shi mafi dacewa da cewa farawa a kan ya shiga. Kuma za ku yi kusan nan take - don shigar da interlinear.

1: iyakance aikin

Kuma ma fiye da haka, mutum ba zai shiga musamman ba 'роз магаз', don haka dole ne ku nemo kowace kalma ta prefix. A'a, ya fi sauƙi ga mai amfani ya amsa ga saurin nuni ga kalma ta ƙarshe fiye da "ƙananan ƙayyadaddun bayanai" waɗanda suka gabata da gangan - duba yadda kowane injin bincike ke sarrafa wannan.

Kullum, dama tsara abubuwan da ake bukata don matsalar ya fi rabin mafita. Wani lokaci ana yin amfani da bincike a hankali na iya tasiri sosai sakamakon.

Menene mai haɓakawa abstract yake yi?

1.0: Injin bincike na waje

Oh, bincike yana da wahala, ba na so in yi wani abu kwata-kwata - bari mu ba shi kyauta! Bari su tura injin bincike na waje zuwa bayanan bayanai: Sphinx, ElasticSearch,...

Zaɓin aiki, ko da yake yana da ƙarfin aiki dangane da aiki tare da saurin canje-canje. Amma ba a cikin yanayinmu ba, tun da yake ana gudanar da bincike ga kowane abokin ciniki kawai a cikin tsarin bayanan asusunsa. Kuma bayanan yana da babban bambanci - kuma idan mai sarrafa yanzu ya shiga katin 'Магазин Роза', to bayan 5-10 seconds ya riga ya tuna cewa ya manta ya nuna imel ɗin sa a can kuma yana so ya nemo shi kuma ya gyara shi.

Saboda haka - bari mu bincika "kai tsaye a cikin database". Abin farin ciki, PostgreSQL yana ba mu damar yin wannan, kuma ba kawai zaɓi ɗaya ba - za mu dube su.

1.1: "Gaskiya" substring

Muna manne da kalmar "substring". Amma don binciken ƙididdiga ta hanyar ƙananan igiyoyi (har ma da maganganun yau da kullum!) Akwai mai kyau module pg_trgm! Sa'an nan ne kawai zai zama dole a daidaita daidai.

Bari mu yi ƙoƙari mu ɗauki farantin mai zuwa don sauƙaƙe samfurin:

CREATE TABLE firms(
  id
    serial
      PRIMARY KEY
, name
    text
);

Muna ɗora bayanan ƙungiyoyi miliyan 7.8 na ƙungiyoyi na gaske a can kuma muna fidda su:

CREATE EXTENSION pg_trgm;
CREATE INDEX ON firms USING gin(lower(name) gin_trgm_ops);

Bari mu nemo bayanan farko guda 10 don binciken tsaka-tsaki:

SELECT
  *
FROM
  firms
WHERE
  lower(name) ~ ('(^|s)' || 'роза')
ORDER BY
  lower(name) ~ ('^' || 'роза') DESC -- сначала "начинающиеся на"
, lower(name) -- остальное по алфавиту
LIMIT 10;

PostgreSQL Antipatterns: tatsuniya na sake fasalin bincike da suna, ko "Ingantacciyar gaba da gaba"
[duba bayanin.tensor.ru]

To, haka ne... 26ms, 31MB karanta bayanai da fiye da 1.7K tace bayanai - na 10 da aka nema. Kudin da ake kashewa sun yi yawa, shin babu wani abu da ya fi inganci?

1.2: bincika ta rubutu? FTS da!

Lallai, PostgreSQL yana ba da ƙarfi sosai cikakken injin bincike na rubutu (Binciken Cikakkun Rubutu), gami da ikon yin prefix search. Kyakkyawan zaɓi, ba kwa buƙatar shigar da kari! Mu gwada:

CREATE INDEX ON firms USING gin(to_tsvector('simple'::regconfig, lower(name)));

SELECT
  *
FROM
  firms
WHERE
  to_tsvector('simple'::regconfig, lower(name)) @@ to_tsquery('simple', 'роза:*')
ORDER BY
  lower(name) ~ ('^' || 'роза') DESC
, lower(name)
LIMIT 10;

PostgreSQL Antipatterns: tatsuniya na sake fasalin bincike da suna, ko "Ingantacciyar gaba da gaba"
[duba bayanin.tensor.ru]

Anan daidaitawar aiwatar da tambaya ya taimaka mana kaɗan, tare da yanke lokacin zuwa rabi zuwa 11ms ku. Kuma dole ne mu karanta ƙasa da sau 1.5 - a duka 20MB. Amma a nan, ƙananan, mafi kyau, saboda girman ƙarar da muke karantawa, mafi girman damar samun kuskuren cache, kuma kowane ƙarin shafi na bayanan da aka karanta daga faifai shine yuwuwar "birki" don buƙatar.

1.3: Har yanzu LIKE?

Bukatar da ta gabata tana da kyau ga kowa, amma idan kun ja shi sau dubu dari a rana, zai zo 2TB karanta bayanai. A cikin mafi kyawun yanayin, daga ƙwaƙwalwar ajiya, amma idan kun yi rashin sa'a, to daga faifai. Don haka bari mu yi ƙoƙari mu ƙarasa shi.

Bari mu tuna abin da mai amfani ke son gani farko "wanda ya fara da...". Don haka wannan yana cikin mafi kyawun siffarsa binciken prefix tare da taimakon text_pattern_ops! Kuma idan "ba mu da isassun" har zuwa rikodin 10 da muke nema, to dole ne mu gama karanta su ta amfani da binciken FTS:

CREATE INDEX ON firms(lower(name) text_pattern_ops);

SELECT
  *
FROM
  firms
WHERE
  lower(name) LIKE ('роза' || '%')
LIMIT 10;

PostgreSQL Antipatterns: tatsuniya na sake fasalin bincike da suna, ko "Ingantacciyar gaba da gaba"
[duba bayanin.tensor.ru]

Kyakkyawan aiki - duka 0.05ms kuma fiye da 100KB kadan karanta! Mu kadai muka manta jera da sunadon kada mai amfani ya yi asara a sakamakon:

SELECT
  *
FROM
  firms
WHERE
  lower(name) LIKE ('роза' || '%')
ORDER BY
  lower(name)
LIMIT 10;

PostgreSQL Antipatterns: tatsuniya na sake fasalin bincike da suna, ko "Ingantacciyar gaba da gaba"
[duba bayanin.tensor.ru]

Oh, wani abu ba shi da kyau sosai kuma - da alama akwai maƙasudi, amma rarrabuwa ya wuce shi ... Yana, ba shakka, ya riga ya fi dacewa sau da yawa fiye da zaɓi na baya, amma ...

1.4: "Gama da fayil"

Amma akwai fihirisar da ke ba ku damar bincika ta kewayo kuma har yanzu kuna amfani da rarrabuwa akai-akai - btree na yau da kullun!

CREATE INDEX ON firms(lower(name));

Buƙatar ta kawai za a “tattara da hannu”:

SELECT
  *
FROM
  firms
WHERE
  lower(name) >= 'роза' AND
  lower(name) <= ('роза' || chr(65535)) -- для UTF8, для однобайтовых - chr(255)
ORDER BY
   lower(name)
LIMIT 10;

PostgreSQL Antipatterns: tatsuniya na sake fasalin bincike da suna, ko "Ingantacciyar gaba da gaba"
[duba bayanin.tensor.ru]

Madalla - aikin rarrabuwa, kuma amfani da albarkatun ya kasance “microscopic”, sau dubbai mafi inganci fiye da "tsabta" FTS! Abin da ya rage shi ne a haɗa shi a cikin buƙatu ɗaya:

(
  SELECT
    *
  FROM
    firms
  WHERE
    lower(name) >= 'роза' AND
    lower(name) <= ('роза' || chr(65535)) -- для UTF8, для однобайтовых кодировок - chr(255)
  ORDER BY
     lower(name)
  LIMIT 10
)
UNION ALL
(
  SELECT
    *
  FROM
    firms
  WHERE
    to_tsvector('simple'::regconfig, lower(name)) @@ to_tsquery('simple', 'роза:*') AND
    lower(name) NOT LIKE ('роза' || '%') -- "начинающиеся на" мы уже нашли выше
  ORDER BY
    lower(name) ~ ('^' || 'роза') DESC -- используем ту же сортировку, чтобы НЕ пойти по btree-индексу
  , lower(name)
  LIMIT 10
)
LIMIT 10;

Lura cewa ana aiwatar da subquery na biyu kawai idan na farko ya dawo kasa da yadda ake tsammani na karshe LIMIT yawan layi. Ina magana ne game da wannan hanyar inganta tambaya ya riga ya rubuta a baya.

Don haka a, yanzu muna da btree da gin akan tebur, amma a kididdiga ya nuna hakan. kasa da 10% na buƙatun sun kai ga aiwatar da toshe na biyu. Wato, tare da irin waɗannan iyakoki na yau da kullun da aka sani a gaba don aikin, mun sami damar rage yawan amfani da albarkatun uwar garken da kusan sau dubu!

1.5*: Za mu iya yin ba tare da fayil ba

Mafi girma LIKE An hana mu yin amfani da rarrabuwa da ba daidai ba. Amma ana iya “tsaya akan madaidaiciyar hanya” ta hanyar tantance mai amfani da AMFANI:

Ta hanyar tsoho ana ɗauka ASC. Bugu da ƙari, zaku iya ƙididdige sunan takamaiman nau'in afareta a cikin jumla USING. Dole ne nau'in ma'aikaci ya kasance memba na ƙasa da ko mafi girma fiye da na wasu dangin ma'aikatan bishiyar B. ASC yawanci daidai USING < и DESC yawanci daidai USING >.

A cikin yanayinmu, "ƙasa" shine ~<~:

SELECT
  *
FROM
  firms
WHERE
  lower(name) LIKE ('роза' || '%')
ORDER BY
  lower(name) USING ~<~
LIMIT 10;

PostgreSQL Antipatterns: tatsuniya na sake fasalin bincike da suna, ko "Ingantacciyar gaba da gaba"
[duba bayanin.tensor.ru]

2: yadda buƙatun ke zama mai tsami

Yanzu mun bar bukatar mu don "simmer" na tsawon watanni shida ko shekara, kuma mun yi mamakin sake gano shi "a saman" tare da alamomi na jimlar yau da kullum "fasa" ƙwaƙwalwar ajiya (buffers shared hit) a cikin 5.5TB - wato, har ma fiye da yadda yake a asali.

A'a, hakika, kasuwancinmu ya haɓaka kuma aikinmu ya karu, amma ba da yawa ba! Wannan yana nufin cewa wani abu yana da kifi a nan - bari mu gano shi.

2.1: Haihuwar paging

A wani lokaci, wata ƙungiyar ci gaba ta so ta ba da damar yin "tsalle" daga bincike mai sauri zuwa wurin yin rajista tare da sakamako iri ɗaya, amma fadada sakamakon. Menene rajista ba tare da kewayawa na shafi ba? Bari mu dunkule shi!

( ... LIMIT <N> + 10)
UNION ALL
( ... LIMIT <N> + 10)
LIMIT 10 OFFSET <N>;

Yanzu yana yiwuwa a nuna rajistar sakamakon bincike tare da "shafi-bi-shafi" loading ba tare da wani damuwa ga mai haɓakawa ba.

Tabbas, a zahiri, ga kowane shafi na bayanan na gaba ana ƙara karantawa (duk daga lokacin da ya gabata, wanda za mu jefar da shi, da kuma "wutsiya" da ake bukata) - wato, wannan tsari ne bayyananne. Amma zai zama mafi daidai don fara binciken a sake maimaitawa na gaba daga maɓalli da aka adana a cikin dubawa, amma game da wannan wani lokaci.

2.2: Ina son wani abu mai ban mamaki

A wani lokaci mai haɓaka ya so bambanta samfurin da aka samu tare da bayanai daga wani tebur, wanda aka aika da duk buƙatar da ta gabata zuwa CTE:

WITH q AS (
  ...
  LIMIT <N> + 10
)
SELECT
  *
, (SELECT ...) sub_query -- какой-то запрос к связанной таблице
FROM
  q
LIMIT 10 OFFSET <N>;

Kuma duk da haka, ba abu mai kyau ba ne, tun lokacin da aka kimanta subquery kawai don bayanan 10 da aka dawo, idan ba ...

2.3: Bambance-bambancen banza ne kuma mara tausayi

Wani wuri a cikin aiwatar da irin wannan juyin halitta daga 2nd subquery ya bata NOT LIKE yanayin. A fili yake cewa bayan wannan UNION ALL ya fara dawowa wasu shigarwar sau biyu - da farko samu a farkon layin, sannan kuma - a farkon kalmar farko na wannan layin. A cikin iyaka, duk bayanan subquery na 2 na iya dacewa da bayanan farko.

Me mai haɓakawa yake yi maimakon neman dalilin?.. Babu tambaya!

  • ninka girman samfurori na asali
  • yi amfani da DISTINCTdon samun misalai guda ɗaya na kowane layi

WITH q AS (
  ( ... LIMIT <2 * N> + 10)
  UNION ALL
  ( ... LIMIT <2 * N> + 10)
  LIMIT <2 * N> + 10
)
SELECT DISTINCT
  *
, (SELECT ...) sub_query
FROM
  q
LIMIT 10 OFFSET <N>;

Wato, a bayyane yake cewa sakamakon, a ƙarshe, daidai yake, amma damar "tashi" a cikin 2nd CTE subquery ya zama mafi girma, kuma ko da ba tare da wannan ba. a sarari mafi karantawa.

Amma wannan ba shine abin bakin ciki ba. Tunda mai haɓakawa ya nemi zaɓi DISTINCT ba don takamaiman wasu ba, amma ga duk fage lokaci guda records, sannan filin sub_query - sakamakon subquery - an haɗa shi kai tsaye a wurin. Yanzu, don aiwatarwa DISTINCT, da database ya yi aiki riga ba 10 subqueries, amma duk <2 * N> + 10!

2.4: Haɗin kai sama da duka!

Don haka, masu haɓakawa sun rayu - ba su damu ba, saboda a fili mai amfani ba shi da isasshen haƙuri don “daidaita” rajista zuwa mahimman ƙimar N tare da raguwar raguwar kowane “shafi” na gaba.

Har sai da masu haɓakawa daga wani sashen suka zo wurinsu kuma suna son yin amfani da irin wannan hanya mai dacewa don bincike mai maimaitawa - wato, muna ɗaukar wani yanki daga wasu samfurori, tace shi ta ƙarin yanayi, zana sakamakon, sa'an nan kuma yanki na gaba (wanda a cikin yanayinmu yana samuwa ta hanyar ƙara N), da sauransu har sai mun cika allon.

Gabaɗaya, a cikin samfurin da aka kama N ya kai darajar kusan 17K, kuma a cikin kwana ɗaya aƙalla 4K na irin waɗannan buƙatun an aiwatar da su "tare da sarkar". Na karshensu da karfin hali aka duba su 1GB na ƙwaƙwalwar ajiya a kowane lokaci...

Jimlar

PostgreSQL Antipatterns: tatsuniya na sake fasalin bincike da suna, ko "Ingantacciyar gaba da gaba"

source: www.habr.com

Add a comment