Izinkulungwane zabaphathi abavela emahhovisi okuthengisa ezweni lonke
Ngakho-ke, akumangazi ukuthi, siphinda sihlaziya imibuzo “esindayo” kwenye yedatha egcwele kakhulu - eyethu.
Ngaphezu kwalokho, uphenyo olwengeziwe lwembula isibonelo esithakazelisayo okokuqala nokwenza kahle bese kuba ukucekelwa phansi kokusebenza isicelo ngokuhlungwa kwayo okulandelanayo ngamaqembu amaningana, ngalinye lenze ngezinhloso ezinhle kuphela.
0: ubefunani umsebenzisi?
[KDPV
Uvame ukusho ukuthini umsebenzisi uma ekhuluma ngokusesha “okusheshayo” ngegama? Cishe akukaze kuvele kube usesho "oluqotho" lochungechunge oluncane olufana ... LIKE '%роза%'
- ngoba ke umphumela uhlanganisa hhayi kuphela 'Розалия'
и 'Магазин Роза'
Kodwa 'Гроза'
futhi ngisho 'Дом Деда Мороза'
.
Umsebenzisi uthatha ezingeni lansuku zonke ozomhlinzeka ngalo cinga ngokuqala kwegama esihlokweni futhi ukwenze kuhambisane nalokho iqala ngo wangena. Futhi uzokwenza cishe ngokuphazima kweso - okokufaka phakathi kwemigqa.
1: khawula umsebenzi
Futhi nakakhulu, umuntu ngeke angene ngokuqondile 'роз магаз'
, ukuze ukwazi ukucinga igama ngalinye ngesiqalo. Cha, kulula kakhulu kumsebenzisi ukuthi aphendule iseluleko esisheshayo segama lokugcina kunokuba "angacacisi" ngokudlule - bheka ukuthi noma iyiphi injini yokusesha ikusingatha kanjani lokhu.
Ngokuvamile kwesokudla ukwakha izidingo zenkinga kungaphezu kwesigamu sesixazululo. Ngezinye izikhathi ukuhlaziya izimo zokusebenzisa ngokucophelela
Wenzani umthuthukisi we-abstract?
1.0: injini yokusesha yangaphandle
O, ukusesha kunzima, angifuni ukwenza lutho nhlobo - asinikeze ama-devops! Bavumele bakhiphe injini yokusesha ngaphandle kusizindalwazi: Sphinx, ElasticSearch,...
Inketho yokusebenza, nakuba idinga abasebenzi kakhulu mayelana nokuvumelanisa kanye nesivinini soshintsho. Kodwa hhayi kithi, ngoba ukusesha kwenziwa kuklayenti ngalinye ngaphakathi kohlaka lwedatha ye-akhawunti yakhe. Futhi idatha inokuhlukahluka okuphezulu kakhulu - futhi uma umphathi esefake ikhadi 'Магазин Роза'
, khona-ke ngemva kwemizuzwana engu-5-10 angase akhumbule ukuthi ukhohlwe ukukhombisa i-imeyili yakhe lapho futhi ufuna ukuyithola futhi ayilungise.
Ngakho-ke - ake sesha “ngqo kusizindalwazi”. Ngenhlanhla, i-PostgreSQL isivumela ukuthi senze lokhu, hhayi inketho eyodwa kuphela - sizoyibheka.
1.1: "i-honest" substring
Sinamathela egameni elithi "substring". Kodwa ekusesheni kwenkomba ngochungechunge oluncane (ngisho nangezinkulumo ezivamile!) kukhona okuhle kakhulu
Ake sizame ukuthatha ipuleti elilandelayo ukwenza imodeli ibe lula:
CREATE TABLE firms(
id
serial
PRIMARY KEY
, name
text
);
Silayisha amarekhodi ayizigidi ezingu-7.8 ezinhlangano zangempela lapho bese sikhomba:
CREATE EXTENSION pg_trgm;
CREATE INDEX ON firms USING gin(lower(name) gin_trgm_ops);
Ake sibheke amarekhodi ayi-10 okuqala osesho lwe-interlinear:
SELECT
*
FROM
firms
WHERE
lower(name) ~ ('(^|s)' || 'роза')
ORDER BY
lower(name) ~ ('^' || 'роза') DESC -- сначала "начинающиеся на"
, lower(name) -- остальное по алфавиту
LIMIT 10;
Awu, lokho... 26ms, 31MB funda idatha kanye namarekhodi ahlungiwe angaphezu kuka-1.7K - ku-10 aseshiwe. Izindleko ze-overhead ziphezulu kakhulu, ingabe akukho okunye okusebenzayo?
1.2: sesha ngombhalo? I-FTS!
Ngempela, i-PostgreSQL inikeza amandla amakhulu kakhulu
CREATE INDEX ON firms USING gin(to_tsvector('simple'::regconfig, lower(name)));
SELECT
*
FROM
firms
WHERE
to_tsvector('simple'::regconfig, lower(name)) @@ to_tsquery('simple', 'роза:*')
ORDER BY
lower(name) ~ ('^' || 'роза') DESC
, lower(name)
LIMIT 10;
Lapha ukufana kokwenziwa kombuzo kwasisiza kancane, ukusika isikhathi phakathi ukuze 11ms. Futhi kwakudingeka sifunde izikhathi ezingu-1.5 ngaphansi - sezizonke 20MB. Kodwa lapha, kancane, kungcono, ngoba ivolumu enkulu esiyifundayo, ayanda amathuba okuthola i-cache miss, futhi wonke amakhasi engeziwe wedatha afundwa kudiski "amabhuleki" angaba khona esicelo.
1.3: usathanda?
Isicelo sangaphambilini sihle kuwo wonke umuntu, kodwa kuphela uma usidonsa izikhathi eziyizinkulungwane eziyikhulu ngosuku, sizofika 2TB funda idatha. Esimweni esihle kakhulu, kusuka enkumbulweni, kepha uma unebhadi, bese usuka kudiski. Ngakho-ke ake sizame ukuyenza ibe mncane.
Masikhumbule lokho umsebenzisi afuna ukukubona okokuqala "okuqala ...". Ngakho lokhu kusesimweni sakho esimsulwa text_pattern_ops
! Futhi kuphela uma “singenawo okwanele” amarekhodi afika kwayi-10 esiwafunayo, kuzodingeka siqedele ukuwafunda sisebenzisa ukusesha kwe-FTS:
CREATE INDEX ON firms(lower(name) text_pattern_ops);
SELECT
*
FROM
firms
WHERE
lower(name) LIKE ('роза' || '%')
LIMIT 10;
Ukusebenza okuhle kakhulu - inani 0.05ms kanye nokungaphezulu kancane kuka-100KB funda! Kuphela thina esikhohliwe hlunga ngamagamaukuze umsebenzisi angalahleki emiphumeleni:
SELECT
*
FROM
firms
WHERE
lower(name) LIKE ('роза' || '%')
ORDER BY
lower(name)
LIMIT 10;
Oh, into ayiseyinhle kangako - kubonakala sengathi kukhona inkomba, kodwa izimpukane zokuhlunga zidlule ... Yiqiniso, isivele iphumelela izikhathi eziningi kunenketho yangaphambilini, kodwa ...
1.4: “qeda ngefayela”
Kepha kunenkomba ekuvumela ukuthi useshe ngobubanzi futhi usebenzise ukuhlunga ngokujwayelekile - i-btree evamile!
CREATE INDEX ON firms(lower(name));
Isicelo sakho kuphela okufanele "siqoqwe mathupha":
SELECT
*
FROM
firms
WHERE
lower(name) >= 'роза' AND
lower(name) <= ('роза' || chr(65535)) -- для UTF8, для однобайтовых - chr(255)
ORDER BY
lower(name)
LIMIT 10;
Kuhle kakhulu - ukuhlunga kuyasebenza, futhi ukusetshenziswa kwezinsiza kuhlala "kuncane kakhulu", izinkulungwane zezikhathi ezisebenza kangcono kune-FTS “ehlanzekile”! Okusele nje ukukuhlanganisa kube isicelo esisodwa:
(
SELECT
*
FROM
firms
WHERE
lower(name) >= 'роза' AND
lower(name) <= ('роза' || chr(65535)) -- для UTF8, для однобайтовых кодировок - chr(255)
ORDER BY
lower(name)
LIMIT 10
)
UNION ALL
(
SELECT
*
FROM
firms
WHERE
to_tsvector('simple'::regconfig, lower(name)) @@ to_tsquery('simple', 'роза:*') AND
lower(name) NOT LIKE ('роза' || '%') -- "начинающиеся на" мы уже нашли выше
ORDER BY
lower(name) ~ ('^' || 'роза') DESC -- используем ту же сортировку, чтобы НЕ пойти по btree-индексу
, lower(name)
LIMIT 10
)
LIMIT 10;
Qaphela ukuthi i-subquery yesibili ifakiwe kuphela uma eyokuqala ibuye ngaphansi kwalokho obekulindelekile okokugcina LIMIT
inombolo yemigqa. Ngikhuluma ngale ndlela yokuthuthukisa imibuzo
Ngakho-ke yebo, manje sinakho kokubili i-btree ne-gin etafuleni, kodwa ngokwezibalo kuvela lokho ngaphansi kwe-10% yezicelo ezifinyelela ekusetshenzisweni kwebhulokhi yesibili. Okusho ukuthi, ngemikhawulo enjalo eyaziwa kusengaphambili ngomsebenzi, sikwazile ukunciphisa ukusetshenziswa okuphelele kwezinsiza zeseva cishe izikhathi eziyinkulungwane!
1.5*: singenza ngaphandle kwefayela
Phezulu LIKE
Sivinjiwe ekusebenziseni ukuhlunga okungalungile. Kodwa "ingasethwa endleleni efanele" ngokucacisa i-USING opharetha:
Ngokuzenzakalelayo kuyacatshangelwa
ASC
. Ukwengeza, ungacacisa igama le-opharetha yohlobo oluthile esigatshaneniUSING
. U-opharetha wohlobo kufanele abe yilungu labangaphansi noma elikhulu kunomndeni othile wabasebenzisa i-B-tree.ASC
ngokuvamile kuyalinganaUSING <
иDESC
ngokuvamile kuyalinganaUSING >
.
Esimweni sethu, "okuncane" kusho ~<~
:
SELECT
*
FROM
firms
WHERE
lower(name) LIKE ('роза' || '%')
ORDER BY
lower(name) USING ~<~
LIMIT 10;
2: izicelo ziba muncu kanjani
Manje sishiya isicelo sethu sokuthi "sibambe" izinyanga eziyisithupha noma unyaka, futhi siyamangala ukuphinde sithole "phezulu" nezinkomba zengqikithi "yokupompa" kwenkumbulo yansuku zonke (ama-buffers abiwe) ku 5.5TB - okungukuthi, ngisho nangaphezu kwalokho okwakuyikho ekuqaleni.
Cha, yebo, ibhizinisi lethu selikhulile futhi nomsebenzi wethu unyukile, kodwa hhayi ngenani elifanayo! Lokhu kusho ukuthi kukhona okushaya amanzi lapha - ake sikuthole.
2.1: ukuzalwa kwekhasi
Ngesinye isikhathi, elinye ithimba labathuthukisi lalifuna ukwenza kube nokwenzeka "ukweqa" kusukela ekusesheni okubhaliselwe okusheshayo kuya kurejista ngemiphumela efanayo, kodwa eyandisiwe. Yini ukubhalisa ngaphandle kokuzulazula kwekhasi? Masiyiklwebhe!
( ... LIMIT <N> + 10)
UNION ALL
( ... LIMIT <N> + 10)
LIMIT 10 OFFSET <N>;
Manje bekungenzeka ukukhombisa ukubhaliswa kwemiphumela yosesho ngokulayisha "ikhasi nekhasi" ngaphandle kwengcindezi kunjiniyela.
Yebo, eqinisweni, ekhasini ngalinye elilandelayo ledatha kuyafundwa futhi kuyafundwa (konke kusukela esikhathini esidlule, esizoyilahla, kanye "nomsila" odingekayo) - okungukuthi, lokhu kuyi-antipattern ecacile. Kodwa kungaba okulungile kakhulu ukuqala usesho ekuphindaphindweni okulandelayo kusuka kukhiye ogcinwe kusixhumi esibonakalayo, kodwa mayelana nalokho ngesinye isikhathi.
2.2: Ngifuna into engavamile
Ngesinye isikhathi umthuthukisi wayefuna hlukanisa isampula eliwumphumela ngedatha kwelinye ithebula, isicelo salo sonke sangaphambilini sathunyelwa ku-CTE:
WITH q AS (
...
LIMIT <N> + 10
)
SELECT
*
, (SELECT ...) sub_query -- какой-то запрос к связанной таблице
FROM
q
LIMIT 10 OFFSET <N>;
Futhi noma kunjalo, akukubi, njengoba i-subquery ihlolwa kuphela kumarekhodi ayi-10 abuyisiwe, uma kungenjalo ...
2.3: I-DIISTINCT ayinangqondo futhi ayinasihawu
Endaweni ethile kunqubo yokuvela okunjalo kusukela kumbuzo ongaphansi wesi-2 ilahleke NOT LIKE
isimo. Kuyacaca ukuthi emva kwalokhu UNION ALL
waqala ukubuya eminye imingenelo kabili - okokuqala kutholakala ekuqaleni komugqa, futhi futhi - ekuqaleni kwegama lokuqala lalo mugqa. Emkhawulweni, wonke amarekhodi emibuzo engaphansi yesi-2 angase afane namarekhodi okuqala.
Wenzani umthuthukisi esikhundleni sokubheka imbangela?.. Akubuzwa!
- kabili ubukhulu amasampula oqobo
- sebenzisa i-DISTINCTukuze uthole izibonelo ezilodwa zomugqa ngamunye
WITH q AS (
( ... LIMIT <2 * N> + 10)
UNION ALL
( ... LIMIT <2 * N> + 10)
LIMIT <2 * N> + 10
)
SELECT DISTINCT
*
, (SELECT ...) sub_query
FROM
q
LIMIT 10 OFFSET <N>;
Okusho ukuthi, kusobala ukuthi umphumela, ekugcineni, ufana ncamashi, kodwa ithuba "lokundiza" ku-subquery ye-2 CTE seliphakeme kakhulu, futhi ngaphandle kwalokhu, efundeka ngokucacile.
Kodwa lena akuyona into edabukisa kakhulu. Njengoba unjiniyela ecele ukukhetha DISTINCT
hhayi kwezithize, kodwa kuzo zonke izinkambu ngesikhathi esisodwa amarekhodi, bese inkambu ye-sub_query - umphumela we-subquery - ifakwe lapho ngokuzenzakalelayo. Manje, ukwenza DISTINCT
, isizindalwazi bekufanele sisebenzise kakade hhayi imibuzo eyi-10, kodwa yonke <2 * N> + 10!
2.4: ukusebenzisana ngaphezu kwakho konke!
Ngakho-ke, abathuthukisi baphile - abazange bazihluphe, ngoba umsebenzisi ngokusobala wayengenaso isineke esanele "sokulungisa" ukubhalisa kumanani abalulekile we-N ngokuncipha okungapheli ekutholeni "ikhasi" ngalinye elilandelayo.
Kwaze kwaba yilapho kufika onjiniyela abavela komunye umnyango futhi bafuna ukusebenzisa indlela elula kangaka ngokusesha okuphindaphindayo - okungukuthi, sithatha ucezu kusuka kwesinye isampula, sihlunge ngezimo ezengeziwe, sidwebe umphumela, bese ucezu olulandelayo (okuthi kithi lufezwa ngokwandisa u-N), njalonjalo size sigcwalise isikrini.
Ngokuvamile, ku-specimen ebanjwe N ifinyelele amanani acishe abe ngu-17K, futhi ngosuku olulodwa nje okungenani i-4K yezicelo ezinjalo zenziwa "kanye neketanga". Owokugcina wahlolwa ngesibindi ngu I-1GB yememori ngokuphindaphinda ngakunye...
Inani
Source: www.habr.com