Indaba yophenyo olulodwa lwe-SQL

NgoDisemba odlule ngithole umbiko onephutha ovela ethimbeni lokusekela le-VWO . Isikhathi sokulayisha somunye wemibiko yezibalo yekhasimende elikhulu lebhizinisi sibonakale sisenqabela. Futhi njengoba lokhu kuyindawo yami yesibopho, ngagxila ngokushesha ekuxazululeni inkinga.

prehistory

Ukuze kucace ukuthi ngikhuluma ngani, ngizokutshela kancane nge-VWO. Lena inkundla ongaqalisa ngayo imikhankaso ehlosiwe ehlukahlukene kumawebhusayithi akho: yenza izivivinyo ze-A/B, ulandelele izivakashi nokuguqulwa, uhlaziye ifaneli yokuthengisa, ubonise amamephu okushisa nokudlala okurekhodiwe kokuvakashela.

Kodwa okubaluleke kakhulu ngeplatifomu ukubika. Yonke imisebenzi engenhla ixhumene. Futhi kumakhasimende ezinkampani, inani elikhulu lolwazi lingaba yize ngaphandle kwenkundla enamandla eyethula ngendlela yezibalo.

Usebenzisa inkundla, ungenza umbuzo ongahleliwe kusethi enkulu yedatha. Nasi isibonelo esilula:

Bonisa konke ukuchofoza ekhasini elithi "abc.com"
KUSUKA  KUYA 
kubantu aba
usebenzise i-Chrome NOMA
(babeseYurophu FUTHI besebenzisa i-iPhone)

Naka opharetha be-Boolean. Ayatholakala kumakhasimende kusixhumi esibonakalayo sombuzo ukuze enze imibuzo eyinkimbinkimbi ngokungafanele ukuze athole amasampula.

Isicelo esihamba kancane

Iklayenti okukhulunywa ngalo belizama ukwenza okuthile okufanele kusebenze ngokushesha:

Bonisa konke okurekhodiwe kweseshini
kubasebenzisi abavakashele noma yiliphi ikhasi
nge-URL lapho kukhona "/imisebenzi"

Leli sayithi libe nethrafikhi eningi futhi besigcina ama-URL ahlukile angaphezu kwesigidi ngenxa yalo. Futhi babefuna ukuthola isifanekiso se-URL esilula esihlobene nemodeli yebhizinisi labo.

Uphenyo lokuqala

Ake sibheke ukuthi kwenzakalani ku-database. Ngezansi umbuzo wokuqala we-SQL ongasheshi:

SELECT 
    count(*) 
FROM 
    acc_{account_id}.urls as recordings_urls, 
    acc_{account_id}.recording_data as recording_data, 
    acc_{account_id}.sessions as sessions 
WHERE 
    recording_data.usp_id = sessions.usp_id 
    AND sessions.referrer_id = recordings_urls.id 
    AND  (  urls &&  array(select id from acc_{account_id}.urls where url  ILIKE  '%enterprise_customer.com/jobs%')::text[]   ) 
    AND r_time > to_timestamp(1542585600) 
    AND r_time < to_timestamp(1545177599) 
    AND recording_data.duration >=5 
    AND recording_data.num_of_pages > 0 ;

Futhi nazi izikhathi:

Isikhathi esihleliwe: 1.480 ms
Isikhathi sokwenza: 1431924.650 ms

Umbuzo ukhase imigqa eyizinkulungwane eziyi-150. Umhleli wemibuzo ubonise imininingwane embalwa ethokozisayo, kodwa azikho izithiyo ezisobala.

Ake sifunde kabanzi isicelo. Njengoba ubona, uyakwenza JOIN amatafula amathathu:

  1. sessions: ukubonisa ulwazi lweseshini: isiphequluli, i-ejenti yomsebenzisi, izwe, njalonjalo.
  2. idatha_yokurekhoda: ama-URL aqoshiwe, amakhasi, ubude besikhathi sokuvakasha
  3. ama-url: Ukugwema ukuphindaphinda ama-URL amakhulu kakhulu, siwagcina kuthebula elihlukile.

Futhi qaphela ukuthi wonke amatafula ethu asevele ahlukaniswe account_id. Ngale ndlela, isimo lapho i-akhawunti eyodwa enkulu ikakhulukazi idala izinkinga kwabanye ayifakwa.

Ifuna imikhondo

Lapho sihlolisisa, siyabona ukuthi kukhona okungalungile ngesicelo esithile. Kufanelekile ukubhekisisa lo mugqa:

urls && array(
	select id from acc_{account_id}.urls 
	where url  ILIKE  '%enterprise_customer.com/jobs%'
)::text[]

Umcabango wokuqala kwaba ukuthi mhlawumbe ngoba ILIKE kuwo wonke lawa ma-URL amade (sinezigidi ezingaphezu kwe-1,4 okuhlukile Ama-URL aqoqwe kule akhawunti) ukusebenza kungase kuphazamiseke.

Kodwa cha, lokho akulona iphuzu!

SELECT id FROM urls WHERE url ILIKE '%enterprise_customer.com/jobs%';
  id
--------
 ...
(198661 rows)

Time: 5231.765 ms

Isicelo sokusesha isifanekiso ngokwaso sithatha imizuzwana emi-5 kuphela. Ukusesha iphethini kuma-URL ayingqayizivele ayisigidi ngokusobala akuyona inkinga.

Umsolwa olandelayo ohlwini baningana JOIN. Mhlawumbe ukusetshenziswa kwabo ngokweqile kubangele ukwehla? Ngokuvamile JOINAmakhandidethi angamakhandidethi asobala kakhulu ezinkingeni zokusebenza, kodwa angikholwanga ukuthi udaba lwethu lwalujwayelekile.

analytics_db=# SELECT
    count(*)
FROM
    acc_{account_id}.urls as recordings_urls,
    acc_{account_id}.recording_data_0 as recording_data,
    acc_{account_id}.sessions_0 as sessions
WHERE
    recording_data.usp_id = sessions.usp_id
    AND sessions.referrer_id = recordings_urls.id
    AND r_time > to_timestamp(1542585600)
    AND r_time < to_timestamp(1545177599)
    AND recording_data.duration >=5
    AND recording_data.num_of_pages > 0 ;
 count
-------
  8086
(1 row)

Time: 147.851 ms

Futhi lokhu kwakungelona icala lethu. JOIN'kuvele kwashesha kakhulu.

Ukunciphisa indilinga yabasolwa

Ngangilungele ukuqala ukushintsha umbuzo ukuze ngifinyelele noma yikuphi ukuthuthukiswa kokusebenza okungenzeka. Mina nethimba lami sakha imibono emi-2 eyinhloko:

  • Sebenzisa i-EXISTS ye-URL yombuzo omncane: Besifuna ukuphinda sihlole ukuthi ngabe zikhona yini izinkinga ngokubuza okungaphansi kwama-URL. Enye indlela yokufeza lokhu ukusebenzisa kalula EXISTS. EXISTS unga ithuthukise kakhulu ukusebenza njengoba iphetha ngokushesha lapho ithola intambo kuphela ehambisana nesimo.

SELECT
	count(*) 
FROM 
    acc_{account_id}.urls as recordings_urls,
    acc_{account_id}.recording_data as recording_data,
    acc_{account_id}.sessions as sessions
WHERE
    recording_data.usp_id = sessions.usp_id
    AND  (  1 = 1  )
    AND sessions.referrer_id = recordings_urls.id
    AND  (exists(select id from acc_{account_id}.urls where url  ILIKE '%enterprise_customer.com/jobs%'))
    AND r_time > to_timestamp(1547585600)
    AND r_time < to_timestamp(1549177599)
    AND recording_data.duration >=5
    AND recording_data.num_of_pages > 0 ;
 count
 32519
(1 row)
Time: 1636.637 ms

Yebo, yebo. I-Subquery uma isongwe EXISTS, yenza yonke into isheshe kakhulu. Umbuzo olandelayo onengqondo ukuthi kungani isicelo nge JOIN-ami kanye ne-subquery ngokwayo iyashesha ngakunye, kodwa ihamba kancane kakhulu ndawonye?

  • Ukuhambisa umbuzo ongaphansi ku-CTE : Uma umbuzo ushesha ngokwawo, singavele sibale umphumela osheshayo kuqala bese siwuhlinzeka embuzweni oyinhloko.

WITH matching_urls AS (
    select id::text from acc_{account_id}.urls where url  ILIKE  '%enterprise_customer.com/jobs%'
)

SELECT 
    count(*) FROM acc_{account_id}.urls as recordings_urls, 
    acc_{account_id}.recording_data as recording_data, 
    acc_{account_id}.sessions as sessions,
    matching_urls
WHERE 
    recording_data.usp_id = sessions.usp_id 
    AND  (  1 = 1  )  
    AND sessions.referrer_id = recordings_urls.id
    AND (urls && array(SELECT id from matching_urls)::text[])
    AND r_time > to_timestamp(1542585600) 
    AND r_time < to_timestamp(1545107599)
    AND recording_data.duration >=5 
    AND recording_data.num_of_pages > 0;

Kodwa kwakusahamba kancane.

Ukuthola umenzi wecala

Sonke lesi sikhathi, into encane yayikhanya phambi kwamehlo ami, engangilokhu ngiyibhulakela eceleni. Kodwa njengoba kwakungasekho okunye, nganquma ukumbheka nami. Ngikhuluma ngani && opharetha. Sala kahle EXISTS ukusebenza okuthuthukisiwe nje && bekuwukuphela kwento evamile esele kuzo zonke izinguqulo zombuzo ongasheshi.

Bukela i imibhalo, siyabona lokho && isetshenziswe uma udinga ukuthola izakhi ezivamile phakathi kwamalungu afanayo amabili.

Esicelweni sokuqala yilesi:

AND  (  urls &&  array(select id from acc_{account_id}.urls where url  ILIKE  '%enterprise_customer.com/jobs%')::text[]   )

Okusho ukuthi senza usesho lwephethini kuma-URL ethu, bese sithola ukuhlangana kwawo wonke ama-URL anamaposi avamile. Lokhu kuyadida kancane ngoba "ama-url" lapha awabhekiseli kuthebula eliqukethe wonke ama-URL, kodwa kukholomu ethi "urls" kuthebula. recording_data.

Ngokukhula kwezinsolo mayelana &&, ngizamile ukubatholela isiqinisekiso kuhlelo lombuzo olwenziwe EXPLAIN ANALYZE (Bengivele nginecebo elilondoloziwe, kodwa ngivamise ukukhululeka kakhulu ukuhlola i-SQL kunokuzama ukuqonda ukufiphala kwabahleli bemibuzo).

Filter: ((urls && ($0)::text[]) AND (r_time > '2018-12-17 12:17:23+00'::timestamp with time zone) AND (r_time < '2018-12-18 23:59:59+00'::timestamp with time zone) AND (duration >= '5'::double precision) AND (num_of_pages > 0))
                           Rows Removed by Filter: 52710

Bekunemigqa eminingana yezihlungi ezivela kuphela &&. Okusho ukuthi lo msebenzi wawungabizi nje kuphela, kodwa futhi wenziwa izikhathi eziningana.

Ngikuvivinye lokhu ngokuhlukanisa isimo

SELECT 1
FROM 
    acc_{account_id}.urls as recordings_urls, 
    acc_{account_id}.recording_data_30 as recording_data_30, 
    acc_{account_id}.sessions_30 as sessions_30 
WHERE 
	urls &&  array(select id from acc_{account_id}.urls where url  ILIKE  '%enterprise_customer.com/jobs%')::text[]

Lo mbuzo ubuhamba kancane. Ngoba i JOIN-s ayashesha futhi imibuzo engezansi iyashesha, okuwukuphela kwento eyayisele && opharetha.

Lona umsebenzi oyinhloko nje. Njalo sidinga ukusesha lonke ithebula elingaphansi lama-URL ukuze sifune iphethini, futhi sidinga njalo ukuthola izimpambanondlela. Asikwazi ukusesha ngamarekhodi e-URL ngokuqondile, ngoba lawa ama-ID abhekisela kuwo urls.

Endleleni eya esixazululweni

&& kancane ngoba womabili amasethi makhulu. Umsebenzi uzoshesha uma ngishintsha urls on { "http://google.com/", "http://wingify.com/" }.

Ngaqala ukufuna indlela yokwenza i-set intersection e-Postgres ngaphandle kokusebenzisa &&, kodwa ngaphandle kwempumelelo enkulu.

Ekugcineni, sinqume ukuxazulula inkinga sisodwa: nginike yonke into urls imigqa lapho i-URL ifana nephethini. Ngaphandle kwemibandela eyengeziwe kuyoba - 

SELECT urls.url
FROM 
	acc_{account_id}.urls as urls,
	(SELECT unnest(recording_data.urls) AS id) AS unrolled_urls
WHERE
	urls.id = unrolled_urls.id AND
	urls.url  ILIKE  '%jobs%'

Esikhundleni JOIN i-syntax ngisanda kusebenzisa i-subquery futhi nganwetshwa recording_data.urls array ukuze ukwazi ukusebenzisa ngokuqondile isimo ku WHERE.

Okubaluleke kakhulu lapha ukuthi && esetshenziselwa ukuhlola ukuthi okufakiwe okunikeziwe kuqukethe i-URL efanayo. Uma uthi nhla kancane, ungabona lokhu kusebenza kuhamba phakathi kwezakhi zohlelo (noma imigqa yethebula) futhi kuma lapho umbandela (ukufanisa) ufinyelelwa. Akukukhumbuzi lutho? Yebo, EXISTS.

Kusukela kuqhubeke recording_data.urls ingabhekiselwa ngaphandle kwengqikithi yombuzo omncane, uma lokhu kwenzeka singabuyela kumngane wethu wakudala EXISTS bese ugoqa i-subquery ngayo.

Ukuhlanganisa yonke into, sithola umbuzo wokugcina olungiselelwe:

SELECT 
    count(*) 
FROM 
    acc_{account_id}.urls as recordings_urls, 
    acc_{account_id}.recording_data as recording_data, 
    acc_{account_id}.sessions as sessions 
WHERE 
    recording_data.usp_id = sessions.usp_id 
    AND  (  1 = 1  )  
    AND sessions.referrer_id = recordings_urls.id 
    AND r_time > to_timestamp(1542585600) 
    AND r_time < to_timestamp(1545177599) 
    AND recording_data.duration >=5 
    AND recording_data.num_of_pages > 0
    AND EXISTS(
        SELECT urls.url
        FROM 
            acc_{account_id}.urls as urls,
            (SELECT unnest(urls) AS rec_url_id FROM acc_{account_id}.recording_data) 
            AS unrolled_urls
        WHERE
            urls.id = unrolled_urls.rec_url_id AND
            urls.url  ILIKE  '%enterprise_customer.com/jobs%'
    );

Futhi isikhathi sokugcina sokuhola Time: 1898.717 ms Isikhathi sokugubha?!?

Hhayi ngokushesha! Okokuqala udinga ukuhlola ukunemba. Nganginokusola kakhulu EXISTS nokwenza kahle njengoba ishintsha ingqondo ukuze inqamule ngaphambi kwesikhathi. Kudingeka siqiniseke ukuthi asingezanga iphutha elingabonakali esicelweni.

Ukuhlola okulula kwakuwukugijima count(*) kuyo yomibili imibuzo enensayo nesheshayo yenani elikhulu lamasethi edatha ahlukene. Bese, kusethi encane yedatha, ngaqinisekisa mathupha ukuthi yonke imiphumela yayilungile.

Zonke izivivinyo zinikeze imiphumela engashintshile. Silungise konke!

Izifundo Ezifundiwe

Ziningi izifundo ezingafundwa kule ndaba:

  1. Izinhlelo zemibuzo aziyitsheli yonke indaba, kodwa zinganikeza imikhondo
  2. Abasolwa abaqavile akubona njalo abayizigebengu zangempela
  3. Imibuzo enensa ingahlukaniswa ukuze ihlukanise izingqinamba
  4. Akuwona wonke ukulungiselelwa okunciphisa ngokwemvelo
  5. Sebenzisa EXIST, lapho kungenzeka, kungaholela ekwandeni okumangalisayo kokukhiqiza

isiphetho

Sisuke esikhathini sombuzo esingamaminithi angu-~24 saya kumasekhondi angu-2 - ukukhuphuka okukhulu kokusebenza! Nakuba le ndatshana iphume inkulu, zonke izivivinyo esizenzile zenzeke ngosuku olulodwa, futhi kulinganiselwa ukuthi zithathe phakathi kwamahora angu-1,5 nangu-2 ukuze kulungiselelwe kahle futhi kuhlolwe.

I-SQL iwulimi oluhle kakhulu uma ungalwesabi, kodwa zama ukufunda nokuyisebenzisa. Ngokuqonda kahle ukuthi imibuzo ye-SQL yenziwa kanjani, ukuthi isizindalwazi sikhiqiza kanjani izinhlelo zemibuzo, ukuthi izinkomba zisebenza kanjani, kanye nosayizi wedatha osebenza nayo, ungaphumelela kakhulu ekulungiseleleni imibuzo. Kubalulekile ngokulinganayo, nokho, ukuqhubeka nokuzama izindlela ezehlukene bese udiliza kancane inkinga, uthole izingqinamba.

Ingxenye engcono kakhulu mayelana nokuzuza imiphumela efana nalena ukuthuthuka okubonakalayo, okubonakalayo kwesivinini - lapho umbiko obukade ubungeke ulayishe manje ulayisha cishe ngaso leso sikhathi.

Sibonga ngokukhethekile bakwethu ngomyalo ka-Aditya MishraAditya Gauru ΠΈ Varun Malhotra zokuhlanganisa ingqondo kanye I-Dinkar Pandir ngokuthola iphutha elibalulekile esicelweni sethu sokugcina ngaphambi kokuthi sivalelise ekugcineni!

Source: www.habr.com

Engeza amazwana