Crita saka siji penyelidikan SQL

Desember kepungkur aku nampa laporan bug sing menarik saka tim dhukungan VWO. Wektu loading kanggo salah sawijining laporan analytics kanggo klien perusahaan gedhe katon nglarang. Lan amarga iki minangka tanggung jawabku, aku langsung fokus kanggo ngrampungake masalah kasebut.

prasejarah

Kanggo nggawe cetha apa aku ngomong bab, Aku bakal pitutur marang kowe sethitik babagan VWO. Iki minangka platform sing bisa digunakake kanggo ngluncurake macem-macem kampanye sing ditargetake ing situs web sampeyan: nindakake eksperimen A/B, nglacak pengunjung lan konversi, nganalisa corong penjualan, nampilake peta panas lan muter rekaman kunjungan.

Nanging sing paling penting babagan platform yaiku nglaporake. Kabeh fungsi ing ndhuwur saling gegandhengan. Lan kanggo klien perusahaan, jumlah informasi sing akeh bakal dadi ora ana gunane tanpa platform kuat sing nyedhiyakake ing wangun analytics.

Nggunakake platform, sampeyan bisa nggawe pitakon acak ing set data gedhe. Punika conto prasaja:

Tampilake kabeh klik ing kaca "abc.com" Saka <tanggal d1> nganti <tanggal d2> kanggo wong sing nggunakake Chrome UTAWA (dumunung ing Eropa LAN nggunakake iPhone)

Pay manungsa waΓ© menyang operator Boolean. Padha kasedhiya kanggo klien ing antarmuka query kanggo nggawe pitakon rumit kanthi sewenang-wenang kanggo njupuk conto.

Panjaluk alon

Klien sing dimaksud nyoba nindakake apa wae sing kudu ditindakake kanthi intuisi kanthi cepet:

Tampilake kabeh rekaman sesi kanggo pangguna sing ngunjungi kaca apa wae kanthi URL sing ngemot "/jobs"

Situs iki duwe ton lalu lintas lan kita nyimpen luwih saka yuta URL unik mung kanggo iku. Lan dheweke pengin golek template URL sing cukup prasaja sing ana hubungane karo model bisnis.

Penyelidikan awal

Ayo ndeleng apa sing kedadeyan ing basis data. Ing ngisor iki query SQL alon asli:

SELECT 
    count(*) 
FROM 
    acc_{account_id}.urls as recordings_urls, 
    acc_{account_id}.recording_data as recording_data, 
    acc_{account_id}.sessions as sessions 
WHERE 
    recording_data.usp_id = sessions.usp_id 
    AND sessions.referrer_id = recordings_urls.id 
    AND  (  urls &&  array(select id from acc_{account_id}.urls where url  ILIKE  '%enterprise_customer.com/jobs%')::text[]   ) 
    AND r_time > to_timestamp(1542585600) 
    AND r_time < to_timestamp(1545177599) 
    AND recording_data.duration >=5 
    AND recording_data.num_of_pages > 0 ;

Lan iki wektu:

Wektu sing direncanakake: 1.480 ms Wektu eksekusi: 1431924.650 ms

Pitakonan nyusup 150 ewu larik. Perencana pitakon nuduhake sawetara rincian sing menarik, nanging ora ana kemacetan sing jelas.

Ayo sinau panjaluk kasebut luwih lanjut. Kaya sing sampeyan ngerteni, dheweke nindakake JOIN telung tabel:

  1. mau: kanggo nampilake informasi sesi: browser, agen pangguna, negara, lan liya-liyane.
  2. rekaman_data: URL sing direkam, kaca, durasi kunjungan
  3. URL: Kanggo ngindhari duplikasi URL sing gedhe banget, kita simpen ing tabel sing kapisah.

Uga elinga yen kabeh tabel kita wis dipisahake dening account_id. Kanthi cara iki, kahanan ing ngendi siji akun gedhe banget nyebabake masalah kanggo wong liya ora kalebu.

Nggolek pitunjuk

Sawise dipriksa kanthi tliti, kita weruh ana sing salah karo panjaluk tartamtu. Iku worth njupuk dipikir nyedhaki baris iki:

urls && array(
	select id from acc_{account_id}.urls 
	where url  ILIKE  '%enterprise_customer.com/jobs%'
)::text[]

Pikiran pisanan sing mbok menawa amarga ILIKE ing kabeh URL sing dawa iki (kita duwe luwih saka 1,4 yuta unik URL sing diklumpukake kanggo akun iki) kinerja bisa nandhang sangsara.

Nanging ora, iku dudu titik!

SELECT id FROM urls WHERE url ILIKE '%enterprise_customer.com/jobs%';
  id
--------
 ...
(198661 rows)

Time: 5231.765 ms

Panjaluk telusuran cithakan mung butuh 5 detik. Nggoleki pola ing yuta URL unik jelas ora dadi masalah.

Suspect sabanjurΓ© ing dhaftar sawetara JOIN. Mbok sing overuse wis nyebabake slowdown? Biasane JOIN'S minangka calon sing paling jelas kanggo masalah kinerja, nanging aku ora ngandel yen kasus iki khas.

analytics_db=# SELECT
    count(*)
FROM
    acc_{account_id}.urls as recordings_urls,
    acc_{account_id}.recording_data_0 as recording_data,
    acc_{account_id}.sessions_0 as sessions
WHERE
    recording_data.usp_id = sessions.usp_id
    AND sessions.referrer_id = recordings_urls.id
    AND r_time > to_timestamp(1542585600)
    AND r_time < to_timestamp(1545177599)
    AND recording_data.duration >=5
    AND recording_data.num_of_pages > 0 ;
 count
-------
  8086
(1 row)

Time: 147.851 ms

Lan iki uga dudu kasus kita. JOIN's ternyata cukup cepet.

Narrowing mudhun bunder saka tersangka

Aku wis siyap miwiti ngganti pitakon kanggo entuk perbaikan kinerja sing bisa ditindakake. Aku lan tim ngembangake 2 gagasan utama:

  • Gunakake EXISTS kanggo URL subquery: Kita pengin mriksa maneh yen ana masalah karo subquery kanggo URL. Salah siji cara kanggo entuk iki mung nggunakake EXISTS. EXISTS bisa nemen nambah kinerja wiwit ends sanalika ketemu mung senar sing cocog kondisi.

SELECT
	count(*) 
FROM 
    acc_{account_id}.urls as recordings_urls,
    acc_{account_id}.recording_data as recording_data,
    acc_{account_id}.sessions as sessions
WHERE
    recording_data.usp_id = sessions.usp_id
    AND  (  1 = 1  )
    AND sessions.referrer_id = recordings_urls.id
    AND  (exists(select id from acc_{account_id}.urls where url  ILIKE '%enterprise_customer.com/jobs%'))
    AND r_time > to_timestamp(1547585600)
    AND r_time < to_timestamp(1549177599)
    AND recording_data.duration >=5
    AND recording_data.num_of_pages > 0 ;
 count
 32519
(1 row)
Time: 1636.637 ms

Inggih, inggih. Subquery nalika dibungkus EXISTS, ndadekake kabeh super cepet. Pitakonan logis sabanjurΓ© kok request karo JOIN-ami lan subquery dhewe cepet individu, nanging banget alon bebarengan?

  • Pindhah subquery menyang CTE : Yen pitakon cepet dhewe, kita mung bisa ngetung asil cepet dhisik banjur menehi menyang pitakon utama

WITH matching_urls AS (
    select id::text from acc_{account_id}.urls where url  ILIKE  '%enterprise_customer.com/jobs%'
)

SELECT 
    count(*) FROM acc_{account_id}.urls as recordings_urls, 
    acc_{account_id}.recording_data as recording_data, 
    acc_{account_id}.sessions as sessions,
    matching_urls
WHERE 
    recording_data.usp_id = sessions.usp_id 
    AND  (  1 = 1  )  
    AND sessions.referrer_id = recordings_urls.id
    AND (urls && array(SELECT id from matching_urls)::text[])
    AND r_time > to_timestamp(1542585600) 
    AND r_time < to_timestamp(1545107599)
    AND recording_data.duration >=5 
    AND recording_data.num_of_pages > 0;

Nanging isih alon banget.

Nggoleki pelakune

Kabeh wektu iki, siji bab cilik flashed sadurunge mripatku, kang terus-terusan disikat aside. Nanging amarga ora ana apa-apa liyane, aku mutusake kanggo ndeleng dheweke uga. Aku ngomong babagan && operator. Bye EXISTS mung apik kinerja && mung siji-sijine faktor umum sing isih ana ing kabeh versi pitakon alon.

nyawang dokumentasi, kita weruh sing && digunakake nalika sampeyan kudu nemokake unsur umum antarane rong susunan.

Ing panjalukan asli iki:

AND  (  urls &&  array(select id from acc_{account_id}.urls where url  ILIKE  '%enterprise_customer.com/jobs%')::text[]   )

Tegese kita nindakake telusuran pola ing URL kita, banjur temokake persimpangan karo kabeh URL kanthi kiriman umum. Iki rada mbingungake amarga "url" ing kene ora nuduhake tabel sing ngemot kabeh URL, nanging menyang kolom "url" ing tabel. recording_data.

Kanthi akeh anggepan babagan &&, Aku nyoba golek konfirmasi kanggo wong-wong mau ing rencana query kui EXPLAIN ANALYZE (Aku wis duwe rencana sing disimpen, nanging aku biasane luwih seneng eksperimen ing SQL tinimbang nyoba ngerti opacity saka perancang pitakon).

Filter: ((urls && ($0)::text[]) AND (r_time > '2018-12-17 12:17:23+00'::timestamp with time zone) AND (r_time < '2018-12-18 23:59:59+00'::timestamp with time zone) AND (duration >= '5'::double precision) AND (num_of_pages > 0))
                           Rows Removed by Filter: 52710

Ana sawetara baris saringan mung saka &&. Sing tegese operasi iki ora mung larang, nanging uga ditindakake kaping pirang-pirang.

Aku nyoba iki kanthi ngisolasi kondisi kasebut

SELECT 1
FROM 
    acc_{account_id}.urls as recordings_urls, 
    acc_{account_id}.recording_data_30 as recording_data_30, 
    acc_{account_id}.sessions_30 as sessions_30 
WHERE 
	urls &&  array(select id from acc_{account_id}.urls where url  ILIKE  '%enterprise_customer.com/jobs%')::text[]

Pitakon iki alon. Amarga ing JOIN-s cepet lan subqueries cepet, mung bab kiwa ana && operator.

Iki mung operasi kunci. Kita mesthi kudu nggoleki kabeh tabel dhasar URL kanggo nggoleki pola, lan kita kudu tansah golek persimpangan. Kita ora bisa nelusuri kanthi cathetan URL langsung, amarga iki mung ID sing dirujuk urls.

Ing dalan kanggo solusi

&& alon amarga loro mranata ageng. Operasi bakal relatif cepet yen aku ngganti urls ing { "http://google.com/", "http://wingify.com/" }.

Aku miwiti nggoleki cara kanggo nggawe persimpangan ing Postgres tanpa nggunakake &&, nanging tanpa akeh sukses.

Pungkasane, kita mutusake kanggo ngrampungake masalah kasebut kanthi mandiri: wenehi kabeh urls baris sing URL cocog karo pola. Tanpa syarat tambahan iku bakal - 

SELECT urls.url
FROM 
	acc_{account_id}.urls as urls,
	(SELECT unnest(recording_data.urls) AS id) AS unrolled_urls
WHERE
	urls.id = unrolled_urls.id AND
	urls.url  ILIKE  '%jobs%'

Tinimbang JOIN sintaks aku mung nggunakake subquery lan ditambahi recording_data.urls array supaya sampeyan bisa langsung ngetrapake kondisi kasebut ing WHERE.

Sing paling penting ing kene yaiku && digunakake kanggo mriksa apa entri diwenehi ngemot URL sing cocog. Yen sampeyan squint sethitik, sampeyan bisa ndeleng operasi iki gerakane liwat unsur Uploaded (utawa larik saka meja) lan mandheg nalika kondisi (cocog) ketemu. Apa ora ngelingake sampeyan apa-apa? ya, EXISTS.

Wiwit ing recording_data.urls bisa dirujuk saka njaba konteks subquery, nalika kedadeyan kasebut, kita bisa bali menyang kanca lawas EXISTS lan mbungkus subquery karo.

Nggabungake kabeh, kita entuk pitakon sing dioptimalake pungkasan:

SELECT 
    count(*) 
FROM 
    acc_{account_id}.urls as recordings_urls, 
    acc_{account_id}.recording_data as recording_data, 
    acc_{account_id}.sessions as sessions 
WHERE 
    recording_data.usp_id = sessions.usp_id 
    AND  (  1 = 1  )  
    AND sessions.referrer_id = recordings_urls.id 
    AND r_time > to_timestamp(1542585600) 
    AND r_time < to_timestamp(1545177599) 
    AND recording_data.duration >=5 
    AND recording_data.num_of_pages > 0
    AND EXISTS(
        SELECT urls.url
        FROM 
            acc_{account_id}.urls as urls,
            (SELECT unnest(urls) AS rec_url_id FROM acc_{account_id}.recording_data) 
            AS unrolled_urls
        WHERE
            urls.id = unrolled_urls.rec_url_id AND
            urls.url  ILIKE  '%enterprise_customer.com/jobs%'
    );

Lan wektu timbal pungkasan Time: 1898.717 ms Wektu kanggo ngrayakake?!?

Ora cepet banget! Pisanan sampeyan kudu mriksa bener. Aku iki arang banget curiga babagan EXISTS optimasi minangka ngganti logika kanggo ngrampungake sadurungΓ©. Kita kudu mesthekake yen kita durung nambahake kesalahan sing ora jelas ing panyuwunan kasebut.

A test prasaja kanggo mbukak count(*) ing pitakon alon lan cepet kanggo pirang-pirang set data sing beda. Banjur, kanggo subset cilik data, aku verifikasi kanthi manual yen kabeh asil bener.

Kabeh tes menehi asil positif kanthi konsisten. Kita ndandani kabeh!

Piwulang sing Disinaoni

Kathah piwulang ingkang saged dipunpendhet saking cariyos punika:

  1. Rencana pitakon ora nyritakake kabeh, nanging bisa menehi pitunjuk
  2. Tersangka utama ora mesthi dadi pelaku nyata
  3. Pitakonan alon bisa dipecah kanggo ngisolasi bottlenecks
  4. Ora kabeh optimasi sifate reduktif
  5. Gunakake EXIST, yen bisa, bisa nyebabake peningkatan produktivitas sing dramatis

kesimpulan

Kita pindhah saka wektu pitakon ~ 24 menit dadi 2 detik - peningkatan kinerja sing cukup signifikan! Sanajan artikel iki metu gedhe, kabeh eksperimen sing ditindakake sajrone sedina, lan kira-kira butuh 1,5 nganti 2 jam kanggo optimasi lan tes.

SQL minangka basa sing apik yen sampeyan ora wedi, nanging coba sinau lan gunakake. Kanthi duwe pangerten sing apik babagan carane query SQL dieksekusi, carane database nggawe rencana query, carane indeks bisa digunakake, lan mung ukuran data sing sampeyan tindakake, sampeyan bisa sukses banget ing ngoptimalake pitakon. Nanging, penting banget kanggo terus nyoba pendekatan sing beda-beda lan alon-alon ngilangi masalah kasebut, nemokake kemacetan.

Sisih paling apik babagan entuk asil kaya iki yaiku paningkatan kacepetan sing katon lan katon - ing ngendi laporan sing sadurunge ora bakal dimuat saiki meh langsung dimuat.

Matur nuwun khusus kanggo kancaku ing dhawuhe Aditya MishraAditya Gauru ΠΈ Varun Malhotra kanggo brainstorming lan Dinkar Pandir kanggo nemokake kesalahan penting ing panjalukan pungkasan kita sadurunge pungkasanipun pamit!

Source: www.habr.com

Add a comment