Labarin binciken SQL daya

Disambar da ya gabata na sami rahoton bugu mai ban sha'awa daga ƙungiyar tallafin VWO. Lokacin lodawa ɗaya daga cikin rahotannin nazari na babban abokin ciniki ya zama kamar haramun ne. Kuma tunda wannan yanki ne na alhakin, nan da nan na mai da hankali kan magance matsalar.

prehistory

Don bayyana abin da nake magana akai, zan gaya muku kadan game da VWO. Wannan dandali ne wanda zaku iya ƙaddamar da yaƙin neman zaɓe daban-daban akan gidajen yanar gizonku: gudanar da gwaje-gwajen A/B, bin diddigin baƙi da jujjuyawar, nazarin mazuyin tallace-tallace, nuna taswirorin zafi da kunna rikodin ziyarar.

Amma abu mafi mahimmanci game da dandamali shine bayar da rahoto. Duk ayyukan da ke sama suna haɗe. Kuma ga abokan ciniki na kamfanoni, babban adadin bayanai ba zai zama mara amfani kawai ba tare da dandamali mai ƙarfi wanda ke gabatar da shi cikin sigar nazari ba.

Amfani da dandamali, zaku iya yin tambaya bazuwar akan babban saitin bayanai. Ga misali mai sauƙi:

Nuna duk dannawa akan shafi "abc.com" DAGA <kwanan wata d1> ZUWA <kwanan wata d2> ga mutanen da suka yi amfani da Chrome OR (wanda ke cikin Turai kuma suka yi amfani da iPhone)

Kula da ma'aikatan Boolean. Suna samuwa ga abokan ciniki a cikin hanyar tambaya don yin hadaddun tambayoyin gayyata don samun samfurori.

A hankali bukatar

Abokin ciniki da ake tambaya yana ƙoƙarin yin wani abu da ya kamata a hankali ya yi aiki da sauri:

Nuna duk bayanan zaman don masu amfani waɗanda suka ziyarci kowane shafi tare da URL mai ɗauke da "/ ayyuka"

Wannan rukunin yanar gizon yana da tarin zirga-zirgar ababen hawa kuma muna adana sama da URLs na musamman don shi kawai. Kuma sun so nemo samfuri mai sauƙi na URL wanda ke da alaƙa da tsarin kasuwancin su.

Binciken farko

Bari mu dubi abin da ke faruwa a cikin ma'ajin bayanai. A ƙasa akwai ainihin jinkirin tambayar SQL:

SELECT 
    count(*) 
FROM 
    acc_{account_id}.urls as recordings_urls, 
    acc_{account_id}.recording_data as recording_data, 
    acc_{account_id}.sessions as sessions 
WHERE 
    recording_data.usp_id = sessions.usp_id 
    AND sessions.referrer_id = recordings_urls.id 
    AND  (  urls &&  array(select id from acc_{account_id}.urls where url  ILIKE  '%enterprise_customer.com/jobs%')::text[]   ) 
    AND r_time > to_timestamp(1542585600) 
    AND r_time < to_timestamp(1545177599) 
    AND recording_data.duration >=5 
    AND recording_data.num_of_pages > 0 ;

Kuma a nan ne lokutan:

Lokacin da aka tsara: 1.480 ms Lokacin aiwatarwa: 1431924.650 ms

Tambayar ta zazzage layuka dubu 150. Mai tsara shirin tambayar ya nuna wasu bayanai masu ban sha'awa guda biyu, amma ba a bayyane yake ba.

Bari mu kara nazarin bukatar. Kamar yadda kake gani, yana yi JOIN teburi uku:

  1. zaman: don nuna bayanan zaman: mai bincike, wakilin mai amfani, ƙasa, da sauransu.
  2. rikodin_dataURLs da aka yi rikodin, shafuka, tsawon lokacin ziyara
  3. amsoshi: Don guje wa kwafin manyan URLs, muna adana su a cikin wani tebur daban.

Hakanan lura cewa duk teburin mu an riga an raba su account_id. Ta wannan hanyar, yanayin da wani babban asusun musamman ke haifar da matsala ga wasu ba a cire shi ba.

Neman alamu

Bayan dubawa na kusa, za mu ga cewa wani abu ba daidai ba ne tare da takamaiman buƙata. Yana da kyau a kalli wannan layin:

urls && array(
	select id from acc_{account_id}.urls 
	where url  ILIKE  '%enterprise_customer.com/jobs%'
)::text[]

Tunani na farko shine watakila saboda ILIKE akan duk waɗannan dogayen URLs (muna da sama da miliyan 1,4 na musamman URLs da aka tattara don wannan asusun) aikin na iya wahala.

Amma a'a, wannan ba batun ba ne!

SELECT id FROM urls WHERE url ILIKE '%enterprise_customer.com/jobs%';
  id
--------
 ...
(198661 rows)

Time: 5231.765 ms

Neman samfurin samfuri kanta yana ɗaukar daƙiƙa 5 kawai. Neman tsari a cikin URL na musamman na miliyan ba matsala ba ne.

Wanda ake zargi na gaba a jerin suna da yawa JOIN. Wataƙila yawan amfani da su ya haifar da raguwar? Yawancin lokaci JOIN's sune mafi bayyanannen 'yan takara don matsalolin aiki, amma ban yi imani da yanayin mu ba.

analytics_db=# SELECT
    count(*)
FROM
    acc_{account_id}.urls as recordings_urls,
    acc_{account_id}.recording_data_0 as recording_data,
    acc_{account_id}.sessions_0 as sessions
WHERE
    recording_data.usp_id = sessions.usp_id
    AND sessions.referrer_id = recordings_urls.id
    AND r_time > to_timestamp(1542585600)
    AND r_time < to_timestamp(1545177599)
    AND recording_data.duration >=5
    AND recording_data.num_of_pages > 0 ;
 count
-------
  8086
(1 row)

Time: 147.851 ms

Kuma wannan ma ba lamarinmu bane. JOIN' ya juya ya zama da sauri sosai.

Yanke da'irar wadanda ake zargi

Na shirya don fara canza tambayar don cimma duk wani ingantaccen aiki mai yiwuwa. Ni da ƙungiyara mun samar da manyan dabaru guda biyu:

  • Yi amfani da EXISTS don adireshin URL: Muna son sake bincika ko akwai wasu matsaloli tare da subquery na URLs. Hanya ɗaya don cimma wannan ita ce amfani kawai EXISTS. EXISTS iya yana haɓaka aiki sosai tunda yana ƙarewa nan da nan da zaran ya sami igiya ɗaya tilo da ta dace da yanayin.

SELECT
	count(*) 
FROM 
    acc_{account_id}.urls as recordings_urls,
    acc_{account_id}.recording_data as recording_data,
    acc_{account_id}.sessions as sessions
WHERE
    recording_data.usp_id = sessions.usp_id
    AND  (  1 = 1  )
    AND sessions.referrer_id = recordings_urls.id
    AND  (exists(select id from acc_{account_id}.urls where url  ILIKE '%enterprise_customer.com/jobs%'))
    AND r_time > to_timestamp(1547585600)
    AND r_time < to_timestamp(1549177599)
    AND recording_data.duration >=5
    AND recording_data.num_of_pages > 0 ;
 count
 32519
(1 row)
Time: 1636.637 ms

To, eh. Subquery lokacin nannade ciki EXISTS, yana sa komai yayi sauri sosai. Tambaya mai ma'ana ta gaba shine dalilin da yasa buƙatar tare da JOIN-ami da subquery kanta suna sauri daban-daban, amma suna jinkirin tare?

  • Matsar da subquery zuwa CTE : Idan tambaya ta yi sauri da kanta, za mu iya ƙididdige sakamakon azumi da farko sannan mu samar da shi ga babbar tambaya.

WITH matching_urls AS (
    select id::text from acc_{account_id}.urls where url  ILIKE  '%enterprise_customer.com/jobs%'
)

SELECT 
    count(*) FROM acc_{account_id}.urls as recordings_urls, 
    acc_{account_id}.recording_data as recording_data, 
    acc_{account_id}.sessions as sessions,
    matching_urls
WHERE 
    recording_data.usp_id = sessions.usp_id 
    AND  (  1 = 1  )  
    AND sessions.referrer_id = recordings_urls.id
    AND (urls && array(SELECT id from matching_urls)::text[])
    AND r_time > to_timestamp(1542585600) 
    AND r_time < to_timestamp(1545107599)
    AND recording_data.duration >=5 
    AND recording_data.num_of_pages > 0;

Amma har yanzu ya kasance a hankali.

Gano mai laifi

A duk tsawon wannan lokacin, wani ɗan ƙaramin abu ya haskaka a idanuna, wanda a koyaushe ina gogewa. Amma tunda babu sauran abin da ya rage sai na yanke shawarar in kalle ta. Ina magana akai && ma'aikaci. Wallahi EXISTS kawai ingantaccen aiki && shine kawai abin da ya rage na gama gari a duk nau'ikan binciken jinkirin.

Kallon takardun shaida, muna ganin haka && amfani dashi lokacin da kake buƙatar nemo abubuwan gama gari tsakanin tsararraki biyu.

A cikin ainihin buƙatar wannan shine:

AND  (  urls &&  array(select id from acc_{account_id}.urls where url  ILIKE  '%enterprise_customer.com/jobs%')::text[]   )

Wanne yana nufin muna yin binciken tsari akan URLs ɗinmu, sannan nemo hanyar haɗin gwiwa tare da duk URLs masu rubutun gama gari. Wannan ɗan ruɗani ne saboda "urls" a nan baya nufin teburin da ke ɗauke da duk URLs, amma ga ginshiƙin "urls" a cikin tebur. recording_data.

Tare da karuwar tuhuma game da &&, Na yi ƙoƙarin nemo musu tabbaci a cikin shirin tambaya da aka samar EXPLAIN ANALYZE (Na riga na sami tanadin tsari, amma yawanci na fi jin daɗin yin gwaji a cikin SQL fiye da ƙoƙarin fahimtar fahintar masu tsara tambaya).

Filter: ((urls && ($0)::text[]) AND (r_time > '2018-12-17 12:17:23+00'::timestamp with time zone) AND (r_time < '2018-12-18 23:59:59+00'::timestamp with time zone) AND (duration >= '5'::double precision) AND (num_of_pages > 0))
                           Rows Removed by Filter: 52710

Akwai layukan tacewa da yawa daga kawai &&. Wanda ke nufin cewa wannan aiki ba tsada kawai ba ne, amma kuma an yi shi sau da yawa.

Na gwada wannan ta hanyar ware yanayin

SELECT 1
FROM 
    acc_{account_id}.urls as recordings_urls, 
    acc_{account_id}.recording_data_30 as recording_data_30, 
    acc_{account_id}.sessions_30 as sessions_30 
WHERE 
	urls &&  array(select id from acc_{account_id}.urls where url  ILIKE  '%enterprise_customer.com/jobs%')::text[]

Wannan tambayar ta kasance a hankali. Domin da JOIN-s suna da sauri kuma subqueries suna da sauri, abin da ya rage shi ne && ma'aikaci.

Wannan babban aiki ne kawai. A koyaushe muna buƙatar bincika gabaɗayan tebur na URLs don bincika tsari, kuma koyaushe muna buƙatar nemo mahadar. Ba za mu iya bincika ta bayanan URL kai tsaye ba, saboda waɗannan ID ne kawai ke magana urls.

A kan hanyar zuwa mafita

&& sannu a hankali domin duka saitin suna da girma. Aikin zai yi sauri idan na maye gurbinsa urls a kan { "http://google.com/", "http://wingify.com/" }.

Na fara nemo hanyar da zan yi saita tsaka-tsaki a Postgres ba tare da amfani ba &&, amma ba tare da nasara da yawa ba.

A ƙarshe, mun yanke shawarar kawai magance matsalar a ware: ba ni komai urls Layukan da URL ɗin ya dace da tsarin. Ba tare da ƙarin sharuɗɗa ba zai kasance - 

SELECT urls.url
FROM 
	acc_{account_id}.urls as urls,
	(SELECT unnest(recording_data.urls) AS id) AS unrolled_urls
WHERE
	urls.id = unrolled_urls.id AND
	urls.url  ILIKE  '%jobs%'

Madadin haka JOIN syntax Na yi amfani da subquery kuma na faɗaɗa recording_data.urls tsararru ta yadda zaku iya amfani da yanayin kai tsaye WHERE.

Abu mafi mahimmanci a nan shi ne && ana amfani dashi don bincika ko shigarwar da aka bayar ta ƙunshi URL ɗin da ya dace. Idan ka ɗan leƙa kaɗan, za ka ga wannan aikin yana motsawa ta cikin abubuwan da ke cikin tsararru (ko layuka na tebur) kuma yana tsayawa lokacin da yanayin (match) ya cika. Ba ya tunatar da ku komai? iya, EXISTS.

Tun daga recording_data.urls za a iya isar da shi daga wajen mahallin subquery, lokacin da wannan ya faru za mu iya komawa kan tsohon abokinmu EXISTS kuma kunsa subquery da shi.

Hada komai tare, muna samun ingantaccen tambaya ta ƙarshe:

SELECT 
    count(*) 
FROM 
    acc_{account_id}.urls as recordings_urls, 
    acc_{account_id}.recording_data as recording_data, 
    acc_{account_id}.sessions as sessions 
WHERE 
    recording_data.usp_id = sessions.usp_id 
    AND  (  1 = 1  )  
    AND sessions.referrer_id = recordings_urls.id 
    AND r_time > to_timestamp(1542585600) 
    AND r_time < to_timestamp(1545177599) 
    AND recording_data.duration >=5 
    AND recording_data.num_of_pages > 0
    AND EXISTS(
        SELECT urls.url
        FROM 
            acc_{account_id}.urls as urls,
            (SELECT unnest(urls) AS rec_url_id FROM acc_{account_id}.recording_data) 
            AS unrolled_urls
        WHERE
            urls.id = unrolled_urls.rec_url_id AND
            urls.url  ILIKE  '%enterprise_customer.com/jobs%'
    );

Kuma lokacin jagora na ƙarshe Time: 1898.717 ms Lokacin bikin?!?

Ba da sauri ba! Da farko kuna buƙatar duba daidai. Na yi matukar shakku akai EXISTS ingantawa yayin da yake canza dabaru don kammalawa a baya. Muna buƙatar tabbatar da cewa ba mu ƙara kuskuren da ba a bayyane ga buƙatar ba.

Gwaji mai sauƙi ya gudana count(*) akan duka a hankali da tambayoyin gaggawa don adadi mai yawa na saitin bayanai daban-daban. Bayan haka, don ƙaramin ɓangaren bayanan, na tabbatar da hannu da hannu cewa duk sakamakon daidai ne.

Duk gwaje-gwajen sun ba da sakamako mai kyau akai-akai. Mun gyara komai!

Darussan Da Aka Koyi

Akwai darussa da yawa da za a koya daga wannan labarin:

  1. Shirye-shiryen tambaya ba su ba da cikakken labarin ba, amma suna iya ba da alamu
  2. Babban wadanda ake zargi ba koyaushe ne ainihin masu laifi ba
  3. Ana iya wargaza tambayoyin a hankali don ware ƙulla
  4. Ba duk ingantawa ba ne mai raguwa a yanayi
  5. Amfani EXIST, inda zai yiwu, zai iya haifar da karuwa mai yawa a cikin yawan aiki

ƙarshe

Mun tafi daga lokacin tambaya na ~ 24 minutes zuwa 2 seconds - babban haɓakar aiki sosai! Kodayake wannan labarin ya fito da girma, duk gwaje-gwajen da muka yi sun faru a rana ɗaya, kuma an kiyasta cewa sun ɗauki tsakanin 1,5 zuwa 2 hours don ingantawa da gwaji.

SQL harshe ne mai ban sha'awa idan ba ku ji tsoronsa ba, amma kuyi ƙoƙarin koyo da amfani da shi. Ta hanyar samun kyakkyawar fahimtar yadda ake aiwatar da tambayoyin SQL, yadda ma'ajin bayanai ke samar da tsare-tsaren tambaya, yadda fihirisa ke aiki, da girman girman bayanan da kuke mu'amala da su, za ku iya samun nasara sosai wajen inganta tambayoyin. Yana da mahimmanci, duk da haka, don ci gaba da gwada hanyoyi daban-daban kuma a hankali a hankali warware matsalar, gano matsalolin.

Mafi kyawun sashi game da samun sakamako irin waɗannan shine sananne, ingantaccen saurin gani - inda rahoton da a baya ma ba zai yi lodi ba yanzu kusan nan take.

Godiya ta musamman ga 'yan uwana a umurnin Aditya MishraAditya Gauru и Varun Malhotra domin tada hankali da kuma Dinkar Pandir don gano wani muhimmin kuskure a cikin buƙatarmu ta ƙarshe kafin mu yi bankwana da shi!

source: www.habr.com

Add a comment