Analytics operasional ing arsitektur microservice: bantuan lan pituduh Postgres FDW

Arsitektur microservice, kaya kabeh ing donya iki, duwe pro lan cons. Sawetara proses dadi luwih gampang, liyane luwih angel. Lan kanggo kacepetan pangowahan lan skalabilitas sing luwih apik, sampeyan kudu ngorbanake. Salah sijine yaiku nambah kerumitan analytics. Yen ing monolit kabeh analytics operasional bisa dikurangi dadi pitakon SQL dadi replika analitis, banjur ing arsitektur multiservice saben layanan duwe basis data dhewe lan misale jek siji pitakon ora bisa ditindakake (utawa bisa uga?). Kanggo sing kepengin weruh carane ngatasi masalah analitik operasional ing perusahaan kita lan carane kita sinau kanggo manggon karo solusi iki - welcome.

Analytics operasional ing arsitektur microservice: bantuan lan pituduh Postgres FDW
Jenengku Pavel Sivash, ing DomClick aku kerja ing tim sing tanggung jawab kanggo njaga gudang data analitik. Secara konvensional, aktivitas kita bisa diklasifikasikake minangka teknik data, nanging nyatane, sawetara tugas luwih akeh. Ana standar ETL / ELT kanggo teknik data, dhukungan lan adaptasi alat kanggo analisis data lan pangembangan alat sampeyan dhewe. Utamane, kanggo laporan operasional, kita mutusake "nyamar" yen kita duwe monolit lan menehi analis siji database sing bakal ngemot kabeh data sing dibutuhake.

UmumΓ©, kita nimbang opsi beda. Sampeyan bisa mbangun repositori lengkap - kita malah nyoba, nanging, jujur, kita ora bisa nggabungake owah-owahan logika sing kerep banget karo proses sing rada alon kanggo mbangun repositori lan nggawe owah-owahan (yen ana sing sukses. , tulis ing komentar carane). Sampeyan bisa ngomong analis: "Guys, sinau python lan pindhah menyang replika analitis,"Nanging iki minangka syarat tambahan kanggo recruiting, lan ketoke sing iki kudu nyingkiri yen bisa. Kita mutusake kanggo nyoba nggunakake teknologi FDW (Foreign Data Wrapper): intine, iki minangka dblink standar, sing ana ing standar SQL, nanging kanthi antarmuka sing luwih trep. Adhedhasar iku, kita digawe solusi, kang pungkasanipun kejiret ing, lan kita dienggoni ing. Rincian kasebut minangka topik artikel sing kapisah, lan bisa uga luwih saka siji, amarga aku pengin ngomong babagan akeh: saka nyinkronake skema database kanggo ngakses kontrol lan depersonalisasi data pribadhi. Sampeyan uga kudu nggawe reservasi yen solusi iki dudu panggantos kanggo database lan repositori analitis nyata, nanging mung ngrampungake masalah tartamtu.

Ing tingkat ndhuwur katon kaya iki:

Analytics operasional ing arsitektur microservice: bantuan lan pituduh Postgres FDW
Ana database PostgreSQL ing ngendi pangguna bisa nyimpen data karya, lan sing paling penting, replika analitis kabeh layanan disambungake menyang database iki liwat FDW. Iki ndadekake iku bisa kanggo nulis pitakonan kanggo sawetara database, lan iku ora Matter apa: PostgreSQL, MySQL, MongoDB utawa mergo (file, API, yen dumadakan ora ana pambungkus cocok, sampeyan bisa nulis dhewe). Inggih, kabeh katon apik! Apa kita putus?

Yen kabeh rampung kanthi cepet lan gampang, mesthine ora ana artikel.

Penting kanggo ngerti carane Postgres ngolah panjaluk menyang server remot. Iki katon logis, nanging asring wong ora nggatekake: Postgres mbagi panjaluk kasebut dadi bagean sing dieksekusi kanthi mandiri ing server remot, ngumpulake data iki, lan nindakake petungan pungkasan dhewe, saengga kacepetan eksekusi query bakal gumantung banget. carane ditulis. Sampeyan uga kudu dicathet: nalika data teka saka server remot, ora ana indeks maneh, ora ana sing bisa nulungi panjadwal, mula mung awake dhewe sing bisa nulungi lan menehi saran. Lan iki persis apa aku arep ngomong bab ing liyane rinci.

A pitakonan prasaja lan rencana karo

Kanggo nuduhake carane Postgres takon tabel 6 yuta baris ing server remot, ayo kang katon ing rencana prasaja.

explain analyze verbose  
SELECT count(1)
FROM fdw_schema.table;

Aggregate  (cost=418383.23..418383.24 rows=1 width=8) (actual time=3857.198..3857.198 rows=1 loops=1)
  Output: count(1)
  ->  Foreign Scan on fdw_schema."table"  (cost=100.00..402376.14 rows=6402838 width=0) (actual time=4.874..3256.511 rows=6406868 loops=1)
        Output: "table".id, "table".is_active, "table".meta, "table".created_dt
        Remote SQL: SELECT NULL FROM fdw_schema.table
Planning time: 0.986 ms
Execution time: 3857.436 ms

Nggunakake statement VERBOSE ngidini kita ndeleng pitakon sing bakal dikirim menyang server remot lan asil sing bakal ditampa kanggo proses luwih lanjut (baris RemoteSQL).

Ayo dadi luwih maju lan nambah sawetara saringan kanggo panyuwunan kita: siji kanggo boolean lapangan, siji kanthi kedadeyan wektu ing interval lan siji dening jsonb.

explain analyze verbose
SELECT count(1)
FROM fdw_schema.table 
WHERE is_active is True
AND created_dt BETWEEN CURRENT_DATE - INTERVAL '7 month' 
AND CURRENT_DATE - INTERVAL '6 month'
AND meta->>'source' = 'test';

Aggregate  (cost=577487.69..577487.70 rows=1 width=8) (actual time=27473.818..25473.819 rows=1 loops=1)
  Output: count(1)
  ->  Foreign Scan on fdw_schema."table"  (cost=100.00..577469.21 rows=7390 width=0) (actual time=31.369..25372.466 rows=1360025 loops=1)
        Output: "table".id, "table".is_active, "table".meta, "table".created_dt
        Filter: (("table".is_active IS TRUE) AND (("table".meta ->> 'source'::text) = 'test'::text) AND ("table".created_dt >= (('now'::cstring)::date - '7 mons'::interval)) AND ("table".created_dt <= ((('now'::cstring)::date)::timestamp with time zone - '6 mons'::interval)))
        Rows Removed by Filter: 5046843
        Remote SQL: SELECT created_dt, is_active, meta FROM fdw_schema.table
Planning time: 0.665 ms
Execution time: 27474.118 ms

Iki minangka titik sing kudu digatekake nalika nulis pitakon. Filter-filter kasebut ora ditransfer menyang server remot, tegese kanggo nglakokake, Postgres narik kabeh 6 yuta larik supaya bisa nyaring sacara lokal (Filter baris) lan nindakake agregasi. Tombol kanggo sukses iku kanggo nulis pitakonan supaya saringan ditransfer menyang mesin remot, lan kita nampa lan aggregate mung baris perlu.

Iku sawetara booleanshit

Kanthi kolom boolean, kabeh gampang. Ing panjalukan asli, masalah kasebut amarga operator is. Yen sampeyan ngganti karo =, banjur kita entuk asil ing ngisor iki:

explain analyze verbose
SELECT count(1)
FROM fdw_schema.table
WHERE is_active = True
AND created_dt BETWEEN CURRENT_DATE - INTERVAL '7 month' 
AND CURRENT_DATE - INTERVAL '6 month'
AND meta->>'source' = 'test';

Aggregate  (cost=508010.14..508010.15 rows=1 width=8) (actual time=19064.314..19064.314 rows=1 loops=1)
  Output: count(1)
  ->  Foreign Scan on fdw_schema."table"  (cost=100.00..507988.44 rows=8679 width=0) (actual time=33.035..18951.278 rows=1360025 loops=1)
        Output: "table".id, "table".is_active, "table".meta, "table".created_dt
        Filter: ((("table".meta ->> 'source'::text) = 'test'::text) AND ("table".created_dt >= (('now'::cstring)::date - '7 mons'::interval)) AND ("table".created_dt <= ((('now'::cstring)::date)::timestamp with time zone - '6 mons'::interval)))
        Rows Removed by Filter: 3567989
        Remote SQL: SELECT created_dt, meta FROM fdw_schema.table WHERE (is_active)
Planning time: 0.834 ms
Execution time: 19064.534 ms

Nalika sampeyan bisa ndeleng, Filter miber menyang server remot, lan wektu eksekusi suda saka 27 kanggo 19 detik.

Iku worth kang lagi nyimak sing operator is beda karo operator = amarga bisa karo Nilai Null. Iku tegese ora Bener bakal ninggalake nilai Palsu lan Null ing saringan, dene != Bener bakal ninggalake mung Nilai Palsu. Mulane, nalika ngganti operator ora rong kondisi karo operator UTAWA kudu diterusake menyang saringan, contone, WHERE (col != True) UTAWA (col is null).

Kita wis ngrampungake boolean, ayo nerusake. Saiki, ayo mbalekake saringan Boolean menyang wujud asline supaya bisa nimbang kanthi bebas efek saka owah-owahan liyane.

timestamptz? hz

UmumΓ©, sampeyan kerep kudu eksprimen carane bener nulis panjalukan sing melu server remot, lan mung banjur goleki panjelasan apa iki kedaden. Sithik banget informasi babagan iki bisa ditemokake ing Internet. Dadi, ing eksperimen, kita nemokake manawa filter tanggal tetep mabur menyang server remot kanthi bang, nanging nalika kita pengin nyetel tanggal kanthi dinamis, contone, saiki () utawa CURRENT_DATE, iki ora kedadeyan. Ing conto kita, kita nambahake filter supaya kolom created_at ngemot data persis 1 sasi kepungkur (ANTARA CURRENT_DATE - INTERVAL '7 sasi' LAN CURRENT_DATE - INTERVAL '6 sasi'). Apa sing ditindakake ing kasus iki?

explain analyze verbose
SELECT count(1)
FROM fdw_schema.table 
WHERE is_active is True
AND created_dt >= (SELECT CURRENT_DATE::timestamptz - INTERVAL '7 month') 
AND created_dt <(SELECT CURRENT_DATE::timestamptz - INTERVAL '6 month')
AND meta->>'source' = 'test';

Aggregate  (cost=306875.17..306875.18 rows=1 width=8) (actual time=4789.114..4789.115 rows=1 loops=1)
  Output: count(1)
  InitPlan 1 (returns $0)
    ->  Result  (cost=0.00..0.02 rows=1 width=8) (actual time=0.007..0.008 rows=1 loops=1)
          Output: ((('now'::cstring)::date)::timestamp with time zone - '7 mons'::interval)
  InitPlan 2 (returns $1)
    ->  Result  (cost=0.00..0.02 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=1)
          Output: ((('now'::cstring)::date)::timestamp with time zone - '6 mons'::interval)
  ->  Foreign Scan on fdw_schema."table"  (cost=100.02..306874.86 rows=105 width=0) (actual time=23.475..4681.419 rows=1360025 loops=1)
        Output: "table".id, "table".is_active, "table".meta, "table".created_dt
        Filter: (("table".is_active IS TRUE) AND (("table".meta ->> 'source'::text) = 'test'::text))
        Rows Removed by Filter: 76934
        Remote SQL: SELECT is_active, meta FROM fdw_schema.table WHERE ((created_dt >= $1::timestamp with time zone)) AND ((created_dt < $2::timestamp with time zone))
Planning time: 0.703 ms
Execution time: 4789.379 ms

Kita marang planner kanggo ngetung tanggal ing subquery ing advance lan pass variabel siap-digawe menyang Filter. Lan pitunjuk iki menehi asil sing apik banget, panjaluk kasebut meh 6 kaping luwih cepet!

Maneh, penting kanggo ati-ati ing kene: jinis data ing subquery kudu padha karo lapangan sing kita nyaring, yen ora, perencana bakal mutusake manawa jinis kasebut beda-beda, mula kudu entuk kabeh. data lan nyaring sacara lokal.

Ayo bali filter tanggal menyang nilai asline.

Freddy vs. Jsonb

Umume, kolom lan tanggal Boolean wis nyepetake pitakon kita kanthi cukup, nanging isih ana siji jinis data liyane. Perang karo nyaring, jujur, isih durung rampung, sanajan ana sukses ing kene. Dadi, iki carane kita bisa ngliwati saringan jsonb lapangan menyang server remot.

explain analyze verbose
SELECT count(1)
FROM fdw_schema.table 
WHERE is_active is True
AND created_dt BETWEEN CURRENT_DATE - INTERVAL '7 month' 
AND CURRENT_DATE - INTERVAL '6 month'
AND meta @> '{"source":"test"}'::jsonb;

Aggregate  (cost=245463.60..245463.61 rows=1 width=8) (actual time=6727.589..6727.590 rows=1 loops=1)
  Output: count(1)
  ->  Foreign Scan on fdw_schema."table"  (cost=1100.00..245459.90 rows=1478 width=0) (actual time=16.213..6634.794 rows=1360025 loops=1)
        Output: "table".id, "table".is_active, "table".meta, "table".created_dt
        Filter: (("table".is_active IS TRUE) AND ("table".created_dt >= (('now'::cstring)::date - '7 mons'::interval)) AND ("table".created_dt <= ((('now'::cstring)::date)::timestamp with time zone - '6 mons'::interval)))
        Rows Removed by Filter: 619961
        Remote SQL: SELECT created_dt, is_active FROM fdw_schema.table WHERE ((meta @> '{"source": "test"}'::jsonb))
Planning time: 0.747 ms
Execution time: 6727.815 ms

Tinimbang nyaring operator, sampeyan kudu nggunakake ngarsane siji operator jsonb ing beda. 7 detik tinimbang asli 29. Supaya adoh iki mung pilihan sukses kanggo transmisi saringan liwat jsonb menyang server remot, nanging ing kene penting kanggo nggatekake siji watesan: kita nggunakake versi 9.6 saka database, nanging ing pungkasan April kita rencana kanggo ngrampungake tes pungkasan lan pindhah menyang versi 12. Sawise nganyari, kita bakal nulis babagan kena pengaruh, amarga ana owah-owahan sing akeh banget: json_path, prilaku CTE anyar, push mudhun (wis ana wiwit versi 10). Aku pancene pengin nyoba enggal.

Rampung dheweke

We dites carane saben owah-owahan mengaruhi kacepetan request individu. Ayo saiki ndeleng apa sing kedadeyan nalika kabeh telung saringan ditulis kanthi bener.

explain analyze verbose
SELECT count(1)
FROM fdw_schema.table 
WHERE is_active = True
AND created_dt >= (SELECT CURRENT_DATE::timestamptz - INTERVAL '7 month') 
AND created_dt <(SELECT CURRENT_DATE::timestamptz - INTERVAL '6 month')
AND meta @> '{"source":"test"}'::jsonb;

Aggregate  (cost=322041.51..322041.52 rows=1 width=8) (actual time=2278.867..2278.867 rows=1 loops=1)
  Output: count(1)
  InitPlan 1 (returns $0)
    ->  Result  (cost=0.00..0.02 rows=1 width=8) (actual time=0.010..0.010 rows=1 loops=1)
          Output: ((('now'::cstring)::date)::timestamp with time zone - '7 mons'::interval)
  InitPlan 2 (returns $1)
    ->  Result  (cost=0.00..0.02 rows=1 width=8) (actual time=0.003..0.003 rows=1 loops=1)
          Output: ((('now'::cstring)::date)::timestamp with time zone - '6 mons'::interval)
  ->  Foreign Scan on fdw_schema."table"  (cost=100.02..322041.41 rows=25 width=0) (actual time=8.597..2153.809 rows=1360025 loops=1)
        Output: "table".id, "table".is_active, "table".meta, "table".created_dt
        Remote SQL: SELECT NULL FROM fdw_schema.table WHERE (is_active) AND ((created_dt >= $1::timestamp with time zone)) AND ((created_dt < $2::timestamp with time zone)) AND ((meta @> '{"source": "test"}'::jsonb))
Planning time: 0.820 ms
Execution time: 2279.087 ms

Ya, panjaluk kasebut katon luwih rumit, iki minangka biaya sing dipeksa, nanging kacepetan eksekusi 2 detik, luwih saka 10 kaping luwih cepet! Lan kita ngomong babagan pitakon prasaja marang set data sing relatif cilik. Ing panjalukan nyata, kita nampa paningkatan nganti pirang-pirang atus.

Kanggo ngringkes: yen sampeyan nggunakake PostgreSQL karo FDW, tansah mriksa sing kabeh saringan dikirim menyang server remot, lan sampeyan bakal seneng ... Paling nganti sampeyan njaluk menyang nggabungake antarane tabel saka server beda. Nanging iki crita kanggo artikel liyane.

Matur nuwun kanggo perhatian sampeyan! Aku seneng ngrungokake pitakon, komentar, lan crita babagan pengalaman sampeyan ing komentar.

Source: www.habr.com

Add a comment