Chida Chomanga Chida kapena zomwe zili zofala pakati pa Data Warehouse ndi Smoothie
Ndi mfundo ziti zomwe malo abwino osungiramo data amamangidwa?
Yang'anani pamtengo wamabizinesi ndi ma analytics pakalibe boilerplate code. Kuwongolera DWH ngati codebase: kumasulira, kubwereza, kuyezetsa makina ndi CI. Modular, extensible, open source and community. Zolemba zosavuta kugwiritsa ntchito komanso mawonekedwe odalira (Data Lineage).
Zambiri za izi komanso za udindo wa DBT mu Big Data & Analytics ecosystem - talandiridwa kumphaka.
moni nonse
Artemy Kozyr akulumikizana. Kwa zaka zoposa 5 ndakhala ndikugwira ntchito ndi malo osungiramo deta, kumanga ETL / ELT, komanso kusanthula deta ndi kuwonetseratu. Panopa ndikugwira ntchito ku wamagudumu, Ndimaphunzitsa ku OTUS pa maphunziro Katswiri wa Zolemba, ndipo lero ndikufuna kugawana nanu nkhani yomwe ndinalemba poyembekezera chiyambi olembetsa atsopano pamaphunzirowa.
Ndemanga yachidule
Dongosolo la DBT limakhudza zonse za T mu ELT (Extract - Transform - Load) acronym.
Kubwera kwa nkhokwe zowunikira komanso zowunikira ngati BigQuery, Redshift, Snowflake, panalibe chifukwa chosinthira kunja kwa Data Warehouse.
DBT sitsitsa deta kuchokera ku magwero, koma imapereka mwayi waukulu wogwira ntchito ndi deta yomwe yalowetsedwa kale mu Kusungirako (Mukati kapena Kusungirako Kwanja).
Cholinga chachikulu cha DBT ndikutenga kachidindoyo, kuyipanga mu SQL, ndikuchita malamulowo motsatira ndondomeko yoyenera mu Repository.
DBT Project Mapangidwe
Ntchitoyi imakhala ndi zolemba ndi mafayilo amitundu iwiri yokha:
Chitsanzo (.sql) - gawo losinthika lowonetsedwa ndi funso la SELECT
Khodi yophatikizidwa ya SQL imachitidwa mu Kusungirako motsatizana (graph)
Izi ndi zomwe kuthamanga kuchokera ku CLI kungawonekere:
Chilichonse CHOSAKHA
Ichi ndi chinthu chakupha cha chimango cha Data Build Tool. Mwanjira ina, DBT imachotsa ma code onse okhudzana ndi kupanga mafunso anu mu Store (zosiyana ndi malamulo CREATE, INSERT, UPDATE, DELETE ALTER, GRANT, ...).
{% set payment_methods = ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] %}
with orders as (
select * from {{ ref('stg_orders') }}
),
order_payments as (
select * from {{ ref('order_payments') }}
),
final as (
select
orders.order_id,
orders.customer_id,
orders.order_date,
orders.status,
{% for payment_method in payment_methods -%}
order_payments.{{payment_method}}_amount,
{% endfor -%}
order_payments.total_amount as amount
from orders
left join order_payments using (order_id)
)
select * from final
Ndi zinthu zochititsa chidwi ziti zomwe tikuwona pano?
Choyamba: Ntchito CTE (Common Table Expressions) - kukonza ndikumvetsetsa ma code omwe ali ndi masinthidwe ambiri ndi malingaliro abizinesi
Chachiwiri: Code code ndi chisakanizo cha SQL ndi chinenero Jinja (chiyankhulo cha template).
Chitsanzo chimagwiritsa ntchito lupu chifukwa kupanga ndalama panjira iliyonse yolipira yomwe yafotokozedwa m'mawuwo akonzedwa. Njirayi imagwiritsidwanso ntchito ref - Kutha kutchula mitundu ina mkati mwa code:
Reference Integrity - kukhulupirika kotsimikizika (mwachitsanzo, kasitomala_id patebulo la maoda amafanana ndi id patebulo lamakasitomala)
Kufananiza mndandanda wazinthu zovomerezeka
N'zotheka kuwonjezera mayesero anu (mayeso a deta), monga, mwachitsanzo, % kupatuka kwa ndalama ndi zizindikiro kuyambira tsiku, sabata, mwezi wapitawo. Lingaliro lililonse lopangidwa ngati funso la SQL litha kukhala mayeso.
Mwanjira imeneyi, mutha kugwira zopatuka zapathengo ndi zolakwika mu data mu Warehouse mawindo.
Pankhani ya zolembedwa, DBT imapereka njira zowonjezerera, kumasulira, ndi kugawa metadata ndi ndemanga pazachitsanzo komanso ngakhale milingo.
Izi ndi zomwe kuwonjezera mayeso ndi zolemba kumawoneka ngati mulingo wamafayilo:
- name: fct_orders
description: This table has basic information about orders, as well as some derived facts based on payments
columns:
- name: order_id
tests:
- unique # ΠΏΡΠΎΠ²Π΅ΡΠΊΠ° Π½Π° ΡΠ½ΠΈΠΊΠ°Π»ΡΠ½ΠΎΡΡΡ Π·Π½Π°ΡΠ΅Π½ΠΈΠΉ
- not_null # ΠΏΡΠΎΠ²Π΅ΡΠΊΠ° Π½Π° Π½Π°Π»ΠΈΡΠΈΠ΅ null
description: This is a unique identifier for an order
- name: customer_id
description: Foreign key to the customers table
tests:
- not_null
- relationships: # ΠΏΡΠΎΠ²Π΅ΡΠΊΠ° ΡΡΡΠ»ΠΎΡΠ½ΠΎΠΉ ΡΠ΅Π»ΠΎΡΡΠ½ΠΎΡΡΠΈ
to: ref('dim_customers')
field: customer_id
- name: order_date
description: Date (UTC) that the order was placed
- name: status
description: '{{ doc("orders_status") }}'
tests:
- accepted_values: # ΠΏΡΠΎΠ²Π΅ΡΠΊΠ° Π½Π° Π΄ΠΎΠΏΡΡΡΠΈΠΌΡΠ΅ Π·Π½Π°ΡΠ΅Π½ΠΈΡ
values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']
Ndipo izi ndi momwe zolembedwazi zimawonekera patsamba lopangidwa:
Macros ndi ma modules
Cholinga cha DBT sichochuluka kwambiri kuti mukhale seti ya zolemba za SQL, koma kupatsa ogwiritsa ntchito njira zamphamvu komanso zolemera zopangira zosintha zawo ndikugawa ma module awa.
Macros ndi magulu a zomanga ndi mawu omwe angatchulidwe ngati ntchito mkati mwa zitsanzo. Macros amakulolani kuti mugwiritsenso ntchito SQL pakati pa zitsanzo ndi mapulojekiti molingana ndi mfundo ya uinjiniya ya DRY (Musabwereze Nokha).
Macro chitsanzo:
{% macro rename_category(column_name) %}
case
when {{ column_name }} ilike '%osx%' then 'osx'
when {{ column_name }} ilike '%android%' then 'android'
when {{ column_name }} ilike '%ios%' then 'ios'
else 'other'
end as renamed_product
{% endmacro %}
Ndipo ntchito zake:
{% set column_name = 'product' %}
select
product,
{{ rename_category(column_name) }} -- Π²ΡΠ·ΠΎΠ² ΠΌΠ°ΠΊΡΠΎΡΠ°
from my_table
DBT imabwera ndi woyang'anira phukusi yemwe amalola ogwiritsa ntchito kusindikiza ndikugwiritsanso ntchito ma module ndi ma macros.
Izi zikutanthauza kutha kutsitsa ndikugwiritsa ntchito malaibulale monga:
dbt_zida: kugwira ntchito ndi Date/Time, Surrogate Keys, mayeso a Schema, Pivot/Unpivot ndi ena
Mndandanda wathunthu wamaphukusi angapezeke pa dbt ku.
Ngakhale zambiri
Apa ndifotokoza zina zingapo zosangalatsa ndi kukhazikitsa komwe ine ndi gulu timagwiritsa ntchito pomanga Malo Osungiramo Zinthu wamagudumu.
Kupatukana kwa malo othamanga DEV - TEST - PROD
Ngakhale mkati mwa gulu lomwelo la DWH (m'machitidwe osiyanasiyana). Mwachitsanzo, kugwiritsa ntchito mawu otsatirawa:
with source as (
select * from {{ source('salesforce', 'users') }}
where 1=1
{%- if target.name in ['dev', 'test', 'ci'] -%}
where timestamp >= dateadd(day, -3, current_date)
{%- endif -%}
)
Khodi iyi imati: za chilengedwe dev, test, ci tengani zambiri zamasiku atatu omaliza okha osapitilirapo. Ndiko kuti, kuthamanga m'malo awa kudzakhala kofulumira kwambiri ndipo kumafunikira zinthu zochepa. Pamene kuthamanga pa chilengedwe kapulidwe mawonekedwe a fyuluta adzanyalanyazidwa.
Kupanga zinthu ndi kabisidwe kosinthira ndime
Redshift ndi columnar DBMS yomwe imakulolani kuti muyike ma algorithms a compression a data pagawo lililonse. Kusankha ma aligorivimu abwino kumatha kuchepetsa malo a disk ndi 20-50%.
Macro redshift.compress_table ipereka lamulo la ANALYZE COMPRESSION, pangani tebulo latsopano lokhala ndi ma aligorivimu ovomerezeka, makiyi agawidwe (dist_key) ndi makiyi osanja (sort_key), kusamutsa detayo, ndipo, ngati kuli kofunikira, chotsani kopi yakale.
Izi ndi zomwe dashboard imawonekera kutengera deta yodula mitengo mu Looker:
Makina Osungirako Kusungirako
Ngati mugwiritsa ntchito zowonjezera za magwiridwe antchito a Malo Ogwiritsidwa Ntchito, monga UDF (User Defined Functions), ndiye kuti kumasulira kwa magwiridwe antchitowa, kuwongolera mwayi, ndi kutulutsa zatsopano ndizosavuta kuchita mu DBT.
Timagwiritsa ntchito UDF ku Python kuwerengera ma hashes, madera a imelo, ndi kumasulira kwa bitmask.
{% macro create_udf() -%}
{% set sql %}
CREATE OR REPLACE FUNCTION {{ target.schema }}.f_sha256(mes "varchar")
RETURNS varchar
LANGUAGE plpythonu
STABLE
AS $$
import hashlib
return hashlib.sha256(mes).hexdigest()
$$
;
{% endset %}
{% set table = run_query(sql) %}
{%- endmacro %}
Ku Wheely timagwiritsa ntchito Amazon Redshift, yomwe imachokera ku PostgreSQL. Kwa Redshift, ndikofunikira kusonkhanitsa ziwerengero pafupipafupi pamatebulo ndikumasula malo a disk - malamulo a ANALYZE ndi VACUUM, motsatana.
Kukonzekera ndi kudya DWH kumakhala kosangalatsa komanso kopindulitsa monga kumwa mowa wotsekemera. DBT imakhala ndi Jinja, zowonjezera za ogwiritsa ntchito (ma module), wophatikiza, wowongolera, ndi woyang'anira phukusi. Mwa kuphatikiza zinthu izi mumapeza malo ogwirira ntchito a Data Warehouse yanu. Palibe njira yabwinoko yoyendetsera kusintha kwa DWH lero.
Zikhulupiriro zotsatiridwa ndi omwe amapanga DBT zidapangidwa motere:
Kwa omwe ali ndi chidwi, pali kanema wa phunziro lotseguka lomwe ndidapereka miyezi ingapo yapitayo ngati gawo la phunziro lotseguka ku OTUS - Chida Chomanga Chida cha Amazon Redshift Storage.
Kuphatikiza pa DBT ndi Data Warehousing, monga gawo la maphunziro a Data Engineer pa nsanja ya OTUS, anzanga ndi ine timaphunzitsa makalasi pamitu ina yofunikira komanso yamakono:
Malingaliro Omanga a Big Data Application
Yesetsani ndi Spark ndi Spark Streaming
Kufufuza njira ndi zida zotsatsira magwero a data
Kupanga ziwonetsero zowunikira mu DWH
Malingaliro a NoSQL: HBase, Cassandra, ElasticSearch
Mfundo zowunikira ndi kuyimba
Ntchito Yomaliza: Kuyika maluso onse pamodzi mothandizidwa ndi upangiri