Pamisimboti ipi yakanakira Dhata Warehouse yakavakwa?
Tarisa pane kukosha kwebhizinesi uye analytics mukusavapo kweboilerplate kodhi. Kugadzirisa DWH sekodhibase: kushandura, kuongorora, otomatiki kuyedza uye CI. Modular, inowedzera, yakavhurika sosi uye nharaunda. Mushandisi-ane hushamwari zvinyorwa uye kutsamira kuona (Data Lineage).
Zvimwe nezve zvese izvi uye nezve basa reDBT muBig Data & Analytics ecosystem - tinogamuchirwa kukati.
Mhoroi munhu wese
Artemy Kozyr ari kubata. Kweanopfuura makore 5 ndanga ndichishanda nenzvimbo dzekuchengetera data, kuvaka ETL/ELT, pamwe nekuongorora data uye kuona. Ndiri kushanda mu
Overview
Iyo DBT chimiro ndeye zvese nezve T muELT (Kubvisa - Shandura - Mutoro) acronym.
Nekuuya kweakadaro anogadzira uye ane scalable analytical dhatabhesi seBigQuery, Redshift, Snowflake, pakanga pasina chikonzero chekuita shanduko kunze kweData Warehouse.
DBT haitore data kubva kune masosi, asi inopa mikana mikuru yekushanda nedata rakatoiswa muKuchengeta (mumukati kana Yekunze Kuchengetera).
Chinangwa chikuru cheDBT ndechekutora kodhi, kuiunganidza muSQL, kuita mirairo mukutevedzana kwakaringana muRepository.
DBT Chirongwa Chimiro
Iyo purojekiti ine madhairekitori uye mafaera emhando mbiri chete:
- Muenzaniso (.sql) - chikwata chekushandura chinoratidzwa ne SELECT query
- Kugadzirisa faira (.yml) - parameters, zvirongwa, miedzo, zvinyorwa
Padanho rekutanga, basa rakagadzirwa sezvizvi:
- Mushandisi anogadzirira modhi kodhi mune chero yakanakira IDE
- Uchishandisa iyo CLI, modhi dzinotangwa, DBT inounganidza iyo modhi kodhi muSQL
- Iyo yakaunganidzwa SQL kodhi inoitwa muKuchengeta mune yakapihwa kutevedzana (girafu)
Hezvino izvo zvinomhanya kubva kuCLI zvingaite senge:
Zvese ZVINOSARA
Ichi chinhu chinouraya cheData Build Tool framework. Mune mamwe mazwi, DBT inobvisa kodhi yese ine chekuita nekuita mibvunzo yako muChitoro (misiyano kubva kumirairo GADZIRA, PINDA, UPDATE, DELETE ALTER, GRANT, ...).
Chero modhi inosanganisira kunyora imwe SELECT mubvunzo inotsanangura inokonzeresa data set.
Mune ino kesi, iyo shanduko logic inogona kuve yakawanda-level uye kubatanidza data kubva kune akati wandei mamwe mamodheru. Muenzaniso wemuenzaniso unovaka odha showcase (f_order):
{% set payment_methods = ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] %}
with orders as (
select * from {{ ref('stg_orders') }}
),
order_payments as (
select * from {{ ref('order_payments') }}
),
final as (
select
orders.order_id,
orders.customer_id,
orders.order_date,
orders.status,
{% for payment_method in payment_methods -%}
order_payments.{{payment_method}}_amount,
{% endfor -%}
order_payments.total_amount as amount
from orders
left join order_payments using (order_id)
)
select * from final
Ndezvipi zvinhu zvinonakidza zvatingaona pano?
Chekutanga: Yakashandiswa CTE (Common Table Expressions) - kuronga uye kunzwisisa kodhi ine zvakawanda shanduko uye bhizinesi pfungwa.
Chechipiri: Model kodhi musanganiswa weSQL uye mutauro
Muenzaniso unoshandisa loop nokuti kuburitsa mari yenzira yega yega yekubhadhara yakatsanangurwa mukutaura seti. Basa rinoshandiswa zvakare ref -Kugona kutaura mamwe mamodheru mukati mekodhi:
- Panguva yekugadzira ref ichashandurwa kuita chinongedzo chinongedzo kutafura kana kuona muKuchengeta
- ref inokubvumira kuti ugadzire girafu rekutsamira modhi
Zvaiva
- Kana / zvimwe zvirevo - zvirevo zvebazi
- Zvezvishwe
- Variables
- Macro - kugadzira macros
Materialization: Tafura, Kuona, Kuwedzera
Materialization strategy inzira inoenderana iyo inoguma seti yemuenzaniso data ichachengetwa muKuchengeta.
Mumashoko makuru ndeiyi:
- Tafura - tafura yemuviri muKuchengeta
- Tarisa - tarisa, chaiyo tafura muKuchengeta
Kune zvakare mamwe maitiro akaomarara ekugadzirisa zvinhu:
- Kuwedzera - kuwedzera kurodha (kwematafura makuru echokwadi); mitsetse mitsva inowedzerwa, mitsara yakashandurwa inovandudzwa, mitsetse yakadzimwa inocheneswa
- Ephemeral - iyo modhi haigadzirike yakanangana, asi inobatanidzwa seCTE mune mamwe mamodheru
- Chero mamwe mazano aunogona kuwedzera iwe pachako
Pamusoro pemaitiro ekugadzira zvinhu, kune mikana yekukwirisa kune chaiyo Storage, semuenzaniso:
- Snowflake: Matafura enguva pfupi, Batanidza maitiro, Kubatanidza Tafura, Kukopa zvipo, Maonero akachengeteka
- Redshift: Distkey, Sortkey (yakapindirana, musanganiswa), Kunonoka Kusunga Maonero
- bigquery: Kupatsanura kwetafura & kusanganisa, Batanidza maitiro, KMS Encryption, Mazita & Matagi
- chimvari: Mafaira efaira (parquet, csv, json, orc, delta), partition_by, clustered_by, mabhaketi, incremental_strategy
Aya Matura anotevera ari kutsigirwa parizvino:
- postgres
- Redshift
- bigquery
- Snowflake
- Presto (chikamu)
- Spark (chikamu)
- Microsoft SQL Server (community adapter)
Ngatinatsidze modhi yedu:
- Ngatiite kuti kuzadza kwayo kuwedzere (Kuwedzera)
- Ngatiwedzerei segmentation uye kuronga makiyi eRedshift
-- ΠΠΎΠ½ΡΠΈΠ³ΡΡΠ°ΡΠΈΡ ΠΌΠΎΠ΄Π΅Π»ΠΈ:
-- ΠΠ½ΠΊΡΠ΅ΠΌΠ΅Π½ΡΠ°Π»ΡΠ½ΠΎΠ΅ Π½Π°ΠΏΠΎΠ»Π½Π΅Π½ΠΈΠ΅, ΡΠ½ΠΈΠΊΠ°Π»ΡΠ½ΡΠΉ ΠΊΠ»ΡΡ Π΄Π»Ρ ΠΎΠ±Π½ΠΎΠ²Π»Π΅Π½ΠΈΡ Π·Π°ΠΏΠΈΡΠ΅ΠΉ (unique_key)
-- ΠΠ»ΡΡ ΡΠ΅Π³ΠΌΠ΅Π½ΡΠ°ΡΠΈΠΈ (dist), ΠΊΠ»ΡΡ ΡΠΎΡΡΠΈΡΠΎΠ²ΠΊΠΈ (sort)
{{
config(
materialized='incremental',
unique_key='order_id',
dist="customer_id",
sort="order_date"
)
}}
{% set payment_methods = ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] %}
with orders as (
select * from {{ ref('stg_orders') }}
where 1=1
{% if is_incremental() -%}
-- ΠΡΠΎΡ ΡΠΈΠ»ΡΡΡ Π±ΡΠ΄Π΅Ρ ΠΏΡΠΈΠΌΠ΅Π½Π΅Π½ ΡΠΎΠ»ΡΠΊΠΎ Π΄Π»Ρ ΠΈΠ½ΠΊΡΠ΅ΠΌΠ΅Π½ΡΠ°Π»ΡΠ½ΠΎΠ³ΠΎ Π·Π°ΠΏΡΡΠΊΠ°
and order_date >= (select max(order_date) from {{ this }})
{%- endif %}
),
order_payments as (
select * from {{ ref('order_payments') }}
),
final as (
select
orders.order_id,
orders.customer_id,
orders.order_date,
orders.status,
{% for payment_method in payment_methods -%}
order_payments.{{payment_method}}_amount,
{% endfor -%}
order_payments.total_amount as amount
from orders
left join order_payments using (order_id)
)
select * from final
Muenzaniso kutsamira girafu
Uyuwo muti wekutsamira. Iyo inozivikanwa zvakare seDAG (Yakananga Acyclic Girafu).
DBT inovaka girafu zvichienderana nekugadziriswa kweese mapurojekiti modhi, kana kuti, ref () zvinongedzo mukati memodhi kune mamwe mamodheru. Kuva negirafu kunokubvumira kuita zvinhu zvinotevera:
- Kumhanya mamodheru munhevedzano chaiyo
- Kufanana kwekugadzirwa kwechitoro
- Kumhanyisa subgraph isingaite
Muenzaniso wekuona girafu:
Imwe neimwe node yegirafu muenzaniso; mipendero yegirafu inotsanangurwa nekutaura ref.
Data Quality uye Zvinyorwa
Pamusoro pekugadzira iwo mamodheru pachawo, DBT inokutendera kuti uedze akati wandei efungidziro nezve inoguma data set, senge:
- Kwete Null
- Unique
- Reference Kutendeseka - kutendeseka kutendeseka (semuenzaniso, mutengi_id mutafura yeodha inoenderana neid mutafura yevatengi)
- Kufananidza rondedzero yetsika dzinogamuchirika
Zvinogoneka kuwedzera yako bvunzo (custom data bvunzo), senge, semuenzaniso, % kutsauka kwemari ine zviratidzo kubva pazuva, vhiki, mwedzi wapfuura. Chero fungidziro yakaumbwa semubvunzo weSQL inogona kuve bvunzo.
Nenzira iyi, unogona kubata zvisingadiwe kutsauka uye zvikanganiso mune data muWarehouse windows.
Panyaya yezvinyorwa, DBT inopa nzira dzekuwedzera, kushandura, uye kugovera metadata uye makomendi pamuenzaniso uye kunyange hunhu mazinga.
Hezvino izvo kuwedzera bvunzo uye zvinyorwa zvinotaridzika padanho refaira rekugadzirisa:
- name: fct_orders
description: This table has basic information about orders, as well as some derived facts based on payments
columns:
- name: order_id
tests:
- unique # ΠΏΡΠΎΠ²Π΅ΡΠΊΠ° Π½Π° ΡΠ½ΠΈΠΊΠ°Π»ΡΠ½ΠΎΡΡΡ Π·Π½Π°ΡΠ΅Π½ΠΈΠΉ
- not_null # ΠΏΡΠΎΠ²Π΅ΡΠΊΠ° Π½Π° Π½Π°Π»ΠΈΡΠΈΠ΅ null
description: This is a unique identifier for an order
- name: customer_id
description: Foreign key to the customers table
tests:
- not_null
- relationships: # ΠΏΡΠΎΠ²Π΅ΡΠΊΠ° ΡΡΡΠ»ΠΎΡΠ½ΠΎΠΉ ΡΠ΅Π»ΠΎΡΡΠ½ΠΎΡΡΠΈ
to: ref('dim_customers')
field: customer_id
- name: order_date
description: Date (UTC) that the order was placed
- name: status
description: '{{ doc("orders_status") }}'
tests:
- accepted_values: # ΠΏΡΠΎΠ²Π΅ΡΠΊΠ° Π½Π° Π΄ΠΎΠΏΡΡΡΠΈΠΌΡΠ΅ Π·Π½Π°ΡΠ΅Π½ΠΈΡ
values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']
Uye hezvino izvo zvinyorwa izvi zvinotaridzika pawebhusaiti yakagadzirwa:
Macros uye Modules
Chinangwa cheDBT hachina kunyanya kuita seti yezvinyorwa zveSQL, asi kupa vashandisi nzira ine simba uye yemhando-yakapfuma yekuvaka yavo shanduko nekugovera mamodule aya.
Macros maseti ezvivakwa uye mataurirwo anogona kunzi semabasa mukati memhando. Macros inokutendera kuti ushandisezve SQL pakati pemodheru nemapurojekiti zvinoenderana neiyo DRY (Usazvidzokorore Iwe pachako) engineering musimboti.
Macro muenzaniso:
{% macro rename_category(column_name) %}
case
when {{ column_name }} ilike '%osx%' then 'osx'
when {{ column_name }} ilike '%android%' then 'android'
when {{ column_name }} ilike '%ios%' then 'ios'
else 'other'
end as renamed_product
{% endmacro %}
Uye kushandiswa kwayo:
{% set column_name = 'product' %}
select
product,
{{ rename_category(column_name) }} -- Π²ΡΠ·ΠΎΠ² ΠΌΠ°ΠΊΡΠΎΡΠ°
from my_table
DBT inouya neyepakeji maneja inobvumira vashandisi kuburitsa uye kushandisazve ega mamodule uye macros.
Izvi zvinoreva kukwanisa kurodha nekushandisa maraibhurari akadai se:
dbt_utils : kushanda neDate/Nguva, Surrogate Keys, Schema bvunzo, Pivot/Unpivot nevamwe- Yakagadzirirwa-yakagadzirwa showcase matemplate emasevhisi akadai
Snowplow ΠΈmutsetse - Maraibhurari eZvitoro zveData, e.g.
Redshift kutema -Module yekutema matanda DBT mashandiro
Rondedzero yakazara yemapakeji inogona kuwanikwa pa
Zvimwe zvinhu
Pano ini ndichatsanangura mamwe mashoma anonakidza maficha uye mashandisirwo ayo ini nechikwata tinoshandisa kuvaka Data Warehouse mukati
Kuparadzaniswa kwenzvimbo dzekumhanya DEV - TEST - PROD
Kunyangwe mukati meiyo yakafanana DWH cluster (mukati mezvirongwa zvakasiyana). Semuenzaniso, kushandisa chirevo chinotevera:
with source as (
select * from {{ source('salesforce', 'users') }}
where 1=1
{%- if target.name in ['dev', 'test', 'ci'] -%}
where timestamp >= dateadd(day, -3, current_date)
{%- endif -%}
)
Iyi kodhi inoti: yenzvimbo dev, bvunzo, ci tora data chete kwemazuva matatu apfuura uye kwete. Kureva, kumhanya munzvimbo idzi kuchave nekukurumidza uye kunoda zviwanikwa zvishoma. Paunenge uchimhanya pane zvakatipoteredza production iyo sefa mamiriro achafuratirwa.
Materialization ine alternate column encoding
Redshift ndeye columnar DBMS iyo inokutendera iwe kuseta data compression algorithms kune yega yega column. Kusarudza optimal algorithms kunogona kuderedza dhisiki nzvimbo ne20-50%.
Macro
Macro siginicha:
{{ compress_table(schema, table,
drop_backup=False,
comprows=none|Integer,
sort_style=none|compound|interleaved,
sort_keys=none|List<String>,
dist_style=none|all|even,
dist_key=none|String) }}
Modhi yekutema miti inomhanya
Iwe unogona kubatanidza zvikorekedzo kune yega yega yemuenzaniso, iyo inozoitwa isati yatanga kana nekukurumidza mushure mekusikwa kweiyo modhi kwapera:
pre-hook: "{{ logging.log_model_start_event() }}"
post-hook: "{{ logging.log_model_end_event() }}"
Iyo yekutema matanda module ichakubvumidza iwe kurekodha ese anodiwa metadata mune yakaparadzana tafura, iyo inogona kuzoshandiswa kuongorora uye kuongorora mabhodhoro.
Izvi ndizvo zvinoita dashibhodhi rinotaridzika zvichibva pane yekutema data muLocker:
Automation Yekuchengetedza Kuchengetedza
Kana iwe ukashandisa mamwe mawedzero ekushanda kweiyo yakashandiswa Repository, senge UDF (Mushandisi Anotsanangurwa Mabasa), ipapo kushandura mabasa aya, kutonga kwekuwana, uye otomatiki kuburitsa kunze kwekuburitswa kutsva kuri nyore kuita muDBT.
Isu tinoshandisa UDF muPython kuverenga hashes, email domains, uye bitmask decoding.
Muenzaniso we macro inogadzira UDF pane chero nharaunda yekuuraya (dev, bvunzo, prod):
{% macro create_udf() -%}
{% set sql %}
CREATE OR REPLACE FUNCTION {{ target.schema }}.f_sha256(mes "varchar")
RETURNS varchar
LANGUAGE plpythonu
STABLE
AS $$
import hashlib
return hashlib.sha256(mes).hexdigest()
$$
;
{% endset %}
{% set table = run_query(sql) %}
{%- endmacro %}
Pa Wheely tinoshandisa Amazon Redshift, iyo yakavakirwa paPostgreSQL. Kune Redshift, zvakakosha kuti ugare uchiunganidza nhamba pamatafura uye kusunungura diski nzvimbo - iyo ANALYZE uye VACUUM mirairo, zvichiteerana.
Kuti uite izvi, iyo mirairo kubva kune redshift_maintenance macro inourayiwa husiku hwese:
{% macro redshift_maintenance() %}
{% set vacuumable_tables=run_query(vacuumable_tables_sql) %}
{% for row in vacuumable_tables %}
{% set message_prefix=loop.index ~ " of " ~ loop.length %}
{%- set relation_to_vacuum = adapter.get_relation(
database=row['table_database'],
schema=row['table_schema'],
identifier=row['table_name']
) -%}
{% do run_query("commit") %}
{% if relation_to_vacuum %}
{% set start=modules.datetime.datetime.now() %}
{{ dbt_utils.log_info(message_prefix ~ " Vacuuming " ~ relation_to_vacuum) }}
{% do run_query("VACUUM " ~ relation_to_vacuum ~ " BOOST") %}
{{ dbt_utils.log_info(message_prefix ~ " Analyzing " ~ relation_to_vacuum) }}
{% do run_query("ANALYZE " ~ relation_to_vacuum) %}
{% set end=modules.datetime.datetime.now() %}
{% set total_seconds = (end - start).total_seconds() | round(2) %}
{{ dbt_utils.log_info(message_prefix ~ " Finished " ~ relation_to_vacuum ~ " in " ~ total_seconds ~ "s") }}
{% else %}
{{ dbt_utils.log_info(message_prefix ~ ' Skipping relation "' ~ row.values() | join ('"."') ~ '" as it does not exist') }}
{% endif %}
{% endfor %}
{% endmacro %}
DBT Cloud
Zvinogoneka kushandisa DBT sevhisi (Managed Service). Zvinosanganisira:
- Webhu IDE yekugadzira mapurojekiti uye modhi
- Kugadziriswa kwebasa uye kuronga
- Nyore uye nyore kuwana matanda
- Webhusaiti ine zvinyorwa zvepurojekiti yako
- Kubatanidza CI (Kuenderera mberi Kubatanidzwa)
mhedziso
Kugadzirira uye kushandisa DWH kunova kunakidza uye kunobatsira sekunwa smoothie. DBT ine Jinja, mushandisi ekuwedzera (modules), compiler, muiti, uye pasuru maneja. Nekuisa zvinhu izvi pamwechete iwe unowana yakakwana yekushanda nharaunda yeData Warehouse yako. Hapana imwe nzira iri nani yekugadzirisa shanduko mukati meDWH nhasi.
Zvinotendwa zvinotevedzwa nevagadziri veDBT zvinogadzirwa sezvizvi:
- Kodhi, kwete GUI, ndiyo yakanakisa yekubvisa yekutaura yakaoma analytical logic
- Kushanda nedata kunofanirwa kuchinjisa maitiro akanakisa muinjiniya yesoftware (Software Engineering)
- Yakakosha data zvivakwa zvinofanirwa kudzorwa nenharaunda yevashandisi seyakavhurika sosi software
- Kwete chete maturusi ekuongorora, asiwo kodhi ichawedzera kuve pfuma yeOpen Source nharaunda
Izvi zvinotendwa zvakanyanya zvakaburitsa chigadzirwa chinoshandiswa nemakambani anopfuura mazana masere nemakumi mashanu nhasi, uye vanoumba hwaro hweakawanda anonakidza ekuwedzera ayo achagadzirwa mune ramangwana.
Kune avo vanofarira, pane vhidhiyo yechidzidzo chakavhurika chandakapa mwedzi mishoma yapfuura sechikamu chechidzidzo chakavhurika paOTUS -
Pamusoro peDBT neData Warehousing, sechikamu cheData Engineer kosi paOTUS chikuva, vandinoshanda navo neni tinodzidzisa makirasi pane akati wandei akakosha uye azvino misoro:
- Architectural Concepts for Big Data Applications
- Dzidzira neSpark uye Spark Streaming
- Kuongorora nzira uye maturusi ekurodha data masosi
- Kuvaka zviratidziro zvekuongorora muDWH
- NoSQL pfungwa: HBase, Cassandra, ElasticSearch
- Nheyo dzekutarisa uye kuronga
- Final Project: kuisa hunyanzvi hwese pamwechete pasi perutsigiro rwekupa mazano
Mareferensi:
DBT zvinyorwa - Nhanganyaya - Zvinyorwa zvepamutemoChii, chaizvo, chinonzi dbt? - Ongorora chinyorwa nemumwe wevanyori veDBTDhata Vaka Tool yeAmazon Redshift Storage - YouTube, Kurekodha kweOTUS yakavhurika chidzidzoKusvika pakuziva Greenplum - Chidzidzo chinotevera chakavhurwa ndiMay 15, 2020Data Engineering Kosi β OTUSKuvaka Mature Analytics Workflow -Kutarisa kune ramangwana re data uye analyticsYave nguva yekuvhura source analytics - Iko kushanduka kweanalytics uye pesvedzero yeOpen SourceKuenderera mberi Kubatanidzwa uye Otomatiki Kuvaka Kuedza ne dbtCloud -Nheyo dzekuvaka CI uchishandisa DBTKutanga neDBT tutorial - Dzidzira, Nhanho-ne-nhanho mirairo yebasa rakazvimiriraJaffle shopu - Github DBT Tutorial - Github, kodhi yepurojekiti yedzidzo
Source: www.habr.com