Dhata Kuvaka Turusi kana izvo zvakajairika pakati peData Warehouse neSmoothie

Dhata Kuvaka Turusi kana izvo zvakajairika pakati peData Warehouse neSmoothie
Pamisimboti ipi yakanakira Dhata Warehouse yakavakwa?

Tarisa pane kukosha kwebhizinesi uye analytics mukusavapo kweboilerplate kodhi. Kugadzirisa DWH sekodhibase: kushandura, kuongorora, otomatiki kuyedza uye CI. Modular, inowedzera, yakavhurika sosi uye nharaunda. Mushandisi-ane hushamwari zvinyorwa uye kutsamira kuona (Data Lineage).

Zvimwe nezve zvese izvi uye nezve basa reDBT muBig Data & Analytics ecosystem - tinogamuchirwa kukati.

Mhoroi munhu wese

Artemy Kozyr ari kubata. Kweanopfuura makore 5 ndanga ndichishanda nenzvimbo dzekuchengetera data, kuvaka ETL/ELT, pamwe nekuongorora data uye kuona. Ndiri kushanda mu vhiri, ndinodzidzisa kosi kuOTUS Data Injiniya, uye nhasi ndinoda kugoverana newe nyaya yandakanyora ndichitarisira kutanga kunyoresa kutsva kwekosi.

Overview

Iyo DBT chimiro ndeye zvese nezve T muELT (Kubvisa - Shandura - Mutoro) acronym.

Nekuuya kweakadaro anogadzira uye ane scalable analytical dhatabhesi seBigQuery, Redshift, Snowflake, pakanga pasina chikonzero chekuita shanduko kunze kweData Warehouse. 

DBT haitore data kubva kune masosi, asi inopa mikana mikuru yekushanda nedata rakatoiswa muKuchengeta (mumukati kana Yekunze Kuchengetera).

Dhata Kuvaka Turusi kana izvo zvakajairika pakati peData Warehouse neSmoothie
Chinangwa chikuru cheDBT ndechekutora kodhi, kuiunganidza muSQL, kuita mirairo mukutevedzana kwakaringana muRepository.

DBT Chirongwa Chimiro

Iyo purojekiti ine madhairekitori uye mafaera emhando mbiri chete:

  • Muenzaniso (.sql) - chikwata chekushandura chinoratidzwa ne SELECT query
  • Kugadzirisa faira (.yml) - parameters, zvirongwa, miedzo, zvinyorwa

Padanho rekutanga, basa rakagadzirwa sezvizvi:

  • Mushandisi anogadzirira modhi kodhi mune chero yakanakira IDE
  • Uchishandisa iyo CLI, modhi dzinotangwa, DBT inounganidza iyo modhi kodhi muSQL
  • Iyo yakaunganidzwa SQL kodhi inoitwa muKuchengeta mune yakapihwa kutevedzana (girafu)

Hezvino izvo zvinomhanya kubva kuCLI zvingaite senge:

Dhata Kuvaka Turusi kana izvo zvakajairika pakati peData Warehouse neSmoothie

Zvese ZVINOSARA

Ichi chinhu chinouraya cheData Build Tool framework. Mune mamwe mazwi, DBT inobvisa kodhi yese ine chekuita nekuita mibvunzo yako muChitoro (misiyano kubva kumirairo GADZIRA, PINDA, UPDATE, DELETE ALTER, GRANT, ...).

Chero modhi inosanganisira kunyora imwe SELECT mubvunzo inotsanangura inokonzeresa data set.

Mune ino kesi, iyo shanduko logic inogona kuve yakawanda-level uye kubatanidza data kubva kune akati wandei mamwe mamodheru. Muenzaniso wemuenzaniso unovaka odha showcase (f_order):

{% set payment_methods = ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] %}
 
with orders as (
 
   select * from {{ ref('stg_orders') }}
 
),
 
order_payments as (
 
   select * from {{ ref('order_payments') }}
 
),
 
final as (
 
   select
       orders.order_id,
       orders.customer_id,
       orders.order_date,
       orders.status,
       {% for payment_method in payment_methods -%}
       order_payments.{{payment_method}}_amount,
       {% endfor -%}
       order_payments.total_amount as amount
   from orders
       left join order_payments using (order_id)
 
)
 
select * from final

Ndezvipi zvinhu zvinonakidza zvatingaona pano?

Chekutanga: Yakashandiswa CTE (Common Table Expressions) - kuronga uye kunzwisisa kodhi ine zvakawanda shanduko uye bhizinesi pfungwa.

Chechipiri: Model kodhi musanganiswa weSQL uye mutauro kuJinja (templeting language).

Muenzaniso unoshandisa loop nokuti kuburitsa mari yenzira yega yega yekubhadhara yakatsanangurwa mukutaura seti. Basa rinoshandiswa zvakare ref -Kugona kutaura mamwe mamodheru mukati mekodhi:

  • Panguva yekugadzira ref ichashandurwa kuita chinongedzo chinongedzo kutafura kana kuona muKuchengeta
  • ref inokubvumira kuti ugadzire girafu rekutsamira modhi

Zvaiva kuJinja inowedzera zvingangoita zvisingagumi kuDBT. Anonyanya kushandiswa ndeaya:

  • Kana / zvimwe zvirevo - zvirevo zvebazi
  • Zvezvishwe
  • Variables
  • Macro - kugadzira macros

Materialization: Tafura, Kuona, Kuwedzera

Materialization strategy inzira inoenderana iyo inoguma seti yemuenzaniso data ichachengetwa muKuchengeta.

Mumashoko makuru ndeiyi:

  • Tafura - tafura yemuviri muKuchengeta
  • Tarisa - tarisa, chaiyo tafura muKuchengeta

Kune zvakare mamwe maitiro akaomarara ekugadzirisa zvinhu:

  • Kuwedzera - kuwedzera kurodha (kwematafura makuru echokwadi); mitsetse mitsva inowedzerwa, mitsara yakashandurwa inovandudzwa, mitsetse yakadzimwa inocheneswa 
  • Ephemeral - iyo modhi haigadzirike yakanangana, asi inobatanidzwa seCTE mune mamwe mamodheru
  • Chero mamwe mazano aunogona kuwedzera iwe pachako

Pamusoro pemaitiro ekugadzira zvinhu, kune mikana yekukwirisa kune chaiyo Storage, semuenzaniso:

  • Snowflake: Matafura enguva pfupi, Batanidza maitiro, Kubatanidza Tafura, Kukopa zvipo, Maonero akachengeteka
  • Redshift: Distkey, Sortkey (yakapindirana, musanganiswa), Kunonoka Kusunga Maonero
  • bigquery: Kupatsanura kwetafura & kusanganisa, Batanidza maitiro, KMS Encryption, Mazita & Matagi
  • chimvari: Mafaira efaira (parquet, csv, json, orc, delta), partition_by, clustered_by, mabhaketi, incremental_strategy

Aya Matura anotevera ari kutsigirwa parizvino:

  • postgres
  • Redshift
  • bigquery
  • Snowflake
  • Presto (chikamu)
  • Spark (chikamu)
  • Microsoft SQL Server (community adapter)

Ngatinatsidze modhi yedu:

  • Ngatiite kuti kuzadza kwayo kuwedzere (Kuwedzera)
  • Ngatiwedzerei segmentation uye kuronga makiyi eRedshift

-- ΠšΠΎΠ½Ρ„ΠΈΠ³ΡƒΡ€Π°Ρ†ΠΈΡ ΠΌΠΎΠ΄Π΅Π»ΠΈ: 
-- Π˜Π½ΠΊΡ€Π΅ΠΌΠ΅Π½Ρ‚Π°Π»ΡŒΠ½ΠΎΠ΅ Π½Π°ΠΏΠΎΠ»Π½Π΅Π½ΠΈΠ΅, ΡƒΠ½ΠΈΠΊΠ°Π»ΡŒΠ½Ρ‹ΠΉ ΠΊΠ»ΡŽΡ‡ для обновлСния записСй (unique_key)
-- ΠšΠ»ΡŽΡ‡ сСгмСнтации (dist), ΠΊΠ»ΡŽΡ‡ сортировки (sort)
{{
  config(
       materialized='incremental',
       unique_key='order_id',
       dist="customer_id",
       sort="order_date"
   )
}}
 
{% set payment_methods = ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] %}
 
with orders as (
 
   select * from {{ ref('stg_orders') }}
   where 1=1
   {% if is_incremental() -%}
       -- Π­Ρ‚ΠΎΡ‚ Ρ„ΠΈΠ»ΡŒΡ‚Ρ€ Π±ΡƒΠ΄Π΅Ρ‚ ΠΏΡ€ΠΈΠΌΠ΅Π½Π΅Π½ Ρ‚ΠΎΠ»ΡŒΠΊΠΎ для ΠΈΠ½ΠΊΡ€Π΅ΠΌΠ΅Π½Ρ‚Π°Π»ΡŒΠ½ΠΎΠ³ΠΎ запуска
       and order_date >= (select max(order_date) from {{ this }})
   {%- endif %} 
 
),
 
order_payments as (
 
   select * from {{ ref('order_payments') }}
 
),
 
final as (
 
   select
       orders.order_id,
       orders.customer_id,
       orders.order_date,
       orders.status,
       {% for payment_method in payment_methods -%}
       order_payments.{{payment_method}}_amount,
       {% endfor -%}
       order_payments.total_amount as amount
   from orders
       left join order_payments using (order_id)
 
)
 
select * from final

Muenzaniso kutsamira girafu

Uyuwo muti wekutsamira. Iyo inozivikanwa zvakare seDAG (Yakananga Acyclic Girafu).

DBT inovaka girafu zvichienderana nekugadziriswa kweese mapurojekiti modhi, kana kuti, ref () zvinongedzo mukati memodhi kune mamwe mamodheru. Kuva negirafu kunokubvumira kuita zvinhu zvinotevera:

  • Kumhanya mamodheru munhevedzano chaiyo
  • Kufanana kwekugadzirwa kwechitoro
  • Kumhanyisa subgraph isingaite 

Muenzaniso wekuona girafu:

Dhata Kuvaka Turusi kana izvo zvakajairika pakati peData Warehouse neSmoothie
Imwe neimwe node yegirafu muenzaniso; mipendero yegirafu inotsanangurwa nekutaura ref.

Data Quality uye Zvinyorwa

Pamusoro pekugadzira iwo mamodheru pachawo, DBT inokutendera kuti uedze akati wandei efungidziro nezve inoguma data set, senge:

  • Kwete Null
  • Unique
  • Reference Kutendeseka - kutendeseka kutendeseka (semuenzaniso, mutengi_id mutafura yeodha inoenderana neid mutafura yevatengi)
  • Kufananidza rondedzero yetsika dzinogamuchirika

Zvinogoneka kuwedzera yako bvunzo (custom data bvunzo), senge, semuenzaniso, % kutsauka kwemari ine zviratidzo kubva pazuva, vhiki, mwedzi wapfuura. Chero fungidziro yakaumbwa semubvunzo weSQL inogona kuve bvunzo.

Nenzira iyi, unogona kubata zvisingadiwe kutsauka uye zvikanganiso mune data muWarehouse windows.

Panyaya yezvinyorwa, DBT inopa nzira dzekuwedzera, kushandura, uye kugovera metadata uye makomendi pamuenzaniso uye kunyange hunhu mazinga. 

Hezvino izvo kuwedzera bvunzo uye zvinyorwa zvinotaridzika padanho refaira rekugadzirisa:

 - name: fct_orders
   description: This table has basic information about orders, as well as some derived facts based on payments
   columns:
     - name: order_id
       tests:
         - unique # ΠΏΡ€ΠΎΠ²Π΅Ρ€ΠΊΠ° Π½Π° ΡƒΠ½ΠΈΠΊΠ°Π»ΡŒΠ½ΠΎΡΡ‚ΡŒ Π·Π½Π°Ρ‡Π΅Π½ΠΈΠΉ
         - not_null # ΠΏΡ€ΠΎΠ²Π΅Ρ€ΠΊΠ° Π½Π° Π½Π°Π»ΠΈΡ‡ΠΈΠ΅ null
       description: This is a unique identifier for an order
     - name: customer_id
       description: Foreign key to the customers table
       tests:
         - not_null
         - relationships: # ΠΏΡ€ΠΎΠ²Π΅Ρ€ΠΊΠ° ссылочной цСлостности
             to: ref('dim_customers')
             field: customer_id
     - name: order_date
       description: Date (UTC) that the order was placed
     - name: status
       description: '{{ doc("orders_status") }}'
       tests:
         - accepted_values: # ΠΏΡ€ΠΎΠ²Π΅Ρ€ΠΊΠ° Π½Π° допустимыС значСния
             values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']

Uye hezvino izvo zvinyorwa izvi zvinotaridzika pawebhusaiti yakagadzirwa:

Dhata Kuvaka Turusi kana izvo zvakajairika pakati peData Warehouse neSmoothie

Macros uye Modules

Chinangwa cheDBT hachina kunyanya kuita seti yezvinyorwa zveSQL, asi kupa vashandisi nzira ine simba uye yemhando-yakapfuma yekuvaka yavo shanduko nekugovera mamodule aya.

Macros maseti ezvivakwa uye mataurirwo anogona kunzi semabasa mukati memhando. Macros inokutendera kuti ushandisezve SQL pakati pemodheru nemapurojekiti zvinoenderana neiyo DRY (Usazvidzokorore Iwe pachako) engineering musimboti.

Macro muenzaniso:

{% macro rename_category(column_name) %}
case
 when {{ column_name }} ilike  '%osx%' then 'osx'
 when {{ column_name }} ilike  '%android%' then 'android'
 when {{ column_name }} ilike  '%ios%' then 'ios'
 else 'other'
end as renamed_product
{% endmacro %}

Uye kushandiswa kwayo:

{% set column_name = 'product' %}
select
 product,
 {{ rename_category(column_name) }} -- Π²Ρ‹Π·ΠΎΠ² макроса
from my_table

DBT inouya neyepakeji maneja inobvumira vashandisi kuburitsa uye kushandisazve ega mamodule uye macros.

Izvi zvinoreva kukwanisa kurodha nekushandisa maraibhurari akadai se:

  • dbt_utils: kushanda neDate/Nguva, Surrogate Keys, Schema bvunzo, Pivot/Unpivot nevamwe
  • Yakagadzirirwa-yakagadzirwa showcase matemplate emasevhisi akadai Snowplow ΠΈ mutsetse 
  • Maraibhurari eZvitoro zveData, e.g. Redshift 
  • kutema -Module yekutema matanda DBT mashandiro

Rondedzero yakazara yemapakeji inogona kuwanikwa pa dbt hub.

Zvimwe zvinhu

Pano ini ndichatsanangura mamwe mashoma anonakidza maficha uye mashandisirwo ayo ini nechikwata tinoshandisa kuvaka Data Warehouse mukati vhiri.

Kuparadzaniswa kwenzvimbo dzekumhanya DEV - TEST - PROD

Kunyangwe mukati meiyo yakafanana DWH cluster (mukati mezvirongwa zvakasiyana). Semuenzaniso, kushandisa chirevo chinotevera:

with source as (
 
   select * from {{ source('salesforce', 'users') }}
   where 1=1
   {%- if target.name in ['dev', 'test', 'ci'] -%}           
       where timestamp >= dateadd(day, -3, current_date)   
   {%- endif -%}
 
)

Iyi kodhi inoti: yenzvimbo dev, bvunzo, ci tora data chete kwemazuva matatu apfuura uye kwete. Kureva, kumhanya munzvimbo idzi kuchave nekukurumidza uye kunoda zviwanikwa zvishoma. Paunenge uchimhanya pane zvakatipoteredza production iyo sefa mamiriro achafuratirwa.

Materialization ine alternate column encoding

Redshift ndeye columnar DBMS iyo inokutendera iwe kuseta data compression algorithms kune yega yega column. Kusarudza optimal algorithms kunogona kuderedza dhisiki nzvimbo ne20-50%.

Macro redshift.compress_table ichaita iyo ANALYZE COMPRESSION command, gadzira tafura nyowani ine yakakurudzirwa column encoding algorithms, yakatsanangurwa segmentation kiyi (dist_kiyi) uye kuronga makiyi (sort_kiyi), endesa iyo data kwairi, uye, kana zvichidikanwa, dzima kopi yekare.

Macro siginicha:

{{ compress_table(schema, table,
                 drop_backup=False,
                 comprows=none|Integer,
                 sort_style=none|compound|interleaved,
                 sort_keys=none|List<String>,
                 dist_style=none|all|even,
                 dist_key=none|String) }}

Modhi yekutema miti inomhanya

Iwe unogona kubatanidza zvikorekedzo kune yega yega yemuenzaniso, iyo inozoitwa isati yatanga kana nekukurumidza mushure mekusikwa kweiyo modhi kwapera:

   pre-hook: "{{ logging.log_model_start_event() }}"
   post-hook: "{{ logging.log_model_end_event() }}"

Iyo yekutema matanda module ichakubvumidza iwe kurekodha ese anodiwa metadata mune yakaparadzana tafura, iyo inogona kuzoshandiswa kuongorora uye kuongorora mabhodhoro.

Izvi ndizvo zvinoita dashibhodhi rinotaridzika zvichibva pane yekutema data muLocker:

Dhata Kuvaka Turusi kana izvo zvakajairika pakati peData Warehouse neSmoothie

Automation Yekuchengetedza Kuchengetedza

Kana iwe ukashandisa mamwe mawedzero ekushanda kweiyo yakashandiswa Repository, senge UDF (Mushandisi Anotsanangurwa Mabasa), ipapo kushandura mabasa aya, kutonga kwekuwana, uye otomatiki kuburitsa kunze kwekuburitswa kutsva kuri nyore kuita muDBT.

Isu tinoshandisa UDF muPython kuverenga hashes, email domains, uye bitmask decoding.

Muenzaniso we macro inogadzira UDF pane chero nharaunda yekuuraya (dev, bvunzo, prod):

{% macro create_udf() -%}
 
 {% set sql %}
       CREATE OR REPLACE FUNCTION {{ target.schema }}.f_sha256(mes "varchar")
           RETURNS varchar
           LANGUAGE plpythonu
           STABLE
       AS $$  
           import hashlib
           return hashlib.sha256(mes).hexdigest()
       $$
       ;
 {% endset %}
  
 {% set table = run_query(sql) %}
 
{%- endmacro %}

Pa Wheely tinoshandisa Amazon Redshift, iyo yakavakirwa paPostgreSQL. Kune Redshift, zvakakosha kuti ugare uchiunganidza nhamba pamatafura uye kusunungura diski nzvimbo - iyo ANALYZE uye VACUUM mirairo, zvichiteerana.

Kuti uite izvi, iyo mirairo kubva kune redshift_maintenance macro inourayiwa husiku hwese:

{% macro redshift_maintenance() %}
 
   {% set vacuumable_tables=run_query(vacuumable_tables_sql) %}
 
   {% for row in vacuumable_tables %}
       {% set message_prefix=loop.index ~ " of " ~ loop.length %}
 
       {%- set relation_to_vacuum = adapter.get_relation(
                                               database=row['table_database'],
                                               schema=row['table_schema'],
                                               identifier=row['table_name']
                                   ) -%}
       {% do run_query("commit") %}
 
       {% if relation_to_vacuum %}
           {% set start=modules.datetime.datetime.now() %}
           {{ dbt_utils.log_info(message_prefix ~ " Vacuuming " ~ relation_to_vacuum) }}
           {% do run_query("VACUUM " ~ relation_to_vacuum ~ " BOOST") %}
           {{ dbt_utils.log_info(message_prefix ~ " Analyzing " ~ relation_to_vacuum) }}
           {% do run_query("ANALYZE " ~ relation_to_vacuum) %}
           {% set end=modules.datetime.datetime.now() %}
           {% set total_seconds = (end - start).total_seconds() | round(2)  %}
           {{ dbt_utils.log_info(message_prefix ~ " Finished " ~ relation_to_vacuum ~ " in " ~ total_seconds ~ "s") }}
       {% else %}
           {{ dbt_utils.log_info(message_prefix ~ ' Skipping relation "' ~ row.values() | join ('"."') ~ '" as it does not exist') }}
       {% endif %}
 
   {% endfor %}
 
{% endmacro %}

DBT Cloud

Zvinogoneka kushandisa DBT sevhisi (Managed Service). Zvinosanganisira:

  • Webhu IDE yekugadzira mapurojekiti uye modhi
  • Kugadziriswa kwebasa uye kuronga
  • Nyore uye nyore kuwana matanda
  • Webhusaiti ine zvinyorwa zvepurojekiti yako
  • Kubatanidza CI (Kuenderera mberi Kubatanidzwa)

Dhata Kuvaka Turusi kana izvo zvakajairika pakati peData Warehouse neSmoothie

mhedziso

Kugadzirira uye kushandisa DWH kunova kunakidza uye kunobatsira sekunwa smoothie. DBT ine Jinja, mushandisi ekuwedzera (modules), compiler, muiti, uye pasuru maneja. Nekuisa zvinhu izvi pamwechete iwe unowana yakakwana yekushanda nharaunda yeData Warehouse yako. Hapana imwe nzira iri nani yekugadzirisa shanduko mukati meDWH nhasi.

Dhata Kuvaka Turusi kana izvo zvakajairika pakati peData Warehouse neSmoothie

Zvinotendwa zvinotevedzwa nevagadziri veDBT zvinogadzirwa sezvizvi:

  • Kodhi, kwete GUI, ndiyo yakanakisa yekubvisa yekutaura yakaoma analytical logic
  • Kushanda nedata kunofanirwa kuchinjisa maitiro akanakisa muinjiniya yesoftware (Software Engineering)

  • Yakakosha data zvivakwa zvinofanirwa kudzorwa nenharaunda yevashandisi seyakavhurika sosi software
  • Kwete chete maturusi ekuongorora, asiwo kodhi ichawedzera kuve pfuma yeOpen Source nharaunda

Izvi zvinotendwa zvakanyanya zvakaburitsa chigadzirwa chinoshandiswa nemakambani anopfuura mazana masere nemakumi mashanu nhasi, uye vanoumba hwaro hweakawanda anonakidza ekuwedzera ayo achagadzirwa mune ramangwana.

Kune avo vanofarira, pane vhidhiyo yechidzidzo chakavhurika chandakapa mwedzi mishoma yapfuura sechikamu chechidzidzo chakavhurika paOTUS - Dhata Vaka Tool yeAmazon Redshift Storage.

Pamusoro peDBT neData Warehousing, sechikamu cheData Engineer kosi paOTUS chikuva, vandinoshanda navo neni tinodzidzisa makirasi pane akati wandei akakosha uye azvino misoro:

  • Architectural Concepts for Big Data Applications
  • Dzidzira neSpark uye Spark Streaming
  • Kuongorora nzira uye maturusi ekurodha data masosi
  • Kuvaka zviratidziro zvekuongorora muDWH
  • NoSQL pfungwa: HBase, Cassandra, ElasticSearch
  • Nheyo dzekutarisa uye kuronga 
  • Final Project: kuisa hunyanzvi hwese pamwechete pasi perutsigiro rwekupa mazano

Mareferensi:

  1. DBT zvinyorwa - Nhanganyaya - Zvinyorwa zvepamutemo
  2. Chii, chaizvo, chinonzi dbt? - Ongorora chinyorwa nemumwe wevanyori veDBT 
  3. Dhata Vaka Tool yeAmazon Redshift Storage - YouTube, Kurekodha kweOTUS yakavhurika chidzidzo
  4. Kusvika pakuziva Greenplum - Chidzidzo chinotevera chakavhurwa ndiMay 15, 2020
  5. Data Engineering Kosi β€” OTUS
  6. Kuvaka Mature Analytics Workflow -Kutarisa kune ramangwana re data uye analytics
  7. Yave nguva yekuvhura source analytics - Iko kushanduka kweanalytics uye pesvedzero yeOpen Source
  8. Kuenderera mberi Kubatanidzwa uye Otomatiki Kuvaka Kuedza ne dbtCloud -Nheyo dzekuvaka CI uchishandisa DBT
  9. Kutanga neDBT tutorial - Dzidzira, Nhanho-ne-nhanho mirairo yebasa rakazvimirira
  10. Jaffle shopu - Github DBT Tutorial - Github, kodhi yepurojekiti yedzidzo

Dzidza zvakawanda nezvekosi.

Source: www.habr.com

Voeg