Ngwá Ọrụ Nrụpụta Data ma ọ bụ ihe a na-ahụkarị n'etiti Data Warehouse na Smoothie

Ngwá Ọrụ Nrụpụta Data ma ọ bụ ihe a na-ahụkarị n'etiti Data Warehouse na Smoothie
N'ụkpụrụ kedu ka ejiri wuo ụlọ nkwakọba ihe data dị mma?

Lekwasị anya na uru azụmaahịa na nyocha na enweghị koodu igwe mmiri. Ijikwa DWH dị ka codebase: nsụgharị, nyocha, ule akpaaka na CI. Modular, extensible, mepere emepe na obodo. Akwụkwọ enyi na enyi na nleba anya nke ịdabere (Data Lineage).

Ihe ndị ọzọ gbasara ihe a niile yana gbasara ọrụ DBT na nnukwu data na gburugburu ebe obibi nyocha - welcome to cat.

Ndewonu onye obula

Artemy Kozyr na-akpọtụrụ. N'ime ihe karịrị afọ 5 m na-arụ ọrụ na ụlọ nkwakọba ihe data, na-ewu ETL / ELT, yana nyocha data na nhụta anya. Ana m arụ ọrụ ugbu a wheely, M na-akụzi na OTUS na a N'ezie Injin Inyocha data, na taa, m chọrọ ịkọrọ gị otu isiokwu m dere na atụmanya nke mmalite ndebanye aha ọhụrụ maka nkuzi.

Isi

Usoro DBT bụ ihe niile gbasara T dị na ELT (Extract - Transform - Load) acronym.

Site na ọbịbịa nke ọdụ data nyocha dị ka BigQuery, Redshift, Snowflake, ọ nweghị uru ịme mgbanwe na mpụga Data Warehouse. 

DBT anaghị ebudata data sitere na isi mmalite, mana na-enye ohere dị ukwuu maka iji data etinyegoro n'ime Nchekwa (na Nchekwa Ime ma ọ bụ Mpụga).

Ngwá Ọrụ Nrụpụta Data ma ọ bụ ihe a na-ahụkarị n'etiti Data Warehouse na Smoothie
Ebumnuche bụ isi nke DBT bụ iwere koodu ahụ, chịkọta ya na SQL, mebie iwu ahụ n'usoro ziri ezi na ebe nchekwa.

Ọdịdị DBT Project

Ọrụ a nwere akwụkwọ ndekọ aha na faịlụ nke naanị ụdị 2:

  • Model (.sql) - otu ngbanwe nke ajụjụ ahọrọ gosipụtara
  • Faịlụ nhazi (.yml) - paramita, ntọala, ule, akwụkwọ

N'ogo nke isi, a na-ahazi ọrụ ahụ dị ka ndị a:

  • Onye ọrụ na-akwado koodu nlereanya na IDE ọ bụla dabara adaba
  • Iji CLI, a na-ewepụta ụdịdị, DBT na-achịkọta koodu nlereanya n'ime SQL
  • A na-eme koodu SQL a chịkọtara na nchekwa n'usoro enyere (graph)

Nke a bụ ihe na-agba ọsọ site na CLI nwere ike ịdị ka:

Ngwá Ọrụ Nrụpụta Data ma ọ bụ ihe a na-ahụkarị n'etiti Data Warehouse na Smoothie

Ihe niile bụ SELECT

Nke a bụ ihe na-egbu egbu nke Data Build Tool framework. N'ikwu ya n'ụzọ ọzọ, DBT na-ewepụta koodu niile metụtara ịmegharị ajụjụ gị n'ime Ụlọ Ahịa (ọdịiche sitere na iwu CREATE, INSERT, UPDATE, Delete ALTER, GRANT, ...).

Ihe nlereanya ọ bụla gụnyere ide otu ajụjụ SELECT nke na-akọwapụta arụpụta data.

N'okwu a, mgbagha mgbanwe nwere ike ịbụ ọkwa dị iche iche ma jikọta data site na ọtụtụ ụdị ndị ọzọ. Ihe atụ nke ihe nlereanya ga-ewu ihe ngosi ihe ngosi (f_orders):

{% set payment_methods = ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] %}
 
with orders as (
 
   select * from {{ ref('stg_orders') }}
 
),
 
order_payments as (
 
   select * from {{ ref('order_payments') }}
 
),
 
final as (
 
   select
       orders.order_id,
       orders.customer_id,
       orders.order_date,
       orders.status,
       {% for payment_method in payment_methods -%}
       order_payments.{{payment_method}}_amount,
       {% endfor -%}
       order_payments.total_amount as amount
   from orders
       left join order_payments using (order_id)
 
)
 
select * from final

Kedu ihe na-adọrọ mmasị anyị nwere ike ịhụ ebe a?

Nke mbụ: Ejiri CTE (Nkwupụta Okwu Ndị A Na-ahụkarị) - iji hazie ma ghọta koodu nwere ọtụtụ mgbanwe na mgbagha azụmahịa.

Nke abụọ: Koodu nlereanya bụ ngwakọta nke SQL na asụsụ Jinja (asụsụ ntọlite).

Ihe atụ na-eji loop n'ihi na iji wepụta ego maka usoro ịkwụ ụgwọ ọ bụla akọwapụtara na okwu ahụ set. A na-ejikwa ọrụ ahụ ref - ike ịkọwa ụdị ndị ọzọ n'ime koodu:

  • N'oge nchịkọta ref a ga-agbanwe ka ọ bụrụ ntụnye ebumnuche na tebụl ma ọ bụ nlele na Nchekwa
  • ref na-enye gị ohere ịmepụta eserese ndabere ihe nlereanya

Ọ bụ Jinja na-agbakwụnye ihe fọrọ nke nta ka ọ bụrụ ohere na-akparaghị ókè na DBT. Ndị a na-ejikarị eme ihe bụ:

  • Ọ bụrụ na / ọzọ nkwupụta - ngalaba nkwupụta
  • Maka loops
  • Mgbanwe
  • Macro - eke macros

Akụrụngwa: Tebụl, Nlele, Mmụba

Atụmatụ ihe onwunwe bụ ụzọ nke ga-esi na-echekwa data ihe nlereanya a ga-esi na ya pụta na Nchekwa.

N'okwu ndị bụ isi ọ bụ:

  • Tebụl - tebụl anụ ahụ na Nchekwa
  • Lelee - nlele, tebụl mebere na Nchekwa

E nwekwara usoro ihe eji eme ihe dị mgbagwoju anya:

  • Mmụba - ntinye ntinye (nke nnukwu tebụl eziokwu); ọhụrụ ahịrị na-agbakwunyere, gbanwere ahịrị na-emelite, ehichapụkwa ahịrị ehichapụ 
  • Ephemeral - ihe nlereanya ahụ adịghị eme ozugbo, ma na-ekere òkè dị ka CTE na ụdị ndị ọzọ
  • Atụmatụ ọ bụla ọzọ ị nwere ike itinye onwe gị

Na mgbakwunye na atụmatụ iji nweta ihe onwunwe, enwere ohere maka nkwalite maka ebe nchekwa akọwapụtara, dịka ọmụmaatụ:

  • Snowflake: Tebụl na-agafe agafe, omume jikọrọ ọnụ, nchịkọta tebụl, onyinye iṅomi, nlele echedoro.
  • Redshift: Distkey, Ụdị igodo (nke a na-ejikọta ọnụ, ngwakọta), Nleba anya na-ejikọta oge
  • nnukwu ajụjụ: Nkewa nke tebụl & nchịkọta, omume jikọrọ ọnụ, KMS ezoro ezo, Labels & Tags
  • Ọkụ: Usoro faịlụ (parquet, csv, json, orc, delta), partition_by, clustered_by, bọket, incremental_strategy

A na-akwado ebe nchekwa ndị a ugbu a:

  • postgres
  • Redshift
  • nnukwu ajụjụ
  • Snowflake
  • Presto (akụkụ akụkụ)
  • Spark (akụkụ ụfọdụ)
  • Microsoft SQL Server (ihe nkwụnye obodo)

Ka anyị kwalite ihe nlereanya anyị:

  • Ka anyị mee njuputa ya na-abawanye (Mmụba)
  • Ka anyị tinye nkewa na igodo nhazi maka Redshift

-- Конфигурация модели: 
-- Инкрементальное наполнение, уникальный ключ для обновления записей (unique_key)
-- Ключ сегментации (dist), ключ сортировки (sort)
{{
  config(
       materialized='incremental',
       unique_key='order_id',
       dist="customer_id",
       sort="order_date"
   )
}}
 
{% set payment_methods = ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] %}
 
with orders as (
 
   select * from {{ ref('stg_orders') }}
   where 1=1
   {% if is_incremental() -%}
       -- Этот фильтр будет применен только для инкрементального запуска
       and order_date >= (select max(order_date) from {{ this }})
   {%- endif %} 
 
),
 
order_payments as (
 
   select * from {{ ref('order_payments') }}
 
),
 
final as (
 
   select
       orders.order_id,
       orders.customer_id,
       orders.order_date,
       orders.status,
       {% for payment_method in payment_methods -%}
       order_payments.{{payment_method}}_amount,
       {% endfor -%}
       order_payments.total_amount as amount
   from orders
       left join order_payments using (order_id)
 
)
 
select * from final

Eserese ndabere nke ụdị

Ọ bụkwa osisi ndabere. A makwaara ya dị ka DAG (Directed Acyclic Graph).

DBT na-ewupụta eserese dabere na nhazi nke ụdị ọrụ niile, ma ọ bụ karịa, njikọ ref() n'ime ụdị na ụdị ndị ọzọ. Inwe eserese na-enye gị ohere ịme ihe ndị a:

  • Ụdị na-agba ọsọ n'usoro ziri ezi
  • Myirịta nke nhazi ihu ụlọ ahịa
  • Na-agba ọsọ nkeji aka aka ike 

Ọmụmaatụ nke nhụta eserese:

Ngwá Ọrụ Nrụpụta Data ma ọ bụ ihe a na-ahụkarị n'etiti Data Warehouse na Smoothie
Ọnụ ọ bụla nke eserese ahụ bụ ihe atụ; a na-akọwapụta akụkụ nke eserese ahụ site na okwu ref.

Ogo data na akwụkwọ

Na mgbakwunye na ịmepụta ụdịdị ahụ n'onwe ha, DBT na-enye gị ohere ịnwale ọtụtụ echiche gbasara ntọala data arụpụtara, dị ka:

  • Ọ bụghị Null
  • Pụrụ iche
  • Iguzosi ike n'ezi ihe - nrụtụ aka (dịka ọmụmaatụ, client_id na tebụl iwu dabara na id na tebụl ndị ahịa)
  • Dakọtara ndepụta nke ụkpụrụ anabatara

Ọ ga-ekwe omume ịgbakwunye ule nke gị (nnwale data omenala), dị ka, ọmụmaatụ, % ichekwa ego nke ego na-egosi site na otu ụbọchị, otu izu, otu ọnwa gara aga. Echiche ọ bụla emepụtara dị ka ajụjụ SQL nwere ike ịghọ ule.

N'ụzọ dị otú a, ị nwere ike ijide nhụsianya na njehie na-achọghị na data na windo ụlọ nkwakọba ihe.

N'ihe gbasara akwụkwọ, DBT na-enye usoro maka ịgbakwunye, nsụgharị, na ikesa metadata na nkọwa na ụdị na ọbụna ọkwa njirimara. 

Nke a bụ ihe ịgbakwunye ule na akwụkwọ dị ka ọkwa faịlụ nhazi:

 - name: fct_orders
   description: This table has basic information about orders, as well as some derived facts based on payments
   columns:
     - name: order_id
       tests:
         - unique # проверка на уникальность значений
         - not_null # проверка на наличие null
       description: This is a unique identifier for an order
     - name: customer_id
       description: Foreign key to the customers table
       tests:
         - not_null
         - relationships: # проверка ссылочной целостности
             to: ref('dim_customers')
             field: customer_id
     - name: order_date
       description: Date (UTC) that the order was placed
     - name: status
       description: '{{ doc("orders_status") }}'
       tests:
         - accepted_values: # проверка на допустимые значения
             values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']

Ma ebe a bụ ka akwụkwọ a dị na webụsaịtị emepụtara:

Ngwá Ọrụ Nrụpụta Data ma ọ bụ ihe a na-ahụkarị n'etiti Data Warehouse na Smoothie

Macros na modul

Ebumnuche nke DBT abụghị nke ukwuu ka ọ bụrụ usoro nke scripts SQL, kama iji nye ndị ọrụ ụzọ dị ike na atụmatụ bara ụba maka ịmepụta mgbanwe nke ha na ikesa modul ndị a.

Macros bụ nhazi ihe nrụpụta na okwu enwere ike ịkpọ dịka ọrụ n'ime ụdị. Macros na-enye gị ohere iji SQL ọzọ n'etiti ụdị na oru ngo dabere na ụkpụrụ injinịa DRY (Edoghachi Onwe Gị).

Ọmụmaatụ nnukwu:

{% macro rename_category(column_name) %}
case
 when {{ column_name }} ilike  '%osx%' then 'osx'
 when {{ column_name }} ilike  '%android%' then 'android'
 when {{ column_name }} ilike  '%ios%' then 'ios'
 else 'other'
end as renamed_product
{% endmacro %}

Na ojiji ya:

{% set column_name = 'product' %}
select
 product,
 {{ rename_category(column_name) }} -- вызов макроса
from my_table

DBT na-abịa na njikwa ngwugwu na-enye ndị ọrụ ohere ibipụta na jigharịa modul na macros n'otu n'otu.

Nke a pụtara inwe ike ibu na iji ọba akwụkwọ dịka:

  • dbt_utils: na-arụ ọrụ na Date/Time Time, Surrogate Keys, Schema tests, Pivot/Unpivot na ndị ọzọ
  • Ndebiri ngosi emebere maka ọrụ dịka Ugwu snowplow и straipu 
  • Ọbá akwụkwọ maka ụlọ ahịa data akọwapụtara, dịka. Redshift 
  • -egbu osisi - Modul maka ịbanye ọrụ DBT

Enwere ike ịchọta ndepụta ngwugwu zuru ezu na dbt ebe.

Ọbụna atụmatụ ndị ọzọ

N'ebe a, m ga-akọwa atụmatụ ole na ole ndị ọzọ na-atọ ụtọ na mmejuputa nke mụ na ndị otu ahụ na-eji wuo Ụlọ nkwakọba data na wheely.

Nkewa nke gburugburu oge ịgba ọsọ DEV - TEST - PROD

Ọbụna n'ime otu ụyọkọ DWH (n'ime atụmatụ dị iche iche). Dịka ọmụmaatụ, iji okwu a:

with source as (
 
   select * from {{ source('salesforce', 'users') }}
   where 1=1
   {%- if target.name in ['dev', 'test', 'ci'] -%}           
       where timestamp >= dateadd(day, -3, current_date)   
   {%- endif -%}
 
)

Koodu a na-ekwu n'ụzọ nkịtị: maka gburugburu dev, nwale, ci were data naanị maka ụbọchị 3 gara aga na agaghịkwa ọzọ. Ya bụ, ịgba ọsọ na gburugburu ebe a ga-adị ngwa ngwa ma chọọ obere ego. Mgbe na-agba ọsọ na gburugburu ebe obibi eb A ga-eleghara ọnọdụ nzacha anya.

Ngwa na ntinye koodu kọlụm ọzọ

Redshift bụ DBMS kọlụm na-enye gị ohere ịtọ algọridim mkpakọ data maka kọlụm ọ bụla. Ịhọrọ ezigbo algọridim nwere ike ibelata ohere diski site na 20-50%.

Nnukwu redshift.compress_table ga-emezu iwu ANALYZE COMPRESSION, mepụta tebụl ọhụrụ nwere kọlụm akwadoro na-etinye algọridim, igodo ngalaba akọwapụtara (dist_key) na igodo nhazi (ụdị_key), nyefee data na ya, ma, ọ bụrụ na ọ dị mkpa, hichapụ ihe ochie ochie.

Akara mbinye aka:

{{ compress_table(schema, table,
                 drop_backup=False,
                 comprows=none|Integer,
                 sort_style=none|compound|interleaved,
                 sort_keys=none|List<String>,
                 dist_style=none|all|even,
                 dist_key=none|String) }}

Ụdị ndekọ na-agba ọsọ

Ị nwere ike itinye nko na ogbugbu ọ bụla nke ihe nlereanya ahụ, nke a ga-egbu tupu mmalite ma ọ bụ ozugbo emechara ihe nlereanya ahụ:

   pre-hook: "{{ logging.log_model_start_event() }}"
   post-hook: "{{ logging.log_model_end_event() }}"

Modul ndekọ ahụ ga-enye gị ohere ịdekọ metadata niile dị mkpa na tebụl dị iche, nke enwere ike iji mechaa nyochaa na nyocha ihe mgbochi.

Nke a bụ ihe dashboard dị ka dabere na ntinye data na Looker:

Ngwá Ọrụ Nrụpụta Data ma ọ bụ ihe a na-ahụkarị n'etiti Data Warehouse na Smoothie

Akpaaka nke Nlekọta Nchekwa

Ọ bụrụ na ị na-eji ụfọdụ ndọtị arụmọrụ nke ebe nchekwa ejiri mee ihe, dị ka UDF (ọrụ akọwapụtara nke onye ọrụ), mgbe ahụ ịdepụta ọrụ ndị a, njikwa ohere, na mwepu na-akpaghị aka na mwepụta ọhụrụ dị ezigbo mma ime na DBT.

Anyị na-eji UDF na Python gbakọọ hashes, ngalaba email na ngbanwe bitmask.

Ọmụmaatụ nke nnukwu macro na-emepụta UDF na mpaghara ogbugbu ọ bụla (dev, test, prod):

{% macro create_udf() -%}
 
 {% set sql %}
       CREATE OR REPLACE FUNCTION {{ target.schema }}.f_sha256(mes "varchar")
           RETURNS varchar
           LANGUAGE plpythonu
           STABLE
       AS $$  
           import hashlib
           return hashlib.sha256(mes).hexdigest()
       $$
       ;
 {% endset %}
  
 {% set table = run_query(sql) %}
 
{%- endmacro %}

Na Wheely anyị na-eji Amazon Redshift, nke dabere na PostgreSQL. Maka Redshift, ọ dị mkpa ịnakọta ọnụ ọgụgụ mgbe niile na tebụl wee hapụ ohere diski - iwu ANALYZE na VACUUM, n'otu n'otu.

Iji mee nke a, a na-eme iwu sitere na redshift_maintenance macro kwa abalị:

{% macro redshift_maintenance() %}
 
   {% set vacuumable_tables=run_query(vacuumable_tables_sql) %}
 
   {% for row in vacuumable_tables %}
       {% set message_prefix=loop.index ~ " of " ~ loop.length %}
 
       {%- set relation_to_vacuum = adapter.get_relation(
                                               database=row['table_database'],
                                               schema=row['table_schema'],
                                               identifier=row['table_name']
                                   ) -%}
       {% do run_query("commit") %}
 
       {% if relation_to_vacuum %}
           {% set start=modules.datetime.datetime.now() %}
           {{ dbt_utils.log_info(message_prefix ~ " Vacuuming " ~ relation_to_vacuum) }}
           {% do run_query("VACUUM " ~ relation_to_vacuum ~ " BOOST") %}
           {{ dbt_utils.log_info(message_prefix ~ " Analyzing " ~ relation_to_vacuum) }}
           {% do run_query("ANALYZE " ~ relation_to_vacuum) %}
           {% set end=modules.datetime.datetime.now() %}
           {% set total_seconds = (end - start).total_seconds() | round(2)  %}
           {{ dbt_utils.log_info(message_prefix ~ " Finished " ~ relation_to_vacuum ~ " in " ~ total_seconds ~ "s") }}
       {% else %}
           {{ dbt_utils.log_info(message_prefix ~ ' Skipping relation "' ~ row.values() | join ('"."') ~ '" as it does not exist') }}
       {% endif %}
 
   {% endfor %}
 
{% endmacro %}

DBT igwe ojii

Enwere ike iji DBT dịka ọrụ (Ọrụ Managed). Gụnyere:

  • IDE webụ maka mmepe oru na ụdị
  • Nhazi ọrụ na nhazi oge
  • Ụzọ dị mfe ma dị mma na ndekọ
  • Weebụsaịtị nwere akwụkwọ ọrụ gị
  • Ijikọ CI (njikọ na-aga n'ihu)

Ngwá Ọrụ Nrụpụta Data ma ọ bụ ihe a na-ahụkarị n'etiti Data Warehouse na Smoothie

nkwubi

Ịkwadebe na iri DWH na-aghọ ihe na-atọ ụtọ ma na-aba uru dị ka ịṅụ mmiri ara ehi. DBT nwere Jinja, ndọtị onye ọrụ (modul), onye na-achịkọta, onye mmebe, na onye njikwa ngwugwu. Site na ijikọta ihe ndị a ọnụ, ị ga-enweta ebe ọrụ zuru oke maka ụlọ nkwakọba ihe data gị. Enweghị ụzọ ka mma iji jikwaa mgbanwe n'ime DWH taa.

Ngwá Ọrụ Nrụpụta Data ma ọ bụ ihe a na-ahụkarị n'etiti Data Warehouse na Smoothie

Nkwenkwe ndị ndị mmepe nke DBT na-esote ka emebere dịka ndị a:

  • Koodu, ọ bụghị GUI, bụ nkọwapụta kacha mma maka ịkọwapụta mgbanaka nyocha dị mgbagwoju anya
  • Ịrụ ọrụ na data kwesịrị imeghari omume kachasị mma na injinịa ngwanrọ (Software Engineering)

  • Ndị ọrụ obodo kwesịrị ịchịkwa akụrụngwa data dị mkpa dị ka ngwanrọ mepere emepe
  • Ọ bụghị naanị ngwaọrụ nyocha, kamakwa koodu ga-aghọwanye ihe onwunwe nke obodo mepere emepe

Nkwenkwe ndị a bụ isi ewepụtala ngwaahịa nke ụlọ ọrụ 850 na-eji taa, ha na-etolitekwa ntọala nke ọtụtụ ndọtị na-akpali akpali nke a ga-emepụta n'ọdịnihu.

Maka ndị nwere mmasị, enwere vidiyo nke nkuzi mepere emepe m nyere ọnwa ole na ole gara aga dịka akụkụ nke nkuzi mepere emepe na OTUS - Ngwa wuo data maka Amazon Redshift Nchekwa.

Na mgbakwunye na DBT na Data Warehousing, dị ka akụkụ nke Data Engineer N'ezie na OTUS n'elu ikpo okwu, m na ndị ọrụ ibe m na-akụzi klaasị na ọtụtụ ndị ọzọ dị mkpa na nke oge a isiokwu:

  • Atụmatụ ihe owuwu maka ngwa data buru ibu
  • Jiri Spark na Spark Streaming mee ihe
  • Ịchọgharị ụzọ na ngwaọrụ maka nbudata isi mmalite data
  • Ihe ngosi nyocha ụlọ na DWH
  • Echiche NoSQL: HBase, Cassandra, ElasticSearch
  • Ụkpụrụ nke nlekota na nhazi 
  • Ọrụ ikpeazụ: itinye nkà niile ọnụ n'okpuru nkwado ndụmọdụ

Ntughari:

  1. DBT akwụkwọ - Okwu mmalite - akwụkwọ ikike
  2. Gịnị bụ kpọmkwem dbt? - Nyochaa akụkọ nke otu n'ime ndị dere DBT 
  3. Ngwa wuo data maka Amazon Redshift Nchekwa - YouTube, Ndekọ nkuzi OTUS mepere emepe
  4. Ịmata Greenplum — Ihe ọmụmụ mepere emepe na-esote bụ Mee 15, 2020
  5. Agụmakwụkwọ Injinia Data —OTUS
  6. Iwulite usoro nyocha nke tozuru oke - A anya na ọdịnihu nke data na nchịkọta
  7. Oge erugo maka nyocha isi mmalite - Evolushọn nke nyocha na mmetụta nke Open Source
  8. Mwekota na-aga n'ihu na Nnwale nrụpụta akpaaka na dbtCloud - Ụkpụrụ nke iwulite CI iji DBT
  9. Malite na nkuzi DBT - Omume, ntuziaka nzọụkwụ site na nzọụkwụ maka ọrụ nọọrọ onwe ya
  10. Ụlọ ahịa Jaffle - nkuzi Github DBT - Github, koodu oru ngo mmụta

Mụtakwuo maka nkuzi ahụ.

isi: www.habr.com

Tinye a comment