Data Build Tool kapa se tloaelehileng pakeng tsa Data Warehouse le Smoothie

Data Build Tool kapa se tloaelehileng pakeng tsa Data Warehouse le Smoothie
Sebaka sa polokelo ea data se nepahetseng se hahiloe holim'a melao-motheo efe?

Tsepamisa maikutlo ho boleng ba khoebo le li-analytics ha ho se na khoutu ea boilerplate. Ho laola DWH joalo ka codebase: phetolelo, tlhahlobo, tlhahlobo e ikemetseng le CI. Modular, e atolositsoeng, mohloli o bulehileng le sechaba. Litokomane tse bonolo tsa basebelisi le pono ea ho itšetleha (Data Lineage).

Tse ling mabapi le tsena tsohle le karolo ea DBT ho Big Data & Analytics ecosystem - rea u amohela ho katse.

Всем привет

Artemy Kozyr o teng. Ka lilemo tse fetang 5 ke 'nile ka sebetsa le matlo a polokelo ea boitsebiso, ho haha ​​​​ETL / ELT, hammoho le tlhahlobo ea lintlha le pono. Ke ntse ke sebetsa hona joale mabili, Ke ruta OTUS ka thupelo Moenjiniere oa data, 'me kajeno ke batla ho arolelana le uena sehlooho seo ke se ngotseng ka tebello ea ho qala ngoliso e ncha bakeng sa thupelo.

Tlhahlobo e khutšoane

Moralo oa DBT o mabapi le T ho ELT (Extract - Transform - Load) acronym.

Ka ho fihla ha li-database tse hlahisang lihlahisoa le tse senyehang tse kang BigQuery, Redshift, Snowflake, ho ne ho se na thuso ea ho etsa liphetoho ka ntle ho Data Warehouse. 

DBT ha e khoasolle data ho tsoa mehloling, empa e fana ka menyetla e metle ea ho sebetsa ka data e seng e kentsoe Bobolokelong (Bobolokelong ba ka Hare kapa ka Ntle).

Data Build Tool kapa se tloaelehileng pakeng tsa Data Warehouse le Smoothie
Sepheo se seholo sa DBT ke ho nka khoutu, ho e bokella ho SQL, ho phethahatsa litaelo ka tatellano e nepahetseng ho Repository.

Sebopeho sa Morero oa DBT

Morero o na le li-directory le lifaele tsa mefuta e 2 feela:

  • Mohlala (.sql) - yuniti ea phetoho e hlahisoang ke potso ea KHETHA
  • Faele ea tlhophiso (.yml) - mekhahlelo, litlhophiso, liteko, litokomane

Boemong ba mantlha, mosebetsi o hlophisitsoe ka tsela e latelang:

  • Mosebelisi o lokisa khoutu ea mohlala ho IDE efe kapa efe e loketseng
  • Ho sebelisoa CLI, ho qalisoa mehlala, DBT e bokella khoutu ea mohlala ho SQL
  • Khoutu e hlophisitsoeng ea SQL e etsoa ka har'a Storage ka tatellano e fanoeng (graph)

Mona ke hore na ho tsoa ho CLI ho ka shebahala joang:

Data Build Tool kapa se tloaelehileng pakeng tsa Data Warehouse le Smoothie

Tsohle ke KHETHA

Ena ke karolo e bolaeang ea moralo oa Data Build Tool. Ka mantsoe a mang, DBT e hula khoutu eohle e amanang le ho etsa lipotso tsa hau Lebenkeleng (mefuta e fapaneng ho tsoa ho litaelo CREATE, INSERT, UPDATE, DELETE ALTER, GRANT, ...).

Mofuta ofe kapa ofe o kenyelletsa ho ngola potso e le 'ngoe ea KHETHA e hlalosang sephetho sa data se hlahisoang.

Tabeng ena, logic ea phetoho e ka ba ea maemo a mangata 'me ea kopanya lintlha tse tsoang ho mefuta e meng e mengata. Mohlala oa mohlala o tla theha pontšo ea odara (f_orders):

{% set payment_methods = ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] %}
 
with orders as (
 
   select * from {{ ref('stg_orders') }}
 
),
 
order_payments as (
 
   select * from {{ ref('order_payments') }}
 
),
 
final as (
 
   select
       orders.order_id,
       orders.customer_id,
       orders.order_date,
       orders.status,
       {% for payment_method in payment_methods -%}
       order_payments.{{payment_method}}_amount,
       {% endfor -%}
       order_payments.total_amount as amount
   from orders
       left join order_payments using (order_id)
 
)
 
select * from final

Ke lintho life tse thahasellisang tseo re li bonang moo?

Pele: E sebelisitsoe CTE (Common Table Expressions) - ho hlophisa le ho utloisisa khoutu e nang le liphetoho tse ngata le mabaka a khoebo.

Ea bobeli: Khoutu ea mohlala ke motsoako oa SQL le puo Jinja (puo e lekanyang).

Mohlala o sebelisa loop etsoe ho hlahisa chelete bakeng sa mokhoa o mong le o mong oa tefo o boletsoeng polelong sete. Sesebelisoa se boetse se sebelisoa ref - bokhoni ba ho supa mefuta e meng ka har'a khoutu:

  • Nakong ea ho bokella ref e tla fetoleloa ho pointer e shebiloeng ho tafole kapa pono ho Storage
  • ref e u lumella ho etsa kerafo ea ho itšetleha ka mohlala

E ne e le Jinja e eketsa menyetla e batlang e se na moeli ho DBT. Tse sebelisoang haholo ke:

  • Haeba / ho seng joalo lipolelo - lipolelo tsa lekala
  • Bakeng sa loops
  • Mefuta e fapaneng
  • Macro - ho theha macros

Lintho tse bonahalang: Tafole, Pono, Keketseho

Leano la ho etsa lintho tse bonahalang ke mokhoa oo sephetho sa data ea mohlala se tla bolokoa ka oona Storage.

Ka mantsoe a mantlha ke:

  • Tafole - tafole ea 'mele ka har'a Storage
  • Sheba - sheba, tafole ea sebele ho Storage

Ho boetse ho na le maano a rarahaneng a ho etsa lintho tse bonahalang:

  • Keketseho - phallo e ntseng e eketseha (ea litafole tse kholo tsa lintlha); mela e mecha ea eketsoa, ​​mela e fetotsoeng ea ntlafatsoa, ​​mela e hlakotsoeng ea hlakoloa 
  • Ephemeral - mohlala ha o fetohe ka ho toba, empa o nka karolo e le CTE mefuteng e meng
  • Maano afe kapa afe a mang ao u ka a eketsang

Ntle le maano a ho etsa lintho tse bonahalang, ho na le menyetla ea ho ntlafatsa bakeng sa polokelo e khethehileng, mohlala:

  • Snowflake: Litafole tsa nakoana, Boitšoaro bo Kopantsoeng, Ho kopanya litafole, Lithuso tsa ho kopitsa, Lipono tse sireletsehileng
  • Lefapha la boipheliso: Distkey, Sortkey (e kopane, e kopantsoe), Maikutlo a Late Binding
  • kgolohadi: Karohano ea litafole le ho kopanya, Boits'oaro bo Kopantsoeng, KMS Encryption, Labels & Tags
  • Spark: Sebopeho sa faele (parquet, csv, json, orc, delta), partition_by, clustered_by, buckets, incremental_strategy

Lipolokelo tse latelang lia tšehetsoa hajoale:

  • Litlhaku
  • Lefapha la boipheliso
  • kgolohadi
  • Snowflake
  • Presto (karolelano)
  • Spark (karolelano)
  • Microsoft SQL Server (adapter ea sechaba)

Ha re ntlafatse mohlala oa rona:

  • Ha re e etseng hore ho tlatsoa ha eona ho eketsehe (Keketseho)
  • Ha re kenyeng likarolo le linotlolo tsa ho hlopha bakeng sa Redshift

-- Конфигурация модели: 
-- Инкрементальное наполнение, уникальный ключ для обновления записей (unique_key)
-- Ключ сегментации (dist), ключ сортировки (sort)
{{
  config(
       materialized='incremental',
       unique_key='order_id',
       dist="customer_id",
       sort="order_date"
   )
}}
 
{% set payment_methods = ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] %}
 
with orders as (
 
   select * from {{ ref('stg_orders') }}
   where 1=1
   {% if is_incremental() -%}
       -- Этот фильтр будет применен только для инкрементального запуска
       and order_date >= (select max(order_date) from {{ this }})
   {%- endif %} 
 
),
 
order_payments as (
 
   select * from {{ ref('order_payments') }}
 
),
 
final as (
 
   select
       orders.order_id,
       orders.customer_id,
       orders.order_date,
       orders.status,
       {% for payment_method in payment_methods -%}
       order_payments.{{payment_method}}_amount,
       {% endfor -%}
       order_payments.total_amount as amount
   from orders
       left join order_payments using (order_id)
 
)
 
select * from final

Kerafo ea ho itšetleha ka mohlala

Hape ke sefate sa ho itšetleha. E boetse e tsejoa e le DAG (Directed Acyclic Graph).

DBT e theha kerafo e ipapisitseng le tlhophiso ea mefuta eohle ea merero, kapa ho fapana le moo, ref() likhokahano ka har'a mehlala ho mefuta e meng. Ho ba le graph ho u lumella ho etsa lintho tse latelang:

  • Mehlala ea ho matha ka tatellano e nepahetseng
  • Ho tšoana ha sebopeho sa lebenkele
  • Ho tsamaisa subgraph e sa reroang 

Mohlala oa pono ea kerafo:

Data Build Tool kapa se tloaelehileng pakeng tsa Data Warehouse le Smoothie
Node e 'ngoe le e' ngoe ea kerafo ke mohlala; mathōko a graph a hlalosoa ke polelo ea ref.

Boleng ba Boitsebiso le Litokomane

Ntle le ho iketsetsa mefuta ka boyona, DBT e u lumella ho lekola mehopolo e mengata mabapi le sephetho sa data, joalo ka:

  • Eseng Null
  • e ikhethang
  • Reference Integrity - botšepehi bo nepahetseng (mohlala, customer_id tafoleng ea liodara e tsamaisana le id e tafoleng ea bareki)
  • Ho tsamaisana le lenane la litekanyetso tse amohelehang

Hoa khonahala ho eketsa liteko tsa hau (liteko tsa data tsa moetlo), joalo ka, mohlala, ho kheloha% ea lekeno ka matšoao ho tloha ka letsatsi, beke, khoeling e fetileng. Khopolo efe kapa efe e entsoeng e le potso ea SQL e ka fetoha teko.

Ka tsela ena, o ka tšoasa ho kheloha ho sa batleheng le liphoso tsa data ho lifensetere tsa Warehouse.

Mabapi le litokomane, DBT e fana ka mekhoa ea ho eketsa, ho fetolela, le ho aba metadata le maikutlo ho mehlala esita le maemo a litšobotsi. 

Mona ke hore na ho eketsa liteko le litokomane ho shebahala joang boemong ba faele ea tlhophiso:

 - name: fct_orders
   description: This table has basic information about orders, as well as some derived facts based on payments
   columns:
     - name: order_id
       tests:
         - unique # проверка на уникальность значений
         - not_null # проверка на наличие null
       description: This is a unique identifier for an order
     - name: customer_id
       description: Foreign key to the customers table
       tests:
         - not_null
         - relationships: # проверка ссылочной целостности
             to: ref('dim_customers')
             field: customer_id
     - name: order_date
       description: Date (UTC) that the order was placed
     - name: status
       description: '{{ doc("orders_status") }}'
       tests:
         - accepted_values: # проверка на допустимые значения
             values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']

Mona ke hore na litokomane tsena li shebahala joang webosaeteng e hlahisitsoeng:

Data Build Tool kapa se tloaelehileng pakeng tsa Data Warehouse le Smoothie

Macros le li-modules

Sepheo sa DBT ha se hakaalo hore e be sehlopha sa mangolo a SQL, empa ke ho fa basebelisi mekhoa e matla le e ruileng bakeng sa ho iketsetsa liphetoho le ho aba li-module tsena.

Macros ke lihlopha tsa lihahi le lipolelo tse ka bitsoang mesebetsi ka har'a mehlala. Macros e u lumella ho sebelisa SQL hape lipakeng tsa mehlala le merero ho latela molao-motheo oa boenjiniere oa DRY (U se ke Ua Ipheta).

Macro mohlala:

{% macro rename_category(column_name) %}
case
 when {{ column_name }} ilike  '%osx%' then 'osx'
 when {{ column_name }} ilike  '%android%' then 'android'
 when {{ column_name }} ilike  '%ios%' then 'ios'
 else 'other'
end as renamed_product
{% endmacro %}

Le tšebeliso ea eona:

{% set column_name = 'product' %}
select
 product,
 {{ rename_category(column_name) }} -- вызов макроса
from my_table

DBT e tla le mookameli oa sephutheloana ea lumellang basebelisi ho phatlalatsa le ho sebelisa hape li-module le li-macros ka bomong.

Sena se bolela ho khona ho kenya le ho sebelisa lilaebrari tse kang:

  • dbt_utils: ho sebetsa le Letsatsi / Nako, Linotlolo tsa Surrogate, liteko tsa Schema, Pivot / Unpivot le tse ling
  • Li-template tsa ponts'o tse lokiselitsoeng litšebeletso tse kang Lehloa и Tšoaea 
  • Lilaebrari tsa Mabenkele a khethehileng a Boitsebiso, mohlala. Lefapha la boipheliso 
  • Ho kena lipalaneng -Mojule bakeng sa ts'ebetso ea DBT ea ho rema lifate

Lethathamo le felletseng la liphutheloana le ka fumanoa ho dbt hub.

Le ho feta likarolo

Mona ke tla hlalosa likarolo tse ling tse 'maloa tse khahlisang le ts'ebetsong eo 'na le sehlopha re e sebelisang ho haha ​​​​Data Warehouse mabili.

Karohano ea libaka tsa nako ea ho sebetsa DEV - TEST - PROD

Le ka har'a sehlopha se tšoanang sa DWH (ka har'a merero e fapaneng). Ka mohlala, sebelisa mantsoe a latelang:

with source as (
 
   select * from {{ source('salesforce', 'users') }}
   where 1=1
   {%- if target.name in ['dev', 'test', 'ci'] -%}           
       where timestamp >= dateadd(day, -3, current_date)   
   {%- endif -%}
 
)

Khoutu ena ha e le hantle e re: bakeng sa tikoloho dev, teko, ci nka data feela bakeng sa matsatsi a 3 a fetileng eseng ho feta. Ke hore, ho sebetsa libakeng tsena ho tla potlaka haholo 'me ho hloka lisebelisoa tse fokolang. Ha o matha tikolohong tlhahiso boemo ba ho sefa bo tla hlokomolohuoa.

Lisebelisoa tse nang le khouto e 'ngoe ea kholomo

Redshift ke kholomo ea DBMS e u lumellang hore u behe li-algorithms tsa compression ea data bakeng sa kholomo ka 'ngoe. Ho khetha li-algorithms tse nepahetseng ho ka fokotsa sebaka sa disk ka 20-50%.

Macro redshift.compress_table e tla phethahatsa taelo ea ANALYZE COMPRESSION, theha tafole e ncha e nang le li-algorithms tsa khouto tse khothaletsoang, linotlolo tsa likarolo tse boletsoeng (dist_key) le linotlolo tsa ho hlopha (sort_key), fetisetsa data ho eona, 'me, ha ho hlokahala, hlakola kopi ea khale.

Mosaeno oa Macro:

{{ compress_table(schema, table,
                 drop_backup=False,
                 comprows=none|Integer,
                 sort_style=none|compound|interleaved,
                 sort_keys=none|List<String>,
                 dist_style=none|all|even,
                 dist_key=none|String) }}

Mokhoa oa ho rema lifate oa sebetsa

O ka hokela lihakisi ts'ebetsong e 'ngoe le e' ngoe ea mohlala, e tla etsoa pele ho qala kapa hang ka mor'a hore ho phethoe mohlala:

   pre-hook: "{{ logging.log_model_start_event() }}"
   post-hook: "{{ logging.log_model_end_event() }}"

Mojule oa ho rema lifate o tla u lumella ho hatisa metadata eohle e hlokahalang tafoleng e arohaneng, eo hamorao e ka sebelisoang ho hlahloba le ho sekaseka mathata.

Sena ke seo dashboard e shebahalang ka sona ho latela data ea ho rema lifate ho Looker:

Data Build Tool kapa se tloaelehileng pakeng tsa Data Warehouse le Smoothie

Boiketsetso ba Tlhokomelo ea Polokelo

Haeba u sebelisa likeketso tse ling tsa tšebetso ea Polokelo e sebelisitsoeng, joalo ka UDF (User Defined Functions), joale phetolelo ea mesebetsi ena, taolo ea phihlello, le ho phatlalatsoa ha likhatiso tse ncha ho bonolo haholo ho etsoa ho DBT.

Re sebelisa UDF ho Python ho bala li-hashes, libaka tsa lengolo-tsoibila, le li-bitmask decoding.

Mohlala oa macro o theha UDF tikolohong efe kapa efe ea ts'ebetso (dev, test, prod):

{% macro create_udf() -%}
 
 {% set sql %}
       CREATE OR REPLACE FUNCTION {{ target.schema }}.f_sha256(mes "varchar")
           RETURNS varchar
           LANGUAGE plpythonu
           STABLE
       AS $$  
           import hashlib
           return hashlib.sha256(mes).hexdigest()
       $$
       ;
 {% endset %}
  
 {% set table = run_query(sql) %}
 
{%- endmacro %}

Ho Wheely re sebelisa Amazon Redshift, e thehiloeng ho PostgreSQL. Bakeng sa Redshift, ho bohlokoa ho bokella lipalo-palo kamehla litafoleng le ho lokolla sebaka sa disk - litaelo tsa ANALYZE le VACUUM, ka ho latellana.

Ho etsa sena, litaelo tse tsoang ho redshift_maintenance macro li etsoa bosiu bo bong le bo bong:

{% macro redshift_maintenance() %}
 
   {% set vacuumable_tables=run_query(vacuumable_tables_sql) %}
 
   {% for row in vacuumable_tables %}
       {% set message_prefix=loop.index ~ " of " ~ loop.length %}
 
       {%- set relation_to_vacuum = adapter.get_relation(
                                               database=row['table_database'],
                                               schema=row['table_schema'],
                                               identifier=row['table_name']
                                   ) -%}
       {% do run_query("commit") %}
 
       {% if relation_to_vacuum %}
           {% set start=modules.datetime.datetime.now() %}
           {{ dbt_utils.log_info(message_prefix ~ " Vacuuming " ~ relation_to_vacuum) }}
           {% do run_query("VACUUM " ~ relation_to_vacuum ~ " BOOST") %}
           {{ dbt_utils.log_info(message_prefix ~ " Analyzing " ~ relation_to_vacuum) }}
           {% do run_query("ANALYZE " ~ relation_to_vacuum) %}
           {% set end=modules.datetime.datetime.now() %}
           {% set total_seconds = (end - start).total_seconds() | round(2)  %}
           {{ dbt_utils.log_info(message_prefix ~ " Finished " ~ relation_to_vacuum ~ " in " ~ total_seconds ~ "s") }}
       {% else %}
           {{ dbt_utils.log_info(message_prefix ~ ' Skipping relation "' ~ row.values() | join ('"."') ~ '" as it does not exist') }}
       {% endif %}
 
   {% endfor %}
 
{% endmacro %}

Leru la DBT

Hoa khoneha ho sebelisa DBT e le tšebeletso (Tšebeletso e laoloang). E kenyelelitsoe:

  • Web IDE bakeng sa ho ntlafatsa merero le mehlala
  • Tlhophiso ea mosebetsi le kemiso
  • Mokhoa o bonolo le o bonolo oa ho fihlella likutung
  • Webosaete e nang le litokomane tsa projeke ea hau
  • Ho hokela CI (Continuous Integration)

Data Build Tool kapa se tloaelehileng pakeng tsa Data Warehouse le Smoothie

fihlela qeto e

Ho lokisa le ho sebelisa DWH e ba ntho e monate le e molemo joalo ka ho noa smoothie. DBT e na le Jinja, li-extensions tsa basebelisi (li-module), moqapi, mohlahlobi, le molaoli oa sephutheloana. Ka ho kopanya lintlha tsena u fumana tikoloho e felletseng ea ho sebetsa bakeng sa polokelo ea hau ea data. Ha ho na mokhoa o betere oa ho laola phetoho ka har'a DWH kajeno.

Data Build Tool kapa se tloaelehileng pakeng tsa Data Warehouse le Smoothie

Litumelo tse lateloang ke baetsi ba DBT li entsoe ka tsela e latelang:

  • Khoutu, eseng GUI, ke eona tlhaloso e ntle ka ho fetisisa ea ho hlalosa lintlha tse rarahaneng tsa tlhahlobo
  • Ho sebetsa ka data ho lokela ho ikamahanya le mekhoa e metle ea boenjiniere ba software (Software Engineering)

  • Lisebelisoa tsa bohlokoa tsa data li lokela ho laoloa ke sechaba sa basebelisi e le software e bulehileng ea mohloli
  • Ha se lisebelisoa tsa analytics feela, empa le khoutu e tla eketseha ho ba thepa ea sechaba sa Open Source

Litumelo tsena tsa mantlha li hlahisitse sehlahisoa se sebelisoang ke lik'hamphani tse fetang 850 kajeno, 'me ke motheo oa likeketso tse ngata tse khahlisang tse tla etsoa nakong e tlang.

Bakeng sa ba thahasellang, ho na le video ea thuto e bulehileng eo ke faneng ka eona likhoeling tse 'maloa tse fetileng e le karolo ea thuto e bulehileng ho OTUS - Data Build Tool bakeng sa polokelo ea Amazon Redshift.

Ntle le DBT le Data Warehousing, e le karolo ea thupelo ea Moenjiniere oa Boitsebiso sethaleng sa OTUS, 'na le basebetsi-'moho le 'na re ruta lihlopha ka lihlooho tse ling tse ngata tse amanang le tsa morao-rao:

  • Mehopolo ea Mehaho ea Likopo tse kholo tsa data
  • Itloaetse ka Spark le Spark Streaming
  • Ho hlahloba mekhoa le lisebelisoa tsa ho kenya mehloli ea data
  • Ho aha lipontšo tsa tlhahlobo ho DWH
  • Likhopolo tsa NoSQL: HBase, Cassandra, ElasticSearch
  • Melao-motheo ea ho beha leihlo le ho hlophisa 
  • Morero oa ho qetela: ho kopanya litsebo tsohle tlas'a tšehetso ea boeletsi

Lipeeletso:

  1. Litokomane tsa DBT - Selelekela - Litokomane tsa molao
  2. Ha e le hantle, dbt ke eng? - Hlahloba sengoloa sa e mong oa bangoli ba DBT 
  3. Data Build Tool bakeng sa polokelo ea Amazon Redshift - YouTube, Ho rekota thuto e bulehileng ea OTUS
  4. Ho tseba Greenplum - Thuto e latelang e bulehileng ke May 15, 2020
  5. Koetliso ea Boenjiniere ba Boitsebiso — OTUS
  6. Ho aha Mokhoa oa Mosebetsi oa Mature Analytics - Ho sheba bokamoso ba data le analytics
  7. Ke nako ea ho etsa litlhahlobo tsa mohloli o bulehileng - Kholiso ea li-analytics le tšusumetso ea Open Source
  8. Khokahano e Tsoelang Pele le Teko ea Moaho e Ikemetseng ka dbtCloud - Melao-motheo ea ho haha ​​​​CI ho sebelisa DBT
  9. Ho qala ka thupelo ea DBT - Itloaetse, litaelo tsa mohato ka mohato bakeng sa mosebetsi o ikemetseng
  10. Lebenkele la Jaffle - Thuto ea Github DBT - Github, khoutu ea morero oa thuto

Ithute haholoanyane ka thupelo.

Source: www.habr.com

Eketsa ka tlhaloso