
Ma luna o nā loina hea i kūkulu ʻia ai kahi Data Warehouse?
E nānā i ka waiwai o ka ʻoihana a me ka ʻikepili me ka loaʻa ʻole o ka code boilerplate. Ka mālama ʻana iā DWH ma ke ʻano he codebase: versioning, review, automated testing and CI. Modular, extensible, open source a me ke kaiāulu. ʻO nā palapala pili i ka mea hoʻohana a me ka hiʻohiʻona hilinaʻi (Data Lineage).
ʻO nā mea hou aʻe e pili ana i kēia a me ke kuleana o DBT i ka Big Data & Analytics ecosystem - welcome to cat.
Aloha kākou kela kanaka
Ua pili ʻo Artemy Kozyr. No nā makahiki ʻoi aku ma mua o 5 mau makahiki aʻu e hana pū ana me nā hale waihona ʻikepili, ke kūkulu ʻana i ka ETL/ELT, a me ka ʻikepili ʻikepili a me ka ʻike. Ke hana nei au i kēia manawa , Ke aʻo nei au ma OTUS ma kahi papa , a i kēia lā makemake wau e kaʻana like me ʻoe i kahi ʻatikala aʻu i kākau ai i mua o ka hoʻomaka ʻana kau inoa hou no ka papa.
Hōʻike Overview
ʻO ka hoʻolālā DBT e pili ana i ka T ma ka acronym ELT (Extract - Transform - Load).
Me ka hiki ʻana mai o nā ʻikepili analytical huahana a hiki ke hoʻonui ʻia e like me BigQuery, Redshift, Snowflake, ʻaʻohe kumu o ka hana ʻana i nā hoʻololi ma waho o ka Data Warehouse.
ʻAʻole hoʻoiho ʻo DBT i ka ʻikepili mai nā kumu, akā hāʻawi i nā manawa kūpono no ka hana ʻana me ka ʻikepili i hoʻoili ʻia i loko o ka Storage (ma loko a i waho paha).

ʻO ke kumu nui o DBT ka lawe ʻana i ke code, hōʻuluʻulu iā ia i SQL, hoʻokō i nā kauoha ma ke kaʻina pololei i ka Repository.
Hoʻolālā papahana DBT
Aia ka papahana i nā papa kuhikuhi a me nā faila o nā ʻano 2 wale nō:
- Hoʻohālike (.sql) - he ʻāpana o ka hoʻololi i hōʻike ʻia e kahi nīnau SELECT
- Kōnae hoʻonohonoho (.yml) - nā ʻāpana, nā hoʻonohonoho, nā hoʻokolohua, nā palapala
Ma kahi pae kumu, ua hoʻonohonoho ʻia ka hana penei:
- Hoʻomākaukau ka mea hoʻohana i ke code kumu hoʻohālike i kekahi IDE kūpono
- Ke hoʻohana nei i ka CLI, hoʻomaka ʻia nā hiʻohiʻona, hoʻohui ʻo DBT i ka code model i SQL
- Hoʻokō ʻia ka code SQL i hoʻohui ʻia i ka Storage ma kahi kaʻina i hāʻawi ʻia (graph)
Eia ke ʻano o ka holo ʻana mai ka CLI:

ʻO nā mea a pau he SELECT
He hiʻohiʻona pepehi kanaka kēia o ka ʻIkepili Mea Hana Hana. Ma nā huaʻōlelo ʻē aʻe, hoʻokaʻawale ʻo DBT i nā code āpau e pili ana i ka hoʻopili ʻana i kāu mau nīnau i loko o ka hale kūʻai (nā ʻano like ʻole mai nā kauoha CREATE, INSERT, UPDATE, DELETE ALTER, GRANT, ...).
ʻO kēlā me kēia kumu hoʻohālike e pili ana i ke kākau ʻana i hoʻokahi nīnau SELECT e wehewehe ana i ka hoʻonohonoho ʻikepili hopena.
I kēia hihia, hiki i ka loiloi hoʻololi ke hoʻohui i nā ʻikepili mai nā kumu hoʻohālike ʻē aʻe. He laʻana o kahi hoʻohālike e kūkulu i kahi hōʻikeʻike kauoha (f_orders):
{% set payment_methods = ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] %}
with orders as (
select * from {{ ref('stg_orders') }}
),
order_payments as (
select * from {{ ref('order_payments') }}
),
final as (
select
orders.order_id,
orders.customer_id,
orders.order_date,
orders.status,
{% for payment_method in payment_methods -%}
order_payments.{{payment_method}}_amount,
{% endfor -%}
order_payments.total_amount as amount
from orders
left join order_payments using (order_id)
)
select * from final
He aha nā mea hoihoi e ʻike ai ma ʻaneʻi?
ʻO ka mua: Hoʻohana ʻia ʻo CTE (Common Table Expressions) - e hoʻonohonoho a hoʻomaopopo i ka code i loaʻa i nā loli he nui a me nā loina ʻoihana.
ʻO ka lua: ʻO ka code model kahi hui ʻana o SQL a me ka ʻōlelo (ʻōlelo hoʻohālike).
Hoʻohana ka laʻana i kahi loop no ka mea, e hoʻopuka i ka nui no kēlā me kēia ʻano uku i kuhikuhi ʻia ma ka ʻōlelo i. Hoʻohana ʻia ka hana mihi - ka hiki ke kuhikuhi i nā hiʻohiʻona ʻē aʻe i loko o ke code:
- I ka houluulu ana mihi e hoʻololi ʻia i kahi kiko kuhikuhi i ka papaʻaina a i ʻole ka nānā ʻana ma Storage
- mihi hiki iā ʻoe ke kūkulu i ka pakuhi hilinaʻi kumu hoʻohālike
ʻO ia hoʻohui i nā mea hiki ʻole i ka DBT. ʻO nā mea i hoʻohana pinepine ʻia:
- Inā / ʻē aʻe nā ʻōlelo - nā ʻōlelo lālā
- No nā puka lou - cycles
- Nā mea hoʻololi
- Makro - hana macros
Mea Hana: Papa, Nānā, Hoʻonui
ʻO ka hoʻolālā materialization kahi ala e mālama ʻia ai ka hopena o ka ʻikepili kumu hoʻohālike i ka Storage.
Ma nā ʻōlelo kumu ʻo ia:
- Papa - papa kino ma ka Waihona
- Nānā - nānā, papa ʻaina ma ka Storage
Aia kekahi mau hoʻolālā materialization paʻakikī:
- Hoʻonui - hoʻouka ʻia (o nā papa ʻike nui); hoʻohui ʻia nā laina hou, hoʻololi ʻia nā laina i hoʻololi ʻia, holoi ʻia nā laina i holoi ʻia
- Ephemeral - ʻaʻole i hoʻokō pololei ʻia ke kumu hoʻohālike, akā komo ʻo ia ma ke ʻano he CTE i nā hiʻohiʻona ʻē aʻe
- Hiki iā ʻoe ke hoʻohui iā ʻoe iho i nā hoʻolālā ʻē aʻe
Ma kahi o nā hoʻolālā materialization, aia nā manawa no ka hoʻonui ʻana no nā Storages kikoʻī, no ka laʻana:
- Snowflake: Nā papa kuʻuna, Hoʻohui ʻia ka hana, ka hui ʻana o ka papa, ke kope kope ʻana, nā manaʻo palekana
- ʻO Redshift: Distkey, Sortkey (interleaved, compound), Late Binding Views
- ʻO BigQuery: Hoʻokaʻawale papa a me ka hui ʻana, Hoʻohui ʻia ka hana, KMS Encryption, Lepili a me nā huaʻōlelo
- hunaahi: Hōpili waihona (parquet, csv, json, orc, delta), partition_by, clustered_by, bākeke, incremental_strategy
Ke kākoʻo ʻia nei nā waihona i kēia manawa:
- ʻO Postgres
- ʻO Redshift
- ʻO BigQuery
- Snowflake
- Presto (ʻāpana)
- Spark (ʻāpana)
- Microsoft SQL Server (mea hoʻopili kaiaulu)
E hoʻomaikaʻi i kā mākou kumu hoʻohālike:
- E hoʻonui kākou i kona hoʻopiha (Incremental)
- E hoʻohui i nā kī ʻāpana a me ka wehe ʻana no Redshift
-- Конфигурация модели:
-- Инкрементальное наполнение, уникальный ключ для обновления записей (unique_key)
-- Ключ сегментации (dist), ключ сортировки (sort)
{{
config(
materialized='incremental',
unique_key='order_id',
dist="customer_id",
sort="order_date"
)
}}
{% set payment_methods = ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] %}
with orders as (
select * from {{ ref('stg_orders') }}
where 1=1
{% if is_incremental() -%}
-- Этот фильтр будет применен только для инкрементального запуска
and order_date >= (select max(order_date) from {{ this }})
{%- endif %}
),
order_payments as (
select * from {{ ref('order_payments') }}
),
final as (
select
orders.order_id,
orders.customer_id,
orders.order_date,
orders.status,
{% for payment_method in payment_methods -%}
order_payments.{{payment_method}}_amount,
{% endfor -%}
order_payments.total_amount as amount
from orders
left join order_payments using (order_id)
)
select * from final
Kiʻi kiʻi hilinaʻi
He lāʻau hilinaʻi nō hoʻi ia. Ua kapa ʻia ʻo DAG (Directed Acyclic Graph).
Hoʻokumu ʻo DBT i ka pakuhi e pili ana i ka hoʻonohonoho ʻana o nā kumu hoʻohālike a pau, a i ʻole, ref() nā loulou i loko o nā hiʻohiʻona i nā kumu hoʻohālike ʻē aʻe. ʻO ka loaʻa ʻana o ka pakuhi hiki iā ʻoe ke hana i kēia mau mea:
- Ke holo nei i nā kumu hoʻohālike ma ke kaʻina pololei
- Ka hoʻolikelike ʻana o ka hale kūʻai
- Ka holo ʻana i kahi subgraph kūʻokoʻa
Laʻana o ka ʻike kiʻi kiʻi:

He kumu hoʻohālike kēlā me kēia node o ka pakuhi; ua kuhikuhi ʻia nā ʻaoʻao o ka pakuhi e ka ʻōlelo ref.
ʻIkepili a me ka palapala
Ma waho aʻe o ka hana ʻana i nā hiʻohiʻona iā lākou iho, ʻae ʻo DBT iā ʻoe e hoʻāʻo i kekahi mau manaʻo e pili ana i ka hoʻonohonoho ʻikepili hopena, e like me:
- ʻAʻole Null
- kūikawā
- Hoʻopaʻa Kūʻai - ka pono kuhikuhi (no ka laʻana, customer_id i ka papa kauoha e pili ana i ka id i ka papaʻaina o nā mea kūʻai aku)
- Hoʻohālikelike i ka papa inoa o nā waiwai i ʻae ʻia
Hiki ke hoʻohui i kāu mau ho'āʻo pono'ī (nā ho'āʻoʻikepili maʻamau), e like me, no ka laʻana, % deviation o ka loaʻa kālā me nā hōʻailona mai kahi lā, hoʻokahi pule, hoʻokahi mahina i hala. Hiki ke lilo i ho'āʻo kekahi manaʻo i haku ʻia ma ke ʻano he nīnau SQL.
Ma kēia ala, hiki iā ʻoe ke hopu i nā deviations makemake ʻole a me nā hewa i ka ʻikepili ma ka Warehouse windows.
Ma ka ʻōlelo o ka palapala, hāʻawi ʻo DBT i nā mīkini no ka hoʻohui ʻana, ka hoʻololi ʻana, a me ka hāʻawi ʻana i nā metadata a me nā manaʻo ma ke kumu hoʻohālike a me nā pae hiʻona.
Eia ke ʻano o ka hoʻohui ʻana i nā hoʻokolohua a me nā palapala i ka pae faila hoʻonohonoho:
- name: fct_orders
description: This table has basic information about orders, as well as some derived facts based on payments
columns:
- name: order_id
tests:
- unique # проверка на уникальность значений
- not_null # проверка на наличие null
description: This is a unique identifier for an order
- name: customer_id
description: Foreign key to the customers table
tests:
- not_null
- relationships: # проверка ссылочной целостности
to: ref('dim_customers')
field: customer_id
- name: order_date
description: Date (UTC) that the order was placed
- name: status
description: '{{ doc("orders_status") }}'
tests:
- accepted_values: # проверка на допустимые значения
values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']
A eia ke ʻano o kēia palapala ma ka pūnaewele i hana ʻia:

Macros a me nā Module
ʻAʻole nui ka manaʻo o DBT e lilo i pūʻulu o nā palapala SQL, akā e hāʻawi i nā mea hoʻohana i kahi ala ikaika a waiwai nui no ke kūkulu ʻana i kā lākou mau hoʻololi a me ka hāʻawi ʻana i kēia mau modules.
ʻO Macros nā pūʻulu o nā kūkulu a me nā ʻōlelo i hiki ke kapa ʻia he mau hana i loko o nā hiʻohiʻona. ʻAe ʻo Macros iā ʻoe e hoʻohana hou i ka SQL ma waena o nā hiʻohiʻona a me nā papahana e like me ka DRY (Do not Repeat Yourself) engineering principle.
Laʻana macro:
{% macro rename_category(column_name) %}
case
when {{ column_name }} ilike '%osx%' then 'osx'
when {{ column_name }} ilike '%android%' then 'android'
when {{ column_name }} ilike '%ios%' then 'ios'
else 'other'
end as renamed_product
{% endmacro %}
A me kona hoʻohana ʻana:
{% set column_name = 'product' %}
select
product,
{{ rename_category(column_name) }} -- вызов макроса
from my_table
Hele mai ʻo DBT me kahi luna pūʻolo e hiki ai i nā mea hoʻohana ke hoʻolaha a hoʻohana hou i nā modula a me nā macros.
ʻO ia ka hiki ke hoʻouka a hoʻohana i nā hale waihona puke e like me:
- : hana pū me ka lā/manawa, nā kī pani, nā ho'āʻo Schema, Pivot/Unpivot a me nā mea ʻē aʻe
- Nā papa hōʻikeʻike mākaukau no nā lawelawe e like me и
- Nā hale waihona puke no nā hale kūʻai ʻikepili kikoʻī, e.g.
- - Module no ka hoʻopaʻa inoa ʻana i ka hana DBT
Hiki ke loaʻa kahi papa inoa piha o nā pūʻolo ma .
ʻOi aku ka nui o nā hiʻohiʻona
Maʻaneʻi e wehewehe au i kekahi mau hiʻohiʻona hoihoi a me ka hoʻokō ʻana a ka hui a me aʻu e hoʻohana ai e kūkulu i kahi Data Warehouse .
Ka hoʻokaʻawale ʻana o nā kaiapuni wā holo DEV - TEST - PROD
ʻOiai i loko o ka pūʻulu DWH hoʻokahi (i loko o nā papahana like ʻole). No ka laʻana, e hoʻohana ana i kēia ʻōlelo:
with source as (
select * from {{ source('salesforce', 'users') }}
where 1=1
{%- if target.name in ['dev', 'test', 'ci'] -%}
where timestamp >= dateadd(day, -3, current_date)
{%- endif -%}
)
'Ōlelo maoli kēia code: no nā kaiapuni dev, hoao, ci e lawe i ka ʻikepili no nā lā 3 i hala a ʻaʻole hou. ʻO ia hoʻi, ʻoi aku ka wikiwiki o ka holo ʻana i kēia mau kaiapuni a koi aku i nā kumuwaiwai liʻiliʻi. I ka holo ʻana ma luna o ke kaiapuni prod e nānā ʻole ʻia ke kūlana kānana.
Hoʻopilikino me ka hoʻopāpā kolamu ʻē aʻe
ʻO Redshift kahi DBMS columnar e hiki ai iā ʻoe ke hoʻonohonoho i nā algorithms kaomi ʻikepili no kēlā me kēia kolamu. ʻO ke koho ʻana i nā algorithm maikaʻi loa hiki ke hōʻemi i ka nui o ka disk ma 20-50%.
Makoleko e hoʻokō i ke kauoha ANALYZE COMPRESSION, e hana i kahi papaʻaina hou me nā algorithm e hoʻopili ai i ke kolamu i ʻōlelo ʻia, nā kī ʻāpana i wehewehe ʻia (dist_key) a me nā kī koho (sort_key), e hoʻoili i ka ʻikepili iā ia, a inā pono, e holoi i ke kope kahiko.
Pulima macro:
{{ compress_table(schema, table,
drop_backup=False,
comprows=none|Integer,
sort_style=none|compound|interleaved,
sort_keys=none|List<String>,
dist_style=none|all|even,
dist_key=none|String) }}
Holo ke kumu hoʻohālike logging
Hiki iā ʻoe ke hoʻopili i nā makau i kēlā me kēia hoʻokō o ke kŘkohu, e hoʻokō ʻia ma mua o ka hoʻomaka ʻana a i ʻole ma hope koke o ka pau ʻana o ka hana ʻana o ke kumu hoʻohālike:
pre-hook: "{{ logging.log_model_start_event() }}"
post-hook: "{{ logging.log_model_end_event() }}"
E ʻae ka module logging iā ʻoe e hoʻopaʻa i nā metadata pono a pau i kahi papa ʻokoʻa, hiki ke hoʻohana ʻia ma hope no ka loiloi a nānā ʻana i nā bottlenecks.
ʻO kēia ke ʻano o ka dashboard e pili ana i ka hoʻopaʻa ʻana i ka ʻikepili ma Looker:

ʻOtomation o ka mālama mālama
Inā ʻoe e hoʻohana i kekahi mau hoʻonui o ka hana o ka Repository i hoʻohana ʻia, e like me UDF (User Defined Functions), a laila ʻoi aku ka maʻalahi o ka hoʻololi ʻana i kēia mau hana, ka mana ʻae, a me ka holo ʻana i waho o nā mea hou e hana ma DBT.
Hoʻohana mākou i ka UDF ma Python e helu i nā hashes, nā leka uila, a me ka decoding bitmask.
ʻO kahi hiʻohiʻona o kahi macro e hana ana i kahi UDF ma nā wahi hoʻokō (dev, test, prod):
{% macro create_udf() -%}
{% set sql %}
CREATE OR REPLACE FUNCTION {{ target.schema }}.f_sha256(mes "varchar")
RETURNS varchar
LANGUAGE plpythonu
STABLE
AS $$
import hashlib
return hashlib.sha256(mes).hexdigest()
$$
;
{% endset %}
{% set table = run_query(sql) %}
{%- endmacro %}
Ma Wheely hoʻohana mākou iā Amazon Redshift, kahi i hoʻokumu ʻia ma PostgreSQL. No Redshift, he mea nui e hōʻiliʻili mau i nā ʻikepili ma nā papaʻaina a hoʻokuʻu i kahi diski - nā kauoha ANALYZE a me VACUUM, kēlā me kēia.
No ka hana ʻana i kēia, hoʻokō ʻia nā kauoha mai ka redshift_maintenance macro i kēlā me kēia pō:
{% macro redshift_maintenance() %}
{% set vacuumable_tables=run_query(vacuumable_tables_sql) %}
{% for row in vacuumable_tables %}
{% set message_prefix=loop.index ~ " of " ~ loop.length %}
{%- set relation_to_vacuum = adapter.get_relation(
database=row['table_database'],
schema=row['table_schema'],
identifier=row['table_name']
) -%}
{% do run_query("commit") %}
{% if relation_to_vacuum %}
{% set start=modules.datetime.datetime.now() %}
{{ dbt_utils.log_info(message_prefix ~ " Vacuuming " ~ relation_to_vacuum) }}
{% do run_query("VACUUM " ~ relation_to_vacuum ~ " BOOST") %}
{{ dbt_utils.log_info(message_prefix ~ " Analyzing " ~ relation_to_vacuum) }}
{% do run_query("ANALYZE " ~ relation_to_vacuum) %}
{% set end=modules.datetime.datetime.now() %}
{% set total_seconds = (end - start).total_seconds() | round(2) %}
{{ dbt_utils.log_info(message_prefix ~ " Finished " ~ relation_to_vacuum ~ " in " ~ total_seconds ~ "s") }}
{% else %}
{{ dbt_utils.log_info(message_prefix ~ ' Skipping relation "' ~ row.values() | join ('"."') ~ '" as it does not exist') }}
{% endif %}
{% endfor %}
{% endmacro %}
Kapua DBT
Hiki ke hoʻohana i ka DBT ma ke ʻano he lawelawe (Managed Service). Aia i loko:
- IDE pūnaewele no ka hoʻomohala ʻana i nā papahana a me nā hiʻohiʻona
- Hoʻonohonoho hana a hoʻonohonoho
- Loaʻa maʻalahi a maʻalahi i nā lāʻau
- Paena pūnaewele me nā palapala o kāu papahana
- Hoʻohui CI (Hoʻohui Hoʻomau)

hopena
ʻO ka hoʻomākaukau ʻana a me ka ʻai ʻana i ka DWH e lilo i mea leʻaleʻa a maikaʻi hoʻi e like me ka inu ʻana i kahi smoothie. Aia ʻo DBT i Jinja, nā mea hoʻohana (modules), kahi mea hoʻopili, mea hoʻokō, a me kahi luna hoʻonohonoho. Ma ka hui pū ʻana i kēia mau mea e loaʻa iā ʻoe kahi wahi hana piha no kāu Data Warehouse. ʻAʻohe ala maikaʻi aʻe e hoʻokele i ka hoʻololi ʻana i loko o DWH i kēia lā.
ʻO nā manaʻoʻiʻo i hahai ʻia e nā mea hoʻomohala o DBT ua hoʻokumu ʻia penei:
- ʻO ke code, ʻaʻole ʻo GUI, ʻo ia ka abstraction maikaʻi loa no ka hōʻike ʻana i ka loiloi analytical paʻakikī
- Pono ka hana me ka ʻikepili e hoʻololi i nā hana maikaʻi loa i ka ʻenekinia polokalamu (Software Engineering)
- Pono e hoʻomalu ʻia nā ʻōnaehana ʻikepili koʻikoʻi e ke kaiāulu mea hoʻohana ma ke ʻano he polokalamu open source
- ʻAʻole wale nā mea hana analytics, akā e lilo pū ka code i mea waiwai o ke kaiāulu Open Source
Ua hoʻokumu kēia mau manaʻoʻiʻo i kahi huahana i hoʻohana ʻia e nā ʻoihana 850 i kēia lā, a lilo lākou i kumu o nā hoʻonui hoihoi e hana ʻia i ka wā e hiki mai ana.
No ka poʻe hoihoi, aia kahi wikiō o kahi haʻawina hāmama aʻu i hāʻawi ai i kekahi mau mahina i hala aku nei ma ke ʻano he haʻawina wehe ma OTUS - .
Ma waho aʻe o ka DBT a me ka Data Warehousing, ma ke ʻano he ʻāpana o ka papa Data Engineer ma ka platform OTUS, aʻo wau a me koʻu mau hoa hana i nā papa ma kekahi mau kumuhana kūpono a me nā kumuhana hou.
- Nā Manaʻo Hoʻolālā no nā noi ʻikepili nui
- E hoʻomaʻamaʻa me Spark a me Spark Streaming
- Ke ʻimi nei i nā ʻano a me nā mea hana no ka hoʻouka ʻana i nā kumu ʻikepili
- Ke kūkulu ʻana i nā hale hōʻikeʻike loiloi ma DWH
- Nā manaʻo NoSQL: HBase, Cassandra, ElasticSearch
- Nā loina o ka nānā ʻana a me ka hoʻokani pila
- ʻO ka papahana hope: hoʻohui i nā mākau āpau ma lalo o ke kākoʻo aʻoaʻo
Nā Manaʻo:
- — Nā palapala kūhelu
- — E nānā i ka ʻatikala a kekahi o nā mea kākau o DBT
- - YouTube, Hoʻopaʻa ʻana i kahi haʻawina wehe OTUS
- — ʻO ka haʻawina wehe aʻe ʻo Mei 15, 2020
- —OTUS
- - He nānā i ka wā e hiki mai ana o ka ʻikepili a me ka analytics
- - Ka ulu ʻana o ka analytics a me ka mana o Open Source
- - Nā loina o ke kūkulu ʻana iā CI me ka hoʻohana ʻana iā DBT
- — E hoʻomaʻamaʻa, nā ʻōlelo aʻoaʻo ʻanuʻu no ka hana kūʻokoʻa
- — Github, code papahana hoʻonaʻauao
Source: www.habr.com

