ProHoster > Блог > Nchịkwa > Ngwá Ọrụ Nrụpụta Data ma ọ bụ ihe a na-ahụkarị n'etiti Data Warehouse na Smoothie
Ngwá Ọrụ Nrụpụta Data ma ọ bụ ihe a na-ahụkarị n'etiti Data Warehouse na Smoothie
N'ụkpụrụ kedu ka ejiri wuo ụlọ nkwakọba ihe data dị mma?
Lekwasị anya na uru azụmaahịa na nyocha na enweghị koodu igwe mmiri. Ijikwa DWH dị ka codebase: nsụgharị, nyocha, ule akpaaka na CI. Modular, extensible, mepere emepe na obodo. Akwụkwọ enyi na enyi na nleba anya nke ịdabere (Data Lineage).
Ihe ndị ọzọ gbasara ihe a niile yana gbasara ọrụ DBT na nnukwu data na gburugburu ebe obibi nyocha - welcome to cat.
Ndewonu onye obula
Artemy Kozyr na-akpọtụrụ. N'ime ihe karịrị afọ 5 m na-arụ ọrụ na ụlọ nkwakọba ihe data, na-ewu ETL / ELT, yana nyocha data na nhụta anya. Ana m arụ ọrụ ugbu a wheely, M na-akụzi na OTUS na a N'ezie Injin Inyocha data, na taa, m chọrọ ịkọrọ gị otu isiokwu m dere na atụmanya nke mmalite ndebanye aha ọhụrụ maka nkuzi.
Isi
Usoro DBT bụ ihe niile gbasara T dị na ELT (Extract - Transform - Load) acronym.
Site na ọbịbịa nke ọdụ data nyocha dị ka BigQuery, Redshift, Snowflake, ọ nweghị uru ịme mgbanwe na mpụga Data Warehouse.
DBT anaghị ebudata data sitere na isi mmalite, mana na-enye ohere dị ukwuu maka iji data etinyegoro n'ime Nchekwa (na Nchekwa Ime ma ọ bụ Mpụga).
Ebumnuche bụ isi nke DBT bụ iwere koodu ahụ, chịkọta ya na SQL, mebie iwu ahụ n'usoro ziri ezi na ebe nchekwa.
Ọdịdị DBT Project
Ọrụ a nwere akwụkwọ ndekọ aha na faịlụ nke naanị ụdị 2:
Model (.sql) - otu ngbanwe nke ajụjụ ahọrọ gosipụtara
Onye ọrụ na-akwado koodu nlereanya na IDE ọ bụla dabara adaba
Iji CLI, a na-ewepụta ụdịdị, DBT na-achịkọta koodu nlereanya n'ime SQL
A na-eme koodu SQL a chịkọtara na nchekwa n'usoro enyere (graph)
Nke a bụ ihe na-agba ọsọ site na CLI nwere ike ịdị ka:
Ihe niile bụ SELECT
Nke a bụ ihe na-egbu egbu nke Data Build Tool framework. N'ikwu ya n'ụzọ ọzọ, DBT na-ewepụta koodu niile metụtara ịmegharị ajụjụ gị n'ime Ụlọ Ahịa (ọdịiche sitere na iwu CREATE, INSERT, UPDATE, Delete ALTER, GRANT, ...).
Ihe nlereanya ọ bụla gụnyere ide otu ajụjụ SELECT nke na-akọwapụta arụpụta data.
N'okwu a, mgbagha mgbanwe nwere ike ịbụ ọkwa dị iche iche ma jikọta data site na ọtụtụ ụdị ndị ọzọ. Ihe atụ nke ihe nlereanya ga-ewu ihe ngosi ihe ngosi (f_orders):
{% set payment_methods = ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] %}
with orders as (
select * from {{ ref('stg_orders') }}
),
order_payments as (
select * from {{ ref('order_payments') }}
),
final as (
select
orders.order_id,
orders.customer_id,
orders.order_date,
orders.status,
{% for payment_method in payment_methods -%}
order_payments.{{payment_method}}_amount,
{% endfor -%}
order_payments.total_amount as amount
from orders
left join order_payments using (order_id)
)
select * from final
Kedu ihe na-adọrọ mmasị anyị nwere ike ịhụ ebe a?
Nke mbụ: Ejiri CTE (Nkwupụta Okwu Ndị A Na-ahụkarị) - iji hazie ma ghọta koodu nwere ọtụtụ mgbanwe na mgbagha azụmahịa.
Nke abụọ: Koodu nlereanya bụ ngwakọta nke SQL na asụsụ Jinja (asụsụ ntọlite).
Ihe atụ na-eji loop n'ihi na iji wepụta ego maka usoro ịkwụ ụgwọ ọ bụla akọwapụtara na okwu ahụ set. A na-ejikwa ọrụ ahụ ref - ike ịkọwa ụdị ndị ọzọ n'ime koodu:
N'oge nchịkọta ref a ga-agbanwe ka ọ bụrụ ntụnye ebumnuche na tebụl ma ọ bụ nlele na Nchekwa
ref na-enye gị ohere ịmepụta eserese ndabere ihe nlereanya
Ọ bụ Jinja na-agbakwụnye ihe fọrọ nke nta ka ọ bụrụ ohere na-akparaghị ókè na DBT. Ndị a na-ejikarị eme ihe bụ:
Ọ bụrụ na / ọzọ nkwupụta - ngalaba nkwupụta
Maka loops
Mgbanwe
Macro - eke macros
Akụrụngwa: Tebụl, Nlele, Mmụba
Atụmatụ ihe onwunwe bụ ụzọ nke ga-esi na-echekwa data ihe nlereanya a ga-esi na ya pụta na Nchekwa.
N'okwu ndị bụ isi ọ bụ:
Tebụl - tebụl anụ ahụ na Nchekwa
Lelee - nlele, tebụl mebere na Nchekwa
E nwekwara usoro ihe eji eme ihe dị mgbagwoju anya:
-- Конфигурация модели:
-- Инкрементальное наполнение, уникальный ключ для обновления записей (unique_key)
-- Ключ сегментации (dist), ключ сортировки (sort)
{{
config(
materialized='incremental',
unique_key='order_id',
dist="customer_id",
sort="order_date"
)
}}
{% set payment_methods = ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] %}
with orders as (
select * from {{ ref('stg_orders') }}
where 1=1
{% if is_incremental() -%}
-- Этот фильтр будет применен только для инкрементального запуска
and order_date >= (select max(order_date) from {{ this }})
{%- endif %}
),
order_payments as (
select * from {{ ref('order_payments') }}
),
final as (
select
orders.order_id,
orders.customer_id,
orders.order_date,
orders.status,
{% for payment_method in payment_methods -%}
order_payments.{{payment_method}}_amount,
{% endfor -%}
order_payments.total_amount as amount
from orders
left join order_payments using (order_id)
)
select * from final
Eserese ndabere nke ụdị
Ọ bụkwa osisi ndabere. A makwaara ya dị ka DAG (Directed Acyclic Graph).
DBT na-ewupụta eserese dabere na nhazi nke ụdị ọrụ niile, ma ọ bụ karịa, njikọ ref() n'ime ụdị na ụdị ndị ọzọ. Inwe eserese na-enye gị ohere ịme ihe ndị a:
Ụdị na-agba ọsọ n'usoro ziri ezi
Myirịta nke nhazi ihu ụlọ ahịa
Na-agba ọsọ nkeji aka aka ike
Ọmụmaatụ nke nhụta eserese:
Ọnụ ọ bụla nke eserese ahụ bụ ihe atụ; a na-akọwapụta akụkụ nke eserese ahụ site na okwu ref.
Ogo data na akwụkwọ
Na mgbakwunye na ịmepụta ụdịdị ahụ n'onwe ha, DBT na-enye gị ohere ịnwale ọtụtụ echiche gbasara ntọala data arụpụtara, dị ka:
Ọ bụghị Null
Pụrụ iche
Iguzosi ike n'ezi ihe - nrụtụ aka (dịka ọmụmaatụ, client_id na tebụl iwu dabara na id na tebụl ndị ahịa)
Dakọtara ndepụta nke ụkpụrụ anabatara
Ọ ga-ekwe omume ịgbakwunye ule nke gị (nnwale data omenala), dị ka, ọmụmaatụ, % ichekwa ego nke ego na-egosi site na otu ụbọchị, otu izu, otu ọnwa gara aga. Echiche ọ bụla emepụtara dị ka ajụjụ SQL nwere ike ịghọ ule.
N'ụzọ dị otú a, ị nwere ike ijide nhụsianya na njehie na-achọghị na data na windo ụlọ nkwakọba ihe.
N'ihe gbasara akwụkwọ, DBT na-enye usoro maka ịgbakwunye, nsụgharị, na ikesa metadata na nkọwa na ụdị na ọbụna ọkwa njirimara.
Nke a bụ ihe ịgbakwunye ule na akwụkwọ dị ka ọkwa faịlụ nhazi:
- name: fct_orders
description: This table has basic information about orders, as well as some derived facts based on payments
columns:
- name: order_id
tests:
- unique # проверка на уникальность значений
- not_null # проверка на наличие null
description: This is a unique identifier for an order
- name: customer_id
description: Foreign key to the customers table
tests:
- not_null
- relationships: # проверка ссылочной целостности
to: ref('dim_customers')
field: customer_id
- name: order_date
description: Date (UTC) that the order was placed
- name: status
description: '{{ doc("orders_status") }}'
tests:
- accepted_values: # проверка на допустимые значения
values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']
Ma ebe a bụ ka akwụkwọ a dị na webụsaịtị emepụtara:
Macros na modul
Ebumnuche nke DBT abụghị nke ukwuu ka ọ bụrụ usoro nke scripts SQL, kama iji nye ndị ọrụ ụzọ dị ike na atụmatụ bara ụba maka ịmepụta mgbanwe nke ha na ikesa modul ndị a.
Macros bụ nhazi ihe nrụpụta na okwu enwere ike ịkpọ dịka ọrụ n'ime ụdị. Macros na-enye gị ohere iji SQL ọzọ n'etiti ụdị na oru ngo dabere na ụkpụrụ injinịa DRY (Edoghachi Onwe Gị).
Ọmụmaatụ nnukwu:
{% macro rename_category(column_name) %}
case
when {{ column_name }} ilike '%osx%' then 'osx'
when {{ column_name }} ilike '%android%' then 'android'
when {{ column_name }} ilike '%ios%' then 'ios'
else 'other'
end as renamed_product
{% endmacro %}
Na ojiji ya:
{% set column_name = 'product' %}
select
product,
{{ rename_category(column_name) }} -- вызов макроса
from my_table
DBT na-abịa na njikwa ngwugwu na-enye ndị ọrụ ohere ibipụta na jigharịa modul na macros n'otu n'otu.
Nke a pụtara inwe ike ibu na iji ọba akwụkwọ dịka:
dbt_utils: na-arụ ọrụ na Date/Time Time, Surrogate Keys, Schema tests, Pivot/Unpivot na ndị ọzọ
Enwere ike ịchọta ndepụta ngwugwu zuru ezu na dbt ebe.
Ọbụna atụmatụ ndị ọzọ
N'ebe a, m ga-akọwa atụmatụ ole na ole ndị ọzọ na-atọ ụtọ na mmejuputa nke mụ na ndị otu ahụ na-eji wuo Ụlọ nkwakọba data na wheely.
Nkewa nke gburugburu oge ịgba ọsọ DEV - TEST - PROD
Ọbụna n'ime otu ụyọkọ DWH (n'ime atụmatụ dị iche iche). Dịka ọmụmaatụ, iji okwu a:
with source as (
select * from {{ source('salesforce', 'users') }}
where 1=1
{%- if target.name in ['dev', 'test', 'ci'] -%}
where timestamp >= dateadd(day, -3, current_date)
{%- endif -%}
)
Koodu a na-ekwu n'ụzọ nkịtị: maka gburugburu dev, nwale, ci were data naanị maka ụbọchị 3 gara aga na agaghịkwa ọzọ. Ya bụ, ịgba ọsọ na gburugburu ebe a ga-adị ngwa ngwa ma chọọ obere ego. Mgbe na-agba ọsọ na gburugburu ebe obibi eb A ga-eleghara ọnọdụ nzacha anya.
Ngwa na ntinye koodu kọlụm ọzọ
Redshift bụ DBMS kọlụm na-enye gị ohere ịtọ algọridim mkpakọ data maka kọlụm ọ bụla. Ịhọrọ ezigbo algọridim nwere ike ibelata ohere diski site na 20-50%.
Nnukwu redshift.compress_table ga-emezu iwu ANALYZE COMPRESSION, mepụta tebụl ọhụrụ nwere kọlụm akwadoro na-etinye algọridim, igodo ngalaba akọwapụtara (dist_key) na igodo nhazi (ụdị_key), nyefee data na ya, ma, ọ bụrụ na ọ dị mkpa, hichapụ ihe ochie ochie.
Modul ndekọ ahụ ga-enye gị ohere ịdekọ metadata niile dị mkpa na tebụl dị iche, nke enwere ike iji mechaa nyochaa na nyocha ihe mgbochi.
Nke a bụ ihe dashboard dị ka dabere na ntinye data na Looker:
Akpaaka nke Nlekọta Nchekwa
Ọ bụrụ na ị na-eji ụfọdụ ndọtị arụmọrụ nke ebe nchekwa ejiri mee ihe, dị ka UDF (ọrụ akọwapụtara nke onye ọrụ), mgbe ahụ ịdepụta ọrụ ndị a, njikwa ohere, na mwepu na-akpaghị aka na mwepụta ọhụrụ dị ezigbo mma ime na DBT.
Anyị na-eji UDF na Python gbakọọ hashes, ngalaba email na ngbanwe bitmask.
Ọmụmaatụ nke nnukwu macro na-emepụta UDF na mpaghara ogbugbu ọ bụla (dev, test, prod):
{% macro create_udf() -%}
{% set sql %}
CREATE OR REPLACE FUNCTION {{ target.schema }}.f_sha256(mes "varchar")
RETURNS varchar
LANGUAGE plpythonu
STABLE
AS $$
import hashlib
return hashlib.sha256(mes).hexdigest()
$$
;
{% endset %}
{% set table = run_query(sql) %}
{%- endmacro %}
Na Wheely anyị na-eji Amazon Redshift, nke dabere na PostgreSQL. Maka Redshift, ọ dị mkpa ịnakọta ọnụ ọgụgụ mgbe niile na tebụl wee hapụ ohere diski - iwu ANALYZE na VACUUM, n'otu n'otu.
Iji mee nke a, a na-eme iwu sitere na redshift_maintenance macro kwa abalị:
{% macro redshift_maintenance() %}
{% set vacuumable_tables=run_query(vacuumable_tables_sql) %}
{% for row in vacuumable_tables %}
{% set message_prefix=loop.index ~ " of " ~ loop.length %}
{%- set relation_to_vacuum = adapter.get_relation(
database=row['table_database'],
schema=row['table_schema'],
identifier=row['table_name']
) -%}
{% do run_query("commit") %}
{% if relation_to_vacuum %}
{% set start=modules.datetime.datetime.now() %}
{{ dbt_utils.log_info(message_prefix ~ " Vacuuming " ~ relation_to_vacuum) }}
{% do run_query("VACUUM " ~ relation_to_vacuum ~ " BOOST") %}
{{ dbt_utils.log_info(message_prefix ~ " Analyzing " ~ relation_to_vacuum) }}
{% do run_query("ANALYZE " ~ relation_to_vacuum) %}
{% set end=modules.datetime.datetime.now() %}
{% set total_seconds = (end - start).total_seconds() | round(2) %}
{{ dbt_utils.log_info(message_prefix ~ " Finished " ~ relation_to_vacuum ~ " in " ~ total_seconds ~ "s") }}
{% else %}
{{ dbt_utils.log_info(message_prefix ~ ' Skipping relation "' ~ row.values() | join ('"."') ~ '" as it does not exist') }}
{% endif %}
{% endfor %}
{% endmacro %}
DBT igwe ojii
Enwere ike iji DBT dịka ọrụ (Ọrụ Managed). Gụnyere:
IDE webụ maka mmepe oru na ụdị
Nhazi ọrụ na nhazi oge
Ụzọ dị mfe ma dị mma na ndekọ
Weebụsaịtị nwere akwụkwọ ọrụ gị
Ijikọ CI (njikọ na-aga n'ihu)
nkwubi
Ịkwadebe na iri DWH na-aghọ ihe na-atọ ụtọ ma na-aba uru dị ka ịṅụ mmiri ara ehi. DBT nwere Jinja, ndọtị onye ọrụ (modul), onye na-achịkọta, onye mmebe, na onye njikwa ngwugwu. Site na ijikọta ihe ndị a ọnụ, ị ga-enweta ebe ọrụ zuru oke maka ụlọ nkwakọba ihe data gị. Enweghị ụzọ ka mma iji jikwaa mgbanwe n'ime DWH taa.
Nkwenkwe ndị ndị mmepe nke DBT na-esote ka emebere dịka ndị a:
Koodu, ọ bụghị GUI, bụ nkọwapụta kacha mma maka ịkọwapụta mgbanaka nyocha dị mgbagwoju anya
Ịrụ ọrụ na data kwesịrị imeghari omume kachasị mma na injinịa ngwanrọ (Software Engineering)
Ndị ọrụ obodo kwesịrị ịchịkwa akụrụngwa data dị mkpa dị ka ngwanrọ mepere emepe
Ọ bụghị naanị ngwaọrụ nyocha, kamakwa koodu ga-aghọwanye ihe onwunwe nke obodo mepere emepe
Nkwenkwe ndị a bụ isi ewepụtala ngwaahịa nke ụlọ ọrụ 850 na-eji taa, ha na-etolitekwa ntọala nke ọtụtụ ndọtị na-akpali akpali nke a ga-emepụta n'ọdịnihu.
Maka ndị nwere mmasị, enwere vidiyo nke nkuzi mepere emepe m nyere ọnwa ole na ole gara aga dịka akụkụ nke nkuzi mepere emepe na OTUS - Ngwa wuo data maka Amazon Redshift Nchekwa.
Na mgbakwunye na DBT na Data Warehousing, dị ka akụkụ nke Data Engineer N'ezie na OTUS n'elu ikpo okwu, m na ndị ọrụ ibe m na-akụzi klaasị na ọtụtụ ndị ọzọ dị mkpa na nke oge a isiokwu:
Atụmatụ ihe owuwu maka ngwa data buru ibu
Jiri Spark na Spark Streaming mee ihe
Ịchọgharị ụzọ na ngwaọrụ maka nbudata isi mmalite data