ProHoster > Blog > Pulega > Meafaigaluega Fausia Faʻamatalaga poʻo le mea e masani ai i le va o Data Warehouse ma Smoothie
Meafaigaluega Fausia Faʻamatalaga poʻo le mea e masani ai i le va o Data Warehouse ma Smoothie
O a fa'avae o lo'o fausia ai se Faleoloa Fa'amaumauga lelei?
Taulai atu i le tau pisinisi ma auʻiliʻiliga i le leai o se code boilerplate. Puleaina o le DWH o se codebase: fa'aliliuga, iloiloga, su'ega otometi ma CI. Modular, fa'alautele, tatala puna'oa ma alalafaga. Fa'amatalaga fa'aoga-fa'aoga ma fa'aaliga fa'alagolago (Fa'amaumauga Fa'amaumauga).
E uiga i nei mea uma ma e uiga i le matafaioi a le DBT i le Big Data & Analytics ecosystem - welcome to cat.
Talofa tagata uma
Artemy Kozyr o loʻo fesoʻotaʻi. Mo le sili atu i le 5 tausaga sa ou galue ai ma faleteuoloa faʻamaumauga, fausiaina o le ETL/ELT, faʻapea foʻi ma faʻamaumauga faʻamaumauga ma faʻaaliga. O lo'o ou galue nei i totonu Uili, Ou te faiaoga i le OTUS i se kosi Inisinia Faʻamatalaga, ma o le asō ou te fia faasoa atu ai ia te oe se tusiga na ou tusia i le tulimatai atu i le amataga lesitala fou mo le kosi.
Iloiloga pupuu
O le fa'avae DBT e fa'atatau uma i le T i le fa'apuupuuga ELT (Extract - Transform - Load).
Faatasi ai ma le oʻo mai o faʻamaumauga faʻamaumauga faʻapitoa ma faʻalauteleina e pei o BigQuery, Redshift, Snowflake, e leai se aoga i le faia o suiga i fafo atu o le Data Warehouse.
E le faʻapipiʻiina e le DBT faʻamatalaga mai punaoa, ae e maua ai avanoa sili mo le galulue ai ma faʻamaumauga ua uma ona faʻapipiʻiina i totonu o le Teuina (i totonu poʻo Fafo Fafo).
O le faʻamoemoe autu o le DBT o le ave lea o le code, tuʻufaʻatasia i le SQL, faʻatino poloaiga i le faʻasologa saʻo i le Repository.
Fa'atulagaina o Poloketi DBT
O le poloketi e aofia ai faʻamaumauga ma faila e naʻo le 2 ituaiga:
Fa'ata'ita'iga (.sql) - o se iunite o suiga fa'aalia e se fesili FILIFILI
I se tulaga faavae, o le galuega ua faatulagaina e faapea:
E saunia e le tagata faʻaoga le faʻataʻitaʻiga code i soʻo se IDE talafeagai
I le faʻaaogaina o le CLI, faʻataʻitaʻiga faʻataʻitaʻiga, DBT faʻapipiʻi le faʻataʻitaʻiga code i SQL
O le code SQL tuʻufaʻatasia o loʻo faʻatinoina i totonu o le Teuina i se faʻasologa (kalafa)
O le mea lea e ono foliga mai o loʻo tamoe mai le CLI:
E FILIFILI mea uma
O se vaega fa'amamate o le auivi o le Fausiaina o Mea faigaluega. I se isi faaupuga, o le DBT e aveese uma tulafono e fesoʻotaʻi ma le faʻapipiʻiina o au fesili i totonu o le Faleoloa (suiga mai le poloaiga CREATE, INSERT, UPDATE, DELETE ALTER, GRANT, ...).
So'o se fa'ata'ita'iga e aofia ai le tusiaina o se fesili FILIFILI e fa'amatala ai le fa'asologa o fa'amaumauga.
I lenei tulaga, o le suiga suiga e mafai ona tele-tulaga ma faʻamalosia faʻamaumauga mai le tele o isi faʻataʻitaʻiga. O se faʻataʻitaʻiga o se faʻataʻitaʻiga o le a fausia ai se faʻaaliga faʻatonu (f_orders):
{% set payment_methods = ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] %}
with orders as (
select * from {{ ref('stg_orders') }}
),
order_payments as (
select * from {{ ref('order_payments') }}
),
final as (
select
orders.order_id,
orders.customer_id,
orders.order_date,
orders.status,
{% for payment_method in payment_methods -%}
order_payments.{{payment_method}}_amount,
{% endfor -%}
order_payments.total_amount as amount
from orders
left join order_payments using (order_id)
)
select * from final
O ā mea mataʻina e mafai ona tatou vaaia i inei?
Muamua: Fa'aaoga CTE (Common Table Expressions) - e fa'atulaga ma malamalama i le fa'ailoga o lo'o i ai le tele o suiga ma fa'atatau pisinisi.
Lua: Fa'ailoga fa'atusa o se fa'afefiloi ole SQL ma le gagana Jinja (gagana faataitai).
O le faʻataʻitaʻiga e faʻaaogaina se matasele mo e fa'atupuina le aofa'i o auala totogi ta'itasi o lo'o fa'ailoa mai i le fa'amatalaga aotelega. O loʻo faʻaaogaina foi le galuega ref - le mafai ona faʻasino i isi faʻataʻitaʻiga i totonu o le code:
I le taimi o le tuufaatasia ref o le a fa'aliliuina i se fa'ailoga fa'atatau i se laulau po'o se va'aiga i le Teuina
ref fa'atagaina oe e fausia se kalafi fa'alagolago fa'atusa
O le Jinja fa'aopoopo toeitiiti lava le fa'atapula'aina avanoa i le DBT. O mea sili ona faʻaaogaina o:
Afai / isi fa'amatalaga - fa'amatalaga lala
Mo matasele
Fuafuaga
Macro - fatuina macros
Meafaitino: Laulau, Va'aiga, Fa'aopoopo
Ta'iala mo le fa'atinoina o se faiga e fa'atatau i le fa'aputuina o fa'amaumauga fa'ata'ita'iga o le a teuina i le Teuga.
I faaupuga masani e faapea:
Laulau - laulau faʻaletino i le Teuina
Va'ai - va'ai, laulau fa'amatu'u ile Teuga
O lo'o iai fo'i ta'iala e sili atu ona lavelave i meafaitino:
Fa'aopoopo - fa'aopoopo le utaina (o laulau fa'amatalaga tetele); fa'aopoopo laina fou, fa'afou laina fou, fa'amama laina tape
Ephemeral - o le faʻataʻitaʻiga e le faʻaalia saʻo, ae auai o se CTE i isi faʻataʻitaʻiga
So'o se isi ta'iala e mafai ona e fa'aopoopoina e oe lava
I le faaopoopo atu i faiga faʻapitoa, o loʻo i ai avanoa mo le faʻaleleia atili mo Faʻamaumauga faapitoa, mo se faʻataʻitaʻiga:
Snowflake: Laulau e le tumau, Fa'atasi amioga, Fa'aputuina laulau, Kopi o tupe, Va'aiga saogalemu
Se'i o tatou fa'aopoopo vaega ma fa'avasega ki mo Redshift
-- Конфигурация модели:
-- Инкрементальное наполнение, уникальный ключ для обновления записей (unique_key)
-- Ключ сегментации (dist), ключ сортировки (sort)
{{
config(
materialized='incremental',
unique_key='order_id',
dist="customer_id",
sort="order_date"
)
}}
{% set payment_methods = ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] %}
with orders as (
select * from {{ ref('stg_orders') }}
where 1=1
{% if is_incremental() -%}
-- Этот фильтр будет применен только для инкрементального запуска
and order_date >= (select max(order_date) from {{ this }})
{%- endif %}
),
order_payments as (
select * from {{ ref('order_payments') }}
),
final as (
select
orders.order_id,
orders.customer_id,
orders.order_date,
orders.status,
{% for payment_method in payment_methods -%}
order_payments.{{payment_method}}_amount,
{% endfor -%}
order_payments.total_amount as amount
from orders
left join order_payments using (order_id)
)
select * from final
Fa'ata'ita'iga fa'alagolago kalafa
O se la'au fa'alagolago fo'i. E lauiloa foi o le DAG (Directed Acyclic Graph).
E fausia e le DBT se kalafi e fa'avae i luga o le fa'atulagaina o fa'ata'ita'iga uma o fa'ata'ita'iga, po'o le, ref() feso'ota'iga i totonu o fa'ata'ita'iga i isi fa'ata'ita'iga. O le iai o se kalafi e mafai ai ona e faia mea nei:
Faʻataʻitaʻiga faʻataʻitaʻiga i le faʻasologa saʻo
Fa'atutusa o le fa'atulagaina o faleoloa
Fa'atino se fa'atonuga fa'atatau
Fa'ata'ita'iga o le fa'aaliga ata:
O node ta'itasi o le kalafi o se fa'ata'ita'iga; o pito o le kalafi o lo'o fa'amaoti mai i le fa'aaliga ref.
Tulaga lelei ma Fa'amaumauga
I le faʻaopoopoga i le faʻatupuina o faʻataʻitaʻiga latou lava, DBT e faʻatagaina oe e faʻataʻitaʻi le tele o manatu e uiga i faʻamaumauga o faʻamaumauga, e pei o:
E le Null
Tulaga Ese
Fa'asinomaga Amiotonu - fa'asinotonu fa'amaoni (mo se fa'ata'ita'iga, customer_id i le laulau oka e fetaui ma id i le laulau fa'atau)
Fa'afetaui le lisi o mea taua e talia
E mafai ona faʻaopoopo au lava suʻega (faʻataʻitaʻiga faʻamatalaga masani), e pei o, mo se faʻataʻitaʻiga,% o le eseesega o tupe maua ma faʻailoga mai le aso, vaiaso, masina talu ai. So'o se manatu e fa'atulagaina o se fesili SQL e mafai ona avea ma su'ega.
I lenei auala, e mafai ona e puʻeina faʻalavelave ma mea sese e le manaʻomia i faʻamatalaga i totonu o le faleteuoloa windows.
I tulaga o faʻamaumauga, DBT e tuʻuina atu auala mo le faʻaopoopoina, faʻaliliuina, ma le tufatufaina atu o metadata ma faʻamatalaga i le faʻataʻitaʻiga ma e oʻo lava i tulaga uiga.
O mea ia e fa'aopoopo i su'ega ma fa'amaumauga e foliga i le tulaga o faila fa'aopoopo:
- name: fct_orders
description: This table has basic information about orders, as well as some derived facts based on payments
columns:
- name: order_id
tests:
- unique # проверка на уникальность значений
- not_null # проверка на наличие null
description: This is a unique identifier for an order
- name: customer_id
description: Foreign key to the customers table
tests:
- not_null
- relationships: # проверка ссылочной целостности
to: ref('dim_customers')
field: customer_id
- name: order_date
description: Date (UTC) that the order was placed
- name: status
description: '{{ doc("orders_status") }}'
tests:
- accepted_values: # проверка на допустимые значения
values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']
Ma o le mea lenei e foliga mai o lenei pepa i luga o le upega tafaʻilagi na gaosia:
Macros ma Module
O le faʻamoemoega o le DBT e le o se seti o tusitusiga SQL, ae ia tuʻuina atu i tagata faʻaoga se auala mamana ma faʻapitoa mo le fausiaina o latou lava suiga ma le tufatufaina atu o nei modules.
Macros o seti o fausaga ma fa'aaliga e mafai ona ta'ua o galuega i totonu o fa'ata'ita'iga. Macros e fa'atagaina oe e toe fa'aoga le SQL i le va o fa'ata'ita'iga ma galuega fa'atino e tusa ai ma le fa'ainisinia fa'amago (Aua le toe fai oe lava).
Fa'ata'ita'iga Macro:
{% macro rename_category(column_name) %}
case
when {{ column_name }} ilike '%osx%' then 'osx'
when {{ column_name }} ilike '%android%' then 'android'
when {{ column_name }} ilike '%ios%' then 'ios'
else 'other'
end as renamed_product
{% endmacro %}
Ma lona faʻaaogaina:
{% set column_name = 'product' %}
select
product,
{{ rename_category(column_name) }} -- вызов макроса
from my_table
O le DBT e sau ma le pule o pusa e mafai ai e tagata fa'aoga ona fa'asalalau ma toe fa'aoga modules ta'itasi ma macros.
O lona uiga o le mafai ona utaina ma fa'aoga faletusi e pei o:
dbt_utils: galue ma le Aso/Taimi, Su'iga Ki, Su'ega Fa'ailoga, Pivot/Unpivot ma isi
Fa'ata'ita'iga fa'aaliga ua saunia mo auaunaga e pei o Suotosina kiona и mua
Faletusi mo Fa'atauga Fa'amatalaga patino, eg. Redshift
E mafai ona maua se lisi atoa o afifi i dbt nofoaga.
E sili atu foi foliga
O iinei o le a ou faʻamatalaina ai nisi mea manaia ma faʻatinoga o loʻo matou faʻaogaina ma le 'au e fausia ai se Faʻamaumauga Faʻamaumauga i totonu Uili.
Tu'ueseeseina o si'osi'omaga taimi ta'avale DEV - TEST - PROD
E oo lava i totonu o le DWH fuifui tutusa (i totonu o polokalame eseese). Mo se faʻataʻitaʻiga, faʻaaoga le faʻamatalaga lea:
with source as (
select * from {{ source('salesforce', 'users') }}
where 1=1
{%- if target.name in ['dev', 'test', 'ci'] -%}
where timestamp >= dateadd(day, -3, current_date)
{%- endif -%}
)
O lenei tulafono e fai mai moni lava: mo siosiomaga dev, su'ega, ci ave fa'amatalaga mo na'o le 3 aso ua tuana'i ae leai se isi mea. O lona uiga, o le tamoe i totonu o nei siosiomaga o le a sili atu le vave ma manaʻomia ni nai punaoa. Pe a tamoe i luga o le siosiomaga gaosiga o le a le amanaiaina le tulaga faamama.
Meafaitino ma isi fa'ailoga koluma
Redshift o se DBMS koluma e mafai ai ona e setiina faʻamaumauga faʻapipiʻi algorithms mo koluma taʻitasi. Filifilia o algorithms sili ona lelei e mafai ona faʻaititia le avanoa o le disk e 20-50%.
Macro redshift.compress_table o le a faia le ANALYZE COMPRESSION poloaiga, faia se laulau fou ma le koluma fautuaina algorithms encodings, vaega faʻamaonia ki (dist_key) ma faʻavasega ki (sort_key), faʻafeiloaʻi faʻamatalaga i ai, ma, pe a manaʻomia, tape le kopi tuai.
E mafai ona e faʻapipiʻi matau i faʻatinoga taʻitasi o le faʻataʻitaʻiga, lea o le a faʻatinoina aʻo leʻi faʻalauiloaina poʻo le taimi lava e maeʻa ai le fausiaina o le faʻataʻitaʻiga:
O le logging module o le a fa'atagaina oe e fa'amaumau uma metadata e mana'omia i se isi laulau, lea e mafai ona fa'aaoga mulimuli ane e su'e ai ma au'ili'ili bottlenecks.
O le foliga lea o le dashboard e faʻavae i luga o faʻamaumauga faʻamaumauga i Looker:
Otometi o le teuina o le teuina
Afai e te faʻaogaina nisi faʻaopoopoga o galuega a le Repository faʻaaogaina, e pei o le UDF (User Defined Functions), ona faʻaliliuina lea o nei galuega, faʻaogaina avanoa, ma le faʻaogaina otometi mai faʻasalalauga fou e faigofie tele ona fai i le DBT.
Matou te faʻaogaina le UDF i le Python e faʻatatau ai faʻamau, tuatusi imeli, ma le faʻavasegaina o le bitmask.
O se faʻataʻitaʻiga o se macro e fatuina ai se UDF i soʻo se siosiomaga faʻatinoina (dev, test, prod):
{% macro create_udf() -%}
{% set sql %}
CREATE OR REPLACE FUNCTION {{ target.schema }}.f_sha256(mes "varchar")
RETURNS varchar
LANGUAGE plpythonu
STABLE
AS $$
import hashlib
return hashlib.sha256(mes).hexdigest()
$$
;
{% endset %}
{% set table = run_query(sql) %}
{%- endmacro %}
I Wheely matou te faʻaaogaina Amazon Redshift, lea e faʻavae i luga o PostgreSQL. Mo Redshift, e taua tele le aoina mai o fuainumera i luga o laulau ma fa'asa'oloto le avanoa o le disk - o le ANALYZE ma le VACUUM commands, i le faasologa.
Ina ia faia lenei mea, o poloaiga mai le redshift_maintenance macro e faʻatinoina i po uma:
{% macro redshift_maintenance() %}
{% set vacuumable_tables=run_query(vacuumable_tables_sql) %}
{% for row in vacuumable_tables %}
{% set message_prefix=loop.index ~ " of " ~ loop.length %}
{%- set relation_to_vacuum = adapter.get_relation(
database=row['table_database'],
schema=row['table_schema'],
identifier=row['table_name']
) -%}
{% do run_query("commit") %}
{% if relation_to_vacuum %}
{% set start=modules.datetime.datetime.now() %}
{{ dbt_utils.log_info(message_prefix ~ " Vacuuming " ~ relation_to_vacuum) }}
{% do run_query("VACUUM " ~ relation_to_vacuum ~ " BOOST") %}
{{ dbt_utils.log_info(message_prefix ~ " Analyzing " ~ relation_to_vacuum) }}
{% do run_query("ANALYZE " ~ relation_to_vacuum) %}
{% set end=modules.datetime.datetime.now() %}
{% set total_seconds = (end - start).total_seconds() | round(2) %}
{{ dbt_utils.log_info(message_prefix ~ " Finished " ~ relation_to_vacuum ~ " in " ~ total_seconds ~ "s") }}
{% else %}
{{ dbt_utils.log_info(message_prefix ~ ' Skipping relation "' ~ row.values() | join ('"."') ~ '" as it does not exist') }}
{% endif %}
{% endfor %}
{% endmacro %}
DBT Ao
E mafai ona faʻaaogaina le DBT o se tautua (Managed Service). E aofia ai:
Web IDE mo le atinaʻeina o poloketi ma faʻataʻitaʻiga
Fa'atulagaina ma le fa'atulagaina o galuega
Faigofie ma faigofie avanoa i ogalaau
Upega tafa'ilagi ma fa'amaumauga o lau poloketi
Feso'ota'i CI (Fa'aauau Fa'atasi)
iʻuga
O le saunia ma le fa'aaogaina o le DWH e fa'afiafiaina ma aoga e pei o le inuina o se lamolemole. DBT e aofia ai Jinja, faʻaopoopoga tagata faʻaoga (modules), se tuʻufaʻatasiga, se faʻatonu, ma se pule o pusa. O le tu'ufa'atasia o nei elemene e te maua ai se siosiomaga galue atoatoa mo lau Faleoloa Fa'amaumauga. E leai se auala sili atu e pulea ai suiga i totonu o le DWH i aso nei.
O talitonuga na mulimulitaʻia e le au atinaʻe o le DBT o loʻo faʻatulagaina e pei ona taua i lalo:
Code, ae le o le GUI, o le faʻamatalaga sili ona lelei mo le faʻaalia o mafaufauga faʻapitoa
O le galulue fa'atasi ma fa'amaumauga e tatau ona fa'afetaui faiga sili ona lelei ile inisinia faakomepiuta (Software Engineering)
E tatau ona fa'atonutonuina e le fa'alapotopotoga fa'aoga e avea ma polokalama fa'apitoa fa'amatalaga
E le gata o meafaigaluega faʻapitoa, ae faʻapea foi code o le a faʻateleina le avea ma meatotino a le Open Source community
O nei talitonuga autu na maua ai se oloa o loʻo faʻaaogaina e le silia ma le 850 kamupani i aso nei, ma latou faʻavaeina le tele o faʻaopoopoga fiafia o le a faia i le lumanaʻi.
I le faaopoopo atu i le DBT ma Data Warehousing, o se vaega o le Data Engineer course i luga o le OTUS platform, matou te aʻoaʻoina ma aʻu uo vasega i luga o le tele o isi mataupu talafeagai ma faʻaonaponei:
Fa'ata'ita'iga Fa'ata'ita'i mo Talosaga Fa'amatalaga Tele
Faataitai ma Spark ma Spark Streaming
Su'esu'e metotia ma mea faigaluega mo le utaina o fa'amaumauga
Fausia fa'aaliga fa'apitoa ile DWH
NoSQL manatu: HBase, Cassandra, ElasticSearch
Fa'avae o le mata'ituina ma le fa'avasegaina
Galuega Fa'ai'u: tu'u fa'atasi uma tomai i lalo ole lagolago fa'aa'oa'o