ProHoster > Блог > Isakoso > Ọpa Kọ Data tabi ohun ti o wọpọ laarin Data Warehouse ati Smoothie
Ọpa Kọ Data tabi ohun ti o wọpọ laarin Data Warehouse ati Smoothie
Lori awọn ipilẹ wo ni Ile-ipamọ Data pipe ti a kọ?
Fojusi lori iye iṣowo ati awọn atupale ni aini ti koodu igbomikana. Ṣiṣakoso DWH gẹgẹbi koodu koodu: ikede, atunyẹwo, idanwo adaṣe ati CI. Modular, extensible, ìmọ orisun ati awujo. Iwe-ipamọ ore-olumulo ati iworan igbẹkẹle (Ila data).
Diẹ ẹ sii nipa gbogbo eyi ati nipa ipa ti DBT ni Big Data & ilolupo atupale - kaabọ si ologbo.
Mo ki gbogbo yin
Artemy Kozyr wa ni ifọwọkan. Fun diẹ sii ju ọdun 5 Mo ti n ṣiṣẹ pẹlu awọn ile itaja data, ṣiṣe ETL/ELT, bakanna bi awọn atupale data ati iworan. Mo n ṣiṣẹ lọwọlọwọ kẹkẹ, Mo kọ ni OTUS lori papa kan Enjinia data, ati loni Mo fẹ lati pin pẹlu rẹ ohun article ti mo ti kowe ni ifojusona ti awọn ibere titun iforukọsilẹ fun papa.
Atunwo kukuru
Ilana DBT jẹ gbogbo nipa T ni ELT (Jade - Yipada - Fifuye) adape.
Pẹlu dide ti iru iṣelọpọ ati awọn apoti isura infomesonu ti iwọn bi BigQuery, Redshift, Snowflake, ko si aaye ni ṣiṣe awọn iyipada ni ita Ile-ipamọ Data.
DBT ko ṣe igbasilẹ data lati awọn orisun, ṣugbọn pese awọn aye nla fun ṣiṣẹ pẹlu data ti a ti kojọpọ tẹlẹ sinu Ibi ipamọ (ni Ibi ipamọ inu tabi ita).
Idi akọkọ ti DBT ni lati mu koodu naa, ṣajọ rẹ sinu SQL, ṣiṣẹ awọn aṣẹ ni ọna ti o tọ ni Ibi ipamọ.
DBT Project Be
Ise agbese na ni awọn ilana ati awọn faili ti awọn oriṣi 2 nikan:
Awoṣe (.sql) - ẹyọ iyipada ti a fihan nipasẹ ibeere Yan
Faili iṣeto ni (.yml) - paramita, eto, igbeyewo, iwe
Ni ipele ipilẹ, iṣẹ naa ti ṣeto bi atẹle:
Olumulo ngbaradi koodu awoṣe ni eyikeyi IDE irọrun
Lilo CLI, awọn awoṣe ti ṣe ifilọlẹ, DBT ṣe akopọ koodu awoṣe sinu SQL
Koodu SQL ti a ṣajọpọ ti wa ni ṣiṣe ni Ibi ipamọ ni ọna ti a fun (awọn aworan)
Eyi ni ohun ti nṣiṣẹ lati CLI le dabi:
Ohun gbogbo ni Yan
Eyi jẹ ẹya apaniyan ti Ilana Ọpa Kọ Data. Ni awọn ọrọ miiran, DBT ṣe arosọ gbogbo koodu ti o ni nkan ṣe pẹlu sisọ awọn ibeere rẹ di ohun elo sinu Ile itaja (awọn iyatọ lati awọn aṣẹ ṢẸDA, FI sii, Imudojuiwọn, PARAPAPA ALTER, GRANT, ...).
Awoṣe eyikeyi jẹ kikọ ibeere kan ti o yan ti o ṣalaye eto data abajade.
Ni idi eyi, iṣaro iyipada le jẹ ipele-pupọ ati ki o sopọ data lati ọpọlọpọ awọn awoṣe miiran. Apeere ti awoṣe ti yoo kọ iṣafihan aṣẹ (f_orders):
{% set payment_methods = ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] %}
with orders as (
select * from {{ ref('stg_orders') }}
),
order_payments as (
select * from {{ ref('order_payments') }}
),
final as (
select
orders.order_id,
orders.customer_id,
orders.order_date,
orders.status,
{% for payment_method in payment_methods -%}
order_payments.{{payment_method}}_amount,
{% endfor -%}
order_payments.total_amount as amount
from orders
left join order_payments using (order_id)
)
select * from final
Ohun awon ohun ti a le ri nibi?
Ni akọkọ: CTE ti a lo (Awọn asọye Tabili ti o wọpọ) - lati ṣeto ati loye koodu ti o ni ọpọlọpọ awọn iyipada ati oye iṣowo.
Keji: Awoṣe koodu jẹ adalu SQL ati ede Jinja (èdè àdàkọ).
Apẹẹrẹ naa nlo lupu kan fun lati se ina iye fun kọọkan sisan ọna pato ninu awọn ikosile ṣeto. Iṣẹ naa tun lo aṣiw - agbara lati tọka si awọn awoṣe miiran laarin koodu naa:
Lakoko akopọ aṣiw yoo yipada si itọka ibi-afẹde si tabili tabi wiwo ni Ibi ipamọ
aṣiw faye gba o lati kọ kan awoṣe gbára awonya
O jẹ Jinja ṣe afikun awọn aye ailopin si DBT. Awọn ti o wọpọ julọ lo ni:
Ti o ba jẹ / awọn alaye miiran - awọn alaye ẹka
Fun losiwajulosehin
Awọn oniyipada
Makiro - ṣiṣẹda macros
Ohun elo: Tabili, Wo, Ilọsiwaju
Ilana ohun elo jẹ ọna ni ibamu si eyiti eto abajade ti data awoṣe yoo wa ni ipamọ ni Ibi ipamọ.
Ni awọn ofin ipilẹ o jẹ:
Table - ti ara tabili ni Ibi ipamọ
Wo - wo, foju tabili ni Ibi ipamọ
Awọn ilana imudara ohun elo diẹ sii tun wa:
Imudara - ikojọpọ afikun (ti awọn tabili otitọ nla); titun ila ti wa ni afikun, yi pada ila ti wa ni imudojuiwọn, paarẹ ila ti wa ni nso
Ephemeral - awoṣe ko ni ohun elo taara, ṣugbọn ṣe alabapin bi CTE ni awọn awoṣe miiran
Eyikeyi miiran ogbon ti o le fi ara rẹ
Ni afikun si awọn ọgbọn ohun elo, awọn aye wa fun iṣapeye fun Awọn ibi ipamọ kan pato, fun apẹẹrẹ:
Snowflake: Awọn tabili igba diẹ, ihuwasi idapọmọra, Iṣakojọpọ tabili, awọn ẹbun didakọ, Awọn iwo to ni aabo
Redshift: Distkey, Sortkey (interleaved, yellow), Late abuda Wiwo
Microsoft SQL Server (ohun ti nmu badọgba agbegbe)
Jẹ ki a mu awoṣe wa dara si:
Jẹ ki a ṣe kikún rẹ ni afikun (Ilọsiwaju)
Jẹ ki a ṣafikun ipin ati awọn bọtini yiyan fun Redshift
-- Конфигурация модели:
-- Инкрементальное наполнение, уникальный ключ для обновления записей (unique_key)
-- Ключ сегментации (dist), ключ сортировки (sort)
{{
config(
materialized='incremental',
unique_key='order_id',
dist="customer_id",
sort="order_date"
)
}}
{% set payment_methods = ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] %}
with orders as (
select * from {{ ref('stg_orders') }}
where 1=1
{% if is_incremental() -%}
-- Этот фильтр будет применен только для инкрементального запуска
and order_date >= (select max(order_date) from {{ this }})
{%- endif %}
),
order_payments as (
select * from {{ ref('order_payments') }}
),
final as (
select
orders.order_id,
orders.customer_id,
orders.order_date,
orders.status,
{% for payment_method in payment_methods -%}
order_payments.{{payment_method}}_amount,
{% endfor -%}
order_payments.total_amount as amount
from orders
left join order_payments using (order_id)
)
select * from final
Awoṣe gbára awonya
O tun jẹ igi ti o gbẹkẹle. O tun jẹ mimọ bi DAG (Ayaworan Acyclic Dari).
DBT kọ aworan kan ti o da lori iṣeto ti gbogbo awọn awoṣe akanṣe, tabi dipo, awọn ọna asopọ ref () laarin awọn awoṣe si awọn awoṣe miiran. Nini aworan kan gba ọ laaye lati ṣe awọn nkan wọnyi:
Ṣiṣe awọn awoṣe ni ọna ti o tọ
Parallelization ti ibi ipamọ itaja
Nṣiṣẹ subgraph lainidii
Apẹẹrẹ ti iworan aworan:
Ipin kọọkan ti ayaworan jẹ awoṣe; awọn egbegbe ti ayaworan naa jẹ pato nipasẹ ikosile ikosile.
Didara Data ati Iwe
Ni afikun si ti ipilẹṣẹ awọn awoṣe funrararẹ, DBT gba ọ laaye lati ṣe idanwo nọmba awọn arosinu nipa eto data abajade, gẹgẹbi:
Kii ṣe Null
Aami
Itọkasi Itọkasi - iduroṣinṣin itọkasi (fun apẹẹrẹ, alabara_id ninu tabili awọn aṣẹ ni ibamu si id ni tabili awọn alabara)
Ti o baamu atokọ ti awọn iye itẹwọgba
O ṣee ṣe lati ṣafikun awọn idanwo tirẹ (awọn idanwo data aṣa), gẹgẹbi, fun apẹẹrẹ,% iyapa ti owo-wiwọle pẹlu awọn afihan lati ọjọ kan, ọsẹ kan, oṣu kan sẹhin. Eyikeyi arosinu ti a ṣe agbekalẹ bi ibeere SQL le di idanwo kan.
Ni ọna yii, o le yẹ awọn iyapa ti aifẹ ati awọn aṣiṣe ninu data ninu awọn window Warehouse.
Ni awọn ofin ti iwe, DBT n pese awọn ọna ṣiṣe fun fifi kun, ti ikede, ati pinpin awọn metadata ati awọn asọye ni awoṣe ati paapaa awọn ipele ikalara.
Eyi ni ohun ti fifi awọn idanwo ati iwe ṣe dabi ni ipele faili iṣeto:
- name: fct_orders
description: This table has basic information about orders, as well as some derived facts based on payments
columns:
- name: order_id
tests:
- unique # проверка на уникальность значений
- not_null # проверка на наличие null
description: This is a unique identifier for an order
- name: customer_id
description: Foreign key to the customers table
tests:
- not_null
- relationships: # проверка ссылочной целостности
to: ref('dim_customers')
field: customer_id
- name: order_date
description: Date (UTC) that the order was placed
- name: status
description: '{{ doc("orders_status") }}'
tests:
- accepted_values: # проверка на допустимые значения
values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']
Ati pe eyi ni kini iwe yii dabi lori oju opo wẹẹbu ti ipilẹṣẹ:
Makiro ati modulu
Idi ti DBT kii ṣe pupọ lati di eto awọn iwe afọwọkọ SQL, ṣugbọn lati pese awọn olumulo pẹlu ọna ti o lagbara ati ẹya-ara fun kikọ awọn iyipada ti ara wọn ati pinpin awọn modulu wọnyi.
Macros jẹ awọn ipilẹ ti awọn itumọ ati awọn ikosile ti o le pe bi awọn iṣẹ laarin awọn awoṣe. Macros gba ọ laaye lati tun lo SQL laarin awọn awoṣe ati awọn iṣẹ akanṣe ni ibamu pẹlu ilana imọ-ẹrọ DRY (Maṣe Tun Ara Rẹ Tun).
Apẹẹrẹ Makiro:
{% macro rename_category(column_name) %}
case
when {{ column_name }} ilike '%osx%' then 'osx'
when {{ column_name }} ilike '%android%' then 'android'
when {{ column_name }} ilike '%ios%' then 'ios'
else 'other'
end as renamed_product
{% endmacro %}
Ati lilo rẹ:
{% set column_name = 'product' %}
select
product,
{{ rename_category(column_name) }} -- вызов макроса
from my_table
DBT wa pẹlu oluṣakoso package ti o fun laaye awọn olumulo lati ṣe atẹjade ati tun lo awọn modulu kọọkan ati awọn macros.
Eyi tumọ si ni anfani lati kojọpọ ati lo awọn ile-ikawe bii:
Nibi Emi yoo ṣe apejuwe awọn ẹya diẹ ti o nifẹ si ati awọn imuse ti ẹgbẹ ati Emi lo lati kọ Ile-ipamọ Data kan sinu kẹkẹ.
Iyapa ti awọn agbegbe asiko isise DEV - TEST - PROD
Paapaa laarin iṣupọ DWH kanna (laarin awọn ero oriṣiriṣi). Fun apẹẹrẹ, lilo ikosile wọnyi:
with source as (
select * from {{ source('salesforce', 'users') }}
where 1=1
{%- if target.name in ['dev', 'test', 'ci'] -%}
where timestamp >= dateadd(day, -3, current_date)
{%- endif -%}
)
Yi koodu gangan sọ: fun awọn ayika dev, idanwo, ci gba data nikan fun awọn ọjọ 3 kẹhin ati pe ko si siwaju sii. Iyẹn ni, ṣiṣe ni awọn agbegbe wọnyi yoo yara yiyara ati nilo awọn orisun diẹ. Nigbati nṣiṣẹ lori ayika PRODI Àlẹmọ majemu yoo wa ni bikita.
Ohun elo pẹlu fifi koodu apa miran
Redshift jẹ DBMS ọwọn ti o fun ọ laaye lati ṣeto awọn algoridimu funmorawon data fun iwe kọọkan kọọkan. Yiyan awọn algoridimu ti o dara julọ le dinku aaye disk nipasẹ 20-50%.
Makiro redshift.compress_table yoo ṣiṣẹ pipaṣẹ ANALYZE COMPRESSION, ṣẹda tabili tuntun pẹlu awọn algoridimu ti a ṣeduro iwe ti a ṣeduro, awọn bọtini ipin pato (dist_key) ati awọn bọtini yiyan (bọtini tootọ), gbe data naa si, ati, ti o ba jẹ dandan, paarẹ ẹda atijọ rẹ.
Module gedu yoo gba ọ laaye lati ṣe igbasilẹ gbogbo awọn metadata pataki ni tabili lọtọ, eyiti o le ṣee lo lati ṣe ayẹwo ati itupalẹ awọn igo.
Eyi ni ohun ti dasibodu naa dabi ti o da lori data iwọle ni Looker:
Adaṣiṣẹ ti Itọju Ibi ipamọ
Ti o ba lo diẹ ninu awọn amugbooro ti iṣẹ ṣiṣe ti Ibi ipamọ ti a lo, gẹgẹbi UDF (Awọn iṣẹ asọye Olumulo), lẹhinna ikede ti awọn iṣẹ wọnyi, iṣakoso iwọle, ati yiyi adaṣe adaṣe ti awọn idasilẹ tuntun rọrun pupọ lati ṣe ni DBT.
A lo UDF ni Python lati ṣe iṣiro awọn hashes, awọn ibugbe imeeli, ati iyipada bitmask.
Apeere ti Makiro ti o ṣẹda UDF lori eyikeyi agbegbe ipaniyan (dev, idanwo, prod):
{% macro create_udf() -%}
{% set sql %}
CREATE OR REPLACE FUNCTION {{ target.schema }}.f_sha256(mes "varchar")
RETURNS varchar
LANGUAGE plpythonu
STABLE
AS $$
import hashlib
return hashlib.sha256(mes).hexdigest()
$$
;
{% endset %}
{% set table = run_query(sql) %}
{%- endmacro %}
Ni Wheely a lo Amazon Redshift, eyiti o da lori PostgreSQL. Fun Redshift, o ṣe pataki lati gba awọn iṣiro nigbagbogbo lori awọn tabili ati laaye aaye disk - awọn aṣẹ ANALYZE ati VACUUM, ni atele.
Lati ṣe eyi, awọn aṣẹ lati redshift_maintenance macro ti wa ni ṣiṣe ni gbogbo oru:
{% macro redshift_maintenance() %}
{% set vacuumable_tables=run_query(vacuumable_tables_sql) %}
{% for row in vacuumable_tables %}
{% set message_prefix=loop.index ~ " of " ~ loop.length %}
{%- set relation_to_vacuum = adapter.get_relation(
database=row['table_database'],
schema=row['table_schema'],
identifier=row['table_name']
) -%}
{% do run_query("commit") %}
{% if relation_to_vacuum %}
{% set start=modules.datetime.datetime.now() %}
{{ dbt_utils.log_info(message_prefix ~ " Vacuuming " ~ relation_to_vacuum) }}
{% do run_query("VACUUM " ~ relation_to_vacuum ~ " BOOST") %}
{{ dbt_utils.log_info(message_prefix ~ " Analyzing " ~ relation_to_vacuum) }}
{% do run_query("ANALYZE " ~ relation_to_vacuum) %}
{% set end=modules.datetime.datetime.now() %}
{% set total_seconds = (end - start).total_seconds() | round(2) %}
{{ dbt_utils.log_info(message_prefix ~ " Finished " ~ relation_to_vacuum ~ " in " ~ total_seconds ~ "s") }}
{% else %}
{{ dbt_utils.log_info(message_prefix ~ ' Skipping relation "' ~ row.values() | join ('"."') ~ '" as it does not exist') }}
{% endif %}
{% endfor %}
{% endmacro %}
DBT awọsanma
O ṣee ṣe lati lo DBT bi iṣẹ kan (Iṣẹ iṣakoso). To wa:
Web IDE fun a sese ise agbese ati si dede
Iṣeto iṣẹ ati ṣiṣe eto
Rọrun ati irọrun si awọn akọọlẹ
Oju opo wẹẹbu pẹlu iwe iṣẹ akanṣe rẹ
Sisopọ CI (Idapọ Ilọsiwaju)
ipari
Ngbaradi ati jijẹ DWH di igbadun ati anfani bi mimu smoothie kan. DBT ni Jinja, awọn amugbooro olumulo (awọn modulu), alakojọ, oluṣeto, ati oluṣakoso package kan. Nipa fifi awọn eroja wọnyi papọ o gba agbegbe iṣẹ pipe fun Ile-ipamọ Data rẹ. Ko si ọna ti o dara julọ lati ṣakoso iyipada laarin DWH loni.
Awọn igbagbọ ti o tẹle nipasẹ awọn olupilẹṣẹ ti DBT jẹ agbekalẹ bi atẹle:
Koodu, kii ṣe GUI, jẹ abstraction ti o dara julọ fun sisọ asọye imọ-itupalẹ idiju
Nṣiṣẹ pẹlu data yẹ ki o mu awọn iṣe ti o dara julọ mu ni imọ-ẹrọ sọfitiwia (Ẹrọ Software)
Awọn amayederun data pataki yẹ ki o jẹ iṣakoso nipasẹ agbegbe olumulo bi sọfitiwia orisun ṣiṣi
Kii ṣe awọn irinṣẹ atupale nikan, ṣugbọn koodu tun yoo di ohun-ini ti agbegbe Open Source
Awọn igbagbọ pataki wọnyi ti fa ọja kan ti o lo nipasẹ awọn ile-iṣẹ 850 loni, ati pe wọn jẹ ipilẹ ti ọpọlọpọ awọn amugbooro moriwu ti yoo ṣẹda ni ọjọ iwaju.
Ni afikun si DBT ati Data Warehousing, gẹgẹ bi ara ti awọn Data Engineer lori Syeed OTUS, ẹlẹgbẹ mi ati ki o Mo kọ awọn kilasi lori awọn nọmba kan ti miiran ti o yẹ ati igbalode ero:
Awọn imọran ayaworan fun Awọn ohun elo Data Nla
Iwa pẹlu Spark ati Spark Streaming
Ṣiṣawari awọn ọna ati awọn irinṣẹ fun ikojọpọ awọn orisun data