ProHoster > Блог > Gudanarwa > Kayan aikin Gina Data ko abin da ya zama ruwan dare tsakanin Data Warehouse da Smoothie
Kayan aikin Gina Data ko abin da ya zama ruwan dare tsakanin Data Warehouse da Smoothie
A kan waɗanne ƙa'idodi aka gina ingantaccen Warehouse Data?
Mayar da hankali kan ƙimar kasuwanci da nazari idan babu lambar tukunyar jirgi. Sarrafa DWH azaman codebase: sigar, bita, gwaji mai sarrafa kansa da CI. Modular, extensible, bude tushen da al'umma. Takaddun abokantaka na mai amfani da hangen nesa na dogaro (Layin Bayani).
Karin bayani game da duk wannan kuma game da rawar DBT a cikin Babban Bayanai & yanayin muhalli - maraba ga cat.
Сем привет
Artemy Kozyr yana tuntuɓar. Fiye da shekaru 5 ina aiki tare da ɗakunan ajiya na bayanai, gina ETL/ELT, da kuma nazarin bayanai da hangen nesa. A halin yanzu ina aiki a ciki wheely, Ina koyar da kwas a OTUS Injiniyan Bayanai, kuma a yau ina so in raba tare da ku labarin da na rubuta a cikin tsammanin farawa sabon shiga don kwas.
Short review
Tsarin DBT duk game da T ne a cikin ELT (Extract - Transform - Load) gajarta.
Tare da zuwan irin waɗannan mahimman bayanai masu inganci da ƙima kamar BigQuery, Redshift, Snowflake, babu wata ma'ana a yin canje-canje a wajen Warehouse Data.
DBT baya sauke bayanai daga tushe, amma yana ba da dama mai girma don aiki tare da bayanan da aka riga aka ɗora a cikin Ma'ajiyar (a cikin Ma'ajiyar Ciki ko Na waje).
Babban maƙasudin DBT shine ɗaukar lambar, haɗa shi cikin SQL, aiwatar da umarni a daidai jeri a cikin Ma'ajiyar.
Tsarin Ayyukan DBT
Aikin ya ƙunshi kundayen adireshi da fayiloli na nau'ikan 2 kawai:
Model (.sql) - rukunin canji da aka bayyana ta hanyar tambayar SELECT
Fayil na daidaitawa (.yml) - sigogi, saituna, gwaje-gwaje, takardu
A matakin asali, an tsara aikin kamar haka:
Mai amfani yana shirya lambar ƙira a kowane IDE mai dacewa
Yin amfani da CLI, ana ƙaddamar da samfura, DBT tana tattara lambar ƙirar cikin SQL
Ana aiwatar da lambar SQL da aka haɗa a cikin Ma'ajiyar a cikin wani jeri da aka bayar (jadawalin)
Ga abin da ke gudana daga CLI zai yi kama:
Komai shine SELECT
Wannan siffa ce ta kisa na tsarin Kayan Gina Bayanai. A takaice dai, DBT yana ƙaddamar da duk lambar da ke da alaƙa da samar da tambayoyinku cikin Store (sabani daga umarni CREATE, INSERT, UPDATE, DELETE ALTER, GRANT, ...).
Duk wani samfuri ya ƙunshi rubuta tambayar SELECT guda ɗaya wanda ke bayyana saitin bayanan da aka samu.
A wannan yanayin, dabaru na canji na iya zama matakai da yawa kuma yana ƙarfafa bayanai daga wasu samfura da yawa. Misalin samfurin da zai gina nunin oda (f_orders):
{% set payment_methods = ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] %}
with orders as (
select * from {{ ref('stg_orders') }}
),
order_payments as (
select * from {{ ref('order_payments') }}
),
final as (
select
orders.order_id,
orders.customer_id,
orders.order_date,
orders.status,
{% for payment_method in payment_methods -%}
order_payments.{{payment_method}}_amount,
{% endfor -%}
order_payments.total_amount as amount
from orders
left join order_payments using (order_id)
)
select * from final
Waɗanne abubuwa masu ban sha'awa za mu iya gani a nan?
Na farko: CTE da aka yi amfani da shi (Maganin Tebu na gama gari) - don tsarawa da fahimtar lambar da ta ƙunshi sauye-sauye da yawa da dabaru na kasuwanci
Na biyu: Model code cakude ne na SQL da harshe Jinja (harshen samfuri).
Misali yana amfani da madauki domin don samar da adadin don kowane hanyar biyan kuɗi da aka ƙayyade a cikin magana sa. Hakanan ana amfani da aikin ref - ikon yin la'akari da wasu samfura a cikin lambar:
Yayin hadawa ref za a canza zuwa maƙasudin manufa zuwa tebur ko gani a cikin Ma'aji
ref yana ba ku damar gina jadawali abin dogaro
Ya kasance Jinja yana ƙara kusan dama mara iyaka zuwa DBT. Wadanda akafi amfani dasu sune:
Idan / wani bayani - maganganun reshe
Don madaukai
Masu canji
Macro - ƙirƙirar macros
Kayan aiki: Teburi, Dubawa, Ƙaruwa
Dabarun ƙera kayan aiki hanya ce ta yadda za a adana sakamakon bayanan samfurin a cikin Ma'ajiya.
A ka’ida shi ne:
Tebur - tebur na jiki a cikin Ma'ajiyar
Duba- duba, tebur mai kama-da-wane a cikin Ma'aji
Hakanan akwai ƙarin dabarun ƙirƙira kayan aiki:
Ƙarfafawa - haɓaka haɓaka (na manyan tebur na gaskiya); ana ƙara sabbin layukan, an sabunta layukan da aka canza, an share layukan da aka goge
Ephemeral - samfurin ba ya samuwa kai tsaye, amma yana shiga a matsayin CTE a wasu samfurori
Duk wasu dabarun da zaku iya ƙarawa da kanku
Baya ga dabarun ƙirƙira kayan aiki, akwai damar ingantawa don takamaiman Ma'ajiya, misali:
Bari mu ƙara rarrabuwa da maɓallan rarrabawa don Redshift
-- Конфигурация модели:
-- Инкрементальное наполнение, уникальный ключ для обновления записей (unique_key)
-- Ключ сегментации (dist), ключ сортировки (sort)
{{
config(
materialized='incremental',
unique_key='order_id',
dist="customer_id",
sort="order_date"
)
}}
{% set payment_methods = ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] %}
with orders as (
select * from {{ ref('stg_orders') }}
where 1=1
{% if is_incremental() -%}
-- Этот фильтр будет применен только для инкрементального запуска
and order_date >= (select max(order_date) from {{ this }})
{%- endif %}
),
order_payments as (
select * from {{ ref('order_payments') }}
),
final as (
select
orders.order_id,
orders.customer_id,
orders.order_date,
orders.status,
{% for payment_method in payment_methods -%}
order_payments.{{payment_method}}_amount,
{% endfor -%}
order_payments.total_amount as amount
from orders
left join order_payments using (order_id)
)
select * from final
jadawali dogaro da samfur
Ita ma itace abin dogaro. Hakanan ana kiranta DAG (Directed Acyclic Graph).
DBT tana gina jadawali dangane da daidaita duk samfuran aikin, ko kuma a maimakon haka, hanyoyin haɗin gwiwa () a cikin ƙira zuwa wasu ƙira. Samun jadawali yana ba ku damar yin abubuwa masu zuwa:
Samfura masu gudana a cikin madaidaicin jeri
Daidaitawar samuwar gaban kantin
Gudanar da ƙaramin ra'ayi na sabani
Misali na gani jadawali:
Kowane kumburin jadawali abin ƙira ne; an ƙayyade gefuna na jadawali ta hanyar magana ref.
Ingancin Bayanai da Takardu
Baya ga samar da samfuran kansu, DBT yana ba ku damar gwada zato da yawa (tabbatacce) game da saitin bayanan da aka samu, kamar:
Ba Null ba
Musamman
Mutuncin Magana - Mutuncin Magana (misali, abokin ciniki_id a cikin tebur na oda yayi daidai da id a teburin abokan ciniki)
Daidaita jerin ƙimar karɓuwa
Yana yiwuwa a ƙara naku gwaje-gwaje (gwajin bayanan al'ada), kamar, alal misali, % karkatar da kudaden shiga tare da alamomi daga rana, mako, wata daya da suka wuce. Duk wani zato da aka tsara azaman tambayar SQL na iya zama gwaji.
Ta wannan hanyar, zaku iya kama ɓarna da kurakurai da ba'a so a cikin bayanai a cikin tagogin Warehouse.
Dangane da takaddun shaida, DBT yana ba da hanyoyin ƙarawa, siffa, da rarraba metadata da sharhi a ƙirar har ma da matakan sifa.
Ga abin da ƙara gwaje-gwaje da takardu yayi kama da matakin fayil ɗin sanyi:
- name: fct_orders
description: This table has basic information about orders, as well as some derived facts based on payments
columns:
- name: order_id
tests:
- unique # проверка на уникальность значений
- not_null # проверка на наличие null
description: This is a unique identifier for an order
- name: customer_id
description: Foreign key to the customers table
tests:
- not_null
- relationships: # проверка ссылочной целостности
to: ref('dim_customers')
field: customer_id
- name: order_date
description: Date (UTC) that the order was placed
- name: status
description: '{{ doc("orders_status") }}'
tests:
- accepted_values: # проверка на допустимые значения
values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']
Kuma ga yadda wannan takaddun yayi kama da gidan yanar gizon da aka samar:
Macros da Modules
Manufar DBT ba ta da yawa don zama saitin rubutun SQL, amma don samar wa masu amfani da hanyoyi masu ƙarfi da wadata don gina nasu canje-canje da rarraba waɗannan kayayyaki.
Macros saitin gine-gine ne da maganganu waɗanda za a iya kiran su azaman ayyuka a cikin ƙira. Macros suna ba ku damar sake amfani da SQL tsakanin samfura da ayyuka daidai da ƙa'idar injiniyan DRY (Kada ku Maimaita Kanku).
Misali:
{% macro rename_category(column_name) %}
case
when {{ column_name }} ilike '%osx%' then 'osx'
when {{ column_name }} ilike '%android%' then 'android'
when {{ column_name }} ilike '%ios%' then 'ios'
else 'other'
end as renamed_product
{% endmacro %}
Kuma amfaninsa:
{% set column_name = 'product' %}
select
product,
{{ rename_category(column_name) }} -- вызов макроса
from my_table
DBT ya zo tare da mai sarrafa fakiti wanda ke ba masu amfani damar bugawa da sake amfani da kowane nau'i da macro.
Wannan yana nufin iya lodawa da amfani da dakunan karatu kamar:
dbt_utils: Yin aiki tare da Kwanan wata/Lokaci, Maɓallin Maɓalli, Gwajin Tsari, Pivot/Unpivot da sauransu
Anan zan bayyana ƴan wasu abubuwa masu ban sha'awa da aiwatarwa waɗanda ni da ƙungiyar muke amfani da su don gina Warehouse Data a ciki wheely.
Rarraba mahallin lokacin aiki DEV - TEST - PROD
Ko da a cikin gungu na DWH iri ɗaya (a cikin tsari daban-daban). Misali, ta amfani da furci mai zuwa:
with source as (
select * from {{ source('salesforce', 'users') }}
where 1=1
{%- if target.name in ['dev', 'test', 'ci'] -%}
where timestamp >= dateadd(day, -3, current_date)
{%- endif -%}
)
Wannan lambar a zahiri tana cewa: don muhalli dev, test, ci ɗauki bayanai kawai na kwanaki 3 na ƙarshe kuma babu ƙari. Wato, gudana a cikin waɗannan mahalli zai yi sauri da sauri kuma yana buƙatar ƙarancin albarkatu. Lokacin gudanar da yanayi prod za a yi watsi da yanayin tace.
Ƙirƙira kayan aiki tare da madadin ginshiƙi
Redshift shine DBMS mai shafi wanda ke ba ku damar ƙididdige algorithms na matsa bayanai don kowane ginshiƙi. Zaɓin mafi kyawun algorithms na iya rage sararin diski ta 20-50%.
Macro redshift.compress_table zai aiwatar da umarni na ANALYZE COMPRESSION, ƙirƙirar sabon tebur tare da shawarar ginshiƙi da aka ba da shawarar algorithms, ƙayyadaddun maɓallan ɓangarori (dist_key) da maɓallai masu rarraba (sort_key), canja wurin bayanai zuwa gare shi, kuma, idan ya cancanta, share tsohon kwafin.
Tsarin shiga zai ba ku damar yin rikodin duk mahimman bayanan metadata a cikin wani tebur daban, wanda daga baya za'a iya amfani da shi don tantancewa da tantance ƙuƙumma.
Wannan shine abin da dashboard ɗin yayi kama dangane da bayanan shiga cikin Looker:
Mai sarrafa Ma'ajiya ta atomatik
Idan kun yi amfani da wasu kari na ayyukan ma'ajiyar da aka yi amfani da su, kamar UDF (Ayyukan Ƙayyadaddun Mai amfani), to, sigar waɗannan ayyukan, ikon samun dama, da mirginawa ta atomatik na sabbin abubuwan fitarwa ya dace sosai a yi a cikin DBT.
Muna amfani da UDF a cikin Python don ƙididdige hashes, yankin imel, da ƙaddamar da bitmask.
Misali na macro wanda ke ƙirƙirar UDF akan kowane yanayi na kisa (dev, test, prod):
{% macro create_udf() -%}
{% set sql %}
CREATE OR REPLACE FUNCTION {{ target.schema }}.f_sha256(mes "varchar")
RETURNS varchar
LANGUAGE plpythonu
STABLE
AS $$
import hashlib
return hashlib.sha256(mes).hexdigest()
$$
;
{% endset %}
{% set table = run_query(sql) %}
{%- endmacro %}
A Wheely muna amfani da Amazon Redshift, wanda ya dogara da PostgreSQL. Don Redshift, yana da mahimmanci don tattara ƙididdiga akai-akai akan teburi da kuma 'yantar da sarari diski - umarnin ANALYZE da VACUUM, bi da bi.
Don yin wannan, ana aiwatar da umarni daga redshift_maintenance macro kowane dare:
{% macro redshift_maintenance() %}
{% set vacuumable_tables=run_query(vacuumable_tables_sql) %}
{% for row in vacuumable_tables %}
{% set message_prefix=loop.index ~ " of " ~ loop.length %}
{%- set relation_to_vacuum = adapter.get_relation(
database=row['table_database'],
schema=row['table_schema'],
identifier=row['table_name']
) -%}
{% do run_query("commit") %}
{% if relation_to_vacuum %}
{% set start=modules.datetime.datetime.now() %}
{{ dbt_utils.log_info(message_prefix ~ " Vacuuming " ~ relation_to_vacuum) }}
{% do run_query("VACUUM " ~ relation_to_vacuum ~ " BOOST") %}
{{ dbt_utils.log_info(message_prefix ~ " Analyzing " ~ relation_to_vacuum) }}
{% do run_query("ANALYZE " ~ relation_to_vacuum) %}
{% set end=modules.datetime.datetime.now() %}
{% set total_seconds = (end - start).total_seconds() | round(2) %}
{{ dbt_utils.log_info(message_prefix ~ " Finished " ~ relation_to_vacuum ~ " in " ~ total_seconds ~ "s") }}
{% else %}
{{ dbt_utils.log_info(message_prefix ~ ' Skipping relation "' ~ row.values() | join ('"."') ~ '" as it does not exist') }}
{% endif %}
{% endfor %}
{% endmacro %}
DBT Cloud
Yana yiwuwa a yi amfani da DBT azaman sabis (Sabis ɗin Gudanarwa). Hada da:
IDE na Yanar Gizo don haɓaka ayyuka da ƙira
Tsarin aiki da tsarawa
Sauƙaƙan kuma dacewa ga rajistan ayyukan
Yanar Gizo tare da takaddun aikin ku
Haɗa CI (Ci gaba da Haɗuwa)
ƙarshe
Shiri da cinye DWH ya zama mai daɗi da fa'ida kamar shan santsi. DBT ya ƙunshi Jinja, haɓaka mai amfani (modules), mai tarawa, mai zartarwa, da manajan fakiti. Ta hanyar haɗa waɗannan abubuwan tare zaku sami cikakkiyar yanayin aiki don Warehouse ɗin ku. Babu wata hanya mafi kyau don gudanar da canji a cikin DWH a yau.
An tsara imanin da masu haɓaka DBT ke bi kamar haka:
Lambar, ba GUI ba, ita ce mafi kyawun abstraction don bayyana hadaddun dabaru na nazari
Yin aiki tare da bayanai ya kamata ya daidaita mafi kyawun ayyuka a aikin injiniyan software (Injiniya Software)
Ya kamata al'ummar masu amfani su sarrafa muhimman abubuwan more rayuwa na bayanai azaman buɗaɗɗen software
Ba kayan aikin nazari kawai ba, har ma da lambar za ta ƙara zama mallakar al'ummar Buɗewa
Waɗannan ainihin gaskatawar sun haifar da samfur wanda sama da kamfanoni 850 ke amfani da su a yau, kuma sun zama tushen abubuwan haɓaka da yawa masu ban sha'awa waɗanda za a ƙirƙira a nan gaba.
Baya ga DBT da Warehousing na Data, a matsayin wani ɓangare na kwas ɗin Injiniyan Bayanai akan dandalin OTUS, ni da abokan aikina muna koyar da darussa kan wasu batutuwa masu dacewa da na zamani:
Ka'idodin Gine-gine don Babban Aikace-aikacen Bayanai
Yi aiki tare da Spark da Spark Streaming
Bincika hanyoyin da kayan aiki don loda tushen bayanai
Gine-gine na nazari a cikin DWH
Ra'ayoyin NoSQL: HBase, Cassandra, Neman Elastic
Ka'idodin sa ido da tsarawa
Aikin Karshe: Haɗa duk ƙwarewa tare a ƙarƙashin tallafin jagoranci