Kayan aikin Gina Data ko abin da ya zama ruwan dare tsakanin Data Warehouse da Smoothie

Kayan aikin Gina Data ko abin da ya zama ruwan dare tsakanin Data Warehouse da Smoothie
A kan waɗanne ƙa'idodi aka gina ingantaccen Warehouse Data?

Mayar da hankali kan ƙimar kasuwanci da nazari idan babu lambar tukunyar jirgi. Sarrafa DWH azaman codebase: sigar, bita, gwaji mai sarrafa kansa da CI. Modular, extensible, bude tushen da al'umma. Takaddun abokantaka na mai amfani da hangen nesa na dogaro (Layin Bayani).

Karin bayani game da duk wannan kuma game da rawar DBT a cikin Babban Bayanai & yanayin muhalli - maraba ga cat.

Сем привет

Artemy Kozyr yana tuntuɓar. Fiye da shekaru 5 ina aiki tare da ɗakunan ajiya na bayanai, gina ETL/ELT, da kuma nazarin bayanai da hangen nesa. A halin yanzu ina aiki a ciki wheely, Ina koyar da kwas a OTUS Injiniyan Bayanai, kuma a yau ina so in raba tare da ku labarin da na rubuta a cikin tsammanin farawa sabon shiga don kwas.

Short review

Tsarin DBT duk game da T ne a cikin ELT (Extract - Transform - Load) gajarta.

Tare da zuwan irin waɗannan mahimman bayanai masu inganci da ƙima kamar BigQuery, Redshift, Snowflake, babu wata ma'ana a yin canje-canje a wajen Warehouse Data. 

DBT baya sauke bayanai daga tushe, amma yana ba da dama mai girma don aiki tare da bayanan da aka riga aka ɗora a cikin Ma'ajiyar (a cikin Ma'ajiyar Ciki ko Na waje).

Kayan aikin Gina Data ko abin da ya zama ruwan dare tsakanin Data Warehouse da Smoothie
Babban maƙasudin DBT shine ɗaukar lambar, haɗa shi cikin SQL, aiwatar da umarni a daidai jeri a cikin Ma'ajiyar.

Tsarin Ayyukan DBT

Aikin ya ƙunshi kundayen adireshi da fayiloli na nau'ikan 2 kawai:

  • Model (.sql) - rukunin canji da aka bayyana ta hanyar tambayar SELECT
  • Fayil na daidaitawa (.yml) - sigogi, saituna, gwaje-gwaje, takardu

A matakin asali, an tsara aikin kamar haka:

  • Mai amfani yana shirya lambar ƙira a kowane IDE mai dacewa
  • Yin amfani da CLI, ana ƙaddamar da samfura, DBT tana tattara lambar ƙirar cikin SQL
  • Ana aiwatar da lambar SQL da aka haɗa a cikin Ma'ajiyar a cikin wani jeri da aka bayar (jadawalin)

Ga abin da ke gudana daga CLI zai yi kama:

Kayan aikin Gina Data ko abin da ya zama ruwan dare tsakanin Data Warehouse da Smoothie

Komai shine SELECT

Wannan siffa ce ta kisa na tsarin Kayan Gina Bayanai. A takaice dai, DBT yana ƙaddamar da duk lambar da ke da alaƙa da samar da tambayoyinku cikin Store (sabani daga umarni CREATE, INSERT, UPDATE, DELETE ALTER, GRANT, ...).

Duk wani samfuri ya ƙunshi rubuta tambayar SELECT guda ɗaya wanda ke bayyana saitin bayanan da aka samu.

A wannan yanayin, dabaru na canji na iya zama matakai da yawa kuma yana ƙarfafa bayanai daga wasu samfura da yawa. Misalin samfurin da zai gina nunin oda (f_orders):

{% set payment_methods = ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] %}
 
with orders as (
 
   select * from {{ ref('stg_orders') }}
 
),
 
order_payments as (
 
   select * from {{ ref('order_payments') }}
 
),
 
final as (
 
   select
       orders.order_id,
       orders.customer_id,
       orders.order_date,
       orders.status,
       {% for payment_method in payment_methods -%}
       order_payments.{{payment_method}}_amount,
       {% endfor -%}
       order_payments.total_amount as amount
   from orders
       left join order_payments using (order_id)
 
)
 
select * from final

Waɗanne abubuwa masu ban sha'awa za mu iya gani a nan?

Na farko: CTE da aka yi amfani da shi (Maganin Tebu na gama gari) - don tsarawa da fahimtar lambar da ta ƙunshi sauye-sauye da yawa da dabaru na kasuwanci

Na biyu: Model code cakude ne na SQL da harshe Jinja (harshen samfuri).

Misali yana amfani da madauki domin don samar da adadin don kowane hanyar biyan kuɗi da aka ƙayyade a cikin magana sa. Hakanan ana amfani da aikin ref - ikon yin la'akari da wasu samfura a cikin lambar:

  • Yayin hadawa ref za a canza zuwa maƙasudin manufa zuwa tebur ko gani a cikin Ma'aji
  • ref yana ba ku damar gina jadawali abin dogaro

Ya kasance Jinja yana ƙara kusan dama mara iyaka zuwa DBT. Wadanda akafi amfani dasu sune:

  • Idan / wani bayani - maganganun reshe
  • Don madaukai
  • Masu canji
  • Macro - ƙirƙirar macros

Kayan aiki: Teburi, Dubawa, Ƙaruwa

Dabarun ƙera kayan aiki hanya ce ta yadda za a adana sakamakon bayanan samfurin a cikin Ma'ajiya.

A ka’ida shi ne:

  • Tebur - tebur na jiki a cikin Ma'ajiyar
  • Duba- duba, tebur mai kama-da-wane a cikin Ma'aji

Hakanan akwai ƙarin dabarun ƙirƙira kayan aiki:

  • Ƙarfafawa - haɓaka haɓaka (na manyan tebur na gaskiya); ana ƙara sabbin layukan, an sabunta layukan da aka canza, an share layukan da aka goge 
  • Ephemeral - samfurin ba ya samuwa kai tsaye, amma yana shiga a matsayin CTE a wasu samfurori
  • Duk wasu dabarun da zaku iya ƙarawa da kanku

Baya ga dabarun ƙirƙira kayan aiki, akwai damar ingantawa don takamaiman Ma'ajiya, misali:

  • Snowflake: Tebura na wucin gadi, Haɗuwa Halayyar, Tarin tebur, Kwafin tallafi, Amintattun ra'ayoyi
  • Redshift: Distkey, Key Key (maɓalli, fili), Late Binding Views
  • BabbanKanya: Rarraba tebur & tari, Haɗa halayen, KMS boye-boye, Lakabi & Tags
  • walƙiya: Tsarin fayil (parquet, csv, json, orc, delta), partition_by, clustered_by, buckets, incremental_strategy

A halin yanzu ana tallafawa Ma'ajiyoyi masu zuwa:

  • Postgreshi
  • Redshift
  • BabbanKanya
  • Snowflake
  • Presto (wani bangare)
  • Spark (wani bangare)
  • Microsoft SQL Server ( adaftar al'umma)

Mu inganta tsarin mu:

  • Mu sanya cikonsa ya karu (Ƙara)
  • Bari mu ƙara rarrabuwa da maɓallan rarrabawa don Redshift

-- Конфигурация модели: 
-- Инкрементальное наполнение, уникальный ключ для обновления записей (unique_key)
-- Ключ сегментации (dist), ключ сортировки (sort)
{{
  config(
       materialized='incremental',
       unique_key='order_id',
       dist="customer_id",
       sort="order_date"
   )
}}
 
{% set payment_methods = ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] %}
 
with orders as (
 
   select * from {{ ref('stg_orders') }}
   where 1=1
   {% if is_incremental() -%}
       -- Этот фильтр будет применен только для инкрементального запуска
       and order_date >= (select max(order_date) from {{ this }})
   {%- endif %} 
 
),
 
order_payments as (
 
   select * from {{ ref('order_payments') }}
 
),
 
final as (
 
   select
       orders.order_id,
       orders.customer_id,
       orders.order_date,
       orders.status,
       {% for payment_method in payment_methods -%}
       order_payments.{{payment_method}}_amount,
       {% endfor -%}
       order_payments.total_amount as amount
   from orders
       left join order_payments using (order_id)
 
)
 
select * from final

jadawali dogaro da samfur

Ita ma itace abin dogaro. Hakanan ana kiranta DAG (Directed Acyclic Graph).

DBT tana gina jadawali dangane da daidaita duk samfuran aikin, ko kuma a maimakon haka, hanyoyin haɗin gwiwa () a cikin ƙira zuwa wasu ƙira. Samun jadawali yana ba ku damar yin abubuwa masu zuwa:

  • Samfura masu gudana a cikin madaidaicin jeri
  • Daidaitawar samuwar gaban kantin
  • Gudanar da ƙaramin ra'ayi na sabani 

Misali na gani jadawali:

Kayan aikin Gina Data ko abin da ya zama ruwan dare tsakanin Data Warehouse da Smoothie
Kowane kumburin jadawali abin ƙira ne; an ƙayyade gefuna na jadawali ta hanyar magana ref.

Ingancin Bayanai da Takardu

Baya ga samar da samfuran kansu, DBT yana ba ku damar gwada zato da yawa (tabbatacce) game da saitin bayanan da aka samu, kamar:

  • Ba Null ba
  • Musamman
  • Mutuncin Magana - Mutuncin Magana (misali, abokin ciniki_id a cikin tebur na oda yayi daidai da id a teburin abokan ciniki)
  • Daidaita jerin ƙimar karɓuwa

Yana yiwuwa a ƙara naku gwaje-gwaje (gwajin bayanan al'ada), kamar, alal misali, % karkatar da kudaden shiga tare da alamomi daga rana, mako, wata daya da suka wuce. Duk wani zato da aka tsara azaman tambayar SQL na iya zama gwaji.

Ta wannan hanyar, zaku iya kama ɓarna da kurakurai da ba'a so a cikin bayanai a cikin tagogin Warehouse.

Dangane da takaddun shaida, DBT yana ba da hanyoyin ƙarawa, siffa, da rarraba metadata da sharhi a ƙirar har ma da matakan sifa. 

Ga abin da ƙara gwaje-gwaje da takardu yayi kama da matakin fayil ɗin sanyi:

 - name: fct_orders
   description: This table has basic information about orders, as well as some derived facts based on payments
   columns:
     - name: order_id
       tests:
         - unique # проверка на уникальность значений
         - not_null # проверка на наличие null
       description: This is a unique identifier for an order
     - name: customer_id
       description: Foreign key to the customers table
       tests:
         - not_null
         - relationships: # проверка ссылочной целостности
             to: ref('dim_customers')
             field: customer_id
     - name: order_date
       description: Date (UTC) that the order was placed
     - name: status
       description: '{{ doc("orders_status") }}'
       tests:
         - accepted_values: # проверка на допустимые значения
             values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']

Kuma ga yadda wannan takaddun yayi kama da gidan yanar gizon da aka samar:

Kayan aikin Gina Data ko abin da ya zama ruwan dare tsakanin Data Warehouse da Smoothie

Macros da Modules

Manufar DBT ba ta da yawa don zama saitin rubutun SQL, amma don samar wa masu amfani da hanyoyi masu ƙarfi da wadata don gina nasu canje-canje da rarraba waɗannan kayayyaki.

Macros saitin gine-gine ne da maganganu waɗanda za a iya kiran su azaman ayyuka a cikin ƙira. Macros suna ba ku damar sake amfani da SQL tsakanin samfura da ayyuka daidai da ƙa'idar injiniyan DRY (Kada ku Maimaita Kanku).

Misali:

{% macro rename_category(column_name) %}
case
 when {{ column_name }} ilike  '%osx%' then 'osx'
 when {{ column_name }} ilike  '%android%' then 'android'
 when {{ column_name }} ilike  '%ios%' then 'ios'
 else 'other'
end as renamed_product
{% endmacro %}

Kuma amfaninsa:

{% set column_name = 'product' %}
select
 product,
 {{ rename_category(column_name) }} -- вызов макроса
from my_table

DBT ya zo tare da mai sarrafa fakiti wanda ke ba masu amfani damar bugawa da sake amfani da kowane nau'i da macro.

Wannan yana nufin iya lodawa da amfani da dakunan karatu kamar:

  • dbt_utils: Yin aiki tare da Kwanan wata/Lokaci, Maɓallin Maɓalli, Gwajin Tsari, Pivot/Unpivot da sauransu
  • Samfuran nuni da aka yi don ayyuka kamar Dusar kankara и stripe 
  • Dakunan karatu don takamaiman Stores na Bayanai, misali. Redshift 
  • shiga - Module don shiga aikin DBT

Ana iya samun cikakken jerin fakiti a dbt ku.

Har ma da ƙarin fasali

Anan zan bayyana ƴan wasu abubuwa masu ban sha'awa da aiwatarwa waɗanda ni da ƙungiyar muke amfani da su don gina Warehouse Data a ciki wheely.

Rarraba mahallin lokacin aiki DEV - TEST - PROD

Ko da a cikin gungu na DWH iri ɗaya (a cikin tsari daban-daban). Misali, ta amfani da furci mai zuwa:

with source as (
 
   select * from {{ source('salesforce', 'users') }}
   where 1=1
   {%- if target.name in ['dev', 'test', 'ci'] -%}           
       where timestamp >= dateadd(day, -3, current_date)   
   {%- endif -%}
 
)

Wannan lambar a zahiri tana cewa: don muhalli dev, test, ci ɗauki bayanai kawai na kwanaki 3 na ƙarshe kuma babu ƙari. Wato, gudana a cikin waɗannan mahalli zai yi sauri da sauri kuma yana buƙatar ƙarancin albarkatu. Lokacin gudanar da yanayi prod za a yi watsi da yanayin tace.

Ƙirƙira kayan aiki tare da madadin ginshiƙi

Redshift shine DBMS mai shafi wanda ke ba ku damar ƙididdige algorithms na matsa bayanai don kowane ginshiƙi. Zaɓin mafi kyawun algorithms na iya rage sararin diski ta 20-50%.

Macro redshift.compress_table zai aiwatar da umarni na ANALYZE COMPRESSION, ƙirƙirar sabon tebur tare da shawarar ginshiƙi da aka ba da shawarar algorithms, ƙayyadaddun maɓallan ɓangarori (dist_key) da maɓallai masu rarraba (sort_key), canja wurin bayanai zuwa gare shi, kuma, idan ya cancanta, share tsohon kwafin.

Sa hannun macro:

{{ compress_table(schema, table,
                 drop_backup=False,
                 comprows=none|Integer,
                 sort_style=none|compound|interleaved,
                 sort_keys=none|List<String>,
                 dist_style=none|all|even,
                 dist_key=none|String) }}

Samfurin shiga yana gudana

Kuna iya haɗa ƙugiya zuwa kowane kisa na ƙirar, wanda za a aiwatar da shi kafin ƙaddamarwa ko nan da nan bayan an gama ƙirƙirar samfurin:

   pre-hook: "{{ logging.log_model_start_event() }}"
   post-hook: "{{ logging.log_model_end_event() }}"

Tsarin shiga zai ba ku damar yin rikodin duk mahimman bayanan metadata a cikin wani tebur daban, wanda daga baya za'a iya amfani da shi don tantancewa da tantance ƙuƙumma.

Wannan shine abin da dashboard ɗin yayi kama dangane da bayanan shiga cikin Looker:

Kayan aikin Gina Data ko abin da ya zama ruwan dare tsakanin Data Warehouse da Smoothie

Mai sarrafa Ma'ajiya ta atomatik

Idan kun yi amfani da wasu kari na ayyukan ma'ajiyar da aka yi amfani da su, kamar UDF (Ayyukan Ƙayyadaddun Mai amfani), to, sigar waɗannan ayyukan, ikon samun dama, da mirginawa ta atomatik na sabbin abubuwan fitarwa ya dace sosai a yi a cikin DBT.

Muna amfani da UDF a cikin Python don ƙididdige hashes, yankin imel, da ƙaddamar da bitmask.

Misali na macro wanda ke ƙirƙirar UDF akan kowane yanayi na kisa (dev, test, prod):

{% macro create_udf() -%}
 
 {% set sql %}
       CREATE OR REPLACE FUNCTION {{ target.schema }}.f_sha256(mes "varchar")
           RETURNS varchar
           LANGUAGE plpythonu
           STABLE
       AS $$  
           import hashlib
           return hashlib.sha256(mes).hexdigest()
       $$
       ;
 {% endset %}
  
 {% set table = run_query(sql) %}
 
{%- endmacro %}

A Wheely muna amfani da Amazon Redshift, wanda ya dogara da PostgreSQL. Don Redshift, yana da mahimmanci don tattara ƙididdiga akai-akai akan teburi da kuma 'yantar da sarari diski - umarnin ANALYZE da VACUUM, bi da bi.

Don yin wannan, ana aiwatar da umarni daga redshift_maintenance macro kowane dare:

{% macro redshift_maintenance() %}
 
   {% set vacuumable_tables=run_query(vacuumable_tables_sql) %}
 
   {% for row in vacuumable_tables %}
       {% set message_prefix=loop.index ~ " of " ~ loop.length %}
 
       {%- set relation_to_vacuum = adapter.get_relation(
                                               database=row['table_database'],
                                               schema=row['table_schema'],
                                               identifier=row['table_name']
                                   ) -%}
       {% do run_query("commit") %}
 
       {% if relation_to_vacuum %}
           {% set start=modules.datetime.datetime.now() %}
           {{ dbt_utils.log_info(message_prefix ~ " Vacuuming " ~ relation_to_vacuum) }}
           {% do run_query("VACUUM " ~ relation_to_vacuum ~ " BOOST") %}
           {{ dbt_utils.log_info(message_prefix ~ " Analyzing " ~ relation_to_vacuum) }}
           {% do run_query("ANALYZE " ~ relation_to_vacuum) %}
           {% set end=modules.datetime.datetime.now() %}
           {% set total_seconds = (end - start).total_seconds() | round(2)  %}
           {{ dbt_utils.log_info(message_prefix ~ " Finished " ~ relation_to_vacuum ~ " in " ~ total_seconds ~ "s") }}
       {% else %}
           {{ dbt_utils.log_info(message_prefix ~ ' Skipping relation "' ~ row.values() | join ('"."') ~ '" as it does not exist') }}
       {% endif %}
 
   {% endfor %}
 
{% endmacro %}

DBT Cloud

Yana yiwuwa a yi amfani da DBT azaman sabis (Sabis ɗin Gudanarwa). Hada da:

  • IDE na Yanar Gizo don haɓaka ayyuka da ƙira
  • Tsarin aiki da tsarawa
  • Sauƙaƙan kuma dacewa ga rajistan ayyukan
  • Yanar Gizo tare da takaddun aikin ku
  • Haɗa CI (Ci gaba da Haɗuwa)

Kayan aikin Gina Data ko abin da ya zama ruwan dare tsakanin Data Warehouse da Smoothie

ƙarshe

Shiri da cinye DWH ya zama mai daɗi da fa'ida kamar shan santsi. DBT ya ƙunshi Jinja, haɓaka mai amfani (modules), mai tarawa, mai zartarwa, da manajan fakiti. Ta hanyar haɗa waɗannan abubuwan tare zaku sami cikakkiyar yanayin aiki don Warehouse ɗin ku. Babu wata hanya mafi kyau don gudanar da canji a cikin DWH a yau.

Kayan aikin Gina Data ko abin da ya zama ruwan dare tsakanin Data Warehouse da Smoothie

An tsara imanin da masu haɓaka DBT ke bi kamar haka:

  • Lambar, ba GUI ba, ita ce mafi kyawun abstraction don bayyana hadaddun dabaru na nazari
  • Yin aiki tare da bayanai ya kamata ya daidaita mafi kyawun ayyuka a aikin injiniyan software (Injiniya Software)

  • Ya kamata al'ummar masu amfani su sarrafa muhimman abubuwan more rayuwa na bayanai azaman buɗaɗɗen software
  • Ba kayan aikin nazari kawai ba, har ma da lambar za ta ƙara zama mallakar al'ummar Buɗewa

Waɗannan ainihin gaskatawar sun haifar da samfur wanda sama da kamfanoni 850 ke amfani da su a yau, kuma sun zama tushen abubuwan haɓaka da yawa masu ban sha'awa waɗanda za a ƙirƙira a nan gaba.

Ga masu sha'awar, akwai bidiyon budaddiyar darasi da na bayar a watannin baya a wani bangare na budaddiyar darasi a OTUS - Kayan aikin Gina Data don Ma'ajiyar Redshift ta Amazon.

Baya ga DBT da Warehousing na Data, a matsayin wani ɓangare na kwas ɗin Injiniyan Bayanai akan dandalin OTUS, ni da abokan aikina muna koyar da darussa kan wasu batutuwa masu dacewa da na zamani:

  • Ka'idodin Gine-gine don Babban Aikace-aikacen Bayanai
  • Yi aiki tare da Spark da Spark Streaming
  • Bincika hanyoyin da kayan aiki don loda tushen bayanai
  • Gine-gine na nazari a cikin DWH
  • Ra'ayoyin NoSQL: HBase, Cassandra, Neman Elastic
  • Ka'idodin sa ido da tsarawa 
  • Aikin Karshe: Haɗa duk ƙwarewa tare a ƙarƙashin tallafin jagoranci

Tunani:

  1. Takardun DBT - Gabatarwa - Takardun hukuma
  2. Menene, daidai, dbt? - Bitar labarin daya daga cikin marubutan DBT 
  3. Kayan aikin Gina Data don Ma'ajiyar Redshift ta Amazon - YouTube, Yin rikodin darasi na OTUS
  4. Samun cikakken bincike na sunan Greenplum - Babban darasi na gaba shine Mayu 15, 2020
  5. Darasi Injiniya - OTUS
  6. Gina Balagurowar Ayyukan Nazari - Duban makomar bayanai da nazari
  7. Lokaci ya yi don nazarin tushen buɗe ido - Juyin nazari da tasirin Buɗe tushen
  8. Ci gaba da Haɗuwa da Gwajin Gina Kai tsaye tare da dbtCloud - Ka'idodin gina CI ta amfani da DBT
  9. Farawa tare da koyawa DBT - Kwarewa, umarnin mataki-mataki don aiki mai zaman kansa
  10. Shagon Jaffle - Github DBT Koyawa - Github, lambar aikin ilimi

Koyi game da kwas.

source: www.habr.com

Add a comment