I-Airflow iyithuluzi lokuthuthukisa kalula nangokushesha kanye nokugcina izinqubo zokucubungula idatha yeqoqo

I-Airflow iyithuluzi lokuthuthukisa kalula nangokushesha kanye nokugcina izinqubo zokucubungula idatha yeqoqo

Sawubona, Habr! Kulesi sihloko ngifuna ukukhuluma ngethuluzi elilodwa elihle lokuthuthukisa izinqubo zokucubungula idatha ye-batch, isibonelo, engqalasizinda ye-DWH yebhizinisi noma i-DataLake yakho. Sizokhuluma nge-Apache Airflow (kamuva ebizwa ngokuthi i-Airflow). Ayinakwa ngokungafanele ku-Habré, futhi engxenyeni eyinhloko ngizozama ukukukholisa ukuthi okungenani i-Airflow ifanele ukubhekwa lapho ukhetha umhleli wezinqubo zakho ze-ETL/ELT.

Ngaphambilini, ngabhala uchungechunge lwezihloko ngesihloko se-DWH ngenkathi ngisebenza e-Tinkoff Bank. Manje ngibe yingxenye yeqembu le-Mail.Ru Group futhi ngakha inkundla yokuhlaziya idatha endaweni yokudlala. Empeleni, njengoba kuvela izindaba nezisombululo ezithakazelisayo, mina nethimba lami sizokhuluma lapha ngenkundla yethu yokuhlaziya idatha.

Isandulela

Ngakho, ake siqale. Iyini i-Airflow? Lona umtapo wolwazi (noma iqoqo lemitapo yolwazi) ukuthuthukisa, ukuhlela kanye nokuqapha izinqubo zomsebenzi. Isici esiyinhloko se-Airflow: Ikhodi ye-Python isetshenziselwa ukuchaza (ukuthuthukisa) izinqubo. Lokhu kunezinzuzo eziningi zokuhlela iphrojekthi yakho nentuthuko: empeleni, iphrojekthi yakho (ngokwesibonelo) ye-ETL iphrojekthi yePython nje, futhi ungayihlela ngendlela ofisa ngayo, ucabangela imininingwane yengqalasizinda, usayizi weqembu kanye ezinye izidingo. Ngokwezinsimbi yonke into ilula. Sebenzisa isibonelo i-PyCharm + Git. Kuyamangalisa futhi kuwusizo kakhulu!

Manje ake sibheke izinhlangano eziyinhloko ze-Airflow. Ngokuqonda ingqikithi yazo nenjongo, ungakwazi ukuhlela kahle inqubo yakho yokwakha. Mhlawumbe ibhizinisi eliyinhloko I-Directed Acyclic Graph (ngemuva kwalokhu ebizwa ngokuthi i-DAG).

DAG

I-DAG ukuhlotshaniswa okunenjongo kwemisebenzi yakho ofuna ukuyiqedela ngokulandelana okuchazwe ngokuqinile ngokweshejuli ethile. I-Airflow inikezela ngesixhumi esibonakalayo sewebhu sokusebenza nama-DAG nezinye izinhlangano:

I-Airflow iyithuluzi lokuthuthukisa kalula nangokushesha kanye nokugcina izinqubo zokucubungula idatha yeqoqo

I-DAG ingase ibukeke kanje:

I-Airflow iyithuluzi lokuthuthukisa kalula nangokushesha kanye nokugcina izinqubo zokucubungula idatha yeqoqo

Umthuthukisi, lapho akha i-DAG, ubeka phansi isethi yabasebenzisi lapho imisebenzi ngaphakathi kwe-DAG izokwakhiwa khona. Lapha sifika kwelinye ibhizinisi elibalulekile: I-Airflow Operator.

Abasebenzisi

Umsebenzisi yibhizinisi ngesisekelo sokuthi kwakhiwa izimo zemisebenzi, ezichaza ukuthi yini ezokwenzeka ngesikhathi kuqaliswa isenzakalo somsebenzi. Ukukhishwa kwe-Airflow kusuka ku-GitHub kakade iqukethe isethi yama-opharetha alungele ukusetshenziswa. Izibonelo:

  • I-BashOperator - opharetha wokwenza umyalo we-bash.
  • I-PythonOperator - opharetha ngokubiza ikhodi yePython.
  • EmailOperator — opharetha ngokuthumela i-imeyili.
  • I-HTTPOperator - opharetha ngokusebenza nezicelo ze-http.
  • I-SqlOperator - opharetha wokusebenzisa ikhodi ye-SQL.
  • Inzwa iyi-opharetha yokulinda umcimbi (ukufika kwesikhathi esidingekayo, ukuvela kwefayela elidingekayo, umugqa ku-database, impendulo evela ku-API, njll., njll.).

Kukhona ama-opharetha athile kakhulu: I-DockerOperator, i-HiveOperator, i-S3FileTransferOperator, i-PrestoToMysqlOperator, i-SlackOperator.

Ungakwazi futhi ukuthuthukisa opharetha ngokusekelwe kuzici zakho futhi uzisebenzise kuphrojekthi yakho. Isibonelo, sidale i-MongoDBToHiveViaHdfsTransfer, i-opharetha yokuthumela amadokhumenti isuka e-MongoDB iye ku-Hive, kanye nama-opharetha amaningana ozosebenza nayo. ChofozaHouse: CLoadFromHiveOperator kanye ne-CHTableLoaderOperator. Empeleni, lapho nje iphrojekthi ivame ukusebenzisa ikhodi eyakhelwe ezitatimendeni eziyisisekelo, ungacabanga ngokuyakha ibe isitatimende esisha. Lokhu kuzokwenza kube lula ukuthuthukiswa okuqhubekayo, futhi uzokwandisa umtapo wakho wezincwadi wabasebenzisi kuphrojekthi.

Okulandelayo, zonke lezi zimo zemisebenzi zidinga ukwenziwa, futhi manje sizokhuluma ngomhleli.

Isihleli

Isihleli somsebenzi se-Airflow sakhelwe phezu kwawo isilimo esidliwayo esinamagatsha anamanzi. Isilimo esidliwayo esinamagatsha anamanzi siwumtapo wezincwadi wePython okuvumela ukuthi uhlele ulayini kanye nokwenza okuvumelanayo nokusatshalaliswa kwemisebenzi. Ngasohlangothini lwe-Airflow, yonke imisebenzi ihlukaniswe ngamachibi. Amachibi akhiwa mathupha. Ngokujwayelekile, inhloso yabo iwukukhawulela umsebenzi wokusebenza nomthombo noma ukufanekisa imisebenzi ngaphakathi kwe-DWH. Amachibi angaphathwa ngesixhumi esibonakalayo sewebhu:

I-Airflow iyithuluzi lokuthuthukisa kalula nangokushesha kanye nokugcina izinqubo zokucubungula idatha yeqoqo

Ichibi ngalinye linomkhawulo enanini lezikhala. Lapho udala i-DAG, inikezwa ichibi:

ALERT_MAILS =  Variable.get("gv_mail_admin_dwh")
DAG_NAME = 'dma_load'
OWNER = 'Vasya Pupkin'
DEPENDS_ON_PAST = True
EMAIL_ON_FAILURE = True
EMAIL_ON_RETRY = True
RETRIES = int(Variable.get('gv_dag_retries'))
POOL = 'dma_pool'
PRIORITY_WEIGHT = 10

start_dt = datetime.today() - timedelta(1)
start_dt = datetime(start_dt.year, start_dt.month, start_dt.day)

default_args = {
    'owner': OWNER,
    'depends_on_past': DEPENDS_ON_PAST,
    'start_date': start_dt,
    'email': ALERT_MAILS,
    'email_on_failure': EMAIL_ON_FAILURE,
    'email_on_retry': EMAIL_ON_RETRY,
    'retries': RETRIES,
    'pool': POOL,
    'priority_weight': PRIORITY_WEIGHT
}
dag = DAG(DAG_NAME, default_args=default_args)
dag.doc_md = __doc__

I-pool echazwe ezingeni le-DAG ingabhalwa ngaphezulu ezingeni lomsebenzi.
Inqubo ehlukile, i-Scheduler, inesibopho sokuhlela yonke imisebenzi ku-Airflow. Empeleni, i-Scheduler isebenzisana nazo zonke izinsimbi zokusetha imisebenzi ezokwenziwa. Umsebenzi udlula ezigabeni ezimbalwa ngaphambi kokuthi wenziwe:

  1. Imisebenzi yangaphambilini iqediwe ku-DAG; entsha ingafakwa kulayini.
  2. Ulayini uhlelwa ngokuya ngokubekwa phambili kwemisebenzi (okubalulekile nakho kungalawulwa), futhi uma kune-slot yamahhala echibini, umsebenzi ungaqalwa.
  3. Uma kukhona isilimo esidliwayo esinamagatsha anamanzi, umsebenzi uthunyelwa kuso; umsebenzi owuhlele enkingeni uqala, usebenzisa u-opharetha oyedwa noma omunye.

Simple ngokwanele.

Ishejuli isebenza ngesethi yawo wonke ama-DAG nayo yonke imisebenzi engaphakathi kwama-DAG.

Ukuze u-Scheduler aqale ukusebenza ne-DAG, i-DAG idinga ukusetha ishejuli:

dag = DAG(DAG_NAME, default_args=default_args, schedule_interval='@hourly')

Kukhona isethi yokusethwa esenziwe ngomumo: @once, @hourly, @daily, @weekly, @monthly, @yearly.

Ungasebenzisa futhi izinkulumo ze-cron:

dag = DAG(DAG_NAME, default_args=default_args, schedule_interval='*/10 * * * *')

Idethi Yokwenza

Ukuze uqonde ukuthi i-Airflow isebenza kanjani, kubalulekile ukuqonda ukuthi ithini Idethi Yokuqalisa ye-DAG. Ku-Airflow, i-DAG inobukhulu Bedethi Yokwenza, okungukuthi, kuye ngeshejuli yomsebenzi ye-DAG, izimo zemisebenzi zidalelwa Idethi Yokwenza Ngayinye. Futhi ngosuku ngalunye lokuSebenza, imisebenzi ingenziwa kabusha - noma, isibonelo, i-DAG ingasebenza ngesikhathi esisodwa Ezinsukwini ezimbalwa Zokwenziwa. Lokhu kuboniswa ngokucacile lapha:

I-Airflow iyithuluzi lokuthuthukisa kalula nangokushesha kanye nokugcina izinqubo zokucubungula idatha yeqoqo

Ngeshwa (noma mhlawumbe ngenhlanhla: kuncike esimweni), uma ukuqaliswa komsebenzi ku-DAG kulungiswa, khona-ke ukubulawa Osukwini lwangaphambilini Lokukhipha kuzoqhubeka kucatshangelwa izinguquko. Lokhu kuhle uma udinga ukubala kabusha idatha ezikhathini ezidlule usebenzisa i-algorithm entsha, kodwa kubi ngoba ukukhiqizwa kabusha komphumela kulahleka (Yebo, akekho okuhluphayo ukubuyisela inguqulo edingekayo yekhodi yomthombo evela ku-Git futhi ubale ukuthi yini udinga isikhathi esisodwa, ngendlela oyidinga ngayo).

Ukukhiqiza imisebenzi

Ukuqaliswa kwe-DAG ikhodi ku-Python, ngakho-ke sinendlela elula kakhulu yokunciphisa inani lekhodi lapho sisebenza, isibonelo, ngemithombo eshiyiwe. Ake sithi unezinhlamvu ezintathu ze-MySQL njengomthombo, udinga ukugibela kulelo nalelo bese uthatha idatha ethile. Ngaphezu kwalokho, ngokuzimela nangokuhambisana. Ikhodi yePython ku-DAG ingase ibukeke kanje:

connection_list = lv.get('connection_list')

export_profiles_sql = '''
SELECT
  id,
  user_id,
  nickname,
  gender,
  {{params.shard_id}} as shard_id
FROM profiles
'''

for conn_id in connection_list:
    export_profiles = SqlToHiveViaHdfsTransfer(
        task_id='export_profiles_from_' + conn_id,
        sql=export_profiles_sql,
        hive_table='stg.profiles',
        overwrite=False,
        tmpdir='/data/tmp',
        conn_id=conn_id,
        params={'shard_id': conn_id[-1:], },
        compress=None,
        dag=dag
    )
    export_profiles.set_upstream(exec_truncate_stg)
    export_profiles.set_downstream(load_profiles)

I-DAG ibukeka kanje:

I-Airflow iyithuluzi lokuthuthukisa kalula nangokushesha kanye nokugcina izinqubo zokucubungula idatha yeqoqo

Kulesi simo, ungakwazi ukwengeza noma ukususa i-shard ngokumane ulungise izilungiselelo futhi ubuyekeze i-DAG. Ukhululekile!

Ungasebenzisa futhi ukukhiqizwa kwekhodi eyinkimbinkimbi, isibonelo, usebenze nemithombo ngendlela yedathabhesi noma uchaze isakhiwo setafula, i-algorithm yokusebenza ngetafula, futhi, ngokucabangela izici zengqalasizinda ye-DWH, ukhiqize inqubo. ukuze kulayishwe amatafula angu-N endaweni yakho yokugcina. Noma, isibonelo, ukusebenza ne-API engasekeli ukusebenza nepharamitha ngendlela yohlu, ungakwazi ukukhiqiza imisebenzi engu-N ku-DAG kusuka kulolu hlu, ukhawulele ukufana kwezicelo ku-API echibini, futhi ukhuhle. idatha edingekayo evela ku-API. Kuyavumelana nezimo!

inqolobane

I-Airflow inenqolobane yayo yasemuva, isizindalwazi (kungaba i-MySQL noma i-Postgres, sine-Postgres), egcina izifunda zemisebenzi, ama-DAG, izilungiselelo zokuxhuma, okuguquguqukayo komhlaba, njll., njll. Lapha ngithanda ukusho ukuthi indawo yokugcina ku-Airflow ilula kakhulu (cishe amatafula angama-20) futhi ilula uma ufuna ukwakha noma yiziphi izinqubo zakho phezu kwayo. Ngikhumbula amatafula angu-100500 enqolobaneni ye-Informatica, okwakudingeka afundwe isikhathi eside ngaphambi kokuqonda indlela yokwakhiwa kombuzo.

Ukuqapha

Ngokunikezwe ubulula bendawo yokugcina, ungakha inqubo yokuqapha umsebenzi ekulungele wena. Sisebenzisa incwajana kuZeppelin, lapho sibheka khona isimo semisebenzi:

I-Airflow iyithuluzi lokuthuthukisa kalula nangokushesha kanye nokugcina izinqubo zokucubungula idatha yeqoqo

Lokhu kungase futhi kube isixhumi esibonakalayo sewebhu se-Airflow ngokwayo:

I-Airflow iyithuluzi lokuthuthukisa kalula nangokushesha kanye nokugcina izinqubo zokucubungula idatha yeqoqo

Ikhodi ye-Airflow ingumthombo ovulekile, ngakho-ke sengeze izexwayiso kuTelegram. Isenzakalo ngasinye esisebenzayo somsebenzi, uma kwenzeka iphutha, ugaxekile weqembu ku-Telegram, lapho lonke ithimba lokuthuthukisa kanye nokusekelwa lihlanganisa.

Sithola impendulo esheshayo ngeTelegramu (uma kudingeka), futhi nge-Zeppelin sithola isithombe esiphelele semisebenzi ku-Airflow.

Inani

I-Airflow ngokuyinhloko ingumthombo ovulekile, futhi akufanele ulindele izimangaliso kuwo. Zilungiselele ukubeka isikhathi nomzamo wokwakha isisombululo esisebenzayo. Umgomo uyafezeka, ngikholwe, kuwufanele. Ukusheshisa kwentuthuko, ukuguquguquka, ukukhululeka kokwengeza izinqubo ezintsha - uzoyithanda. Yiqiniso, udinga ukunaka kakhulu inhlangano yephrojekthi, ukuzinza kwe-Airflow ngokwayo: izimangaliso azenzeki.

Manje sine-Airflow esebenza nsuku zonke cishe 6,5 ayizinkulungwane imisebenzi. Bahluke kakhulu ngesimilo. Kunemisebenzi yokulayisha idatha ku-DWH eyinhloko evela emithonjeni eminingi ehlukene futhi ecacile kakhulu, kunemisebenzi yokubala izingaphambili zesitolo ngaphakathi kwe-DWH eyinhloko, kunemisebenzi yokushicilela idatha ku-DWH esheshayo, miningi, imisebenzi eminingi ehlukene - kanye ne-Airflow. uzihlafuna zonke izinsuku nosuku. Ukukhuluma ngezinombolo, lokhu 2,3 izinkulungwane Imisebenzi ye-ELT yobunkimbinkimbi obuhlukahlukene ngaphakathi kwe-DWH (Hadoop), cishe. 2,5 amakhulu edathabheyisi imithombo, leli yiqembu elivela 4 Onjiniyela be-ETL, ezihlukaniswe ngokucubungula idatha ye-ETL ku-DWH kanye nokucubungula idatha ye-ELT ngaphakathi kwe-DWH nokunye okwengeziwe admin oyedwa, obhekene nengqalasizinda yesevisi.

Izinhlelo zekusasa

Inani lezinqubo likhula ngokungenakugwenywa, futhi into esemqoka esizobe siyenza ngokwengqalasizinda ye-Airflow iyakhula. Sifuna ukwakha iqoqo le-Airflow, sinikeze izisebenzi ze-Celery imilenze emibili, futhi senze ikhanda eliziphindaphindayo ngezinqubo zokuhlela umsebenzi kanye nenqolobane.

Epilogue

Lokhu, vele, akukhona konke engingathanda ukukutshela mayelana ne-Airflow, kodwa ngizamile ukugqamisa amaphuzu abalulekile. Isifiso sokudla siza nokudla, yizame futhi uzoyithanda :)

Source: www.habr.com

Engeza amazwana