Sawubona, Habr! Kulesi sihloko ngifuna ukukhuluma ngethuluzi elilodwa elihle lokuthuthukisa izinqubo zokucubungula idatha ye-batch, isibonelo, engqalasizinda ye-DWH yebhizinisi noma i-DataLake yakho. Sizokhuluma nge-Apache Airflow (kamuva ebizwa ngokuthi i-Airflow). Ayinakwa ngokungafanele ku-Habré, futhi engxenyeni eyinhloko ngizozama ukukukholisa ukuthi okungenani i-Airflow ifanele ukubhekwa lapho ukhetha umhleli wezinqubo zakho ze-ETL/ELT.
Ngaphambilini, ngabhala uchungechunge lwezihloko ngesihloko se-DWH ngenkathi ngisebenza e-Tinkoff Bank. Manje ngibe yingxenye yeqembu le-Mail.Ru Group futhi ngakha inkundla yokuhlaziya idatha endaweni yokudlala. Empeleni, njengoba kuvela izindaba nezisombululo ezithakazelisayo, mina nethimba lami sizokhuluma lapha ngenkundla yethu yokuhlaziya idatha.
Isandulela
Ngakho, ake siqale. Iyini i-Airflow? Lona umtapo wolwazi (noma
Manje ake sibheke izinhlangano eziyinhloko ze-Airflow. Ngokuqonda ingqikithi yazo nenjongo, ungakwazi ukuhlela kahle inqubo yakho yokwakha. Mhlawumbe ibhizinisi eliyinhloko I-Directed Acyclic Graph (ngemuva kwalokhu ebizwa ngokuthi i-DAG).
DAG
I-DAG ukuhlotshaniswa okunenjongo kwemisebenzi yakho ofuna ukuyiqedela ngokulandelana okuchazwe ngokuqinile ngokweshejuli ethile. I-Airflow inikezela ngesixhumi esibonakalayo sewebhu sokusebenza nama-DAG nezinye izinhlangano:
I-DAG ingase ibukeke kanje:
Umthuthukisi, lapho akha i-DAG, ubeka phansi isethi yabasebenzisi lapho imisebenzi ngaphakathi kwe-DAG izokwakhiwa khona. Lapha sifika kwelinye ibhizinisi elibalulekile: I-Airflow Operator.
Abasebenzisi
Umsebenzisi yibhizinisi ngesisekelo sokuthi kwakhiwa izimo zemisebenzi, ezichaza ukuthi yini ezokwenzeka ngesikhathi kuqaliswa isenzakalo somsebenzi.
- I-BashOperator - opharetha wokwenza umyalo we-bash.
- I-PythonOperator - opharetha ngokubiza ikhodi yePython.
- EmailOperator — opharetha ngokuthumela i-imeyili.
- I-HTTPOperator - opharetha ngokusebenza nezicelo ze-http.
- I-SqlOperator - opharetha wokusebenzisa ikhodi ye-SQL.
- Inzwa iyi-opharetha yokulinda umcimbi (ukufika kwesikhathi esidingekayo, ukuvela kwefayela elidingekayo, umugqa ku-database, impendulo evela ku-API, njll., njll.).
Kukhona ama-opharetha athile kakhulu: I-DockerOperator, i-HiveOperator, i-S3FileTransferOperator, i-PrestoToMysqlOperator, i-SlackOperator.
Ungakwazi futhi ukuthuthukisa opharetha ngokusekelwe kuzici zakho futhi uzisebenzise kuphrojekthi yakho. Isibonelo, sidale i-MongoDBToHiveViaHdfsTransfer, i-opharetha yokuthumela amadokhumenti isuka e-MongoDB iye ku-Hive, kanye nama-opharetha amaningana ozosebenza nayo.
Okulandelayo, zonke lezi zimo zemisebenzi zidinga ukwenziwa, futhi manje sizokhuluma ngomhleli.
Isihleli
Isihleli somsebenzi se-Airflow sakhelwe phezu kwawo
Ichibi ngalinye linomkhawulo enanini lezikhala. Lapho udala i-DAG, inikezwa ichibi:
ALERT_MAILS = Variable.get("gv_mail_admin_dwh")
DAG_NAME = 'dma_load'
OWNER = 'Vasya Pupkin'
DEPENDS_ON_PAST = True
EMAIL_ON_FAILURE = True
EMAIL_ON_RETRY = True
RETRIES = int(Variable.get('gv_dag_retries'))
POOL = 'dma_pool'
PRIORITY_WEIGHT = 10
start_dt = datetime.today() - timedelta(1)
start_dt = datetime(start_dt.year, start_dt.month, start_dt.day)
default_args = {
'owner': OWNER,
'depends_on_past': DEPENDS_ON_PAST,
'start_date': start_dt,
'email': ALERT_MAILS,
'email_on_failure': EMAIL_ON_FAILURE,
'email_on_retry': EMAIL_ON_RETRY,
'retries': RETRIES,
'pool': POOL,
'priority_weight': PRIORITY_WEIGHT
}
dag = DAG(DAG_NAME, default_args=default_args)
dag.doc_md = __doc__
I-pool echazwe ezingeni le-DAG ingabhalwa ngaphezulu ezingeni lomsebenzi.
Inqubo ehlukile, i-Scheduler, inesibopho sokuhlela yonke imisebenzi ku-Airflow. Empeleni, i-Scheduler isebenzisana nazo zonke izinsimbi zokusetha imisebenzi ezokwenziwa. Umsebenzi udlula ezigabeni ezimbalwa ngaphambi kokuthi wenziwe:
- Imisebenzi yangaphambilini iqediwe ku-DAG; entsha ingafakwa kulayini.
- Ulayini uhlelwa ngokuya ngokubekwa phambili kwemisebenzi (okubalulekile nakho kungalawulwa), futhi uma kune-slot yamahhala echibini, umsebenzi ungaqalwa.
- Uma kukhona isilimo esidliwayo esinamagatsha anamanzi, umsebenzi uthunyelwa kuso; umsebenzi owuhlele enkingeni uqala, usebenzisa u-opharetha oyedwa noma omunye.
Simple ngokwanele.
Ishejuli isebenza ngesethi yawo wonke ama-DAG nayo yonke imisebenzi engaphakathi kwama-DAG.
Ukuze u-Scheduler aqale ukusebenza ne-DAG, i-DAG idinga ukusetha ishejuli:
dag = DAG(DAG_NAME, default_args=default_args, schedule_interval='@hourly')
Kukhona isethi yokusethwa esenziwe ngomumo: @once
, @hourly
, @daily
, @weekly
, @monthly
, @yearly
.
Ungasebenzisa futhi izinkulumo ze-cron:
dag = DAG(DAG_NAME, default_args=default_args, schedule_interval='*/10 * * * *')
Idethi Yokwenza
Ukuze uqonde ukuthi i-Airflow isebenza kanjani, kubalulekile ukuqonda ukuthi ithini Idethi Yokuqalisa ye-DAG. Ku-Airflow, i-DAG inobukhulu Bedethi Yokwenza, okungukuthi, kuye ngeshejuli yomsebenzi ye-DAG, izimo zemisebenzi zidalelwa Idethi Yokwenza Ngayinye. Futhi ngosuku ngalunye lokuSebenza, imisebenzi ingenziwa kabusha - noma, isibonelo, i-DAG ingasebenza ngesikhathi esisodwa Ezinsukwini ezimbalwa Zokwenziwa. Lokhu kuboniswa ngokucacile lapha:
Ngeshwa (noma mhlawumbe ngenhlanhla: kuncike esimweni), uma ukuqaliswa komsebenzi ku-DAG kulungiswa, khona-ke ukubulawa Osukwini lwangaphambilini Lokukhipha kuzoqhubeka kucatshangelwa izinguquko. Lokhu kuhle uma udinga ukubala kabusha idatha ezikhathini ezidlule usebenzisa i-algorithm entsha, kodwa kubi ngoba ukukhiqizwa kabusha komphumela kulahleka (Yebo, akekho okuhluphayo ukubuyisela inguqulo edingekayo yekhodi yomthombo evela ku-Git futhi ubale ukuthi yini udinga isikhathi esisodwa, ngendlela oyidinga ngayo).
Ukukhiqiza imisebenzi
Ukuqaliswa kwe-DAG ikhodi ku-Python, ngakho-ke sinendlela elula kakhulu yokunciphisa inani lekhodi lapho sisebenza, isibonelo, ngemithombo eshiyiwe. Ake sithi unezinhlamvu ezintathu ze-MySQL njengomthombo, udinga ukugibela kulelo nalelo bese uthatha idatha ethile. Ngaphezu kwalokho, ngokuzimela nangokuhambisana. Ikhodi yePython ku-DAG ingase ibukeke kanje:
connection_list = lv.get('connection_list')
export_profiles_sql = '''
SELECT
id,
user_id,
nickname,
gender,
{{params.shard_id}} as shard_id
FROM profiles
'''
for conn_id in connection_list:
export_profiles = SqlToHiveViaHdfsTransfer(
task_id='export_profiles_from_' + conn_id,
sql=export_profiles_sql,
hive_table='stg.profiles',
overwrite=False,
tmpdir='/data/tmp',
conn_id=conn_id,
params={'shard_id': conn_id[-1:], },
compress=None,
dag=dag
)
export_profiles.set_upstream(exec_truncate_stg)
export_profiles.set_downstream(load_profiles)
I-DAG ibukeka kanje:
Kulesi simo, ungakwazi ukwengeza noma ukususa i-shard ngokumane ulungise izilungiselelo futhi ubuyekeze i-DAG. Ukhululekile!
Ungasebenzisa futhi ukukhiqizwa kwekhodi eyinkimbinkimbi, isibonelo, usebenze nemithombo ngendlela yedathabhesi noma uchaze isakhiwo setafula, i-algorithm yokusebenza ngetafula, futhi, ngokucabangela izici zengqalasizinda ye-DWH, ukhiqize inqubo. ukuze kulayishwe amatafula angu-N endaweni yakho yokugcina. Noma, isibonelo, ukusebenza ne-API engasekeli ukusebenza nepharamitha ngendlela yohlu, ungakwazi ukukhiqiza imisebenzi engu-N ku-DAG kusuka kulolu hlu, ukhawulele ukufana kwezicelo ku-API echibini, futhi ukhuhle. idatha edingekayo evela ku-API. Kuyavumelana nezimo!
inqolobane
I-Airflow inenqolobane yayo yasemuva, isizindalwazi (kungaba i-MySQL noma i-Postgres, sine-Postgres), egcina izifunda zemisebenzi, ama-DAG, izilungiselelo zokuxhuma, okuguquguqukayo komhlaba, njll., njll. Lapha ngithanda ukusho ukuthi indawo yokugcina ku-Airflow ilula kakhulu (cishe amatafula angama-20) futhi ilula uma ufuna ukwakha noma yiziphi izinqubo zakho phezu kwayo. Ngikhumbula amatafula angu-100500 enqolobaneni ye-Informatica, okwakudingeka afundwe isikhathi eside ngaphambi kokuqonda indlela yokwakhiwa kombuzo.
Ukuqapha
Ngokunikezwe ubulula bendawo yokugcina, ungakha inqubo yokuqapha umsebenzi ekulungele wena. Sisebenzisa incwajana kuZeppelin, lapho sibheka khona isimo semisebenzi:
Lokhu kungase futhi kube isixhumi esibonakalayo sewebhu se-Airflow ngokwayo:
Ikhodi ye-Airflow ingumthombo ovulekile, ngakho-ke sengeze izexwayiso kuTelegram. Isenzakalo ngasinye esisebenzayo somsebenzi, uma kwenzeka iphutha, ugaxekile weqembu ku-Telegram, lapho lonke ithimba lokuthuthukisa kanye nokusekelwa lihlanganisa.
Sithola impendulo esheshayo ngeTelegramu (uma kudingeka), futhi nge-Zeppelin sithola isithombe esiphelele semisebenzi ku-Airflow.
Inani
I-Airflow ngokuyinhloko ingumthombo ovulekile, futhi akufanele ulindele izimangaliso kuwo. Zilungiselele ukubeka isikhathi nomzamo wokwakha isisombululo esisebenzayo. Umgomo uyafezeka, ngikholwe, kuwufanele. Ukusheshisa kwentuthuko, ukuguquguquka, ukukhululeka kokwengeza izinqubo ezintsha - uzoyithanda. Yiqiniso, udinga ukunaka kakhulu inhlangano yephrojekthi, ukuzinza kwe-Airflow ngokwayo: izimangaliso azenzeki.
Manje sine-Airflow esebenza nsuku zonke cishe 6,5 ayizinkulungwane imisebenzi. Bahluke kakhulu ngesimilo. Kunemisebenzi yokulayisha idatha ku-DWH eyinhloko evela emithonjeni eminingi ehlukene futhi ecacile kakhulu, kunemisebenzi yokubala izingaphambili zesitolo ngaphakathi kwe-DWH eyinhloko, kunemisebenzi yokushicilela idatha ku-DWH esheshayo, miningi, imisebenzi eminingi ehlukene - kanye ne-Airflow. uzihlafuna zonke izinsuku nosuku. Ukukhuluma ngezinombolo, lokhu 2,3 izinkulungwane Imisebenzi ye-ELT yobunkimbinkimbi obuhlukahlukene ngaphakathi kwe-DWH (Hadoop), cishe. 2,5 amakhulu edathabheyisi imithombo, leli yiqembu elivela 4 Onjiniyela be-ETL, ezihlukaniswe ngokucubungula idatha ye-ETL ku-DWH kanye nokucubungula idatha ye-ELT ngaphakathi kwe-DWH nokunye okwengeziwe admin oyedwa, obhekene nengqalasizinda yesevisi.
Izinhlelo zekusasa
Inani lezinqubo likhula ngokungenakugwenywa, futhi into esemqoka esizobe siyenza ngokwengqalasizinda ye-Airflow iyakhula. Sifuna ukwakha iqoqo le-Airflow, sinikeze izisebenzi ze-Celery imilenze emibili, futhi senze ikhanda eliziphindaphindayo ngezinqubo zokuhlela umsebenzi kanye nenqolobane.
Epilogue
Lokhu, vele, akukhona konke engingathanda ukukutshela mayelana ne-Airflow, kodwa ngizamile ukugqamisa amaphuzu abalulekile. Isifiso sokudla siza nokudla, yizame futhi uzoyithanda :)
Source: www.habr.com