Molo, Habr! Kule nqaku ndifuna ukuthetha ngesixhobo esinye esikhulu sokuphuhlisa iinkqubo zokucwangcisa idatha ye-batch, umzekelo, kwisiseko se-DWH yenkampani okanye i-DataLake yakho. Siza kuthetha ngeApache Airflow (emva koku kuthiwa yiAirflow). Akunabulungisa ukuvinjwa ingqalelo kwi-Habré, kwaye inxalenye ephambili ndiya kuzama ukukuqinisekisa ukuba ubuncinci i-Airflow ifanelekile ukujonga xa ukhetha umcwangcisi weenkqubo zakho ze-ETL / ELT.
Ngaphambili, ndibhale uthotho lwamanqaku ngesihloko se-DWH xa ndandisebenza kwiBhanki yaseTinkoff. Ngoku ndibe yinxalenye yeqela le-Mail.Ru yeQela kwaye ndiphuhlisa iqonga lokuhlalutya idatha kwindawo yokudlala. Ngokwenyani, njengoko iindaba kunye nezisombululo ezinomdla zivela, iqela lam kunye nam siza kuthetha apha malunga neqonga lethu lohlalutyo lwedatha.
Isingeniso
Ngoko, masiqale. Yintoni iAirflow? Eli lithala leencwadi (okanye
Ngoku makhe sijonge amaziko aphambili eAirflow. Ngokuqonda undoqo kunye nenjongo yabo, unokucwangcisa ngokufanelekileyo inkqubo yakho yokwakha. Mhlawumbi elona qumrhu liziko yiDirected Acyclic Graph (emva koku kubhekiselwa kuyo njengeDAG).
Dag
I-DAG lunxulumano olunentsingiselo lwemisebenzi yakho ofuna ukuyigqiba ngolandelelwano oluchazwe ngokungqongqo ngokweshedyuli ethile. Ukuhamba komoya kubonelela ngojongano lwewebhu olufanelekileyo lokusebenza neeDAGs kunye namanye amaziko:
I-DAG inokujongeka ngolu hlobo:
Umphuhlisi, xa eyila iDAG, ubeka phantsi uluhlu lwabaqhubi apho imisebenzi ngaphakathi kweDAG iya kwakhiwa. Apha siza kwelinye iqumrhu elibalulekileyo: I-Airflow Operator.
Ba sebenzi
Umsebenzisi liqumrhu elisekelwe ekubeni iimeko zemisebenzi zidalwe, nto leyo echaza okuya kwenzeka ngexesha lokwenziwa komsebenzi.
- I-BashOperator - umqhubi wokuphumeza umyalelo we-bash.
- I-PythonOperator-umqhubi wokufowunela ikhowudi yePython.
- EmailOperator — umsebenzisi ngokuthumela i-imeyile.
- HTTPOperator - umqhubi ukusebenza kunye http izicelo.
- SqlOperator - umqhubi wokuphumeza ikhowudi yeSQL.
- I-Sensor ngumqhubi wokulinda isiganeko (ukufika kwexesha elifunekayo, ukubonakala kwefayile efunekayo, umgca kwi-database, impendulo evela kwi-API, njl. njl.).
Kukho abaqhubi abathile ngakumbi: i-DockerOperator, i-HiveOperator, i-S3FileTransferOperator, i-PrestoToMysqlOperator, i-SlackOperator.
Unokuphuhlisa abaqhubi ngokusekelwe kwiimpawu zakho kwaye uzisebenzise kwiprojekthi yakho. Umzekelo, senze i-MongoDBToHiveViaHdfsTransfer, umqhubi wokuthumela ngaphandle amaxwebhu ukusuka kwi-MongoDB ukuya kwi-Hive, kunye nabaqhubi abaninzi ekusebenzeni nabo.
Emva koko, zonke ezi ziganeko zemisebenzi kufuneka zenziwe, kwaye ngoku siza kuthetha malunga nomcwangcisi.
Umcwangcisi
Umcwangcisi womsebenzi weAirflow yakhelwe phezu kwayo
Idama ngalinye linomda kwinani leendawo zokubeka. Xa usenza iDAG, inikwa ichibi:
ALERT_MAILS = Variable.get("gv_mail_admin_dwh")
DAG_NAME = 'dma_load'
OWNER = 'Vasya Pupkin'
DEPENDS_ON_PAST = True
EMAIL_ON_FAILURE = True
EMAIL_ON_RETRY = True
RETRIES = int(Variable.get('gv_dag_retries'))
POOL = 'dma_pool'
PRIORITY_WEIGHT = 10
start_dt = datetime.today() - timedelta(1)
start_dt = datetime(start_dt.year, start_dt.month, start_dt.day)
default_args = {
'owner': OWNER,
'depends_on_past': DEPENDS_ON_PAST,
'start_date': start_dt,
'email': ALERT_MAILS,
'email_on_failure': EMAIL_ON_FAILURE,
'email_on_retry': EMAIL_ON_RETRY,
'retries': RETRIES,
'pool': POOL,
'priority_weight': PRIORITY_WEIGHT
}
dag = DAG(DAG_NAME, default_args=default_args)
dag.doc_md = __doc__
I-pool echazwe kumgangatho we-DAG inokukhutshwa kwinqanaba lomsebenzi.
Inkqubo eyahlukileyo, uMcwangcisi, unoxanduva lokucwangcisa yonke imisebenzi kwi-Airflow. Ngokwenyani, uMcwangcisi ujongana nabo bonke oomatshini bokuseta imisebenzi ukuze yenziwe. Umsebenzi uhamba ngamanqanaba amaninzi ngaphambi kokuba wenziwe:
- Imisebenzi yangaphambili igqityiwe kwi-DAG, omnye omtsha unokufoliswa.
- Umgca uhlelwe ngokuxhomekeke kwizinto eziphambili zemisebenzi (izinto eziphambili nazo zingalawulwa), kwaye ukuba kukho i-slot yamahhala echibini, umsebenzi ungathathwa usebenze.
- Ukuba kukho umsebenzi wamahhala we-celery, umsebenzi uthunyelwa kuwo; umsebenzi owucwangcise kwingxaki uqala, usebenzisa omnye okanye omnye umsebenzisi.
Elula ngokwaneleyo.
Umcwangcisi uqhuba kwiseti yazo zonke iiDAGs kunye nayo yonke imisebenzi ngaphakathi kweeDAG.
Ukuze uMcwangcisi aqalise ukusebenza neDAG, iDAG kufuneka imisele ishedyuli:
dag = DAG(DAG_NAME, default_args=default_args, schedule_interval='@hourly')
Kukho iiseti esele zilungisiwe: @once
, @hourly
, @daily
, @weekly
, @monthly
, @yearly
.
Ungasebenzisa kwakhona amagama e-cron:
dag = DAG(DAG_NAME, default_args=default_args, schedule_interval='*/10 * * * *')
Umhla woKwenziwa
Ukuqonda indlela ukuqukuqela komoya okusebenza ngayo, kubalulekile ukuqonda ukuba ngowuphi uMhla woKuphumeza weDAG. Kwi-Airflow, i-DAG inoMhla wokuSebenzisa umlinganiselo, oko kukuthi, ngokuxhomekeke kwishedyuli yomsebenzi we-DAG, iimeko zemisebenzi zidalwe kuMhla woKwenziwa ngalunye. Kwaye kuMhla woKwenziwa ngalunye, imisebenzi inokuphinda iphunyezwe-okanye, umzekelo, iDAG inokusebenza ngaxeshanye kwiMihla yokuPhunyezwa emininzi. Oku kuboniswe ngokucacileyo apha:
Ngelishwa (okanye mhlawumbi ngethamsanqa: kuxhomekeke kwimeko), ukuba ukuphunyezwa komsebenzi kwi-DAG kulungiswa, ngoko ukuphunyezwa kuMhla woKuphunyezwa kwangaphambili kuya kuqhubeka kuthathelwa ingqalelo uhlengahlengiso. Oku kuhle ukuba ufuna ukubala kwakhona idata kumaxesha adlulileyo usebenzisa i-algorithm entsha, kodwa kubi kuba ukuveliswa kwakhona kwesiphumo kulahlekile (kakade, akukho mntu ukukhathazayo ukubuyisela uguqulelo olufunekayo lwekhowudi yemvelaphi esuka kwiGit kwaye ubale ukuba yintoni udinga ixesha elinye, ngendlela olifuna ngayo).
Ukuvelisa imisebenzi
Ukuphunyezwa kweDAG yikhowudi kwiPython, ngoko sinendlela elula kakhulu yokunciphisa inani lekhowudi xa usebenza, umzekelo, kunye nemithombo edibeneyo. Masithi unama-shards amathathu e-MySQL njengomthombo, kufuneka ukhwele kwindawo nganye kwaye uthathe idatha ethile. Ngaphezu koko, ngokuzimeleyo nangokuhambelanayo. Ikhowudi yePython kwiDAG inokujongeka ngolu hlobo:
connection_list = lv.get('connection_list')
export_profiles_sql = '''
SELECT
id,
user_id,
nickname,
gender,
{{params.shard_id}} as shard_id
FROM profiles
'''
for conn_id in connection_list:
export_profiles = SqlToHiveViaHdfsTransfer(
task_id='export_profiles_from_' + conn_id,
sql=export_profiles_sql,
hive_table='stg.profiles',
overwrite=False,
tmpdir='/data/tmp',
conn_id=conn_id,
params={'shard_id': conn_id[-1:], },
compress=None,
dag=dag
)
export_profiles.set_upstream(exec_truncate_stg)
export_profiles.set_downstream(load_profiles)
I-DAG ibonakala ngolu hlobo:
Kule meko, unokongeza okanye ukususa i-shard ngokulungisa ngokulula izicwangciso kunye nokuhlaziya i-DAG. Ukhululekile!
Ungasebenzisa kwakhona ukuveliswa kwekhowudi eyinkimbinkimbi, umzekelo, usebenze kunye nemithombo ngendlela yesiseko sedatha okanye uchaze isakhiwo setafile, i-algorithm yokusebenza kunye netafile, kwaye, ngokuqwalasela iimpawu zesiseko se-DWH, ukuvelisa inkqubo. yokulayisha iitafile ze-N kwindawo yakho yokugcina. Okanye, umzekelo, ukusebenza kunye ne-API engaxhasi ukusebenza ngeparameter ngendlela yoluhlu, unokuvelisa imisebenzi ye-N kwi-DAG kolu luhlu, ukunciphisa ukuhambelana kwezicelo kwi-API ukuya echibini, kwaye ukhuhle. idatha efunekayo kwi-API. Ubhetyebhetye!
indawo yokugcina
I-Airflow ine-backend repository yayo, i-database (ingaba yi-MySQL okanye i-Postgres, sine-Postgres), egcina iindawo zemisebenzi, ii-DAG, izicwangciso zoqhagamshelwano, izinto eziguquguqukayo zehlabathi, njl. njl. Apha ndingathanda ukuthi indawo yokugcina kwiAirflow ilula kakhulu (malunga neetafile ezingama-20) kwaye ifanelekile ukuba ufuna ukwakha naziphi na iinkqubo zakho ngaphezulu kwayo. Ndikhumbula iitafile ze-100500 kwindawo yokugcina i-Informatica, ekwakufuneka ifundwe ixesha elide ngaphambi kokuqonda indlela yokwakha umbuzo.
Ukubeka iliso
Ngenxa yokulula kwendawo yokugcina, unokwakha inkqubo yokubeka iliso yomsebenzi ekulungeleyo. Sisebenzisa i-notepad eZeppelin, apho sijonga ubume bemisebenzi:
Oku kunokuba lujongano lwewebhu lwe-Airflow ngokwayo:
Ikhowudi ye-Airflow ngumthombo ovulekileyo, ngoko songeze isilumkiso kwiTelegram. Umzekelo ngamnye osebenzayo womsebenzi, ukuba impazamo iyenzeka, i-spams iqela kwiTelegram, apho lonke uphuhliso kunye neqela lenkxaso liqukethe.
Sifumana impendulo ekhawulezileyo ngeTelegram (ukuba iyafuneka), kwaye ngeZeppelin sifumana umfanekiso opheleleyo wemisebenzi kwi-Airflow.
Iyonke
Ukuhamba komoya kuqala ngumthombo ovulekileyo, kwaye akufanele ulindele imimangaliso kuwo. Zilungiselele ukubeka ixesha kunye nomgudu wokwakha isisombululo esisebenzayo. Injongo iyafezekiswa, ndikholelwe, ifanelekile. Isantya sophuhliso, ukuguquguquka, lula ukongeza iinkqubo ezintsha - uya kuyithanda. Ngokuqinisekileyo, kufuneka uhlawule ingqwalasela enkulu kwintlangano yeprojekthi, ukuzinza kwe-Airflow ngokwayo: imimangaliso ayenzeki.
Ngoku sineAirflow esebenza yonke imihla malunga 6,5 amawaka imisebenzi. Bahluke kakhulu ngesimilo. Kukho imisebenzi yokulayisha idatha kwi-DWH ephambili evela kwimithombo emininzi eyahlukeneyo kunye necacileyo kakhulu, kukho imisebenzi yokubala ii-storefronts ngaphakathi kwe-DWH engundoqo, kukho imisebenzi yokupapasha idatha kwi-DWH ekhawulezayo, mininzi, imisebenzi emininzi eyahlukeneyo - kunye ne-Airflow. uwahlafuna yonke imihla. Ukuthetha ngamanani, oku 2,3 amawaka Imisebenzi ye-ELT yobunzima obahlukeneyo ngaphakathi kwe-DWH (Hadoop), malunga. 2,5 amakhulu ogcino-lwazi imithombo, eli liqela elivela 4 abaphuhlisi be-ETL, ezahlulahlulwe kwi-ETL data processing kwi-DWH kunye ne-ELT yokucubungula idatha ngaphakathi kwe-DWH kwaye kunjalo ngakumbi admin omnye, ojongene nezibonelelo zenkonzo.
Izicwangciso zekamva
Inani leenkqubo likhula ngokungenakuthintelwa, kwaye eyona nto iphambili esiya kube siyenza ngokwesiseko se-Airflow sinyuka. Sifuna ukwakha i-Airflow cluster, sabele imilenze yabasebenzi beCelery, kwaye senze intloko eziziphindaphindayo ngeenkqubo zokucwangcisa umsebenzi kunye nendawo yokugcina.
Ipilo
Oku, ewe, ayisiyiyo yonke into endifuna ukuyithetha malunga nokuhamba kwe-Airflow, kodwa ndizamile ukuqaqambisa amanqaku aphambili. Ukuthanda ukutya kuza nokutya, yizame kwaye uya kuyithanda :)
umthombo: www.habr.com