Ukuhamba komoya sisixhobo sokwenza lula nangokukhawuleza ukuphuhlisa nokugcina iinkqubo zokusetyenzwa kwedatha yebhetshi

Ukuhamba komoya sisixhobo sokwenza lula nangokukhawuleza ukuphuhlisa nokugcina iinkqubo zokusetyenzwa kwedatha yebhetshi

Molo, Habr! Kule nqaku ndifuna ukuthetha ngesixhobo esinye esikhulu sokuphuhlisa iinkqubo zokucwangcisa idatha ye-batch, umzekelo, kwisiseko se-DWH yenkampani okanye i-DataLake yakho. Siza kuthetha ngeApache Airflow (emva koku kuthiwa yiAirflow). Akunabulungisa ukuvinjwa ingqalelo kwi-Habré, kwaye inxalenye ephambili ndiya kuzama ukukuqinisekisa ukuba ubuncinci i-Airflow ifanelekile ukujonga xa ukhetha umcwangcisi weenkqubo zakho ze-ETL / ELT.

Ngaphambili, ndibhale uthotho lwamanqaku ngesihloko se-DWH xa ndandisebenza kwiBhanki yaseTinkoff. Ngoku ndibe yinxalenye yeqela le-Mail.Ru yeQela kwaye ndiphuhlisa iqonga lokuhlalutya idatha kwindawo yokudlala. Ngokwenyani, njengoko iindaba kunye nezisombululo ezinomdla zivela, iqela lam kunye nam siza kuthetha apha malunga neqonga lethu lohlalutyo lwedatha.

Isingeniso

Ngoko, masiqale. Yintoni iAirflow? Eli lithala leencwadi (okanye isethi yamathala eencwadi) ukuphuhlisa, ukucwangcisa nokubeka iliso kwiinkqubo zomsebenzi. Inqaku eliphambili lokuhamba komoya: Ikhowudi yePython isetyenziselwa ukuchaza (ukuphuhlisa) iinkqubo. Oku kuneenzuzo ezininzi zokulungelelanisa iprojekthi yakho kunye nophuhliso: ngokwenene, iprojekthi yakho (umzekelo) iprojekthi ye-ETL yiprojekthi yePython nje, kwaye unokuyiququzelela njengoko unqwenela, ngokuqwalasela iinkcukacha zesiseko, ubungakanani beqela kunye ezinye iimfuno. Ngokwe-instrumentally yonke into ilula. Sebenzisa umzekelo iPyCharm + Git. Kuyamangalisa kwaye kulula kakhulu!

Ngoku makhe sijonge amaziko aphambili eAirflow. Ngokuqonda undoqo kunye nenjongo yabo, unokucwangcisa ngokufanelekileyo inkqubo yakho yokwakha. Mhlawumbi elona qumrhu liziko yiDirected Acyclic Graph (emva koku kubhekiselwa kuyo njengeDAG).

Dag

I-DAG lunxulumano olunentsingiselo lwemisebenzi yakho ofuna ukuyigqiba ngolandelelwano oluchazwe ngokungqongqo ngokweshedyuli ethile. Ukuhamba komoya kubonelela ngojongano lwewebhu olufanelekileyo lokusebenza neeDAGs kunye namanye amaziko:

Ukuhamba komoya sisixhobo sokwenza lula nangokukhawuleza ukuphuhlisa nokugcina iinkqubo zokusetyenzwa kwedatha yebhetshi

I-DAG inokujongeka ngolu hlobo:

Ukuhamba komoya sisixhobo sokwenza lula nangokukhawuleza ukuphuhlisa nokugcina iinkqubo zokusetyenzwa kwedatha yebhetshi

Umphuhlisi, xa eyila iDAG, ubeka phantsi uluhlu lwabaqhubi apho imisebenzi ngaphakathi kweDAG iya kwakhiwa. Apha siza kwelinye iqumrhu elibalulekileyo: I-Airflow Operator.

Ba sebenzi

Umsebenzisi liqumrhu elisekelwe ekubeni iimeko zemisebenzi zidalwe, nto leyo echaza okuya kwenzeka ngexesha lokwenziwa komsebenzi. Ukukhutshwa komoya kwi-GitHub sele iqulathe iqela labasebenzisi abalungele ukusetyenziswa. Imizekelo:

  • I-BashOperator - umqhubi wokuphumeza umyalelo we-bash.
  • I-PythonOperator-umqhubi wokufowunela ikhowudi yePython.
  • EmailOperator — umsebenzisi ngokuthumela i-imeyile.
  • HTTPOperator - umqhubi ukusebenza kunye http izicelo.
  • SqlOperator - umqhubi wokuphumeza ikhowudi yeSQL.
  • I-Sensor ngumqhubi wokulinda isiganeko (ukufika kwexesha elifunekayo, ukubonakala kwefayile efunekayo, umgca kwi-database, impendulo evela kwi-API, njl. njl.).

Kukho abaqhubi abathile ngakumbi: i-DockerOperator, i-HiveOperator, i-S3FileTransferOperator, i-PrestoToMysqlOperator, i-SlackOperator.

Unokuphuhlisa abaqhubi ngokusekelwe kwiimpawu zakho kwaye uzisebenzise kwiprojekthi yakho. Umzekelo, senze i-MongoDBToHiveViaHdfsTransfer, umqhubi wokuthumela ngaphandle amaxwebhu ukusuka kwi-MongoDB ukuya kwi-Hive, kunye nabaqhubi abaninzi ekusebenzeni nabo. Cofa indlu: CLoadFromHiveOperator kunye neCHTableLoaderOperator. Ngokusisiseko, ngokukhawuleza ukuba iprojekthi isoloko isebenzisa ikhowudi eyakhelwe kwiingxelo ezisisiseko, unokucinga ngokuyakha kwingxelo entsha. Oku kuya kwenza lula uphuhliso olongezelelweyo, kwaye uya kwandisa ilayibrari yakho yabasebenzi kwiprojekthi.

Emva koko, zonke ezi ziganeko zemisebenzi kufuneka zenziwe, kwaye ngoku siza kuthetha malunga nomcwangcisi.

Umcwangcisi

Umcwangcisi womsebenzi weAirflow yakhelwe phezu kwayo Isileri. I-Celery lithala leencwadi lePython elikuvumela ukuba ulungelelanise umgca kunye nokuhanjiswa kwe-asynchronous kunye nokuhanjiswa kwemisebenzi. Kwicala lokuhamba komoya, yonke imisebenzi yahlulwe yangamanzi. Amachibi enziwa ngesandla. Ngokuqhelekileyo, injongo yabo kukunciphisa umthwalo wokusebenza kunye nomthombo okanye ukuchwetheza imisebenzi ngaphakathi kwe-DWH. Amachibi anokulawulwa ngojongano lwewebhu:

Ukuhamba komoya sisixhobo sokwenza lula nangokukhawuleza ukuphuhlisa nokugcina iinkqubo zokusetyenzwa kwedatha yebhetshi

Idama ngalinye linomda kwinani leendawo zokubeka. Xa usenza iDAG, inikwa ichibi:

ALERT_MAILS =  Variable.get("gv_mail_admin_dwh")
DAG_NAME = 'dma_load'
OWNER = 'Vasya Pupkin'
DEPENDS_ON_PAST = True
EMAIL_ON_FAILURE = True
EMAIL_ON_RETRY = True
RETRIES = int(Variable.get('gv_dag_retries'))
POOL = 'dma_pool'
PRIORITY_WEIGHT = 10

start_dt = datetime.today() - timedelta(1)
start_dt = datetime(start_dt.year, start_dt.month, start_dt.day)

default_args = {
    'owner': OWNER,
    'depends_on_past': DEPENDS_ON_PAST,
    'start_date': start_dt,
    'email': ALERT_MAILS,
    'email_on_failure': EMAIL_ON_FAILURE,
    'email_on_retry': EMAIL_ON_RETRY,
    'retries': RETRIES,
    'pool': POOL,
    'priority_weight': PRIORITY_WEIGHT
}
dag = DAG(DAG_NAME, default_args=default_args)
dag.doc_md = __doc__

I-pool echazwe kumgangatho we-DAG inokukhutshwa kwinqanaba lomsebenzi.
Inkqubo eyahlukileyo, uMcwangcisi, unoxanduva lokucwangcisa yonke imisebenzi kwi-Airflow. Ngokwenyani, uMcwangcisi ujongana nabo bonke oomatshini bokuseta imisebenzi ukuze yenziwe. Umsebenzi uhamba ngamanqanaba amaninzi ngaphambi kokuba wenziwe:

  1. Imisebenzi yangaphambili igqityiwe kwi-DAG, omnye omtsha unokufoliswa.
  2. Umgca uhlelwe ngokuxhomekeke kwizinto eziphambili zemisebenzi (izinto eziphambili nazo zingalawulwa), kwaye ukuba kukho i-slot yamahhala echibini, umsebenzi ungathathwa usebenze.
  3. Ukuba kukho umsebenzi wamahhala we-celery, umsebenzi uthunyelwa kuwo; umsebenzi owucwangcise kwingxaki uqala, usebenzisa omnye okanye omnye umsebenzisi.

Elula ngokwaneleyo.

Umcwangcisi uqhuba kwiseti yazo zonke iiDAGs kunye nayo yonke imisebenzi ngaphakathi kweeDAG.

Ukuze uMcwangcisi aqalise ukusebenza neDAG, iDAG kufuneka imisele ishedyuli:

dag = DAG(DAG_NAME, default_args=default_args, schedule_interval='@hourly')

Kukho iiseti esele zilungisiwe: @once, @hourly, @daily, @weekly, @monthly, @yearly.

Ungasebenzisa kwakhona amagama e-cron:

dag = DAG(DAG_NAME, default_args=default_args, schedule_interval='*/10 * * * *')

Umhla woKwenziwa

Ukuqonda indlela ukuqukuqela komoya okusebenza ngayo, kubalulekile ukuqonda ukuba ngowuphi uMhla woKuphumeza weDAG. Kwi-Airflow, i-DAG inoMhla wokuSebenzisa umlinganiselo, oko kukuthi, ngokuxhomekeke kwishedyuli yomsebenzi we-DAG, iimeko zemisebenzi zidalwe kuMhla woKwenziwa ngalunye. Kwaye kuMhla woKwenziwa ngalunye, imisebenzi inokuphinda iphunyezwe-okanye, umzekelo, iDAG inokusebenza ngaxeshanye kwiMihla yokuPhunyezwa emininzi. Oku kuboniswe ngokucacileyo apha:

Ukuhamba komoya sisixhobo sokwenza lula nangokukhawuleza ukuphuhlisa nokugcina iinkqubo zokusetyenzwa kwedatha yebhetshi

Ngelishwa (okanye mhlawumbi ngethamsanqa: kuxhomekeke kwimeko), ukuba ukuphunyezwa komsebenzi kwi-DAG kulungiswa, ngoko ukuphunyezwa kuMhla woKuphunyezwa kwangaphambili kuya kuqhubeka kuthathelwa ingqalelo uhlengahlengiso. Oku kuhle ukuba ufuna ukubala kwakhona idata kumaxesha adlulileyo usebenzisa i-algorithm entsha, kodwa kubi kuba ukuveliswa kwakhona kwesiphumo kulahlekile (kakade, akukho mntu ukukhathazayo ukubuyisela uguqulelo olufunekayo lwekhowudi yemvelaphi esuka kwiGit kwaye ubale ukuba yintoni udinga ixesha elinye, ngendlela olifuna ngayo).

Ukuvelisa imisebenzi

Ukuphunyezwa kweDAG yikhowudi kwiPython, ngoko sinendlela elula kakhulu yokunciphisa inani lekhowudi xa usebenza, umzekelo, kunye nemithombo edibeneyo. Masithi unama-shards amathathu e-MySQL njengomthombo, kufuneka ukhwele kwindawo nganye kwaye uthathe idatha ethile. Ngaphezu koko, ngokuzimeleyo nangokuhambelanayo. Ikhowudi yePython kwiDAG inokujongeka ngolu hlobo:

connection_list = lv.get('connection_list')

export_profiles_sql = '''
SELECT
  id,
  user_id,
  nickname,
  gender,
  {{params.shard_id}} as shard_id
FROM profiles
'''

for conn_id in connection_list:
    export_profiles = SqlToHiveViaHdfsTransfer(
        task_id='export_profiles_from_' + conn_id,
        sql=export_profiles_sql,
        hive_table='stg.profiles',
        overwrite=False,
        tmpdir='/data/tmp',
        conn_id=conn_id,
        params={'shard_id': conn_id[-1:], },
        compress=None,
        dag=dag
    )
    export_profiles.set_upstream(exec_truncate_stg)
    export_profiles.set_downstream(load_profiles)

I-DAG ibonakala ngolu hlobo:

Ukuhamba komoya sisixhobo sokwenza lula nangokukhawuleza ukuphuhlisa nokugcina iinkqubo zokusetyenzwa kwedatha yebhetshi

Kule meko, unokongeza okanye ukususa i-shard ngokulungisa ngokulula izicwangciso kunye nokuhlaziya i-DAG. Ukhululekile!

Ungasebenzisa kwakhona ukuveliswa kwekhowudi eyinkimbinkimbi, umzekelo, usebenze kunye nemithombo ngendlela yesiseko sedatha okanye uchaze isakhiwo setafile, i-algorithm yokusebenza kunye netafile, kwaye, ngokuqwalasela iimpawu zesiseko se-DWH, ukuvelisa inkqubo. yokulayisha iitafile ze-N kwindawo yakho yokugcina. Okanye, umzekelo, ukusebenza kunye ne-API engaxhasi ukusebenza ngeparameter ngendlela yoluhlu, unokuvelisa imisebenzi ye-N kwi-DAG kolu luhlu, ukunciphisa ukuhambelana kwezicelo kwi-API ukuya echibini, kwaye ukhuhle. idatha efunekayo kwi-API. Ubhetyebhetye!

indawo yokugcina

I-Airflow ine-backend repository yayo, i-database (ingaba yi-MySQL okanye i-Postgres, sine-Postgres), egcina iindawo zemisebenzi, ii-DAG, izicwangciso zoqhagamshelwano, izinto eziguquguqukayo zehlabathi, njl. njl. Apha ndingathanda ukuthi indawo yokugcina kwiAirflow ilula kakhulu (malunga neetafile ezingama-20) kwaye ifanelekile ukuba ufuna ukwakha naziphi na iinkqubo zakho ngaphezulu kwayo. Ndikhumbula iitafile ze-100500 kwindawo yokugcina i-Informatica, ekwakufuneka ifundwe ixesha elide ngaphambi kokuqonda indlela yokwakha umbuzo.

Ukubeka iliso

Ngenxa yokulula kwendawo yokugcina, unokwakha inkqubo yokubeka iliso yomsebenzi ekulungeleyo. Sisebenzisa i-notepad eZeppelin, apho sijonga ubume bemisebenzi:

Ukuhamba komoya sisixhobo sokwenza lula nangokukhawuleza ukuphuhlisa nokugcina iinkqubo zokusetyenzwa kwedatha yebhetshi

Oku kunokuba lujongano lwewebhu lwe-Airflow ngokwayo:

Ukuhamba komoya sisixhobo sokwenza lula nangokukhawuleza ukuphuhlisa nokugcina iinkqubo zokusetyenzwa kwedatha yebhetshi

Ikhowudi ye-Airflow ngumthombo ovulekileyo, ngoko songeze isilumkiso kwiTelegram. Umzekelo ngamnye osebenzayo womsebenzi, ukuba impazamo iyenzeka, i-spams iqela kwiTelegram, apho lonke uphuhliso kunye neqela lenkxaso liqukethe.

Sifumana impendulo ekhawulezileyo ngeTelegram (ukuba iyafuneka), kwaye ngeZeppelin sifumana umfanekiso opheleleyo wemisebenzi kwi-Airflow.

Iyonke

Ukuhamba komoya kuqala ngumthombo ovulekileyo, kwaye akufanele ulindele imimangaliso kuwo. Zilungiselele ukubeka ixesha kunye nomgudu wokwakha isisombululo esisebenzayo. Injongo iyafezekiswa, ndikholelwe, ifanelekile. Isantya sophuhliso, ukuguquguquka, lula ukongeza iinkqubo ezintsha - uya kuyithanda. Ngokuqinisekileyo, kufuneka uhlawule ingqwalasela enkulu kwintlangano yeprojekthi, ukuzinza kwe-Airflow ngokwayo: imimangaliso ayenzeki.

Ngoku sineAirflow esebenza yonke imihla malunga 6,5 amawaka imisebenzi. Bahluke kakhulu ngesimilo. Kukho imisebenzi yokulayisha idatha kwi-DWH ephambili evela kwimithombo emininzi eyahlukeneyo kunye necacileyo kakhulu, kukho imisebenzi yokubala ii-storefronts ngaphakathi kwe-DWH engundoqo, kukho imisebenzi yokupapasha idatha kwi-DWH ekhawulezayo, mininzi, imisebenzi emininzi eyahlukeneyo - kunye ne-Airflow. uwahlafuna yonke imihla. Ukuthetha ngamanani, oku 2,3 amawaka Imisebenzi ye-ELT yobunzima obahlukeneyo ngaphakathi kwe-DWH (Hadoop), malunga. 2,5 amakhulu ogcino-lwazi imithombo, eli liqela elivela 4 abaphuhlisi be-ETL, ezahlulahlulwe kwi-ETL data processing kwi-DWH kunye ne-ELT yokucubungula idatha ngaphakathi kwe-DWH kwaye kunjalo ngakumbi admin omnye, ojongene nezibonelelo zenkonzo.

Izicwangciso zekamva

Inani leenkqubo likhula ngokungenakuthintelwa, kwaye eyona nto iphambili esiya kube siyenza ngokwesiseko se-Airflow sinyuka. Sifuna ukwakha i-Airflow cluster, sabele imilenze yabasebenzi beCelery, kwaye senze intloko eziziphindaphindayo ngeenkqubo zokucwangcisa umsebenzi kunye nendawo yokugcina.

Ipilo

Oku, ewe, ayisiyiyo yonke into endifuna ukuyithetha malunga nokuhamba kwe-Airflow, kodwa ndizamile ukuqaqambisa amanqaku aphambili. Ukuthanda ukutya kuza nokutya, yizame kwaye uya kuyithanda :)

umthombo: www.habr.com

Yongeza izimvo