Amafomethi wefayela ngedatha enkulu: uhlelo olufushane lwezemfundo

Amafomethi wefayela ngedatha enkulu: uhlelo olufushane lwezemfundo
UbuNkulunkulu besimo sezulu nguRemarin

Ithimba I-Mail.ru Cloud Solutions okunikezwayo ukuhunyushwa kwesihloko unjiniyela u-Rahul Bhatia ovela ku-Clairvoyant mayelana nokuthi yimaphi amafomethi akhona kudatha enkulu, yiziphi izici ezivame kakhulu zamafomethi we-Hadoop nokuthi iyiphi ifomethi engcono ukuyisebenzisa.

Kungani kudingeka amafomethi efayela ahlukene?

Umgoqo omkhulu wokusebenza wezinhlelo zokusebenza ezinikwe amandla i-HDFS njenge-MapReduce ne-Spark isikhathi esisithathayo ukusesha, ukufunda, nokubhala idatha. Lezi zinkinga zihlanganiswa nobunzima bokuphatha amasethi amakhulu edatha uma sine-schema eguqukayo kunesimisiwe, noma uma kunezingqinamba ezithile zesitoreji.

Ukucubungula idatha enkulu kukhulisa umthwalo ohlelweni olungaphansi lwesitoreji - I-Hadoop igcina idatha ngokuningi ukuze kuzuzwe ukubekezelela amaphutha. Ngaphezu kwamadiski, iprosesa, inethiwekhi, isistimu yokufaka/yokukhiphayo, nokunye kuyalayishwa. Njengoba umthamo wedatha ukhula, kanjalo nezindleko zokuyicubungula nokuyigcina.

Amafomethi wefayela ahlukahlukene ku Hadoop yakhelwe ukuxazulula lezi zinkinga ngokunembile. Ukukhetha ifomethi yefayela efanele kunganikeza izinzuzo ezithile ezibalulekile:

  1. Isikhathi sokufunda esisheshayo.
  2. Isikhathi sokurekhoda esisheshayo.
  3. Amafayela abiwe.
  4. Ukusekelwa kokuvela kwe-schema.
  5. Usekelo lokuminyanisa olunwetshiwe.

Amanye amafomethi efayela enzelwe ukusetshenziswa okuvamile, amanye ukusetshenziswa okuqondile, kanti amanye aklanyelwe ukuhlangabezana nezici ezithile zedatha. Ngakho ukukhetha kukhulu ngempela.

Ifomethi yefayela le-Avro

Ukuze ukukhiqizwa kwedatha I-Avro isetshenziswa kabanzi - yona intambo esekelwe, okungukuthi, ifomethi yokugcina idatha yeyunithi yezinhlamvu ku-Hadoop. Igcina i-schema ngefomethi ye-JSON, ikwenze kube lula ukufunda nokuhumusha nganoma yiluphi uhlelo. Idatha ngokwayo ikufomethi kanambambili, ihlangene futhi iyasebenza.

Isistimu yochungechunge ye-Avro ayithathi hlangothi ngolimi. Amafayela angacutshungulwa ngezilimi ezihlukahlukene, okwamanje C, C++, C#, Java, Python kanye neRuby.

Isici esibalulekile se-Avro ukwesekwa kwayo okuqinile kwezikimu zedatha ezishintsha ngokuhamba kwesikhathi, okungukuthi, ziyavela. I-Avro iyaqonda izinguquko ze-schema—ukususa, ukungeza, noma ukushintsha izinkambu.

I-Avro isekela izinhlobonhlobo zezakhiwo zedatha. Isibonelo, ungakha irekhodi eliqukethe amalungu afanayo, uhlobo olubaliwe, kanye nerekhodi elincane.

Amafomethi wefayela ngedatha enkulu: uhlelo olufushane lwezemfundo
Le fomethi ilungele ukubhalela indawo yokufika (inguquko) yechibi ledatha (idatha echibini, noma ichibi ledatha - iqoqo lezimo zokugcina izinhlobo ezihlukahlukene zedatha ngaphezu kwemithombo yedatha ngokuqondile).

Ngakho-ke, le fomethi ifaneleka kakhulu ekubhaleleni indawo yokufikela yechibi ledatha ngenxa yezizathu ezilandelayo:

  1. Idatha evela kule zoni ivamise ukufundwa iyonke ukuze kuqhutshekwe nokucutshungulwa amasistimu awela phansi - futhi ifomethi esekelwe kumugqa isebenza kahle kakhulu kulokhu.
  2. Amasistimu awela phansi angathola kalula amathebula e-schema kumafayela—asikho isidingo sokugcina ama-schema ngokuhlukene kusitoreji se-meta sangaphandle.
  3. Noma yiluphi ushintsho ku-schema sangempela lucutshungulwa kalula (ukuvela kwe-schema).

Ifomethi yefayela le-Parquet

I-Parquet iyifomethi yefayela yomthombo ovulekile ye-Hadoop egcinayo izakhiwo zedatha ezifakwe esidlekeni ngefomethi yekholomu eyisicaba.

Uma kuqhathaniswa nendlela yerowu evamile, i-Parquet isebenza kahle kakhulu mayelana nokugcinwa nokusebenza.

Lokhu kuwusizo ikakhulukazi emibuzweni efunda amakholomu athile kuthebula elibanzi (amakholomu amaningi). Ngenxa yefomethi yefayela, amakholomu adingekayo kuphela afundwayo, ngakho-ke i-I/O igcinwa incane.

Ukuhlehla okuncane kanye nencazelo: Ukuze uqonde kangcono ifomethi yefayela le-Parquet ku-Hadoop, ake sibone ukuthi iyini ifomethi esekelwe kukholomu - i.e. columnar - format. Le fomethi igcina amanani afanayo ekholomu ngayinye ndawonye.

Isibonelo, irekhodi lihlanganisa i-ID, Igama, nezinkambu zoMnyango. Kulokhu, wonke amanani ekholomu ye-ID azogcinwa ndawonye, ​​njengoba kuzoba namanani ekholomu Yegama, njalonjalo. Ithebula lizobukeka kanje:

ID
Igama
umnyango

1
emp1
d1

2
emp2
d2

3
emp3
d3

Ngefomethi yeyunithi yezinhlamvu, idatha izolondolozwa kanje:

1
emp1
d1
2
emp2
d2
3
emp3
d3

Ngefomethi yefayela lekholomu, idatha efanayo izolondolozwa kanje:

1
2
3
emp1
emp2
emp3
d1
d2
d3

Ifomethi yekholomu isebenza kahle kakhulu uma udinga ukubuza amakholomu amaningi etafuleni. Izofunda kuphela amakholomu adingekayo ngoba ancikene. Ngale ndlela, imisebenzi ye-I/O igcinwa isezingeni eliphansi.

Isibonelo, udinga kuphela ikholomu ethi NAME. IN ifomethi yochungechunge Irekhodi ngalinye kudathasethi lidinga ukulayishwa, licutshulwe ngenkambu, bese likhipha idatha ye-NAME. Ifomethi yekholomu ikuvumela ukuthi ushayele phansi ngokuqondile kukholomu Yegama ngoba wonke amanani aleyo kholomu agcinwa ndawonye. Awudingi ukuskena konke okurekhodiwe.

Ngakho, ifomethi yekholomu ithuthukisa ukusebenza kombuzo ngoba idinga isikhathi esincane sokubheka ukuze ifike kumakholomu adingekayo futhi inciphisa inani lemisebenzi ye-I/O ngoba kufundwa amakholomu afunekayo kuphela.

Esinye sezici eziyingqayizivele Ipharamu ukuthi kule fomethi ingakwazi gcina idatha enezakhiwo ezifakwe isidleke. Lokhu kusho ukuthi efayeleni leParquet, ngisho nezinkambu ezibekwe isidleke zingafundwa ngazinye ngaphandle kokuthi kufundwe zonke izinkambu esakhiweni esisidleke. I-Parquet isebenzisa i-algorithm ye-shredding ne-assembly ukuze igcine izakhiwo esidleke.

Amafomethi wefayela ngedatha enkulu: uhlelo olufushane lwezemfundo
Ukuze uqonde ifomethi yefayela leParquet ku-Hadoop, udinga ukwazi la magama alandelayo:

  1. Iqembu lezintambo (iqembu lomugqa): ukuhlukaniswa okuvundlile okunengqondo kwedatha ibe imigqa. Iqembu lomugqa liqukethe isiqeshana sekholomu ngayinye kusethi yedatha.
  2. Isiqephu sekholomu (isiqephu sekholomu): Isiqeshana sekholomu ethile. Lezi zingcezu zekholomu zihlala eqenjini elithile lemigqa futhi ziqinisekisiwe ukuthi zizohlangana efayelini.
  3. Ikhasi (ikhasi): Izingcezu zekholomu zihlukaniswa zibe amakhasi abhalwe ngokulandelana. Amakhasi anesihloko esivamile, ngakho-ke ungakwazi ukweqa okungadingekile uma ufunda.

Amafomethi wefayela ngedatha enkulu: uhlelo olufushane lwezemfundo
Lapha isihloko siqukethe inombolo yomlingo I-PAR1 (4 bytes) ekhomba ifayela njengefayela leParquet.

Unyaweni uthi okulandelayo:

  1. Imethadatha yefayela equkethe izixhumanisi zokuqala zemethadatha yekholomu ngayinye. Lapho ufunda, kufanele uqale ufunde imethadatha yefayela ukuze uthole zonke izingcezu zekholomu ozithakaselayo. Izingxenye zekholomu kufanele-ke zifundwe ngokulandelana. Enye imethadatha ifaka inguqulo yefomethi, i-schema, nanoma yimaphi amabhangqa engeziwe enani lokhiye.
  2. Ubude bemethadatha (amabhayithi angu-4).
  3. Inombolo yomlingo I-PAR1 (amabhayithi ama-4).

Ifomethi yefayela le-ORC

Ifomethi yefayela lekholomu yerowu elungiselelwe (Optimized Row Column, ORC) inikeza indlela esebenza kahle kakhulu yokugcina idatha futhi yakhelwe ukunqoba ukulinganiselwa kwamanye amafomethi. Igcina idatha ngendlela ehlangene ngokuphelele, ekuvumela ukuthi weqe imininingwane engadingekile - ngaphandle kokudinga ukwakhiwa kwezinkomba ezinkulu, eziyinkimbinkimbi noma ezigcinwe ngesandla.

Izinzuzo zefomethi ye-ORC:

  1. Ifayela elilodwa liwumphumela womsebenzi ngamunye, okwehlisa umthwalo ku-NameNode (i-node yegama).
  2. Ukusekelwa kwezinhlobo zedatha ye-Hive, okuhlanganisa i-DateTime, idesimali kanye nezinhlobo zedatha eziyinkimbinkimbi (isakhiwo, uhlu, imephu nenyunyana).
  3. Ukufundwa ngesikhathi esisodwa kwefayela elifanayo ngezinqubo ezahlukene zeRecordReader.
  4. Ikhono lokuhlukanisa amafayela ngaphandle kokuskena omaka.
  5. Isilinganiso somkhawulo wokunikezwa kwenkumbulo yenqwaba okungaba khona kwezinqubo zokufunda/ukubhala ngokusekelwe olwazini olukunyaweni wefayela.
  6. Imethadatha igcinwa kufomethi kanambambili ye-Protocol Buffers, evumela izinkambu ukuthi zengezwe futhi zisuswe.

Amafomethi wefayela ngedatha enkulu: uhlelo olufushane lwezemfundo
I-ORC igcina amaqoqo ezintambo kufayela elilodwa, futhi ngaphakathi kweqoqo, idatha yeyunithi yezinhlamvu igcinwa ngefomethi yekholomu.

Ifayela le-ORC ligcina amaqembu emigqa ebizwa ngokuthi imithende nolwazi olusekelayo kunyaweni yefayela. I-Postscript ekugcineni kwefayela iqukethe imingcele yokucindezela kanye nosayizi wonyaweni ocindezelwe.

Usayizi womugqa ozenzakalelayo ungu-250 MB. Ngenxa yemigqa emikhulu kangaka, ukufunda kusuka ku-HDFS kwenziwa ngokuphumelelayo: kumabhulokhi amakhulu ahlangene.

Unyaweni wefayela urekhoda uhlu lwemigqa kufayela, inani lemigqa ngomzila ngamunye, kanye nohlobo lwedatha yekholomu ngayinye. Inani eliwumphumela lokubala, umzuzu, umkhawulo kanye nesamba sekholomu ngayinye nalo libhalwa lapho.

Unyaweni womucu uqukethe uhla lwemibhalo lwezindawo zokusakaza.

Idatha yomugqa isetshenziswa uma kuthwetshulwa amathebula.

Idatha yenkomba ihlanganisa amanani aphansi naphezulu ekholomu ngayinye kanye nokuma kwemigqa kukholomu ngayinye. Izinkomba ze-ORC zisetshenziselwa ukukhetha imigqa namaqembu emigqa kuphela, hhayi ukuphendula imibuzo.

Ukuqhathaniswa kwamafomethi wefayela ahlukene

I-Avro uma iqhathaniswa neParquet

  1. I-Avro iyifomethi yokugcina irowu, kuyilapho i-Parquet igcina idatha kumakholomu.
  2. I-Parquet ifaneleka kangcono imibuzo yokuhlaziya, okusho ukuthi imisebenzi yokufunda kanye nedatha yokubuza isebenza kahle kakhulu kunokubhala.
  3. Imisebenzi yokubhala ku-Avro yenziwa kahle kakhulu kunase-Parquet.
  4. I-Avro isebenzisana ne-circuit evolution ngokuvuthwa kakhudlwana. I-Parquet isekela kuphela ukungezwa kwe-schema, kuyilapho i-Avro isekela ukuvela kwemisebenzi eminingi, okungukuthi, ukwengeza noma ukushintsha amakholomu.
  5. I-Parquet ilungele ukubuza isethi yamakholomu kuthebula lamakholomu amaningi. I-Avro ifanele imisebenzi ye-ETL lapho sibuza wonke amakholomu.

I-ORC vs Parquet

  1. I-Parquet igcina idatha efakwe esidlekeni kangcono.
  2. I-ORC ifaneleka kangcono isibikezelo sokwehla.
  3. I-ORC isekela izakhiwo ze-ACID.
  4. I-ORC iminyanisa idatha kangcono.

Yini enye ongayifunda esihlokweni:

  1. Ukuhlaziywa kwedatha enkulu efwini: ukuthi inkampani ingaba kanjani egxile kudatha.
  2. Umhlahlandlela Othobekile kuma-Database Schemas.
  3. Isiteshi sethu socingo mayelana noguquko lwedijithali.

Source: www.habr.com

Engeza amazwana