Iifomati zefayile kwidatha enkulu: inkqubo emfutshane yokufundisa

Iifomati zefayile kwidatha enkulu: inkqubo emfutshane yokufundisa
UbuThixo beMozulu nguRemarin

Iqela Mail.ru Cloud Solutions ukubonelela inqaku translation injineli uRahul Bhatia ovela kuClairvoyant malunga nokuba zeziphi iifomati zefayile ezikhoyo kwidatha enkulu, zeziphi iimpawu eziqhelekileyo zeefomathi zeHadoop kwaye yeyiphi ifomathi engcono ukuyisebenzisa.

Kutheni kufuneka iifomati zefayile ezahlukeneyo?

I-bottleneck enkulu yokusebenza ye-HDFS-enabled izicelo ezifana ne-MapReduce kunye ne-Spark lixesha elithathayo ukukhangela, ukufunda, nokubhala idatha. Ezi ngxaki zongezwa bubunzima bokulawula iiseti zedatha enkulu ukuba sine-schema eguqukayo endaweni yesigxina, okanye ukuba kukho imiqobo yokugcina.

Ukusetyenzwa kwedatha enkulu kwandisa umthwalo kwi-subsystem yokugcina - i-Hadoop igcina idatha ngokuphindaphindiweyo ukufezekisa ukunyamezela impazamo. Ukongeza kwiidiski, iprosesa, inethiwekhi, igalelo / inkqubo yemveliso, njalo njalo ilayishiwe. Njengoko umthamo wedatha ukhula, kunjalo neendleko zokuyilungisa nokuyigcina.

Iifomati ezahlukeneyo zefayile kwi hadoop yenzelwe ukusombulula ngokuchanekileyo ezi ngxaki. Ukukhetha ifomathi efanelekileyo yefayile kunokubonelela ngeenzuzo ezibalulekileyo:

  1. Ixesha lokufunda ngokukhawuleza.
  2. Ixesha lokurekhoda ngokukhawuleza.
  3. Iifayile ekwabelwana ngazo.
  4. Inkxaso ye-schema evolution.
  5. Inkxaso yoxinzelelo eyandisiweyo.

Ezinye iifomati zefayile zenzelwe ukusetyenziswa ngokubanzi, ezinye kusetyenziso olukhethekileyo, kwaye ezinye ziyilelwe ukuhlangabezana neempawu ezithile zedatha. Ngoko ukhetho lukhulu ngokwenene.

Ifomati yefayile yeAvro

kuba ulandelelwano lwedatha I-Avro isetyenziswa ngokubanzi - yona umtya osekwe, oko kukuthi, ifomathi yokugcina idatha yomtya kwiHadoop. Igcina i-schema kwifomati ye-JSON, isenza kube lula ukufunda nokutolika ngayo nayiphi na inkqubo. Idatha ngokwayo ikwifomathi yokubini, icwecwe kwaye iyasebenza.

Inkqubo yolandelelwano ka-Avro ayithathi cala ngolwimi. Iifayile zinokucutshungulwa kwiilwimi ezahlukeneyo, okwangoku C, C ++, C #, Java, Python kunye neRuby.

Into ephambili ye-Avro yinkxaso yayo eyomeleleyo ye-schemas yedatha eguqukayo ngokuhamba kwexesha, oko kukuthi, ukuguquka. U-Avro uyaluqonda utshintsho lweschema-ukucima, ukongeza, okanye ukutshintsha amabala.

I-Avro isekela iindidi ezahlukeneyo zedatha. Umzekelo, ungenza irekhodi equlethe uluhlu, uhlobo olubaliweyo, kunye nerekhodi engaphantsi.

Iifomati zefayile kwidatha enkulu: inkqubo emfutshane yokufundisa
Le fomati ilungele ukubhala kwindawo yokumisa (inguqu) yechibi ledatha (idatha echibini, okanye ichibi ledatha - ingqokelela yeemeko zokugcina iintlobo ezahlukeneyo zedatha ukongeza kwimithombo yedatha ngokuthe ngqo).

Ke, le fomati ifaneleka kakhulu ukubhala kwindawo yokumisa yechibi ledatha ngezi zizathu zilandelayo:

  1. Idata evela kulo mmandla idla ngokufundwa iyonke ukuze iqhubeke nokuqhubekekiswa ziinkqubo ezisezantsi - kwaye ifomathi esekwe kumqolo iyasebenza kakhulu kule meko.
  2. Iinkqubo ezisezantsi ziyakwazi ukubuyisela ngokulula iitafile ze-schema kwiifayile-akukho mfuneko yokugcina i-schemas ngokwahlukileyo kwindawo yokugcina imeta yangaphandle.
  3. Naluphi na utshintsho kwischema sokuqala lusetyenzwa ngokulula (i-schema evolution).

IFayile yeFayile yeParquet

I-Parquet yifomati yefayile yomthombo ovulekileyo weHadoop egcinayo Ulwakhiwo lwedatha olubekwe kwifomathi yekholamu ecaba.

Xa kuthelekiswa nendlela yomgca wendabuko, i-Parquet isebenze ngakumbi ngokumalunga nokugcinwa kunye nokusebenza.

Oku kuluncedo ngakumbi kwimibuzo efunda iikholamu ezithile ukusuka kuluhlu olubanzi (iintsika ezininzi). Ndiyabulela kwifomati yefayile, kuphela iikholomu eziyimfuneko zifundwayo, ngoko ke i-I / O igcinwa ubuncinane.

Ukuphambuka okuncinci kunye nenkcazo: Ukuze uqonde ngcono ifayile yefayile yeParquet kwiHadoop, makhe sibone ukuba yintoni ikholomu-based based - oko kukuthi i-columnar - ifomathi. Le fomati igcina amaxabiso afanayo kwikholamu nganye kunye.

Ngokomzekelo, irekhodi ibandakanya i-ID, iGama, kunye nemimandla yeSebe. Kule meko, onke amaxabiso ekholamu ye-ID aya kugcinwa kunye, njengoko kuya kwenzelwa amaxabiso ekholamu yeGama, njalo njalo. Itheyibhile iya kujongeka ngolu hlobo:

ID
igama
KwiSebe

1
amandla1
d1

2
amandla2
d2

3
amandla3
d3

Kwifomathi yomtya, idatha iya kugcinwa ngolu hlobo lulandelayo:

1
amandla1
d1
2
amandla2
d2
3
amandla3
d3

Kwifomati yefayile yekholomu, idatha efanayo iya kugcinwa ngolu hlobo:

1
2
3
amandla1
amandla2
amandla3
d1
d2
d3

Ifomathi yekholomu isebenze ngakumbi xa ufuna ukubuza iikholamu ezininzi kwitafile. Iya kufunda kuphela iikholamu ezifunekayo kuba zikufuphi. Ngale ndlela, imisebenzi ye-I/O igcinwa isezantsi.

Umzekelo, ufuna kuphela NAME ikholamu. IN ifomathi yomtya Irekhodi nganye kwi-dataset kufuneka ilayishwe, ihlaziywe ngentsimi, kwaye ikhuphe idatha ye-NAME. Ifomati yekholomu ikuvumela ukuba uqhube phantsi ngokuthe ngqo kwiKholamu yeGama kuba onke amaxabiso alo kholamu agcinwa kunye. Awudingi ukuskena yonke into erekhodiweyo.

Ngaloo ndlela, ifomathi ye-columnar iphucula ukusebenza kombuzo ngenxa yokuba idinga ixesha elincinane lokujonga ukuya kwiikholamu ezifunekayo kwaye inciphisa inani lemisebenzi ye-I / O kuba kuphela iikholomu ezifunwayo zifundwa.

Enye yeempawu ezizodwa Umgangatho weparati kukuba kule fomati ingakwazi gcina idatha enezakhiwo ezifakwe kwindlwane. Oku kuthetha ukuba kwifayile yeParquet, kunye namasimi afakwe kwindlwane anokufundwa ngabanye ngaphandle kokufunda onke amasimi kwisakhiwo esifakwe kwindlwane. I-Parquet isebenzisa i-shredding kunye ne-algorithm yokuhlanganisa ukugcina izakhiwo ezifakwe kwindlwane.

Iifomati zefayile kwidatha enkulu: inkqubo emfutshane yokufundisa
Ukuqonda ifomathi yefayile yeParquet eHadoop, kufuneka wazi la magama alandelayo:

  1. Iqela leentambo (iqela lomqolo): ulwahlulo oluthe tye olusengqiqweni lwedatha kwimiqolo. Iqela lomqolo liquka iqhekeza lekholamu nganye kwiseti yedatha.
  2. Iqhekeza lekholamu (iqhekeza lekholamu): Iqhekeza lekholamu ethile. Ezi ziqwenga zekholomu zihlala kwiqela elithile lemiqolo kwaye ziqinisekisiwe ukuba zidibene kwifayile.
  3. Iphepha (iphepha): Iziqwenga zekholamu zahlulwe ngokwamaphepha abhalwe emva kwelinye. Amaphepha anesihloko esiqhelekileyo, ngoko unokutsiba ezingeyomfuneko xa ufunda.

Iifomati zefayile kwidatha enkulu: inkqubo emfutshane yokufundisa
Apha isihloko siqulathe nje inombolo yomlingo I-PAR1 (4 bytes) echonga ifayile njengefayile yeParquet.

Umbhalo ongezantsi uthi:

  1. Imetadata yefayile equlethe ulungelelwaniso lokuqalisa lwemetadata yomhlathi ngamnye. Xa ufunda, kufuneka uqale ufunde imethadatha yefayile ukuze ufumane zonke iziqwenga zekholomu ezinomdla. Amacandelo ekholomu kufuneka emva koko afundwe ngokulandelelanayo. Enye i-metadata ibandakanya inguqulelo yefomathi, i-schema, kunye naziphi na iiperi zexabiso ezingundoqo ezongezelelweyo.
  2. Ubude bemetadata (4 bytes).
  3. Inombolo yomlingo I-PAR1 (iibyte ezi-4).

IFomathi yeFayile ye-ORC

Ulungelelwaniso lwefomati yomqolo wekholamu yefayile (IKholam yomqolo oLundisiweyo, ORC) inikezela ngendlela efanelekileyo yokugcina idatha kwaye yayiyilwe ukoyisa imida yezinye iifomati. Igcina idatha kwifom edibeneyo ngokugqibeleleyo, ikuvumela ukuba utsibe iinkcukacha ezingeyomfuneko - ngaphandle kokufuna ukwakhiwa kwezalathisi ezinkulu, ezinzima okanye ezigcinwe ngesandla.

Izinto ezilungileyo zefomathi ye-ORC:

  1. Ifayile enye yimveliso yomsebenzi ngamnye, onciphisa umthwalo kwiNodeNode (indawo yegama).
  2. Inkxaso kwiintlobo zedatha yeHive, kuquka iDateTime, idesimali kunye neentlobo zedatha ezinzima (isakhiwo, uluhlu, imephu kunye nomanyano).
  3. Ukufundwa ngaxeshanye kwefayile enye ngeenkqubo ezahlukeneyo zeRecordReader.
  4. Ukukwazi ukwahlula iifayile ngaphandle kokuskena abamakishi.
  5. Uqikelelo lolona lwabiwo lwenkumbulo yemfumba eninzi enokwenzeka yenkqubo yokufunda/yokubhala esekelwe kulwazi olukwindawo esezantsi yefayile.
  6. Imetadata igcinwe kwiProtocol Buffers ifomathi yokubini yokubini, evumela imihlaba ukuba yongezwe kwaye isuswe.

Iifomati zefayile kwidatha enkulu: inkqubo emfutshane yokufundisa
I-ORC igcina iingqokelela zeentambo kwifayile enye, kwaye ngaphakathi kokuqokelela, idatha yomtya igcinwe kwifomathi yekholomu.

Ifayile ye-ORC igcina amaqela emigca ebizwa ngokuba yimigca kunye nolwazi oluxhasayo kumazantsi efayile. I-Postscript ekupheleni kwefayile iqulethe i-compression parameters kunye nobukhulu be-footer ecinezelweyo.

Ubungakanani bemigca emiselweyo yi-250 MB. Ngenxa yemivimbo emikhulu, ukufunda kwi-HDFS kwenziwa ngokufanelekileyo: kwiibhloko ezinkulu ezidibeneyo.

I-footer yefayile irekhoda uluhlu lweendlela kwifayile, inani lemiqolo ngomzila ngamnye, kunye nohlobo lwedatha yekholamu nganye. Ixabiso lesiphumo sokubala, min, max kunye ne-sum yekholamu nganye nayo ibhaliwe apho.

Okusemazantsi komcu kuqulathe uluhlu lweendawo zomlambo.

Idatha yomqolo isetyenziswa xa kujongwa iitafile.

Idatha yesalathisi ibandakanya ubuncinci kunye namaxabiso aphezulu kwikholamu nganye kunye nendawo yemigca kwikholamu nganye. Izalathisi ze-ORC zisetyenziswa kuphela ekukhetheni imigca kunye namaqela omqolo, hayi ukuphendula imibuzo.

Ukuthelekiswa kweefomathi zefayile ezahlukeneyo

I-Avro xa ithelekiswa neParquet

  1. I-Avro yifomathi yokugcina umqolo, ngelixa i-Parquet igcina idatha kwiikholomu.
  2. I-Parquet ifaneleka ngcono kwimibuzo yohlalutyo, oku kuthetha ukuba imisebenzi yokufunda kunye neenkcukacha zokubuza zisebenza kakhulu kunokubhala.
  3. Imisebenzi yokubhala kwi-Avro yenziwa ngokufanelekileyo ngakumbi kuneParquet.
  4. UAvro ujongana nendaleko yesekethe ngokuqolileyo. I-Parquet isekela kuphela ukongezwa kwe-schema, ngelixa i-Avro isekela i-multifunctional evolution, oko kukuthi, ukongeza okanye ukutshintsha iikholamu.
  5. I-Parquet ikulungele ukubuza i-subset yeentsika kwitafile yekholamu ezininzi. I-Avro ifanelekile kwimisebenzi ye-ETL apho sibuza zonke iikholamu.

ORC vs Parquet

  1. I-Parquet igcina idatha egcinwe ngcono.
  2. I-ORC ilunge ngakumbi kwisibikezelo sokutyhala.
  3. I-ORC ixhasa iipropati ze-ACID.
  4. I-ORC icinezela idatha ngcono.

Yintoni enye ekufuneka uyifunde ngesihloko:

  1. Uhlalutyo lwedatha enkulu kwilifu: indlela inkampani enokuthi ijolise ngayo kwidatha.
  2. Isikhokelo esithobekileyo kwiiSchema zeDatha.
  3. Ijelo lethu letelegram malunga nenguqu yedijithali.

umthombo: www.habr.com

Yongeza izimvo