Mafaira mafomati mune yakakura data: chirongwa chipfupi chedzidzo

Mafaira mafomati mune yakakura data: chirongwa chipfupi chedzidzo
Weather Humwari naRemarin

chikwata Mail.ru Cloud Solutions anopa chinyorwa chekushandura mainjiniya Rahul Bhatia anobva kuClairvoyant nezve mafomati emafaira aripo mu data hombe, ndeapi anonyanya kufarirwa mafomati eHadoop uye kuti ndeipi fomati iri nani kushandisa.

Sei mafomu akasiyana efaira achidiwa?

Chinhu chikuru chekuita bhodhoro reHDFS-inogonesa maapplication akadai seMepuReduce uye Spark ndiyo nguva yainotora kutsvaga, kuverenga, uye kunyora data. Matambudziko aya anowedzerwa nekuomerwa mukugadzirisa mahombe data seti kana isu tiine schema inoshanduka kwete yakagadziriswa, kana paine zvimwe zvimhingamipinyi zvekuchengetedza.

Kugadzirisa data hombe kunowedzera mutoro pane yekuchengetedza subsystem - Hadoop inochengeta data zvakawandisa kuti iwane kushivirira kukanganisa. Pamusoro pemadhisiki, iyo processor, network, yekuisa / kubuda system, uye zvichingodaro zvinotakurwa. Sezvo huwandu hwe data hunokura, ndizvo zvinoitawo mutengo wekugadzirisa nekuichengeta.

Mafomati akasiyana siyana mu Hadoop yakagadzirirwa kugadzirisa chaizvo matambudziko aya. Kusarudza iyo yakakodzera faira fomati inogona kupa mamwe mabhenefiti akakosha:

  1. Inokurumidza kuverenga nguva.
  2. Inokurumidza kurekodha nguva.
  3. Mafaira akagoverwa.
  4. Tsigiro yekushanduka kwe schema.
  5. Yakawedzera compression rutsigiro.

Mamwe mafomati emafaira akagadzirirwa kushandiswa zvakajairika, mamwe emamwe mashandisirwo chaiwo, uye mamwe akagadzirirwa kusangana neakananga data maitiro. Saka sarudzo yakakura chaizvo.

Avro faira fomati

nokuti data serialization Avro inoshandiswa zvakanyanya - iyo string based, ndiko kuti, tambo yekuchengetedza data fomati muHadoop. Iyo inochengeta schema muJSON fomati, ichiita kuti zvive nyore kuverenga uye kududzira nechero chirongwa. Iyo data pachayo iri mubhinari fomati, compact uye inoshanda.

Avro's serialization system imutauro usina kwayakarerekera. Mafaira anogona kugadziriswa mumitauro yakasiyana-siyana, parizvino C, C++, C#, Java, Python uye Ruby.

Chinhu chakakosha cheAvro kutsigira kwayo kwakasimba kwedata schemas inoshanduka nekufamba kwenguva, ndiko kuti, kushanduka. Avro inonzwisisa schema shanduko-kubvisa, kuwedzera, kana kushandura minda.

Avro inotsigira zvakasiyana-siyana zve data data. Semuyenzaniso, unogona kugadzira rekodhi rine array, enumerated type, uye subrecord.

Mafaira mafomati mune yakakura data: chirongwa chipfupi chedzidzo
Iyi fomati yakanakira kunyora kune kumhara (shanduko) zone yedhamu data (data lake, kana kuti data dziva - muunganidzwa wezviitiko zvekuchengetedza marudzi akasiyana-siyana e data mukuwedzera kune data masosi zvakananga).

Saka, iyi fomati yakanyatsokodzera kunyorera kunzvimbo yekumhara yedhamu data nekuda kwezvikonzero zvinotevera:

  1. Dhata kubva munzvimbo ino inowanzoverengerwa yakazara kuti iwedzere kugadziriswa nemasisitimu ekudzika - uye mutsara-wakavakirwa fomati inonyanya kushanda mune iyi kesi.
  2. Kudzika masisitimu anogona kudzoreredza schema matafura kubva mafaera-hapana chikonzero chekuchengeta schemas zvakasiyana mune ekunze meta chengetedzo.
  3. Chero shanduko kune yekutanga schema inogadziriswa zviri nyore (schema evolution).

Parquet File Format

Parquet ndeye yakavhurika sosi faira fomati yeHadoop inochengeta nested data zvimiro mune flat columnar fomati.

Kuenzaniswa neyakajairwa mutsara nzira, Parquet inoshanda zvakanyanya maererano nekuchengetedza uye kuita.

Izvi zvinonyanya kukosha kumibvunzo inoverenga makoramu chaiwo kubva patafura yakakura (makoramu mazhinji). Kutenda kune iyo faira fomati, chete makoramu anodiwa anoverengwa, saka I / O inochengeterwa kushoma.

Digression shoma uye tsananguro: Kuti unzwisise zviri nani iyo Parquet faira fomati muHadoop, ngationei kuti koramu-yakavakirwa - i.e. columnar - fomati chii. Iyi fomati inochengeta zvakafanana kukosha kwekoramu yega yega pamwechete.

Somuenzaniso, rekodhi inosanganisira ID, Zita, uye minda yeDhipatimendi. Muchiitiko ichi, ese maID column values ​​achachengetwa pamwechete, sezvichaita Zita column values, zvichingodaro. Tafura ichataridzika seizvi:

ID
zita
Dhipatimendi

1
emp1
d1

2
emp2
d2

3
emp3
d3

Mune tambo fomati, iyo data ichachengetwa seinotevera:

1
emp1
d1
2
emp2
d2
3
emp3
d3

Mune columnar faira fomati, iyo data yakafanana ichachengetwa seizvi:

1
2
3
emp1
emp2
emp3
d1
d2
d3

Iyo columnar fomati inoshanda zvakanyanya kana iwe uchida kubvunza akawanda makoramu kubva patafura. Ichaverenga chete makoramu anodiwa nekuti ari pedyo. Nenzira iyi, mashandiro eI/O anochengetwa ari mashoma.

Semuenzaniso, unongoda iyo NAME column. IN string format Rekodhi yega yega mudhataset inoda kutakurwa, kuchekwa nemunda, uyezve kuburitsa iyo NAME data. Iyo column fomati inokutendera kuti udhire pasi wakananga kuZita column nekuti ese ma values ​​eiyo column anochengetwa pamwechete. Haufanire kuongorora zvese zvakarekodhwa.

Saka, iyo columnar fomati inovandudza kuita kwemubvunzo nekuti inoda nguva shoma yekutarisa kuti usvike kumakoramu anodiwa uye inoderedza huwandu hweI/O mashandiro nekuti chete makoramu anodiwa anoverengwa.

Chimwe chezvinhu zvakasiyana-siyana parquet ndeyekuti mune iyi fomati inogona chengetedza data ine nested zvimiro. Izvi zvinoreva kuti muParquet faira, kunyangwe minda yakavharirwa inogona kuverengerwa yega pasina kuverenga minda yese mune yakavakirwa. Parquet inoshandisa shredding uye gungano algorithm kuchengetedza matenderedzwa.

Mafaira mafomati mune yakakura data: chirongwa chipfupi chedzidzo
Kuti unzwisise iyo Parquet faira fomati muHadoop, unofanirwa kuziva mazwi anotevera:

  1. Row group (Row group): zvine musoro kupatsanurwa kwedata kuita mitsetse. Boka remutsara rine chidimbu chekoramu yega yega mune data set.
  2. Column fragment (column chunk): Chimedu chegwaro chairo. Izvi zvimedu zvembiru zvinogara mune rimwe boka remitsara uye zvinovimbiswa kuve zvinosangana mufaira.
  3. Peji (peji): Zvidimbu zvekoramu zvakakamurwa kuita mapeji akanyorwa rimwe pashure peimwe. Mapeji ane zita rakafanana, saka unogona kusvetuka zvisina kufanira paunenge uchiverenga.

Mafaira mafomati mune yakakura data: chirongwa chipfupi chedzidzo
Pano musoro wacho unongova nenhamba yemashiripiti PAR1 (4 bytes) iyo inoratidza iyo faira seParquet faira.

The footer inoti zvinotevera:

  1. Faira metadata ine matangiro ekutanga echikamu chega chega metadata. Paunenge uchiverenga, unofanirwa kutanga waverenga metadata yefaira kuti uwane ese mameta emakoramu ekufarira. Zvikamu zve column zvinofanirwa kuverengwa zvakatevedzana. Imwe metadata inosanganisira iyo fomati vhezheni, schema, uye chero yekuwedzera kiyi-kukosha pairi.
  2. Hurefu hwemetadata (4 bytes).
  3. Nhamba yemashiripiti PAR1 (4 bytes).

ORC File Format

Yakagadziridzwa mitsara-column faira fomati (Optimized Row Column, CRO) inopa yakanyatso shanda nzira yekuchengetedza data uye yakagadzirirwa kukunda zvisingakwanisi kune mamwe mafomati. Inochengeta data mune yakanyatso compact fomu, ichikubvumidza kuti usvetuke zvisina basa - pasina kuda kuvakwa kweakakura, akaomarara kana akachengetedzwa nemaoko indexes.

Zvakanakira iyo ORC fomati:

  1. Imwe faira ndeyekubuda kwebasa rega rega, izvo zvinoderedza mutoro paNameNode (zita node).
  2. Tsigiro yemhando dzeHive data, kusanganisira DateTime, decimal uye yakaoma data mhando (chimiro, rondedzero, mepu uye mubatanidzwa).
  3. Kuverenga panguva imwe chete kwefaira rimwe chete neakasiyana RecordReader maitiro.
  4. Kugona kupatsanura mafaera pasina kutarisisa mamaki.
  5. Estimation yehukuru hunogona kuitika murwi memory allocation yekuverenga/kunyora maitiro zvichienderana neruzivo rwuri mufaera repasi.
  6. Metadata inochengetwa muProtocol Buffers binary serialization format, iyo inobvumira minda kuwedzerwa nekubviswa.

Mafaira mafomati mune yakakura data: chirongwa chipfupi chedzidzo
ORC inochengeta kuunganidzwa kwetambo mufaira rimwe chete, uye mukati mekuunganidza, tambo data inochengetwa mune columnar fomati.

Faira reORC rinochengeta mapoka emitsara inonzi mitsetse uye ruzivo rwekutsigira mujinga refaira. Iyo Postscript kumagumo kwefaira ine compression paramita uye saizi yeiyo yakamanikidzwa tsoka.

Iyo yakasarudzika saizi yemutsara ndeye 250 MB. Nekuda kwemitsetse mikuru yakadai, kuverenga kubva kuHDFS kunoitwa zvakanyanya: mumabhuroko makuru anobatana.

Iyo faira footer inorekodha runyorwa rwenzira mufaira, nhamba yemitsara pamutsara, uye rudzi rwe data rekoramu yega yega. Iko kukosha kwechiverengero chekuverenga, min, max uye sum pachikamu chimwe nechimwe chakanyorwawo ipapo.

Muzasi memutsetse une dhairekitori renzvimbo dzemarwizi.

Row data inoshandiswa pakuvheneka matafura.

Index data inosanganisira hushoma uye huwandu hunokosha hwekoramu yega yega uye chinzvimbo chemitsara mune yega yega. ORC indexes anoshandiswa chete pakusarudza mitsetse nemapoka emitsara, kwete pakupindura mibvunzo.

Kuenzanisa kwemhando dzakasiyana dzefaira

Avro akaenzaniswa neParquet

  1. Avro inzira yekuchengetedza mutsara, nepo Parquet ichichengeta data mumakoramu.
  2. Parquet inokodzera zvirinani kumibvunzo yekuongorora, zvichireva kuti kuverenga mashandiro uye kubvunza data kunoshanda zvakanyanya pane kunyora.
  3. Nyora mabasa muAvro anoitwa zvakanyanya kupfuura muParquet.
  4. Avro inobata nekushanduka kwedunhu zvakanyanya kukura. Parquet inongotsigira schema yekuwedzera, nepo Avro inotsigira multifunctional evolution, ndiko kuti, kuwedzera kana kushandura makoramu.
  5. Parquet yakanakira kubvunza subset yemakoramu mune yakawanda-column tafura. Avro inokodzera ETL mashandiro kwatinobvunza makoramu ese.

ORC vs Parquet

  1. Parquet inochengeta nested data zviri nani.
  2. ORC inokodzera zvirinani kufungidzira kusundira pasi.
  3. ORC inotsigira ACID zvivakwa.
  4. ORC inodzvanya data zvirinani.

Chii chimwe chekuverenga pamusoro penyaya:

  1. Kuongorora kukuru kwedata mugore: kuti kambani inogona sei kuve yakatarisana nedata.
  2. A Humble Guide kune Database Schemas.
  3. Yedu teregiramu chiteshi nezve shanduko yedhijitari.

Source: www.habr.com

Voeg