chikwata
Sei mafomu akasiyana efaira achidiwa?
Chinhu chikuru chekuita bhodhoro reHDFS-inogonesa maapplication akadai seMepuReduce uye Spark ndiyo nguva yainotora kutsvaga, kuverenga, uye kunyora data. Matambudziko aya anowedzerwa nekuomerwa mukugadzirisa mahombe data seti kana isu tiine schema inoshanduka kwete yakagadziriswa, kana paine zvimwe zvimhingamipinyi zvekuchengetedza.
Kugadzirisa data hombe kunowedzera mutoro pane yekuchengetedza subsystem - Hadoop inochengeta data zvakawandisa kuti iwane kushivirira kukanganisa. Pamusoro pemadhisiki, iyo processor, network, yekuisa / kubuda system, uye zvichingodaro zvinotakurwa. Sezvo huwandu hwe data hunokura, ndizvo zvinoitawo mutengo wekugadzirisa nekuichengeta.
Mafomati akasiyana siyana mu
- Inokurumidza kuverenga nguva.
- Inokurumidza kurekodha nguva.
- Mafaira akagoverwa.
- Tsigiro yekushanduka kwe schema.
- Yakawedzera compression rutsigiro.
Mamwe mafomati emafaira akagadzirirwa kushandiswa zvakajairika, mamwe emamwe mashandisirwo chaiwo, uye mamwe akagadzirirwa kusangana neakananga data maitiro. Saka sarudzo yakakura chaizvo.
Avro faira fomati
nokuti data serialization Avro inoshandiswa zvakanyanya - iyo string based, ndiko kuti, tambo yekuchengetedza data fomati muHadoop. Iyo inochengeta schema muJSON fomati, ichiita kuti zvive nyore kuverenga uye kududzira nechero chirongwa. Iyo data pachayo iri mubhinari fomati, compact uye inoshanda.
Avro's serialization system imutauro usina kwayakarerekera. Mafaira anogona kugadziriswa mumitauro yakasiyana-siyana, parizvino C, C++, C#, Java, Python uye Ruby.
Chinhu chakakosha cheAvro kutsigira kwayo kwakasimba kwedata schemas inoshanduka nekufamba kwenguva, ndiko kuti, kushanduka. Avro inonzwisisa schema shanduko-kubvisa, kuwedzera, kana kushandura minda.
Avro inotsigira zvakasiyana-siyana zve data data. Semuyenzaniso, unogona kugadzira rekodhi rine array, enumerated type, uye subrecord.
Iyi fomati yakanakira kunyora kune kumhara (shanduko) zone yedhamu data (
Saka, iyi fomati yakanyatsokodzera kunyorera kunzvimbo yekumhara yedhamu data nekuda kwezvikonzero zvinotevera:
- Dhata kubva munzvimbo ino inowanzoverengerwa yakazara kuti iwedzere kugadziriswa nemasisitimu ekudzika - uye mutsara-wakavakirwa fomati inonyanya kushanda mune iyi kesi.
- Kudzika masisitimu anogona kudzoreredza schema matafura kubva mafaera-hapana chikonzero chekuchengeta schemas zvakasiyana mune ekunze meta chengetedzo.
- Chero shanduko kune yekutanga schema inogadziriswa zviri nyore (schema evolution).
Parquet File Format
Parquet ndeye yakavhurika sosi faira fomati yeHadoop inochengeta nested data zvimiro mune flat columnar fomati.
Kuenzaniswa neyakajairwa mutsara nzira, Parquet inoshanda zvakanyanya maererano nekuchengetedza uye kuita.
Izvi zvinonyanya kukosha kumibvunzo inoverenga makoramu chaiwo kubva patafura yakakura (makoramu mazhinji). Kutenda kune iyo faira fomati, chete makoramu anodiwa anoverengwa, saka I / O inochengeterwa kushoma.
Digression shoma uye tsananguro: Kuti unzwisise zviri nani iyo Parquet faira fomati muHadoop, ngationei kuti koramu-yakavakirwa - i.e. columnar - fomati chii. Iyi fomati inochengeta zvakafanana kukosha kwekoramu yega yega pamwechete.
ID
zita
Dhipatimendi
1
emp1
d1
2
emp2
d2
3
emp3
d3
Mune tambo fomati, iyo data ichachengetwa seinotevera:
1
emp1
d1
2
emp2
d2
3
emp3
d3
Mune columnar faira fomati, iyo data yakafanana ichachengetwa seizvi:
1
2
3
emp1
emp2
emp3
d1
d2
d3
Iyo columnar fomati inoshanda zvakanyanya kana iwe uchida kubvunza akawanda makoramu kubva patafura. Ichaverenga chete makoramu anodiwa nekuti ari pedyo. Nenzira iyi, mashandiro eI/O anochengetwa ari mashoma.
Semuenzaniso, unongoda iyo NAME column. IN
Saka, iyo columnar fomati inovandudza kuita kwemubvunzo nekuti inoda nguva shoma yekutarisa kuti usvike kumakoramu anodiwa uye inoderedza huwandu hweI/O mashandiro nekuti chete makoramu anodiwa anoverengwa.
Chimwe chezvinhu zvakasiyana-siyana
Kuti unzwisise iyo Parquet faira fomati muHadoop, unofanirwa kuziva mazwi anotevera:
- Row group (Row group): zvine musoro kupatsanurwa kwedata kuita mitsetse. Boka remutsara rine chidimbu chekoramu yega yega mune data set.
- Column fragment (column chunk): Chimedu chegwaro chairo. Izvi zvimedu zvembiru zvinogara mune rimwe boka remitsara uye zvinovimbiswa kuve zvinosangana mufaira.
- Peji (peji): Zvidimbu zvekoramu zvakakamurwa kuita mapeji akanyorwa rimwe pashure peimwe. Mapeji ane zita rakafanana, saka unogona kusvetuka zvisina kufanira paunenge uchiverenga.
Pano musoro wacho unongova nenhamba yemashiripiti PAR1 (4 bytes) iyo inoratidza iyo faira seParquet faira.
The footer inoti zvinotevera:
- Faira metadata ine matangiro ekutanga echikamu chega chega metadata. Paunenge uchiverenga, unofanirwa kutanga waverenga metadata yefaira kuti uwane ese mameta emakoramu ekufarira. Zvikamu zve column zvinofanirwa kuverengwa zvakatevedzana. Imwe metadata inosanganisira iyo fomati vhezheni, schema, uye chero yekuwedzera kiyi-kukosha pairi.
- Hurefu hwemetadata (4 bytes).
- Nhamba yemashiripiti PAR1 (4 bytes).
ORC File Format
Yakagadziridzwa mitsara-column faira fomati (Optimized Row Column,
Zvakanakira iyo ORC fomati:
- Imwe faira ndeyekubuda kwebasa rega rega, izvo zvinoderedza mutoro paNameNode (zita node).
- Tsigiro yemhando dzeHive data, kusanganisira DateTime, decimal uye yakaoma data mhando (chimiro, rondedzero, mepu uye mubatanidzwa).
- Kuverenga panguva imwe chete kwefaira rimwe chete neakasiyana RecordReader maitiro.
- Kugona kupatsanura mafaera pasina kutarisisa mamaki.
- Estimation yehukuru hunogona kuitika murwi memory allocation yekuverenga/kunyora maitiro zvichienderana neruzivo rwuri mufaera repasi.
- Metadata inochengetwa muProtocol Buffers binary serialization format, iyo inobvumira minda kuwedzerwa nekubviswa.
ORC inochengeta kuunganidzwa kwetambo mufaira rimwe chete, uye mukati mekuunganidza, tambo data inochengetwa mune columnar fomati.
Faira reORC rinochengeta mapoka emitsara inonzi mitsetse uye ruzivo rwekutsigira mujinga refaira. Iyo Postscript kumagumo kwefaira ine compression paramita uye saizi yeiyo yakamanikidzwa tsoka.
Iyo yakasarudzika saizi yemutsara ndeye 250 MB. Nekuda kwemitsetse mikuru yakadai, kuverenga kubva kuHDFS kunoitwa zvakanyanya: mumabhuroko makuru anobatana.
Iyo faira footer inorekodha runyorwa rwenzira mufaira, nhamba yemitsara pamutsara, uye rudzi rwe data rekoramu yega yega. Iko kukosha kwechiverengero chekuverenga, min, max uye sum pachikamu chimwe nechimwe chakanyorwawo ipapo.
Muzasi memutsetse une dhairekitori renzvimbo dzemarwizi.
Row data inoshandiswa pakuvheneka matafura.
Index data inosanganisira hushoma uye huwandu hunokosha hwekoramu yega yega uye chinzvimbo chemitsara mune yega yega. ORC indexes anoshandiswa chete pakusarudza mitsetse nemapoka emitsara, kwete pakupindura mibvunzo.
Kuenzanisa kwemhando dzakasiyana dzefaira
Avro akaenzaniswa neParquet
- Avro inzira yekuchengetedza mutsara, nepo Parquet ichichengeta data mumakoramu.
- Parquet inokodzera zvirinani kumibvunzo yekuongorora, zvichireva kuti kuverenga mashandiro uye kubvunza data kunoshanda zvakanyanya pane kunyora.
- Nyora mabasa muAvro anoitwa zvakanyanya kupfuura muParquet.
- Avro inobata nekushanduka kwedunhu zvakanyanya kukura. Parquet inongotsigira schema yekuwedzera, nepo Avro inotsigira multifunctional evolution, ndiko kuti, kuwedzera kana kushandura makoramu.
- Parquet yakanakira kubvunza subset yemakoramu mune yakawanda-column tafura. Avro inokodzera ETL mashandiro kwatinobvunza makoramu ese.
ORC vs Parquet
- Parquet inochengeta nested data zviri nani.
- ORC inokodzera zvirinani kufungidzira kusundira pasi.
- ORC inotsigira ACID zvivakwa.
- ORC inodzvanya data zvirinani.
Chii chimwe chekuverenga pamusoro penyaya:
Kuongorora kukuru kwedata mugore: kuti kambani inogona sei kuve yakatarisana nedata .A Humble Guide kune Database Schemas .Yedu teregiramu chiteshi nezve shanduko yedhijitari .
Source: www.habr.com