Tsarin fayil a cikin manyan bayanai: taƙaitaccen shirin ilimi

Tsarin fayil a cikin manyan bayanai: taƙaitaccen shirin ilimi
Weather Bautawa ta Remarin

tawagar Mail.ru Cloud Solutions tayi fassarar labarin injiniya Rahul Bhatia daga Clairvoyant game da waɗanne nau'ikan fayil ɗin ne a cikin manyan bayanai, menene mafi yawan fasalulluka na tsarin Hadoop kuma wane tsari ya fi kyau a yi amfani da shi.

Me yasa ake buƙatar tsarin fayil daban-daban?

Babban ƙwaƙƙwaran aiki don aikace-aikacen da aka kunna HDFS kamar MapReduce da Spark shine lokacin da ake ɗauka don bincika, karantawa, da rubuta bayanai. Waɗannan matsalolin suna haɗuwa da wahalar sarrafa manyan bayanan bayanai idan muna da tsari mai tasowa maimakon ƙayyadaddun tsari, ko kuma idan akwai wasu matsalolin ajiya.

Sarrafa manyan bayanai yana ƙara nauyi akan tsarin tsarin ajiya - Hadoop yana adana bayanai ba tare da wahala ba don samun haƙurin kuskure. Baya ga faifai, ana ɗora kayan sarrafa kwamfuta, cibiyar sadarwa, tsarin shigar da fitarwa, da sauransu. Yayin da adadin bayanai ke karuwa, haka kuma farashin sarrafawa da adana su ke karuwa.

Fayiloli daban-daban a ciki Hadoop ƙirƙira don magance daidai waɗannan matsalolin. Zaɓin tsarin fayil ɗin da ya dace zai iya ba da wasu fa'idodi masu mahimmanci:

  1. Mafi saurin karatu.
  2. Lokacin yin rikodi da sauri.
  3. Fayilolin da aka raba.
  4. Taimako don juyin halitta makirci.
  5. Faɗaɗɗen tallafin matsawa.

Wasu tsarin fayil an yi niyya don amfanin gaba ɗaya, wasu don ƙarin takamaiman amfani, wasu kuma an ƙirƙira su don saduwa da takamaiman halayen bayanai. Don haka zaɓin yana da girma sosai.

Avro fayil format

domin serialization data Ana amfani da Avro sosai - shi tushen kirtani, wato, tsarin adana bayanan kirtani a Hadoop. Yana adana tsarin a tsarin JSON, yana sauƙaƙa karantawa da fassara ta kowane shiri. Bayanan da kansa yana cikin tsarin binary, m da inganci.

Tsarin serialization na Avro tsaka tsakin harshe ne. Ana iya sarrafa fayiloli a cikin yaruka daban-daban, a halin yanzu C, C++, C#, Java, Python da Ruby.

Babban fasalin Avro shine ƙaƙƙarfan goyon bayansa don tsare-tsaren bayanai waɗanda ke canzawa akan lokaci, wato, haɓakawa. Avro ya fahimci canje-canjen makirci - sharewa, ƙara, ko canza filaye.

Avro yana goyan bayan tsarin bayanai iri-iri. Misali, zaku iya ƙirƙirar rikodin wanda ya ƙunshi tsararru, nau'in ƙididdiga, da ƙaramin rikodin.

Tsarin fayil a cikin manyan bayanai: taƙaitaccen shirin ilimi
Wannan tsari yana da kyau don rubutawa zuwa yankin saukowa (canzawa) na tafkin bayanai (data lake, ko tafkin bayanai - tarin misalai don adana nau'ikan bayanai daban-daban ban da tushen bayanai kai tsaye).

Don haka, wannan tsari ya fi dacewa don rubutawa zuwa yankin da ake saukowa na tafkin bayanai saboda dalilai masu zuwa:

  1. Yawancin bayanai daga wannan yanki ana karanta su gabaɗaya don ƙarin sarrafawa ta tsarin ƙasa - kuma tsarin tushen layi yana da inganci a wannan yanayin.
  2. Tsarukan ƙasa na iya ɗauko tebur mai ƙima cikin sauƙi daga fayiloli - babu buƙatar adana makirci daban a ma'ajin meta na waje.
  3. Duk wani canji zuwa ainihin tsari ana iya sarrafa shi cikin sauƙi (juyin halitta).

Tsarin Fayil na Parquet

Parquet shine tsarin fayil na buɗe tushen don Hadoop wanda ke adanawa tsarin bayanan gida a cikin tsarin ginshiƙan lebur.

Idan aka kwatanta da tsarin layi na gargajiya, Parquet ya fi dacewa ta fuskar ajiya da aiki.

Wannan yana da amfani musamman ga tambayoyin da ke karanta takamaiman ginshiƙai daga tebur mai faɗi (ginshiƙai da yawa). Godiya ga tsarin fayil, ginshiƙan da ake buƙata kawai ana karantawa, don haka ana kiyaye I/O zuwa ƙarami.

Karamin digression da bayani: Don ƙarin fahimtar tsarin fayil na Parquet a cikin Hadoop, bari mu ga menene tsarin tushen shafi - watau columnar - tsari. Wannan tsarin yana adana ƙima iri ɗaya ga kowane shafi tare.

Alal misali, rikodin ya haɗa da ID, Suna, da filayen Sashen. A wannan yanayin, za a adana duk ƙimar ginshiƙin ID tare, kamar yadda ƙimar ginshiƙin Suna, da sauransu. Teburin zai yi kama da wani abu kamar haka:

ID
sunan
Sashen

1
karfafawa. 1
d1

2
karfafawa. 2
d2

3
karfafawa. 3
d3

A cikin tsarin kirtani, za a adana bayanan kamar haka:

1
karfafawa. 1
d1
2
karfafawa. 2
d2
3
karfafawa. 3
d3

A cikin tsarin fayil ɗin columnar, za a adana bayanai iri ɗaya kamar haka:

1
2
3
karfafawa. 1
karfafawa. 2
karfafawa. 3
d1
d2
d3

Tsarin ginshiƙi yana da inganci lokacin da kuke buƙatar tambayar ginshiƙai da yawa daga tebur. Zata karanta ginshiƙan da ake buƙata kawai saboda suna kusa. Ta wannan hanyar, ana kiyaye ayyukan I/O zuwa ƙarami.

Misali, kawai kuna buƙatar rukunin NAME. IN tsarin kirtani Kowane rikodin a cikin bayanan yana buƙatar lodawa, a daidaita shi ta filin, sannan a fitar da bayanan SUNA. Tsarin ginshiƙi yana ba ku damar rawar ƙasa kai tsaye zuwa ginshiƙin Suna saboda ana adana duk ƙimar wannan shafi tare. Ba dole ba ne ka duba duk rikodin rikodin.

Don haka, tsarin ginshiƙi yana haɓaka aikin tambaya saboda yana buƙatar ƙarancin lokacin dubawa don isa ga ginshiƙan da ake buƙata kuma yana rage adadin ayyukan I/O saboda kawai ginshiƙan da ake so ana karantawa.

Ɗaya daga cikin siffofi na musamman parquet shi ne cewa a cikin wannan tsari zai iya adana bayanai tare da tsarin gida. Wannan yana nufin cewa a cikin fayil ɗin Parquet, ko da filayen gida ana iya karantawa daban-daban ba tare da karanta duk filayen da ke cikin tsarin gida ba. Parquet yana amfani da jujjuyawa da algorithm na taro don adana tsarin gida.

Tsarin fayil a cikin manyan bayanai: taƙaitaccen shirin ilimi
Don fahimtar tsarin fayil ɗin Parquet a cikin Hadoop, kuna buƙatar sanin waɗannan sharuɗɗan:

  1. Ƙungiyar igiyoyi (Rukunin jere): rarraba bayanai a kwance na ma'ana cikin layuka. Ƙungiyar jeri ta ƙunshi guntun kowane shafi a cikin saitin bayanai.
  2. Guntun ginshiƙi (cikakken ginshiƙi): Gwargwadon takamaiman shafi. Waɗannan ɓangarorin ginshiƙan suna rayuwa a cikin takamaiman rukunin layuka kuma an ba da tabbacin za su ci gaba a cikin fayil ɗin.
  3. BAYANAN (shafi): An raba guntuwar ginshiƙan zuwa shafuka da aka rubuta ɗaya bayan ɗaya. Shafukan suna da taken gama-gari, don haka zaku iya tsallake waɗanda ba dole ba lokacin karantawa.

Tsarin fayil a cikin manyan bayanai: taƙaitaccen shirin ilimi
Anan take kawai ya ƙunshi lambar sihirin PAR1 (4 bytes) wanda ke gano fayil ɗin azaman fayil ɗin Parquet.

Mai kafa yana cewa kamar haka:

  1. Bayanan metadata na fayil wanda ya ƙunshi daidaitawar farawa na kowane metadata na kowane shafi. Lokacin karantawa, dole ne ka fara karanta metadata na fayil don nemo duk guntuwar sha'awa. Sannan yakamata a karanta sassan ginshiƙan bi da bi. Wasu metadata sun haɗa da sigar tsari, tsari, da kowane ƙarin maɓalli-darajar nau'i-nau'i.
  2. Tsawon metadata (bytes 4).
  3. Lambar sihiri PAR1 (4 bytes).

Tsarin Fayil na ORC

Ingantaccen tsarin fayil na ginshiƙi (Ingantacciyar Rukunin Layi, CRO) yana ba da ingantacciyar hanya don adana bayanai kuma an tsara shi don shawo kan gazawar wasu nau'ikan. Ajiye bayanai a cikin ƙaƙƙarfan tsari, yana ba ku damar tsallake bayanan da ba dole ba - ba tare da buƙatar gina manyan filaye masu rikitarwa ko da hannu ba.

Amfanin tsarin ORC:

  1. Fayil ɗaya shine fitarwa na kowane ɗawainiya, wanda ke rage nauyi akan NameNode (kudin suna).
  2. Taimako don nau'ikan bayanan Hive, gami da DateTime, nau'ikan bayanai masu yawa da hadaddun (tsari, jeri, taswira da ƙungiyar).
  3. Karatun lokaci guda na fayil iri ɗaya ta hanyoyi daban-daban na RecordReader.
  4. Ikon raba fayiloli ba tare da bincika alamomi ba.
  5. Ƙididdiga mafi girman yuwuwar židayar žwažwalwar ajiyar tulin don ayyukan karantawa/rubutu bisa bayanai a gindin fayil.
  6. Ana adana metadata a cikin tsarin serialization na Protocol Buffers, wanda ke ba da damar ƙara da cire filayen.

Tsarin fayil a cikin manyan bayanai: taƙaitaccen shirin ilimi
ORC tana adana tarin kirtani a cikin fayil guda ɗaya, kuma a cikin tarin, ana adana bayanan kirtani a cikin tsarin shafi.

Fayil ɗin ORC yana adana ƙungiyoyin layukan da ake kira ratsi da bayanan tallafi a gindin fayil ɗin. Rubutun Rubutun a ƙarshen fayil ɗin ya ƙunshi sigogin matsawa da girman matsewar ƙafar.

Tsohuwar girman tsiri shine 250 MB. Saboda irin waɗannan manyan ratsi, ana yin karatu daga HDFS da kyau: a cikin manyan tubalan da ke haɗuwa.

Ƙafar fayil ɗin yana yin rikodin jerin hanyoyi a cikin fayil ɗin, adadin layuka a kowane layi, da nau'in bayanai na kowane shafi. Sakamakon ƙimar ƙidaya, min, max da jimla na kowane shafi kuma an rubuta shi a wurin.

Ƙafar tsiri yana ƙunshe da kundin adireshi na wuraren rafi.

Ana amfani da bayanan jere lokacin da ake duba tebur.

Bayanan fihirisa sun haɗa da mafi ƙanƙanta da ƙimar ƙima ga kowane shafi da matsayi na layuka a kowane shafi. Ana amfani da fihirisar ORC don zaɓar ratsi da ƙungiyoyin layi, ba don amsa tambayoyin ba.

Kwatanta nau'ikan fayil daban-daban

Avro idan aka kwatanta da Parquet

  1. Avro sigar ajiya ce ta jere, yayin da Parquet ke adana bayanai a cikin ginshiƙai.
  2. Parquet ya fi dacewa don tambayoyin nazari, ma'ana ayyukan karantawa da bayanan tambaya sun fi yadda ake rubutawa inganci.
  3. Ayyukan rubuce-rubuce a Avro ana yin su da kyau fiye da na Parquet.
  4. Avro yayi ma'amala da juyin halitta mafi girma. Parquet yana goyan bayan ƙari kawai, yayin da Avro yana goyan bayan juyin halitta da yawa, wato ƙara ko canza ginshiƙai.
  5. Parquet yana da kyau don tambayar wani yanki na ginshiƙai a cikin tebur mai ginshiƙi da yawa. Avro ya dace da ayyukan ETL inda muke tambayar duk ginshiƙai.

ORC vs Parquet

  1. Parquet yana adana bayanan gida mafi kyau.
  2. ORC ya fi dacewa don ƙaddara turawa.
  3. ORC tana goyan bayan kaddarorin ACID.
  4. ORC yana danne bayanai mafi kyau.

Me kuma za a karanta a kan batun:

  1. Babban bincike na bayanai a cikin gajimare: yadda kamfani zai iya zama mai dogaro da bayanai.
  2. Jagora mai ƙasƙantar da kai ga Tsare-tsaren Database.
  3. Tashar mu ta telegram game da canjin dijital.

source: www.habr.com

Add a comment