Menene na musamman game da Cloudera da yadda ake dafa shi

Kasuwar rarraba kwamfuta da manyan bayanai, bisa ga ƙididdiga, yana girma da 18-19% a kowace shekara. Wannan yana nufin cewa batun zaɓin software don waɗannan dalilai ya kasance mai dacewa. A cikin wannan sakon, za mu fara da dalilin da ya sa ake buƙatar kwamfuta mai rarrabawa, dalla-dalla game da zabar software, magana game da amfani da Hadoop ta amfani da Cloudera, a ƙarshe kuma muyi magana game da zabar kayan aiki da yadda yake shafar aiki ta hanyoyi daban-daban.

Menene na musamman game da Cloudera da yadda ake dafa shi
Me yasa ake buƙatar lissafin rarrabawa a cikin kasuwanci na yau da kullun? Komai a nan yana da sauƙi kuma mai rikitarwa a lokaci guda. Sauƙaƙan - saboda a mafi yawan lokuta muna yin ƙididdige ƙididdiga masu sauƙi kowace raka'a na bayanai. Yana da wahala saboda akwai irin waɗannan bayanai da yawa. Da yawa. A sakamakon haka, ya zama dole sarrafa terabytes na bayanai a cikin zaren 1000. Don haka, shari'o'in amfani sun kasance na duniya gabaɗaya: ana iya amfani da ƙididdiga a duk inda ya zama dole don la'akari da adadi mai yawa na ma'auni akan mafi girman tsararrun bayanai.

Ɗaya daga cikin misalan kwanan nan: sarkar pizzeria Dodo Pizza ƙaddara bisa nazarin bayanan odar abokin ciniki, cewa lokacin zabar pizza tare da bazuwar topping, masu amfani yawanci suna aiki tare da saiti na asali guda shida kawai tare da wasu bazuwar. Dangane da wannan, pizzeria ta gyara siyayyar sa. Bugu da ƙari, ta iya ba da shawarar ƙarin samfurori da aka ba wa masu amfani a lokacin oda, wanda ya karu da riba.

Wani misali: bincike Abubuwan samfur sun ba da damar kantin H&M don rage nau'in a cikin shaguna ɗaya da kashi 40%, yayin da suke kiyaye matakan tallace-tallace. An cimma hakan ne ta hanyar keɓance kayan da ba a sayar da su ba, kuma an yi la'akari da yanayin yanayi a cikin lissafin.

Zaɓin kayan aiki

Ma'aunin masana'antu don irin wannan nau'in kwamfuta shine Hadoop. Me yasa? Domin Hadoop kyakkyawan tsari ne, ingantaccen tsari (Habr ɗaya yana ba da cikakkun bayanai game da wannan batu), wanda ke tare da ɗaukacin kayan aiki da ɗakunan karatu. Kuna iya shigar da manyan jigogi na bayanan da aka tsara da kuma waɗanda ba a tsara su ba, kuma tsarin da kansa zai rarraba shi tsakanin ikon sarrafa kwamfuta. Haka kuma, ana iya ƙara waɗannan ƙarfin guda ɗaya ko a kashe su a kowane lokaci - wannan madaidaicin sikelin a cikin aiki.

A cikin 2017, babban kamfani mai ba da shawara Gartner kammalacewa Hadoop ba da daɗewa ba zai zama marar amfani. Dalilin shi ne banal banal: manazarta sun yi imanin cewa kamfanoni za su yi ƙaura da yawa zuwa gajimare, tun da can za su iya biya yayin da suke amfani da ikon sarrafa kwamfuta. Abu mai mahimmanci na biyu wanda zai iya zama "binne" Hadoop shine saurin sa. Saboda zaɓuɓɓuka kamar Apache Spark ko Google Cloud DataFlow sun fi MapReduce sauri, wanda ke ƙarƙashin Hadoop.

Hadoop yana kan ginshiƙai da yawa, mafi shahara daga cikinsu shine fasahar MapReduce (tsarin rarraba bayanai don ƙididdigewa tsakanin sabobin) da tsarin fayil HDFS. An tsara na ƙarshe na musamman don adana bayanan da aka rarraba tsakanin nodes ɗin gungu: kowane shinge na ƙayyadaddun girman za a iya sanya shi a kan nodes da yawa, kuma godiya ga maimaitawa, tsarin yana jurewa ga gazawar nodes. Maimakon teburin fayil, ana amfani da uwar garken musamman mai suna NameNode.

Hoton da ke ƙasa yana nuna yadda MapReduce ke aiki. A mataki na farko, ana rarraba bayanai bisa ga wani ma'auni, a mataki na biyu kuma ana rarraba su bisa ga ikon kwamfuta, kuma a mataki na uku ana yin lissafin.

Menene na musamman game da Cloudera da yadda ake dafa shi
MapReduce Google ne ya ƙirƙira shi don buƙatun bincikensa. Sannan MapReduce ya tafi lambar kyauta, kuma Apache ya karɓi aikin. To, Google a hankali ya yi ƙaura zuwa wasu mafita. Tidbit mai ban sha'awa: Google a halin yanzu yana da aikin da ake kira Google Cloud Dataflow, wanda aka sanya shi azaman mataki na gaba bayan Hadoop, azaman maye gurbinsa da sauri.

Duban kusa yana nuna cewa Google Cloud Dataflow yana dogara ne akan bambancin Apache Beam, yayin da Apache Beam ya haɗa da ingantaccen tsarin Apache Spark, wanda ke ba mu damar yin magana game da kusan saurin aiwatar da mafita iri ɗaya. Da kyau, Apache Spark yana aiki daidai akan tsarin fayil na HDFS, wanda ke ba da damar tura shi akan sabar Hadoop.

Ƙara a nan ƙarar takaddun bayanai da shirye-shiryen mafita don Hadoop da Spark tare da Google Cloud Dataflow, kuma zaɓin kayan aiki ya zama bayyane. Haka kuma, injiniyoyi na iya yanke wa kansu wanne lamba - na Hadoop ko Spark - yakamata su gudanar, suna mai da hankali kan aikin, gogewa da cancanta.

Cloud ko uwar garken gida

Halin zuwa ga jujjuyawar gaba ɗaya zuwa ga gajimare har ma ya haifar da irin wannan kalma mai ban sha'awa kamar Hadoop-as-a-service. A cikin irin wannan yanayin, gudanar da sabar da aka haɗa ya zama mahimmanci. Domin, kash, duk da shahararsa, Hadoop mai tsabta kayan aiki ne mai wahala don daidaitawa, tunda dole ne a yi abubuwa da yawa da hannu. Misali, saita sabar daban-daban, saka idanu akan ayyukansu, da kuma daidaita sigogi da yawa a hankali. Gabaɗaya, aikin na mai son ne kuma akwai babban damar yin rikici a wani wuri ko rasa wani abu.

Saboda haka, nau'ikan rarrabawa daban-daban, waɗanda aka fara sanye da kayan aiki masu dacewa da kayan aikin gudanarwa, sun shahara sosai. Ɗaya daga cikin shahararrun rabe-raben da ke goyan bayan Spark kuma yana sauƙaƙe komai shine Cloudera. Yana da nau'ikan nau'ikan biya da na kyauta - kuma a ƙarshen ana samun duk ayyuka na asali, ba tare da iyakance adadin nodes ba.

Menene na musamman game da Cloudera da yadda ake dafa shi

Yayin saitin, Cloudera Manager zai haɗa ta hanyar SSH zuwa sabobin ku. Wani batu mai ban sha'awa: lokacin shigarwa, yana da kyau a ƙayyade cewa ana aiwatar da abin da ake kira parsels: fakiti na musamman, kowannensu ya ƙunshi dukkan abubuwan da suka dace waɗanda aka tsara don aiki tare da juna. Ainihin wannan ingantaccen sigar manajan fakiti ne.

Bayan shigarwa, muna karɓar na'ura mai sarrafa gungu, inda za ku iya ganin gungu na telemetry, ayyukan da aka shigar, da za ku iya ƙarawa/cire albarkatu da kuma daidaita tsarin tari.

Menene na musamman game da Cloudera da yadda ake dafa shi

Sakamakon haka, gidan roka wanda zai kai ku cikin kyakkyawar makoma na BigData ya bayyana a gaban ku. Amma kafin mu ce "mu tafi," bari mu matsa a ƙarƙashin kaho.

Hardware bukatun

A kan gidan yanar gizon sa, Cloudera yana ambaton daidaitawa daban-daban. Gaba ɗaya ƙa'idodin da aka gina su da su an nuna su a cikin hoton:

Menene na musamman game da Cloudera da yadda ake dafa shi
MapReduce na iya ɓata wannan kyakkyawan hoto. Idan ka sake duba zane daga sashin da ya gabata, zai bayyana a fili cewa a kusan dukkanin lokuta, aikin MapReduce zai iya fuskantar matsala lokacin karanta bayanai daga faifai ko daga hanyar sadarwa. Ana kuma lura da wannan a cikin Cloudera blog. Sakamakon haka, ga kowane ƙididdiga mai sauri, gami da ta hanyar Spark, wanda galibi ana amfani da shi don ƙididdige lokaci na gaske, saurin I / O yana da mahimmanci. Sabili da haka, lokacin amfani da Hadoop, yana da matukar mahimmanci cewa gungu ya haɗa da daidaitattun injuna da sauri, waɗanda, a sanya shi a hankali, ba koyaushe ake tabbatar da su a cikin kayan aikin girgije ba.

Ana samun ma'auni a cikin rarraba kaya ta hanyar amfani da ƙirƙira ta Opentack akan sabobin tare da manyan CPUs masu yawan gaske. Ana keɓance nodes ɗin bayanai na albarkatun sarrafa nasu da takamaiman fayafai. A cikin shawararmu Atos Codex Data Lake Engine Ana samun haɓaka mai fa'ida, wanda shine dalilin da ya sa muke amfana duka dangane da aikin (an rage tasirin abubuwan haɗin gwiwar cibiyar sadarwa) kuma a cikin TCO (an kawar da ƙarin sabar jiki).

Menene na musamman game da Cloudera da yadda ake dafa shi
Lokacin amfani da sabobin BullSequana S200, muna samun kaya iri ɗaya, ba tare da wasu kwalabe ba. Mafi ƙarancin tsari ya haɗa da sabobin BullSequana S3 guda 200, kowannensu yana da JBODs guda biyu, da ƙarin S200s masu ɗauke da nodes ɗin bayanai guda huɗu an haɗa su da zaɓin zaɓi. Ga misalin kaya a gwajin TeraGen:

Menene na musamman game da Cloudera da yadda ake dafa shi

Gwaje-gwaje tare da juzu'in bayanai daban-daban da ƙimar maimaitawa suna nuna sakamako iri ɗaya dangane da rarraba kaya tsakanin nodes ɗin tari. A ƙasa akwai jadawali na rarraba damar faifai ta gwaje-gwajen aiki.

Menene na musamman game da Cloudera da yadda ake dafa shi

An yi lissafin ƙididdiga bisa ƙaramin tsari na sabar 3 BullSequana S200. Ya haɗa da nodes ɗin bayanai guda 9 da nodes na master guda 3, da kuma na'urori masu ƙima idan aka yi amfani da kariya ta tushen OpenStack Virtualization. Sakamakon gwaji na TeraSort: Girman toshe 512 MB na maimaitawa daidai da uku tare da ɓoyewa shine mintuna 23,1.

Ta yaya za a iya fadada tsarin? Akwai nau'ikan kari daban-daban don Injin Lake Data:

  • Bayanan bayanai: ga kowane TB 40 na sarari mai amfani
  • Nodes na nazari tare da ikon shigar da GPU
  • Sauran zaɓuɓɓukan dangane da buƙatun kasuwanci (misali, idan kuna buƙatar Kafka da makamantansu)

Menene na musamman game da Cloudera da yadda ake dafa shi

Injin Lake Data Atos Codex ya haɗa da sabobin da kansu da software da aka riga aka shigar, gami da kayan Cloudera mai lasisi; Hadoop kanta, OpenStack tare da injunan kama-da-wane bisa tushen RedHat Enterprise Linux kernel, kwafin bayanai da tsarin ajiya (ciki har da yin amfani da kumburin madadin da Cloudera BDR - Ajiyayyen da Farfaɗo da Bala'i). Atos Codex Data Lake Engine ya zama mafita na farko da aka ba da bokan Cloudera.

Idan kuna sha'awar cikakkun bayanai, za mu yi farin cikin amsa tambayoyinmu a cikin sharhi.

source: www.habr.com

Add a comment