Dab tsi tshwj xeeb txog Cloudera thiab yuav ua li cas ua noj

Kev ua lag luam rau kev faib xam thiab cov ntaub ntawv loj, raws li cov ntaub ntawv khaws tseg, yog loj hlob ntawm 18-19% ib xyoos twg. Qhov no txhais tau hais tias qhov teeb meem ntawm kev xaiv software rau cov hom phiaj no tseem cuam tshuam. Hauv cov ntawv tshaj tawm no, peb yuav pib nrog vim li cas peb xav tau kev faib xam, peb yuav nyob hauv kev nthuav dav ntxiv ntawm kev xaiv software, peb yuav tham txog kev siv Hadoop nrog Cloudera, thiab thaum kawg peb yuav tham txog kev xaiv kho vajtse thiab nws cuam tshuam li cas rau kev ua haujlwm. sib txawv.

Dab tsi tshwj xeeb txog Cloudera thiab yuav ua li cas ua noj
Vim li cas peb thiaj xav tau kev faib xam hauv kev lag luam zoo tib yam? Txhua yam yog yooj yim thiab nyuaj tib lub sijhawm. Yooj yim - vim tias feem ntau peb ua cov lej yooj yim rau ib chav tsev ntawm cov ntaub ntawv. Nyuaj - vim tias muaj ntau cov ntaub ntawv zoo li no. Ntau heev. Raws li qhov tshwm sim, nws yuav tsum txheej txheem terabytes ntawm cov ntaub ntawv nyob rau hauv 1000 threads. Yog li, cov ntaub ntawv siv tau zoo heev: kev suav tuaj yeem siv rau txhua qhov chaw nws yuav tsum tau coj mus rau hauv tus lej ntau ntawm cov ntsuas ntawm cov ntaub ntawv loj dua.

Ib qho piv txwv tsis ntev los no: Dodo Pizza txiav txim siab Raws li kev soj ntsuam ntawm cov neeg siv khoom xaj hauv paus, tias thaum xaiv pizza nrog cov khoom tsis txaus ntseeg, cov neeg siv feem ntau ua haujlwm nrog tsuas yog rau lub hauv paus ntawm cov khoom xyaw ntxiv rau ob peb qhov sib txawv. Raws li, lub pizzeria kho kev yuav khoom. Tsis tas li ntawd, nws muaj peev xwm ua kom pom zoo ntxiv cov khoom muaj nyob rau ntawm theem kev txiav txim rau cov neeg siv, uas tau nce cov txiaj ntsig.

Lwm cov piv txwv: tsom xam Cov khoom lag luam tso cai rau H&M kom txo qis cov khoom lag luam hauv ib lub khw muag khoom los ntawm 40%, thaum tswj hwm qib kev muag khoom. Qhov no tau ua tiav los ntawm kev tsis suav nrog kev muag khoom tsis zoo, thiab lub caij nyoog raug coj mus rau hauv tus account hauv kev suav.

Kev xaiv cuab yeej

Cov txheej txheem kev lag luam rau hom kev suav no yog Hadoop. Vim li cas? Vim Hadoop yog ib qho zoo heev, cov ntaub ntawv zoo heev (tib yam Habr muab ntau cov ncauj lus ntxaws ntxaws ntawm cov ncauj lus no), uas yog nrog rau tag nrho cov khoom siv hluav taws xob thiab cov tsev qiv ntawv. Koj tuaj yeem xa cov ntaub ntawv loj loj ntawm ob qho tib si tsim thiab tsis tsim cov ntaub ntawv raws li cov tswv yim, thiab lub kaw lus nws tus kheej yuav faib rau lawv ntawm kev siv hluav taws xob. Ntxiv mus, cov peev txheej tib yam no tuaj yeem nce lossis xiam oob qhab txhua lub sijhawm - uas tib txoj kab rov tav scalability hauv kev nqis tes ua.

Nyob rau hauv 2017, lub influential pab tswv yim tuam txhab Gartner xaustias Hadoop yuav dhau los sai sai. Yog vim li cas yog banal ntau: cov kws tshuaj ntsuam ntseeg hais tias cov tuam txhab yuav loj heev tsiv mus nyob rau hauv huab, vim hais tias muaj lawv yuav muaj peev xwm them raws li kev siv ntawm xam lub hwj chim. Qhov thib ob tseem ceeb tshaj plaws supposedly muaj peev xwm "faus" Hadoop yog qhov ceev ntawm kev ua hauj lwm. Vim tias cov kev xaiv zoo li Apache Spark lossis Google Cloud DataFlow sai dua li MapReduce hauv qab Hadoop.

Hadoop so ntawm ob peb tus ncej, qhov tseem ceeb tshaj plaws ntawm uas yog MapReduce technologies (ib qho system rau faib cov ntaub ntawv rau kev suav ntawm cov servers) thiab HDFS cov ntaub ntawv kaw lus. Cov tom kawg yog tsim tshwj xeeb los khaws cov ntaub ntawv faib tawm ntawm pawg nodes: txhua qhov thaiv ntawm qhov loj me tuaj yeem muab tso rau ntawm ob peb lub nodes, thiab ua tsaug rau kev rov ua dua, lub kaw lus tiv thaiv kev ua tsis tiav ntawm tus neeg nodes. Hloov chaw ntawm cov ntaub ntawv, ib lub server tshwj xeeb hu ua NameNode yog siv.

Cov duab hauv qab no qhia tau tias MapReduce ua haujlwm li cas. Nyob rau hauv thawj theem, cov ntaub ntawv yog muab faib raws li ib tug tej yam cwj pwm, nyob rau hauv lub thib ob theem nws yog faib los ntawm xam lub hwj chim, nyob rau hauv lub thib peb theem kev xam yuav siv qhov chaw.

Dab tsi tshwj xeeb txog Cloudera thiab yuav ua li cas ua noj
MapReduce yog thawj zaug tsim los ntawm Google rau kev xav tau ntawm nws txoj kev tshawb nrhiav. Tom qab ntawd MapReduce tau nkag mus rau hauv cov lej pub dawb, thiab Apache tau hla qhov project. Zoo, Google maj mam tsiv mus rau lwm cov kev daws teeb meem. Ib qho nthuav nuance: tam sim no, Google muaj ib qhov project hu ua Google Cloud Dataflow, positioned as the next step after Hadoop, as its quick change.

Kev saib ze dua qhia tau tias Google Cloud Dataflow yog raws li kev hloov pauv ntawm Apache Beam, thaum Apache Beam suav nrog cov ntaub ntawv zoo Apache Spark lub moj khaum, uas tso cai rau peb tham txog yuav luag tib yam kev daws teeb meem. Zoo, Apache Spark ua haujlwm zoo ntawm HDFS cov ntaub ntawv, uas tso cai rau koj siv nws ntawm Hadoop servers.

Ntxiv ntawm no qhov ntim ntawm cov ntaub ntawv thiab npaj cov kev daws teeb meem rau Hadoop thiab Spark tawm tsam Google Cloud Dataflow, thiab kev xaiv cov cuab yeej ua pom tseeb. Ntxiv mus, engineers tuaj yeem txiav txim siab rau lawv tus kheej qhov chaws twg - nyob rau hauv Hadoop lossis Spark - lawv yuav ua tiav, tsom mus rau txoj haujlwm, kev paub dhau los thiab kev tsim nyog.

Huab lossis hauv zos server

Qhov sib txawv ntawm kev hloov pauv mus rau huab tau txawm tias tau nce mus rau lub sijhawm zoo li Hadoop-as-a-service. Hauv qhov xwm txheej zoo li no, kev tswj hwm ntawm cov servers txuas nrog tau dhau los ua qhov tseem ceeb heev. Vim tias, alas, txawm tias nws muaj koob meej, Hadoop ntshiab yog ib qho cuab yeej nyuaj rau kev teeb tsa, txij li koj yuav tsum ua ntau ntawm tes. Piv txwv li, koj tuaj yeem teeb tsa cov servers ib tus zuj zus, saib xyuas lawv cov kev ua tau zoo, thiab kho ntau yam tsis zoo. Feem ntau, ua hauj lwm rau ib tug amateur thiab muaj lub caij nyoog loj los mus rau qhov chaw los yog nco ib yam dab tsi.

Yog li ntawd, ntau yam kev faib khoom tau dhau los ua neeg nyiam heev, uas yog pib nrog kev xa khoom yooj yim thiab kev tswj hwm cov cuab yeej. Ib qho ntawm cov khoom lag luam nrov tshaj plaws uas txhawb nqa Spark thiab ua kom yooj yim yog Cloudera. Nws muaj ob qho tib si them thiab dawb versions - thiab nyob rau tom kawg, tag nrho cov haujlwm tseem ceeb yog muaj, thiab tsis txwv tus lej ntawm cov nodes.

Dab tsi tshwj xeeb txog Cloudera thiab yuav ua li cas ua noj

Thaum lub sijhawm teeb tsa, Cloudera Manager yuav txuas ntawm SSH rau koj cov servers. Ib qho nthuav taw tes: thaum txhim kho, nws yog qhov zoo dua los qhia meej tias nws yuav tsum tau nqa tawm los ntawm qhov hu ua parcel: pob khoom tshwj xeeb, txhua qhov uas muaj tag nrho cov khoom tsim nyog tau teeb tsa los ua haujlwm nrog ib leeg. Nyob rau hauv qhov tseeb, qhov no yog xws li ib tug txhim kho version ntawm lub pob manager.

Tom qab kev teeb tsa, peb tau txais lub console tswj pawg, qhov twg koj tuaj yeem pom telemetry rau pawg, cov kev pabcuam tau teeb tsa, ntxiv rau koj tuaj yeem ntxiv / tshem tawm cov peev txheej thiab hloov kho pawg.

Dab tsi tshwj xeeb txog Cloudera thiab yuav ua li cas ua noj

Yog li ntawd, qhov kev txiav ntawm lub foob pob hluav taws tshwm nyob rau hauv pem hauv ntej ntawm koj, uas yuav coj koj mus rau lub neej yav tom ntej ntawm BigData. Tab sis ua ntej peb hais tias "cia mus", cia peb nrawm nrawm rau hauv qab lub hood.

hardware xav tau

Ntawm lawv lub vev xaib, Cloudera hais txog kev teeb tsa sib txawv. Cov ntsiab cai dav dav uas lawv tsim tau muaj nyob hauv cov lus piav qhia:

Dab tsi tshwj xeeb txog Cloudera thiab yuav ua li cas ua noj
MapReduce tuaj yeem plam daim duab zoo li no. Saib dua ntawm daim duab hauv ntu dhau los, nws pom tseeb tias yuav luag txhua qhov xwm txheej, MapReduce txoj haujlwm tuaj yeem cuam tshuam lub fwj thaum nyeem cov ntaub ntawv los ntawm disk lossis network. Qhov no kuj tau sau tseg rau ntawm Cloudera blog. Yog li ntawd, rau txhua qhov kev suav ceev, suav nrog los ntawm Spark, uas feem ntau yog siv rau kev xam lub sijhawm tiag tiag, I / O ceev yog qhov tseem ceeb heev. Yog li ntawd, thaum siv Hadoop, nws yog ib qho tseem ceeb heev uas sib npaug thiab ceev cov cav tov nkag mus rau hauv pawg, uas, yuav tsum tau muab nws me me, tsis yog ib txwm muab rau hauv huab infrastructure.

Qhov sib npaug hauv kev faib khoom yog ua tiav los ntawm kev siv Openstack virtualization ntawm cov servers nrog cov tub ntxhais muaj zog ntau CPUs. Cov ntaub ntawv nodes tau faib lawv tus kheej cov peev txheej processor thiab qee cov disks. Hauv peb qhov kev daws teeb meem Atos Codex Data Lake Cav dav virtualization tau ua tiav, uas yog vim li cas peb yeej ob qho tib si ntawm kev ua tau zoo (qhov cuam tshuam ntawm lub network infrastructure yog txo qis) thiab TCO (cov servers ntxiv lub cev raug tshem tawm).

Dab tsi tshwj xeeb txog Cloudera thiab yuav ua li cas ua noj
Nyob rau hauv cov ntaub ntawv ntawm kev siv BullSequana S200 servers, peb tau txais ib qho kev thauj khoom zoo heev, tsis muaj qee yam ntawm cov fwj. Qhov kev teeb tsa yam tsawg kawg nkaus suav nrog 3 BullSequana S200 servers, txhua tus muaj ob JBODs, ntxiv rau S200s ntxiv uas muaj plaub cov ntaub ntawv nodes yog xaiv tau txuas nrog. Nov yog ib qho piv txwv load hauv TeraGen xeem:

Dab tsi tshwj xeeb txog Cloudera thiab yuav ua li cas ua noj

Kev ntsuam xyuas nrog cov ntaub ntawv sib txawv thiab cov txiaj ntsig rov ua dua qhia cov txiaj ntsig zoo ib yam hauv cov ntsiab lus ntawm kev faib khoom thoob plaws pawg pawg. Hauv qab no yog ib daim duab ntawm kev faib cov disk nkag los ntawm kev ntsuas kev ua tau zoo.

Dab tsi tshwj xeeb txog Cloudera thiab yuav ua li cas ua noj

Kev suav yog ua raws li qhov tsawg kawg nkaus ntawm 3 BullSequana S200 servers. Nws suav nrog 9 cov ntaub ntawv nodes thiab 3 tus tswv nodes, nrog rau cov tshuab virtual tshwj xeeb nyob rau hauv rooj plaub ntawm kev xa tawm kev tiv thaiv raws li OpenStack Virtualization. TeraSort cov txiaj ntsig kev xeem: 512 MB thaiv qhov loj ntawm qhov rov ua dua ntawm peb nrog kev encryption yog 23,1 feeb.

Yuav ua li cas thiaj li yuav nthuav dav? Ntau hom kev txuas ntxiv muaj rau Data Lake Cav:

  • Cov ntaub ntawv nodes: rau txhua 40 TB ntawm qhov chaw siv tau
  • Analytic nodes nrog lub peev xwm los nruab GPU
  • Lwm cov kev xaiv nyob ntawm kev xav tau kev lag luam (piv txwv li, yog tias koj xav tau Kafka thiab lwm yam)

Dab tsi tshwj xeeb txog Cloudera thiab yuav ua li cas ua noj

Lub Atos Codex Data Lake Engine complex suav nrog cov servers lawv tus kheej thiab cov software ua ntej, suav nrog cov khoom siv Cloudera nrog daim ntawv tso cai; Hadoop nws tus kheej, OpenStack nrog cov tshuab virtual raws li RedHat Enterprise Linux ntsiav, cov ntaub ntawv rov ua dua thiab cov txheej txheem thaub qab (xws li siv lub thaub qab ntawm thiab Cloudera BDR - Thaub qab thiab Kev Puas Tsuaj Rov Qab). Atos Codex Data Lake Cav yog thawj qhov kev daws teeb meem virtualization kom tau txais ntawv pov thawj Cloudera.

Yog tias koj txaus siab rau cov ntsiab lus, peb yuav zoo siab los teb peb cov lus nug hauv cov lus.

Tau qhov twg los: www.hab.com

Ntxiv ib saib