He aha ka mea kūikawā e pili ana iā Cloudera a pehea e kuke ai

ʻO ka mākeke no ka hoʻopuka helu helu a me ka ʻikepili nui, e like me heluʻikepili, ke ulu nei e 18-19% i kēlā me kēia makahiki. ʻO ia ke ʻano o ka pili ʻana i ke koho ʻana i nā lako polokalamu no kēia mau hana. Ma kēia pou, e hoʻomaka mākou me ke kumu e pono ai ka hoʻopili helu ʻana, e hele i nā kikoʻī e pili ana i ke koho ʻana i nā polokalamu, e kamaʻilio e pili ana i ka hoʻohana ʻana iā Hadoop me Cloudera, a hope e kamaʻilio e pili ana i ke koho ʻana i ka lako a pehea e pili ai i ka hana ma nā ʻano like ʻole.

He aha ka mea kūikawā e pili ana iā Cloudera a pehea e kuke ai
No ke aha e pono ai ka hoʻopili helu ʻana i ka ʻoihana maʻamau? He mea maʻalahi a paʻakikī nā mea a pau i ka manawa like. Maʻalahi - no ka mea ma ka hapanui o nā hihia, hana mākou i nā helu maʻalahi i kēlā me kēia ʻāpana o ka ʻike. He paʻakikī no ka mea he nui o ia ʻike. Nui loa. ʻO ka hopena, pono ia kaʻina hana terabytes o ka ʻikepili ma 1000 mau kaula. No laila, ʻokoʻa nā hihia hoʻohana: hiki ke hoʻohana ʻia ka helu ʻana ma nā wahi āpau e pono ai e noʻonoʻo i ka nui o nā metric ma kahi ʻano nui o ka ʻikepili.

ʻO kekahi o nā hiʻohiʻona hou: ke kaulahao pizzeria Dodo Pizza wehewehe ma muli o ka nānā ʻana o ka waihona kauoha o nā mea kūʻai aku, ʻo ia ke koho ʻana i ka pizza me ka hoʻopiha ʻole ʻana, e hana maʻamau nā mea hoʻohana me ʻeono mau pūʻulu kumu o nā meaʻai a me nā mea ʻelua. E like me kēia, ua hoʻoponopono ka pizzeria i kāna mau kūʻai. Eia kekahi, ua hiki iā ia ke ʻoi aku ka maikaʻi o nā huahana hou i hāʻawi ʻia i nā mea hoʻohana i ka wā o ke kauoha ʻana, i hoʻonui ai i ka loaʻa kālā.

Eia kekahi laʻana: hoʻokolokolo ʻana Ua ʻae nā huahana huahana i ka hale kūʻai H&M e hōʻemi i ka nui o nā hale kūʻai pākahi ma 40%, ʻoiai e mālama ana i nā pae kūʻai. Ua hoʻokō ʻia kēia ma ka hoʻokaʻawale ʻana i nā mea kūʻai maikaʻi ʻole, a ua mālama ʻia ka seasonality i nā helu.

Koho mea paahana

ʻO Hadoop ke kūlana ʻoihana no kēia ʻano ʻike. No ke aha mai? No ka mea ʻo Hadoop kahi hoʻolālā maikaʻi loa i kākau ʻia (ʻo ia ka Habr e hāʻawi i nā ʻatikala kikoʻī e pili ana i kēia kumuhana), i hele pū ʻia me kahi pūʻulu o nā pono a me nā hale waihona puke. Hiki iā ʻoe ke hoʻokomo i nā pūʻulu nui o nā ʻikepili i kūkulu ʻia a i hoʻonohonoho ʻole ʻia, a na ka ʻōnaehana ponoʻī e puʻunaue ia i waena o ka mana computing. Eia kekahi, hiki ke hoʻonui a hoʻopau ʻia kēia mau mana like i kēlā me kēia manawa - kēlā me kēia scalability ākea i ka hana.

Ma 2017, ʻo ka hui kūkākūkā koʻikoʻi ʻo Gartner hoʻopau ʻiae lilo koke ana ʻo Hadoop i mea kahiko. He banal ke kumu: manaʻo ka poʻe loiloi e neʻe nui nā ʻoihana i ke ao, no ka mea ma laila e hiki ai iā lākou ke uku i ko lākou hoʻohana ʻana i ka mana computing. ʻO ka lua o ka mea nui e hiki ke "kunu" ʻo Hadoop kona wikiwiki. No ka mea ʻoi aku ka wikiwiki o nā koho e like me Apache Spark a i ʻole Google Cloud DataFlow ma mua o MapReduce, kahi i lalo o Hadoop.

Noho ʻo Hadoop ma luna o kekahi mau pou, ʻo ka mea kaulana loa ʻo ia nā ʻenehana MapReduce (kahi ʻōnaehana no ka hāʻawi ʻana i ka ʻikepili no ka helu ʻana ma waena o nā kikowaena) a me ka ʻōnaehana faila HDFS. Hoʻolālā kūikawā ʻia ka mea hope no ka mālama ʻana i ka ʻike i puʻunaue ʻia ma waena o nā puʻupuʻu puʻupuʻu: hiki ke kau ʻia kēlā me kēia poloka o ka nui paʻa ma luna o kekahi mau nodes, a mahalo i ka hana hou ʻana, ua kūpaʻa ka ʻōnaehana i nā hemahema o kēlā me kēia nodes. Ma kahi o kahi papa waihona, hoʻohana ʻia kahi kikowaena kūikawā i kapa ʻia ʻo NameNode.

Hōʻike ka kiʻi ma lalo nei i ka hana ʻana o MapReduce. Ma ka pae mua, ua puunaueia ka ikepili e like me kekahi criterion, ma ka lua o ka papa e puunaue ia e like me ka mana computing, a ma ke kolu o ka helu ana.

He aha ka mea kūikawā e pili ana iā Cloudera a pehea e kuke ai
Ua hana mua ʻia ʻo MapReduce e Google no kāna mau pono ʻimi. A laila hele ʻo MapReduce i ke code manuahi, a lawe ʻo Apache i ka papahana. ʻAe, ua neʻe mālie ʻo Google i nā hoʻonā ʻē aʻe. ʻO kahi mea hoihoi: Loaʻa iā Google kahi papahana i kapa ʻia ʻo Google Cloud Dataflow, i hoʻonoho ʻia ma ke ʻano aʻe ma hope o Hadoop, ma ke ʻano he pani wikiwiki no ia.

Hōʻike ka nānā pono ʻana e pili ana ʻo Google Cloud Dataflow ma kahi ʻano like ʻole o Apache Beam, ʻoiai ʻo Apache Beam e loaʻa ana i ka papa hana Apache Spark i kākau maikaʻi ʻia, kahi e hiki ai iā mākou ke kamaʻilio e pili ana i ka wikiwiki o ka hoʻokō ʻana o nā hopena. Maikaʻi, hana maikaʻi ʻo Apache Spark ma ka ʻōnaehana faila HDFS, kahi e hiki ai ke kau ʻia ma nā kikowaena Hadoop.

Hoʻohui i ʻaneʻi i ka nui o nā palapala a me nā hoʻonā i hana ʻia no Hadoop a me Spark versus Google Cloud Dataflow, a ʻike ʻia ke koho o ka mea hana. Eia kekahi, hiki i nā ʻenekini ke hoʻoholo no lākou iho i ke code - no Hadoop a i ʻole Spark - pono lākou e holo, e kālele ana i ka hana, ka ʻike a me nā ʻike.

Cloud a i ʻole kikowaena kūloko

ʻO ke ʻano o ka hoʻololi maʻamau i ke ao ua hāʻawi ʻia i kahi huaʻōlelo hoihoi e like me Hadoop-as-a-service. Ma ia ʻano hiʻohiʻona, ua lilo ka hoʻokele o nā kikowaena pili i mea nui loa. No ka mea, auwe, ʻoiai kona kaulana, ʻo Hadoop maʻemaʻe kahi mea paʻakikī e hoʻonohonoho, no ka mea he nui nā mea e hana ʻia e ka lima. No ka laʻana, hoʻonohonoho i nā kikowaena pākahi, nānā i kā lākou hana, a hoʻonohonoho pono i nā ʻāpana he nui. Ma keʻano holoʻokoʻa, ʻo ka hana no ka amateur a aia kahi manawa nui e hana hewa i kahi a i ʻole nalo kekahi mea.

No laila, ua kaulana loa nā pahu hoʻoili like ʻole, i hoʻolako mua ʻia me ka hoʻonohonoho pono ʻana a me nā mea hoʻokele. ʻO kekahi o nā māhele kaulana loa e kākoʻo ana iā Spark a maʻalahi nā mea āpau ʻo Cloudera. Loaʻa iā ia nā mana uku a me ka manuahi - a ma ka hope ua loaʻa nā hana maʻamau āpau, me ka ʻole o ka kaupalena ʻana i ka helu o nā nodes.

He aha ka mea kūikawā e pili ana iā Cloudera a pehea e kuke ai

I ka wā hoʻonohonoho, e hoʻopili ʻo Cloudera Manager ma o SSH i kāu mau kikowaena. ʻO kahi mea hoihoi: i ka wā e hoʻokomo ai, ʻoi aku ka maikaʻi o ka wehewehe ʻana e hana ʻia e ka mea i kapa ʻia nā ʻāpana: nā pūʻolo kūikawā, aia i kēlā me kēia o nā mea pono a pau i hoʻonohonoho ʻia e hana pū me kekahi. ʻO ka mea nui, he mana kēia o ka luna pūʻolo.

Ma hope o ka hoʻouka ʻana, loaʻa iā mākou kahi console hoʻokele cluster, kahi āu e ʻike ai i ka telemetry cluster, nā lawelawe i hoʻokomo ʻia, a hiki iā ʻoe ke hoʻohui / wehe i nā kumuwaiwai a hoʻoponopono i ka hoʻonohonoho cluster.

He aha ka mea kūikawā e pili ana iā Cloudera a pehea e kuke ai

ʻO ka hopena, ʻike ʻia ka hale o ka rocket e lawe iā ʻoe i ka wā e hiki mai ana o BigData i mua ou. Akā, ma mua o ka ʻōlelo ʻana "e hele kāua," e neʻe kākou ma lalo o ka pā.

Pono lako lako

Ma kāna pūnaewele, ʻōlelo ʻo Cloudera i nā ʻano hoʻonohonoho like ʻole. Hōʻike ʻia nā loina maʻamau i kūkulu ʻia ai lākou ma ke kiʻi.

He aha ka mea kūikawā e pili ana iā Cloudera a pehea e kuke ai
Hiki iā MapReduce ke hoʻopuehu i kēia kiʻi manaʻo maikaʻi. Inā ʻoe e nānā hou i ke kiʻikuhi mai ka pauku mua, ʻike maopopo ʻia ma kahi kokoke i nā hihia āpau, hiki i kahi hana MapReduce ke hālāwai me kahi bottleneck i ka heluhelu ʻana i ka ʻikepili mai ka disk a i ʻole ka pūnaewele. Ua ʻike ʻia kēia ma ka blog Cloudera. ʻO ka hopena, no nā helu wikiwiki, e komo pū me Spark, i hoʻohana pinepine ʻia no ka helu manawa maoli, he mea nui ka wikiwiki I/O. No laila, i ka hoʻohana ʻana iā Hadoop, he mea koʻikoʻi loa ka hui pū ʻana me nā mīkini kaulike a me ka wikiwiki, ʻo ia hoʻi, e kau mālie, ʻaʻole i hōʻoia mau ʻia i ka ʻōnaehana kapua.

Loaʻa ke kaulike ma ka hāʻawi ʻana i ka ukana ma o ka hoʻohana ʻana i ka Opentack virtualization ma nā kikowaena me nā CPUs multi-core ikaika. Hoʻokaʻawale ʻia nā nodes ʻikepili i kā lākou mau kumuwaiwai ponoʻī a me nā disk kikoʻī. I kā mākou hoʻoholo Atos Codex Data Lake Engine Loaʻa ka virtualization ākea, ʻo ia ke kumu e pōmaikaʻi ai mākou ma ke ʻano o ka hana (ua hoʻemi ʻia ka hopena o ka ʻoihana pūnaewele) a ma TCO (hoʻopau ʻia nā kikowaena kino hou).

He aha ka mea kūikawā e pili ana iā Cloudera a pehea e kuke ai
I ka hoʻohana ʻana i nā kikowaena BullSequana S200, loaʻa iā mākou kahi ukana like ʻole, ʻaʻohe o nā bottlenecks. ʻO ka hoʻonohonoho haʻahaʻa loa he 3 mau kikowaena BullSequana S200, kēlā me kēia me ʻelua mau JBOD, a me nā S200 hou aʻe i loaʻa i nā node ʻikepili ʻehā. Eia kekahi laʻana o ka ukana ma ka hoʻāʻo TeraGen:

He aha ka mea kūikawā e pili ana iā Cloudera a pehea e kuke ai

Hōʻike nā hoʻāʻo me nā helu ʻikepili like ʻole a me nā waiwai replication i nā hopena like i ka ʻōlelo o ka hoʻoili ukana ma waena o nā nodes cluster. Aia ma lalo iho kahi pakuhi o ka māhele ʻana o ka loaʻa disk e nā hoʻokolohua hana.

He aha ka mea kūikawā e pili ana iā Cloudera a pehea e kuke ai

Hana ʻia nā helu ma muli o kahi hoʻonohonoho liʻiliʻi o 3 mau kikowaena BullSequana S200. Loaʻa iā ia nā nodes data 9 a me 3 master nodes, a me nā mīkini virtual i mālama ʻia i ka hihia o ka waiho ʻana o ka pale e pili ana i ka OpenStack Virtualization. ʻO ka hopena hoʻāʻo ʻo TeraSort: ka nui o ka poloka 512 MB helu hoʻopiʻi like me ʻekolu me ka hoʻopili ʻana he 23,1 mau minuke.

Pehea e hoʻonui ʻia ai ka ʻōnaehana? Loaʻa nā ʻano hoʻonui like ʻole no Data Lake Engine:

  • Nā node ʻikepili: no kēlā me kēia 40 TB o kahi hoʻohana
  • Nā nodes analytical me ka hiki ke hoʻokomo i kahi GPU
  • Nā koho ʻē aʻe e pili ana i nā pono ʻoihana (no ka laʻana, inā makemake ʻoe iā Kafka a me nā mea like)

He aha ka mea kūikawā e pili ana iā Cloudera a pehea e kuke ai

Aia ka Atos Codex Data Lake Engine i nā kikowaena ponoʻī a me nā polokalamu i hoʻokomo mua ʻia, me kahi kit Cloudera laikini; ʻO Hadoop ponoʻī, OpenStack me nā mīkini virtual e pili ana i ka kernel RedHat Enterprise Linux, ka hoʻopiʻi ʻikepili a me nā ʻōnaehana hoʻihoʻi (me ka hoʻohana ʻana i kahi node backup a me Cloudera BDR - Backup and Disaster Recovery). Ua lilo ʻo Atos Codex Data Lake Engine i ka hopena virtualization mua e hōʻoia ʻia ʻO Cloudera.

Inā makemake ʻoe i nā kikoʻī, e hauʻoli mākou e pane i kā mākou mau nīnau ma nā ʻōlelo.

Source: www.habr.com

Pākuʻi i ka manaʻo hoʻopuka