Kugadzirisa Spark paYARN

Habr, mhoro! Nezuro on musangano wakatsaurirwa kuApache Spark, kubva kuvarume veRambler&Co, pakanga paine mibvunzo yakawanda kubva kuvatori vechikamu ine chekuita nekugadzirisa chishandiso ichi. Takasarudza kutevera tsoka dzake uye kugoverana zvakaitika kwatiri. Nyaya yacho haisi nyore - saka tinokukoka iwe kuti ugovere ruzivo rwako mumashoko, zvichida isu tinonzwisisawo uye tinoshandisa chimwe chinhu chakaipa.

Sumo diki yemashandisiro atinoita Spark. Tine chirongwa chemwedzi mitatu "Big Data Nyanzvi", uye muchikamu chechipiri vatori vechikamu vedu vanoshanda pachiridzwa ichi. Saizvozvo, basa redu, sevarongi, nderekugadzirira sumbu kuti rishandiswe mukati mechiitiko chakadaro.

Iyo peculiarity yekushandiswa kwedu ndeyekuti nhamba yevanhu panguva imwe chete vanoshanda paSpark inogona kuenzana neboka rese. Somuenzaniso, pamusangano, apo munhu wose anoedza chimwe chinhu panguva imwe chete uye anodzokorora mushure memudzidzisi wedu. Uye izvi hazvisi zvakawanda - dzimwe nguva kusvika ku40 vanhu. Panogona kunge pasina makambani mazhinji munyika anotarisana nedambudziko rekushandisa rakadaro.

Tevere, ini ndichakuudza kuti sei uye nei isu takasarudza mamwe config paramita.

Ngatitangei kubva pakutanga. Spark ine 3 sarudzo dzekumhanya pane sumbu: yakamira, uchishandisa Mesos, uye kushandisa YARN. Takasarudza kusarudza yechitatu nekuti ine musoro kwatiri. Isu tatova nehadoop cluster. Vatori vechikamu vedu vatove vakanyatsoziva mavakirwo ayo. Ngatishandisei YARN.

spark.master=yarn

Zvimwe zvinonakidza. Imwe neimwe yeiyi 3 yekuisa sarudzo ine 2 yekuisa sarudzo: mutengi uye cluster. Based zvinyorwa uye zvakasiyana-siyana zvinongedzo paInternet, tinogona kugumisa kuti mutengi akakodzera kupindirana basa - semuenzaniso, kuburikidza nejupyter notebook, uye cluster inonyanya kukodzera kugadzirisa mhinduro. Muchiitiko chedu, isu taifarira basa rekudyidzana, saka:

spark.deploy-mode=client

Kazhinji, kubva zvino zvichienda mberi Spark ichashanda paYARN, asi izvi hazvina kukwana kwatiri. Sezvo isu tine chirongwa pamusoro pe data hombe, dzimwe nguva vatori vechikamu vaive vasina kukwana pane izvo zvakawanikwa mukati mehurongwa hwekuchekwa kwezviwanikwa. Uye ipapo takawana chinhu chinonakidza - dynamic resource kugoverwa. Muchidimbu, pfungwa yacho ndeiyi: kana uine basa rakaoma uye sumbu racho rakasununguka (somuenzaniso, mangwanani), ipapo kushandisa sarudzo iyi Spark inogona kukupa zvimwe zviwanikwa. Kudikanwa kunoverengerwa ipapo maererano nemaitiro ehungwaru. Isu hatisi kuzoenda mune zvakadzama - inoshanda nemazvo.

spark.dynamicAllocation.enabled=true

Isu takaisa iyi paramende, uye pakatanga Spark yakarova uye haina kutanga. Ndizvozvo, nokuti ndaifanira kuiverenga zvinyorwa zvakanyanya. Inotaura kuti kuitira kuti zvese zvive zvakanaka, iwe unofanirwawo kugonesa imwe parameter.

spark.shuffle.service.enabled=true

Nei zvichidiwa? Kana basa redu risisade zviwanikwa zvakawanda kudaro, Spark anofanira kuvadzosera kudziva rakajairika. Iyo inonyanya kutora nguva nhanho munenge chero MepuReduce basa ndiyo Shuffle nhanho. Iyi parameter inokutendera iwe kuchengetedza iyo data inogadzirwa panguva ino uye kusunungura executors zvinoenderana. Uye muitisi ndiyo nzira inoverenga zvese pamushandi. Iyo ine imwe nhamba ye processor cores uye imwe chiyero chendangariro.

Iyi parameter yakawedzerwa. Zvose zvaiita sezviri kushanda. Zvakazoonekwa kuti vatori vechikamu vaipiwa zvekushandisa pavanenge vachida. Asi rimwe dambudziko rakamuka - pane imwe nguva vamwe vatori vechikamu vakamuka uye vaidawo kushandisa Spark, asi zvose zvakanga zvakabatikana ipapo, uye vakanga vasingafari. Vanogona kunzwisiswa. Takatanga kutarisa magwaro. Zvakazoitika kuti kune akati wandei mamwe ma parameter anogona kushandiswa kupesvedzera maitiro. Semuenzaniso, kana muiti wemuitisi ari muchimiro chekumira, mushure menguvai zviwanikwa zvinogona kutorwa kubva kwairi?

spark.dynamicAllocation.executorIdleTimeout=120s

Kwatiri, kana vaurayi vako vasingaite chinhu kwemaminitsi maviri, saka ndapota vadzosere kune dziva rakajairwa. Asi iyi parameter yakanga isiri nguva dzose yakakwana. Zvaiva pachena kuti munhu wacho akanga asina kuita chinhu kwenguva yakareba, uye zvinhu zvakanga zvisiri kusunungurwa. Zvakazoitika kuti kune zvakare yakakosha parameter - mushure menguva ipi yekusarudza executors ine cached data. Nekumisikidza, iyi parameter yaive isingaperi! Takazvigadzirisa.

spark.dynamicAllocation.cachedExecutorIdleTimeout=600s

Ndiko kuti, kana vaurayi vako vasingaite chinhu kwemaminitsi mashanu, vape kune dziva rakajairwa. Mune iyi modhi, kumhanya kwekuburitsa uye kuburitsa zviwanikwa kune nhamba huru yevashandisi kwave kwakanaka. Huwandu hwekusagutsikana hwadzikira. Asi isu takasarudza kuenderera mberi nekudzikamisa huwandu hwepamusoro hwevaiti pakushandisa - zvakanyanya pamunhu wechirongwa.

spark.dynamicAllocation.maxExecutors=19

Ikozvino, hongu, kune vanhu vasingagutsikane kune rumwe rutivi - "sumbu racho harina basa, uye ini ndine 19 chete vaurayi," asi iwe ungaitei? Unokona kuita kuti munhu wese afare.

Uye imwezve nyaya diki ine chekuita nezvakatsanangurwa nenyaya yedu. Neimwe nzira, vanhu vakati wandei vakanonoka chidzidzo chinoshanda, uye nekuda kwechimwe chikonzero Spark haana kuvatangira. Takatarisa huwandu hwezviwanikwa zvemahara - zvinoita kunge zviripo. Spark inofanira kutanga. Neraki, panguva iyoyo magwaro akange atowedzerwa kune subcortex kumwe kunhu, uye isu takarangarira kuti payakatangwa, Spark inotsvaga chiteshi chekutangira. Kana chiteshi chekutanga muchikamu chakabatikana, chinoenda kune chinotevera chakarongeka. Kana iri yemahara, inotora. Uye pane parameter inoratidza huwandu hwehuwandu hwekuedza kweizvi. Kusarudzika ndeye 16. Nhamba ishoma pane nhamba yevanhu vari muboka redu mukirasi. Saizvozvo, mushure mekuedza 16, Spark akakanda mapfumo pasi ndokuti handikwanise kutanga. Tagadzirisa marongero aya.

spark.port.maxRetries=50

Tevere ini ndichakuudza nezve mamwe marongero asina hukama zvakanyanya kune chaiwo enyaya yedu.

Kuti utange Spark nekukurumidza, zvinokurudzirwa kuchengetedza zvirongo folda iri muSPARK_HOME dhairekitori repamba woiisa paHDFS. Ipapo haazotambisi nguva kurodha jarnik idzi nevashandi.

spark.yarn.archive=hdfs:///tmp/spark-archive.zip

Inokurudzirwawo kushandisa kryo se serializer yekukurumidza kushanda. Iyo yakagadziridzwa zvakanyanya kupfuura iyo yekutanga.

spark.serializer=org.apache.spark.serializer.KryoSerializer

Uye kune zvakare dambudziko renguva refu neSpark iyo inowanzopunzika kubva mundangariro. Kazhinji izvi zvinoitika panguva iyo vashandi vakaverenga zvose uye vanotumira chigumisiro kumutyairi. Isu takaita iyi parameter yakakura isu pachedu. Nekusagadzikana, i1GB, takaigadzira 3.

spark.driver.maxResultSize=3072

Uye pakupedzisira, se dessert. Maitiro ekugadzirisa Spark kune shanduro 2.1 paHortonWorks kugovera - HDP 2.5.3.0. Iyi vhezheni yeHDP ine pre-yakaiswa vhezheni 2.0, asi isu takambozvisarudzira isu kuti Spark iri kusimudzira zvakanyanya, uye yega yega vhezheni itsva inogadzirisa mamwe mabhugi uye inopa mamwe maficha, kusanganisira yepython API, saka takasarudza, chii chinoda kuitwa i update.

Dhawunirodha vhezheni kubva kune yepamutemo webhusaiti yeHadoop 2.7. Uzip woiisa muHDP folda. Takaisa ma symlinks sezvinodiwa. Isu tinoivhura - haina kutanga. Inonyora kukanganisa kusina kujeka.

java.lang.NoClassDefFoundError: com/sun/jersey/api/client/config/ClientConfig

Mushure mekuita googling, takaona kuti Spark akasarudza kusamirira kusvikira Hadoop azvarwa, uye akafunga kushandisa shanduro itsva yejezi. Ivo pachavo vanonetsana pamusoro penyaya iyi muJIRA. Mhinduro yaive yekurodha jezi shanduro 1.17.1. Isa izvi mumagaba folda mu SPARK_HOME, zip iyo zvakare uye uise kuHDFS.

Takatenderedza chikanganiso ichi, asi imwe nyowani uye yakakwenenzverwa yakamuka.

org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master

Panguva imwecheteyo, tinoedza kumhanyisa vhezheni 2.0 - zvese zvakanaka. Edza kufungidzira kuti chii chiri kuitika. Takatarisa mumatanda echikumbiro ichi uye takaona chimwe chinhu chakadai:

/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar

Kazhinji, nokuda kwechimwe chikonzero hdp.version haina kugadzirisa. Mushure me googling, takawana mhinduro. Iwe unofanirwa kuenda kune iyo YARN marongero muAmbari uye wobva wawedzera parameter ipapo kune yakasarudzika shinda-saiti:

hdp.version=2.5.3.0-37

Mashiripiti aya akabatsira, uye Spark akabva asimuka. Takaedza akati wandei emalaptop edu ejupyter. Zvose zviri kushanda. Takagadzirira chidzidzo chekutanga cheSpark musi weMugovera (mangwana)!

DUP. Munguva yechidzidzo, rimwe dambudziko rakauya. Pane imwe nguva, YARN yakamira kupa midziyo yeSpark. Mu YARN zvaive zvakafanira kugadzirisa parameter, iyo nekusarudzika yaive 0.2:

yarn.scheduler.capacity.maximum-am-resource-percent=0.8

Ndiko kuti, 20% chete yezviwanikwa zvakatora chikamu mukugoverwa kwezviwanikwa. Mushure mekuchinja ma parameter, takaisazve YARN. Dambudziko rakagadziriswa uye vamwe vese vatori vechikamu vakakwanisawo kumhanyisa mamiriro e spark.

Source: www.habr.com

Voeg