Ukuqwalasela i-Spark kwi-YARN

Habr, molo! Izolo intlanganiso enikezelwe kwiApache Spark, ukusuka kubafana abavela kwi-Rambler&Co, kukho imibuzo emininzi kakhulu evela kubathathi-nxaxheba enxulumene nokuqwalasela esi sixhobo. Sagqiba kwelokuba silandele emanyathelweni akhe size sabelane ngamava ethu. Isihloko asilula - ngoko siyakumema ukuba wabelane ngamava akho kumazwana, mhlawumbi siyaqonda kwaye sisebenzise into engalunganga.

Intshayelelo encinci malunga nendlela esisebenzisa ngayo iSpark. Sinenkqubo yeenyanga ezintathu "Ingcali yeDatha enkulu", kwaye kuyo yonke imodyuli yesibini abathathi-nxaxheba bethu basebenza kwesi sixhobo. Ngokufanelekileyo, umsebenzi wethu, njengabaququzeleli, kukulungiselela iqela ukuze lisetyenziswe kwimeko enjalo.

Eyona nto ingaqhelekanga yokusetyenziswa kwethu kukuba inani labantu abasebenza ngaxeshanye kwi-Spark linokulingana neqela lonke. Ngokomzekelo, kwisemina, xa wonke umntu ezama into ngaxeshanye aze aphinde emva kotitshala wethu. Kwaye oku akuninzi - ngamanye amaxesha ukuya kuma-40 abantu. Mhlawumbi azikho iinkampani ezininzi kwihlabathi ezijongene nemeko yokusetyenziswa okunjalo.

Okulandelayo, ndiza kukuxelela indlela kwaye kutheni sikhethe iiparamitha ezithile zoqwalaselo.

Masiqale kwasekuqaleni. I-Spark ineenketho ezi-3 zokuqhuba kwiqela: izimele, isebenzisa iMesos, kunye nokusebenzisa i-YARN. Sigqibe kwelokuba sikhethe eyesithathu kuba ibivakala kuthi. Sele sinalo iqela lehadoop. Abathathi-nxaxheba bethu sele beqhelene kakuhle noyilo lwayo. Masisebenzise i-YARN.

spark.master=yarn

Okunye okunomdla ngakumbi. Nganye kwezi 3 iinketho zokusasazwa zineenketho ezi-2 zokusasazwa: umxhasi kunye neqela. Ngokusekelwe amaxwebhu kunye namakhonkco ezahlukeneyo kwi-Intanethi, sinokugqiba ukuba umxhasi ilungele umsebenzi interactive - umzekelo, ngokusebenzisa jupyter notebook, kunye cluster ifaneleke ngakumbi izisombululo imveliso. Kwimeko yethu, besinomdla kumsebenzi osebenzisanayo, ke ngoko:

spark.deploy-mode=client

Ngokubanzi, ukusukela ngoku ukuya phambili i-Spark iya kusebenza ngandlela thile kwi-YARN, kodwa oku bekunganelanga kuthi. Kuba sinenkqubo malunga nedatha enkulu, ngamanye amaxesha abathathi-nxaxheba babengenako okwaneleyo koko kufunyenwe ngaphakathi kwesakhelo sokunqunyulwa kwezixhobo. Kwaye emva koko sifumene into enomdla - ulwabiwo lwezixhobo eziguqukayo. Ngamafutshane, ingongoma yile: ukuba unomsebenzi onzima kwaye i-cluster ikhululekile (umzekelo, ekuseni), ngoko usebenzisa olu khetho i-Spark inokunika izixhobo ezongezelelweyo. Imfuneko ibalwa apho ngokwefomula enobuqili. Asiyi kungena kwiinkcukacha - isebenza kakuhle.

spark.dynamicAllocation.enabled=true

Siseta le parameter, kwaye ekuqaliseni i-Spark yaphuka kwaye ayizange iqalise. Injalo loo nto, kuba kwafuneka ndiyifunde uxwebhu ngononophelo ngakumbi. Itsho ukuba ukuze yonke into ihambe kakuhle, kufuneka kwakhona uvule iparamitha eyongezelelweyo.

spark.shuffle.service.enabled=true

Kutheni ifuneka? Xa umsebenzi wethu ungasafuni izixhobo ezininzi, i-Spark kufuneka ibuyisele kwi-pool yesiqhelo. Elona nqanaba lithatha ixesha elininzi phantse kuwo nawuphi na umsebenzi we-MapReduce yinqanaba Shuffle. Le parameter ikuvumela ukuba ugcine idatha eyenziwe kweli nqanaba kwaye ukhulule abafayo ngokufanelekileyo. Kwaye umenzi welifa yinkqubo ebala yonke into ngomsebenzi. Inenani elithile le-processor cores kunye nenani elithile lememori.

Le parameter yongeziwe. Yonke into yayibonakala isebenza. Kuye kwaqapheleka ukuba abathathi-nxaxheba banikwa izibonelelo ezithe kratya xa bezifuna. Kodwa kwavela enye ingxaki - ngaxa lithile abanye abathathi-nxaxheba bavuka kwaye bafuna ukusebenzisa i-Spark, kodwa yonke into yayixakeke apho, kwaye babengonwabanga. Zinokuqondwa. Saqala ukujonga amaxwebhu. Kwavela ukuba kusekho inani leeparamitha onokuthi ube nefuthe kwinkqubo. Umzekelo, ukuba umenzi welifa ukwimowudi yokulinda, emva kweliphi ixesha ubutyebi obunokuthatyathwa kuyo?

spark.dynamicAllocation.executorIdleTimeout=120s

Kwimeko yethu, ukuba abagwebi bakho abenzi nto imizuzu emibini, ngoko nceda ubabuyisele kwidama eliqhelekileyo. Kodwa le parameter yayingasoloko yanele. Kwacaca ukuba kudala umntu engenzi nto, kwaye nezibonelelo azikhululwa. Kwavela ukuba kukho iparameter ekhethekileyo - emva kwexesha elingakanani ukukhetha ababulali abaqulethe idatha egciniweyo. Ngokungagqibekanga, le parameter yayingenasiphelo! Sayilungisa.

spark.dynamicAllocation.cachedExecutorIdleTimeout=600s

Oko kukuthi, ukuba abagwebi bakho abenzi nto imizuzu emi-5, banike ichibi eliqhelekileyo. Kule modi, isantya sokukhulula kunye nokukhupha izixhobo kwinani elikhulu labasebenzisi liye lahloniphekile. Isixa sokungoneliseki sehlile. Kodwa siye sagqiba ekubeni siqhubele phambili kwaye sinciphise inani eliphezulu labenzi bokufa ngesicelo ngasinye - ngokusisiseko ngokomthathi-nxaxheba wenkqubo nganye.

spark.dynamicAllocation.maxExecutors=19

Ngoku, ewe, kukho abantu abangonelisekanga kwelinye icala - β€œiqela alisebenzi, kwaye ndinabagwebi abali-19 kuphela,” kodwa unokwenza ntoni? Awunako ukuvuyisa wonke umntu.

Kwaye elinye ibali elincinci elinxulumene neenkcukacha zetyala lethu. Ngandlel’ ithile, abantu abaliqela bafika emva kwexesha kwisifundo esisebenzisekayo, yaye ngenxa yesizathu esithile uSpark akazange aqalise kubo. Sijonge isixa semithombo yasimahla - kubonakala ngathi ikhona. Intlantsi kufuneka iqale. Ngethamsanqa, ngelo xesha amaxwebhu asele ongeziwe kwi-subcortex kwindawo ethile, kwaye sakhumbula ukuba xa yasungulwa, i-Spark ikhangela izibuko apho ingaqala khona. Ukuba izibuko lokuqala kuluhlu lixakekile, lihamba liye kwelilandelayo ngokulandelelana. Ukuba isimahla, iyayibamba. Kwaye kukho iparameter ebonisa inani eliphezulu lokuzama oku. Ukungagqibeki ngu-16. Inani lingaphantsi kwenani labantu kwiqela lethu eklasini. Ngokufanelekileyo, emva kwemigudu eli-16, uSpark wanikezela waza wathi andinakuqalisa. Siyilungisile le parameter.

spark.port.maxRetries=50

Okulandelayo ndiza kukuxelela malunga noseto oluthile olungahambelani kakhulu neenkcukacha zetyala lethu.

Ukuqala i-Spark ngokukhawuleza, kuyacetyiswa ukuba ugcine ifolda yeejagi ezibekwe kulawulo lwasekhaya lwe-SPARK_HOME kwaye uyibeke kwi-HDFS. Emva koko akayi kuchitha ixesha lokulayisha ezi jarnik ngabasebenzi.

spark.yarn.archive=hdfs:///tmp/spark-archive.zip

Kukwacetyiswa ukuba usebenzise ikryo njengeserializer ukuze usebenze ngokukhawuleza. Yenziwe yalunge ngakumbi kuneyokuqala.

spark.serializer=org.apache.spark.serializer.KryoSerializer

Kwaye kukwakho nengxaki ekudala ikho ngeSpark ehlala ingqubeka kwinkumbulo. Ngokuqhelekileyo oku kwenzeka ngexesha xa abasebenzi bebala yonke into kwaye bathumela umphumo kumqhubi. Senze le parameter ibenkulu kuthi. Ngokungagqibekanga, yi-1GB, siyenze isi-3.

spark.driver.maxResultSize=3072

Kwaye ekugqibeleni, njenge dessert. Indlela yokuhlaziya i-Spark kwi-version 2.1 kwi-HortonWorks distribution - HDP 2.5.3.0. Le nguqulo ye-HDP iqulethe inguqulo efakwe ngaphambili 2.0, kodwa sakha sagqiba ngokwethu ukuba i-Spark iphuhla ngokukhutheleyo, kwaye inguqulelo entsha nganye ilungisa ezinye iibhugs kunye nokubonelela ngeempawu ezongezelelweyo, kubandakanywa ne-python API, ngoko sigqibe, yintoni ekufuneka siyenze. makwenziwe luhlaziyo.

Khuphela inguqulelo kwiwebhusayithi esemthethweni yeHadoop 2.7. Yikhulule kwaye uyibeke kwifolda yeHDP. Sifake ii-symlinks njengoko kufuneka. Siyayisungula - ayiqali. Ubhala impazamo engaqhelekanga.

java.lang.NoClassDefFoundError: com/sun/jersey/api/client/config/ClientConfig

Emva kwe-googling, safumanisa ukuba i-Spark yagqiba ekubeni ingalindi de i-Hadoop izalwe, kwaye yagqiba ekubeni isebenzise inguqulo entsha yejezi. Bona ngokwabo bayaphikisana malunga nesi sihloko kwi-JIRA. Isisombululo yayikukukhuphela ijezi inguqulelo 1.17.1. Beka oku kwifolda yeengqayi kwi-SPARK_HOME, yifake kwakhona kwizip kwaye uyilayishe kwi-HDFS.

Siye sayijikeleza le mpazamo, kodwa kwavela enye entsha kwaye elungelelanisiweyo.

org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master

Ngelo xesha, sizama ukuqhuba inguqulo 2.0 - yonke into ilungile. Zama ukuthelekelela ukuba kuqhubeka ntoni. Sijonge kwiilog zesi sicelo kwaye sabona into efana nale:

/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar

Ngokubanzi, ngesizathu esithile i-hdp.version ayizange isombulule. Emva kwe-googling, sifumene isisombululo. Kuya kufuneka uye kwiisetingi ze-YARN e-Ambari kwaye wongeze ipharamitha apho kwindawo yesiko lomsonto:

hdp.version=2.5.3.0-37

Lo mlingo wanceda, kwaye uSpark wemka. Siye savavanya uninzi lweelaptop zethu zejupyter. Yonke into iyasebenza. Silungele isifundo sokuqala seSpark ngoMgqibelo (ngomso)!

I-DUP. Ebudeni besifundo, kwavela enye ingxaki. Ngexesha elithile, i-YARN yayeka ukubonelela ngezikhongozeli zeSpark. Kwi-YARN bekuyimfuneko ukulungisa ipharamitha, enokuthi ngokungagqibekanga ibe ngu-0.2:

yarn.scheduler.capacity.maximum-am-resource-percent=0.8

Oko kukuthi, yi-20% kuphela yezibonelelo ezithatha inxaxheba ekusasazweni kwezibonelelo. Emva kokutshintsha iiparamitha, silayishe kwakhona i-YARN. Ingxaki yasonjululwa kwaye abanye abathathi-nxaxheba bakwazi ukuqhuba umxholo wentlantsi.

umthombo: www.habr.com

Yongeza izimvo