Ukukhetha inqaku ekufundeni koomatshini

Hayi Habr!

Thina eReksoft saguqulela eli nqaku kwisiRashiya UKhetho lweNkalo kwiSifundo soomatshini. Siyathemba ukuba kuya kuba luncedo kuye wonke umntu onomdla kwisihloko.

Ehlabathini lokwenyani, idatha ayisoloko icocekile njengoko abathengi beshishini ngamanye amaxesha becinga. Kungenxa yoko le nto ukuchithwa kwedatha kunye nokuphikisana kwedatha kufunwa. Inceda ekuchongeni amaxabiso alahlekileyo kunye neepateni kwidatha eyakhiwe ngemibuzo engenakubonwa ngabantu. Ukuze ufumane kwaye usebenzise ezi patheni ukuqikelela iziphumo usebenzisa ubudlelwane obufunyenweyo kwidatha, ukufunda ngomatshini kuza luncedo.

Ukuqonda nayiphi na i-algorithm, kufuneka ujonge zonke iinguqu kwidatha kwaye uqikelele ukuba ezo ziguquko zimele. Oku kubalulekile kuba ingqiqo emva kweziphumo isekelwe ekuqondeni idatha. Ukuba idatha iqulethe i-5 okanye i-50 variables, unokuzihlola zonke. Kuthekani ukuba kukho 200 kubo? Emva koko akuyi kuba nexesha elaneleyo lokufunda yonke into eguquguqukayo. Ngaphezu koko, ezinye ii-algorithms azisebenzi kwidatha yecategorical, kwaye ke kuya kufuneka uguqule zonke iikholamu zecategorical ukuba zibeguquguqukayo zobungakanani (zinokujongeka ngokobungakanani, kodwa iimethrikhi ziya kubonisa ukuba zicategorical) ukuzongeza kwimodeli. Ngaloo ndlela, inani leenguqu liyanda, kwaye kukho malunga nama-500. Yintoni enokuyenza ngoku? Omnye unokucinga ukuba impendulo iya kuba kukunciphisa ubungakanani. Ii-algorithms zokunciphisa i-Dimensionality zinciphisa inani leeparamitha kodwa zinefuthe elibi ekutolikeni. Kuthekani ukuba kukho ezinye iindlela ezisusa iimpawu ngoxa zisenza eziseleyo zibe lula ukuziqonda nokuzitolika?

Kuxhomekeke ekubeni uhlalutyo lusekwe kuhlengahlengiso okanye kuhlelo, i-algorithms yokukhetha uphawu inokwahluka, kodwa umbono ophambili wokuphunyezwa kwawo uhlala ufana.

Iiguquguquko eziNxibelene kakhulu

Iinguqu ezihambelana kakhulu kunye nomnye zibonelela ngolwazi olufanayo kwimodeli, ngoko akukho mfuneko yokuzisebenzisa zonke ukuhlalutya. Umzekelo, ukuba i-dataset iqulethe iimpawu "Ixesha le-intanethi" kunye "neTrafikhi esetyenzisiweyo", sinokucinga ukuba ziya kuhambelana, kwaye siya kubona ulungelelwaniso oluqinileyo nokuba sikhetha isampuli yedatha engakhethi cala. Kule meko, enye kuphela kwezi ziguquko ezifunekayo kwimodeli. Ukuba usebenzisa zombini, imodeli iya kufakwa ngaphezulu kwaye ithathelwe ecaleni kwicala elinye.

P-amaxabiso

Kwii-algorithms ezifana nohlengahlengiso lomgca, imodeli yamanani yokuqala isoloko ingumbono olungileyo. Kuyanceda ukubonisa ukubaluleka kweempawu ngexabiso labo le-p elifunyenwe yile modeli. Xa siseta inqanaba lokubaluleka, sijonga isiphumo sexabiso le-p, kwaye ukuba naliphi na ixabiso lingaphantsi komgangatho wokubaluleka ochaziweyo, ngoko olu phawu lubhengezwa lubalulekile, oko kukuthi, utshintsho kwixabiso layo kunokukhokelela kutshintsho kwixabiso ekujoliswe kuko.

Ukukhetha ngokuthe ngqo

Ukukhetha ukuqhubela phambili bubuchule obubandakanya ukubuyisela umva ngokwenyathelo. Isakhiwo somzekelo siqala ngo-zero opheleleyo, oko kukuthi, imodeli engenanto, kwaye ukuphindaphinda ngalunye kwongeza ukuguquguquka okwenza uphuculo kwimodeli eyakhiwayo. Yiyiphi inguqu eyongeziweyo kwimodeli inqunywe ngokubaluleka kwayo. Oku kungabalwa kusetyenziswa iimetrikhi ezahlukeneyo. Eyona ndlela iqhelekileyo kukusebenzisa amaxabiso e-p afunyenwe kwimodeli yamanani oqobo usebenzisa zonke izinto eziguquguqukayo. Ngamanye amaxesha ukhetho lwangaphambili lunokukhokelela ekugqithiseni imodeli ngenxa yokuba kunokubakho iinguqu ezihambelana kakhulu kwimodeli, nokuba zibonelela ngolwazi olufanayo kwimodeli (kodwa imodeli isabonisa ukuphuculwa).

Ukubuyisela umva ukhetho

Ukukhetha umva kwakhona kubandakanya ukupheliswa kwenyathelo ngenyathelo leempawu, kodwa kwicala elichaseneyo xa kuthelekiswa nokukhethwa phambili. Kule meko, imodeli yokuqala iquka zonke iinguqu ezizimeleyo. Izinto eziguquguqukayo ziye zipheliswe (enye ngokuphindaphinda) ukuba azifaki xabiso kwimodeli entsha yobuyiselo kuphindaphindo ngalunye. Ukukhutshwa kophawu kusekwe kumaxabiso e-p yemodeli yokuqala. Le ndlela nayo inokungaqiniseki xa isusa iinguqu ezinxibelelene kakhulu.

Recursive Feature Ukupheliswa

I-RFE bubuchule obusetyenziswa ngokubanzi/i-algorithm yokukhetha inani elichanekileyo leempawu ezibalulekileyo. Ngamanye amaxesha indlela isetyenziselwa ukuchaza inani leempawu "ezibaluleke kakhulu" ezichaphazela iziphumo; kwaye ngamanye amaxesha ukunciphisa inani elikhulu kakhulu lezinto eziguquguqukayo (malunga ne-200-400), kwaye kuphela abo benza igalelo elithile kumzekelo bagcinwa, kwaye bonke abanye ababandakanywa. I-RFE isebenzisa inkqubo yokubeka. Iimpawu kwiseti yedatha zabelwe iirenki. Ezi nqanaba zisetyenziselwa ukuphelisa ngokuphindaphindiweyo iimpawu ezisekelwe kwi-collinearity phakathi kwabo kunye nokubaluleka kwezo mpawu kumzekelo. Ukongeza kwiimpawu zokuhlela, i-RFE ingabonisa ukuba ezi mpawu zibalulekile na okanye hayi nakwinani elinikiweyo leempawu (kuba kusenokwenzeka ukuba inani elikhethiweyo leempawu lisenokungabi lelona lifanelekileyo, kwaye elona nani lilelona lifanelekileyo leempawu linokuba ngaphezulu. okanye ngaphantsi kwenani elikhethiweyo).

Idayagram yokubaluleka kophawu

Xa sithetha ngokutolika kwe-algorithms yokufunda koomatshini, sihlala sixoxa ngohlengahlengiso lomgca (okuvumela ukuba uhlalutye ukubaluleka kweempawu usebenzisa i-p-values) kunye nemithi yesigqibo (ebonisa ngokoqobo ukubaluleka kweempawu ngendlela yomthi, kwaye kwangaxeshanye ulawulo lwabo). Ngakolunye uhlangothi, i-algorithms efana ne-Random Forest, i-LightGBM kunye ne-XG Boost ihlala isebenzisa umzobo wokubaluleka kwento, oko kukuthi, umzobo wezinto eziguquguqukayo kunye "namanani abo okubaluleka" acetywayo. Oku kuluncedo ngakumbi xa ufuna ukubonelela ngengqiqo ecwangcisiweyo yokubaluleka kweempawu ngokwempembelelo yazo kwishishini.

Uhlengahlengiso

Ukumiswa rhoqo kwenziwa ukulawula ukulingana phakathi kwe-bias kunye nokuhluka. I-bias ibonisa ukuba imodeli igqithise kangakanani kwiseti yedatha yoqeqesho. Ukutenxa kubonisa ukuba uqikelelo olwahluke njani phakathi koqeqesho kunye novavanyo lwedatha. Ngokufanelekileyo, zombini i-bias kunye nokwahluka kufuneka kube kuncinci. Kulapho uhlengahlengiso lusiza khona! Kukho iindlela ezimbini eziphambili:

I-L1 Regularization - iLasso: I-Lasso yohlwaya iintsimbi zemodeli ukutshintsha ukubaluleka kwazo kwimodeli kwaye inokuzitshitshisa (okt ukususa ezo ziguquguqukayo kwimodeli yokugqibela). Ngokuqhelekileyo, i-Lasso isetyenziswe xa i-dataset iqulethe inani elikhulu lezinto eziguquguqukayo kwaye ufuna ukungabandakanyi ezinye zazo ukuze uqonde kakuhle ukuba iimpawu ezibalulekileyo zichaphazela njani imodeli (oko kukuthi, ezo mpawu zikhethwe nguLasso kwaye zinikezelwe ukubaluleka).

I-L2 Regularization - Indlela ye-Ridge: Umsebenzi we-Ridge kukugcina zonke izinto eziguquguqukayo kwaye ngexesha elifanayo unike ukubaluleka kwazo ngokusekelwe kwigalelo labo ekusebenzeni komzekelo. I-Ridge iya kuba yinto efanelekileyo ukuba i-dataset iqulethe inani elincinci lezinto eziguquguqukayo kwaye zonke ziyimfuneko ukutolika iziphumo kunye neziphumo ezifunyenweyo.

Kuba iRidge igcina zonke izinto eziguquguqukayo kwaye iLasso yenza umsebenzi ongcono wokuseka ukubaluleka kwazo, i-algorithm yaphuhliswa edibanisa ezona mpawu zibalaseleyo zazo zombini, ezaziwa ngokuba yi-Elastic-Net.

Kukho iindlela ezininzi zokukhetha iimpawu zokufunda koomatshini, kodwa ingcamango ephambili isoloko ifana: bonisa ukubaluleka kwezinto eziguquguqukayo uze ucime ezinye zazo ngokusekelwe kukubaluleka okubangelwayo. Ukubaluleka ligama elixhomekeke kakhulu, njengoko ingeyiyo nje enye, kodwa iseti epheleleyo yeemetrikhi kunye neetshathi ezinokuthi zisetyenziswe ukufumana iimpawu eziphambili.

Enkosi ngokufunda! Kumnandi ukufunda!

umthombo: www.habr.com

Yongeza izimvo