Kusankhidwa kwa zinthu pakuphunzira makina

Pa Habr!

Ife a ku Reksoft tinamasulira nkhaniyo m’Chirasha Kusankhidwa kwa Mbali mu Kuphunzira kwa Makina. Tikukhulupirira kuti zikhala zothandiza kwa aliyense amene ali ndi chidwi ndi mutuwo.

M'dziko lenileni, deta sikhala yoyera monga momwe makasitomala amaganizira nthawi zina. Ichi ndichifukwa chake migodi ya data ndi kukangana kwa data ndizofunikira. Zimathandizira kuzindikira zomwe zikusowa ndi machitidwe omwe anthu sangathe kuzindikira. Kuti mupeze ndikugwiritsa ntchito machitidwewa kulosera zotsatira pogwiritsa ntchito maubale omwe apezeka mu data, kuphunzira pamakina kumakhala kothandiza.

Kuti mumvetse algorithm iliyonse, muyenera kuyang'ana zosintha zonse zomwe zili mu datayo ndikuwona zomwe mitunduyo ikuyimira. Izi ndizofunikira chifukwa zomveka zotsatila zotsatira zimachokera pakumvetsetsa deta. Ngati deta ili ndi zosintha 5 kapena 50, mutha kuzifufuza zonse. Bwanji ngati alipo 200? Ndiye sipadzakhala nthawi yokwanira yophunzira kusintha kulikonse. Kuphatikiza apo, ma aligorivimu ena sagwira ntchito pazida zamagawo, ndiye kuti muyenera kusintha magawo onse amtundu kuti akhale osinthika (atha kuwoneka ochulukira, koma ma metrics amawonetsa kuti ndi amtundu) kuti muwonjezere pachitsanzo. Chifukwa chake, kuchuluka kwa zosintha kumawonjezeka, ndipo pali pafupifupi 500. Zoyenera kuchita tsopano? Wina angaganize kuti yankho lingakhale kuchepetsa dimensionality. Ma aligorivimu ochepetsa kukula amachepetsa kuchuluka kwa magawo koma amakhala ndi zotsatira zoyipa pakutanthauzira. Nanga bwanji ngati pali njira zina zomwe zimachotsa mbali zina zomwe zatsalazo kuti zikhale zosavuta kuzimvetsetsa ndi kuzimasulira?

Kutengera ngati kuwunikaku kumatengera kusinthika kapena kugawika, ma algorithms osankha mawonekedwe amatha kusiyana, koma lingaliro lalikulu la kukhazikitsidwa kwawo limakhalabe lomwe.

Zosiyanasiyana Zogwirizana Kwambiri

Zosintha zomwe zimagwirizana kwambiri ndi wina ndi mzake zimapereka chidziwitso chofanana ndi chitsanzo, kotero palibe chifukwa chogwiritsira ntchito zonsezo pofufuza. Mwachitsanzo, ngati gulu la data lili ndi "Nthawi Yapaintaneti" ndi "Magalimoto Ogwiritsidwa Ntchito", titha kuganiza kuti zitha kulumikizana pang'ono, ndipo tiwona kulumikizana kwamphamvu ngakhale titasankha chitsanzo cha data mosakondera. Pankhaniyi, chimodzi chokha mwa zosinthazi chimafunika mu chitsanzo. Ngati mutagwiritsa ntchito zonsezi, chitsanzocho chidzakhala chokwanira komanso chokondera ku chinthu china.

P-makhalidwe

Mu ma aligorivimu monga kubwezeredwa kwa mzere, chitsanzo choyambirira chowerengera nthawi zonse chimakhala chabwino. Zimathandizira kuwonetsa kufunikira kwa zinthuzo kudzera mu ma p-values ​​awo omwe adapezedwa ndi mtundu uwu. Tikayika mulingo wofunikira, timayang'ana zotsatira za p, ndipo ngati mtengo uliwonse uli pansi pa mulingo wofunikira womwe watchulidwa, ndiye kuti mawonekedwewa amanenedwa kuti ndi ofunika, ndiye kuti, kusintha kwa mtengo wake kungapangitse kusintha kwa mtengo wake. chandamale.

Kusankhidwa kwachindunji

Kusankha kutsogolo ndi njira yomwe imaphatikizapo kugwiritsa ntchito njira yochepetsera pang'onopang'ono. Kumanga kwachitsanzo kumayamba ndi zero wathunthu, ndiko kuti, chitsanzo chopanda kanthu, ndiyeno kubwereza kulikonse kumawonjezera kusintha komwe kumapangitsa kusintha kwa chitsanzo chomwe chikumangidwa. Kusintha komwe kumawonjezeredwa ku chitsanzo kumatsimikiziridwa ndi kufunikira kwake. Izi zitha kuwerengedwa pogwiritsa ntchito ma metric osiyanasiyana. Njira yodziwika kwambiri ndikugwiritsa ntchito ma p-values ​​omwe amapezeka muzowerengera zoyambirira pogwiritsa ntchito mitundu yonse. Nthawi zina kusankhidwa kwachindunji kungayambitse kuwonjezereka kwachitsanzo chifukwa pakhoza kukhala zosinthika zogwirizana kwambiri mu chitsanzo, ngakhale zitapereka chidziwitso chofanana ndi chitsanzo (koma chitsanzo chikuwonetsabe kusintha).

Kusintha kusankha

Kusankha m'mbuyo kumaphatikizapo kuchotsa pang'onopang'ono makhalidwe, koma mosiyana poyerekeza ndi kusankha kutsogolo. Pankhaniyi, chitsanzo choyamba chimaphatikizapo mitundu yonse yodziimira. Zosintha zimachotsedwa (imodzi pa kubwereza) ngati sizipereka phindu ku mtundu watsopano wotsitsimula pakubwereza kulikonse. Kupatulapo mawonekedwe kumatengera p-values ​​yachitsanzo choyambirira. Njirayi imakhalanso ndi kusatsimikizika pochotsa zosinthika zogwirizana kwambiri.

Recursive Feature Kuchotsa

RFE ndi njira/algorithm yomwe imagwiritsidwa ntchito kwambiri posankha kuchuluka kwenikweni kwazinthu zofunikira. Nthawi zina njirayo imagwiritsidwa ntchito pofotokozera zinthu zingapo "zofunika kwambiri" zomwe zimakhudza zotsatira; ndipo nthawi zina kuchepetsa chiwerengero chachikulu kwambiri cha zosinthika (pafupifupi 200-400), ndipo okhawo omwe amapereka osachepera ena ku chitsanzo amasungidwa, ndipo ena onse amachotsedwa. RFE imagwiritsa ntchito dongosolo la kusanja. Zomwe zili mu seti ya data zimaperekedwa masanjidwe. Maudindowa amagwiritsidwa ntchito pochotsa zinthu mobwerezabwereza kutengera collinearity pakati pawo komanso kufunikira kwa mawonekedwewo. Kuphatikiza pa kusanja mawonekedwe, RFE imatha kuwonetsa ngati izi ndi zofunika kapena ayi ngakhale pazinthu zingapo (chifukwa ndizotheka kuti kuchuluka kosankhidwa sikungakhale koyenera, ndipo kuchuluka koyenera kwazinthu kungakhale kochulukirapo. kapena zochepa kuposa nambala yosankhidwa).

Chithunzi Chofunika Kwambiri

Tikamalankhula za kutanthauzira kwa ma aligorivimu ophunzirira makina, nthawi zambiri timakambirana zakusintha kwamizere (zomwe zimakulolani kusanthula kufunikira kwa zinthu pogwiritsa ntchito p-values) ndi mitengo yosankha (kuwonetsa kufunikira kwa mawonekedwe ngati mtengo, komanso pamtengo. nthawi yomweyo utsogoleri wawo). Kumbali ina, ma aligorivimu monga Random Forest, LightGBM ndi XG Boost nthawi zambiri amagwiritsa ntchito chithunzi chofunika kwambiri, ndiko kuti, chithunzi cha zosinthika ndi "manambala ofunikira" amakonzedwa. Izi ndizothandiza makamaka mukafuna kupereka malingaliro okhazikika a kufunikira kwa zomwe zimakhudzira bizinesiyo.

Kukhazikika

Kukhazikika kumachitidwa kuti azitha kuwongolera pakati pa kukondera ndi kusiyanasiyana. Kukondera kumawonetsa kuchuluka kwachitsanzo chomwe chachulukira pa data yophunzitsira. Kupatuka kukuwonetsa momwe zolosera zinalili zosiyana pakati pa maphunziro ndi ma dataset oyeserera. Momwemo, kukondera komanso kusiyanasiyana kuyenera kukhala kochepa. Apa ndipamene kukhazikika kumabwera kudzapulumutsa! Pali njira ziwiri zazikulu:

L1 Regularization - Lasso: Lasso amalanga zolemera zachitsanzo kuti zisinthe kufunikira kwake kwachitsanzo ndipo zimatha kuzithetsa (mwachitsanzo, kuchotsa zosinthazo pamtundu womaliza). Kawirikawiri, Lasso imagwiritsidwa ntchito pamene deta ili ndi chiwerengero chachikulu cha zosinthika ndipo mukufuna kuchotsa zina mwazo kuti mumvetse bwino momwe zinthu zofunika zimakhudzira chitsanzo (ndiko kuti, zomwe zasankhidwa ndi Lasso ndipo zapatsidwa kufunika).

L2 Regularization - Njira ya Ridge: Ntchito ya Ridge ndikusunga zosinthika zonse ndipo nthawi yomweyo azipereka zofunika kwa iwo potengera zomwe athandizira pakuchita kwachitsanzocho. Ridge idzakhala chisankho chabwino ngati deta ili ndi zosinthika zochepa ndipo zonse ndizofunika kutanthauzira zomwe zapeza ndi zotsatira zomwe zapezedwa.

Popeza Ridge amasunga zosintha zonse ndipo Lasso amachita ntchito yabwinoko yotsimikizira kufunikira kwake, algorithm idapangidwa yomwe imaphatikiza mawonekedwe abwino kwambiri anthawi zonse, omwe amadziwika kuti Elastic-Net.

Pali njira zambiri zosankhira zida zophunzirira makina, koma lingaliro lalikulu limakhala lofanana nthawi zonse: wonetsani kufunikira kwa zosinthika kenako ndikuchotsa zina mwazo kutengera kufunikira kwake. Kufunika ndi liwu lodziyimira pawokha, popeza siliri limodzi lokha, koma mndandanda wonse wa ma metric ndi ma chart omwe angagwiritsidwe ntchito kupeza zikhumbo zazikulu.

Zikomo powerenga! Kuphunzira kosangalatsa!

Source: www.habr.com

Kuwonjezera ndemanga