Ukukhethwa kwesici ekufundeni komshini

Sawubona Habr!

Thina kwaReksoft sahumushela lesi sihloko olimini lwesiRashiya Ukukhetha Isici Ekufundeni Komshini. Sithemba ukuthi kuzoba usizo kuwo wonke umuntu onentshisekelo ngesihloko.

Emhlabeni wangempela, idatha ayihlali ihlanzekile njengoba amakhasimende ebhizinisi ecabanga ngezinye izikhathi. Yingakho ukumbiwa kwedatha nokuphikisana kwedatha kuyadingeka. Isiza ukuhlonza amanani angekho namaphethini kudatha yesakhiwo sombuzo abantu abangakwazi ukuyikhomba. Ukuze uthole futhi usebenzise lawa maphethini ukubikezela imiphumela usebenzisa ubudlelwano obutholiwe kudatha, ukufunda ngomshini kuyasiza.

Ukuze uqonde noma iyiphi i-algorithm, udinga ukubheka zonke izinto eziguquguqukayo kudatha futhi uthole ukuthi lezo ziguquguqukayo zimelelani. Lokhu kubalulekile ngoba isizathu semiphumela sisekelwe ekuqondeni idatha. Uma idatha iqukethe okuguquguqukayo okungu-5 noma okungu-50, ungazihlola zonke. Kuthiwani uma kukhona angu-200? Khona-ke ngeke kube nesikhathi esanele sokufunda zonke izinhlobo eziguquguqukayo. Ngaphezu kwalokho, amanye ama-algorithms awasebenzi kudatha yezigaba, futhi kuzodingeka ukuthi uguqule wonke amakholomu ezigaba abe okuguquguqukayo komthamo (angabukeka emuningi, kodwa amamethrikhi azobonisa ukuthi angokwezigaba) ukuze awengeze kumodeli. Ngakho, inani lezinto eziguquguqukayo liyakhula, futhi kukhona cishe ama-500. Yini okufanele uyenze manje? Omunye angase acabange ukuthi impendulo kungaba ukunciphisa dimensionality. Ama-algorithms okunciphisa ubukhulu anciphisa inani lamapharamitha kodwa abe nomthelela omubi ekuchazekeni. Kuthiwani uma kukhona ezinye izindlela ezisusa izici kuyilapho zenza ezisele ziziqonde futhi zizihumushe kalula?

Kuya ngokuthi ukuhlaziya kusekelwe ekuhlehleni phansi noma ekuhlukaniseni, ama-algorithms wokukhetha isici angase ahluke, kodwa umqondo oyinhloko wokusetshenziswa kwawo uhlala unjalo.

Okuguquguqukayo Okuhlobene Kakhulu

Okuguquguqukayo okuhlobene kakhulu nokunye kunikeza ulwazi olufanayo kumodeli, ngakho-ke asikho isidingo sokuzisebenzisa zonke ukuze zihlaziywe. Isibonelo, uma idathasethi iqukethe izici "Isikhathi Se-inthanethi" kanye "Nethrafikhi Esetshenzisiwe", singacabanga ukuthi zizohlotshaniswa ngandlela thize, futhi sizobona ukuhlobana okuqinile ngisho noma sikhetha isampula yedatha engachemile. Kulokhu, okukodwa kuphela kwalokhu okuguquguqukayo kuyadingeka kumodeli. Uma usebenzisa kokubili, imodeli izogcwala ngokweqile futhi ichema esicini esisodwa.

P-amanani

Kuma-algorithms afana nokuhlehla komugqa, imodeli yezibalo yokuqala ihlale ingumqondo omuhle. Kuyasiza ukukhombisa ukubaluleka kwezici ngamavelu azo we-p atholwe yile modeli. Ngemva kokusetha izinga lokubaluleka, sihlola amanani angu-p angumphumela, futhi uma noma yiliphi inani lingaphansi kwezinga lokubaluleka elishiwo, lesi sici kuthiwa sibalulekile, okungukuthi, ukuguqulwa kwenani laso kungase kuholele ekushintsheni kwenani okuhlosiwe.

Ukukhetha okuqondile

Ukukhetha phambili kuyindlela ehlanganisa ukusebenzisa ukuhlehla kwesinyathelo. Ukwakhiwa kwemodeli kuqala ngoziro ophelele, okungukuthi, imodeli engenalutho, bese ukuphindaphinda ngakunye kwengeza okuguquguqukayo okwenza ngcono imodeli eyakhiwayo. Ikuphi okuguquguqukayo okwengezwe kumodeli kunqunywa ukubaluleka kwakho. Lokhu kungabalwa kusetshenziswa amamethrikhi ahlukahlukene. Indlela ejwayeleke kakhulu ukusebenzisa amanani e-p atholwe kumodeli yezibalo yasekuqaleni kusetshenziswa zonke izinto eziguquguqukayo. Kwesinye isikhathi ukukhetha okuya phambili kungaholela ekufakeni ngokweqile imodeli ngoba kungase kube neziguquko ezihlotshaniswa kakhulu kumodeli, ngisho noma zinikeza ulwazi olufanayo kumodeli (kodwa imodeli isabonisa ukuthuthuka).

Ukuhlehlisa ukukhetha

Ukuhlehlisa ukukhetha kuphinde kubandakanya nokuqedwa kwezimpawu zesinyathelo ngesinyathelo, kodwa ngakolunye uhlangothi uma kuqhathaniswa nokukhethwa phambili. Kulokhu, imodeli yokuqala ihlanganisa zonke izinhlobo ezizimele. Okuguquguqukayo bese kukhishwa (okukodwa ngokuphindaphinda ngakunye) uma kunganikeleli inani kumodeli entsha yokuhlehla ekuphindaphindweni ngakunye. Ukukhishwa kwesici kusekelwe kumanani we-p wemodeli yokuqala. Le ndlela futhi inokungaqiniseki lapho isusa okuguquguqukayo okuhlobene kakhulu.

Ukuqedwa Kwesici Esiphindaphindayo

I-RFE iyindlela/i-algorithm esetshenziswa kabanzi yokukhetha inani eliqondile lezici ezibalulekile. Ngezinye izikhathi indlela isetshenziswa ukuchaza izici β€œezibaluleke kakhulu” ezithonya imiphumela; futhi ngezinye izikhathi ukunciphisa inani elikhulu kakhulu lezinto eziguquguqukayo (cishe 200-400), futhi kuphela lezo ezenza okungenani umnikelo othile kumodeli zigcinwa, futhi zonke ezinye azifakiwe. I-RFE isebenzisa isistimu yezinga. Izici kusethi yedatha zinikezwa amazinga. Lawa mazinga abe esesetshenziselwa ukususa ngokuphindaphindiwe izici ngokusekelwe ku-collinearity phakathi kwazo nokubaluleka kwalezo zici kumodeli. Ngaphezu kwezici zokukala, i-RFE ingabonisa ukuthi lezi zici zibalulekile noma cha kunombolo ethile yezici (ngoba kungenzeka kakhulu ukuthi inombolo ekhethiwe yezici ingase ingabi kahle, futhi inani eliphelele lezici lingase libe ngaphezulu. noma ngaphansi kwenombolo ekhethiwe).

Umdwebo Wokubaluleka Kwesici

Uma sikhuluma ngokutolika kwama-algorithms okufunda komshini, sivamise ukuxoxa ngokuhlehla komugqa (okuvumela ukuthi uhlaziye ukubaluleka kwezici usebenzisa amanani we-p) nezihlahla zesinqumo (okubonisa ngokoqobo ukubaluleka kwezici ngendlela yesihlahla, kanye nasesihlahleni. ngesikhathi esifanayo i-hierarchy yabo). Ngakolunye uhlangothi, ama-algorithms afana ne-Random Forest, i-LightGBM ne-XG Boost ngokuvamile isebenzisa umdwebo wokubaluleka kwesici, okungukuthi, umdwebo wezinto eziguquguqukayo kanye "nezinombolo zabo ezibalulekile" zihlelwe. Lokhu kubaluleke kakhulu uma udinga ukuhlinzeka ngesizathu esihlelekile sokubaluleka kwezimfanelo ngokuya nomthelela wazo ebhizinisini.

Ukwenziwa njalo

Ukuhlelwa kabusha kwenziwa ukuze kulawulwe ibhalansi phakathi kokuchema nokuhluka. Ukuchema kukhombisa ukuthi imodeli igcwele kangakanani kusethi yedatha yokuqeqeshwa. Ukuchezuka kubonisa ukuthi izibikezelo zazihluke kangakanani phakathi kokuqeqeshwa namasethi edatha yokuhlola. Ngokufanelekile, kokubili ukuchema nokuhluka kufanele kube kuncane. Yilapho ukujwayela kusiza khona! Kunamasu amabili amakhulu:

I-L1 Regularization - I-Lasso: I-Lasso ijezisa izisindo zemodeli ukuze iguqule ukubaluleka kwazo kumodeli futhi ingakwazi nokuzesula (okungukuthi, isuse lezo ziguquguqukayo kumodeli yokugcina). Ngokuvamile, i-Lasso isetshenziswa uma isethi yedatha iqukethe inani elikhulu lezinto eziguquguqukayo futhi ufuna ukukhipha ezinye zazo ukuze uqonde kangcono ukuthi izici ezibalulekile ziyithinta kanjani imodeli (okungukuthi, lezo zici ezikhethwe yi-Lasso futhi zabelwe ukubaluleka).

I-L2 Regularization - Indlela ye-Ridge: Umsebenzi we-Ridge ukugcina zonke eziguquguqukayo futhi ngesikhathi esifanayo unikeze ukubaluleka kuzo ngokusekelwe ekubambeni kwazo ekusebenzeni kwemodeli. I-Ridge izoba yisinqumo esihle uma isethi yedatha iqukethe inani elincane lezinto eziguquguqukayo futhi zonke ziyadingeka ukuze kuhunyushwe okutholakele kanye nemiphumela etholiwe.

Njengoba i-Ridge igcina zonke izinto eziguquguqukayo futhi i-Lasso yenza umsebenzi ongcono wokuthola ukubaluleka kwazo, kwasungulwa i-algorithm ehlanganisa izici ezinhle kakhulu zakho kokubili okujwayelekile, okwaziwa nge-Elastic-Net.

Kukhona ezinye izindlela eziningi zokukhetha izici zokufunda komshini, kodwa umqondo oyinhloko uhlala ufana: bonisa ukubaluleka kokuguquguquka bese ususa ezinye zazo ngokusekelwe ekubalulekeni okuwumphumela. Ukubaluleka yitemu elincike kakhulu, njengoba kungelona nje elilodwa, kodwa isethi yonke yamamethrikhi namashadi angasetshenziswa ukuthola izibaluli eziyinhloko.

Siyabonga ngokufunda! Ukufunda okujabulisayo!

Source: www.habr.com

Engeza amazwana