NgoFebruwari-Matshi 2019, kwabanjwa ukhuphiswano lokulinganisa ukutya kwenethiwekhi yoluntu
SNA Hackathon
Esi sisihlandlo sesithathu ukuba i-hackathon ibanjwe phantsi kweli gama. Ihlelwe yi-social network ok.ru, ngokulandelanayo, umsebenzi kunye neenkcukacha zihambelana ngqo nale nethiwekhi yoluntu.
I-SNA (uhlalutyo lwenethiwekhi yoluntu) kule meko iqondwa ngokuchanekileyo kungekhona njengohlalutyo lwegrafu yentlalo, kodwa kunokuba luhlalutyo lwenethiwekhi yoluntu.
- Ngo-2014, umsebenzi wawukuqikelela inani lokuthandwa yisithuba esiza kufumana.
- Ngo-2016 - umsebenzi weVVZ (mhlawumbi uqhelekile), kufutshane nohlalutyo lwegrafu yentlalo.
- Ngo-2019, beka isondlo somsebenzisi ngokusekelwe kumathuba okuba umsebenzisi athande iposti.
Andikwazi ukuthetha malunga ne-2014, kodwa ngo-2016 kunye no-2019, ngaphezu kwezakhono zokuhlalutya idatha, izakhono zokusebenza kunye nedatha enkulu nazo zazifuneka. Ndicinga ukuba yayiyindibaniselwano yokufunda koomatshini kunye neengxaki ezinkulu zokucwangcisa idatha ezanditsalayo kolu khuphiswano, kwaye amava am kwezi ndawo andinceda ukuba ndiphumelele.
mlbootcamp
Ngo-2019, ukhuphiswano lwaququzelelwa eqongeni
Ukhuphiswano lwaqala kwi-intanethi nge-7 kaFebruwari kwaye lwalunemisebenzi emi-3. Nabani na unokubhalisa kwisiza, khuphela
Injongo
Idatha yomthombo ibonelela ngee-ID zabasebenzisi (userId) kunye ne-ID yeposi (objectId). Ukuba umsebenzisi uboniswe isithuba, ngoko idatha iqulethe umgca oqulethe i-userId, intoId, iimpendulo zomsebenzisi kwesi sithuba (ingxelo) kunye neseti yeempawu ezahlukeneyo okanye amakhonkco kwimifanekiso kunye nezicatshulwa.
Isazisi somsebenzisi | objectId | ownerId | ingxelo | imifanekiso |
---|---|---|---|---|
3555 | 22 | 5677 | [ithandile, yacofa] | [Hash1] |
12842 | 55 | 32144 | [andithandanga] | [Hash2,hash3] |
13145 | 35 | 5677 | [cofa, kwabelwana ngayo] | [Hash2] |
Isethi yedatha yovavanyo iqulethe isakhiwo esifanayo, kodwa indawo yempendulo ayikho. Umsebenzi kukuqikelela ubukho bempendulo 'ethandiweyo' kwicandelo lengxelo.
Ifayile engeniswayo inolwakhiwo olulandelayo:
Isazisi somsebenzisi | Uluhlu oluhleliweyo[objectId] |
---|---|
123 | 78,13,54,22 |
128 | 35,61,55 |
131 | 35,68,129,11 |
I-metric yi-avareji ye-ROC AUC yabasebenzisi.
Inkcazo ethe kratya yedatha inokufumaneka apha
Iqonga le-Intanethi
Kwinqanaba le-intanethi, umsebenzi wahlulwa ube ngamacandelo ama-3
Inkqubo yentsebenziswano β ibandakanya zonke iimpawu ngaphandle kwemifanekiso nezicatshulwa;Imifanekiso β ibandakanya kuphela ulwazi malunga nemifanekiso;Izibhalo β kubandakanya ulwazi malunga nezicatshulwa kuphela.
Inqanaba elingaxhunyiwe kwi-intanethi
Kwinqanaba elingaxhunyiwe kwi-intanethi, idatha yayiquka zonke iimpawu, ngelixa iitekisi kunye nemifanekiso yayincinci. Kwakukho amaxesha angama-1,5 ngaphezulu kwimiqolo kwidathasethi, esele ininzi.
Isisombululo sengxaki
Oko ndisenza i-CV emsebenzini, ndiqale uhambo lwam kolu khuphiswano ngomsebenzi we-"Images". Idatha enikezelweyo yayiyi-userId, objectId, ownerId (iqela apho isithuba sipapashwe), izitampu zexesha zokudala kunye nokubonisa isithuba, kwaye, ngokuqinisekileyo, umfanekiso wesi sithuba.
Emva kokuvelisa amanqaku amaninzi ngokusekwe kwizitampu zexesha, umbono olandelayo yayikukuthatha umaleko ongaphambili we-neuron oqeqeshwe kwangaphambili kwi-imagenet kwaye uthumele olu luzinziso kunyuso.
Iziphumo azizange zibe ntle. Uzinziso olusuka kwi-imagenet neuron alunamsebenzi, ndacinga ukuba kufuneka ndenze eyam i-autoencoder.
Kuthathe ixesha elininzi kwaye iziphumo azizange ziphucuke.
Ukuveliswa kophawu
Ukusebenza ngemifanekiso kuthatha ixesha elininzi, ngoko ndagqiba ekubeni ndenze into elula.
Njengoko unokubona kwangoko, kukho amanqaku amaninzi kwidathasethi, kwaye ukuze ndingakhathazi kakhulu, ndithathe nje i-catboost. Isisombululo besisemagqabini, ngaphandle kwezicwangciso ndiye ndakhawuleza ndaya kumgca wokuqala webhodi yabaphambili.
Kukho idatha eninzi kwaye ibekwe kwifomathi ye-parquet, ngoko ngaphandle kokucinga kabini, ndathatha i-scala ndaza ndaqala ukubhala yonke into kwi-spark.
Ezona mpawu zilula ezinike ukukhula okungaphezulu kunokufakwa kwemifanekiso:
- mangaphi amaxesha objectId, userId kunye nomniniId evele kwidatha (kufuneka inxulumane nodumo);
- zingaphi izithuba zomsebenzisi ozibonileyo kwi-ID yomnini (kufuneka inxulumane nomdla womsebenzisi kwiqela);
- zingaphi ii-userIds ezizodwa ezijongwe kwizithuba ezivela kumnini-ID (ubonisa ubungakanani babaphulaphuli beqela).
Ukususela kwizitampu zexesha kwakunokwenzeka ukufumana ixesha losuku apho umsebenzisi abukele khona ukutya (kusasa / emini / ngokuhlwa / ubusuku). Ngokudibanisa ezi ndidi, unokuqhubeka ukuvelisa iimpawu:
- mangaphi amaxesha umsebenzisiId ungene ngokuhlwa;
- ngeliphi ixesha esi sithuba siboniswa rhoqo (objectId) njalo njalo.
Konke oku kwaphucula ngokuthe ngcembe iimethrikhi. Kodwa ubungakanani bedatha yoqeqesho malunga neerekhodi ze-20M, ngoko ke ukongeza iimpawu kunciphise kakhulu uqeqesho.
Ndiye ndacinga kwakhona indlela yam yokusebenzisa idatha. Nangona idatha ixhomekeke kwixesha, andizange ndibone naluphi na ulwazi olucacileyo oluvuzayo "kwixesha elizayo", nangona kunjalo, ukuba kunokwenzeka, ndophule ngolu hlobo:
Uqeqesho olumiselweyo (ngoFebruwari kunye neeveki ezi-2 zikaMatshi) lwahlulwe lwaba ziinxalenye ezi-2.
Imodeli yaqeqeshwa kwiidatha ukusuka kwiintsuku zokugqibela ze-N. I-aggregations echazwe ngasentla yakhiwe kuyo yonke idatha, kubandakanywa novavanyo. Ngexesha elifanayo, idatha ibonakale apho kunokwenzeka khona ukwakha iikhowudi ezahlukeneyo zokutshintsha okujoliswe kuyo. Eyona ndlela ilula kukusebenzisa kwakhona ikhowudi esele idala iimpawu ezintsha, kwaye imane iyondle idatha apho ingasayi kuqeqeshwa kwaye ijolise = 1.
Ke, sifumene iimpawu ezifanayo:
- Mangaphi amaxesha apho umsebenzisiId ebone isithuba kumnini weqela;
- Mangaphi amaxesha umsebenzisiId ethanda isithuba kumnini weqela;
- Ipesenti yezithuba ezithandiweyo ngumsebenzisi kwi-id.
Oko kukuthi, kwavela thetha ukufakwa kweekhowudi ekujoliswe kuko kwinxalenye yedatha yeendibaniselwano ezahlukeneyo zeempawu zodidi. Ngokomgaqo, i-catboost iphinda yakha i-encoding ekujoliswe kuyo kwaye ukusuka kulo mbono akukho nzuzo, kodwa, umzekelo, kuye kwenzeka ukubala inani labasebenzisi abakhethekileyo abathanda izithuba kweli qela. Ngelo xesha, injongo ephambili yaphunyezwa - i-dataset yam yancitshiswa ngamaxesha amaninzi, kwaye kwakunokwenzeka ukuqhubeka nokuvelisa iimpawu.
Ngelixa i-catboost inokwakha i-encoding kuphela ngokusekwe kwimpendulo ethandwayo, impendulo inezinye iimpendulo: kwabelwane ngokutsha, kungathandwanga, kungathandwanga, kucofe, kungahoywa, ukufakwa kweekhowudi okunokwenziwa ngesandla. Ndibala kwakhona zonke iintlobo zee-aggregates kwaye ndasusa iimpawu ezinokubaluleka okuphantsi ukuze ndingafaki i-dataset.
Ngelo xesha ndandikwindawo yokuqala ngomda obanzi. Ekuphela kwento eyayibhida kukuba ukufakwa kwemifanekiso kubonise phantse akukho kukhula. Umbono weza ukunika yonke into catboost. Sihlanganisa imifanekiso ye-Kmeans kwaye sifumane icandelo elitsha lomfanekiso weCat.
Nazi ezinye iiklasi emva kokucoca ngesandla kunye nokudityaniswa kwamaqela afunyenwe kwi-KMeans.
Ngokusekwe kumfanekiso weCat sivelisa:
- Iimpawu ezintsha zodidi:
- Yeyiphi iCat yomfanekiso edla ngokujongwa ngumsebenzisiId;
- Yeyiphi iCat yomfanekiso edla ngokubonisa iID yomnini;
- Yeyiphi iCat yomfanekiso eyayithandwa kakhulu ngumsebenzisiId;
- Iikhawunta ezahlukeneyo:
- Mingaphi umfanekiso weCat owahlukileyo ojonge umsebenzisiId;
- Malunga neempawu ezili-15 ezifanayo kunye nokufakwa kweekhowudi ekujoliswe kuko njengoko kuchaziwe ngasentla.
Izibhalo
Iziphumo kukhuphiswano lomfanekiso zindifanele kwaye ndagqiba ekubeni ndizame isandla sam kwiitekisi. Andizange ndisebenze kakhulu ngeetekisi ngaphambili kwaye, ngobudenge, ndabulala usuku kwi-tf-idf kunye ne-svd. Emva koko ndabona isiseko kunye ne-doc2vec, eyenza kanye le nto ndiyifunayo. Emva kokuba ndihlengahlengise kancinane iiparamitha ze-doc2vec, ndifumene uthungelwano lombhalo.
Kwaye emva koko ndaphinda ndasebenzisa ikhowudi yemifanekiso, apho ndithathe indawo yokufakela umfanekiso ngokufakela okubhaliweyo. Ngenxa yoko, ndathatha indawo ye-2 kukhuphiswano lombhalo.
Inkqubo yentsebenziswano
Kwashiyeka ukhuphiswano olunye βendingekaluxhobiβ ngentonga, kwaye xa ndigweba yi-AUC kwibhodi yabaphambili, iziphumo zolu khuphiswano bekumele ukuba zibe nempembelelo enkulu kwinqanaba le-offline.
Ndithathe zonke iimpawu ezikumthombo wedatha, ezikhethiweyo zecandelo kwaye ndibala ii-aggregates ezifanayo njengemifanekiso, ngaphandle kweempawu ezisekelwe kwimifanekiso ngokwayo. Ukubeka nje oku kwi-catboost kundifake kwindawo yesi-2.
Amanyathelo okuqala e-catboost optimization
Indawo yokuqala neyesibini yandivuyisa, kodwa kwakukho ukuqonda ukuba andizange ndenze nto ekhethekileyo, oko kuthetha ukuba ndingalindela ukulahlekelwa kwizikhundla.
Injongo yolu khuphiswano kukubeka izithuba ngaphakathi komsebenzisi, kwaye lonke eli xesha bendisombulula ingxaki yokuhlelwa, oko kukuthi, ukwenza ngcono imetriki engalunganga.
Makhe ndikunike umzekelo olula:
Isazisi somsebenzisi | objectId | wokuxela | inyaniso esisiseko |
---|---|---|---|
1 | 10 | 0.9 | 1 |
1 | 11 | 0.8 | 1 |
1 | 12 | 0.7 | 1 |
1 | 13 | 0.6 | 1 |
1 | 14 | 0.5 | 0 |
2 | 15 | 0.4 | 0 |
2 | 16 | 0.3 | 1 |
Masenze uhlengahlengiso oluncinci
Isazisi somsebenzisi | objectId | wokuxela | inyaniso esisiseko |
---|---|---|---|
1 | 10 | 0.9 | 1 |
1 | 11 | 0.8 | 1 |
1 | 12 | 0.7 | 1 |
1 | 13 | 0.6 | 0 |
2 | 16 | 0.5 | 1 |
2 | 15 | 0.4 | 0 |
1 | 14 | 0.3 | 1 |
Sifumana ezi ziphumo zilandelayo:
Umzekelo | AUC | Umsebenzisi1 AUC | Umsebenzisi2 AUC | kuthetha i-AUC |
---|---|---|---|---|
Ikhetho 1 | 0,8 | 1,0 | 0,0 | 0,5 |
Ikhetho 2 | 0,7 | 0,75 | 1,0 | 0,875 |
Njengoko unokubona, ukuphuculwa kwemetric ye-AUC iyonke akuthethi ukuphucula umndilili we-AUC metric ngaphakathi komsebenzisi.
I-Catboost
Imizuzu emi-5 ngaphambi kokuvalwa kwenqanaba le-intanethi lokhuphiswano lwe-"Collaborative Systems", uSergey Shalnov wandithuthela kwindawo yesibini. Sahamba enye indlela kunye.
Ukulungiselela iqonga ngaphandle kweintanethi
Siye saqinisekiswa ukunqoba kwinqanaba le-intanethi kunye nekhadi levidiyo le-RTX 2080 TI, kodwa ibhaso eliphambili le-ruble ye-300 kwaye, mhlawumbi, nendawo yokugqibela yokuqala yasinyanzela ukuba sisebenze kwezi veki ze-000.
Njengoko kwavela, uSergey wasebenzisa i-catboost. Sanikana ngezimvo nangeempawu, kwaye ndafunda malunga
Ukujonga ingxelo kwandikhokelela kwingcamango yokuba kufuneka sibuyisele zonke iiparameters kwixabiso elingagqibekanga, kwaye senze izicwangciso ngononophelo kwaye kuphela emva kokulungisa isethi yeempawu. Ngoku olunye uqeqesho luthathe malunga neeyure ezili-15, kodwa enye imodeli ikwazile ukufumana isantya esingcono kuneso sifunyenwe kwindibano ngokuhlelwa.
Ukuveliswa kophawu
Kukhuphiswano lweeNkqubo zokuSebenza, inani elikhulu leempawu zivavanywa njengento ebalulekileyo kumzekelo. Umzekelo, auditweights_spark_svd - olona phawu lubalulekileyo, kodwa akukho lwazi malunga nokuba lithetha ukuthini. Ndicinge ukuba kuya kuba luncedo ukubala ii-aggregates ezahlukeneyo ngokusekelwe kwiimpawu ezibalulekileyo. Umzekelo, i-avareji auditweights_spark_svd ngumsebenzisi, ngeqela, ngento. Okufanayo kungabalwa ngokusebenzisa idatha apho kungekho qeqesho lwenziwayo kunye nethagethi = 1, oko kukuthi, umyinge auditweights_spark_svd ngumsebenzisi ngezinto azithandileyo. Iimpawu ezibalulekileyo ngaphandle auditweights_spark_svd, zaziliqela. Nazi ezinye zazo:
- auditweightsCTrGender
- auditweightsCTrHigh
- userOwnerCounterCreateLikes
Umzekelo, umyinge auditweightsCTrGender ngokwe-userId ivele yaba luphawu olubalulekileyo, njengexabiso eliphakathi userOwnerCounterCreateLikes nge-userId+ownerId. Oku kufuneka sele kukwenza ukuba ucinge ukuba kufuneka uqonde intsingiselo yamasimi.
Kwakhona iimpawu ezibalulekileyo zaba auditweightsLikesCount ΠΈ auditweightsShowsCount. Ukwahlula omnye komnye, kwafunyanwa eyona nto ibaluleke ngakumbi.
Ukuvuza kwedatha
Ukhuphiswano kunye nemodeli yemveliso yimisebenzi eyahlukileyo kakhulu. Xa ulungiselela idatha, kunzima kakhulu ukuqwalasela zonke iinkcukacha kwaye ungadlulisi ulwazi oluthile olungenamsebenzi malunga nokuguquguquka okujoliswe kuyo kuvavanyo. Ukuba senza isisombululo sokuvelisa, siya kuzama ukuphepha ukusebenzisa ukuvuza kwedatha xa siqeqesha imodeli. Kodwa ukuba sifuna ukuphumelela ukhuphiswano, ke ukuvuza kwedatha zezona mpawu zibalaseleyo.
Emva kokufunda idata, ungabona ukuba ngokwexabiso objectId auditweightsLikesCount ΠΈ auditweightsShowsCount utshintsho, okuthetha ukuba umlinganiselo wamaxabiso aphezulu ezi mpawu uya kubonisa ukuguqulwa kweposi ngcono kakhulu kunomlinganiselo ngexesha lokuboniswa.
Ukuvuza kokuqala esikufumeneyo auditweightsLikesCountMax/auditweightsShowsCountMax.
Kodwa kuthekani ukuba sijonga idatha ngokusondeleyo? Masihlele ngokomhla womboniso kwaye sifumane:
objectId | Isazisi somsebenzisi | auditweightsShowsCount | auditweightsLikesCount | ekujoliswe kuko (kuthandiwe) |
---|---|---|---|---|
1 | 1 | 12 | 3 | mhlawumbi akunjalo |
1 | 2 | 15 | 3 | mhlawumbi ewe |
1 | 3 | 16 | 4 |
Kwakumangalisa xa ndifumana umzekelo wokuqala onjalo kwaye kwavela ukuba ingqikelelo yam ayizange ibe yinyaniso. Kodwa, kuthathelwa ingqalelo into yokuba amaxabiso aphezulu ezi mpawu ngaphakathi kwento anike ukwanda, asizange sonqena kwaye sagqiba ekubeni sifumane. auditweightsShowsCountNext ΠΈ auditweightsLikesCountNext, oko kukuthi, amaxabiso kumzuzu olandelayo ngexesha. Ngokongeza uphawu
(auditweightsShowsCountOkulandelayo-auditweightsShowsCount)/(auditweightsLikesCount-auditweightsLikesCountNext) senza umtsi obukhali ngokukhawuleza.
Ukuvuza okufanayo kunokusetyenziswa ngokufumana la maxabiso alandelayo userOwnerCounterCreateLikes ngaphakathi komsebenzisiId+ownerId kwaye, umzekelo, auditweightsCTrGender ngaphakathi objectId+userGender. Sifumene iindawo ezi-6 ezifanayo ezinokuvuza kwaye sikhuphe ulwazi oluninzi kangangoko sinakho kubo.
Ngeli xesha, sasicinezele ulwazi oluninzi kangangoko sinakho kwiimpawu zentsebenziswano, kodwa asizange sibuyele kukhuphiswano lwemifanekiso nombhalo. Ndinengcamango enkulu yokukhangela: zingakanani iimpawu ezisekelwe ngokuthe ngqo kwimifanekiso okanye izicatshulwa ezinikezela kukhuphiswano olufanelekileyo?
Kwakungekho kuvuza kumfanekiso kunye nokhuphiswano lombhalo, kodwa ngelo xesha ndandibuyisele iiparamitha ze-catboost ezingagqibekanga, ndacoca ikhowudi kwaye ndongeza izinto ezimbalwa. Iyonke yaba:
Isisombululo | kungekudala |
---|---|
Ubuninzi obunemifanekiso | 0.6411 |
Ubuninzi akukho mifanekiso | 0.6297 |
Isiphumo sendawo yesibini | 0.6295 |
Isisombululo | kungekudala |
---|---|
Ubuninzi obuneetekisi | 0.666 |
Ubuninzi ngaphandle kweetekisi | 0.660 |
Isiphumo sendawo yesibini | 0.656 |
Isisombululo | kungekudala |
---|---|
Ubuninzi bentsebenziswano | 0.745 |
Isiphumo sendawo yesibini | 0.723 |
Kwacaca ukuba sasingenakukwazi ukucudisa okuninzi kwiitekisi kunye nemifanekiso, kwaye emva kokuzama iingcamango ezimbalwa ezinomdla kakhulu, sayeka ukusebenza kunye nabo.
Isizukulwana esongezelelweyo seempawu kwiinkqubo zentsebenziswano azange sinike ukwanda, kwaye saqala ukubeka. Kwinqanaba le-intanethi, ulwahlulo kunye nokuhlanganiswa kwezikhundla kundinika ukonyuka okuncinci, njengoko kuye kwenzeka ngenxa yokuba ndiluqeqeshe phantsi ulwahlulo. Akukho nanye kwimisebenzi yemposiso, kuquka iYetiRanlPairwise, evelise naphina kufutshane nesiphumo esenziwe nguLogLoss (0,745 vs. 0,725). Kwakusekho ithemba le-QueryCrossEntropy, engakwazi ukusungulwa.
Inqanaba elingaxhunyiwe kwi-intanethi
Kwinqanaba le-offline, ulwakhiwo lwedatha lwahlala lufana, kodwa bekukho utshintsho oluncinci:
- Izazisi zomsebenzisiId, objectId, ownerId ziye zaphinda zalungiswa;
- imiqondiso emininzi yasuswa kwaye ezininzi zathiywa amagama;
- idatha iye yanda malunga namaxesha e-1,5.
Ukongeza kwiingxaki ezidwelisiweyo, kwakukho enye enkulu kunye: iqela labelwa iseva enkulu kunye ne-RTX 2080TI. Ndikonwabele ihtop ixesha elide.
Kwakukho ingcamango enye kuphela - ukuvelisa ngokulula into esele ikhona. Emva kokuchitha iiyure ezimbalwa siseta okusingqongileyo kumncedisi, ngokuthe ngcembe saqala ukuqinisekisa ukuba iziphumo ziphinde zaveliswa. Ingxaki enkulu esijongene nayo kukunyuka komthamo wedatha. Sigqibe ekubeni sinciphise umthwalo kancinci kwaye sisete iparamitha yecatboost ctr_complexity=1. Oku kunciphisa isantya kancinci, kodwa imodeli yam yaqala ukusebenza, umphumo wawulungile - 0,733. U-Sergey, ngokungafaniyo nam, akazange ahlukanise idatha kwiingxenye ze-2 kwaye aqeqeshe kuyo yonke idatha, nangona oku kunika iziphumo ezilungileyo kakhulu kwinqanaba le-intanethi, kwinqanaba elingaxhunyiwe kwi-intanethi kwakukho ubunzima obuninzi. Ukuba sithathe zonke iimpawu esizenzileyo kwaye sazama ukuzityhala kwi-catboost, akukho nto iza kusebenza kwinqanaba le-intanethi. USergey uye wachwetheza ukulungelelaniswa, umzekelo, ukuguqula iintlobo zefloat64 ukuya kwi-float32.
Ezi ziphumo bezanele ukuphumelela, kodwa besifihla isantya sethu sokwenyani kwaye singaqinisekanga ukuba namanye amaqela awenzi okufanayo.
Yilwa kude kube sekugqibeleni
Ukulungiswa kweCatboost
Isisombululo sethu senziwa ngokutsha ngokupheleleyo, songeze iimpawu zedatha yesicatshulwa kunye nemifanekiso, ngoko ke yonke into eyayisele yayiyi-parameters ye-catboost. U-Sergey waqeqeshwa kwi-CPU kunye nenani elincinci lokuphindaphinda, kwaye ndaqeqeshelwa enye ene-ctr_complexity=1. Bekusele usuku olunye, kwaye ukuba uthe wongeza nje ukuphinda-phinda okanye ukonyusa i-ctr_complexity, kusasa ungafumana isantya esingcono kwaye uhambe imini yonke.
Kwinqanaba le-offline, isantya sinokufihlwa ngokulula ngokukhetha nje esona sisombululo silungileyo kwindawo. Besilindele utshintsho olukhulu kwibhodi yabaphambili kwimizuzu yokugqibela ngaphambi kokuba ukuhanjiswa kuvalwe kwaye sagqiba kwelokuba singayeki.
Kwividiyo ka-Anna, ndifunde ukuba ukuphucula umgangatho wemodeli, kungcono ukhethe ezi parameters zilandelayo:
- umgangatho_wokufunda β Ixabiso elimiselweyo libalwa ngokusekelwe kubungakanani bedathasethi. Ukunyusa izinga_lokufunda kufuna ukwandisa inani lokuphindaphinda.
- l2_igqabi_reg - I-coefficient ye-regularization, ixabiso elingagqibekanga le-3, ngokukhethekileyo ukhethe ukusuka kwi-2 ukuya ku-30. Ukunciphisa ixabiso kukhokelela ekunyukeni kwe-overfit.
- bagging_ubushushu - yongeza i-randomization kubunzima bezinto kwisampulu. Ixabiso elimiselweyo ngu-1, apho iintsimbi zitsalwa ukusuka kunikezelo lwe-exponential. Ukunciphisa ixabiso kukhokelela ekonyukeni kwe-overfit.
- random_amandla - Ichaphazela ukhetho lokwahlula kwindawo ethile. Ukuphakama kwe-random_strength, kokukhona kuphezulu ithuba lokwahlulwa kokubaluleka okuphantsi okukhethiweyo. Kwi-iteration nganye elandelayo, i-randomness iyancipha. Ukunciphisa ixabiso kukhokelela ekonyukeni kwe-overfit.
Ezinye iiparamitha zinesiphumo esincinci kakhulu kwisiphumo sokugqibela, ke khange ndizame ukuzikhetha. Uphinda-phindo olunye loqeqesho kwidathasethi yam ye-GPU ene-ctr_complexity=1 ithathe imizuzu engama-20, kwaye iiparamitha ezikhethiweyo kwidathasethi encitshisiweyo zahluke kancinane kwezona ziphezulu kwidatha epheleleyo. Ekugqibeleni, ndenze malunga nokuphindaphinda kwe-30 kwi-10% yedatha, kwaye emva koko malunga ne-10 yokuphindaphinda kuyo yonke idatha. Kwavela into enje:
- umgangatho_wokufunda Ndonde nge-40% ukusuka ekusileleni;
- l2_igqabi_reg wayishiya injalo;
- bagging_ubushushu ΠΈ random_amandla yehliswe ukuya kwi-0,8.
Sinokugqiba kwelokuba imodeli yayingaqeqeshwanga kakuhle ngeeparamitha ezimiselweyo.
Ndothuka kakhulu xa ndabona isiphumo kwibhodi yabaphambili:
Umzekelo | imodeli 1 | imodeli 2 | imodeli 3 | dibanisa |
---|---|---|---|---|
Ngaphandle kokulungelelanisa | 0.7403 | 0.7404 | 0.7404 | 0.7407 |
Ngokulungisa | 0.7406 | 0.7405 | 0.7406 | 0.7408 |
Ndizigqibele ngokwam ukuba usetyenziso olukhawulezayo lwemodeli aludingeki, ke kungcono ukutshintshela ukhetho lweeparamitha kunye nokuhlanganiswa kweemodeli ezininzi usebenzisa iiparamitha ezingalungiswanga.
USergey wayelungiselela ubungakanani bedathaset ukuze ayiqhube kwiGPU. Olona khetho lulula kukusika inxalenye yedatha, kodwa oku kunokwenziwa ngeendlela ezininzi:
- susa ngokuthe ngcembe eyona datha indala (ukuqala kukaFebruwari) de idataset iqale ukungena kwinkumbulo;
- susa iimpawu ezinokubaluleka okuphantsi;
- susa ii-userIds apho kukho ingeniso enye kuphela;
- shiya kuphela umsebenzisiIds ezikuvavanyo.
Kwaye ekugqibeleni, yenza i-ensemble kuzo zonke iinketho.
Indibano yokugqibela
Ngokuhlwa kosuku lokugqibela, sasibeke indibano yeemodeli zethu ezivelise i-0,742. Ngobusuku ndasungula imodeli yam nge-ctr_complexity=2 kwaye endaweni yemizuzu engama-30 yaziqeqeshela iiyure ezi-5. Kuphela ngo-4 ekuseni kwabalwa, kwaye ndenza i-ensemble yokugqibela, eyanika i-0,7433 kwibhodi yabaphambili yoluntu.
Ngenxa yeendlela ezahlukeneyo zokusombulula le ngxaki, uqikelelo lwethu aluzange ludityaniswe ngokuqinileyo, olunike ukwanda okuhle kwindibano. Ukufumana udibaniso olululo, kungcono ukusebenzisa uqikelelo lwemodeli ekrwada (prediction_type='RawFormulaVal') kwaye usete scale_pos_weight=neg_count/pos_count.
Kwiwebhusayithi ungabona
Ezinye izisombululo
Amaqela amaninzi alandele ii-canons ze-algorithms yenkqubo yokuncoma. Mna, ndingeyongcali kule ntsimi, andinako ukuzivavanya, kodwa ndikhumbula izisombululo ezi-2 ezinomdla.
Isisombululo sikaNikolay Anokhin . U-Nikolay, ongumqeshwa we-Mail.ru, akazange afake isicelo samabhaso, ngoko injongo yakhe yayingekokufezekisa isantya esiphezulu, kodwa ukufumana isisombululo esilula.- Isigqibo seqela eliphumelele iBhaso leJury esekelwe kwi
eli nqaku livela ku-facebook , kuvunyelwe ukuhlanganiswa kwemifanekiso elungileyo ngaphandle komsebenzi wezandla.
isiphelo
Eyona nto ibihleli kwinkumbulo yam:
- Ukuba kukho iimpawu zecandelo kwidatha, kwaye uyayazi indlela yokwenza i-encoding ekujoliswe kuyo ngokuchanekileyo, kusengcono ukuzama i-catboost.
- Ukuba uthatha inxaxheba kukhuphiswano, akufuneki uchithe ixesha ngokukhetha iiparamitha ngaphandle kokufunda_umlinganiselo kunye nophindaphindo. Isisombululo esikhawulezayo kukwenza indibano yeemodeli ezininzi.
- I-Boostings inokufunda kwi-GPU. ICatboost inokufunda ngokukhawuleza kwiGPU, kodwa itya inkumbulo eninzi.
- Ngexesha lophuhliso kunye nokuvavanywa kweengcamango, kungcono ukuseta encinci rsm~=0.2 (CPU kuphela) kunye ne-ctr_complexity=1.
- Ngokungafaniyo namanye amaqela, ukuhlanganiswa kweemodeli zethu kunikeze ukwanda okukhulu. Sasitshintshiselana ngezimvo kuphela size sibhale ngeelwimi ezahlukeneyo. Sasinendlela eyahlukileyo yokwahlula idatha kwaye, ndiyacinga, nganye ineempazamo zayo.
- Ayicacanga ukuba kutheni ukwenziwa ngcono kusebenze kakubi kunokuhlelwa kokuhlelwa.
- Ndifumene amava athile ndisebenza ngeetekisi kunye nokuqonda ukuba iinkqubo zokuncoma zenziwe njani.
Ndiyabulela kubaququzeleli ngeemvakalelo, ulwazi kunye namabhaso afunyenweyo.
umthombo: www.habr.com