NgoFebhuwari-Mashi 2019, kwaba nomncintiswano wokulinganisa okuphakelayo kwenethiwekhi yokuxhumana nabantu
I-SNA Hackathon
Sekungokwesithathu i-hackathon ibanjwa ngaphansi kwaleli gama. Ihlelwe yi-social network ok.ru, ngokulandelana, umsebenzi kanye nedatha kuhlobene ngqo nale nkundla yokuxhumana.
I-SNA (ukuhlaziywa kwenethiwekhi yokuxhumana nabantu) kulokhu kuqondwa kahle kakhulu hhayi njengokuhlaziywa kwegrafu yomphakathi, kodwa njengokuhlaziywa kwenethiwekhi yokuxhumana nabantu.
- Ngo-2014, umsebenzi wawuwukubikezela inani lokuthandwa okuthunyelwe okuzothola.
- Ngo-2016 - umsebenzi we-VVZ (mhlawumbe ujwayele), eduze nokuhlaziywa kwegrafu yezenhlalo.
- Ngo-2019, linganisa okuphakelayo komsebenzisi ngokusekelwe kumathuba okuthi umsebenzisi azokuthanda okuthunyelwe.
Angikwazi ukusho ngo-2014, kodwa ngo-2016 no-2019, ngaphezu kwamakhono okuhlaziya idatha, amakhono okusebenza ngedatha enkulu nawo ayedingeka. Ngicabanga ukuthi kwakuyinhlanganisela yokufunda komshini kanye nezinkinga ezinkulu zokucubungula idatha okungidonsele kule miqhudelwano, futhi ulwazi lwami kulezi zindawo lwangisiza ukuba ngiwine.
mlbootcamp
Ngo-2019, umncintiswano wahlelwa endaweni yesikhulumi
Umncintiswano waqala ku-inthanethi ngoFebhuwari 7 futhi wawunemisebenzi emi-3. Noma ubani angabhalisa kusayithi, landa
Inhloso
Idatha yomthombo inikeza ama-ID omsebenzisi (i-Id yomsebenzisi) kanye nama-ID wokuthunyelwe (i-objectId). Uma umsebenzisi eboniswe okuthunyelwe, khona-ke idatha iqukethe umugqa oqukethe i-userId, objectId, ukusabela komsebenzisi kulokhu okuthunyelwe (impendulo) kanye nesethi yezici ezihlukahlukene noma izixhumanisi zezithombe nemibhalo.
I-ID Yomsebenzisi | objectId | umnikaziId | impendulo | izithombe |
---|---|---|---|---|
3555 | 22 | 5677 | [ithandiwe, yachofozwa] | [hash1] |
12842 | 55 | 32144 | [akathandwa] | [hash2,hash3] |
13145 | 35 | 5677 | [chofoziwe, kwabiwe] | [hash2] |
Isethi yedatha yokuhlola iqukethe isakhiwo esifanayo, kodwa inkambu yempendulo ayikho. Umsebenzi uwukubikezela ukuba khona kokusabela 'okuthandiwe' kunkambu yempendulo.
Ifayela elithunyelwe linesakhiwo esilandelayo:
I-ID Yomsebenzisi | Uhlu Oluhleliwe[objectId] |
---|---|
123 | 78,13,54,22 |
128 | 35,61,55 |
131 | 35,68,129,11 |
Imethrikhi iyisilinganiso se-ROC AUC sabasebenzisi.
Incazelo enemininingwane eyengeziwe yedatha ingatholakala kokuthi
Isiteji esiku-inthanethi
Esigabeni se-inthanethi, umsebenzi wahlukaniswa waba izingxenye ezi-3
Uhlelo lokusebenzisana — kubandakanya zonke izici ngaphandle kwezithombe nemibhalo;Izithombe - kuhlanganisa kuphela ulwazi mayelana nezithombe;Imibhalo — kufaka phakathi ulwazi olumayelana nemibhalo kuphela.
Isiteji esingaxhunyiwe ku-inthanethi
Esigabeni sokungaxhunyiwe ku-inthanethi, idatha yayihlanganisa zonke izici, kuyilapho imibhalo nezithombe ziyingcosana. Kube nemigqa ephindwe izikhathi ezingu-1,5 ngaphezulu kudathasethi, ebisivele miningi.
Isixazululo senkinga
Njengoba ngenza i-CV emsebenzini, ngiqale uhambo lwami kulo mncintiswano ngomsebenzi "wezithombe". Idatha enikeziwe bekuyi-userId, objectId, ownerId (iqembu okwashicilelwa kulo okuthunyelwe), izitembu zesikhathi zokudala nokubonisa okuthunyelwe, kanye, vele, isithombe salokhu okuthunyelwe.
Ngemva kokukhiqiza izici ezimbalwa ngokusekelwe kuzitembu zesikhathi, umbono olandelayo bekuwukuthatha ungqimba olungaphambili lwe-neuron oluqeqeshwe kusengaphambili ku-imagenet bese uthumela lokhu kushumeka ekuthuthukisweni.
Imiphumela ayizange ibe mnandi. Ukushumeka okuvela ku-iuron ye-imagenet akubalulekile, ngacabanga ukuthi ngidinga ukwenza eyami i-autoencoder.
Kuthathe isikhathi esiningi futhi umphumela awuzange ube ngcono.
Ukukhiqizwa kwesici
Ukusebenza ngezithombe kuthatha isikhathi esiningi, ngakho nginqume ukwenza into elula.
Njengoba ungabona ngokushesha, kunezici ezimbalwa zezigaba kudathasethi, futhi ukuze ngingazihluphi kakhulu, ngisanda kuthatha i-catboost. Isixazululo besisihle kakhulu, ngaphandle kwanoma yiziphi izilungiselelo ngifinyelele ngokushesha kulayini wokuqala webhodi yabaphambili.
Kunedatha eningi futhi ibekwe ngefomethi ye-parquet, ngakho-ke ngaphandle kokucabanga kabili, ngathatha i-scala ngaqala ukubhala yonke into ngenhlansi.
Izici ezilula ezinikeze ukukhula okwengeziwe kunokushumeka kwesithombe:
- zingakhi izikhathi okuthi i- objectId, i-userId kanye ne-id yomnikazi zivele kudatha (kufanele zihlobane nokuduma);
- mangaki amaposi umsebenzisiId ewabonile ku-ID yomnikazi (kufanele ahlobane nentshisekelo yomsebenzisi eqenjini);
- mangaki ama-userId ahlukile abukwe okuthunyelwe ku- ownerId (ibonisa usayizi wababukeli beqembu).
Kuzitembu zesikhathi kwakungenzeka ukuthola isikhathi sosuku umsebenzisi abuke ngaso okuphakelayo (ekuseni/ntambama/kusihlwa/ebusuku). Ngokuhlanganisa lezi zigaba, ungaqhubeka nokukhiqiza izici:
- kukangakhi i-ID yomsebenzisi engene ngayo kusihlwa;
- ngasiphi isikhathi lokhu okuthunyelwe kuvame ukuboniswa (objectId) nokunye.
Konke lokhu kwathuthukisa kancane kancane amamethrikhi. Kodwa usayizi wedathasethi yokuqeqeshwa imayelana namarekhodi angu-20M, ngakho ukwengeza izici kubambezele kakhulu ukuqeqeshwa.
Ngicabange kabusha indlela yami yokusebenzisa idatha. Nakuba idatha incike esikhathini, angizange ngikubone ukuvuza kolwazi olusobala “esikhathini esizayo”, nokho, uma kwenzeka, ngilwephule kanje:
Isethi yokuqeqeshwa esinikezwe yona (February namaviki ama-2 kaMarch) yahlukaniswa yaba izingxenye ezi-2.
Imodeli yaqeqeshwa ngedatha kusukela ezinsukwini zokugcina ezingu-N. Izilinganiso ezichazwe ngenhla zakhelwe kuyo yonke idatha, okuhlanganisa nokuhlola. Ngesikhathi esifanayo, idatha ivele lapho kungenzeka khona ukwakha amakhodi ahlukahlukene wokuguquguquka okuhlosiwe. Indlela elula iwukusebenzisa kabusha ikhodi esivele idala izici ezintsha, futhi umane uyiphakele idatha engeke iqeqeshwe kuyo futhi iqondiswe = 1.
Ngakho, sithole izici ezifanayo:
- Kukangaki i-userId ebone okuthunyelwe kubunikazi beqembu;
- Kukangaki i-userId ethande okuthunyelwe ku-ID yomnikazi weqembu;
- Iphesenti lokuthunyelwe okuthandile i-Id yomsebenzisi kubunikazi bomnikazi.
Okusho ukuthi, kwavela kusho ukubhala ngekhodi okuqondiwe engxenyeni yedathasethi yezinhlanganisela ezihlukahlukene zezici zezigaba. Eqinisweni, i-catboost iphinde yakha umbhalo wekhodi ohlosiwe futhi kusukela kuleli phuzu lokubuka akukho nzuzo, kodwa, isibonelo, kuye kwaba nokwenzeka ukubala inani labasebenzisi abahlukile abathanda okuthunyelwe kuleli qembu. Ngesikhathi esifanayo, umgomo oyinhloko wafinyelelwa - idathasethi yami yancishiswa izikhathi eziningana, futhi kwakungenzeka ukuqhubeka nokukhiqiza izici.
Nakuba i-catboost ingakha umbhalo wekhodi ngokusekelwe kuphela ekuphenduleni okuthandile, impendulo inokunye ukusabela: okwabiwe kabusha, okungathandwanga, okungathandwanga, okuchofoziwe, ukuzitshwa, ukubhala ngekhodi okungenziwa mathupha. Ngibale kabusha zonke izinhlobo zokuhlanganisa futhi ngasusa izici ezinokubaluleka okuphansi ukuze ngingakhuphukisi idathasethi.
Ngaleso sikhathi ngangisendaweni yokuqala ngebanga elibanzi. Okuwukuphela kwento eyayidida ukuthi ukushumeka kwesithombe kwakungabonisi ukukhula. Kwafika umqondo wokunikeza konke ukuze kuthuthukiswe. Sihlanganisa izithombe ze-Kmeans futhi sithole isici esisha sesici sesithombeCat.
Nawa amakilasi athile ngemva kokuhlunga mathupha nokuhlanganiswa kwamaqoqo atholwe ku-KMeans.
Ngokusekelwe esithombeniIkati esilikhiqizayo:
- Izici ezintsha zesigaba:
- Isiphi isithombeIkati elivame ukubukwa umsebenzisiId;
- Isiphi isithombeIkati elivame ukubonisa ubunikazi bomnikazi;
- Isiphi isithombeIkati ebelithandwa kakhulu umsebenzisiId;
- Izibalo ezihlukene:
- Zingaki izithombe eziyingqayiziveleIkati elibheke ku-Id yomsebenzisi;
- Cishe izici ezifanayo ezingu-15 kanye nombhalo wekhodi oqondiwe njengoba kuchazwe ngenhla.
Imibhalo
Imiphumela yomncintiswano wezithombe ingifanele futhi nganquma ukuzama isandla sami emibhalweni. Angikaze ngisebenze kakhulu ngemibhalo ngaphambili futhi, ngobuwula, ngabulala usuku ku-tf-idf ne-svd. Ngabe sengibona isisekelo nge-doc2vec, eyenza lokho engikudingayo. Ngemva kokulungisa kancane amapharamitha e-doc2vec, ngithole ukushumeka kombhalo.
Bese ngivele ngasebenzisa kabusha ikhodi yezithombe, lapho ngashintsha khona ukushumeka kwesithombe ngokushumeka umbhalo. Ngenxa yalokho, ngathatha indawo yesi-2 emncintiswaneni wombhalo.
Uhlelo lokusebenzisana
Kwakusele umncintiswano owodwa engangingakawuhlohli “ngenduku”, futhi uma ngihlulela i-AUC ebhodini labaphambili, imiphumela yalo mqhudelwano bekufanele ibe nomthelela omkhulu esiteji esingaxhunyiwe ku-inthanethi.
Ngithathe zonke izici ebezikudatha yomthombo, ngakhetha ezezigaba futhi ngabala izilinganiso ezifanayo nezezithombe, ngaphandle kwezici ezisekelwe ezithombeni ngokwazo. Ukubeka nje lokhu ku-catboost kungifake endaweni yesi-2.
Izinyathelo zokuqala ze-catboost optimization
Indawo eyodwa yokuqala neyesibili yangijabulisa, kodwa kwakukhona ukuqonda ukuthi ngangingenzanga lutho olukhethekile, okusho ukuthi ngangingalindela ukulahlekelwa isikhundla.
Umsebenzi womncintiswano uwukukala okuthunyelwe ngaphakathi komsebenzisi, futhi sonke lesi sikhathi bengixazulula inkinga yokuhlukanisa, okungukuthi, ukulungisa imethrikhi engalungile.
Ake ngikunike isibonelo esilula:
I-ID Yomsebenzisi | objectId | sibikezelo | iqiniso eliyisisekelo |
---|---|---|---|
1 | 10 | 0.9 | 1 |
1 | 11 | 0.8 | 1 |
1 | 12 | 0.7 | 1 |
1 | 13 | 0.6 | 1 |
1 | 14 | 0.5 | 0 |
2 | 15 | 0.4 | 0 |
2 | 16 | 0.3 | 1 |
Masenze ukuhlela kabusha okuncane
I-ID Yomsebenzisi | objectId | sibikezelo | iqiniso eliyisisekelo |
---|---|---|---|
1 | 10 | 0.9 | 1 |
1 | 11 | 0.8 | 1 |
1 | 12 | 0.7 | 1 |
1 | 13 | 0.6 | 0 |
2 | 16 | 0.5 | 1 |
2 | 15 | 0.4 | 0 |
1 | 14 | 0.3 | 1 |
Sithola imiphumela elandelayo:
Imodeli | I-AUC | Umsebenzisi1 AUC | Umsebenzisi2 AUC | kusho i-AUC |
---|---|---|---|---|
Inketho ye-1 | 0,8 | 1,0 | 0,0 | 0,5 |
Inketho ye-2 | 0,7 | 0,75 | 1,0 | 0,875 |
Njengoba ubona, ukuthuthukisa imethrikhi ye-AUC iyonke akusho ukuthuthukisa imethrikhi emaphakathi ye-AUC ngaphakathi komsebenzisi.
I-Catboost
Imizuzu emi-5 ngaphambi kokuvalwa kwesiteji se-inthanethi somncintiswano we-"Collaborative Systems", u-Sergey Shalnov wangithuthela endaweni yesibili. Sahamba indlela eya phambili ndawonye.
Ilungiselela isigaba sokungaxhunyiwe ku-inthanethi
Siqinisekisiwe ukunqoba esigabeni se-inthanethi ngekhadi levidiyo le-RTX 2080 TI, kodwa umklomelo oyinhloko wama-ruble angu-300 futhi, cishe, ngisho nendawo yokugcina yokugcina yasiphoqa ukuthi sisebenze kulawa maviki angu-000.
Njengoba kwenzeka, uSergey naye wasebenzisa i-catboost. Sanikezana imibono nezici, futhi ngafunda ngakho
Ukubuka umbiko kungiholele embonweni wokuthi sidinga ukubuyisela wonke amapharamitha enanini elizenzakalelayo, futhi senze izilungiselelo ngokucophelela futhi ngemva kokulungisa isethi yezici. Manje ukuqeqeshwa okukodwa kuthathe cishe amahora ayi-15, kodwa imodeli eyodwa ikwazile ukuthola isivinini esingcono kunaleso esitholwe ekuhlanganisweni ngezikhundla.
Ukukhiqizwa kwesici
Emqhudelwaneni Wezinhlelo Ezisebenzisanayo, izici eziningi zihlolwa njengezibalulekile kumodeli. Ngokwesibonelo, auditweights_spark_svd - uphawu olubaluleke kakhulu, kodwa alukho ulwazi mayelana nokuthi lisho ukuthini. Ngicabange ukuthi kuzobaluleka ukubala ama-aggregate ahlukahlukene ngokusekelwe ezicini ezibalulekile. Isibonelo, isilinganiso se-auditweights_spark_svd ngomsebenzisi, ngeqembu, ngento. Okufanayo kungabalwa kusetshenziswa idatha okungekho ukuqeqeshwa okwenziwa kuyo kanye nokuhlosiwe = 1, okungukuthi, isilinganiso auditweights_spark_svd ngomsebenzisi ngezinto azithandile. Izimpawu ezibalulekile ngaphandle kwalokho auditweights_spark_svd, babeningana. Nazi ezinye zazo:
- ama-auditweightsCtrGender
- I-AuditweightsCTrHigh
- userOwnerCounterCreateLikes
Ngokwesibonelo, isilinganiso ama-auditweightsCtrGender ngokuya nge-Id yomsebenzisi kuvele kuyisici esibalulekile, njengenani elimaphakathi userOwnerCounterCreateLikes nge-userId+ownerId. Lokhu kufanele vele kukwenze ucabange ukuthi udinga ukuqonda incazelo yezinkambu.
Futhi izici ezibalulekile zazikhona auditweightsLikesCount и auditweightsShowsCount. Ukuhlukanisa omunye nomunye, kwatholakala isici esibaluleke nakakhulu.
Ukuvuza kwedatha
Ukuncintisana nokwenza imodeli yokukhiqiza kuyimisebenzi ehluke kakhulu. Lapho ulungiselela idatha, kunzima kakhulu ukucabangela yonke imininingwane futhi ungadlulisi ulwazi oluthile olungeyona into encane mayelana nokuguquguquka okuhlosiwe ekuhlolweni. Uma sidala isixazululo sokukhiqiza, sizozama ukugwema ukusebenzisa ukuvuza kwedatha lapho siqeqesha imodeli. Kodwa uma sifuna ukunqoba umncintiswano, khona-ke ukuvuza kwedatha kuyizici ezinhle kakhulu.
Ngemva kokufunda idatha, ungabona lokho ngokwamanani we- objectId auditweightsLikesCount и auditweightsShowsCount ushintsho, okusho ukuthi isilinganiso samanani aphezulu alezi zici sizobonisa ukuguqulwa kokuthunyelwe okungcono kakhulu kunesilinganiso ngesikhathi sokuboniswa.
Ukuvuza kokuqala esikutholile kungukuthi auditweightsLikesCountMax/auditweightsShowsCountMax.
Kodwa kuthiwani uma sibheka idatha eduze kakhulu? Masihlele ngedethi yombukiso futhi sithole:
objectId | I-ID Yomsebenzisi | auditweightsShowsCount | auditweightsLikesCount | ithagethi (ithandiwe) |
---|---|---|---|---|
1 | 1 | 12 | 3 | cishe cha |
1 | 2 | 15 | 3 | mhlawumbe yebo |
1 | 3 | 16 | 4 |
Kwamangala lapho ngithola isibonelo sokuqala esinjalo futhi kwavela ukuthi ukubikezela kwami akufezekanga. Kodwa, ngokucabangela iqiniso lokuthi amanani aphezulu alezi zici ngaphakathi kwento anikeze ukwanda, asizange sibe namavila futhi sanquma ukuthola auditweightsShowsCountNext и auditweightsLikesCountNext, okungukuthi, amanani ngesikhathi esilandelayo ngesikhathi. Ngokungeza isici
(auditweightsShowsCountNext-auditweightsShowsCount)/(auditweightsLikesCount-auditweightsLikesCountNext) senza ukugxuma okubukhali ngokushesha.
Ukuvuza okufanayo kungasetshenziswa ngokuthola amanani alandelayo userOwnerCounterCreateLikes ngaphakathi kwe-UserId+ownerId futhi, isibonelo, ama-auditweightsCtrGender ngaphakathi kwe- objectId+userGender. Sithole izinkambu ezi-6 ezifanayo ezinokuvuza futhi sakhipha ulwazi oluningi ngangokunokwenzeka kuzo.
Ngaleso sikhathi, sase sikhiphe ulwazi oluningi ngangokunokwenzeka ezicini ezihlanganyelwe, kodwa asibuyelanga emiqhudelwaneni yezithombe neyombhalo. Ngibe nombono omuhle wokuhlola: zingakanani izici ezinikezwa ngokuqondile ezisekelwe ezithombeni noma emibhalweni emiqhudelwaneni efanele?
Kwakungekho ukuvuza emiqhudelwaneni yesithombe nombhalo, kodwa ngaleso sikhathi ngase ngibuyisele imingcele ye-catboost ezenzakalelayo, ngahlanza ikhodi futhi ngengeza izici ezimbalwa. Isamba kwaba:
Isixazululo | maduze |
---|---|
Ubuningi obunezithombe | 0.6411 |
Ubuningi bezithombe azikho | 0.6297 |
Umphumela wendawo yesibili | 0.6295 |
Isixazululo | maduze |
---|---|
Ubuningi obunemibhalo | 0.666 |
Ubuningi obungenayo imibhalo | 0.660 |
Umphumela wendawo yesibili | 0.656 |
Isixazululo | maduze |
---|---|
Ubuningi ekuhlanganyeleni | 0.745 |
Umphumela wendawo yesibili | 0.723 |
Kwaba sobala ukuthi kwakungenakwenzeka ukuthi sikwazi ukuminya okuningi emibhalweni nasezithombeni, futhi ngemva kokuzama imibono embalwa ethakazelisa kakhulu, sayeka ukusebenza nabo.
Ukukhiqizwa okuqhubekayo kwezici ezinhlelweni zokusebenzisana akuzange kunyuse, futhi siqale ukukala. Esigabeni esiku-inthanethi, ukuhlukaniswa ngezigaba kanye nokuhlanganiswa kwamazinga kunginikeze ukwenyuka okuncane, njengoba kwenzeka ngenxa yokuthi ngangikuqeqeshe kancane ukuhlukaniswa. Akukho neyodwa imisebenzi yephutha, ehlanganisa i-YetiRanlPairwise, ekhiqize noma yikuphi eduze nomphumela owenziwe yi-LogLoss (0,745 vs. 0,725). Kwakusenethemba le-QueryCrossEntropy, engakwazanga ukwethulwa.
Isiteji esingaxhunyiwe ku-inthanethi
Esigabeni sokungaxhunyiwe ku-inthanethi, ukwakheka kwedatha kuhlala kunjalo, kodwa kube nezinguquko ezincane:
- izihlonziI-userId, i- objectId, ubunikazi bobunikazi benziwe kabusha;
- izimpawu eziningana zasuswa futhi eziningana zaqanjwa kabusha;
- idatha yenyuke cishe izikhathi ezi-1,5.
Ngaphezu kobunzima obusohlwini, kwakukhona ukuhlanganisa okukodwa okukhulu: iqembu labelwa iseva enkulu ene-RTX 2080TI. Ngiyijabulele i-htop isikhathi eside.
Kwakunombono owodwa kuphela - ukumane ukhiqize lokho osekuvele kukhona. Ngemva kokuchitha amahora ambalwa sisetha imvelo kuseva, kancane kancane saqala ukuqinisekisa ukuthi imiphumela yayikwazi ukukhiqizwa kabusha. Inkinga enkulu esibhekene nayo ukwenyuka kwevolumu yedatha. Sinqume ukwehlisa umthwalo kancane futhi setha ipharamitha ye-catboost ctr_complexity=1. Lokhu kwehlisa ijubane kancane, kodwa imodeli yami yaqala ukusebenza, umphumela wawumuhle - 0,733. U-Sergey, ngokungafani nami, akazange ahlukanise idatha zibe izingxenye ezingu-2 futhi aqeqeshwe kuyo yonke idatha, nakuba lokhu kwanikeza imiphumela engcono kakhulu esiteji se-intanethi, esiteji esingaxhunyiwe ku-intanethi kwakukhona ubunzima obuningi. Uma sithatha zonke izici esizikhiqizile futhi sazama ukuzifaka ku-catboost, akukho okungasebenza ku-inthanethi. U-Sergey uthayiphe ukwenza kahle, isibonelo, ukuguqula izinhlobo ze-float64 zibe yi-float32.
Le miphumela yanele ukunqoba, kodwa safihla isivinini sethu sangempela futhi sasingaqiniseki ukuthi amanye amaqembu awenzi okufanayo.
Yilwa kuze kube sekugcineni
Ukushuna kwe-Catboost
Isixazululo sethu senziwa kabusha ngokugcwele, sengeze izici zedatha yombhalo nezithombe, ngakho konke okwakusele kwakuwukushuna amapharamitha we-catboost. U-Sergey waqeqeshwa ku-CPU ngenani elincane lokuphindaphinda, futhi ngaqeqeshelwa leyo ene-ctr_complexity=1. Kwakusele usuku olulodwa, futhi uma uvele wengeza okuphindaphindwayo noma wandisa u-ctr_complexity, khona-ke ekuseni ungase uthole isivinini esingcono nakakhulu futhi uhambe usuku lonke.
Esigabeni sokungaxhunyiwe ku-inthanethi, isivinini singafihlwa kalula ngokukhetha hhayi isixazululo esingcono kakhulu kusayithi. Besilindele izinguquko ezinqala ebhodini labaphambili emizuzwini yokugcina ngaphambi kokuthi kuvalwe ukuthunyelwa futhi sinqume ukungayeki.
Kuvidiyo ka-Anna, ngifunde ukuthi ukuthuthukisa ikhwalithi yemodeli, kungcono ukukhetha imingcele elandelayo:
- izinga_lokufunda — Inani elizenzakalelayo libalwa ngokusekelwe kusayizi wedathasethi. Ukwenyusa izinga_lokufunda kudinga ukwandisa inani lokuphindaphinda.
- l2_reg_leaf - I-coefficient yokulinganisa, inani elizenzakalelayo elingu-3, okungcono khetha kusuka ku-2 kuye ku-30. Ukunciphisa inani kuholela ekwenyukeni kokugcwala ngokweqile.
- izinga lokushisa_lesikhwama - yengeza i-randomization ezisindweni zezinto ezisesampula. Inani elizenzakalelayo ngu-1, lapho izisindo zithathwa ekusabalaliseni komchazi. Ukunciphisa inani kuholela ekwenyukeni kokugcwala ngokweqile.
- okungahleliwe_amandla - Kuthinta ukukhethwa kokuhlukaniswa ngokuphindaphinda okuthile. Ukuphakama kwe-random_strength, ayanda amathuba okuhlukaniswa ngokubaluleka okuphansi okukhethwayo. Ekuphindaphindweni ngakunye okulandelayo, ukungahleliwe kuncipha. Ukunciphisa inani kuholela ekwenyukeni kokugcwala ngokweqile.
Amanye amapharamitha anomphumela omncane kakhulu kumphumela wokugcina, ngakho-ke angizange ngizame ukuwakhetha. Ukuphindwaphindwa okukodwa kokuqeqeshwa kudathasethi yami ye-GPU ene-ctr_complexity=1 kuthathe amaminithi angu-20, futhi amapharamitha akhethiwe kudathasethi encishisiwe ayehluke kancane kulawo afanele kudathasethi ephelele. Ekugcineni, ngenze iziphindaphinda ezingaba ngu-30 ku-10% wedatha, ngase ngiphinda cishe izikhathi ezingu-10 kuyo yonke idatha. Kwavela into enjengale:
- izinga_lokufunda Ngenyuke ngo-40% kusukela kokumisiwe;
- l2_reg_leaf washiya okufanayo;
- izinga lokushisa_lesikhwama и okungahleliwe_amandla yehliswe yaba ngu-0,8.
Singaphetha ngokuthi imodeli yayingaqeqeshwa kahle ngamapharamitha azenzakalelayo.
Ngamangala kakhulu lapho ngibona umphumela ebhodini labaphambili:
Imodeli | imodeli 1 | imodeli 2 | imodeli 3 | hlanganisa |
---|---|---|---|---|
Ngaphandle kokushuna | 0.7403 | 0.7404 | 0.7404 | 0.7407 |
Ngokushuna | 0.7406 | 0.7405 | 0.7406 | 0.7408 |
Ngaziphetha ngokuthi uma ukusetshenziswa okusheshayo kwemodeli kungadingeki, kungcono ukushintsha ukukhethwa kwamapharamitha nge-ensemble yamamodeli amaningana usebenzisa imingcele engalungiselelwe.
U-Sergey wayelungiselela usayizi wedathasethi ukuze ayisebenzise ku-GPU. Inketho elula ukusika ingxenye yedatha, kodwa lokhu kungenziwa ngezindlela ezimbalwa:
- susa kancane kancane idatha endala (ekuqaleni kukaFebruwari) kuze kube yilapho idathasethi iqala ukungena kumemori;
- susa izici ezinokubaluleka okuphansi kakhulu;
- susa ama-userId okufakwe kuwo okukodwa kuphela;
- shiya kuphela ama-ID omsebenzisi akuhlolo.
Futhi ekugcineni, yenza i-ensemble kuzo zonke izinketho.
Iqoqo lokugcina
Kusihlwa ngosuku lokugcina, sase sibeke iqoqo lamamodeli ethu akhiqize u-0,742. Ngobusuku nje ngethule imodeli yami nge-ctr_complexity=2 futhi esikhundleni semizuzu engama-30 yaziqeqeshela amahora ama-5. Kuphela ngo-4 ekuseni kwabalwa, futhi ngenza iqoqo lokugcina, elinikeze u-0,7433 ebhodini labaphambili lomphakathi.
Ngenxa yezindlela ezahlukene zokuxazulula le nkinga, ukubikezela kwethu akuzange kuhlotshaniswe ngokuqinile, okunikeze ukwanda okuhle kokuhlanganiswa. Ukuze uthole ukuhlanganisa okuhle, kungcono ukusebenzisa ukubikezela kwemodeli eluhlaza (prediction_type='RawFormulaVal') bese usetha isikali_pos_weight=neg_count/pos_count.
Kuwebhusayithi ungabona
Ezinye izixazululo
Amaqembu amaningi alandele ama-canon of recommender system algorithms. Mina, ngingeyena uchwepheshe kulo mkhakha, angikwazi ukuwahlola, kodwa ngikhumbula izixazululo ezi-2 ezithakazelisayo.
Isixazululo sikaNikolay Anokhin . U-Nikolay, eyisisebenzi se-Mail.ru, akazange afake isicelo semiklomelo, ngakho umgomo wakhe wawungekona ukufeza isivinini esiphezulu, kodwa ukuthola isisombululo esilula.- Isinqumo seqembu eliwina uMklomelo weJury esisekelwe ku
lesi sihloko esivela ku-facebook , kuvunyelwe ukuhlanganisa izithombe ezinhle kakhulu ngaphandle komsebenzi wezandla.
isiphetho
Obekuhlale enkumbulweni yami:
- Uma kunezici zesigaba kudatha, futhi wazi ukuthi ukwenza kanjani umbhalo wekhodi oqondisiwe ngendlela efanele, kusengcono ukuzama i-catboost.
- Uma ubamba iqhaza emqhudelwaneni, akufanele umoshe isikhathi ngokukhetha amapharamitha ngaphandle kwesilinganiso_sokufunda nokuphindaphinda. Isixazululo esisheshayo ukwenza iqoqo lamamodeli amaningana.
- Ama-boostings angafunda ku-GPU. I-Catboost ingafunda ngokushesha kakhulu ku-GPU, kodwa idla inkumbulo eningi.
- Ngesikhathi sokuthuthukiswa nokuhlolwa kwemibono, kungcono ukusetha i-rsm~=0.2 encane (CPU kuphela) kanye ne-ctr_complexity=1.
- Ngokungafani namanye amaqembu, iqoqo lamamodeli ethu linikeze ukwanda okukhulu. Sasishintshana kuphela ngemibono futhi sibhale ngezilimi ezahlukene. Sasinendlela ehlukile yokuhlukanisa idatha futhi, ngicabanga, ngayinye yayinezimbungulu zayo.
- Akukacaci ukuthi kungani ukulungiselelwa kwezinga kusebenze kubi kunokwenza kahle ngokwezigaba.
- Ngithole ulwazi oluthile lokusebenza ngemibhalo kanye nokuqonda ukuthi izinhlelo zokuncoma zenziwa kanjani.
Sibonga abahleli ngemizwa, ulwazi kanye nemiklomelo etholiwe.
Source: www.habr.com