I-SNA Hackathon 2019

NgoFebhuwari-Mashi 2019, kwaba nomncintiswano wokulinganisa okuphakelayo kwenethiwekhi yokuxhumana nabantu I-SNA Hackathon 2019, lapho iqembu lethu lithathe indawo yokuqala. Esihlokweni ngizokhuluma ngenhlangano yomncintiswano, izindlela esizamile, kanye nezilungiselelo ze-catboost zokuqeqeshwa kudatha enkulu.

I-SNA Hackathon 2019

I-SNA Hackathon

Sekungokwesithathu i-hackathon ibanjwa ngaphansi kwaleli gama. Ihlelwe yi-social network ok.ru, ngokulandelana, umsebenzi kanye nedatha kuhlobene ngqo nale nkundla yokuxhumana.
I-SNA (ukuhlaziywa kwenethiwekhi yokuxhumana nabantu) kulokhu kuqondwa kahle kakhulu hhayi njengokuhlaziywa kwegrafu yomphakathi, kodwa njengokuhlaziywa kwenethiwekhi yokuxhumana nabantu.

  • Ngo-2014, umsebenzi wawuwukubikezela inani lokuthandwa okuthunyelwe okuzothola.
  • Ngo-2016 - umsebenzi we-VVZ (mhlawumbe ujwayele), eduze nokuhlaziywa kwegrafu yezenhlalo.
  • Ngo-2019, linganisa okuphakelayo komsebenzisi ngokusekelwe kumathuba okuthi umsebenzisi azokuthanda okuthunyelwe.

Angikwazi ukusho ngo-2014, kodwa ngo-2016 no-2019, ngaphezu kwamakhono okuhlaziya idatha, amakhono okusebenza ngedatha enkulu nawo ayedingeka. Ngicabanga ukuthi kwakuyinhlanganisela yokufunda komshini kanye nezinkinga ezinkulu zokucubungula idatha okungidonsele kule miqhudelwano, futhi ulwazi lwami kulezi zindawo lwangisiza ukuba ngiwine.

mlbootcamp

Ngo-2019, umncintiswano wahlelwa endaweni yesikhulumi https://mlbootcamp.ru.

Umncintiswano waqala ku-inthanethi ngoFebhuwari 7 futhi wawunemisebenzi emi-3. Noma ubani angabhalisa kusayithi, landa isisekelo bese ulayisha imoto yakho amahora ambalwa. Ekupheleni kwesiteji esiku-inthanethi ngoMashi 15, abaphezulu abayi-15 bomcimbi ngamunye wokugxuma bamenyelwe ehhovisi le-Mail.ru ngesiteji esingaxhunyiwe ku-inthanethi, esenzeka kusukela ngoMashi 30 kuya ku-Ephreli 1.

Inhloso

Idatha yomthombo inikeza ama-ID omsebenzisi (i-Id yomsebenzisi) kanye nama-ID wokuthunyelwe (i-objectId). Uma umsebenzisi eboniswe okuthunyelwe, khona-ke idatha iqukethe umugqa oqukethe i-userId, objectId, ukusabela komsebenzisi kulokhu okuthunyelwe (impendulo) kanye nesethi yezici ezihlukahlukene noma izixhumanisi zezithombe nemibhalo.

I-ID Yomsebenzisi objectId umnikaziId impendulo izithombe
3555 22 5677 [ithandiwe, yachofozwa] [hash1]
12842 55 32144 [akathandwa] [hash2,hash3]
13145 35 5677 [chofoziwe, kwabiwe] [hash2]

Isethi yedatha yokuhlola iqukethe isakhiwo esifanayo, kodwa inkambu yempendulo ayikho. Umsebenzi uwukubikezela ukuba khona kokusabela 'okuthandiwe' kunkambu yempendulo.
Ifayela elithunyelwe linesakhiwo esilandelayo:

I-ID Yomsebenzisi Uhlu Oluhleliwe[objectId]
123 78,13,54,22
128 35,61,55
131 35,68,129,11

Imethrikhi iyisilinganiso se-ROC AUC sabasebenzisi.

Incazelo enemininingwane eyengeziwe yedatha ingatholakala kokuthi iwebhusayithi yomkhandlu. Ungadawuniloda idatha lapho, okuhlanganisa izivivinyo nezithombe.

Isiteji esiku-inthanethi

Esigabeni se-inthanethi, umsebenzi wahlukaniswa waba izingxenye ezi-3

  • Uhlelo lokusebenzisana — kubandakanya zonke izici ngaphandle kwezithombe nemibhalo;
  • Izithombe - kuhlanganisa kuphela ulwazi mayelana nezithombe;
  • Imibhalo — kufaka phakathi ulwazi olumayelana nemibhalo kuphela.

Isiteji esingaxhunyiwe ku-inthanethi

Esigabeni sokungaxhunyiwe ku-inthanethi, idatha yayihlanganisa zonke izici, kuyilapho imibhalo nezithombe ziyingcosana. Kube nemigqa ephindwe izikhathi ezingu-1,5 ngaphezulu kudathasethi, ebisivele miningi.

Isixazululo senkinga

Njengoba ngenza i-CV emsebenzini, ngiqale uhambo lwami kulo mncintiswano ngomsebenzi "wezithombe". Idatha enikeziwe bekuyi-userId, objectId, ownerId (iqembu okwashicilelwa kulo okuthunyelwe), izitembu zesikhathi zokudala nokubonisa okuthunyelwe, kanye, vele, isithombe salokhu okuthunyelwe.
Ngemva kokukhiqiza izici ezimbalwa ngokusekelwe kuzitembu zesikhathi, umbono olandelayo bekuwukuthatha ungqimba olungaphambili lwe-neuron oluqeqeshwe kusengaphambili ku-imagenet bese uthumela lokhu kushumeka ekuthuthukisweni.

I-SNA Hackathon 2019

Imiphumela ayizange ibe mnandi. Ukushumeka okuvela ku-iuron ye-imagenet akubalulekile, ngacabanga ukuthi ngidinga ukwenza eyami i-autoencoder.

I-SNA Hackathon 2019

Kuthathe isikhathi esiningi futhi umphumela awuzange ube ngcono.

Ukukhiqizwa kwesici

Ukusebenza ngezithombe kuthatha isikhathi esiningi, ngakho nginqume ukwenza into elula.
Njengoba ungabona ngokushesha, kunezici ezimbalwa zezigaba kudathasethi, futhi ukuze ngingazihluphi kakhulu, ngisanda kuthatha i-catboost. Isixazululo besisihle kakhulu, ngaphandle kwanoma yiziphi izilungiselelo ngifinyelele ngokushesha kulayini wokuqala webhodi yabaphambili.

Kunedatha eningi futhi ibekwe ngefomethi ye-parquet, ngakho-ke ngaphandle kokucabanga kabili, ngathatha i-scala ngaqala ukubhala yonke into ngenhlansi.

Izici ezilula ezinikeze ukukhula okwengeziwe kunokushumeka kwesithombe:

  • zingakhi izikhathi okuthi i- objectId, i-userId kanye ne-id yomnikazi zivele kudatha (kufanele zihlobane nokuduma);
  • mangaki amaposi umsebenzisiId ewabonile ku-ID yomnikazi (kufanele ahlobane nentshisekelo yomsebenzisi eqenjini);
  • mangaki ama-userId ahlukile abukwe okuthunyelwe ku- ownerId (ibonisa usayizi wababukeli beqembu).

Kuzitembu zesikhathi kwakungenzeka ukuthola isikhathi sosuku umsebenzisi abuke ngaso okuphakelayo (ekuseni/ntambama/kusihlwa/ebusuku). Ngokuhlanganisa lezi zigaba, ungaqhubeka nokukhiqiza izici:

  • kukangakhi i-ID yomsebenzisi engene ngayo kusihlwa;
  • ngasiphi isikhathi lokhu okuthunyelwe kuvame ukuboniswa (objectId) nokunye.

Konke lokhu kwathuthukisa kancane kancane amamethrikhi. Kodwa usayizi wedathasethi yokuqeqeshwa imayelana namarekhodi angu-20M, ngakho ukwengeza izici kubambezele kakhulu ukuqeqeshwa.

Ngicabange kabusha indlela yami yokusebenzisa idatha. Nakuba idatha incike esikhathini, angizange ngikubone ukuvuza kolwazi olusobala “esikhathini esizayo”, nokho, uma kwenzeka, ngilwephule kanje:

I-SNA Hackathon 2019

Isethi yokuqeqeshwa esinikezwe yona (February namaviki ama-2 kaMarch) yahlukaniswa yaba izingxenye ezi-2.
Imodeli yaqeqeshwa ngedatha kusukela ezinsukwini zokugcina ezingu-N. Izilinganiso ezichazwe ngenhla zakhelwe kuyo yonke idatha, okuhlanganisa nokuhlola. Ngesikhathi esifanayo, idatha ivele lapho kungenzeka khona ukwakha amakhodi ahlukahlukene wokuguquguquka okuhlosiwe. Indlela elula iwukusebenzisa kabusha ikhodi esivele idala izici ezintsha, futhi umane uyiphakele idatha engeke iqeqeshwe kuyo futhi iqondiswe = 1.

Ngakho, sithole izici ezifanayo:

  • Kukangaki i-userId ebone okuthunyelwe kubunikazi beqembu;
  • Kukangaki i-userId ethande okuthunyelwe ku-ID yomnikazi weqembu;
  • Iphesenti lokuthunyelwe okuthandile i-Id yomsebenzisi kubunikazi bomnikazi.

Okusho ukuthi, kwavela kusho ukubhala ngekhodi okuqondiwe engxenyeni yedathasethi yezinhlanganisela ezihlukahlukene zezici zezigaba. Eqinisweni, i-catboost iphinde yakha umbhalo wekhodi ohlosiwe futhi kusukela kuleli phuzu lokubuka akukho nzuzo, kodwa, isibonelo, kuye kwaba nokwenzeka ukubala inani labasebenzisi abahlukile abathanda okuthunyelwe kuleli qembu. Ngesikhathi esifanayo, umgomo oyinhloko wafinyelelwa - idathasethi yami yancishiswa izikhathi eziningana, futhi kwakungenzeka ukuqhubeka nokukhiqiza izici.

Nakuba i-catboost ingakha umbhalo wekhodi ngokusekelwe kuphela ekuphenduleni okuthandile, impendulo inokunye ukusabela: okwabiwe kabusha, okungathandwanga, okungathandwanga, okuchofoziwe, ukuzitshwa, ukubhala ngekhodi okungenziwa mathupha. Ngibale kabusha zonke izinhlobo zokuhlanganisa futhi ngasusa izici ezinokubaluleka okuphansi ukuze ngingakhuphukisi idathasethi.

Ngaleso sikhathi ngangisendaweni yokuqala ngebanga elibanzi. Okuwukuphela kwento eyayidida ukuthi ukushumeka kwesithombe kwakungabonisi ukukhula. Kwafika umqondo wokunikeza konke ukuze kuthuthukiswe. Sihlanganisa izithombe ze-Kmeans futhi sithole isici esisha sesici sesithombeCat.

Nawa amakilasi athile ngemva kokuhlunga mathupha nokuhlanganiswa kwamaqoqo atholwe ku-KMeans.

I-SNA Hackathon 2019

Ngokusekelwe esithombeniIkati esilikhiqizayo:

  • Izici ezintsha zesigaba:
    • Isiphi isithombeIkati elivame ukubukwa umsebenzisiId;
    • Isiphi isithombeIkati elivame ukubonisa ubunikazi bomnikazi;
    • Isiphi isithombeIkati ebelithandwa kakhulu umsebenzisiId;
  • Izibalo ezihlukene:
    • Zingaki izithombe eziyingqayiziveleIkati elibheke ku-Id yomsebenzisi;
    • Cishe izici ezifanayo ezingu-15 kanye nombhalo wekhodi oqondiwe njengoba kuchazwe ngenhla.

Imibhalo

Imiphumela yomncintiswano wezithombe ingifanele futhi nganquma ukuzama isandla sami emibhalweni. Angikaze ngisebenze kakhulu ngemibhalo ngaphambili futhi, ngobuwula, ngabulala usuku ku-tf-idf ne-svd. Ngabe sengibona isisekelo nge-doc2vec, eyenza lokho engikudingayo. Ngemva kokulungisa kancane amapharamitha e-doc2vec, ngithole ukushumeka kombhalo.

Bese ngivele ngasebenzisa kabusha ikhodi yezithombe, lapho ngashintsha khona ukushumeka kwesithombe ngokushumeka umbhalo. Ngenxa yalokho, ngathatha indawo yesi-2 emncintiswaneni wombhalo.

Uhlelo lokusebenzisana

Kwakusele umncintiswano owodwa engangingakawuhlohli “ngenduku”, futhi uma ngihlulela i-AUC ebhodini labaphambili, imiphumela yalo mqhudelwano bekufanele ibe nomthelela omkhulu esiteji esingaxhunyiwe ku-inthanethi.
Ngithathe zonke izici ebezikudatha yomthombo, ngakhetha ezezigaba futhi ngabala izilinganiso ezifanayo nezezithombe, ngaphandle kwezici ezisekelwe ezithombeni ngokwazo. Ukubeka nje lokhu ku-catboost kungifake endaweni yesi-2.

Izinyathelo zokuqala ze-catboost optimization

Indawo eyodwa yokuqala neyesibili yangijabulisa, kodwa kwakukhona ukuqonda ukuthi ngangingenzanga lutho olukhethekile, okusho ukuthi ngangingalindela ukulahlekelwa isikhundla.

Umsebenzi womncintiswano uwukukala okuthunyelwe ngaphakathi komsebenzisi, futhi sonke lesi sikhathi bengixazulula inkinga yokuhlukanisa, okungukuthi, ukulungisa imethrikhi engalungile.

Ake ngikunike isibonelo esilula:

I-ID Yomsebenzisi objectId sibikezelo iqiniso eliyisisekelo
1 10 0.9 1
1 11 0.8 1
1 12 0.7 1
1 13 0.6 1
1 14 0.5 0
2 15 0.4 0
2 16 0.3 1

Masenze ukuhlela kabusha okuncane

I-ID Yomsebenzisi objectId sibikezelo iqiniso eliyisisekelo
1 10 0.9 1
1 11 0.8 1
1 12 0.7 1
1 13 0.6 0
2 16 0.5 1
2 15 0.4 0
1 14 0.3 1

Sithola imiphumela elandelayo:

Imodeli I-AUC Umsebenzisi1 AUC Umsebenzisi2 AUC kusho i-AUC
Inketho ye-1 0,8 1,0 0,0 0,5
Inketho ye-2 0,7 0,75 1,0 0,875

Njengoba ubona, ukuthuthukisa imethrikhi ye-AUC iyonke akusho ukuthuthukisa imethrikhi emaphakathi ye-AUC ngaphakathi komsebenzisi.

I-Catboost iyazi indlela yokuthuthukisa amamethrikhi wokulinganisa kusuka ebhokisini. Ngifunde mayelana nokukala amamethrikhi, izindaba zempumelelo lapho usebenzisa i-catboost bese usetha i-YetiRankPairwise ukuqeqesha ubusuku bonke. Umphumela awubanga umxhwele. Ngokunquma ukuthi ngangiqeqeshekile, ngashintsha umsebenzi wephutha waba i-QueryRMSE, okuthi, uma ngibheka imibhalo ye-catboost, iguquke ngokushesha. Ekugcineni, ngathola imiphumela efanayo lapho ngiqeqeshelwa ukuhlukaniswa, kodwa ama-ensembles alezi zinhlobo ezimbili anikeze ukwanda okuhle, okwangiletha endaweni yokuqala kuyo yonke imiqhudelwano emithathu.

Imizuzu emi-5 ngaphambi kokuvalwa kwesiteji se-inthanethi somncintiswano we-"Collaborative Systems", u-Sergey Shalnov wangithuthela endaweni yesibili. Sahamba indlela eya phambili ndawonye.

Ilungiselela isigaba sokungaxhunyiwe ku-inthanethi

Siqinisekisiwe ukunqoba esigabeni se-inthanethi ngekhadi levidiyo le-RTX 2080 TI, kodwa umklomelo oyinhloko wama-ruble angu-300 futhi, cishe, ngisho nendawo yokugcina yokugcina yasiphoqa ukuthi sisebenze kulawa maviki angu-000.

Njengoba kwenzeka, uSergey naye wasebenzisa i-catboost. Sanikezana imibono nezici, futhi ngafunda ngakho umbiko ka-Anna Veronica Dorogush eyayiqukethe izimpendulo zemibuzo yami eminingi, kanye naleyo engangingakayitholi ngaleso sikhathi.

Ukubuka umbiko kungiholele embonweni wokuthi sidinga ukubuyisela wonke amapharamitha enanini elizenzakalelayo, futhi senze izilungiselelo ngokucophelela futhi ngemva kokulungisa isethi yezici. Manje ukuqeqeshwa okukodwa kuthathe cishe amahora ayi-15, kodwa imodeli eyodwa ikwazile ukuthola isivinini esingcono kunaleso esitholwe ekuhlanganisweni ngezikhundla.

Ukukhiqizwa kwesici

Emqhudelwaneni Wezinhlelo Ezisebenzisanayo, izici eziningi zihlolwa njengezibalulekile kumodeli. Ngokwesibonelo, auditweights_spark_svd - uphawu olubaluleke kakhulu, kodwa alukho ulwazi mayelana nokuthi lisho ukuthini. Ngicabange ukuthi kuzobaluleka ukubala ama-aggregate ahlukahlukene ngokusekelwe ezicini ezibalulekile. Isibonelo, isilinganiso se-auditweights_spark_svd ngomsebenzisi, ngeqembu, ngento. Okufanayo kungabalwa kusetshenziswa idatha okungekho ukuqeqeshwa okwenziwa kuyo kanye nokuhlosiwe = 1, okungukuthi, isilinganiso auditweights_spark_svd ngomsebenzisi ngezinto azithandile. Izimpawu ezibalulekile ngaphandle kwalokho auditweights_spark_svd, babeningana. Nazi ezinye zazo:

  • ama-auditweightsCtrGender
  • I-AuditweightsCTrHigh
  • userOwnerCounterCreateLikes

Ngokwesibonelo, isilinganiso ama-auditweightsCtrGender ngokuya nge-Id yomsebenzisi kuvele kuyisici esibalulekile, njengenani elimaphakathi userOwnerCounterCreateLikes nge-userId+ownerId. Lokhu kufanele vele kukwenze ucabange ukuthi udinga ukuqonda incazelo yezinkambu.

Futhi izici ezibalulekile zazikhona auditweightsLikesCount и auditweightsShowsCount. Ukuhlukanisa omunye nomunye, kwatholakala isici esibaluleke nakakhulu.

Ukuvuza kwedatha

Ukuncintisana nokwenza imodeli yokukhiqiza kuyimisebenzi ehluke kakhulu. Lapho ulungiselela idatha, kunzima kakhulu ukucabangela yonke imininingwane futhi ungadlulisi ulwazi oluthile olungeyona into encane mayelana nokuguquguquka okuhlosiwe ekuhlolweni. Uma sidala isixazululo sokukhiqiza, sizozama ukugwema ukusebenzisa ukuvuza kwedatha lapho siqeqesha imodeli. Kodwa uma sifuna ukunqoba umncintiswano, khona-ke ukuvuza kwedatha kuyizici ezinhle kakhulu.

Ngemva kokufunda idatha, ungabona lokho ngokwamanani we- objectId auditweightsLikesCount и auditweightsShowsCount ushintsho, okusho ukuthi isilinganiso samanani aphezulu alezi zici sizobonisa ukuguqulwa kokuthunyelwe okungcono kakhulu kunesilinganiso ngesikhathi sokuboniswa.

Ukuvuza kokuqala esikutholile kungukuthi auditweightsLikesCountMax/auditweightsShowsCountMax.
Kodwa kuthiwani uma sibheka idatha eduze kakhulu? Masihlele ngedethi yombukiso futhi sithole:

objectId I-ID Yomsebenzisi auditweightsShowsCount auditweightsLikesCount ithagethi (ithandiwe)
1 1 12 3 cishe cha
1 2 15 3 mhlawumbe yebo
1 3 16 4

Kwamangala lapho ngithola isibonelo sokuqala esinjalo futhi kwavela ukuthi ukubikezela kwami ​​​​akufezekanga. Kodwa, ngokucabangela iqiniso lokuthi amanani aphezulu alezi zici ngaphakathi kwento anikeze ukwanda, asizange sibe namavila futhi sanquma ukuthola auditweightsShowsCountNext и auditweightsLikesCountNext, okungukuthi, amanani ngesikhathi esilandelayo ngesikhathi. Ngokungeza isici
(auditweightsShowsCountNext-auditweightsShowsCount)/(auditweightsLikesCount-auditweightsLikesCountNext) senza ukugxuma okubukhali ngokushesha.
Ukuvuza okufanayo kungasetshenziswa ngokuthola amanani alandelayo userOwnerCounterCreateLikes ngaphakathi kwe-UserId+ownerId futhi, isibonelo, ama-auditweightsCtrGender ngaphakathi kwe- objectId+userGender. Sithole izinkambu ezi-6 ezifanayo ezinokuvuza futhi sakhipha ulwazi oluningi ngangokunokwenzeka kuzo.

Ngaleso sikhathi, sase sikhiphe ulwazi oluningi ngangokunokwenzeka ezicini ezihlanganyelwe, kodwa asibuyelanga emiqhudelwaneni yezithombe neyombhalo. Ngibe nombono omuhle wokuhlola: zingakanani izici ezinikezwa ngokuqondile ezisekelwe ezithombeni noma emibhalweni emiqhudelwaneni efanele?

Kwakungekho ukuvuza emiqhudelwaneni yesithombe nombhalo, kodwa ngaleso sikhathi ngase ngibuyisele imingcele ye-catboost ezenzakalelayo, ngahlanza ikhodi futhi ngengeza izici ezimbalwa. Isamba kwaba:

Isixazululo maduze
Ubuningi obunezithombe 0.6411
Ubuningi bezithombe azikho 0.6297
Umphumela wendawo yesibili 0.6295

Isixazululo maduze
Ubuningi obunemibhalo 0.666
Ubuningi obungenayo imibhalo 0.660
Umphumela wendawo yesibili 0.656

Isixazululo maduze
Ubuningi ekuhlanganyeleni 0.745
Umphumela wendawo yesibili 0.723

Kwaba sobala ukuthi kwakungenakwenzeka ukuthi sikwazi ukuminya okuningi emibhalweni nasezithombeni, futhi ngemva kokuzama imibono embalwa ethakazelisa kakhulu, sayeka ukusebenza nabo.

Ukukhiqizwa okuqhubekayo kwezici ezinhlelweni zokusebenzisana akuzange kunyuse, futhi siqale ukukala. Esigabeni esiku-inthanethi, ukuhlukaniswa ngezigaba kanye nokuhlanganiswa kwamazinga kunginikeze ukwenyuka okuncane, njengoba kwenzeka ngenxa yokuthi ngangikuqeqeshe kancane ukuhlukaniswa. Akukho neyodwa imisebenzi yephutha, ehlanganisa i-YetiRanlPairwise, ekhiqize noma yikuphi eduze nomphumela owenziwe yi-LogLoss (0,745 vs. 0,725). Kwakusenethemba le-QueryCrossEntropy, engakwazanga ukwethulwa.

Isiteji esingaxhunyiwe ku-inthanethi

Esigabeni sokungaxhunyiwe ku-inthanethi, ukwakheka kwedatha kuhlala kunjalo, kodwa kube nezinguquko ezincane:

  • izihlonziI-userId, i- objectId, ubunikazi bobunikazi benziwe kabusha;
  • izimpawu eziningana zasuswa futhi eziningana zaqanjwa kabusha;
  • idatha yenyuke cishe izikhathi ezi-1,5.

Ngaphezu kobunzima obusohlwini, kwakukhona ukuhlanganisa okukodwa okukhulu: iqembu labelwa iseva enkulu ene-RTX 2080TI. Ngiyijabulele i-htop isikhathi eside.
I-SNA Hackathon 2019

Kwakunombono owodwa kuphela - ukumane ukhiqize lokho osekuvele kukhona. Ngemva kokuchitha amahora ambalwa sisetha imvelo kuseva, kancane kancane saqala ukuqinisekisa ukuthi imiphumela yayikwazi ukukhiqizwa kabusha. Inkinga enkulu esibhekene nayo ukwenyuka kwevolumu yedatha. Sinqume ukwehlisa umthwalo kancane futhi setha ipharamitha ye-catboost ctr_complexity=1. Lokhu kwehlisa ijubane kancane, kodwa imodeli yami yaqala ukusebenza, umphumela wawumuhle - 0,733. U-Sergey, ngokungafani nami, akazange ahlukanise idatha zibe izingxenye ezingu-2 futhi aqeqeshwe kuyo yonke idatha, nakuba lokhu kwanikeza imiphumela engcono kakhulu esiteji se-intanethi, esiteji esingaxhunyiwe ku-intanethi kwakukhona ubunzima obuningi. Uma sithatha zonke izici esizikhiqizile futhi sazama ukuzifaka ku-catboost, akukho okungasebenza ku-inthanethi. U-Sergey uthayiphe ukwenza kahle, isibonelo, ukuguqula izinhlobo ze-float64 zibe yi-float32. Kulesi sihloko Ungathola ulwazi mayelana nokwenza kahle inkumbulo kuma-panda. Ngenxa yalokho, uSergey waqeqeshwa ku-CPU esebenzisa yonke idatha futhi wathola cishe u-0,735.

Le miphumela yanele ukunqoba, kodwa safihla isivinini sethu sangempela futhi sasingaqiniseki ukuthi amanye amaqembu awenzi okufanayo.

Yilwa kuze kube sekugcineni

Ukushuna kwe-Catboost

Isixazululo sethu senziwa kabusha ngokugcwele, sengeze izici zedatha yombhalo nezithombe, ngakho konke okwakusele kwakuwukushuna amapharamitha we-catboost. U-Sergey waqeqeshwa ku-CPU ngenani elincane lokuphindaphinda, futhi ngaqeqeshelwa leyo ene-ctr_complexity=1. Kwakusele usuku olulodwa, futhi uma uvele wengeza okuphindaphindwayo noma wandisa u-ctr_complexity, khona-ke ekuseni ungase uthole isivinini esingcono nakakhulu futhi uhambe usuku lonke.

Esigabeni sokungaxhunyiwe ku-inthanethi, isivinini singafihlwa kalula ngokukhetha hhayi isixazululo esingcono kakhulu kusayithi. Besilindele izinguquko ezinqala ebhodini labaphambili emizuzwini yokugcina ngaphambi kokuthi kuvalwe ukuthunyelwa futhi sinqume ukungayeki.

Kuvidiyo ka-Anna, ngifunde ukuthi ukuthuthukisa ikhwalithi yemodeli, kungcono ukukhetha imingcele elandelayo:

  • izinga_lokufunda — Inani elizenzakalelayo libalwa ngokusekelwe kusayizi wedathasethi. Ukwenyusa izinga_lokufunda kudinga ukwandisa inani lokuphindaphinda.
  • l2_reg_leaf - I-coefficient yokulinganisa, inani elizenzakalelayo elingu-3, ​​okungcono khetha kusuka ku-2 kuye ku-30. Ukunciphisa inani kuholela ekwenyukeni kokugcwala ngokweqile.
  • izinga lokushisa_lesikhwama - yengeza i-randomization ezisindweni zezinto ezisesampula. Inani elizenzakalelayo ngu-1, lapho izisindo zithathwa ekusabalaliseni komchazi. Ukunciphisa inani kuholela ekwenyukeni kokugcwala ngokweqile.
  • okungahleliwe_amandla - Kuthinta ukukhethwa kokuhlukaniswa ngokuphindaphinda okuthile. Ukuphakama kwe-random_strength, ayanda amathuba okuhlukaniswa ngokubaluleka okuphansi okukhethwayo. Ekuphindaphindweni ngakunye okulandelayo, ukungahleliwe kuncipha. Ukunciphisa inani kuholela ekwenyukeni kokugcwala ngokweqile.

Amanye amapharamitha anomphumela omncane kakhulu kumphumela wokugcina, ngakho-ke angizange ngizame ukuwakhetha. Ukuphindwaphindwa okukodwa kokuqeqeshwa kudathasethi yami ye-GPU ene-ctr_complexity=1 kuthathe amaminithi angu-20, futhi amapharamitha akhethiwe kudathasethi encishisiwe ayehluke kancane kulawo afanele kudathasethi ephelele. Ekugcineni, ngenze iziphindaphinda ezingaba ngu-30 ku-10% wedatha, ngase ngiphinda cishe izikhathi ezingu-10 kuyo yonke idatha. Kwavela into enjengale:

  • izinga_lokufunda Ngenyuke ngo-40% kusukela kokumisiwe;
  • l2_reg_leaf washiya okufanayo;
  • izinga lokushisa_lesikhwama и okungahleliwe_amandla yehliswe yaba ngu-0,8.

Singaphetha ngokuthi imodeli yayingaqeqeshwa kahle ngamapharamitha azenzakalelayo.

Ngamangala kakhulu lapho ngibona umphumela ebhodini labaphambili:

Imodeli imodeli 1 imodeli 2 imodeli 3 hlanganisa
Ngaphandle kokushuna 0.7403 0.7404 0.7404 0.7407
Ngokushuna 0.7406 0.7405 0.7406 0.7408

Ngaziphetha ngokuthi uma ukusetshenziswa okusheshayo kwemodeli kungadingeki, kungcono ukushintsha ukukhethwa kwamapharamitha nge-ensemble yamamodeli amaningana usebenzisa imingcele engalungiselelwe.

U-Sergey wayelungiselela usayizi wedathasethi ukuze ayisebenzise ku-GPU. Inketho elula ukusika ingxenye yedatha, kodwa lokhu kungenziwa ngezindlela ezimbalwa:

  • susa kancane kancane idatha endala (ekuqaleni kukaFebruwari) kuze kube yilapho idathasethi iqala ukungena kumemori;
  • susa izici ezinokubaluleka okuphansi kakhulu;
  • susa ama-userId okufakwe kuwo okukodwa kuphela;
  • shiya kuphela ama-ID omsebenzisi akuhlolo.

Futhi ekugcineni, yenza i-ensemble kuzo zonke izinketho.

Iqoqo lokugcina

Kusihlwa ngosuku lokugcina, sase sibeke iqoqo lamamodeli ethu akhiqize u-0,742. Ngobusuku nje ngethule imodeli yami nge-ctr_complexity=2 futhi esikhundleni semizuzu engama-30 yaziqeqeshela amahora ama-5. Kuphela ngo-4 ekuseni kwabalwa, futhi ngenza iqoqo lokugcina, elinikeze u-0,7433 ebhodini labaphambili lomphakathi.

Ngenxa yezindlela ezahlukene zokuxazulula le nkinga, ukubikezela kwethu akuzange kuhlotshaniswe ngokuqinile, okunikeze ukwanda okuhle kokuhlanganiswa. Ukuze uthole ukuhlanganisa okuhle, kungcono ukusebenzisa ukubikezela kwemodeli eluhlaza (prediction_type='RawFormulaVal') bese usetha isikali_pos_weight=neg_count/pos_count.

I-SNA Hackathon 2019

Kuwebhusayithi ungabona imiphumela yokugcina kubhodi yabaphambili yangasese.

Ezinye izixazululo

Amaqembu amaningi alandele ama-canon of recommender system algorithms. Mina, ngingeyena uchwepheshe kulo mkhakha, angikwazi ukuwahlola, kodwa ngikhumbula izixazululo ezi-2 ezithakazelisayo.

  • Isixazululo sikaNikolay Anokhin. U-Nikolay, eyisisebenzi se-Mail.ru, akazange afake isicelo semiklomelo, ngakho umgomo wakhe wawungekona ukufeza isivinini esiphezulu, kodwa ukuthola isisombululo esilula.
  • Isinqumo seqembu eliwina uMklomelo weJury esisekelwe ku lesi sihloko esivela ku-facebook, kuvunyelwe ukuhlanganisa izithombe ezinhle kakhulu ngaphandle komsebenzi wezandla.

isiphetho

Obekuhlale enkumbulweni yami:

  • Uma kunezici zesigaba kudatha, futhi wazi ukuthi ukwenza kanjani umbhalo wekhodi oqondisiwe ngendlela efanele, kusengcono ukuzama i-catboost.
  • Uma ubamba iqhaza emqhudelwaneni, akufanele umoshe isikhathi ngokukhetha amapharamitha ngaphandle kwesilinganiso_sokufunda nokuphindaphinda. Isixazululo esisheshayo ukwenza iqoqo lamamodeli amaningana.
  • Ama-boostings angafunda ku-GPU. I-Catboost ingafunda ngokushesha kakhulu ku-GPU, kodwa idla inkumbulo eningi.
  • Ngesikhathi sokuthuthukiswa nokuhlolwa kwemibono, kungcono ukusetha i-rsm~=0.2 encane (CPU kuphela) kanye ne-ctr_complexity=1.
  • Ngokungafani namanye amaqembu, iqoqo lamamodeli ethu linikeze ukwanda okukhulu. Sasishintshana kuphela ngemibono futhi sibhale ngezilimi ezahlukene. Sasinendlela ehlukile yokuhlukanisa idatha futhi, ngicabanga, ngayinye yayinezimbungulu zayo.
  • Akukacaci ukuthi kungani ukulungiselelwa kwezinga kusebenze kubi kunokwenza kahle ngokwezigaba.
  • Ngithole ulwazi oluthile lokusebenza ngemibhalo kanye nokuqonda ukuthi izinhlelo zokuncoma zenziwa kanjani.

I-SNA Hackathon 2019

Sibonga abahleli ngemizwa, ulwazi kanye nemiklomelo etholiwe.

Source: www.habr.com

Engeza amazwana