SNA Hackathon 2019

NgoFebruwari-Matshi 2019, kwabanjwa ukhuphiswano lokulinganisa ukutya kwenethiwekhi yoluntu SNA Hackathon 2019, apho iqela lethu lithathe indawo yokuqala. Kwinqaku ndiza kuthetha malunga nombutho wokhuphiswano, iindlela esizama ngazo, kunye nezicwangciso ze-catboost zoqeqesho kwidatha enkulu.

SNA Hackathon 2019

SNA Hackathon

Esi sisihlandlo sesithathu ukuba i-hackathon ibanjwe phantsi kweli gama. Ihlelwe yi-social network ok.ru, ngokulandelanayo, umsebenzi kunye neenkcukacha zihambelana ngqo nale nethiwekhi yoluntu.
I-SNA (uhlalutyo lwenethiwekhi yoluntu) kule meko iqondwa ngokuchanekileyo kungekhona njengohlalutyo lwegrafu yentlalo, kodwa kunokuba luhlalutyo lwenethiwekhi yoluntu.

  • Ngo-2014, umsebenzi wawukuqikelela inani lokuthandwa yisithuba esiza kufumana.
  • Ngo-2016 - umsebenzi weVVZ (mhlawumbi uqhelekile), kufutshane nohlalutyo lwegrafu yentlalo.
  • Ngo-2019, beka isondlo somsebenzisi ngokusekelwe kumathuba okuba umsebenzisi athande iposti.

Andikwazi ukuthetha malunga ne-2014, kodwa ngo-2016 kunye no-2019, ngaphezu kwezakhono zokuhlalutya idatha, izakhono zokusebenza kunye nedatha enkulu nazo zazifuneka. Ndicinga ukuba yayiyindibaniselwano yokufunda koomatshini kunye neengxaki ezinkulu zokucwangcisa idatha ezanditsalayo kolu khuphiswano, kwaye amava am kwezi ndawo andinceda ukuba ndiphumelele.

mlbootcamp

Ngo-2019, ukhuphiswano lwaququzelelwa eqongeni https://mlbootcamp.ru.

Ukhuphiswano lwaqala kwi-intanethi nge-7 kaFebruwari kwaye lwalunemisebenzi emi-3. Nabani na unokubhalisa kwisiza, khuphela siseko kwaye ulayishe imoto yakho iiyure ezimbalwa. Ekupheleni kwenqanaba le-Intanethi nge-15 kaMatshi, i-15 ephezulu yomsitho ngamnye wokutsiba wamenywa kwi-ofisi ye-Mail.ru kwinqanaba elingaxhunyiwe kwi-intanethi, eyenzeka ukusuka nge-30 kaMatshi ukuya kwi-1 ka-Epreli.

Injongo

Idatha yomthombo ibonelela ngee-ID zabasebenzisi (userId) kunye ne-ID yeposi (objectId). Ukuba umsebenzisi uboniswe isithuba, ngoko idatha iqulethe umgca oqulethe i-userId, intoId, iimpendulo zomsebenzisi kwesi sithuba (ingxelo) kunye neseti yeempawu ezahlukeneyo okanye amakhonkco kwimifanekiso kunye nezicatshulwa.

Isazisi somsebenzisi objectId ownerId ingxelo imifanekiso
3555 22 5677 [ithandile, yacofa] [Hash1]
12842 55 32144 [andithandanga] [Hash2,hash3]
13145 35 5677 [cofa, kwabelwana ngayo] [Hash2]

Isethi yedatha yovavanyo iqulethe isakhiwo esifanayo, kodwa indawo yempendulo ayikho. Umsebenzi kukuqikelela ubukho bempendulo 'ethandiweyo' kwicandelo lengxelo.
Ifayile engeniswayo inolwakhiwo olulandelayo:

Isazisi somsebenzisi Uluhlu oluhleliweyo[objectId]
123 78,13,54,22
128 35,61,55
131 35,68,129,11

I-metric yi-avareji ye-ROC AUC yabasebenzisi.

Inkcazo ethe kratya yedatha inokufumaneka apha iwebhusayithi yebhunga. Unokukhuphela idatha apho, kubandakanywa iimvavanyo kunye nemifanekiso.

Iqonga le-Intanethi

Kwinqanaba le-intanethi, umsebenzi wahlulwa ube ngamacandelo ama-3

  • Inkqubo yentsebenziswano β€” ibandakanya zonke iimpawu ngaphandle kwemifanekiso nezicatshulwa;
  • Imifanekiso β€” ibandakanya kuphela ulwazi malunga nemifanekiso;
  • Izibhalo β€” kubandakanya ulwazi malunga nezicatshulwa kuphela.

Inqanaba elingaxhunyiwe kwi-intanethi

Kwinqanaba elingaxhunyiwe kwi-intanethi, idatha yayiquka zonke iimpawu, ngelixa iitekisi kunye nemifanekiso yayincinci. Kwakukho amaxesha angama-1,5 ngaphezulu kwimiqolo kwidathasethi, esele ininzi.

Isisombululo sengxaki

Oko ndisenza i-CV emsebenzini, ndiqale uhambo lwam kolu khuphiswano ngomsebenzi we-"Images". Idatha enikezelweyo yayiyi-userId, objectId, ownerId (iqela apho isithuba sipapashwe), izitampu zexesha zokudala kunye nokubonisa isithuba, kwaye, ngokuqinisekileyo, umfanekiso wesi sithuba.
Emva kokuvelisa amanqaku amaninzi ngokusekwe kwizitampu zexesha, umbono olandelayo yayikukuthatha umaleko ongaphambili we-neuron oqeqeshwe kwangaphambili kwi-imagenet kwaye uthumele olu luzinziso kunyuso.

SNA Hackathon 2019

Iziphumo azizange zibe ntle. Uzinziso olusuka kwi-imagenet neuron alunamsebenzi, ndacinga ukuba kufuneka ndenze eyam i-autoencoder.

SNA Hackathon 2019

Kuthathe ixesha elininzi kwaye iziphumo azizange ziphucuke.

Ukuveliswa kophawu

Ukusebenza ngemifanekiso kuthatha ixesha elininzi, ngoko ndagqiba ekubeni ndenze into elula.
Njengoko unokubona kwangoko, kukho amanqaku amaninzi kwidathasethi, kwaye ukuze ndingakhathazi kakhulu, ndithathe nje i-catboost. Isisombululo besisemagqabini, ngaphandle kwezicwangciso ndiye ndakhawuleza ndaya kumgca wokuqala webhodi yabaphambili.

Kukho idatha eninzi kwaye ibekwe kwifomathi ye-parquet, ngoko ngaphandle kokucinga kabini, ndathatha i-scala ndaza ndaqala ukubhala yonke into kwi-spark.

Ezona mpawu zilula ezinike ukukhula okungaphezulu kunokufakwa kwemifanekiso:

  • mangaphi amaxesha objectId, userId kunye nomniniId evele kwidatha (kufuneka inxulumane nodumo);
  • zingaphi izithuba zomsebenzisi ozibonileyo kwi-ID yomnini (kufuneka inxulumane nomdla womsebenzisi kwiqela);
  • zingaphi ii-userIds ezizodwa ezijongwe kwizithuba ezivela kumnini-ID (ubonisa ubungakanani babaphulaphuli beqela).

Ukususela kwizitampu zexesha kwakunokwenzeka ukufumana ixesha losuku apho umsebenzisi abukele khona ukutya (kusasa / emini / ngokuhlwa / ubusuku). Ngokudibanisa ezi ndidi, unokuqhubeka ukuvelisa iimpawu:

  • mangaphi amaxesha umsebenzisiId ungene ngokuhlwa;
  • ngeliphi ixesha esi sithuba siboniswa rhoqo (objectId) njalo njalo.

Konke oku kwaphucula ngokuthe ngcembe iimethrikhi. Kodwa ubungakanani bedatha yoqeqesho malunga neerekhodi ze-20M, ngoko ke ukongeza iimpawu kunciphise kakhulu uqeqesho.

Ndiye ndacinga kwakhona indlela yam yokusebenzisa idatha. Nangona idatha ixhomekeke kwixesha, andizange ndibone naluphi na ulwazi olucacileyo oluvuzayo "kwixesha elizayo", nangona kunjalo, ukuba kunokwenzeka, ndophule ngolu hlobo:

SNA Hackathon 2019

Uqeqesho olumiselweyo (ngoFebruwari kunye neeveki ezi-2 zikaMatshi) lwahlulwe lwaba ziinxalenye ezi-2.
Imodeli yaqeqeshwa kwiidatha ukusuka kwiintsuku zokugqibela ze-N. I-aggregations echazwe ngasentla yakhiwe kuyo yonke idatha, kubandakanywa novavanyo. Ngexesha elifanayo, idatha ibonakale apho kunokwenzeka khona ukwakha iikhowudi ezahlukeneyo zokutshintsha okujoliswe kuyo. Eyona ndlela ilula kukusebenzisa kwakhona ikhowudi esele idala iimpawu ezintsha, kwaye imane iyondle idatha apho ingasayi kuqeqeshwa kwaye ijolise = 1.

Ke, sifumene iimpawu ezifanayo:

  • Mangaphi amaxesha apho umsebenzisiId ebone isithuba kumnini weqela;
  • Mangaphi amaxesha umsebenzisiId ethanda isithuba kumnini weqela;
  • Ipesenti yezithuba ezithandiweyo ngumsebenzisi kwi-id.

Oko kukuthi, kwavela thetha ukufakwa kweekhowudi ekujoliswe kuko kwinxalenye yedatha yeendibaniselwano ezahlukeneyo zeempawu zodidi. Ngokomgaqo, i-catboost iphinda yakha i-encoding ekujoliswe kuyo kwaye ukusuka kulo mbono akukho nzuzo, kodwa, umzekelo, kuye kwenzeka ukubala inani labasebenzisi abakhethekileyo abathanda izithuba kweli qela. Ngelo xesha, injongo ephambili yaphunyezwa - i-dataset yam yancitshiswa ngamaxesha amaninzi, kwaye kwakunokwenzeka ukuqhubeka nokuvelisa iimpawu.

Ngelixa i-catboost inokwakha i-encoding kuphela ngokusekwe kwimpendulo ethandwayo, impendulo inezinye iimpendulo: kwabelwane ngokutsha, kungathandwanga, kungathandwanga, kucofe, kungahoywa, ukufakwa kweekhowudi okunokwenziwa ngesandla. Ndibala kwakhona zonke iintlobo zee-aggregates kwaye ndasusa iimpawu ezinokubaluleka okuphantsi ukuze ndingafaki i-dataset.

Ngelo xesha ndandikwindawo yokuqala ngomda obanzi. Ekuphela kwento eyayibhida kukuba ukufakwa kwemifanekiso kubonise phantse akukho kukhula. Umbono weza ukunika yonke into catboost. Sihlanganisa imifanekiso ye-Kmeans kwaye sifumane icandelo elitsha lomfanekiso weCat.

Nazi ezinye iiklasi emva kokucoca ngesandla kunye nokudityaniswa kwamaqela afunyenwe kwi-KMeans.

SNA Hackathon 2019

Ngokusekwe kumfanekiso weCat sivelisa:

  • Iimpawu ezintsha zodidi:
    • Yeyiphi iCat yomfanekiso edla ngokujongwa ngumsebenzisiId;
    • Yeyiphi iCat yomfanekiso edla ngokubonisa iID yomnini;
    • Yeyiphi iCat yomfanekiso eyayithandwa kakhulu ngumsebenzisiId;
  • Iikhawunta ezahlukeneyo:
    • Mingaphi umfanekiso weCat owahlukileyo ojonge umsebenzisiId;
    • Malunga neempawu ezili-15 ezifanayo kunye nokufakwa kweekhowudi ekujoliswe kuko njengoko kuchaziwe ngasentla.

Izibhalo

Iziphumo kukhuphiswano lomfanekiso zindifanele kwaye ndagqiba ekubeni ndizame isandla sam kwiitekisi. Andizange ndisebenze kakhulu ngeetekisi ngaphambili kwaye, ngobudenge, ndabulala usuku kwi-tf-idf kunye ne-svd. Emva koko ndabona isiseko kunye ne-doc2vec, eyenza kanye le nto ndiyifunayo. Emva kokuba ndihlengahlengise kancinane iiparamitha ze-doc2vec, ndifumene uthungelwano lombhalo.

Kwaye emva koko ndaphinda ndasebenzisa ikhowudi yemifanekiso, apho ndithathe indawo yokufakela umfanekiso ngokufakela okubhaliweyo. Ngenxa yoko, ndathatha indawo ye-2 kukhuphiswano lombhalo.

Inkqubo yentsebenziswano

Kwashiyeka ukhuphiswano olunye β€œendingekaluxhobi” ngentonga, kwaye xa ndigweba yi-AUC kwibhodi yabaphambili, iziphumo zolu khuphiswano bekumele ukuba zibe nempembelelo enkulu kwinqanaba le-offline.
Ndithathe zonke iimpawu ezikumthombo wedatha, ezikhethiweyo zecandelo kwaye ndibala ii-aggregates ezifanayo njengemifanekiso, ngaphandle kweempawu ezisekelwe kwimifanekiso ngokwayo. Ukubeka nje oku kwi-catboost kundifake kwindawo yesi-2.

Amanyathelo okuqala e-catboost optimization

Indawo yokuqala neyesibini yandivuyisa, kodwa kwakukho ukuqonda ukuba andizange ndenze nto ekhethekileyo, oko kuthetha ukuba ndingalindela ukulahlekelwa kwizikhundla.

Injongo yolu khuphiswano kukubeka izithuba ngaphakathi komsebenzisi, kwaye lonke eli xesha bendisombulula ingxaki yokuhlelwa, oko kukuthi, ukwenza ngcono imetriki engalunganga.

Makhe ndikunike umzekelo olula:

Isazisi somsebenzisi objectId wokuxela inyaniso esisiseko
1 10 0.9 1
1 11 0.8 1
1 12 0.7 1
1 13 0.6 1
1 14 0.5 0
2 15 0.4 0
2 16 0.3 1

Masenze uhlengahlengiso oluncinci

Isazisi somsebenzisi objectId wokuxela inyaniso esisiseko
1 10 0.9 1
1 11 0.8 1
1 12 0.7 1
1 13 0.6 0
2 16 0.5 1
2 15 0.4 0
1 14 0.3 1

Sifumana ezi ziphumo zilandelayo:

Umzekelo AUC Umsebenzisi1 AUC Umsebenzisi2 AUC kuthetha i-AUC
Ikhetho 1 0,8 1,0 0,0 0,5
Ikhetho 2 0,7 0,75 1,0 0,875

Njengoko unokubona, ukuphuculwa kwemetric ye-AUC iyonke akuthethi ukuphucula umndilili we-AUC metric ngaphakathi komsebenzisi.

I-Catboost iyayazi indlela yokwenyuswa kweemetriki zomgangatho ukusuka kwibhokisi. Ndifunde malunga neemetrics zokulinganisa, amabali empumelelo xa usebenzisa i-catboost kwaye usete i-YetiRankPairwise ukuqeqesha ubusuku bonke. Isiphumo asizange sibe sihle. Ukuthatha isigqibo sokuba ndiphantsi koqeqesho, ndatshintsha umsebenzi wempazamo ukuya kwi-QueryRMSE, ethi, ngokugweba ngamaxwebhu e-catboost, idibanisa ngokukhawuleza. Ekugqibeleni, ndafumana iziphumo ezifanayo xa uqeqesho lokuhlela, kodwa i-ensembles yale mizekelo mibini yanika ukwanda okulungileyo, okwandizisa kwindawo yokuqala kuzo zontathu ukhuphiswano.

Imizuzu emi-5 ngaphambi kokuvalwa kwenqanaba le-intanethi lokhuphiswano lwe-"Collaborative Systems", uSergey Shalnov wandithuthela kwindawo yesibini. Sahamba enye indlela kunye.

Ukulungiselela iqonga ngaphandle kweintanethi

Siye saqinisekiswa ukunqoba kwinqanaba le-intanethi kunye nekhadi levidiyo le-RTX 2080 TI, kodwa ibhaso eliphambili le-ruble ye-300 kwaye, mhlawumbi, nendawo yokugqibela yokuqala yasinyanzela ukuba sisebenze kwezi veki ze-000.

Njengoko kwavela, uSergey wasebenzisa i-catboost. Sanikana ngezimvo nangeempawu, kwaye ndafunda malunga ingxelo nguAnna Veronica Dorogush eyayiqulethe iimpendulo kwimibuzo yam emininzi, kwanaleyo ndandingekabi nayo ngelo xesha.

Ukujonga ingxelo kwandikhokelela kwingcamango yokuba kufuneka sibuyisele zonke iiparameters kwixabiso elingagqibekanga, kwaye senze izicwangciso ngononophelo kwaye kuphela emva kokulungisa isethi yeempawu. Ngoku olunye uqeqesho luthathe malunga neeyure ezili-15, kodwa enye imodeli ikwazile ukufumana isantya esingcono kuneso sifunyenwe kwindibano ngokuhlelwa.

Ukuveliswa kophawu

Kukhuphiswano lweeNkqubo zokuSebenza, inani elikhulu leempawu zivavanywa njengento ebalulekileyo kumzekelo. Umzekelo, auditweights_spark_svd - olona phawu lubalulekileyo, kodwa akukho lwazi malunga nokuba lithetha ukuthini. Ndicinge ukuba kuya kuba luncedo ukubala ii-aggregates ezahlukeneyo ngokusekelwe kwiimpawu ezibalulekileyo. Umzekelo, i-avareji auditweights_spark_svd ngumsebenzisi, ngeqela, ngento. Okufanayo kungabalwa ngokusebenzisa idatha apho kungekho qeqesho lwenziwayo kunye nethagethi = 1, oko kukuthi, umyinge auditweights_spark_svd ngumsebenzisi ngezinto azithandileyo. Iimpawu ezibalulekileyo ngaphandle auditweights_spark_svd, zaziliqela. Nazi ezinye zazo:

  • auditweightsCTrGender
  • auditweightsCTrHigh
  • userOwnerCounterCreateLikes

Umzekelo, umyinge auditweightsCTrGender ngokwe-userId ivele yaba luphawu olubalulekileyo, njengexabiso eliphakathi userOwnerCounterCreateLikes nge-userId+ownerId. Oku kufuneka sele kukwenza ukuba ucinge ukuba kufuneka uqonde intsingiselo yamasimi.

Kwakhona iimpawu ezibalulekileyo zaba auditweightsLikesCount ΠΈ auditweightsShowsCount. Ukwahlula omnye komnye, kwafunyanwa eyona nto ibaluleke ngakumbi.

Ukuvuza kwedatha

Ukhuphiswano kunye nemodeli yemveliso yimisebenzi eyahlukileyo kakhulu. Xa ulungiselela idatha, kunzima kakhulu ukuqwalasela zonke iinkcukacha kwaye ungadlulisi ulwazi oluthile olungenamsebenzi malunga nokuguquguquka okujoliswe kuyo kuvavanyo. Ukuba senza isisombululo sokuvelisa, siya kuzama ukuphepha ukusebenzisa ukuvuza kwedatha xa siqeqesha imodeli. Kodwa ukuba sifuna ukuphumelela ukhuphiswano, ke ukuvuza kwedatha zezona mpawu zibalaseleyo.

Emva kokufunda idata, ungabona ukuba ngokwexabiso objectId auditweightsLikesCount ΠΈ auditweightsShowsCount utshintsho, okuthetha ukuba umlinganiselo wamaxabiso aphezulu ezi mpawu uya kubonisa ukuguqulwa kweposi ngcono kakhulu kunomlinganiselo ngexesha lokuboniswa.

Ukuvuza kokuqala esikufumeneyo auditweightsLikesCountMax/auditweightsShowsCountMax.
Kodwa kuthekani ukuba sijonga idatha ngokusondeleyo? Masihlele ngokomhla womboniso kwaye sifumane:

objectId Isazisi somsebenzisi auditweightsShowsCount auditweightsLikesCount ekujoliswe kuko (kuthandiwe)
1 1 12 3 mhlawumbi akunjalo
1 2 15 3 mhlawumbi ewe
1 3 16 4

Kwakumangalisa xa ndifumana umzekelo wokuqala onjalo kwaye kwavela ukuba ingqikelelo yam ayizange ibe yinyaniso. Kodwa, kuthathelwa ingqalelo into yokuba amaxabiso aphezulu ezi mpawu ngaphakathi kwento anike ukwanda, asizange sonqena kwaye sagqiba ekubeni sifumane. auditweightsShowsCountNext ΠΈ auditweightsLikesCountNext, oko kukuthi, amaxabiso kumzuzu olandelayo ngexesha. Ngokongeza uphawu
(auditweightsShowsCountOkulandelayo-auditweightsShowsCount)/(auditweightsLikesCount-auditweightsLikesCountNext) senza umtsi obukhali ngokukhawuleza.
Ukuvuza okufanayo kunokusetyenziswa ngokufumana la maxabiso alandelayo userOwnerCounterCreateLikes ngaphakathi komsebenzisiId+ownerId kwaye, umzekelo, auditweightsCTrGender ngaphakathi objectId+userGender. Sifumene iindawo ezi-6 ezifanayo ezinokuvuza kwaye sikhuphe ulwazi oluninzi kangangoko sinakho kubo.

Ngeli xesha, sasicinezele ulwazi oluninzi kangangoko sinakho kwiimpawu zentsebenziswano, kodwa asizange sibuyele kukhuphiswano lwemifanekiso nombhalo. Ndinengcamango enkulu yokukhangela: zingakanani iimpawu ezisekelwe ngokuthe ngqo kwimifanekiso okanye izicatshulwa ezinikezela kukhuphiswano olufanelekileyo?

Kwakungekho kuvuza kumfanekiso kunye nokhuphiswano lombhalo, kodwa ngelo xesha ndandibuyisele iiparamitha ze-catboost ezingagqibekanga, ndacoca ikhowudi kwaye ndongeza izinto ezimbalwa. Iyonke yaba:

Isisombululo kungekudala
Ubuninzi obunemifanekiso 0.6411
Ubuninzi akukho mifanekiso 0.6297
Isiphumo sendawo yesibini 0.6295

Isisombululo kungekudala
Ubuninzi obuneetekisi 0.666
Ubuninzi ngaphandle kweetekisi 0.660
Isiphumo sendawo yesibini 0.656

Isisombululo kungekudala
Ubuninzi bentsebenziswano 0.745
Isiphumo sendawo yesibini 0.723

Kwacaca ukuba sasingenakukwazi ukucudisa okuninzi kwiitekisi kunye nemifanekiso, kwaye emva kokuzama iingcamango ezimbalwa ezinomdla kakhulu, sayeka ukusebenza kunye nabo.

Isizukulwana esongezelelweyo seempawu kwiinkqubo zentsebenziswano azange sinike ukwanda, kwaye saqala ukubeka. Kwinqanaba le-intanethi, ulwahlulo kunye nokuhlanganiswa kwezikhundla kundinika ukonyuka okuncinci, njengoko kuye kwenzeka ngenxa yokuba ndiluqeqeshe phantsi ulwahlulo. Akukho nanye kwimisebenzi yemposiso, kuquka iYetiRanlPairwise, evelise naphina kufutshane nesiphumo esenziwe nguLogLoss (0,745 vs. 0,725). Kwakusekho ithemba le-QueryCrossEntropy, engakwazi ukusungulwa.

Inqanaba elingaxhunyiwe kwi-intanethi

Kwinqanaba le-offline, ulwakhiwo lwedatha lwahlala lufana, kodwa bekukho utshintsho oluncinci:

  • Izazisi zomsebenzisiId, objectId, ownerId ziye zaphinda zalungiswa;
  • imiqondiso emininzi yasuswa kwaye ezininzi zathiywa amagama;
  • idatha iye yanda malunga namaxesha e-1,5.

Ukongeza kwiingxaki ezidwelisiweyo, kwakukho enye enkulu kunye: iqela labelwa iseva enkulu kunye ne-RTX 2080TI. Ndikonwabele ihtop ixesha elide.
SNA Hackathon 2019

Kwakukho ingcamango enye kuphela - ukuvelisa ngokulula into esele ikhona. Emva kokuchitha iiyure ezimbalwa siseta okusingqongileyo kumncedisi, ngokuthe ngcembe saqala ukuqinisekisa ukuba iziphumo ziphinde zaveliswa. Ingxaki enkulu esijongene nayo kukunyuka komthamo wedatha. Sigqibe ekubeni sinciphise umthwalo kancinci kwaye sisete iparamitha yecatboost ctr_complexity=1. Oku kunciphisa isantya kancinci, kodwa imodeli yam yaqala ukusebenza, umphumo wawulungile - 0,733. U-Sergey, ngokungafaniyo nam, akazange ahlukanise idatha kwiingxenye ze-2 kwaye aqeqeshe kuyo yonke idatha, nangona oku kunika iziphumo ezilungileyo kakhulu kwinqanaba le-intanethi, kwinqanaba elingaxhunyiwe kwi-intanethi kwakukho ubunzima obuninzi. Ukuba sithathe zonke iimpawu esizenzileyo kwaye sazama ukuzityhala kwi-catboost, akukho nto iza kusebenza kwinqanaba le-intanethi. USergey uye wachwetheza ukulungelelaniswa, umzekelo, ukuguqula iintlobo zefloat64 ukuya kwi-float32. Kweli nqaku Unokufumana ulwazi malunga nokulungiswa kwememori kwiipandas. Ngenxa yoko, uSergey waqeqeshwa kwi-CPU esebenzisa yonke idatha kwaye wafumana malunga ne-0,735.

Ezi ziphumo bezanele ukuphumelela, kodwa besifihla isantya sethu sokwenyani kwaye singaqinisekanga ukuba namanye amaqela awenzi okufanayo.

Yilwa kude kube sekugqibeleni

Ukulungiswa kweCatboost

Isisombululo sethu senziwa ngokutsha ngokupheleleyo, songeze iimpawu zedatha yesicatshulwa kunye nemifanekiso, ngoko ke yonke into eyayisele yayiyi-parameters ye-catboost. U-Sergey waqeqeshwa kwi-CPU kunye nenani elincinci lokuphindaphinda, kwaye ndaqeqeshelwa enye ene-ctr_complexity=1. Bekusele usuku olunye, kwaye ukuba uthe wongeza nje ukuphinda-phinda okanye ukonyusa i-ctr_complexity, kusasa ungafumana isantya esingcono kwaye uhambe imini yonke.

Kwinqanaba le-offline, isantya sinokufihlwa ngokulula ngokukhetha nje esona sisombululo silungileyo kwindawo. Besilindele utshintsho olukhulu kwibhodi yabaphambili kwimizuzu yokugqibela ngaphambi kokuba ukuhanjiswa kuvalwe kwaye sagqiba kwelokuba singayeki.

Kwividiyo ka-Anna, ndifunde ukuba ukuphucula umgangatho wemodeli, kungcono ukhethe ezi parameters zilandelayo:

  • umgangatho_wokufunda β€” Ixabiso elimiselweyo libalwa ngokusekelwe kubungakanani bedathasethi. Ukunyusa izinga_lokufunda kufuna ukwandisa inani lokuphindaphinda.
  • l2_igqabi_reg - I-coefficient ye-regularization, ixabiso elingagqibekanga le-3, ngokukhethekileyo ukhethe ukusuka kwi-2 ukuya ku-30. Ukunciphisa ixabiso kukhokelela ekunyukeni kwe-overfit.
  • bagging_ubushushu - yongeza i-randomization kubunzima bezinto kwisampulu. Ixabiso elimiselweyo ngu-1, apho iintsimbi zitsalwa ukusuka kunikezelo lwe-exponential. Ukunciphisa ixabiso kukhokelela ekonyukeni kwe-overfit.
  • random_amandla - Ichaphazela ukhetho lokwahlula kwindawo ethile. Ukuphakama kwe-random_strength, kokukhona kuphezulu ithuba lokwahlulwa kokubaluleka okuphantsi okukhethiweyo. Kwi-iteration nganye elandelayo, i-randomness iyancipha. Ukunciphisa ixabiso kukhokelela ekonyukeni kwe-overfit.

Ezinye iiparamitha zinesiphumo esincinci kakhulu kwisiphumo sokugqibela, ke khange ndizame ukuzikhetha. Uphinda-phindo olunye loqeqesho kwidathasethi yam ye-GPU ene-ctr_complexity=1 ithathe imizuzu engama-20, kwaye iiparamitha ezikhethiweyo kwidathasethi encitshisiweyo zahluke kancinane kwezona ziphezulu kwidatha epheleleyo. Ekugqibeleni, ndenze malunga nokuphindaphinda kwe-30 kwi-10% yedatha, kwaye emva koko malunga ne-10 yokuphindaphinda kuyo yonke idatha. Kwavela into enje:

  • umgangatho_wokufunda Ndonde nge-40% ukusuka ekusileleni;
  • l2_igqabi_reg wayishiya injalo;
  • bagging_ubushushu ΠΈ random_amandla yehliswe ukuya kwi-0,8.

Sinokugqiba kwelokuba imodeli yayingaqeqeshwanga kakuhle ngeeparamitha ezimiselweyo.

Ndothuka kakhulu xa ndabona isiphumo kwibhodi yabaphambili:

Umzekelo imodeli 1 imodeli 2 imodeli 3 dibanisa
Ngaphandle kokulungelelanisa 0.7403 0.7404 0.7404 0.7407
Ngokulungisa 0.7406 0.7405 0.7406 0.7408

Ndizigqibele ngokwam ukuba usetyenziso olukhawulezayo lwemodeli aludingeki, ke kungcono ukutshintshela ukhetho lweeparamitha kunye nokuhlanganiswa kweemodeli ezininzi usebenzisa iiparamitha ezingalungiswanga.

USergey wayelungiselela ubungakanani bedathaset ukuze ayiqhube kwiGPU. Olona khetho lulula kukusika inxalenye yedatha, kodwa oku kunokwenziwa ngeendlela ezininzi:

  • susa ngokuthe ngcembe eyona datha indala (ukuqala kukaFebruwari) de idataset iqale ukungena kwinkumbulo;
  • susa iimpawu ezinokubaluleka okuphantsi;
  • susa ii-userIds apho kukho ingeniso enye kuphela;
  • shiya kuphela umsebenzisiIds ezikuvavanyo.

Kwaye ekugqibeleni, yenza i-ensemble kuzo zonke iinketho.

Indibano yokugqibela

Ngokuhlwa kosuku lokugqibela, sasibeke indibano yeemodeli zethu ezivelise i-0,742. Ngobusuku ndasungula imodeli yam nge-ctr_complexity=2 kwaye endaweni yemizuzu engama-30 yaziqeqeshela iiyure ezi-5. Kuphela ngo-4 ekuseni kwabalwa, kwaye ndenza i-ensemble yokugqibela, eyanika i-0,7433 kwibhodi yabaphambili yoluntu.

Ngenxa yeendlela ezahlukeneyo zokusombulula le ngxaki, uqikelelo lwethu aluzange ludityaniswe ngokuqinileyo, olunike ukwanda okuhle kwindibano. Ukufumana udibaniso olululo, kungcono ukusebenzisa uqikelelo lwemodeli ekrwada (prediction_type='RawFormulaVal') kwaye usete scale_pos_weight=neg_count/pos_count.

SNA Hackathon 2019

Kwiwebhusayithi ungabona iziphumo zokugqibela kwibhodi yabaphambili yabucala.

Ezinye izisombululo

Amaqela amaninzi alandele ii-canons ze-algorithms yenkqubo yokuncoma. Mna, ndingeyongcali kule ntsimi, andinako ukuzivavanya, kodwa ndikhumbula izisombululo ezi-2 ezinomdla.

  • Isisombululo sikaNikolay Anokhin. U-Nikolay, ongumqeshwa we-Mail.ru, akazange afake isicelo samabhaso, ngoko injongo yakhe yayingekokufezekisa isantya esiphezulu, kodwa ukufumana isisombululo esilula.
  • Isigqibo seqela eliphumelele iBhaso leJury esekelwe kwi eli nqaku livela ku-facebook, kuvunyelwe ukuhlanganiswa kwemifanekiso elungileyo ngaphandle komsebenzi wezandla.

isiphelo

Eyona nto ibihleli kwinkumbulo yam:

  • Ukuba kukho iimpawu zecandelo kwidatha, kwaye uyayazi indlela yokwenza i-encoding ekujoliswe kuyo ngokuchanekileyo, kusengcono ukuzama i-catboost.
  • Ukuba uthatha inxaxheba kukhuphiswano, akufuneki uchithe ixesha ngokukhetha iiparamitha ngaphandle kokufunda_umlinganiselo kunye nophindaphindo. Isisombululo esikhawulezayo kukwenza indibano yeemodeli ezininzi.
  • I-Boostings inokufunda kwi-GPU. ICatboost inokufunda ngokukhawuleza kwiGPU, kodwa itya inkumbulo eninzi.
  • Ngexesha lophuhliso kunye nokuvavanywa kweengcamango, kungcono ukuseta encinci rsm~=0.2 (CPU kuphela) kunye ne-ctr_complexity=1.
  • Ngokungafaniyo namanye amaqela, ukuhlanganiswa kweemodeli zethu kunikeze ukwanda okukhulu. Sasitshintshiselana ngezimvo kuphela size sibhale ngeelwimi ezahlukeneyo. Sasinendlela eyahlukileyo yokwahlula idatha kwaye, ndiyacinga, nganye ineempazamo zayo.
  • Ayicacanga ukuba kutheni ukwenziwa ngcono kusebenze kakubi kunokuhlelwa kokuhlelwa.
  • Ndifumene amava athile ndisebenza ngeetekisi kunye nokuqonda ukuba iinkqubo zokuncoma zenziwe njani.

SNA Hackathon 2019

Ndiyabulela kubaququzeleli ngeemvakalelo, ulwazi kunye namabhaso afunyenweyo.

umthombo: www.habr.com

Yongeza izimvo