SNA Hackathon 2019

A watan Fabrairu-Maris 2019, an gudanar da gasa don ba da lambar yabo ta hanyar sadarwar zamantakewa SNA Hackathon 2019, wanda tawagarmu ta fara zama na farko. A cikin labarin zan yi magana game da tsarin gasar, hanyoyin da muka gwada, da saitunan catboost don horarwa akan manyan bayanai.

SNA Hackathon 2019

SNA Hackathon

Wannan shi ne karo na uku da ake gudanar da wani hackathon da wannan sunan. An tsara shi ta hanyar sadarwar zamantakewa ok.ru, bi da bi, aikin da bayanai suna da alaƙa kai tsaye zuwa wannan hanyar sadarwar zamantakewa.
SNA (binciken sadarwar zamantakewa) a cikin wannan yanayin an fi fahimtar shi daidai ba a matsayin nazarin jadawali ba, amma a matsayin nazarin hanyar sadarwar zamantakewa.

  • A cikin 2014, aikin shine yin hasashen adadin likes da post zai samu.
  • A 2016 - VVZ aiki (watakila kun saba), kusa da bincike na zamantakewa jadawali.
  • A cikin 2019, ƙididdige abincin mai amfani dangane da yuwuwar mai amfani zai so gidan.

Ba zan iya faɗi game da 2014 ba, amma a cikin 2016 da 2019, ban da ƙwarewar nazarin bayanai, ƙwarewar aiki tare da manyan bayanai kuma an buƙaci. Ina tsammanin hadewar injina da manyan matsalolin sarrafa bayanai ne suka jawo ni zuwa ga wadannan gasa, kuma gogewar da na yi a wadannan fannonin ya taimaka min samun nasara.

mlbootcamp

A shekarar 2019, an shirya gasar a kan dandali https://mlbootcamp.ru.

An fara gasar ta yanar gizo ne a ranar 7 ga Fabrairu kuma ta kunshi ayyuka 3. Kowa zai iya yin rajista akan rukunin yanar gizon, zazzagewa tushe kuma ka loda motarka na 'yan sa'o'i. A ƙarshen matakin kan layi a ranar 15 ga Maris, an gayyaci manyan 15 na kowane taron tsalle-tsalle zuwa ofishin Mail.ru don matakin layi, wanda ya gudana daga Maris 30 zuwa Afrilu 1.

Manufar

Bayanan tushen yana ba da ID na mai amfani (userId) da ID na post (objectId). Idan an nuna wa mai amfani matsayi, to bayanan sun ƙunshi layi mai ɗauke da userId, objectId, halayen mai amfani ga wannan post ɗin (feedback) da saitin fasali daban-daban ko hanyoyin haɗi zuwa hotuna da rubutu.

mai amfaniId objectId mai Id feedback images
3555 22 5677 [an so, danna] [hashi1]
12842 55 32144 [ba a so] [hash2, hash3]
13145 35 5677 [an danna, rabawa] [hashi2]

Saitin bayanan gwajin ya ƙunshi irin wannan tsari, amma filin martani ya ɓace. Ayyukan shine tsinkaya kasancewar abin da ake so a cikin filin martani.
Fayil ɗin ƙaddamarwa yana da tsari mai zuwa:

mai amfaniId Jerin Lissafi [objectId]
123 78,13,54,22
128 35,61,55
131 35,68,129,11

Ma'auni shine matsakaicin ROC AUC don masu amfani.

Ana iya samun ƙarin bayanin bayanan a gidan yanar gizon majalisa. Hakanan zaka iya zazzage bayanai a wurin, gami da gwaje-gwaje da hotuna.

Matakin kan layi

A mataki na kan layi, an raba aikin zuwa sassa 3

Matakin layi

A mataki na kan layi, bayanan sun haɗa da duk fasali, yayin da rubutu da hotuna ba su da yawa. Akwai karin layuka 1,5 a cikin bayanan, wanda tuni akwai da yawa.

Maganin matsalar

Tun da na yi CV a wurin aiki, na fara tafiya a cikin wannan gasar tare da aikin "Hotuna". Bayanan da aka bayar sune userId, objectId, ownerId (ƙungiyar da aka buga post ɗin), tambarin lokutan ƙirƙira da nuna gidan, kuma, ba shakka, hoton wannan post ɗin.
Bayan samar da fasali da yawa dangane da tambarin lokaci, ra'ayi na gaba shine a ɗauki matakin farko na neuron wanda aka riga aka horar akan imagenet kuma aika waɗannan abubuwan haɗin gwiwa don haɓakawa.

SNA Hackathon 2019

Sakamakon bai ban mamaki ba. Abubuwan da aka haɗa daga neuron na imagenet ba su da mahimmanci, na yi tunani, ina buƙatar yin nawa autoencoder.

SNA Hackathon 2019

Ya ɗauki lokaci mai yawa kuma sakamakon bai inganta ba.

Ƙirƙirar fasali

Yin aiki tare da hotuna yana ɗaukar lokaci mai yawa, don haka na yanke shawarar yin wani abu mafi sauƙi.
Kamar yadda zaku iya gani nan da nan, akwai abubuwa da yawa iri-iri a cikin bayanan, kuma don kada in damu da yawa, kawai na ɗauki catboost. Maganin yana da kyau kwarai, ba tare da wani saiti ba nan da nan na isa layin farko na allon jagora.

Akwai bayanai da yawa da yawa kuma an shimfida shi cikin tsarin parquet, don haka ba tare da tunani sau biyu ba, na ɗauki scala na fara rubuta komai a cikin walƙiya.

Mafi sauƙaƙan fasalulluka waɗanda suka ba da ƙarin girma fiye da saka hoto:

  • sau nawa objectId, userId da ownerId suka bayyana a cikin bayanan (ya dace da shahara);
  • posts nawa mai amfaniId ya gani daga mai mallakarId (ya kamata ya dace da sha'awar mai amfani ga ƙungiyar);
  • nawa keɓantattun masu amfaniId nawa suka kalli posts daga mai mallakarId (yana nuna girman masu sauraron ƙungiyar).

Daga lokutan lokuta yana yiwuwa a sami lokacin rana wanda mai amfani ya kalli abincin (safiya / rana / maraice / dare). Ta hanyar haɗa waɗannan nau'ikan, za ku iya ci gaba da samar da fasali:

  • sau nawa mai amfaniId ya shiga da yamma;
  • a wane lokaci ne aka fi nuna wannan sakon (objectId) da sauransu.

Duk wannan ya inganta awo a hankali. Amma girman bayanan horon yana da kusan rikodin 20M, don haka ƙara fasali ya rage horo sosai.

Na sake tunani ta hanyar amfani da bayanai. Duk da cewa bayanan sun dogara da lokaci, ban ga wani bayyanannen bayanin da ke fitowa ba "a nan gaba", duk da haka, kawai idan, na rushe shi kamar haka:

SNA Hackathon 2019

Tsarin horon da aka ba mu (Fabrairu da makonni 2 na Maris) ya kasu kashi biyu.
An horar da samfurin akan bayanai daga kwanakin N na ƙarshe. Abubuwan da aka kwatanta a sama an gina su akan duk bayanai, gami da gwajin. A lokaci guda, bayanai sun bayyana a kansu waɗanda za a iya gina maɓalli daban-daban na madaidaicin manufa. Hanya mafi sauƙi ita ce sake amfani da lambar da ta riga ta ƙirƙiri sababbin fasali, kuma kawai ciyar da shi bayanan da ba za a horar da shi ba da manufa = 1.

Don haka, mun sami irin waɗannan siffofi:

  • Sau nawa mai amfaniId ya ga wani rubutu a cikin mai kungiyarId;
  • Sau nawa mai amfaniId yana son sakon a cikin masu mallakar rukuniId;
  • Adadin abubuwan da mai amfani ke so daga ownerId.

Wato abin ya kasance ma'anar maƙasudi a wani ɓangare na saitin bayanai don haɗuwa daban-daban na fasalulluka. A ka'ida, catboost kuma yana gina maƙasudin ɓoyewa kuma daga wannan ra'ayi babu fa'ida, amma, alal misali, ya zama mai yiwuwa a ƙidaya adadin masu amfani na musamman waɗanda ke son posts a cikin wannan rukunin. A lokaci guda, an cimma babban burin - an rage yawan bayanan bayanan na sau da yawa, kuma yana yiwuwa a ci gaba da samar da siffofi.

Yayin da catboost zai iya gina ɓoyewa kawai bisa ga abin da ake so, ra'ayi yana da wasu halayen: sake rabawa, ba a so, ba a so, dannawa, watsi da su, abubuwan da za a iya yi da hannu. Na sake ƙididdige kowane nau'in tari da kuma kawar da fasali tare da ƙarancin mahimmanci don kar in faɗaɗa saitin bayanai.

A lokacin na kasance a matsayi na farko da tazara mai fadi. Abinda kawai ke daure kai shine cewa saka hoton ya nuna kusan babu girma. Tunanin ya zo don ba da komai don catboost. Muna tattara hotuna na Kmeans kuma muna samun sabon fasalin fasalin hotoCat.

Anan akwai wasu azuzuwan bayan tacewa na hannu da hadewar gungu da aka samu daga KMeans.

SNA Hackathon 2019

Dangane da imageCat muna samarwa:

  • Sabbin fasalulluka:
    • Wane hotoCat ya fi kallon mai amfaniId;
    • Wane hotoCat ya fi yawan nuna ownerId;
    • Wane hotoCat ya fi so ta mai amfaniId;
  • Daban-daban counters:
    • Hoton nawa na musammanCat ya kalli userId;
    • Kimanin fasali iri ɗaya 15 tare da rufaffen manufa kamar yadda aka bayyana a sama.

Rubutun rubutu

Sakamako a gasar hoton ya dace da ni kuma na yanke shawarar gwada hannuna a rubutu. Ban yi aiki da yawa da rubutu ba a baya kuma, wauta, na kashe ranar akan tf-idf da svd. Sai na ga tushe tare da doc2vec, wanda ke yin daidai abin da nake buƙata. Bayan an ɗan daidaita sigogin doc2vec, na sami abubuwan saka rubutu.

Sannan kawai na sake amfani da lambar don hotunan, inda na maye gurbin hoton da aka saka tare da rubutun rubutu. A sakamakon haka, na dauki matsayi na 2 a gasar rubutu.

Tsarin haɗin gwiwa

Akwai wata gasa wacce har yanzu ban yi “poked” da sanda ba, kuma idan aka yi la’akari da AUC a kan jagororin jagororin, sakamakon wannan gasa ta musamman yakamata ya yi tasiri sosai a matakin layi.
Na ɗauki duk fasalulluka waɗanda ke cikin bayanan tushen, na zaɓi nau'ikan nau'ikan kuma na ƙididdige jimillar guda ɗaya kamar na hotuna, ban da fasali dangane da hotunan da kansu. Kawai sanya wannan a cikin catboost ya sa ni zuwa matsayi na 2.

Matakan farko na inganta catboost

Wurare ɗaya na ɗaya da na biyu ya faranta mini rai, amma akwai fahimtar cewa ban yi wani abu na musamman ba, wanda ke nufin zan iya tsammanin asarar mukamai.

Manufar gasar ita ce sanya matsayi a cikin mai amfani, kuma duk wannan lokacin ina warware matsalar rarrabawa, wato, inganta ma'auni mara kyau.

Bari in ba ku misali mai sauƙi:

mai amfaniId objectId Hasashen kasa gaskiya
1 10 0.9 1
1 11 0.8 1
1 12 0.7 1
1 13 0.6 1
1 14 0.5 0
2 15 0.4 0
2 16 0.3 1

Mu yi ƙaramin tsari

mai amfaniId objectId Hasashen kasa gaskiya
1 10 0.9 1
1 11 0.8 1
1 12 0.7 1
1 13 0.6 0
2 16 0.5 1
2 15 0.4 0
1 14 0.3 1

Muna samun sakamako kamar haka:

Samfurin AUC Mai amfani 1 AUC Mai amfani 2 AUC nufin AUC
Zabin 1 0,8 1,0 0,0 0,5
Zabin 2 0,7 0,75 1,0 0,875

Kamar yadda kuke gani, haɓaka ma'aunin AUC gabaɗaya baya nufin haɓaka matsakaicin awo na AUC tsakanin mai amfani.

Catboost ya san yadda ake inganta ma'aunin martaba daga akwatin. Na karanta game da ma'auni, labaran nasara lokacin amfani da catboost kuma saita YetiRankPairwise don horar da dare. Sakamakon bai burge ba. Yanke shawarar cewa ba a horar da ni ba, na canza aikin kuskure zuwa QueryRMSE, wanda, yin hukunci da takaddun catboost, yana haɗuwa da sauri. A ƙarshe, na sami sakamako iri ɗaya kamar lokacin horo don rarrabawa, amma ƙungiyoyin waɗannan samfuran biyu sun ba da haɓaka mai kyau, wanda ya kawo ni matsayi na farko a duk gasa uku.

Minti 5 kafin rufe matakin kan layi na gasar "Haɗin kai Systems", Sergey Shalnov ya koma matsayi na biyu. Muka kara tafiya tare.

Ana yin shiri don matakin layi

An ba mu tabbacin nasara a mataki na kan layi tare da katin bidiyo na RTX 2080 TI, amma babban kyautar 300 rubles kuma, mafi mahimmanci, har ma da wuri na farko ya tilasta mu muyi aiki don waɗannan makonni 000.

Kamar yadda ya juya waje, Sergey kuma ya yi amfani da catboost. Mun yi musayar ra'ayoyi da fasali, kuma na koya game da su rahoton Anna Veronica Dorogush wadanda ke dauke da amsoshi ga yawancin tambayoyina, da ma wadanda ban samu ba a wancan lokacin.

Duban rahoton ya kai ni ga ra'ayin cewa muna buƙatar dawo da duk sigogi zuwa ƙimar da ta dace, kuma kuyi saitunan a hankali kuma kawai bayan gyara saitin fasali. Yanzu horo ɗaya ya ɗauki kimanin sa'o'i 15, amma ɗayan samfurin ya sami damar samun saurin gudu fiye da wanda aka samu a cikin ƙungiyar tare da matsayi.

Ƙirƙirar fasali

A cikin gasar Tsarin Haɗin kai, an ƙididdige adadi mai yawa na fasali da mahimmanci ga ƙirar. Misali, duba nauyi_spark_svd - alama mafi mahimmanci, amma babu wani bayani game da abin da ake nufi. Ina tsammanin zai dace a ƙidaya tari daban-daban bisa muhimman abubuwa. Misali, matsakaicin auditweights_spark_svd ta mai amfani, ta rukuni, ta abu. Hakanan ana iya ƙididdige su ta amfani da bayanan da ba a yin horo da manufa = 1, wato, matsakaita duba nauyi_spark_svd ta mai amfani ta abubuwan da yake so. Muhimman alamomi banda duba nauyi_spark_svd, akwai da yawa. Ga wasu daga cikinsu:

  • duba nauyiCtrGender
  • duba nauyiCtrHigh
  • mai amfaniOwnerCounterCreateLikes

Misali, matsakaicin duba nauyiCtrGender bisa ga userId ya juya ya zama muhimmin fasali, kamar matsakaicin ƙimar mai amfaniOwnerCounterCreateLikes ta userId+ownerId. Wannan ya kamata ya riga ya sa ku yi tunanin cewa kuna buƙatar fahimtar ma'anar filayen.

Hakanan mahimman fasali sun kasance auditweightLikesCount и auditweightshowsCount. Rarraba ɗaya da ɗayan, an sami wani mahimmin fasalin ma.

Bayanai na zubewa

Gasa da samfurin samarwa ayyuka ne daban-daban. Lokacin shirya bayanai, yana da matukar wahala a yi la'akari da duk cikakkun bayanai kuma kada a isar da wasu bayanai marasa mahimmanci game da madaidaicin manufa a cikin gwajin. Idan muna ƙirƙirar mafita na samarwa, za mu yi ƙoƙarin guje wa yin amfani da leaks na bayanai lokacin horar da ƙirar. Amma idan muna so mu ci gasar, to, leaks bayanai sune mafi kyawun fasali.

Bayan nazarin bayanan, zaku iya ganin hakan bisa ga ƙimar objectId auditweightLikesCount и auditweightshowsCount canji, wanda ke nufin rabon matsakaicin ƙimar waɗannan fasalulluka zai nuna jujjuyawar post ɗin da kyau fiye da rabo a lokacin nuni.

Ruwan farko da muka samu shine na'urar tantance nauyiLikesCountMax/auditweightshowsCountMax.
Amma idan muka kalli bayanan da kyau fa? Bari mu tsara ta kwanan nuni kuma mu sami:

objectId mai amfaniId auditweightshowsCount auditweightLikesCount manufa (ana son)
1 1 12 3 mai yiwuwa ba
1 2 15 3 watakila eh
1 3 16 4

Abin mamaki ne lokacin da na sami farkon irin wannan misalin kuma ya zama cewa hasashena bai cika ba. Amma, la'akari da cewa matsakaicin dabi'u na wadannan halaye a cikin abu ya ba da karuwa, ba mu kasance m ba kuma yanke shawarar gano. auditweightshowsCountNext и duba nauyiLikesCountNext, wato, dabi'u a lokaci na gaba a lokaci. Ta ƙara fasali
(auditweightshowsCountNext-auditweightsShowsCount)/(auditweightsLikesCount-auditweightsLikesCountNext) muka yi tsalle mai kaifi da sauri.
Ana iya amfani da irin wannan leaks ta hanyar nemo ma'auni masu zuwa don mai amfaniOwnerCounterCreateLikes a cikin userId+ownerId da, misali, duba nauyiCtrGender a cikin objectId+ UserGender. Mun sami filayen guda 6 masu kama da leaks kuma mun fitar da bayanai da yawa gwargwadon iko daga gare su.

A lokacin, mun fizge bayanai da yawa daga abubuwan haɗin gwiwa, amma ba mu koma ga gasar hoto da rubutu ba. Ina da kyakkyawan ra'ayi don bincika: nawa fasali kai tsaye bisa hotuna ko rubutu suna bayarwa a cikin gasa masu dacewa?

Babu leaks a cikin gasa na hoto da rubutu, amma a lokacin na dawo da sigogin catboost tsoho, na tsabtace lambar kuma na ƙara wasu fasaloli. Jimlar ta kasance:

yanke shawara da sannu
Mafi girma tare da hotuna 0.6411
Matsakaicin babu hotuna 0.6297
Sakamakon wuri na biyu 0.6295

yanke shawara da sannu
Matsakaicin tare da rubutu 0.666
Matsakaicin ba tare da rubutu ba 0.660
Sakamakon wuri na biyu 0.656

yanke shawara da sannu
Matsakaicin a cikin haɗin gwiwa 0.745
Sakamakon wuri na biyu 0.723

Ya zama a bayyane cewa da wuya mu sami damar matsi da yawa daga rubutu da hotuna, kuma bayan gwada wasu ra'ayoyi masu ban sha'awa, mun daina aiki tare da su.

Ƙarin ƙarni na fasali a cikin tsarin haɗin gwiwar bai ba da karuwa ba, kuma mun fara matsayi. A mataki na kan layi, rarrabuwar kawuna da rukunin martaba sun ba ni ƙaramin haɓaka, kamar yadda ya faru saboda na kasa horar da rarrabuwa. Babu ɗayan ayyukan kuskure, gami da YetiRanlPairwise, wanda aka samar a ko'ina kusa da sakamakon da LogLoss yayi (0,745 vs. 0,725). Har yanzu akwai bege ga QueryCrossEntropy, wanda ba a iya ƙaddamar da shi ba.

Matakin layi

A matakin layi, tsarin bayanan ya kasance iri ɗaya, amma akwai ƙananan canje-canje:

  • masu gano mai amfaniId, objectId, ownerId an sake canza su;
  • an cire alamun da yawa kuma an sake sanyawa wasu suna;
  • bayanan sun karu kusan sau 1,5.

Baya ga matsalolin da aka jera, akwai babban ƙari guda ɗaya: an ware ƙungiyar babbar uwar garken tare da RTX 2080TI. Na dade ina jin dadin htop.
SNA Hackathon 2019

Akwai ra'ayi ɗaya kawai - don kawai sake haifar da abin da ya wanzu. Bayan shafe sa'o'i biyu muna kafa yanayi a kan uwar garke, a hankali mun fara tabbatar da cewa sakamakon zai iya sake fitowa. Babban matsalar da muke fuskanta ita ce karuwar yawan bayanai. Mun yanke shawarar rage kaya kadan kuma mun saita ma'aunin catboost ctr_complexity=1. Wannan yana rage saurin gudu kadan, amma samfurina ya fara aiki, sakamakon ya kasance mai kyau - 0,733. Sergey, ba kamar ni ba, bai raba bayanan zuwa sassa 2 ba kuma ya horar da duk bayanan, ko da yake wannan ya ba da sakamako mafi kyau a mataki na kan layi, a cikin layi na layi akwai matsaloli da yawa. Idan muka ɗauki duk abubuwan da muka ƙirƙira kuma muka yi ƙoƙarin tura su cikin catboost, to babu abin da zai yi aiki a matakin kan layi. Sergey yayi nau'in ingantawa, alal misali, canza nau'ikan float64 zuwa float32. A cikin wannan labarin, Kuna iya samun bayani kan inganta ƙwaƙwalwar ajiya a pandas. Sakamakon haka, Sergey ya horar da CPU ta amfani da duk bayanan kuma ya sami kusan 0,735.

Waɗannan sakamakon sun isa mu yi nasara, amma mun ɓoye ainihin saurin mu kuma ba mu iya tabbatar da cewa sauran ƙungiyoyin ba sa yin haka.

Yaƙi zuwa ƙarshe

Catboost tuning

Maganinmu ya sake fitowa gabaɗaya, mun ƙara fasalulluka na bayanan rubutu da hotuna, don haka duk abin da ya rage shine daidaita sigogin catboost. Sergey ya horar da CPU tare da ƴan ƴan ɗimbin gyare-gyare, kuma na horar da wanda ke da ctr_complexity=1. Akwai saura kwana ɗaya, kuma idan kawai ka ƙara iterations ko ƙara ctr_complexity, to da safe za ku iya samun mafi kyawun gudu da tafiya duk rana.

A mataki na kan layi, ana iya ɓoye saurin gudu cikin sauƙi ta zaɓin ba mafi kyawun bayani akan rukunin yanar gizon ba. Muna tsammanin sauye-sauye masu tsauri a allon jagora a cikin mintuna na ƙarshe kafin ƙaddamar da ƙaddamarwa kuma muka yanke shawarar ba za mu daina ba.

Daga bidiyon Anna, na koyi cewa don inganta ingancin samfurin, yana da kyau a zaɓi sigogi masu zuwa:

  • yawan karatu - Ana ƙididdige ƙimar tsoho bisa girman ma'ajin bayanai. Haɓaka ƙimar koyo yana buƙatar ƙara yawan maimaitawa.
  • l2_leaf_reg - Ƙididdigar daidaitawa, ƙimar tsoho 3, zai fi dacewa zaɓi daga 2 zuwa 30. Rage darajar yana haifar da karuwa a overfit.
  • bagging_zazzabi - yana ƙara bazuwar zuwa ma'aunin abubuwa a cikin samfurin. Ƙimar da ta dace ita ce 1, inda aka zana ma'auni daga rarraba mai ma'ana. Rage ƙima yana haifar da haɓakar wuce gona da iri.
  • bazuwar_ƙarfin - Yana shafar zaɓin rarrabuwa a takamaiman maimaitawa. Mafi girma bazuwar_ƙarfin, mafi girman damar da ake zaɓin ƙarancin mahimmanci. A kowane juzu'i na gaba, bazuwar yana raguwa. Rage ƙima yana haifar da haɓakar wuce gona da iri.

Sauran sigogi suna da ƙaramin tasiri akan sakamakon ƙarshe, don haka ban yi ƙoƙarin zaɓar su ba. Ɗaya daga cikin horo na horo a kan bayanan GPU na tare da ctr_complexity=1 ya ɗauki mintuna 20, kuma zaɓaɓɓun sigogi akan rage yawan bayanan sun ɗan bambanta da waɗanda suka fi dacewa a kan cikakkun bayanai. A ƙarshe, na yi kusan 30 iterations a kan 10% na bayanai, sa'an nan game da 10 more iterations a kan dukan bayanai. Ya zama kamar haka:

  • yawan karatu Na karu da 40% daga tsoho;
  • l2_leaf_reg bar shi daya;
  • bagging_zazzabi и bazuwar_ƙarfin rage zuwa 0,8.

Za mu iya ƙarasa cewa samfurin ya kasance ƙarƙashin horarwa tare da sigogi na asali.

Na yi mamaki sosai lokacin da na ga sakamakon a kan allo:

Samfurin samfurin 1 samfurin 2 samfurin 3 tarawa
Ba tare da kunnawa ba 0.7403 0.7404 0.7404 0.7407
Tare da kunnawa 0.7406 0.7405 0.7406 0.7408

Na kammala wa kaina cewa idan ba a buƙatar aikace-aikacen gaggawa na samfurin ba, to, yana da kyau a maye gurbin zaɓi na sigogi tare da tarin nau'o'in nau'i-nau'i da yawa ta amfani da matakan da ba a inganta ba.

Sergey yana inganta girman bayanan don gudanar da shi akan GPU. Zaɓin mafi sauƙi shine yanke sashin bayanan, amma ana iya yin hakan ta hanyoyi da yawa:

  • a hankali cire tsoffin bayanai (farkon Fabrairu) har sai bayanan sun fara shiga cikin ƙwaƙwalwar ajiya;
  • cire fasali tare da mafi ƙarancin mahimmanci;
  • cire userIds wanda akwai kawai shigarwa daya;
  • bar kawai masu amfaniIds da ke cikin gwajin.

Kuma a ƙarshe, yi tari daga duk zaɓuɓɓukan.

Tarin karshe

Da yammacin yammacin ranar ƙarshe, mun tsara tarin samfuran mu waɗanda suka samar da 0,742. A cikin dare na ƙaddamar da samfurina tare da ctr_complexity=2 kuma maimakon minti 30 ya horar da na tsawon sa'o'i 5. Da karfe 4 na safe kawai aka kirga shi, kuma na yi taron karshe, wanda ya ba da 0,7433 akan allon jagororin jama'a.

Saboda hanyoyi daban-daban don magance matsalar, tsinkayar mu ba ta da alaƙa sosai, wanda ya ba da haɓaka mai kyau a cikin tarin. Don samun kyakkyawan tsari, yana da kyau a yi amfani da tsinkayar tsinkayar ƙirar ƙira (prediction_type='RawFormulaVal') da saita sikelin_pos_weight=neg_count/pos_count.

SNA Hackathon 2019

A kan gidan yanar gizon za ku iya gani sakamako na ƙarshe akan allon jagora masu zaman kansu.

Sauran mafita

Ƙungiyoyi da yawa sun bi canons na tsarin algorithms na tsarin shawarwari. Ni, ba ƙwararre ba a cikin wannan filin, ba zan iya kimanta su ba, amma na tuna 2 mafita masu ban sha'awa.

  • Maganin Nikolay Anokhin. Nikolay, kasancewa ma'aikaci na Mail.ru, bai nemi kyaututtuka ba, don haka burinsa ba shine don cimma matsakaicin sauri ba, amma don samun mafita mai sauƙi.
  • Jury Prize ta yanke shawarar ƙungiyar bisa ga wannan labarin daga facebook, an ba da izini don tattara hotuna masu kyau sosai ba tare da aikin hannu ba.

ƙarshe

Abin da ya fi makale a cikin ƙwaƙwalwar ajiya na:

  • Idan akwai nau'ikan fasali a cikin bayanan, kuma kun san yadda ake yin rufaffiyar manufa daidai, yana da kyau a gwada catboost.
  • Idan kuna shiga gasa, bai kamata ku ɓata lokaci wajen zaɓar sigogi ban da koyo_rate da maimaitawa. Magani cikin sauri shine yin tarin samfura da yawa.
  • Abubuwan haɓakawa na iya koyo akan GPU. Catboost na iya koyo da sauri akan GPU, amma yana cinye ƙwaƙwalwar ajiya da yawa.
  • Lokacin haɓakawa da gwajin ra'ayoyi, yana da kyau a saita ƙaramin rsm ~ = 0.2 (CPU kawai) da ctr_complexity = 1.
  • Ba kamar sauran ƙungiyoyi ba, tarin samfuran mu ya ba da haɓaka mai girma. Mun yi musayar ra'ayi ne kawai kuma muka yi rubutu cikin harsuna daban-daban. Muna da wata hanya ta daban don raba bayanan kuma, ina tsammanin, kowanne yana da nasa kwari.
  • Ba a fayyace dalilin da ya sa aikin inganta martaba ya yi muni fiye da ingantawa ba.
  • Na sami ɗan gogewa aiki tare da rubutu da fahimtar yadda ake yin tsarin masu ba da shawara.

SNA Hackathon 2019

Godiya ga masu shirya don motsin rai, ilimi da kyaututtuka da aka samu.

source: www.habr.com

Add a comment