Babban Hadron Collider da Odnoklassniki

Ci gaba da taken gasar koyon injina kan Habré, muna son gabatar da masu karatu zuwa wasu dandamali guda biyu. Babu shakka ba su da girma kamar kaggle, amma tabbas sun cancanci kulawa.

Babban Hadron Collider da Odnoklassniki

Da kaina, ba na son kaggle da yawa saboda dalilai da yawa:

  • da farko, gasa a can sukan wuce na watanni da yawa, kuma shiga aiki yana buƙatar ƙoƙari mai yawa;
  • na biyu, kernels na jama'a (maganin jama'a). Mabiya Kaggle suna ba da shawarar kula da su da kwanciyar hankali na sufaye na Tibet, amma a zahiri abin kunya ne idan wani abu da kuke aiki don wata ɗaya ko biyu ba zato ba tsammani ya zama an shimfiɗa shi a kan farantin azurfa ga kowa.

Abin farin ciki, ana gudanar da gasar koyon injina a wasu dandamali, kuma za a tattauna guda biyu daga cikin waɗannan gasa.

IDAO SNA Hackathon 2019
Harshen hukuma: Turanci,
Masu shiryawa: Yandex, Sberbank, HSE
Harshen Rashanci na hukuma,
Masu shiryawa: Ƙungiyar Mail.ru
Zagaye na Kan layi: Jan 15 - Fabrairu 11, 2019;
Karshen Wuri: Afrilu 4-6, 2019
kan layi - daga Fabrairu 7 zuwa Maris 15;
offline - daga Maris 30 zuwa Afrilu 1.
Yin amfani da takamaiman saitin bayanai game da barbashi a cikin Babban Hadron Collider (hanyoyi, kuzari, da sauran madaidaitan sigogi na zahiri), ƙayyade ko muon ne ko a'a.
Daga wannan bayanin, an gano ayyuka 2:
- a daya kawai dole ne ka aika hasashen ku,
- kuma a cikin ɗayan - cikakken lambar da samfurin don tsinkaya, kuma aiwatarwar ya kasance ƙarƙashin ƙayyadaddun ƙayyadaddun ƙayyadaddun ƙayyadaddun ƙayyadaddun lokacin aiki da amfani da ƙwaƙwalwar ajiya.
Don gasar SNA Hackathon, rajistan ayyukan nunin abun ciki daga buɗaɗɗen ƙungiyoyi a cikin ciyarwar masu amfani na Fabrairu-Maris 2018 an tattara su. Saitin gwajin ya ƙunshi satin ƙarshe da rabi na Maris. Kowane shigarwa a cikin log ɗin ya ƙunshi bayani game da abin da aka nuna da kuma wa, da kuma yadda mai amfani ya ɗauki wannan abun ciki: ƙididdige shi, sharhi, watsi da shi, ko ɓoye shi daga ciyarwa.
Mahimmancin ayyukan SNA Hackathon shine sanya kowane mai amfani da hanyar sadarwar zamantakewa Odnoklassniki abincinsa, yana haɓaka gwargwadon yiwuwar waɗancan abubuwan da za su karɓi "aji".
A mataki na kan layi, an raba aikin zuwa sassa 3:
1. matsayi matsayi bisa ga halaye daban-daban na haɗin gwiwa
2. Matsayin matsayi bisa ga hotunan da suke ciki
3. sanya matsayi bisa ga rubutun da suka kunsa
Complex al'ada awo, wani abu kamar ROC-AUC Matsakaicin ROC-AUC ta mai amfani
Kyauta don mataki na farko - T-shirts don wuraren N, wucewa zuwa mataki na biyu, inda aka biya masauki da abinci yayin gasar.
Kashi na biyu - ??? (Saboda wasu dalilai, ban halarci bikin bayar da lambar yabo ba kuma na kasa gano menene kyaututtukan a ƙarshe). Sun yi alkawarin kwamfutar tafi-da-gidanka ga duk membobin kungiyar da ta yi nasara
Kyauta don mataki na farko - T-shirts ga 100 mafi kyawun mahalarta, wucewa zuwa mataki na biyu, inda aka biya tafiya zuwa Moscow, masauki da abinci a lokacin gasar. Har ila yau, zuwa ƙarshen mataki na farko, an sanar da kyaututtuka don mafi kyau a cikin ayyuka 3 a mataki na 1: kowa ya lashe katin bidiyo na RTX 2080 TI!
Mataki na biyu shine matakin kungiya, kungiyoyi sun kunshi mutane 2 zuwa 5, kyaututtuka:
1st wuri - 300 rubles
2st wuri - 200 rubles
3st wuri - 100 rubles
Jury kyauta - 100 rubles
Rukunin telegram na hukuma, ~ mahalarta 190, sadarwa cikin Ingilishi, tambayoyi sun jira kwanaki da yawa don amsa Ƙungiyar hukuma a cikin telegram, ~ mahalarta 1500, tattaunawa mai mahimmanci na ayyuka tsakanin mahalarta da masu shiryawa
Masu shiryawa sun ba da mafita guda biyu na asali, mai sauƙi da ci gaba. Sauƙaƙan ana buƙata ƙasa da 16 GB na RAM, kuma haɓakar ƙwaƙwalwar ajiya bai dace da 16 ba. A lokaci guda, duban gaba kadan, mahalarta ba su iya yin tasiri sosai kan mafi ci gaba ba. Babu matsaloli wajen ƙaddamar da waɗannan mafita. Ya kamata a lura cewa a cikin misali na ci gaba akwai sharhi tare da alamar inda za a fara inganta mafita. An ba da mafita na asali na asali ga kowane ɗayan ayyuka, waɗanda mahalarta suka sami sauƙin wucewa. A farkon lokacin gasar, mahalarta sun ci karo da matsaloli da yawa: da farko, an ba da bayanan a cikin tsarin Apache Parquet, kuma ba duk haɗin Python da kunshin parquet sun yi aiki ba tare da kurakurai ba. Wahala ta biyu ita ce zazzage hotuna daga gajimaren wasiku, a halin yanzu babu wata hanya mai sauƙi don saukar da adadi mai yawa a lokaci ɗaya. Sakamakon haka, waɗannan matsalolin sun jinkirta mahalarta na kwanaki biyu.

IDAO. Matakin farko

Aikin shine a rarraba ɓangarorin muon/marasa muon gwargwadon halayensu. Babban fasalin wannan aikin shine kasancewar ginshiƙi mai nauyi a cikin bayanan horo, wanda masu shirya da kansu suka fassara a matsayin amincewa ga amsar wannan layin. Matsalar ita ce ƴan layuka sun ƙunshi ma'auni mara kyau.

Babban Hadron Collider da Odnoklassniki

Bayan yin tunani na 'yan mintoci kaɗan game da layi tare da ambato (alamar kawai ta jawo hankali ga wannan fasalin na ginshiƙi mai nauyi) da gina wannan jadawali, mun yanke shawarar bincika zaɓuɓɓukan 3:

1) juya maƙasudin layi tare da ma'auni mara kyau (da nauyi daidai)
2) matsar da ma'auni zuwa mafi ƙarancin ƙima don farawa daga 0
3) kar a yi amfani da ma'aunin kirtani

Zabi na uku ya zama mafi muni, amma biyun farko sun inganta sakamakon, mafi kyawun zaɓi na 1, wanda nan da nan ya kawo mu matsayi na biyu a halin yanzu a cikin aikin farko kuma na farko a cikin na biyu.
Babban Hadron Collider da Odnoklassniki
Mataki na gaba shine duba bayanan don ƙimar da suka ɓace. Masu shiryawa sun riga sun ba mu bayanan da aka tattara, inda akwai wasu ƙididdiga masu yawa, kuma an maye gurbin su da -9999.

Mun sami batan ƙima a cikin ginshiƙan MatchedHit_{X,Y,Z}[N] da MatchedHit_D{X,Y,Z}[N], kuma kawai lokacin N=2 ko 3. Kamar yadda muka fahimta, wasu barbashi ba su yi ba. wuce duk 4 detectors , kuma tsaya ko dai a kan farantin 3rd ko 4th. Bayanan sun kuma ƙunshi ginshiƙan Lextra_{X,Y}[N], waɗanda da alama sun bayyana abu iri ɗaya da MatchedHit_{X,Y,Z}[N], amma ta amfani da wani nau'in ƙari. Waɗannan ƙananan zato sun ba da shawarar cewa za a iya musanya Lextra_{X,Y}[N] da ƙimar da suka ɓace a cikin MatchedHit_{X,Y,Z}[N] (na haɗin gwiwar X da Y kawai). MatchedHit_Z[N] ya cika da matsakaita. Wadannan magudi sun ba mu damar isa matsakaicin matsayi na 1 a cikin duka ayyuka biyu.

Babban Hadron Collider da Odnoklassniki

Da yake la'akari da cewa ba su ba da wani abu don cin nasara a matakin farko ba, za mu iya tsayawa a can, amma mun ci gaba, zana hotuna masu kyau kuma mun fito da sababbin siffofi.

Babban Hadron Collider da Odnoklassniki

Misali, mun gano cewa idan muka tsara wuraren da ke tsaka da wani barbashi tare da kowanne daga cikin faranti guda huɗu, za mu iya ganin cewa an haɗa wuraren da ke kan kowannen farantin zuwa 5 rectangles tare da rabo na 4 zuwa 5 kuma a tsakiya. batu (0,0), kuma a cikin Babu maki a cikin rectangular farko.

Girman farantin no./rektangulu 1 2 3 4 5
Plate 1 500h625 1000h1250 2000h2500 4000h5000 8000h10000
Plate 2 520h650 1040h1300 2080h2600 4160h5200 8320h10400
Plate 3 560h700 1120h1400 2240h2800 4480h5600 8960h11200
Plate 4 600h750 1200h1500 2400h3000 4800h6000 9600h12000

Bayan da aka ƙayyade waɗannan ma'auni, mun ƙara sababbin siffofi guda 4 don kowane barbashi - adadin rectangle wanda yake tsaka da kowane faranti.

Babban Hadron Collider da Odnoklassniki

Mun kuma lura cewa ɓangarorin sun yi kama da watsawa zuwa tarnaƙi daga tsakiya kuma ra'ayin ya tashi don ko ta yaya kimanta "ingancin" wannan watsawa. Da kyau, yana yiwuwa zai yiwu a fito da wani nau'i na "madaidaicin" parabola dangane da wurin tashi da kuma kimanta sabawa daga gare ta, amma mun iyakance kanmu ga "madaidaicin" madaidaiciya. Bayan mun gina irin wannan madaidaiciyar layukan madaidaiciya don kowane wurin shigarwa, mun sami damar ƙididdige madaidaicin karkatar da yanayin kowane barbashi daga wannan madaidaiciyar layin. Tunda matsakaita karkata ga manufa = 1 shine 152, kuma don manufa = 0 shine 390, mun tantance wannan yanayin da kyau. Kuma lalle ne, wannan fasalin nan da nan ya sanya shi zuwa saman mafi amfani.

Mun yi farin ciki kuma mun ƙara da karkatar da duk 4 intersection maki ga kowane barbashi daga manufa madaidaiciya line a matsayin ƙarin 4 fasali (kuma sun yi aiki da kyau).

Haɗin kai da kasidun kimiyya kan batun gasar, waɗanda masu shirya gasar suka ba mu, ya haifar da tunanin cewa mun yi nisa da na farko don magance wannan matsala kuma, watakila, akwai wasu nau'ikan software na musamman. Bayan gano wurin ajiya akan github inda aka aiwatar da hanyoyin IsMuonSimple, IsMuon, IsMuonLoose, mun tura su zuwa rukunin yanar gizon mu tare da ƴan gyare-gyare. Hanyoyin da kansu sun kasance masu sauƙi: alal misali, idan makamashi ya kasance ƙasa da wani kofa, to ba muon ba ne, in ba haka ba muon ne. Irin waɗannan fasalulluka masu sauƙi a fili ba za su iya ba da haɓaka yanayin yin amfani da haɓakar gradient ba, don haka mun ƙara wani muhimmin “nisa” zuwa bakin kofa. Waɗannan fasalulluka kuma an ɗan inganta su. Wataƙila, ta hanyar nazarin hanyoyin da ake da su sosai, yana yiwuwa a sami hanyoyin da suka fi ƙarfi kuma a ƙara su cikin alamun.

A ƙarshen gasar, mun ɗan ɗanɗana maganin “sauri” don matsala ta biyu; a ƙarshe, ya bambanta da tushe a cikin abubuwa masu zuwa:

  1. A cikin layuka tare da mummunan nauyi an juya maƙasudin
  2. Cike da ƙimar da aka ɓace a cikin MatchedHit_{X, Y,Z}[N]
  3. Rage zurfin zuwa 7
  4. Rage darajar koyo zuwa 0.1 (ya kasance 0.19)

A sakamakon haka, mun gwada ƙarin fasali (ba sosai nasara ba), zaɓaɓɓun sigogi da horar da catboost, lightgbm da xgboost, gwada haɗuwa daban-daban na tsinkaya kuma kafin buɗe masu zaman kansu da ƙarfin gwiwa mun ci nasara akan aiki na biyu, kuma a farkon mun kasance daga cikin shugabanni.

Bayan bude masu zaman kansu mun kasance a matsayi na 10 don aiki na 1 da na 3 na biyu. Duk shugabannin sun cakuɗe, kuma gudun a asirce ya fi na kan allo. Da alama bayanan ba su da kyau sosai (ko alal misali babu layuka masu ma'auni mara kyau a cikin sirri) kuma wannan yana ɗan takaici.

SNA Hackathon 2019 - Rubutun. Matakin farko

Aikin shine sanya matsayi na masu amfani akan hanyar sadarwar zamantakewa ta Odnoklassniki bisa ga rubutun da suka ƙunshi; ban da rubutun, akwai wasu ƙarin halaye na gidan (harshe, mai shi, kwanan wata da lokacin halitta, kwanan wata da lokacin kallo). ).

Kamar yadda hanyoyin gargajiya don aiki tare da rubutu, zan haskaka zaɓuɓɓuka biyu:

  1. Yin taswirar kowace kalma zuwa cikin n-dimensional vector sarari kamar yadda kalmomi masu kama da juna suke da iri ɗaya (karanta ƙarin a ciki) labarin mu), sannan ko dai nemo matsakaiciyar kalma don rubutu ko kuma amfani da hanyoyin da ke la'akari da matsayin kalmomin (CNN, LSTM/GRU).
  2. Yin amfani da samfura waɗanda zasu iya aiki nan da nan tare da duka jimloli. Alal misali, Bert. A ka'idar, wannan hanya ya kamata yayi aiki mafi kyau.

Tun da yake wannan shine farkon sanina game da matani, ba daidai ba ne in koya wa wani, don haka zan koya wa kaina. Waɗannan su ne shawarwarin da zan ba kaina a farkon gasar:

  1. Kafin ka gudu don koyar da wani abu, duba bayanan! Baya ga rubutun kansa, bayanan suna da ginshiƙai da yawa kuma yana yiwuwa a fitar da su da yawa fiye da yadda na yi. Abu mafi sauƙaƙa shine yin maƙasudin maƙasudi don wasu ginshiƙan.
  2. Kada ku koya daga duk bayanan! Akwai bayanai da yawa (kimanin layuka miliyan 17) kuma ba lallai ba ne a yi amfani da su duka don gwada hasashe. Horowa da aiwatarwa sun kasance a hankali, kuma a fili zan sami lokaci don gwada hasashe masu ban sha'awa.
  3. <Nasiha mai rikitarwa> Babu buƙatar neman samfurin kisa. Na dauki lokaci mai tsawo ina gano Elmo da Bert, ina fatan nan da nan za su kai ni wani babban wuri, kuma a sakamakon haka na yi amfani da FastText da aka riga aka horar da su don harshen Rashanci. Ba zan iya samun mafi kyawun gudu tare da Elmo ba, kuma har yanzu ba ni da lokaci don gano shi da Bert.
  4. <Nasiha mai rikitarwa> Babu buƙatar neman fasalin kisa ɗaya. Duban bayanan, na lura cewa kusan kashi 1 cikin ɗari na rubutun ba su ƙunshi rubutu a zahiri ba! Amma akwai hanyoyin haɗi zuwa wasu albarkatu, kuma na rubuta fassarar sauƙi wanda ya buɗe shafin kuma ya fitar da take da bayanin. Ya zama kamar kyakkyawan ra'ayi, amma sai na tafi da ni kuma na yanke shawarar rarraba duk hanyoyin haɗin yanar gizo don duk rubutun kuma na sake rasa lokaci mai yawa. Duk wannan bai samar da gagarumin ci gaba a sakamakon ƙarshe ba (ko da yake na yi la'akari da ƙaddamarwa, alal misali).
  5. Classic fasali suna aiki. Mu Google, alal misali, "fasalolin rubutu kaggle", karanta kuma ƙara komai. TF-IDF ta ba da haɓaka, kamar yadda ƙididdiga suka yi kamar tsayin rubutu, kalmomi, da adadin alamar rubutu.
  6. Idan akwai ginshiƙan DateTime, yana da kyau a rarraba su cikin fasali daban-daban (awanni, kwanakin mako, da sauransu). Waɗanne siffofi ya kamata a ba da haske ya kamata a bincika su ta amfani da jadawali/wasu awo. A nan, a kan jin dadi, na yi duk abin da ke daidai kuma na nuna alamun da ake bukata, amma bincike na yau da kullum ba zai yi rauni ba (misali, kamar yadda muka yi a karshe).

Babban Hadron Collider da Odnoklassniki

A sakamakon gasar, na horar da wani keras model tare da kalmar convolution, da kuma wani daya bisa LSTM da GRU. Dukansu biyu sun yi amfani da abubuwan da aka riga aka horar da FastText don harshen Rashanci (Na gwada wasu abubuwan da aka saka, amma waɗannan su ne waɗanda suka yi aiki mafi kyau). Bayan matsakaicin tsinkaya, na ɗauki matsayi na 7 na ƙarshe daga cikin mahalarta 76.

Bayan mataki na farko an buga shi labarin Nikolai Anokhin, wanda ya zo na biyu (ya fita daga gasar), kuma maganinsa har zuwa wani mataki ya maimaita tawa, amma ya ci gaba saboda tsarin kulawa da mahimmanci-key.

Mataki na biyu OK & IDO

Kusan a jere an yi matakai na biyu na gasar, don haka na yanke shawarar duba su tare.

Na farko, ni da sabuwar ƙungiyar da aka samu sun ƙare a cikin ofishin mai ban sha'awa na kamfanin Mail.ru, inda aikinmu ya haɗa da samfurori na waƙoƙi uku daga mataki na farko - rubutu, hotuna da haɗin kai. An ware wasu kwanaki fiye da 2 don wannan, wanda ya zama kaɗan. A gaskiya ma, mun sami damar maimaita sakamakonmu ne kawai daga matakin farko ba tare da samun wata riba daga haɗin gwiwa ba. A ƙarshe, mun ɗauki matsayi na 5, amma ba mu iya amfani da samfurin rubutu ba. Bayan duba mafita na sauran mahalarta, da alama ya dace a yi ƙoƙarin tattara rubutun da ƙara su cikin ƙirar haɗin gwiwa. Sakamakon sakamako na wannan mataki shine sabon ra'ayi, saduwa da sadarwa tare da masu halartar sanyi da masu shiryawa, da kuma rashin barci mai tsanani, wanda zai iya rinjayar sakamakon karshe na IDAO.

Aiki a IDAO 2019 Karshe mataki shine hasashen lokacin jira don oda ga direbobin taksi na Yandex a filin jirgin sama. A mataki 2, 3 ayyuka = ​​3 filayen jiragen sama an gano. Ga kowane filin jirgin sama, ana ba da bayanan minti-bi-minti kan adadin odar tasi na watanni shida. Kuma a matsayin bayanan gwaji, an ba da wata na gaba da bayanan minti-da-minti kan umarni na makonni 2 da suka gabata. Akwai ɗan lokaci kaɗan (kwanaki 1,5), aikin ya kasance takamaiman, mutum ɗaya ne kawai daga ƙungiyar ya zo gasar - kuma a sakamakon haka, wurin bakin ciki ne zuwa ƙarshen. Ra'ayoyi masu ban sha'awa sun haɗa da ƙoƙarin amfani da bayanan waje: yanayi, cunkoson ababen hawa da kididdigar odar taksi na Yandex. Ko da yake masu shirya ba su faɗi abin da waɗannan filayen jiragen sama suke ba, yawancin mahalarta sun ɗauka cewa sune Sheremetyevo, Domodedovo da Vnukovo. Ko da yake an karyata wannan zato bayan gasar, siffofi, alal misali, daga bayanan yanayi na Moscow sun inganta sakamakon duka akan tabbatarwa da kuma kan jagorar.

ƙarshe

  1. Gasa na ML suna da kyau da ban sha'awa! Anan za ku sami amfani da ƙwarewa a cikin bincike na bayanai, kuma a cikin ƙirar ƙira da dabaru, kuma ana maraba da hankali kawai.
  2. ML ya riga ya zama babban ilimin da ke da alama yana girma sosai. Na kafa kaina manufa don sanin wurare daban-daban (sigina, hotuna, tebur, rubutu) kuma na riga na gane yawan karatun. Alal misali, bayan waɗannan gasa na yanke shawarar yin nazari: clustering algorithms, ingantattun dabarun aiki tare da ɗakunan karatu masu haɓaka gradient (musamman, aiki tare da CatBoost akan GPU), hanyoyin sadarwar capsule, tsarin kulawa-key-darajar tambaya.
  3. Ba ta kaggle kadai ba! Akwai wasu gasa da yawa inda ya fi sauƙi don samun aƙalla rigar riga, kuma akwai ƙarin dama ga wasu kyaututtuka.
  4. Sadarwa! An riga an sami babban al'umma a fagen koyon injin da bincike na bayanai, akwai rukunin jigogi a cikin telegram, slack, da mutane masu mahimmanci daga Mail.ru, Yandex da sauran kamfanoni suna amsa tambayoyi da taimakawa masu farawa da waɗanda ke ci gaba da hanyarsu a cikin wannan filin. na ilimi.
  5. Ina ba da shawara ga duk wanda abin da ya gabata ya yi wahayi zuwa gare su datafest - babban taro na kyauta a Moscow, wanda zai faru a ranar Mayu 10-11.

source: www.habr.com

Add a comment