Ukusebenza komshini wokufunda ku-Mail.ru Mail

Ukusebenza komshini wokufunda ku-Mail.ru Mail

Ngokusekelwe ezinkulumweni zami ku-Highload++ naku-DataFest Minsk 2019.

Kwabaningi namuhla, iposi liyingxenye ebalulekile yokuphila kwe-inthanethi. Ngosizo lwayo, siqhuba izincwadi zebhizinisi, sigcina zonke izinhlobo zolwazi olubalulekile oluhlobene nezezimali, ukubhukha kwamahhotela, ukufaka ama-oda nokunye okuningi. Maphakathi no-2018, senze isu lomkhiqizo lokuthuthukisa i-imeyili. Imeyili yesimanje kufanele ibe njani?

Imeyili kufanele ibe ehlakaniphile, okungukuthi, ukusiza abasebenzisi ukuthi bazulazule ngevolumu ekhulayo yolwazi: isihlungi, isakhiwo futhi sinikeze ngendlela elula kakhulu. Kumele abe njalo wusizo, okukuvumela ukuthi uxazulule imisebenzi ehlukahlukene khona kanye ebhokisini lakho leposi, isibonelo, ukhokhe izinhlawulo (umsebenzi, ngeshwa, engiwusebenzisayo). Futhi ngesikhathi esifanayo, yiqiniso, i-imeyili kufanele inikeze ukuvikelwa kolwazi, ukunqamula ugaxekile nokuvikela ekugetshengeni, okungukuthi, iphephile.

Lezi zindawo zichaza izinkinga eziningi ezibalulekile, eziningi zazo ezingaxazululwa ngempumelelo kusetshenziswa umshini wokufunda. Nazi izibonelo zezici ezivele zikhona ezithuthukisiwe njengengxenye yesu - eyodwa yendawo ngayinye.

  • Impendulo ehlakaniphile. Imeyili inesici sokuphendula esihlakaniphile. Inethiwekhi ye-neural ihlaziya umbhalo wohlamvu, iqonde incazelo nenjongo yayo, futhi ngenxa yalokho inikeza izinketho ezintathu zokuphendula ezifanele kakhulu: okuhle, okubi nokungathathi hlangothi. Lokhu kusiza ukonga kakhulu isikhathi lapho uphendula izinhlamvu, futhi ngokuvamile ukuphendula ngendlela engajwayelekile futhi ehlekisayo.
  • Ukuqoqa ama-imeyiliokuhlobene nama-oda ezitolo eziku-inthanethi. Sivame ukuthenga ku-inthanethi, futhi, njengomthetho, izitolo zingathumela ama-imeyili amaningana nge-oda ngalinye. Isibonelo, kusukela ku-AliExpress, isevisi enkulu kunazo zonke, izinhlamvu eziningi zingena nge-oda elilodwa, futhi sabala ukuthi esimweni sokugcina inombolo yabo ingafinyelela ku-29. Ngakho-ke, sisebenzisa imodeli Yokuqashelwa Kwebhizinisi, sikhipha inombolo ye-oda. kanye nolunye ulwazi oluvela embhalweni bese uqoqa zonke izinhlamvu ngochungechunge olulodwa. Siphinde sibonise ulwazi oluyisisekelo mayelana ne-oda ebhokisini elihlukile, okwenza kube lula ukusebenza nalolu hlobo lwe-imeyili.

    Ukusebenza komshini wokufunda ku-Mail.ru Mail

  • Anti-phishing. Ubugebengu bokweba imininingwane ebucayi kuwuhlobo lwe-imeyili oluyingozi ngokukhethekile, ngosizo abahlaseli abazama ukuthola ulwazi lwezezimali (okuhlanganisa namakhadi ebhange omsebenzisi) kanye nokungena. Izinhlamvu ezinjalo zilingisa ezangempela ezithunyelwe isevisi, kuhlanganise nokubuka. Ngakho-ke, ngosizo lwe-Computer Vision, sibona ama-logo kanye nesitayela sokuklama sezinhlamvu ezivela ezinkampanini ezinkulu (isibonelo, i-Mail.ru, i-Sber, i-Alfa) futhi sikucabangele lokhu kanye nombhalo nezinye izici ku-spam yethu kanye ne-phishing classifiers. .

Ukufunda ngomshini

Okuncane mayelana nokufunda komshini kuma-imeyili ngokuvamile. Imeyili iyisistimu elayishwe kakhulu: isilinganiso sezinhlamvu eziyizigidi eziyizinkulungwane ezingu-1,5 ngosuku zidlula kumaseva ethu kubasebenzisi abayizigidi ezingu-30 be-DAU. Cishe amasistimu okufunda emishini angama-30 asekela yonke imisebenzi edingekayo nezici.

Uhlamvu ngalunye ludlula kuwo wonke umugqa wokuhlukanisa. Okokuqala sinqamula ugaxekile futhi sishiya ama-imeyili amahle. Abasebenzisi ngokuvamile abawuboni umsebenzi we-antispam, ngoba u-95-99% wogaxekile awugcini ngisho kufolda efanele. Ukuqashelwa kogaxekile kuyingxenye ebaluleke kakhulu yesistimu yethu, futhi okunzima kakhulu, njengoba emkhakheni wokulwa nogaxekile kunokujwayela njalo phakathi kwezinhlelo zokuvikela nokuhlasela, okunikeza inselele yobunjiniyela eqhubekayo eqenjini lethu.

Okulandelayo, sihlukanisa izinhlamvu kubantu namarobhothi. Ama-imeyili avela kubantu abaluleke kakhulu, ngakho sinikeza izici ezifana nokuphendula okuhlakaniphile kubo. Izincwadi ezivela kumarobhothi zihlukaniswe izingxenye ezimbili: ukuthengiselana - lezi yizincwadi ezibalulekile ezivela kumasevisi, isibonelo, iziqinisekiso zokuthenga noma ukubhukwa kwamahhotela, izimali, nolwazi - lokhu ukukhangisa kwebhizinisi, izaphulelo.

Sikholelwa ukuthi ama-imeyili okwenziwayo alingana ngokubaluleka nokuxhumana komuntu siqu. Kufanele abe seduzane, ngoba ngokuvamile sidinga ukuthola ngokushesha ulwazi mayelana ne-oda noma ukubhukha ithikithi lendiza, futhi sichitha isikhathi sicinga lezi zinhlamvu. Ngakho-ke, ukuze kube lula, sizihlukanisa ngokuzenzakalelayo zibe izigaba eziyisithupha eziyinhloko: ukuhamba, ama-oda, ezezimali, amathikithi, ukubhaliswa futhi, ekugcineni, izinhlawulo.

Izinhlamvu zolwazi ziyiqembu elikhulu futhi cishe elingabalulekile kangako, elingadingi impendulo esheshayo, ngoba akukho okubalulekile okuzoshintsha empilweni yomsebenzisi uma engayifundi leyo ncwadi. Ku-interface yethu entsha, sizigoqa zibe izintambo ezimbili: amanethiwekhi omphakathi kanye nezincwadi zezindaba, ngaleyo ndlela sisule ngokubukeka ibhokisi lokungenayo futhi sishiye imilayezo ebalulekile kuphela ebonakalayo.

Ukusebenza komshini wokufunda ku-Mail.ru Mail

Ukuxhaphaza

Inani elikhulu lamasistimu lidala ubunzima obuningi ekusebenzeni. Ngemuva kwakho konke, amamodeli alulaza ngokuhamba kwesikhathi, njenganoma iyiphi isoftware: izici ziyaphuka, imishini iyehluleka, ikhodi iba yigwegwe. Ukwengeza, idatha ishintsha njalo: ezintsha ziyengezwa, amaphethini okuziphatha komsebenzisi aguqulwa, njll, ngakho-ke imodeli ngaphandle kokusekelwa okufanele izosebenza ngokubi nakakhulu ngokuhamba kwesikhathi.

Akumele sikhohlwe ukuthi ukufunda ngomshini okujulile kungena ezimpilweni zabasebenzisi, kuba mkhulu umthelela abanawo ku-ecosystem, futhi, ngenxa yalokho, ukulahlekelwa okukhulu kwezezimali noma inzuzo enkulu abadlali bemakethe bangathola. Ngakho-ke, ngenani elikhulayo lezindawo, abadlali bazivumelanisa nomsebenzi wama-algorithms e-ML (izibonelo zakudala ukukhangisa, ukusesha kanye ne-antispam eseyishiwo).

Futhi, imisebenzi yokufunda yomshini inesici esiyingqayizivele: noma yikuphi, ngisho nokuncane, ushintsho ohlelweni lungakhiqiza umsebenzi omningi ngemodeli: ukusebenza ngedatha, ukuqeqeshwa kabusha, ukuthunyelwa, okungathatha amaviki noma izinyanga. Ngakho-ke, lapho imvelo lapho amamodeli akho esebenza khona ishintsha ngokushesha, kudingeka umzamo owengeziwe ukuwagcina. Iqembu lingakha izinhlelo eziningi futhi lijabule ngakho, kodwa bese lichitha cishe zonke izinsiza zalo lizigcina, ngaphandle nethuba lokwenza noma yini entsha. Sake sahlangabezana nesimo esinjalo eqenjini le-antispam. Futhi benza isiphetho esisobala sokuthi ukwesekwa kudinga ukwenziwa ngokuzenzakalelayo.

Ukuzenzakalela

Yini engenziwa ngokuzenzakalela? Cishe yonke into, empeleni. Ngikhombe izindawo ezine ezichaza ingqalasizinda yokufunda yomshini:

  • ukuqoqwa kwedatha;
  • ukuqeqeshwa okwengeziwe;
  • sebenzisa;
  • ukuhlola nokuqapha.

Uma imvelo ingazinzile futhi ishintsha njalo, khona-ke yonke ingqalasizinda ezungeze imodeli ibonakala ibaluleke kakhulu kunemodeli ngokwayo. Kungase kube ukuhlelwa okuhle komugqa kwakudala, kodwa uma uyiphakela izici ezifanele futhi uthole impendulo enhle kubasebenzisi, izosebenza kangcono kakhulu kunamamodeli e-State-Of-The-Art anazo zonke izinsimbi namakhwela.

Iluphu yempendulo

Lo mjikelezo uhlanganisa ukuqoqwa kwedatha, ukuqeqeshwa okwengeziwe nokuthunyelwa - empeleni, wonke umjikelezo wokuvuselela imodeli. Kungani ibalulekile? Bheka ishejuli yokubhalisa eposini:

Ukusebenza komshini wokufunda ku-Mail.ru Mail

Umthuthukisi wokufunda ngomshini usebenzise imodeli ye-anti-bot evimbela ama-bot ekubhaliseni nge-imeyili. Igrafu yehlela kunani lapho kusele abasebenzisi bangempela kuphela. Konke kuhle! Kodwa kudlula amahora amane, ama-bots alungisa imibhalo yawo, futhi yonke into ibuyela kokujwayelekile. Kulokhu kuqaliswa, umthuthukisi uchithe inyanga enezela izici futhi eqeqesha kabusha imodeli, kodwa ugaxekile ukwazile ukuzivumelanisa nezimo emahoreni amane.

Ukuze singabi buhlungu kakhulu futhi kungadingeki senze kabusha yonke into kamuva, kufanele siqale sicabange ukuthi i-feedback loop izobukeka kanjani nokuthi sizokwenzani uma imvelo ishintsha. Ake siqale ngokuqoqa idatha - lona uphethiloli wama-algorithms ethu.

Ukuqoqwa kwedatha

Kuyacaca ukuthi kumanethiwekhi esimanje e-neural, idatha eyengeziwe, iba ngcono, futhi empeleni, yenziwa abasebenzisi bomkhiqizo. Abasebenzisi bangasisiza ngokumaka idatha, kodwa asikwazi ukukusebenzisa kabi lokhu, ngoba ngesinye isikhathi abasebenzisi bazokhathala ukugcwalisa amamodeli akho futhi bazoshintshela komunye umkhiqizo.

Elinye lamaphutha ajwayelekile (lapha ngenza ireferensi ku-Andrew Ng) ukugxila kakhulu kumamethrikhi kudathasethi yokuhlola, hhayi empendulweni evela kumsebenzisi, okuyisilinganiso esiyinhloko sekhwalithi yomsebenzi, njengoba sidala. umkhiqizo womsebenzisi. Uma umsebenzisi engaqondi noma engawuthandi umsebenzi wemodeli, khona-ke konke konakalisiwe.

Ngakho-ke, umsebenzisi kufanele ahlale ekwazi ukuvota futhi kufanele anikezwe ithuluzi lempendulo. Uma sicabanga ukuthi incwadi ehlobene nezezimali ifikile ebhokisini leposi, sidinga ukuyimaka "ezezimali" bese sidweba inkinobho umsebenzisi angayichofoza futhi athi lokhu akuyona imali.

Ikhwalithi yempendulo

Ake sikhulume ngekhwalithi yempendulo yomsebenzisi. Okokuqala, wena nomsebenzisi ningabeka izincazelo ezahlukene emcabangweni owodwa. Isibonelo, wena nabaphathi bomkhiqizo wakho nicabanga ukuthi “izimali” zisho izincwadi ezivela ebhange, futhi umsebenzisi ucabanga ukuthi incwadi evela kugogo emayelana nempesheni yakhe iphinde ibhekisele ezezimali. Okwesibili, kukhona abasebenzisi abathanda ngokungenangqondo ukucindezela izinkinobho ngaphandle kokucabanga. Okwesithathu, umsebenzisi angase abe nephutha elijulile eziphethweni zakhe. Isibonelo esimangalisayo esivela kumkhuba wethu ukuqaliswa kokuhlukanisa ngezigaba Ugaxekile waseNigeria, uhlobo oluhlekisayo lukagaxekile lapho umsebenzisi ecelwa ukuthi athathe izigidi ezimbalwa zamadola esihlotsheni esikude esitholakale ngokuzumayo e-Afrika. Ngemva kokufaka lesi sigaba, sihlole ukuchofoza okuthi “Akuwona Ugaxekile” kulawa ma-imeyili, futhi kwavela ukuthi u-80% wawo ungogaxekile baseNigeria abanoju, okuphakamisa ukuthi abasebenzisi bangakwazi ukukhohliseka kalula.

Futhi masingakhohlwa ukuthi izinkinobho zingachofozwa hhayi kuphela ngabantu, kodwa futhi nazo zonke izinhlobo ze-bots ezizenza sengathi isiphequluli. Ngakho impendulo eluhlaza ayilungele ukufunda. Yini ongayenza ngalolu lwazi?

Sisebenzisa izindlela ezimbili:

  • Impendulo esuka ku-ML exhunyiwe. Isibonelo, sinesistimu ye-inthanethi ye-anti-bot, okuthi, njengoba ngishilo, yenza isinqumo esisheshayo ngokusekelwe kwinani elilinganiselwe lezimpawu. Futhi kukhona isistimu yesibili, ehamba kancane esebenza ngemva kweqiniso. Inedatha eyengeziwe mayelana nomsebenzisi, ukuziphatha kwakhe, njll. Ngenxa yalokho, kwenziwa isinqumo esinolwazi kakhulu; ngakho-ke, sinokunemba okuphezulu nokuphelela. Ungakwazi ukuqondisa umehluko ekusebenzeni kwalezi zinhlelo kweyokuqala njengedatha yokuqeqesha. Ngakho-ke, uhlelo olulula luzohlala luzama ukusondela ekusebenzeni kolunye oluyinkimbinkimbi.
  • Chofoza ukuhlukanisa. Ungamane uhlukanise ngokuchofoza ngakunye komsebenzisi, uhlole ubuqiniso nokusebenziseka kwayo. Lokhu sikwenza ngemeyili ephikisana nogaxekile, sisebenzisa izibaluli zomsebenzisi, umlando wakhe, izibaluli zomthumeli, umbhalo ngokwawo kanye nomphumela wabahlukanisi bezigaba. Njengomphumela, sithola isistimu ezenzakalelayo eqinisekisa impendulo yomsebenzisi. Futhi njengoba idinga ukuqeqeshwa kabusha kancane kancane, umsebenzi wayo ungaba yisisekelo sazo zonke ezinye izinhlelo. Okubalulekile okuyinhloko kulo modeli ukunemba, ngoba ukuqeqesha imodeli kudatha engalungile kugcwele imiphumela.

Ngenkathi sihlanza idatha futhi siqhubeka nokuqeqesha amasistimu ethu e-ML, akumele sikhohlwe ngabasebenzisi, ngoba kithi, izinkulungwane, izigidi zamaphutha kugrafu ziyizibalo, futhi kumsebenzisi, zonke iziphazamisi ziyinhlekelele. Ngaphezu kweqiniso lokuthi umsebenzisi kufanele ngandlela-thile aphile nephutha lakho kumkhiqizo, ngemva kokuthola impendulo, ulindele ukuthi isimo esifanayo sizoqedwa esikhathini esizayo. Ngakho-ke, ngaso sonke isikhathi kufanelekile ukunikeza abasebenzisi hhayi kuphela ithuba lokuvota, kodwa futhi ukulungisa ukuziphatha kwezinhlelo ze-ML, ukudala, isibonelo, i-heuristics yomuntu siqu ngokuchofoza ngakunye kwempendulo; endabeni yeposi, lokhu kungaba ikhono lokuhlunga. izinhlamvu ezinjalo ngomthumeli nesihloko salo msebenzisi.

Udinga futhi ukwakha imodeli ngokusekelwe kweminye imibiko noma izicelo zokusekela kumodi ye-semi-automatic noma manual ukuze abanye abasebenzisi bangahlushwa izinkinga ezifanayo.

I-Heuristics yokufunda

Kunezinkinga ezimbili ngalawa ma-heuristics nezinduku. Okokuqala ukuthi isibalo esilokhu sikhula sezinduku kunzima ukuzinakekela, ingasaphathwa eyekhwalithi nokusebenza kwazo ngokuhamba kwesikhathi. Inkinga yesibili ukuthi iphutha lingase lingabi njalo, futhi ukuchofoza okumbalwa ukuze uqhubeke nokuqeqesha imodeli ngeke kwanele. Kungase kubonakale sengathi le miphumela emibili engahlobene ingancishiswa kakhulu uma le ndlela elandelayo isetshenziswa.

  1. Sakha i-crutch yesikhashana.
  2. Sithumela idatha kusuka kuyo kuya kumodeli, iyazibuyekeza njalo, kufaka phakathi idatha etholiwe. Lapha, yiqiniso, kubalulekile ukuthi ama-heuristics abe nokunemba okuphezulu ukuze anganciphisi ikhwalithi yedatha kusethi yokuqeqesha.
  3. Bese sibeka ukuqapha ukuze kuqalise i-crutch, futhi uma ngemva kwesikhathi esithile i-crutch ingasasebenzi futhi ihlanganiswe ngokuphelele yimodeli, khona-ke ungayisusa ngokuphepha. Manje le nkinga mancane amathuba okuthi iphinde yenzeke.

Ngakho ibutho lezinduku liwusizo kakhulu. Into eyinhloko ukuthi inkonzo yabo iyaphuthuma futhi ayihlali unomphela.

Ukuqeqeshwa okwengeziwe

Ukuqeqesha kabusha kuyinqubo yokwengeza idatha entsha etholwe ngenxa yempendulo evela kubasebenzisi noma ezinye izinhlelo, nokuqeqesha imodeli ekhona kuyo. Kungase kube nezinkinga ezimbalwa ngokuqeqeshwa okwengeziwe:

  1. Imodeli ingase ingasekeli ukuqeqeshwa okwengeziwe, kodwa ifunde kusukela ekuqaleni.
  2. Akukho ndawo encwadini yemvelo lapho kubhalwe khona ukuthi ukuqeqeshwa okwengeziwe kuzothuthukisa ikhwalithi yomsebenzi ekukhiqizeni. Ngokuvamile kwenzeka okuphambene, okungukuthi, ukuwohloka kuphela okungenzeka.
  3. Izinguquko zingase zingabonakali. Leli yiphuzu elicashile esizibonele lona. Ngisho noma imodeli entsha ekuhlolweni kwe-A/B ibonisa imiphumela efanayo uma iqhathaniswa neyamanje, lokhu akusho ukuthi izosebenza ngokufanayo. Umsebenzi wabo ungase uhluke ngephesenti elilodwa nje kuphela, okungase kulethe amaphutha amasha noma kubuyisele amadala aselungisiwe kakade. Kokubili thina nabasebenzisi sesiyazi kakade ukuthi singaphila kanjani namaphutha amanje, futhi lapho inani elikhulu lamaphutha amasha livela, umsebenzisi angase angaqondi ukuthi kwenzekani, ngoba ulindele ukuziphatha okubikezelwe.

Ngakho-ke, into ebaluleke kakhulu ekuqeqeshweni okwengeziwe ukuqinisekisa ukuthi imodeli iyathuthukiswa, noma okungenani ayibi kakhulu.

Into yokuqala efika emqondweni uma sikhuluma ngoqeqesho olwengeziwe yindlela yokuFunda Okusebenzayo. Kusho ukuthini lokhu? Isibonelo, ohlukanisa ngezigaba uyanquma ukuthi i-imeyili ihlobene yini nezezimali, futhi eduze komngcele wayo wesinqumo sengeza isampula yezibonelo ezinelebula. Lokhu kusebenza kahle, isibonelo, ekukhangiseni, lapho kunempendulo eminingi futhi ungakwazi ukuqeqesha imodeli ku-intanethi. Futhi uma kunempendulo encane, khona-ke sithola isampula echemile kakhulu ehlobene nokusatshalaliswa kwedatha yokukhiqiza, ngesisekelo lapho kungenakwenzeka ukuhlola ukuziphatha kwemodeli ngesikhathi sokusebenza.

Ukusebenza komshini wokufunda ku-Mail.ru Mail

Eqinisweni, umgomo wethu uwukugcina amaphethini amadala, amamodeli aziwayo kakade, futhi sithole amasha. Ukuqhubeka kubalulekile lapha. Imodeli, ebesivame ukuthatha izinhlungu ezinkulu ukuyikhipha, isivele isebenza, ngakho-ke singagxila ekusebenzeni kwayo.

Kusetshenziswa amamodeli ahlukene kumeyili: izihlahla, imigqa, amanethiwekhi emizwa. Kulowo nalowo senza i-algorithm yethu yokuqeqesha eyengeziwe. Ngenqubo yokuqeqeshwa okwengeziwe, asitholi kuphela idatha entsha, kodwa futhi ngokuvamile izici ezintsha, esizozicabangela kuwo wonke ama-algorithms angezansi.

Amamodeli alayini

Ake sithi sinokuhlehla kwezinto. Sakha imodeli yokulahlekelwa kusukela ezingxenyeni ezilandelayo:

  • I-LogLoss kudatha entsha;
  • silungisa izisindo zezici ezintsha (asizithinti ezindala);
  • sifunda futhi kudatha endala ukuze silondoloze amaphethini amadala;
  • futhi, mhlawumbe, into ebaluleke kakhulu: sengeza i-Harmonic Regularization, eqinisekisa ukuthi izisindo ngeke zishintshe kakhulu uma kuqhathaniswa nemodeli endala ngokuvumelana nesimiso.

Njengoba ingxenye ngayinye Yokulahlekelwa inama-coefficients, singakhetha amanani alungile omsebenzi wethu ngokuqinisekisa okuphambene noma ngokusekelwe ezimfuneko zomkhiqizo.

Ukusebenza komshini wokufunda ku-Mail.ru Mail

Izihlahla

Ake sidlulele ezihlahleni zezinqumo. Sihlanganise i-algorithm elandelayo yokuqeqeshwa okwengeziwe kwezihlahla:

  1. Ukukhiqizwa kuqhuba ihlathi lezihlahla ezingu-100-300, eziqeqeshwe kusethi yedatha endala.
  2. Ekugcineni sisusa izingcezu ezingu-M = 5 bese sengeza u-2M = 10 amasha, aqeqeshwe kuyo yonke isethi yedatha, kodwa ngesisindo esiphezulu sedatha entsha, eqinisekisa ngokwemvelo ushintsho olwengeziwe kumodeli.

Ngokusobala, ngokuhamba kwesikhathi, inani lezihlahla landa kakhulu, futhi kufanele zincishiswe ngezikhathi ezithile ukuze zihlangabezane nezikhathi. Ukwenza lokhu, sisebenzisa i-Knowledge Distillation (KD) esesiyiyo yonke indawo. Kafushane mayelana nesimiso sokusebenza kwayo.

  1. Sinemodeli yamanje "eyinkimbinkimbi". Siyiqhuba kusethi yedatha yokuqeqeshwa futhi sithola ukusabalalisa kwamathuba ekilasi kokuphumayo.
  2. Okulandelayo, siqeqesha imodeli yomfundi (imodeli enezihlahla ezimbalwa kulesi simo) ukuze iphinde imiphumela yemodeli isebenzisa ukusabalalisa kwekilasi njengokuguquguquka okuhlosiwe.
  3. Kubalulekile ukuqaphela lapha ukuthi asisebenzisi imakhaphu yesethi yedatha nganoma iyiphi indlela, ngakho-ke singasebenzisa idatha engaqondakali. Impela, sisebenzisa isampula yedatha evela kumfudlana wokulwa njengesampula yokuqeqeshwa kwemodeli yomfundi. Ngakho, isethi yokuqeqesha isivumela ukuthi siqinisekise ukunemba kwemodeli, futhi isampula yokusakaza iqinisekisa ukusebenza okufanayo ekusabalaliseni kokukhiqiza, kunxephezela ukuchema kwesethi yokuqeqesha.

Ukusebenza komshini wokufunda ku-Mail.ru Mail

Inhlanganisela yalezi zindlela ezimbili (ukwengeza izihlahla nokunciphisa ngezikhathi ezithile inombolo yazo usebenzisa i-Knowledge Distillation) iqinisekisa ukwethulwa kwamaphethini amasha nokuqhubeka okuphelele.

Ngosizo lwe-KD, futhi senza imisebenzi ehlukene kuzici zemodeli, njengokususa izici nokusebenza ezikhaleni. Esimweni sethu, sinenani lezici zezibalo ezibalulekile (ngabathumeli, ama-hashes ombhalo, ama-URL, njll.) ezigcinwe kusizindalwazi, ezivame ukuhluleka. Imodeli, yiqiniso, ayilungele ukuthuthukiswa kwezenzakalo ezinjalo, ngoba izimo zokuhluleka azenzeki kusethi yokuqeqesha. Ezimweni ezinjalo, sihlanganisa i-KD nezindlela zokwandisa: lapho siqeqeshelwa ingxenye yedatha, sisusa noma sisethe kabusha izici ezidingekayo, futhi sithatha amalebula angempela (okukhiphayo kwemodeli yamanje), futhi imodeli yomfundi ifunda ukuphinda lokhu kusatshalaliswa. .

Ukusebenza komshini wokufunda ku-Mail.ru Mail

Siqaphele ukuthi uma kwenzeka ukukhohliswa kwemodeli okubucayi kakhulu, kuba makhulu amaphesenti esampula yochungechunge oludingekayo.

Ukususwa kwesici, ukusebenza okulula, kudinga ingxenye encane kuphela yokugeleza, njengoba izici ezimbalwa kuphela ezishintshayo, futhi imodeli yamanje yaqeqeshwa kusethi efanayo - umehluko mncane. Ukwenza imodeli ibe lula (ukunciphisa inani lezihlahla izikhathi eziningana), 50 kuya ku-50 kakade kuyadingeka. Futhi ngokushiywa kwezici ezibalulekile zezibalo ezizothinta kakhulu ukusebenza kwemodeli, ukugeleza okwengeziwe kuyadingeka ukuze ulinganise umsebenzi we imodeli entsha emelana nokushiywa kuzo zonke izinhlobo zezinhlamvu.

Ukusebenza komshini wokufunda ku-Mail.ru Mail

I-FastText

Masiqhubekele ku-FastText. Ake nginikhumbuze ukuthi ukumelwa (Ukushumeka) kwegama kuhlanganisa isamba sokushumeka kwegama ngokwalo kanye nalo lonke uhlamvu lwalo lwama-N-gram, ngokuvamile ama-trigram. Njengoba kungase kube nama-trigrams amaningi, kusetshenziswa i-Bucket Hashing, okungukuthi, ukuguqula sonke isikhala sibe i-hashmap ethile engaguquki. Ngenxa yalokho, i-matrix yesisindo itholakala ngobukhulu besendlalelo sangaphakathi ngenani lamagama + amabhakede.

Ngokuqeqeshwa okwengeziwe, izimpawu ezintsha ziyavela: amagama nama-trigrams. Akukho okubalulekile okwenzekayo ekuqeqeshweni okujwayelekile kokulandelela okuvela ku-Facebook. Izisindo ezindala kuphela ezine-cross-entropy eziqeqeshwa kabusha kudatha entsha. Ngakho-ke, izici ezintsha azisetshenziswa; vele, le ndlela inabo bonke ububi obuchazwe ngenhla obuhlobene nokungabikezeli kwemodeli ekukhiqizeni. Yingakho siguqule i-FastText kancane. Sengeza zonke izisindo ezintsha (amagama nama-trigrams), sandise yonke i-matrix nge-cross-entropy futhi sengeza ukujwayela kwe-harmonic ngokufanisa nemodeli yomugqa, okuqinisekisa uguquko olungasho lutho ezisindweni ezindala.

Ukusebenza komshini wokufunda ku-Mail.ru Mail

CNN

Amanethiwekhi e-Convolutional axaka kakhulu. Uma izendlalelo zokugcina ziqediwe ku-CNN, khona-ke, vele, ungasebenzisa i-harmonic regularization kanye nokuqhubeka kwesiqinisekiso. Kodwa uma ukuqeqeshwa okwengeziwe kwenethiwekhi yonke kuyadingeka, khona-ke ukujwayela okunjalo ngeke kusasetshenziswa kuzo zonke izendlalelo. Kodwa-ke, kukhona inketho yokuqeqesha ukushumeka okuhambisanayo nge-Triplet Loss (isihloko sokuqala).

Ukulahlekelwa Okuthathu

Sisebenzisa umsebenzi wokulwa nobugebengu bokweba imininingwane ebucayi njengesibonelo, ake sibheke Ukulahlekelwa Okuthathu ngokwemibandela evamile. Sithatha ilogo yethu, kanye nezibonelo ezinhle nezimbi zamalogo wezinye izinkampani. Sinciphisa ibanga phakathi kweyokuqala futhi sandise ibanga phakathi kwesibili, senza lokhu ngegebe elincane ukuze siqinisekise ukuhlangana okukhulu kwamakilasi.

Ukusebenza komshini wokufunda ku-Mail.ru Mail

Uma siqhubeka siqeqesha inethiwekhi, khona-ke isikhala sethu semethrikhi sishintsha ngokuphelele, futhi asihambisani ngokuphelele nesedlule. Lokhu kuyinkinga enkulu ezinkingeni ezisebenzisa ama-vector. Ukuze sibhekane nale nkinga, sizohlanganisa ukushumeka okudala ngesikhathi sokuqeqeshwa.

Sengeze idatha entsha kusethi yokuqeqesha futhi siqeqesha inguqulo yesibili yemodeli kusukela ekuqaleni. Esigabeni sesibili, siqhubeka nokuqeqesha inethiwekhi yethu (i-Finetuning): okokuqala ungqimba lokugcina luqediwe, bese yonke inethiwekhi ingaqandisiwe. Enqubweni yokuqamba ama-triplets, sibala ingxenye kuphela yokushumeka sisebenzisa imodeli eqeqeshiwe, okusele - sisebenzisa endala. Ngakho, ohlelweni lokuqeqeshwa okwengeziwe, siqinisekisa ukuhambisana kwezikhala zemethrikhi i-v1 ne-v2. Inguqulo eyingqayizivele ye-harmonic regularization.

Ukusebenza komshini wokufunda ku-Mail.ru Mail

Yonke i-architecture

Uma sicabangela lonke uhlelo usebenzisa i-antispam njengesibonelo, khona-ke amamodeli awahlukanisiwe, kodwa afakwe ngaphakathi komunye nomunye. Sithatha izithombe, umbhalo nezinye izici, sisebenzisa i-CNN kanye Nombhalo Osheshayo sithola okushumekiwe. Okulandelayo, izihlukanisi zezigaba zisetshenziswa ngaphezulu kokushumekiwe, okunikeza amaphuzu amakilasi ahlukahlukene (izinhlobo zezinhlamvu, ugaxekile, ukuba khona kwelogo). Izimpawu nezimpawu sezivele zingena ehlathini lezihlahla ukuze kuthathwe isinqumo sokugcina. Abahlukanisi bezigaba ngabanye kulolu hlelo bakwenza kube nokwenzeka ukuhumusha kangcono imiphumela yesistimu futhi ikakhulukazi ukuqeqesha kabusha izingxenye uma kuba nezinkinga, kunokuphakela yonke idatha ezihlahleni zezinqumo ngendlela eluhlaza.

Ukusebenza komshini wokufunda ku-Mail.ru Mail

Ngenxa yalokho, siqinisekisa ukuqhubeka kuwo wonke amaleveli. Ezingeni eliphansi ku-CNN kanye Nombhalo Osheshayo sisebenzisa i-harmonic regularization, kubahlukanisi abaphakathi nendawo sisebenzisa ukujwayela kwe-harmonic kanye nokulinganisa kwezinga lokuvumelana kokusatshalaliswa kwamathuba. Hhayi-ke, ukukhuliswa kwesihlahla kuqeqeshwa ngokuqhubekayo noma kusetshenziswa i-Knowledge Distillation.

Ngokuvamile, ukugcinwa kwesistimu yokufunda yomshini ofakwe esidlekeni ngokuvamile kubuhlungu, njengoba noma iyiphi ingxenye esezingeni eliphansi iholela ekubuyekezweni kwayo yonke isistimu engenhla. Kodwa njengoba ekusetheni kwethu ingxenye ngayinye ishintsha kancane futhi ihambisana nedlule, lonke uhlelo lungabuyekezwa kancane kancane ngaphandle kwesidingo sokuqeqesha kabusha sonke isakhiwo, esivumela ukuthi sisekelwe ngaphandle kwe-overhead enzima.

Sebenzisa

Sixoxile ngokuqoqwa kwedatha nokuqeqeshwa okwengeziwe kwezinhlobo ezahlukene zamamodeli, ngakho-ke siqhubekela phambili ekusetshenzisweni kwawo endaweni yokukhiqiza.

Ukuhlolwa kwe-A/B

Njengoba ngishilo ekuqaleni, ohlelweni lokuqoqa idatha, sivame ukuthola isampula echemile, lapho kungenakwenzeka ukuhlola ukusebenza kokukhiqiza kwemodeli. Ngakho-ke, lapho kuthunyelwa, imodeli kufanele iqhathaniswe nenguqulo yangaphambilini ukuze kuqondwe ukuthi izinto zihamba kanjani ngempela, okungukuthi, ukwenza izivivinyo ze-A/B. Eqinisweni, inqubo yokukhipha nokuhlaziya amashadi iyinjwayelo futhi ingenziwa ngokuzenzakalela. Sikhipha amamodeli ethu kancane kancane siye ku-5%, 30%, 50% kanye no-100% wabasebenzisi, kuyilapho siqoqa wonke ama-metrics atholakalayo ezimpendulo zamamodeli kanye nempendulo yomsebenzisi. Endabeni yamanye ama-outliers abalulekile, sibuyisela imodeli ngokuzenzakalelayo, futhi kwezinye izimo, njengoba siqoqe inani elanele lokuchofoza komsebenzisi, sinquma ukukhulisa amaphesenti. Njengomphumela, siletha imodeli entsha ku-50% wabasebenzisi ngokuzenzakalelayo ngokuphelele, futhi ukukhishwa kuzo zonke izethameli kuzogunyazwa ngumuntu, nakuba lesi sinyathelo singenziwa ngokuzenzakalela.

Nokho, inqubo yokuhlola i-A/B inikeza indawo yokuthuthukisa. Iqiniso liwukuthi noma yikuphi ukuhlolwa kwe-A/B kude kakhulu (kithi kuthatha amahora angu-6 kuye kwangama-24 kuye ngenani lempendulo), okwenza kubize kakhulu futhi kube nezinsiza ezilinganiselwe. Ukwengeza, iphesenti eliphezulu ngokwanele lokugeleza kokuhlolwa liyadingeka ukuze kusheshiswe sonke isikhathi sohlolo lwe-A/B (ukuqasha isampula ebalulekile yezibalo ukuhlola amamethrikhi ngamaphesenti amancane kungathatha isikhathi eside kakhulu), okwenza inani le-A/B Slots lilinganiselwe ngokwedlulele. Ngokusobala, sidinga ukuhlola kuphela amamodeli athembisayo kakhulu, esiwathola kakhulu ngesikhathi sokuqeqeshwa okwengeziwe.

Ukuze sixazulule le nkinga, siqeqeshe isigaba esihlukile esibikezela impumelelo yokuhlolwa kwe-A/B. Ukuze senze lokhu, sithatha izibalo zokuthatha izinqumo, Ukunemba, Ukukhumbula kanye namanye amamethrikhi kusethi yokuqeqesha, kule ehlehlisiwe, kanye nakusampula evela ekusakazeni njengezici. Siphinde siqhathanise imodeli naleyo yamanje ekukhiqizweni, ne-heuristics, futhi sicabangele Ubunkimbinkimbi bemodeli. Esebenzisa zonke lezi zici, umuntu ohlukanisa isigaba oqeqeshelwe umlando wokuhlola uhlola amamodeli amakhandidethi, esimweni sethu lawa angamahlathi ezihlahla, futhi unquma ukuthi iyiphi azoyisebenzisa esivivinyweni se-A/B.

Ukusebenza komshini wokufunda ku-Mail.ru Mail

Ngesikhathi sokuqaliswa, le ndlela yasivumela ukuthi sikhulise inani lokuhlolwa kwe-A/B okuyimpumelelo izikhathi eziningana.

Ukuhlola & Ukuqapha

Ukuhlola nokuqapha, ngokumangazayo, akuyilimazi impilo yethu; kunalokho, kunalokho, bayayithuthukisa futhi basikhulule ekucindezelekeni okungadingekile. Ukuhlola kukuvumela ukuthi uvimbele ukwehluleka, futhi ukuqapha kukuvumela ukuthi ukuthole kusenesikhathi ukuze unciphise umthelela kubasebenzisi.

Kubalulekile ukuqonda lapha ukuthi ngokushesha noma kamuva isistimu yakho izohlale yenza amaphutha - lokhu kungenxa yomjikelezo wokuthuthukiswa kwanoma iyiphi isofthiwe. Ekuqaleni kokuthuthukiswa kwesistimu kuhlale kunezimbungulu eziningi kuze kube yilapho yonke into ixazululwa futhi isigaba esiyinhloko sokusungula izinto ezintsha siqedwa. Kodwa ngokuhamba kwesikhathi, i-entropy ithatha umthwalo wayo, futhi amaphutha avela futhi - ngenxa yokuwohloka kwezingxenye ezizungezile kanye nezinguquko kudatha, engikhulume ngayo ekuqaleni.

Lapha ngithanda ukuqaphela ukuthi noma yiluphi uhlelo lokufunda lomshini kufanele lucatshangelwe ngokombono wenzuzo yalo kuwo wonke umjikelezo walo wokuphila. Igrafu engezansi ibonisa isibonelo sendlela isistimu esebenza ngayo ukuze ibambe uhlobo olungavamile lukagaxekile (umugqa osegrafu useduze noziro). Ngolunye usuku, ngenxa yesibaluli esifakwe kunqolobane ngokungalungile, wahlanya. Njengoba inhlanhla ibingaba nayo, bekungekho ukugadwa kokubangela okungavamile; ngenxa yalokho, isistimu yaqala ukulondoloza izinhlamvu ngobuningi kufolda "yogaxekile" emngceleni wokwenza izinqumo. Naphezu kokulungisa imiphumela, uhlelo seluvele lwenze amaphutha kaningi kangangokuthi ngeke luzikhokhele ngisho neminyaka emihlanu. Futhi lokhu ukwehluleka okuphelele kusukela ekubukeni komjikelezo wokuphila wemodeli.

Ukusebenza komshini wokufunda ku-Mail.ru Mail

Ngakho-ke, into elula njengokuqapha ingaba ukhiye empilweni yemodeli. Ngokungeziwe kumamethrikhi ajwayelekile nasobala, sicabangela ukusatshalaliswa kwezimpendulo zemodeli namaphuzu, kanye nokusatshalaliswa kwamavelu ezici ezibalulekile. Sisebenzisa ukwehluka kwe-KL, singaqhathanisa ukusatshalaliswa kwamanje nokomlando noma amanani esivivinyweni se-A/B nokunye ukusakaza, okusivumela ukuthi siqaphele okudidayo kumodeli futhi sibuyisele emuva izinguquko ngesikhathi.

Ezimweni eziningi, sethula izinguqulo zethu zokuqala zamasistimu sisebenzisa ama-heuristics alula noma amamodeli esiwasebenzisa njengokuqapha ngokuzayo. Isibonelo, siqapha imodeli ye-NER ngokuqhathaniswa nejwayelekile ezitolo ezithile eziku-inthanethi, futhi uma ukufakwa kokuhlelwa kwezigaba kwehla uma kuqhathaniswa nazo, sizobe sesiqonda izizathu. Okunye ukusetshenziswa okuwusizo kwama-heuristics!

Imiphumela

Ake sihlole imibono eyinhloko ye-athikili futhi.

  • Fibdeck. Sihlala sicabanga ngomsebenzisi: ukuthi uzophila kanjani namaphutha ethu, ukuthi uzokwazi kanjani ukuwabika. Ungakhohlwa ukuthi abasebenzisi abawona umthombo wempendulo emsulwa yamamodeli okuqeqesha, futhi idinga ukusulwa ngosizo lwezinhlelo ezisizayo ze-ML. Uma kungenakwenzeka ukuqoqa isignali kumsebenzisi, bese sibheka eminye imithombo yempendulo, isibonelo, amasistimu axhunyiwe.
  • Ukuqeqeshwa okwengeziwe. Into eyinhloko lapha ukuqhubeka, ngakho sithembele kumodeli wamanje wokukhiqiza. Siqeqesha amamodeli amasha ukuze angahlukani kakhulu nangaphambili ngenxa ye-harmonic ejwayelekile namaqhinga afanayo.
  • Sebenzisa. Ukuthunyelwa ngokuzenzakalela okusekelwe kumamethrikhi kunciphisa kakhulu isikhathi sokwenza amamodeli. Ukuqapha izibalo nokusatshalaliswa kokuthatha izinqumo, inani lokuwa kubasebenzisi liyisibopho ukuze ulale ukuphumula nempelasonto ekhiqizayo.

Nokho, ngethemba ukuthi lokhu kukusiza ukuthi uthuthukise amasistimu akho e-ML ngokushesha, uwathole emakethe ngokushesha, futhi awenze athembeke kakhulu futhi angacindezeli kangako.

Source: www.habr.com

Engeza amazwana