Kugwiritsa ntchito makina ophunzirira mu Mail.ru Mail

Kugwiritsa ntchito makina ophunzirira mu Mail.ru Mail

Kutengera zolankhula zanga pa Highload ++ ndi DataFest Minsk 2019.

Kwa ambiri masiku ano, makalata ndi mbali yofunika ya moyo wa pa Intaneti. Ndi chithandizo chake, timatumiza makalata abizinesi, kusunga mitundu yonse yazidziwitso zofunika zokhudzana ndi ndalama, kusungitsa mahotelo, kuyitanitsa ndi zina zambiri. Mkatikati mwa chaka cha 2018, tinapanga njira yopangira makalata. Kodi makalata amakono ayenera kukhala otani?

Imelo iyenera kukhala wanzeru, ndiko kuti, kuthandiza ogwiritsa ntchito kudziwa kuchuluka kwa chidziwitso: fyuluta, kapangidwe kake ndikuzipereka m'njira yabwino kwambiri. Iye ayenera kukhala zothandiza, kukulolani kuthetsa ntchito zosiyanasiyana m'bokosi lanu la makalata, mwachitsanzo, kulipira chindapusa (ntchito yomwe, mwatsoka, ndimagwiritsa ntchito). Ndipo nthawi yomweyo, zowona, makalata ayenera kupereka chitetezo chazidziwitso, kudula sipamu ndikuteteza ku kubera, ndiko kuti, kukhala. otetezeka.

Maderawa amatanthauzira zovuta zingapo zofunika, zambiri zomwe zimatha kuthetsedwa bwino pogwiritsa ntchito makina ophunzirira. Nazi zitsanzo za zomwe zilipo kale zomwe zapangidwa monga gawo la njira - imodzi kumbali iliyonse.

  • Yankho labwino. Imelo ili ndi mayankho anzeru. Neural network imasanthula mawu a kalatayo, imamvetsetsa tanthauzo lake ndi cholinga chake, ndipo chifukwa chake imapereka njira zitatu zoyankhira zoyenera kwambiri: zabwino, zoyipa komanso zandale. Izi zimathandiza kupulumutsa kwambiri nthawi poyankha makalata, komanso nthawi zambiri amayankha mopanda muyezo komanso moseketsa.
  • Kuyika maimelo m'maguluzokhudzana ndi maoda m'masitolo apaintaneti. Nthawi zambiri timagula pa intaneti, ndipo, monga lamulo, masitolo amatha kutumiza maimelo angapo pa dongosolo lililonse. Mwachitsanzo, kuchokera ku AliExpress, utumiki waukulu kwambiri, makalata ambiri amabwera chifukwa cha dongosolo limodzi, ndipo tinawerengera kuti pamtundu wotsiriza chiwerengero chawo chikhoza kufika ku 29. Choncho, pogwiritsa ntchito dzina lotchedwa Entity Recognition model, timachotsa nambala ya dongosolo. ndi zina zambiri kuchokera m'malemba ndikuyika zilembo zonse mu ulusi umodzi. Timawonetsanso zambiri za dongosololi m'bokosi lapadera, zomwe zimapangitsa kuti zikhale zosavuta kugwira ntchito ndi imelo yamtunduwu.

    Kugwiritsa ntchito makina ophunzirira mu Mail.ru Mail

  • Anti-phishing. Phishing ndi mtundu wowopsa wa imelo wachinyengo, mothandizidwa ndi omwe akuukira amayesa kupeza zidziwitso zandalama (kuphatikiza makadi aku banki a wogwiritsa ntchito) ndi kulowa. Zilembo zoterezi zimatsanzira zenizeni zomwe zimatumizidwa ndi utumiki, kuphatikizapo zowoneka. Chifukwa chake, mothandizidwa ndi Computer Vision, timazindikira ma logo ndi kapangidwe kake ka zilembo kuchokera kumakampani akulu (mwachitsanzo, Mail.ru, Sber, Alfa) ndikuganizira izi limodzi ndi zolemba ndi zina mumagulu athu a spam ndi phishing. .

Kuphunzira makina

Pang'ono za kuphunzira pamakina mu imelo. Makalata ndi makina odzaza kwambiri: pafupifupi zilembo 1,5 biliyoni patsiku zimadutsa pa maseva athu kwa ogwiritsa ntchito 30 miliyoni a DAU. Pafupifupi makina 30 ophunzirira makina amathandizira ntchito zonse zofunika ndi mawonekedwe.

Chilembo chilichonse chimadutsa paipi yamagulu onse. Choyamba timadula spam ndikusiya maimelo abwino. Ogwiritsa ntchito nthawi zambiri samazindikira ntchito ya antispam, chifukwa 95-99% ya sipamu samatha ngakhale mufoda yoyenera. Kuzindikira kwa spam ndi gawo lofunika kwambiri la dongosolo lathu, komanso lovuta kwambiri, popeza m'munda wa anti-spam pali kusintha kosalekeza pakati pa chitetezo ndi machitidwe owukira, omwe amapereka zovuta zaumisiri mosalekeza kwa gulu lathu.

Kenako, timalekanitsa makalata ndi anthu ndi maloboti. Maimelo ochokera kwa anthu ndi ofunika kwambiri, choncho timawapatsa zinthu monga Smart Reply kwa iwo. Makalata ochokera ku maloboti amagawidwa m'magawo awiri: transaction - awa ndi makalata ofunikira ochokera ku mautumiki, mwachitsanzo, zitsimikizo za kugula kapena kusungitsa hotelo, ndalama, ndi chidziwitso - izi ndi malonda, kuchotsera.

Timakhulupilira kuti maimelo ochita malonda ndi ofanana m'makalata aumwini. Ayenera kukhala pafupi, chifukwa nthawi zambiri timafunika kupeza mwachangu zambiri za kuyitanitsa kapena kusungitsa matikiti a ndege, ndipo timathera nthawi kufunafuna zilembozi. Chifukwa chake, kuti zitheke, timazigawa m'magulu asanu ndi limodzi: maulendo, maoda, ndalama, matikiti, kulembetsa ndipo, pomaliza, chindapusa.

Makalata achidziwitso ndi gulu lalikulu kwambiri komanso mwina losafunika kwenikweni, lomwe silifuna kuyankha mwachangu, popeza palibe chofunikira chomwe chidzasinthe m'moyo wa wogwiritsa ntchito ngati sawerenga kalata yotere. M'mawonekedwe athu atsopano, timawagwetsa m'magulu awiri: malo ochezera a pa Intaneti ndi makalata, motero timachotsa bokosi lolowera ndikusiya mauthenga ofunikira okha.

Kugwiritsa ntchito makina ophunzirira mu Mail.ru Mail

Ntchito

Kuchuluka kwa machitidwe kumayambitsa zovuta zambiri pakugwira ntchito. Kupatula apo, mitundu imawonongeka pakapita nthawi, monga pulogalamu iliyonse: mawonekedwe amasweka, makina amalephera, kachidindo kumakhala kokhota. Kuonjezera apo, deta ikusintha nthawi zonse: zatsopano zimawonjezeredwa, machitidwe a khalidwe la ogwiritsa ntchito amasinthidwa, ndi zina zotero, kotero chitsanzo popanda chithandizo choyenera chidzagwira ntchito moipitsitsa komanso moipitsitsa pakapita nthawi.

Sitiyenera kuiwala kuti kuphunzira kwamakina mozama kumalowa m'miyoyo ya ogwiritsa ntchito, momwe amakhudzira chilengedwe, ndipo chifukwa chake, kutayika kwachuma kapena phindu kwa osewera pamsika angalandire. Chifukwa chake, m'malo omwe akuchulukirachulukira, osewera akusintha kuti agwirizane ndi ma algorithms a ML (zitsanzo zachikale ndizotsatsa, kusaka ndi antispam zomwe zatchulidwa kale).

Komanso, ntchito zophunzirira makina zimakhala ndi zachilendo: chilichonse, ngakhale chaching'ono, kusintha kwadongosolo kungapangitse ntchito yambiri ndi chitsanzo: kugwira ntchito ndi deta, kukonzanso, kutumizira, zomwe zingatenge masabata kapena miyezi. Chifukwa chake, malo omwe mafanizidwe anu amagwirira ntchito mwachangu amasintha, m'pamenenso pamafunika kuyesetsa kwambiri kuti asamalire. Gulu likhoza kupanga machitidwe ambiri ndikukhala osangalala nazo, koma kenaka amawononga pafupifupi chuma chake chonse kuti azisunga, popanda mwayi wochita china chatsopano. Nthawi ina tinakumana ndi zoterezi mu gulu la antispam. Ndipo adatsimikiza kuti chithandizo chiyenera kukhala chokha.

Zodzichitira

Kodi chikhoza kukhala chotani? Pafupifupi chirichonse, kwenikweni. Ndazindikira madera anayi omwe amatanthauzira makina ophunzirira makina:

  • kusonkhanitsa deta;
  • maphunziro owonjezera;
  • tumiza;
  • kuyesa & kuwunika.

Ngati chilengedwe sichikhazikika komanso chimasintha nthawi zonse, ndiye kuti zowonongeka zonse zozungulira chitsanzozo zimakhala zofunika kwambiri kuposa chitsanzocho. Ikhoza kukhala kagulu kakang'ono kabwino kakale, koma ngati muidyetsa zinthu zoyenera ndikupeza mayankho abwino kuchokera kwa ogwiritsa ntchito, idzagwira ntchito bwino kwambiri kuposa zitsanzo za State-Of-The-Art zokhala ndi mabelu onse ndi mluzu.

Цикл обратной связи

Kuzungulira uku kumaphatikiza kusonkhanitsa deta, maphunziro owonjezera ndi kutumiza - kwenikweni, kuzungulira kwachitsanzo chonse. Chifukwa chiyani kuli kofunikira? Yang'anani ndondomeko yolembetsa mu makalata:

Kugwiritsa ntchito makina ophunzirira mu Mail.ru Mail

Wopanga makina ophunzirira makina agwiritsa ntchito anti-bot model yomwe imalepheretsa bots kulembetsa mu imelo. Grafu imatsikira pamtengo pomwe ogwiritsa ntchito enieni amatsalira. Zonse ndi zabwino! Koma maola anayi amapita, ma bots amasintha zolemba zawo, ndipo zonse zimabwerera mwakale. Pakukhazikitsa uku, wopangayo adakhala mwezi umodzi akuwonjezera mawonekedwe ndikuwongoleranso chitsanzo, koma spammer adatha kusintha maola anayi.

Kuti tisakhale zowawa kwambiri komanso kuti tisadzachitenso chilichonse pambuyo pake, tiyenera kuganizira poyambira momwe njira yolumikizira imawonekera komanso zomwe tingachite ngati chilengedwe chikusintha. Tiyeni tiyambe ndi kusonkhanitsa deta - awa ndi mafuta a ma aligorivimu athu.

Kusonkhanitsa deta

Zikuwonekeratu kuti kwa ma neural network amakono, deta yochulukirapo, imakhala yabwinoko, ndipo imakhala yopangidwa ndi ogwiritsa ntchito. Ogwiritsa atha kutithandiza polemba zidziwitso, koma sitingathe kugwiritsa ntchito izi molakwika, chifukwa nthawi zina ogwiritsa ntchito amatha kutopa pomaliza zitsanzo zanu ndikusinthira kuzinthu zina.

Chimodzi mwazolakwika zofala (pano ndikutchula za Andrew Ng) ndikungoyang'ana kwambiri ma metric pa dataset yoyeserera, osati pa mayankho ochokera kwa wogwiritsa ntchito, womwe ndiye muyeso waukulu wa ntchito yabwino, popeza timapanga. mankhwala kwa wogwiritsa ntchito. Ngati wogwiritsa ntchito sakumvetsa kapena sakonda ntchito yachitsanzo, ndiye kuti zonse zawonongeka.

Chifukwa chake, wogwiritsa ntchitoyo nthawi zonse azikhala wokhoza kuvota ndipo ayenera kupatsidwa chida choyankha. Ngati tikuganiza kuti kalata yokhudzana ndi zachuma yafika m'bokosi la makalata, tiyenera kuika chizindikiro "ndalama" ndikujambula batani lomwe wogwiritsa ntchito angathe kudina ndikunena kuti izi si ndalama.

Ndemanga zabwino

Tiyeni tikambirane za khalidwe la wosuta ndemanga. Choyamba, inu ndi wogwiritsa ntchito mutha kuyika matanthauzo osiyanasiyana mu lingaliro limodzi. Mwachitsanzo, inu ndi oyang'anira malonda anu mukuganiza kuti "ndalama" amatanthauza makalata ochokera kubanki, ndipo wogwiritsa ntchito akuganiza kuti kalata yochokera kwa agogo ya penshoni imatanthauzanso zandalama. Kachiwiri, pali ogwiritsa ntchito omwe amakonda kukanikiza mabatani popanda malingaliro. Chachitatu, wogwiritsa ntchitoyo akhoza kulakwitsa kwambiri pamalingaliro ake. Chitsanzo chochititsa chidwi kuchokera m'zochita zathu ndicho kukhazikitsa gulu lamagulu Spam yaku Nigeria, mtundu woseketsa kwambiri wa sipamu pomwe wogwiritsa ntchito amafunsidwa kutenga madola mamiliyoni angapo kuchokera kwa wachibale yemwe wapezeka mwadzidzidzi ku Africa. Titakhazikitsa gululi, tidayang'ana kudina kwa "Osati Spam" pamaimelowa, ndipo zidapezeka kuti 80% mwaiwo anali sipamu amadzi aku Nigeria, zomwe zikusonyeza kuti ogwiritsa ntchito atha kukhala opusitsika.

Ndipo tisaiwale kuti mabatani amatha kudulidwa osati ndi anthu okha, komanso ndi mitundu yonse ya bots yomwe imadziyesa ngati osatsegula. Chifukwa chake mayankho osasinthika si abwino pophunzira. Kodi mungatani ndi chidziwitsochi?

Timagwiritsa ntchito njira ziwiri:

  • Ndemanga kuchokera ku ML yolumikizidwa. Mwachitsanzo, tili ndi intaneti yotsutsana ndi bot, yomwe, monga ndanenera, imapanga chisankho chofulumira pogwiritsa ntchito zizindikiro zochepa. Ndipo pali yachiwiri, pang'onopang'ono dongosolo ntchito pambuyo mfundo. Ili ndi zambiri zokhudza wogwiritsa ntchito, khalidwe lake, ndi zina zotero. Zotsatira zake, chigamulo chodziwitsidwa kwambiri chimapangidwa; motero, chimakhala cholondola komanso chokwanira. Mukhoza kuwongolera kusiyana kwa kayendetsedwe ka machitidwewa kwa oyamba monga deta yophunzitsira. Choncho, dongosolo losavuta lidzayesa nthawi zonse kuyandikira ntchito ya zovuta kwambiri.
  • Dinani gulu. Mutha kuyika m'magulu onse ogwiritsa ntchito, kuyesa kutsimikizika kwake komanso kutheka kwake. Timachita izi pamakalata a antispam, pogwiritsa ntchito mawonekedwe a ogwiritsa ntchito, mbiri yake, mawonekedwe a wotumiza, zolemba zomwezo komanso zotsatira za ogawa. Zotsatira zake, timapeza makina odzipangira okha omwe amatsimikizira mayankho a ogwiritsa ntchito. Ndipo popeza imayenera kuphunzitsidwanso pafupipafupi, ntchito yake imatha kukhala maziko a machitidwe ena onse. Chofunika kwambiri mu chitsanzo ichi ndi cholondola, chifukwa kuphunzitsa chitsanzo pa deta yolakwika kumakhala ndi zotsatira zake.

Pamene tikuyeretsa deta ndikupititsa patsogolo machitidwe athu a ML, sitiyenera kuiwala za ogwiritsa ntchito, chifukwa kwa ife, zikwi, mamiliyoni a zolakwika pa graph ndizowerengera, ndipo kwa wogwiritsa ntchito, cholakwika chilichonse ndi tsoka. Kuphatikiza pa mfundo yakuti wogwiritsa ntchitoyo ayenera kukhala ndi vuto lanu muzinthuzo, atalandira ndemanga, akuyembekeza kuti zofananazo zidzathetsedwa m'tsogolomu. Chifukwa chake, ndikofunikira nthawi zonse kupatsa ogwiritsa ntchito mwayi wovota okha, komanso kukonza machitidwe a ML, kupanga, mwachitsanzo, zowerengera zamunthu pakudina kulikonse; pakalata, izi zitha kukhala kuthekera kosefa. zilembo zotere ndi wotumiza ndi mutu wa wosuta uyu.

Muyeneranso kupanga chitsanzo potengera malipoti ena kapena zopempha zothandizira mu semi-automatic kapena manual mode kuti ogwiritsa ntchito ena asavutike ndi mavuto ofanana.

Heuristics yophunzirira

Pali mavuto awiri ndi ma heuristics ndi ndodo. Choyamba n’chakuti ndodo zochulukirachulukira n’zovuta kuzisamalira, osasiyapo ubwino wake ndi mmene zimagwirira ntchito kwa nthawi yaitali. Vuto lachiwiri ndiloti cholakwikacho sichingakhale kawirikawiri, ndipo kudina pang'ono kuti mupitirize kuphunzitsa chitsanzo sikungakhale kokwanira. Zingawoneke kuti zotsatira ziwiri zosagwirizanazi zikhoza kuchepetsedwa kwambiri ngati njira yotsatirayi ikugwiritsidwa ntchito.

  1. Timapanga chotengera kwakanthawi.
  2. Timatumiza deta kuchokera kwa izo kupita ku chitsanzo, nthawi zonse imadzisintha yokha, kuphatikizapo zomwe zalandiridwa. Apa, ndithudi, ndikofunika kuti ma heuristics akhale olondola kwambiri kuti asachepetse khalidwe la deta mu maphunziro.
  3. Kenako timayika kuwunika kuti tiyambitse ndodo, ndipo ngati pakapita nthawi ndodoyo sikugwiranso ntchito ndipo imakutidwa ndi fanizo, ndiye kuti mutha kuyichotsa bwinobwino. Tsopano vuto ili silingachitikenso.

Choncho gulu la ndodo ndi lothandiza kwambiri. Chinthu chachikulu ndi chakuti utumiki wawo ndi wofulumira osati wokhazikika.

Maphunziro owonjezera

Дообучение — это процесс добавления новых данных, полученных в результате обратной связи от пользователей или других систем, и обучения существующей модели на них. С дообучением может быть несколько проблем:

  1. Chitsanzocho sichingathandizire maphunziro owonjezera, koma phunzirani kuyambira pachiyambi.
  2. Palibe paliponse m’bukhu la chilengedwe pamene palembedwa kuti maphunziro owonjezereka adzawongolera bwino ntchito yopanga. Nthawi zambiri zosiyana zimachitika, ndiko kuti, kuwonongeka kokha kumatheka.
  3. Kusintha kungakhale kosayembekezereka. Iyi ndi mfundo yobisika yomwe tadzizindikiritsa tokha. Ngakhale chitsanzo chatsopano pamayeso a A/B chikuwonetsa zotsatira zofanana ndi zomwe zilipo, izi sizikutanthauza kuti zidzagwira ntchito mofanana. Ntchito zawo zingasiyane m’gawo limodzi chabe, zomwe zingabweretse zolakwika zatsopano kapena kubwezera zakale zomwe zakonzedwa kale. Tonsefe ndi ogwiritsa ntchito timadziwa kale momwe tingakhalire ndi zolakwika zamakono, ndipo pamene zolakwika zambiri zibuka, wogwiritsa ntchito sangamvetse zomwe zikuchitika, chifukwa amayembekeza khalidwe lodziwikiratu.

Chifukwa chake, chofunikira kwambiri pamaphunziro owonjezera ndikuwonetsetsa kuti chitsanzocho chikuyenda bwino, kapena kuti sichikuipiraipira.

Chinthu choyamba chimene chimabwera m'maganizo tikamalankhula za maphunziro owonjezera ndi njira Yophunzirira Mwachangu. Kodi izi zikutanthauza chiyani? Mwachitsanzo, wosankhayo amasankha ngati imelo ikugwirizana ndi zachuma, ndipo kuzungulira malire ake timawonjezera zitsanzo zolembedwa. Izi zimagwira ntchito bwino, mwachitsanzo, pakutsatsa, komwe kuli mayankho ambiri ndipo mutha kuphunzitsa chitsanzo pa intaneti. Ndipo ngati pali ndemanga zochepa, ndiye kuti timapeza chitsanzo chokondera kwambiri chokhudzana ndi kugawa kwa deta, pazifukwa zomwe sizingatheke kuyesa khalidwe lachitsanzo panthawi yogwira ntchito.

Kugwiritsa ntchito makina ophunzirira mu Mail.ru Mail

M'malo mwake, cholinga chathu ndikusunga mawonekedwe akale, zitsanzo zodziwika kale, ndikupeza zatsopano. Kupitiliza ndikofunikira pano. Chitsanzo, chomwe nthawi zambiri tinkachita zowawa kwambiri kuti titulutse, chikugwira ntchito kale, kotero tikhoza kuyang'ana ntchito yake.

Mitundu yosiyanasiyana imagwiritsidwa ntchito pamakalata: mitengo, mizere, ma neural network. Kwa aliyense timapanga ma algorithm athu owonjezera ophunzitsira. M'kati mwa maphunziro owonjezera, sitilandira zatsopano zokha, komanso nthawi zambiri zatsopano, zomwe tidzaziganizira muzolemba zonse zomwe zili pansipa.

Linear zitsanzo

Tinene kuti tili ndi logistic regression. Timapanga chitsanzo chotayika kuchokera kuzinthu zotsatirazi:

  • LogLoss pa data yatsopano;
  • timakonza zolemetsa zatsopano (sitikhudza zakale);
  • timaphunziranso kuchokera ku deta yakale kuti tisunge machitidwe akale;
  • ndipo, mwinamwake, chinthu chofunika kwambiri: timawonjezera Harmonic Regularization, zomwe zimatsimikizira kuti zolemera sizidzasintha kwambiri poyerekeza ndi chitsanzo chakale malinga ndi chikhalidwe.

Popeza chigawo chilichonse cha Kutayika chimakhala ndi ma coefficients, titha kusankha milingo yabwino kwambiri yantchito yathu kudzera pakutsimikizira kapena kutengera zomwe tikufuna.

Kugwiritsa ntchito makina ophunzirira mu Mail.ru Mail

Mitengo

Tiyeni tipitirire kumitengo yachisankho. Tapanga ma algorithm awa kuti tiphunzitsenso mitengo:

  1. Kupanga kumayendetsa nkhalango ya mitengo ya 100-300, yomwe imaphunzitsidwa pa data yakale.
  2. Pamapeto pake timachotsa M = zidutswa 5 ndikuwonjezera 2M = 10 zatsopano, zophunzitsidwa pa deta yonse, koma ndi kulemera kwakukulu kwa deta yatsopano, yomwe mwachibadwa imatsimikizira kusintha kowonjezereka kwa chitsanzo.

Mwachiwonekere, m'kupita kwa nthawi, chiwerengero cha mitengo chimawonjezeka kwambiri, ndipo chiyenera kuchepetsedwa nthawi ndi nthawi kuti chikwaniritse nthawi. Kuti tichite izi, timagwiritsa ntchito njira yodziwika bwino ya Knowledge Distillation (KD). Mwachidule za mfundo ya ntchito yake.

  1. Tili ndi "complex" yamakono. Timayendetsa pa seti ya data yophunzitsira ndikupeza gawo la kuthekera kwa kalasi pazotuluka.
  2. Kenaka, timaphunzitsa chitsanzo cha ophunzira (chitsanzo chokhala ndi mitengo yochepa pankhaniyi) kuti tibwereze zotsatira zachitsanzo pogwiritsa ntchito kugawa kwa kalasi monga kusintha kwachindunji.
  3. Ndikofunikira kuzindikira apa kuti sitigwiritsa ntchito chizindikiro cha seti ya data mwanjira iliyonse, chifukwa chake titha kugwiritsa ntchito ma data osagwirizana. Inde, timagwiritsa ntchito chitsanzo cha deta kuchokera kumtsinje wankhondo monga chitsanzo cha maphunziro a chitsanzo cha ophunzira. Choncho, maphunzirowa amatipatsa mwayi woonetsetsa kuti chitsanzocho ndi cholondola, ndipo chitsanzo cha mtsinje chimatsimikiziranso ntchito yofanana pa kugawa kwapangidwe, kubwezera kusagwirizana kwa maphunziro.

Kugwiritsa ntchito makina ophunzirira mu Mail.ru Mail

Kuphatikiza kwa njira ziwirizi (kuwonjezera mitengo ndi kuchepetsa nthawi ndi nthawi chiwerengero chawo pogwiritsa ntchito Knowledge Distillation) zimatsimikizira kukhazikitsidwa kwa machitidwe atsopano ndi kupitirizabe kwathunthu.

Mothandizidwa ndi KD, timagwiranso ntchito zosiyanasiyana pazinthu zachitsanzo, monga kuchotsa mbali ndikugwira ntchito pamipata. Kwa ife, tili ndi zofunikira zambiri zowerengera (ndi otumiza, ma hashes, ma URL, ndi zina zotero) zomwe zimasungidwa mu database, zomwe zimalephera. Chitsanzo, ndithudi, sichili chokonzekera chitukuko cha zochitika zoterezi, chifukwa zochitika zolephereka sizichitika mu maphunziro. Zikatero, timagwirizanitsa KD ndi njira zowonjezera: pophunzitsa gawo la deta, timachotsa kapena kukonzanso zofunikira, ndipo timatenga zilembo zoyambirira (zotulutsa zachitsanzo chamakono), ndipo chitsanzo cha ophunzira chimaphunzira kubwereza kugawa uku. .

Kugwiritsa ntchito makina ophunzirira mu Mail.ru Mail

Tidawona kuti kusintha kwakukulu kwachitsanzo kumachitika, kuchuluka kwa ulusi kumafunikanso.

Kuchotsa mawonekedwe, ntchito yosavuta kwambiri, imafuna gawo laling'ono chabe la kayendetsedwe kake, popeza zinthu zingapo zokha zimasintha, ndipo chitsanzo chamakono chinaphunzitsidwa pa seti yomweyo - kusiyana kuli kochepa. Kuti chitsanzocho chikhale chosavuta (kuchepetsa mitengo kangapo), 50 mpaka 50 ikufunika kale. mtundu watsopano wosamva kusiyidwa pamitundu yonse ya zilembo.

Kugwiritsa ntchito makina ophunzirira mu Mail.ru Mail

FastText

Перейдем к FastText. Напомню, что представление (Embedding) слова состоит из суммы embedding’а самого слова и всех его буквенных N-gram, обычно триграмм. Так как триграмм может быть достаточно много, используется Bucket Hashing, то есть преобразование всего пространства в некий фиксированный хэшмэп. В итоге матрица весов получается размерностью внутреннего слоя на количество слов + бакетов.

Ndi maphunziro owonjezera, zizindikiro zatsopano zimawonekera: mawu ndi trigrams. Palibe chofunikira chomwe chimachitika pakuphunzitsidwa kotsatira kuchokera pa Facebook. Zolemera zakale zokha zokhala ndi cross-entropy zimaphunzitsidwanso pa data yatsopano. Choncho, zatsopano sizikugwiritsidwa ntchito; ndithudi, njira iyi ili ndi zovuta zonse zomwe tafotokozazi zomwe zimagwirizanitsidwa ndi kusadziŵika kwa chitsanzo pakupanga. Ichi ndichifukwa chake tidasintha FastText pang'ono. Timawonjezera zolemera zonse zatsopano (mawu ndi ma trigrams), kukulitsa matrix onse ndi cross-entropy ndikuwonjezera kukhazikika kwa ma harmonic mwa fanizo ndi mzere wa mzere, womwe umatsimikizira kusintha kocheperako pazolemera zakale.

Kugwiritsa ntchito makina ophunzirira mu Mail.ru Mail

CNN

Ma Convolutional network ndi ovuta kwambiri. Ngati zigawo zomaliza zamalizidwa mu CNN, ndiye kuti, mutha kugwiritsa ntchito kukhazikika kwa ma harmonic ndikutsimikizira kupitiliza. Koma ngati maphunziro owonjezera a netiweki yonse akufunika, ndiye kuti kukhazikika koteroko sikungagwiritsidwenso ntchito pazigawo zonse. Komabe, pali njira yophunzitsira zowonjezera kudzera mu Triplet Loss (nkhani yoyamba).

Triplet Loss

Pogwiritsa ntchito ntchito yolimbana ndi phishing monga mwachitsanzo, tiyeni tiwone Kutayika Kwakatatu mwachidule. Timatenga logo yathu, komanso zitsanzo zabwino ndi zoipa za ma logos amakampani ena. Timachepetsa mtunda pakati pa choyamba ndikukulitsa mtunda pakati pa chachiwiri, timachita izi ndi kusiyana kochepa kuti titsimikizire kugwirizanitsa kwakukulu kwa makalasi.

Kugwiritsa ntchito makina ophunzirira mu Mail.ru Mail

Ngati tipitiliza kuphunzitsa maukonde, ndiye kuti malo athu a metric amasintha, ndipo amakhala osagwirizana kwathunthu ndi yam'mbuyomu. Ili ndi vuto lalikulu pamavuto omwe amagwiritsa ntchito ma vector. Kuti tithane ndi vutoli, tidzasakaniza zoyikapo zakale panthawi yamaphunziro.

Tawonjeza zatsopano ku seti yophunzitsira ndipo tikuphunzitsa mtundu wachiwiri wachitsanzo kuyambira pachiyambi. Pa gawo lachiwiri, timaphunzitsanso maukonde athu (Finetuning): choyamba gawo lomaliza lamalizidwa, ndiyeno maukonde onse osazizira. Popanga ma triplets, timawerengera gawo limodzi la zoyikapo pogwiritsa ntchito chitsanzo chophunzitsidwa, chotsalacho - pogwiritsa ntchito chakale. Chifukwa chake, pochita maphunziro owonjezera, timaonetsetsa kuti mipata ya metric ikugwirizana v1 ndi v2. Mtundu wapadera wa kukhazikika kwa ma harmonic.

Kugwiritsa ntchito makina ophunzirira mu Mail.ru Mail

Zomangamanga zonse

Ngati tilingalira dongosolo lonselo pogwiritsa ntchito antispam monga chitsanzo, ndiye kuti zitsanzozo sizidzipatula, koma zimakhala mkati mwa wina ndi mzake. Timajambula zithunzi, zolemba ndi zina, pogwiritsa ntchito CNN ndi Fast Text timapeza zojambulidwa. Kenaka, zolembera zimagwiritsidwa ntchito pamwamba pa zoyikapo, zomwe zimapereka zambiri zamagulu osiyanasiyana (mitundu ya zilembo, sipamu, kukhalapo kwa logo). Zizindikiro ndi zizindikiro zikulowa kale m'nkhalango ya mitengo kuti chigamulo chomaliza chipangidwe. Magulu aumwini mu ndondomekoyi amapangitsa kuti athe kutanthauzira bwino zotsatira za dongosololi komanso makamaka kukonzanso zigawozo pakagwa mavuto, m'malo modyetsa deta yonse mumitengo yachisankho mu mawonekedwe aiwisi.

Kugwiritsa ntchito makina ophunzirira mu Mail.ru Mail

Zotsatira zake, timatsimikizira kupitiriza pamlingo uliwonse. Pansi pa CNN ndi Fast Text timagwiritsa ntchito kukhazikika kwa ma harmonic, kwa owerengera omwe ali pakati timagwiritsanso ntchito kukhazikika kwa ma harmonic ndikuwongolera kusinthasintha kwa kugawa. Chabwino, kulimbikitsa mitengo kumaphunzitsidwa mowonjezereka kapena kugwiritsa ntchito Knowledge Distillation.

Nthawi zambiri, kusunga makina ophunzirira makina okhala ndi chisa nthawi zambiri kumakhala kowawa, chifukwa chilichonse chomwe chili m'munsimu chimatsogolera kukusintha kwadongosolo lonse pamwambapa. Koma popeza pakukhazikitsa kwathu chigawo chilichonse chimasintha pang'ono ndipo chimagwirizana ndi chakale, dongosolo lonselo likhoza kusinthidwa pang'onopang'ono popanda kufunikira kukonzanso dongosolo lonse, zomwe zimalola kuti zithandizidwe popanda kuwonjezereka kwakukulu.

Ikani

Takambirana za kusonkhanitsa deta ndi maphunziro owonjezera a mitundu yosiyanasiyana ya zitsanzo, kotero tikupita patsogolo pa kutumizidwa kwawo kumalo opangira.

Kuyesa kwa A/B

Monga ndanenera kale, posonkhanitsa deta, nthawi zambiri timapeza chitsanzo chokondera, chomwe n'kosatheka kuwunika momwe mtunduwu umagwirira ntchito. Chifukwa chake, potumiza, choyimiracho chiyenera kufananizidwa ndi mtundu wakale kuti mumvetsetse momwe zinthu zikuyendera, ndiye kuti, kuyesa mayeso a A/B. M'malo mwake, njira yosinthira ndikusanthula ma chart ndi yachizoloŵezi ndipo imatha kukhala yokhayokha. Timatulutsa zitsanzo zathu pang'onopang'ono mpaka 5%, 30%, 50% ndi 100% ya ogwiritsa ntchito, pamene tikusonkhanitsa ma metrics onse omwe alipo pa mayankho a zitsanzo ndi ndemanga za ogwiritsa ntchito. Pazinthu zina zazikulu, timangobweza chitsanzocho, ndipo nthawi zina, titasonkhanitsa chiwerengero chokwanira cha ogwiritsa ntchito, timasankha kuwonjezera chiwerengerocho. Zotsatira zake, timabweretsa chitsanzo chatsopano kwa 50% ya ogwiritsa ntchito basi, ndipo kutulutsidwa kwa omvera onse kudzavomerezedwa ndi munthu, ngakhale kuti sitepe iyi ikhoza kukhala yokha.

Komabe, kuyesa kwa A/B kumapereka mwayi wokhathamiritsa. Chowonadi ndi chakuti mayeso aliwonse a A/B ndi aatali (kwa ife amatenga maola 6 mpaka 24 kutengera kuchuluka kwa mayankho), zomwe zimapangitsa kuti zikhale zodula komanso zopanda ndalama. Kuphatikiza apo, kuchuluka kokwanira koyenda pamayeso kumafunika kuti mufulumizitse nthawi yonse ya mayeso a A/B (kulemba zitsanzo zowerengera kuti muwunikire ma metric pamlingo wocheperako kumatha kutenga nthawi yayitali), zomwe zimapangitsa chiwerengero cha A/B mipata yochepa kwambiri. Mwachiwonekere, tiyenera kuyesa zitsanzo zodalirika kwambiri, zomwe timalandira kwambiri panthawi ya maphunziro owonjezera.

Kuti tithane ndi vutoli, tidaphunzitsa gulu lapadera lomwe limaneneratu kupambana kwa mayeso a A/B. Kuti tichite izi, timatenga ziwerengero zopanga zisankho, Precision, Recall ndi ma metrics ena pamaphunzirowo, omwe achedwetsedwa, komanso pazitsanzo zochokera pamtsinjewo ngati mawonekedwe. Timafaniziranso chitsanzo ndi chomwe chilipo panopa pakupanga, ndi heuristics, ndikuganizira za Kuvuta kwa chitsanzocho. Pogwiritsa ntchito zinthu zonsezi, katswiri wophunzitsidwa pa mbiri yoyesa amayesa zitsanzo za anthu, kwa ife iyi ndi nkhalango zamitengo, ndipo amasankha kuti agwiritse ntchito pa mayeso a A/B.

Kugwiritsa ntchito makina ophunzirira mu Mail.ru Mail

Panthawi yokhazikitsidwa, njirayi inatilola kuti tiwonjezere mayeso opambana a A / B kangapo.

Kuyesa & Kuwunika

Kuyesa ndi kuwunika, modabwitsa, sikuvulaza thanzi lathu; m'malo mwake, amawongolera ndikutichotsera nkhawa zosafunikira. Kuyesa kumakuthandizani kuti mupewe kulephera, ndipo kuyang'anira kumakupatsani mwayi kuti muzindikire munthawi yake kuti muchepetse kukhudzidwa kwa ogwiritsa ntchito.

Ndikofunika kumvetsetsa apa kuti posachedwa dongosolo lanu lidzalakwitsa nthawi zonse - izi zimachitika chifukwa cha chitukuko cha mapulogalamu aliwonse. Kumayambiriro kwa chitukuko cha machitidwe nthawi zonse pamakhala nsikidzi zambiri mpaka chirichonse chikhazikike ndipo gawo lalikulu la zatsopano lidzatsirizidwa. Koma m'kupita kwa nthawi, entropy imatenga zovuta zake, ndipo zolakwika zimawonekeranso - chifukwa cha kuwonongeka kwa zigawo zozungulira ndi kusintha kwa deta, zomwe ndinanena poyamba.

Apa ndikufuna kudziwa kuti makina aliwonse ophunzirira makina ayenera kuganiziridwa potengera phindu lake m'moyo wake wonse. Chithunzi chomwe chili pansipa chikuwonetsa chitsanzo cha momwe dongosololi limagwirira ntchito kuti ligwire mtundu wosowa wa sipamu (mzere womwe uli pa graph uli pafupi ndi ziro). Tsiku lina, chifukwa cha kusungidwa molakwika, adapenga. Monga mwayi ukanakhala nawo, panalibe kuyang'anira zoyambitsa zachilendo; chifukwa chake, dongosololi linayamba kusunga makalata ochulukirapo kufoda ya "spam" pamalire opangira zisankho. Ngakhale kukonza zotsatira zake, dongosololi lalakwitsa kale nthawi zambiri moti silingadzilipirire ngakhale zaka zisanu. Ndipo uku ndikulephera kwathunthu kuchokera pamalingaliro amayendedwe amoyo wachitsanzo.

Kugwiritsa ntchito makina ophunzirira mu Mail.ru Mail

Chifukwa chake, chinthu chosavuta monga kuwunika chikhoza kukhala chofunikira m'moyo wachitsanzo. Kuphatikiza pa ma metrics okhazikika komanso odziwikiratu, timaganizira za kugawa kwa mayankho achitsanzo ndi zigoli, komanso kugawa kwazinthu zofunikira. Pogwiritsa ntchito KL divergence, titha kufananiza kugawa komwe kulipo ndi mbiri yakale kapena zomwe zili mu mayeso a A/B ndi mtsinje wonsewo, zomwe zimatilola kuzindikira zolakwika zachitsanzo ndikubwezeretsanso kusintha munthawi yake.

Nthawi zambiri, timakhazikitsa masinthidwe athu oyamba pogwiritsa ntchito ma heuristics osavuta omwe timagwiritsa ntchito powunika mtsogolo. Mwachitsanzo, timayang'anira chitsanzo cha NER poyerekeza ndi nthawi zonse m'masitolo apadera a pa intaneti, ndipo ngati chiwerengero cha magulu chitsika poyerekeza ndi iwo, ndiye kuti timamvetsetsa zifukwa zake. Ntchito ina yothandiza ya heuristics!

Zotsatira

Пройдёмся еще раз по ключевым мыслям статьи.

  • Fibdeck. Nthawi zonse timaganizira za wogwiritsa ntchito: momwe angakhalire ndi zolakwa zathu, momwe adzatha kuzifotokozera. Musaiwale kuti ogwiritsa ntchito sali gwero la ndemanga zoyera za zitsanzo zophunzitsira, ndipo ziyenera kuyeretsedwa mothandizidwa ndi machitidwe othandizira a ML. Ngati sizingatheke kusonkhanitsa chizindikiro kuchokera kwa wogwiritsa ntchito, ndiye kuti timayang'ana njira zina zoperekera ndemanga, mwachitsanzo, machitidwe ogwirizana.
  • Maphunziro owonjezera. Chinthu chachikulu apa ndi kupitiriza, kotero timadalira chitsanzo chamakono chopanga. Timaphunzitsa zitsanzo zatsopano kuti zisasiyane kwambiri ndi zam'mbuyomo chifukwa cha kukhazikika kwa ma harmonic ndi zidule zofananira.
  • Ikani. Kuyika pawokha potengera ma metrics kumachepetsa kwambiri nthawi yogwiritsira ntchito zitsanzo. Kuyang'anira ziwerengero ndi kugawa kwa zosankha, kuchuluka kwa kutsika kwa ogwiritsa ntchito ndikofunikira kuti mugone mokwanira komanso sabata yathanzi.

Chabwino, ndikukhulupirira kuti izi zimakuthandizani kukonza makina anu a ML mwachangu, kuwapangitsa kuti agulitse mwachangu, ndikupangitsa kuti akhale odalirika komanso osadetsa nkhawa.

Source: www.habr.com

Kuwonjezera ndemanga