Ungawavula njani amagqabantshintshi kwaye ungatshoniswa kwi-spam

Ungawavula njani amagqabantshintshi kwaye ungatshoniswa kwi-spam

Xa umsebenzi wakho kukudala into enhle, akudingeki ukuba uthethe kakhulu ngayo, kuba umphumo uphambi kwamehlo omntu wonke. Kodwa ukuba ucima imibhalo evela kwiingcingo, akukho mntu uya kuqaphela umsebenzi wakho nje ukuba iingcingo zibukeka zihloniphekile okanye ude ucime into engalunganga.

Nayiphi na inkonzo apho unokushiya khona izimvo, ukuphonononga, ukuthumela umyalezo okanye ukulayisha imifanekiso ngokukhawuleza okanye kamva ujongene nengxaki yogaxekile, ubuqhetseba kunye nokunyanyeka. Le nto ayinakuphetshwa, kodwa kufuneka kujongwane nayo.

Igama lam nguMikhail, ndisebenza kwiqela le-Antispam, elikhusela abasebenzisi beenkonzo zeYandex kwiingxaki ezinjalo. Umsebenzi wethu awuqatshelwanga (kwaye yinto entle leyo!), Ke namhlanje ndiza kukuxelela ngakumbi ngayo. Uya kufunda xa ukumodareyitha kungenamsebenzi kwaye kutheni ukuchaneka kungekuphela kwesalathisi sokusebenza kwayo. Siza kuthetha nangokuthuka sisebenzisa umzekelo weekati nezinja nokuba kutheni maxa wambi kunceda β€œukucinga njengomntu othukayo.”

Iinkonzo ezininzi ngakumbi zivela kwiYandex apho abasebenzisi bapapasha umxholo wabo. Unokubuza umbuzo okanye ubhale impendulo kwiYandex.Q, uxoxe ngeendaba zeyadi kwiYandex.District, wabelane ngeemeko zendlela kwiincoko kwiYandex.Maps. Kodwa xa abaphulaphuli benkonzo bekhula, iba yinto ekhangayo kwi-scammers kunye ne-spammers. Bayeza kwaye bazalise izimvo: banikezela ngemali elula, babhengeze unyango oluyimimangaliso kwaye bathembise ngezibonelelo zentlalo. Ngenxa ye-spammers, abanye abasebenzisi balahlekelwa yimali, ngelixa abanye belahlekelwa ngumnqweno wokuchitha ixesha kwinkonzo engafanelekanga egcwele i-spam.

Kwaye oku akukuphela kwengxaki. Asizami nje kuphela ukukhusela abasebenzisi kwi-scammers, kodwa nokudala umoya okhululekile wokunxibelelana. Ukuba abantu bajongene nezithuko kunye nezithuko kumazwana, banokuthi bahambe kwaye bangaze babuye. Oku kuthetha ukuba nawe kufuneka ukwazi ukujongana nale nto.

IWeb ecocekileyo

Njengoko kuhlala kunjalo kuthi, uphuhliso lokuqala lwazalelwa kuPhendlo, kwindawo elwa nogaxekile kwiziphumo zokukhangela. Malunga neminyaka elishumi edlulileyo, umsebenzi wokucoca umxholo wabantu abadala kukhangelo lwentsapho kunye nemibuzo engakhange ifune iimpendulo ukusuka kudidi lwe-18+ yavela apho. Le yindlela ezavela ngayo izichazi-magama zokuqala ezichwetheziweyo zephonografi kunye nezithuko, zazaliswa ngabahlalutyi. Owona msebenzi wawuphambili yayikukuhlela izicelo zibe kwezo apho kwamkelekileyo ukubonisa umxholo wabantu abadala nalapho kungamkelekanga khona. Kulo msebenzi, i-markup yaqokelelwa, i-heuristics yakhiwa, kwaye iimodeli zaqeqeshwa. Le yindlela uphuhliso lokuqala lokucoca umxholo ongafunekiyo luvele ngayo.

Ngokuhamba kwexesha, i-UGC (umxholo owenziwe ngumsebenzisi) yaqala ukubonakala kwiYandex - imiyalezo ebhaliweyo ngabasebenzisi ngokwabo, kwaye i-Yandex ipapasha kuphela. Ngenxa yezizathu ezichazwe ngasentla, imiyalezo emininzi ayikwazanga ukupapashwa ngaphandle kokujonga - ukumodareyitha bekufunwa. Emva koko bagqiba ekubeni benze inkonzo eya kubonelela ngokhuseleko kwi-spam kunye nabahlaseli kuzo zonke iimveliso ze-Yandex ze-UGC kwaye basebenzise uphuhliso lokucoca umxholo ongafunekiyo kuPhando. Le nkonzo yayibizwa ngokuba yi-β€œCoca Web”.

Imisebenzi emitsha kunye noncedo oluvela kubatyhala

Ekuqaleni, i-automation elula kuphela esetyenzisiweyo kuthi: iinkonzo zisithumelele iitekisi, kwaye saqhuba izichazi-magama ezingcolileyo, izichazi-magama ze-porn kunye nentetho eqhelekileyo kubo - abahlalutyi baqulunqa yonke into ngesandla. Kodwa ekuhambeni kwexesha, inkonzo isetyenziswe kwinani elandayo leemveliso zeYandex, kwaye kwafuneka sifunde ukusebenzisana neengxaki ezintsha.

Ngokuqhelekileyo, endaweni yokuphonononga, abasebenzisi bapapasha iiseti ezingenantsingiselo zeeleta, bezama ukwandisa impumelelo yabo, ngamanye amaxesha bayayithengisa inkampani yabo kuphononongo lwenkampani ekhuphisana nabo, kwaye ngamanye amaxesha bavele babhidekise imibutho kwaye babhale kuphononongo malunga nevenkile yezilwanyana: " Intlanzi ephekwe ngokugqibeleleyo!” Mhlawumbi ngenye imini ubukrelekrele bokwenziwa buya kufunda ukuqonda ngokugqibeleleyo intsingiselo yayo nayiphi na isicatshulwa, kodwa ngoku i-automation ngamanye amaxesha ihlangabezana mbi ngakumbi kunabantu.

Kwacaca ukuba asikwazi ukwenza oku ngaphandle kokumakisha ngesandla, kwaye songeza isigaba sesibini kwisiphaluka sethu-sithumela ukuba sihlolwe ngesandla ngumntu. Ezo zibhalo zipapashiweyo apho umdidiyeli engazange abone naziphi na iingxaki zibandakanyiwe apho. Unokucinga ngokulula ubungakanani bomsebenzi onjalo, ngoko asizange sithembele kuphela kubavavanyi, kodwa siphinde sasebenzisa "ubulumko besihlwele," oko kukuthi, saphendukela kwi-tolokers ngoncedo. Ngabo abasincedayo ukuba sichonge into ephoswe ngumatshini, kwaye ngaloo ndlela siyifundise.

Smart caching kunye LSH hashing

Enye ingxaki esiye sadibana nayo xa sisebenza ngamagqabantshintshi yayiyi-spam, okanye ngokuchanekileyo, umthamo wayo kunye nesantya sokusasazeka. Xa abaphulaphuli baseYandex.Region baqala ukukhula ngokukhawuleza, abagaxekile beza khona. Baye bafunda ukugqitha amabinzana aqhelekileyo ngokutshintsha kancinane isicatshulwa. I-spam, ngokuqinisekileyo, yayifunyenwe kwaye isusiwe, kodwa kwisikali se-Yandex, umyalezo ongamkelekanga othunyelwe nakwimizuzu emi-5 unokubonwa ngamakhulu abantu.

Ungawavula njani amagqabantshintshi kwaye ungatshoniswa kwi-spam

Ewe, oku akuzange kusilungele, kwaye senze i-caching yombhalo ehlakaniphile ngokusekelwe kwi-LSH (I-hashing enovakalelo kwindawo). Isebenza ngolu hlobo: senze isicatshulwa siqheleke, sasusa amakhonkco kuyo kwaye siyinqumle ibe yi-n-grams (ulandelelwano loonobumba u-n). Emva koko, i-hashes ye-n-grams ibalwe, kwaye i-LSH vector yoxwebhu yakhiwe kuyo. Ingongoma kukuba iitekisi ezifanayo, nokuba zithe zatshintshwa kancinane, zajika zaba zii-vectors ezifanayo.

Esi sisombululo senza ukuba kusetyenziswe kwakhona izigwebo zabahluli kunye ne-tolokers kwiitekisi ezifanayo. Ngethuba lokuhlaselwa kwe-spam, ngokukhawuleza ukuba umyalezo wokuqala udlulise ukuskena kwaye ungene kwi-cache ngesigqibo "sogaxekile", yonke imilayezo emitsha efana nayo, kunye neyokuguqulwa, ifumene isigwebo esifanayo kwaye isuswe ngokuzenzekelayo. Kamva, safunda indlela yokuqeqesha kunye nokuqeqesha kwakhona abahlalutyi be-spam ngokuzenzekelayo, kodwa le "cache ehlakaniphile" yahlala nathi kwaye isasinceda rhoqo.

Isihleli kakuhle sombhalo

Ngaphandle kokuba nexesha lokuthatha ikhefu ekulweni nogaxekile, siye saqaphela ukuba i-95% yomxholo wethu imodareyithwa ngesandla: abahluli basabela kuphela ekuphuleni, kwaye uninzi lweetekisi zilungile. Silayisha abacoci abathi kwiimeko ze-95 kwi-100 banike umlinganiselo "Yonke into ilungile". Kwafuneka ndenze umsebenzi ongaqhelekanga - ukwenza abahluli bomxholo olungileyo, ngethamsanqa imarkup eyaneleyo yayiqokelele ngeli xesha.

Umdidi wokuqala ujongeka ngolu hlobo: sinciphisa isicatshulwa (sinciphisa amagama kwimo yawo yokuqala), silahla zonke iinxalenye ezincedisayo zentetho kwaye sisebenzise "isichazi-magama se-lemmas" esilungiselelwe kwangaphambili. Ukuba onke amagama akwisicatshulwa "alungile", ngoko ke isicatshulwa sonke asiqulathanga naluphi na ukuphulwa. Kwiinkonzo ezahlukeneyo, le ndlela yakhawuleza yanika ukusuka kwi-25 ukuya kwi-35% i-automation ye-markup manual. Ngokuqinisekileyo, le ndlela ayifanelekanga: kulula ukudibanisa amagama amaninzi amsulwa kwaye ufumane ingxelo ekhubekisayo, kodwa yasivumela ukuba sifikelele ngokukhawuleza kwinqanaba elihle lokuzenzekelayo kwaye yasinika ixesha lokuqeqesha iimodeli ezinzima.

Iinguqulelo ezilandelayo zabadidiyeli bombhalo abalungileyo sele zibandakanyiwe iimodeli zemigca, imithi yesigqibo, kunye nendibaniselwano yazo. Ukuphawula ubukrwada kunye nezithuko, umzekelo, sizama i-BERT neural network. Kubalulekile ukuqonda intsingiselo yegama kumxholo kunye noqhagamshelwano phakathi kwamagama asuka kwizivakalisi ezahlukeneyo, kwaye iBERT yenza umsebenzi omhle woku. (Ngoko, oogxa bakutshanje beNdaba uxelelwe, indlela iteknoloji isetyenziselwa ngayo umsebenzi ongewona umgangatho - ukufumana iimpazamo kwiintloko.) Ngenxa yoko, kwakunokwenzeka ukuzenzekelayo ukuya kwi-90% yokuhamba, kuxhomekeke kwinkonzo.

Ukuchaneka, ukugqibelela kunye nesantya

Ukuphuhlisa, kufuneka uqonde ukuba zeziphi na izibonelelo eziziswa ngabadidi abathile abazenzekelayo, utshintsho kuzo, kwaye nokuba umgangatho wokuhlolwa okwenziwa ngesandla uyathotywa na. Ukwenza oku, sisebenzisa ukuchaneka kunye nokukhumbula iimetriki.

Ukuchaneka ngumlinganiselo wezigqibo ezichanekileyo phakathi kwazo zonke izigwebo malunga nomxholo ombi. Okukhona kuphezulu ukuchaneka, kuncinci ukuchaneka kobuxoki. Ukuba awuyi kuqwalasela ukuchaneka, ngoko kwi-theory unokucima yonke i-spam kunye namanyala, kunye kunye nesiqingatha semiyalezo emihle. Ngakolunye uhlangothi, ukuba uthembele kuphela ngokuchaneka, ngoko iteknoloji engcono kakhulu iya kuba yinto engabambi nabani na. Ngoko ke, kukho kwakhona isalathisi sokugqiba: isabelo somxholo ombi ochongiweyo phakathi komthamo opheleleyo womxholo ombi. Ezi metrics zimbini zilinganisa enye kwenye.

Ukulinganisa, senza isampula yonke into engenayo kwinkonzo nganye kwaye sinike iisampuli zomxholo kubavavanyi ukuvavanya kweengcali kunye nokuthelekisa kunye nezisombululo zoomatshini.

Kodwa kukho esinye isalathisi esibalulekileyo.

Ndibhale ngasentla ukuba umyalezo ongamkelekanga unokubonwa ngamakhulu abantu nakwimizuzu emi-5. Ngoko sibala ukuba zingaphi izihlandlo esibonisa ngazo abantu izinto ezimbi ngaphambi kokuba sizifihle. Oku kubalulekile kuba akwanelanga ukusebenza ngokufanelekileyo - kufuneka usebenze ngokukhawuleza. Kwaye xa sizakhela ukhuselo nxamnye nokuthuka, saziva ngokupheleleyo.

I-Antimatism isebenzisa umzekelo weekati kunye nezinja

Uphumlo oluncinci lwengoma. Abanye banokuthi amanyala kunye nezithuko aziyongozi njengamakhonkco akhohlakeleyo, kwaye ayicaphukisi njengogaxekile. Kodwa sizama ukugcina iimeko ezikhululekile zokunxibelelana kwizigidi zabasebenzisi, kwaye abantu abathandi ukubuyela kwiindawo apho bathukwa khona. Akunanto yokuba ukuvinjelwa kokufunga kunye nokuhlambalaza kuchazwe kwimithetho yoluntu oluninzi, kubandakanywa noHabrΓ©. Kodwa siyaphambuka.

Izichazi-magama ezithukayo azikwazi ukumelana nabo bonke ubutyebi bolwimi lwesiRashiya. Nangona kukho iingcambu ezine eziphambili zokufunga, ukusuka kuzo unokwenza inani elingenakubalwa lamagama angenakubanjwa yiyiphi na iinjini eziqhelekileyo. Ukongeza, ungabhala inxalenye yegama kuguqulo lweeletha, ubuyisele oonobumba ngokudibanisa okufanayo, ulungelelanise oonobumba ngokutsha, ukongeza iinkwenkwezi, njl njl. Siyayihlonipha imithetho kaHabr, ngoko siya kubonisa oku kungekhona ngemizekelo ephilayo, kodwa ngeekati nezinja.

Ungawavula njani amagqabantshintshi kwaye ungatshoniswa kwi-spam

β€œUmthetho,” yatsho ikati. Kodwa siyaqonda ukuba ikati ithethe igama elahlukileyo...

Siqale ukucinga malunga "nongqamaniso oluntsonkothileyo" lwealgorithms yesichazi-magama sethu kunye nokulungiswa kwangaphambili okukrelekrele: sinikezele ngoguqulelo, izithuba ezincamatheleyo kunye neziphumlisi kunye, sajonga iipateni kwaye sabhala amabinzana aqhelekileyo ahlukeneyo kuzo. Le ndlela yeza neziphumo, kodwa ihlala inciphisa ukuchaneka kwaye ayizange ibonelele ngokugqibeleleyo okufunwayo.

Emva koko sagqiba ekubeni "sicinge njengabathuki." Saqala ukwazisa ingxolo kwidatha ngokwethu: sihlengahlengise iileta, sivelise i-typos, iileta ezitshintshiweyo ezinopelo olufanayo, njalo njalo. Uqwalaselo lokuqala lwale nto lwathathwa ngokusebenzisa izichazi-magama zemat kwiindidi ezinkulu zezicatshulwa. Ukuba uthatha isivakalisi esinye usijije ngeendlela ezininzi, uphela unezivakalisi ezininzi. Ngale ndlela unokwandisa isampuli yoqeqesho amashumi amaxesha. Ekuphela kwento eyayisele yayikukuqeqesha kwiphuli eneziphumo enye imodeli ekrelekrele ngakumbi okanye engaphantsi ethathela ingqalelo umxholo.

Ungawavula njani amagqabantshintshi kwaye ungatshoniswa kwi-spam

Kuselithuba ukuthetha ngesigqibo sokugqibela. Sisazama iindlela zokujongana nale ngxaki, kodwa sele sibona ukuba uthungelwano olulula lwe-convolutional network lwemaleko aliqela lugqwesa kakhulu izichazi-magama kunye neenjini eziqhelekileyo: kunokwenzeka ukwandisa zombini ukuchaneka kunye nokukhumbula.

Ewe kunjalo, siyaqonda ukuba kuya kuhlala kukho iindlela zokudlula neyona nto iphambili i-automation, ngakumbi xa umcimbi unobungozi: bhala ngendlela yokuba umatshini osisidenge awuyi kuqonda. Apha, njengakumlo ochasene ne-spam, injongo yethu ayikokuphelisa kwawona amathuba okubhala into engamanyala; umsebenzi wethu kukuqinisekisa ukuba umdlalo awufanelanga ikhandlela.

Ukuvula ithuba lokwabelana ngoluvo lwakho, ukunxibelelana kunye nokuphawula akukho nzima. Kunzima kakhulu ukufezekisa iimeko ezikhuselekileyo, ezitofotofo kunye nokuphathwa ngentlonipho kwabantu. Kwaye ngaphandle koku akuyi kubakho uphuhliso lwalo naluphi na uluntu.

umthombo: www.habr.com

Yongeza izimvo