Simodareyitha njani iintengiso

Simodareyitha njani iintengiso

Inkonzo nganye abasebenzisi bayo abanokudala umxholo wabo (UGC - Umxholo owenziwe ngumsebenzisi) unyanzeliswa nje kuphela ukusombulula iingxaki zoshishino, kodwa nokubeka izinto ngokulandelelana kwi-UGC. Ukumodareyithwa komxholo ombi okanye ophantsi kunokuphelisa umtsalane wenkonzo kubasebenzisi, nokuba kupheliswe ukusebenza kwayo.

Namhlanje siza kukuxelela malunga nentsebenziswano phakathi kweYula kunye ne-Odnoklassniki, esinceda ngokufanelekileyo ukumodareyitha iintengiso eYula.

I-Synergy ngokubanzi yinto eluncedo kakhulu, kwaye kwihlabathi lanamhlanje, xa itekhnoloji kunye neendlela zitshintsha ngokukhawuleza, zinokujika zibe ngumsindisi wobomi. Kutheni uchitha izibonelelo ezinqabileyo kunye nexesha lokuyila into esele iyilwe kwaye yaziswa engqondweni phambi kwakho?

Sasicinga into efanayo xa sijongene nomsebenzi opheleleyo wokumodareyitha umxholo womsebenzisi - imifanekiso, umbhalo kunye namakhonkco. Abasebenzisi bethu balayisha izigidi zesiqulatho kwiYula yonke imihla, kwaye ngaphandle kokwenza ngokuzenzekelayo akunakwenzeka ukumodareyitha yonke le datha ngesandla.

Ke ngoko, sasebenzisa iqonga lokumodareyitha esele lilungisiwe, ngelo xesha oogxa bethu base-Odnoklassniki babegqibile ukuya “kwimeko ephantse yagqibelela.”

Kutheni Odnoklassniki?

Yonke imihla, amashumi ezigidi zabasebenzisi beza kwinethiwekhi yoluntu kwaye bapapashe iibhiliyoni zomxholo: ukusuka kwiifoto ukuya kwiividiyo kunye nemibhalo. Iqonga lokumodareyitha le-Odnoklassniki linceda ukujonga umthamo omkhulu kakhulu wedatha kunye nokuchasana nogaxekile kunye ne-bots.

Iqela lokumodareyitha le-OK liqokelele amava amaninzi, kuba liye laphucula isixhobo salo iminyaka eyi-12. Kubalulekile ukuba bangakwazi ukwabelana kuphela ngezisombululo zabo esele zenziwe, kodwa baphinde benze ngokwezifiso uyilo lweqonga labo ukuze lilungele imisebenzi yethu ethile.

Simodareyitha njani iintengiso

Ukusukela ngoku ukuya phambili, ngokufutshane, siyakubiza nje iqonga lokumodareyitha elilungile "iqonga."

Indlela yonke into esebenza ngayo

Utshintshiselwano lwedatha phakathi kweYula kunye ne-Odnoklassniki lusekwe ngokusebenzisa Apache Kafka.

Kutheni sikhethe esi sixhobo:

  • Kwi-Yula, zonke iintengiso zimodareyithwa ngasemva, ngoko ke ekuqaleni impendulo ehambelanayo yayingafuneki.
  • Ukuba umhlathi ombi uyenzeka kwaye i-Yula okanye i-Odnoklassniki ayifumaneki, kubandakanywa ngenxa yemithwalo ephezulu, ngoko idatha esuka eKafka ayiyi kunyamalala naphi na kwaye ingafundwa kamva.
  • Iqonga lalisele lidityaniswe neKafka, ngoko ke imiba emininzi yokhuseleko yasonjululwa.

Simodareyitha njani iintengiso

Kwintengiso nganye eyenziweyo okanye elungiswe ngumsebenzisi kwiYula, i-JSON enedatha iyenziwa, ebekwe eKafka ukumodareyitha okulandelayo. Ukusuka eKafka, izibhengezo zilayishwa kwiqonga, apho zigwetywa ngokuzenzekelayo okanye ngesandla. Iintengiso ezimbi zivaliwe ngesizathu, kwaye ezo apho iqonga lingafumani ulwaphulo-mthetho ziphawulwa njenge "zilungile." Emva koko zonke izigqibo zithunyelwa kuYula kwaye zisetyenziswe enkonzweni.

Ekugqibeleni, ku-Yula konke kuhla kwizenzo ezilula: thumela isikhangiso kwi-platform ye-Odnoklassniki kwaye ubuyisele isisombululo "ok", okanye kutheni "kulungile".

Ukuqhubekeka okuzenzekelayo

Kwenzeka ntoni kwintengiso emva kokuba ibethe iqonga? Intengiso nganye yahlulwe yangamacandelo amaninzi:

  • Igama,
  • inkcazelo,
  • iifoto,
  • udidi olukhethwe ngumsebenzisi kunye nocandelwana lwentengiso,
  • ixabiso.

Simodareyitha njani iintengiso

Iqonga ke lenza ukudityaniswa kwequmrhu ngalinye ukufumana impinda. Ngaphezu koko, isicatshulwa kunye neefoto zidityaniswa ngokwezicwangciso ezahlukeneyo.

Ngaphambi kokudibanisa, iitekisi ziqhelekile ukususa iimpawu ezikhethekileyo, iileta ezitshintshiweyo kunye nenye inkunkuma. Idatha efunyenweyo ihlulwe ibe yi-N-grams, nganye apho i-hashed. Isiphumo sininzi ihashes ezizodwa. Ukufana phakathi kweetekisi kugqitywa ngu Umlinganiselo kaJaccard phakathi kweeseti ezimbini ezineziphumo. Ukuba ukufana kukhulu kunomgubasi, ke iitekisi zidityaniswa zibe liqela elinye. Ukukhawulezisa ukukhangela amaqoqo afanayo, i-MinHash kunye ne-Locality-sensitive hashing isetyenziswa.

Iinketho ezahlukeneyo zemifanekiso yokuncamathelisa ziyilelwe iifoto, ukusuka ekuthelekiseni imifanekiso ye-pHash ukuya ekukhangeleni ukuphinda-phinda kusetyenziswa inethiwekhi ye-neural.

Indlela yokugqibela yeyona "inzima". Ukuqeqesha imodeli, i-triplets yemifanekiso (N, A, P) yakhethwa apho i-N ingafani no-A, kwaye i-P ifana no-A (yi-semi-duplicate). Emva koko inethiwekhi ye-neural yafunda ukwenza u-A no-P basondele kangangoko, kunye no-A no-N kangangoko kunokwenzeka. Oku kukhokelela kubungqina obungeyonyani obumbalwa xa kuthelekiswa nokuthatha nje izinto ezizinzisiweyo kuthungelwano oluqeqeshwe kwangaphambili.

Xa inethiwekhi ye-neural ifumana imifanekiso njengegalelo, yenza i-N(128)-dimensional vector nganye kuzo kwaye isicelo senziwe ukuvavanya ukusondela komfanekiso. Okulandelayo, i-threshold ibalwa apho imifanekiso esondeleyo ithathwa njengempinda.

Imodeli iyakwazi ukufumana ngobuchule abagaxekile abafota ngokuthe ngqo imveliso efanayo ukusuka kwii-angles ezahlukeneyo ukuze badlule uthelekiso lwe-pHash.

Simodareyitha njani iintengisoSimodareyitha njani iintengiso
Umzekelo weefoto ze-spam ezincanyathiselwe kunye yinethiwekhi ye-neural njengempinda.

Kwinqanaba lokugqibela, iintengiso eziphindwe kabini zikhangelwa ngaxeshanye ngombhalo kunye nomfanekiso.

Ukuba iintengiso ezimbini okanye ngaphezulu zibambene kunye kwiqela, inkqubo iqala ukubhloka ngokuzenzekelayo, ethi, isebenzisa i-algorithms ethile, ikhethe ukuba yeyiphi ikopi yokucima kunye nokushiya. Umzekelo, ukuba abasebenzisi ababini banezithombe ezifanayo kwintengiso, inkqubo iya kuvala intengiso yamva nje.

Nje ukuba zenziwe, onke amaqela adlula kuthotho lwezihluzi ezizenzekelayo. Isihluzo ngasinye sabela amanqaku kwiqela: kungenzeka kangakanani ukuba iqulathe isoyikiso esichongwa sesi sihluzo.

Umzekelo, inkqubo ihlalutya inkcazo kwintengiso kwaye ikhethe iindidi ezinokubakho kuyo. Emva koko ithatha enye enobuninzi obunokwenzeka kwaye uyithelekise kunye nodidi oluchazwe ngumbhali wesikhangiso. Ukuba azihambelani, intengiso ivaliwe kudidi olungalunganga. Kwaye kuba sinobubele kwaye sinyanisekile, sixelela umsebenzisi ngokuthe ngqo ukuba loluphi udidi ekufuneka alukhethe ukuze intengiso iphumelele ukumodareyitha.

Simodareyitha njani iintengiso
Isaziso sokuvala udidi olungalunganga.

Ukufunda ngoomatshini kuvakala kusekhaya kwiqonga lethu. Ngokomzekelo, ngoncedo lwayo sikhangela kumagama kunye neenkcazo zeempahla ezinqatshelwe kwi-Russian Federation. Kwaye iimodeli zenethiwekhi ye-neural ngocoselelo "zivavanya" imifanekiso ukubona ukuba iqulethe ii-URL, iitekisi ze-spam, iinombolo zomnxeba, kunye neenkcukacha ezifanayo "ezalelweyo".

Kwiimeko apho bazama ukuthengisa imveliso ethintelweyo eguqulwe njengento esemthethweni, kwaye akukho mbhalo kwisihloko okanye kwinkcazo, sisebenzisa ukuthegiswa komfanekiso. Kumfanekiso ngamnye, ukuya kuthi ga kwi-11 lamawaka amathegi ahlukeneyo anokongezwa achaza into ekumfanekiso.

Simodareyitha njani iintengiso
Bazama ukuthengisa ihuka ngokuyifihla njengesamovar.

Ngokunxuseneyo nezihluzi ezintsonkothileyo, ezilula zikwasebenza, ukusombulula iingxaki ezicacileyo ezinxulumene nokubhaliweyo:

  • i-antimat
  • I-URL kunye nomtshini wenombolo yefowuni;
  • ukukhankanywa kwabathunywa bethutyana kunye nabanye oonxibelelwano;
  • ixabiso elincitshisiweyo;
  • iintengiso apho kungekho nto ithengiswayo, njl.

Namhlanje, yonke intengiso ihamba ngesihluzo esingaphezulu kwe-50 yokucoca okuzenzekelayo okuzama ukufumana into embi kwintengiso.

Ukuba akukho nanye yee-detectors esebenzayo, ke impendulo ithunyelwa ku-Yula ukuba isikhangiso "sinokwenzeka" ngendlela egqibeleleyo. Sisebenzisa le mpendulo ngokwethu, kwaye abasebenzisi ababhalise kumthengisi bafumana isaziso malunga nokufumaneka kwemveliso entsha.

Simodareyitha njani iintengiso
Isaziso sokuba umthengisi unemveliso entsha.

Ngenxa yoko, intengiso nganye "igqitywe" ngemetadata, enye iveliswa xa intengiso yenziwe (idilesi ye-IP yombhali, iarhente yomsebenzisi, iqonga, i-geolocation, njl. njl.), kwaye elinye linqaku elikhutshwe sisihluzo ngasinye. .

Imigca yesibhengezo

Xa intengiso ibetha iqonga, inkqubo iyibeka kwenye yemigca. Umgca ngamnye wenziwa kusetyenziswa ifomyula yemathematika edibanisa i-ad metadata ngendlela ebona naziphi iipateni ezimbi.

Ngokomzekelo, unokwenza umgca weentengiso kwisigaba "seSelfowuni" ukusuka kubasebenzisi be-Yula ekucingelwa ukuba baseSt. Petersburg, kodwa iidilesi zabo ze-IP zivela eMoscow okanye kwezinye iidolophu.

Simodareyitha njani iintengiso
Umzekelo weentengiso ezifakwe ngumsebenzisi omnye kwizixeko ezahlukeneyo.

Okanye ungenza imigca ngokusekwe kumanqaku inethiwekhi ye-neural ewanika iintengiso, uwalungelelanise ngolandelelwano oluhlayo.

Umgca ngamnye, ngokwefomula yawo, unika amanqaku okugqibela kwintengiso. Emva koko unokuqhubeka ngeendlela ezahlukeneyo:

  • cacisa umda apho intengiso iya kufumana uhlobo oluthile lokuthintela;
  • thumela zonke iintengiso emgceni kwiimodareyitha ukuze zijongwe ngesandla;
  • okanye udibanise iinketho zangaphambili: khankanya umda wokuvalela okuzenzekelayo kwaye uthumele kubabonisi ezo ntengiso zingekafikeleli kulo mda.

Simodareyitha njani iintengiso

Kutheni le migca ifuneka? Masithi umsebenzisi ufake ifoto yompu. Inethiwekhi ye-neural inika amanqaku ukusuka kwi-95 ukuya kwi-100 kwaye imisela ngepesenti ye-99 ngokuchanekileyo ukuba kukho isixhobo emfanekisweni. Kodwa ukuba ixabiso lamanqaku lingaphantsi kwe-95%, ukuchaneka komzekelo kuqala ukuhla (oku kuyimpawu ye-neural network models).

Ngenxa yoko, umgca wenziwa ngokusekelwe kwimodeli yamanqaku, kwaye ezo ntengiso zifunyenwe phakathi kwe-95 kunye ne-100 zivaliwe ngokuzenzekelayo njenge "Iimveliso eziThintelweyo". Iintengiso ezinamanqaku angaphantsi kwama-95 zithunyelwa kwiimodareyitha ukuze zilungiswe ngesandla.

Simodareyitha njani iintengiso
I-Chocolate Beretta eneekhatriji. Kuphela kumodareyitha ngesandla! 🙂

Ukumodareyitha ngesandla

Ekuqaleni kuka-2019, malunga ne-94% yazo zonke iintengiso zeYula zimodareyithwa ngokuzenzekelayo.

Simodareyitha njani iintengiso

Ukuba iqonga alinakuthatha isigqibo kwezinye iintengiso, lizithumela ukumodareyitha ngesandla. I-Odnoklassniki iphuhlise isixhobo sabo: imisebenzi yeemodareyitha ngokukhawuleza ibonisa lonke ulwazi oluyimfuneko ukwenza isigqibo esikhawulezayo - isikhangiso sifanelekile okanye kufuneka sivalwe, esibonisa isizathu.

Kwaye ukuze umgangatho wenkonzo ungahlupheki ngexesha lokumodareyitha ngesandla, umsebenzi wabantu uhlala ubekwe esweni. Ngokomzekelo, kwi-task stream, imodareyitha iboniswa "imigibe" -iintengiso esele zikho izisombululo esele zenziwe. Ukuba isigqibo somodareyitha asihambelani nesigqityiweyo, umodareyitha unikwa impazamo.

Ngokwe-avareji, imodareyitha ichitha imizuzwana eyi-10 ijonga intengiso enye. Ngaphezu koko, inani leempazamo alikho ngaphezu kwe-0,5% yazo zonke iintengiso eziqinisekisiweyo.

Ukumodareyitha kwabantu

Abalingane abavela kwi-Odnoklassniki baqhubela phambili ngakumbi kwaye bathatha ithuba "loncedo lwabaphulaphuli": babhala isicelo somdlalo wenethiwekhi yoluntu apho unokumakisha ngokukhawuleza inani elikhulu ledatha, ugxininisa uphawu olubi - i-Odnoklassniki Moderator (https://ok.ru/app/moderator). Indlela efanelekileyo yokuthatha uncedo loncedo lwabasebenzisi be-OK abazama ukwenza umxholo wonwabe ngakumbi.

Simodareyitha njani iintengiso
Umdlalo apho abasebenzisi bathega iifoto ezinenombolo yefowuni kubo.

Nawuphi na umgca weentengiso eqongeni unokuhanjiswa kumdlalo we-Odnoklassniki Moderator. Yonke into ephawulwa ngabasebenzisi bomdlalo ithunyelwa kwiimodareyitha zangaphakathi ukuze ziqinisekiswe. Esi sikimu sikuvumela ukuba uthintele iintengiso apho izihluzi zingekadalwa, kwaye ngaxeshanye wenze iisampuli zoqeqesho.

Ukugcina iziphumo zokumodareyitha

Sigcina zonke izigqibo ezenziweyo ngexesha lokumodareyitha ukuze singaphindi siqhubekise ezo ntengiso sele senze isigqibo ngazo.

Izigidi zamaqela enziwa yonke imihla ngokusekwe kwiintengiso. Ngokuhamba kwexesha, iqela ngalinye libhalwe "kuhle" okanye "okubi." Isibhengezo-ntengiso esitsha ngasinye okanye uhlaziyo lwayo, ingenisa iqela elinophawu, ifumana ngokuzenzekelayo isisombululo esivela kwiqela ngokwalo. Kukho malunga nama-20 amawaka ezigqibo ezizenzekelayo ngosuku.

Simodareyitha njani iintengiso

Ukuba akukho zibhengezo ezitsha ezifikayo kwiqela, liyasuswa kwimemori kwaye i-hashi yayo kunye nesisombululo sibhalwa kwi-Apache Cassandra.

Xa iqonga lifumana intengiso entsha, iqala izama ukufumana iqela elifanayo phakathi kwezo sele zidaliwe kwaye lithathe isisombululo kulo. Ukuba akukho qela elinjalo, iqonga liya eCassandra kwaye lijonge apho. Ngaba uyifumene? Kakhulu, sebenzisa isisombululo kwiqela kwaye uyithumele kwiYula. Kukho umlinganiselo wama-70 amawaka ezigqibo “eziphindaphindwayo” mihla le—isi-8 ekhulwini sayo yonke loo nto.

Shwa nkathela

Sele sisebenzise iqonga lokumodareyitha le-Odnoklassniki iminyaka emibini enesiqingatha. Siyazithanda iziphumo:

  • Simodareyitha ngokuzenzekelayo i-94% yazo zonke iintengiso ngosuku.
  • Iindleko zokumodareyitha enye intengiso yancitshiswa ukusuka kwii-ruble ezi-2 ukuya kwii-kopecks ezi-7.
  • Siyabulela kwisixhobo esele silungisiwe, silibale malunga neengxaki zokulawula iimodareyitha.
  • Sandisa inani leentengiso ezenziwe ngesandla ngamaxesha angama-2,5 ngenani elifanayo labamodareyitha kunye nohlahlo lwabiwo-mali. Umgangatho wokumodareyitha ngesandla unyukile ngenxa yolawulo oluzenzekelayo, kwaye uguquguquka malunga ne-0,5% yeempazamo.
  • Sivala ngokukhawuleza iintlobo ezintsha ze-spam ngezihluzi.
  • Ngokukhawuleza sidibanisa amasebe amatsha kumodareyitha "Yula Verticals". Ukusukela ngo-2017, uYula wongeze i-Real Estate, iZithuba kunye ne-Auto verticals.

umthombo: www.habr.com

Yongeza izimvo