Gulu la data scalable la chitetezo ndi zinsinsi

Gulu la data scalable la chitetezo ndi zinsinsi

Kugawika kwa data kutengera zomwe zili ndi vuto lotseguka. Machitidwe achikhalidwe oletsa kutayika kwa data (DLP) amathetsa vutoli polemba zala zala ndikuyang'anira kumapeto kwa zolemba zala. Chifukwa cha kuchuluka kwazinthu zomwe zikusintha nthawi zonse pa Facebook, njira iyi sikuti imangowonongeka, komanso ndi yopanda phindu pakuzindikira komwe deta imakhala. Pepalali limayang'ana kwambiri pamachitidwe omaliza mpaka-mapeto omwe adapangidwa kuti azitha kuzindikira mitundu ya semantic mu Facebook pamlingo waukulu ndikukakamiza kusungirako deta ndikuwongolera njira.

Njira yomwe tafotokozayi ndi njira yathu yachinsinsi yoyambira kumapeto mpaka kumapeto yomwe imayesa kuthetsa vutoli pophatikiza zizindikiro za data, kuphunzira pamakina, ndi njira zachikhalidwe zolembera zala kuti apange mapu ndikuyika deta yonse pa Facebook. Dongosolo lomwe lafotokozedwali limagwira ntchito m'malo opangira zinthu, ndikumapeza pafupifupi F2 ya 0,9+ m'makalasi osiyanasiyana achinsinsi kwinaku akukonza zosungira zambiri m'malo ambiri. Tikubweretsa kumasulira kwa pepala la Facebook la ArXiv pamagawo owopsa achitetezo ndi zinsinsi potengera kuphunzira pamakina.

Mau oyamba

Masiku ano, mabungwe amasonkhanitsa ndikusunga deta yochuluka m'mawonekedwe ndi malo osiyanasiyana [1], ndiye kuti deta imadyedwa m'malo ambiri, nthawi zina amakopera kapena kusungidwa kangapo, zomwe zimapangitsa kuti chidziwitso cha bizinesi chamtengo wapatali komanso chodziwika bwino chifalikire pamabizinesi ambiri. masitolo. Pamene bungwe likufunika kukwaniritsa zofunikira zina zalamulo kapena zoyendetsera, monga kutsata malamulo pazochitika zachiwembu, zimakhala zofunikira kusonkhanitsa deta yokhudzana ndi malo omwe akufunikira. Pamene lamulo la zinsinsi likunena kuti bungwe liyenera kubisa Nambala zonse za Social Security (SSNs) pogawana zambiri zaumwini ndi mabungwe osavomerezeka, choyambira choyambirira ndikufufuza ma SSN onse m'malo osungira data a bungwe. Pazifukwa zotere, kugawa deta kumakhala kovuta [1]. Ndondomeko yamagulu idzalola mabungwe kuti azitsatira ndondomeko zachinsinsi ndi chitetezo, monga kulola ndondomeko zowongolera, kusunga deta. Facebook ikubweretsa njira yomwe tidapanga pa Facebook yomwe imagwiritsa ntchito ma siginoloji angapo a data, kamangidwe ka makina owopsa, komanso kuphunzira pamakina kuti tipeze mitundu yodziwika bwino ya data.

Kupeza deta ndikuyika m'magulu ndi njira yopezera ndikulemba zolemba kuti zidziwitso zoyenera zibwezedwe mwachangu komanso moyenera zikafunika. Zomwe zikuchitika pano ndizomwe zili m'manja mwachilengedwe ndipo zimakhala ndi kuwunika malamulo kapena malamulo oyenerera, ndikuzindikira kuti ndi zidziwitso ziti zomwe ziyenera kuonedwa kuti ndizovuta komanso zomwe zimakhudzidwa ndi zomwe zili, kenako kupanga makalasi ndi mfundo zamagulu molingana [1]. Kupewa kutayika kwa data (DLP) kenako kusindikiza zala ndikuyang'anira ma endpoints kuti mupeze zala. Pochita ndi malo osungira katundu omwe ali ndi ma petabytes a data, njira iyi siimakula.

Cholinga chathu ndi kupanga gulu la data lomwe limafikira ku data yamphamvu komanso yosakhalitsa, popanda zoletsa zina pamtundu wa data kapena mawonekedwe. Ichi ndi cholinga cholimba mtima, ndipo mwachibadwa chimadza ndi zovuta. Zolemba zomwe zaperekedwa zimatha kukhala zazitali za zilembo.

Gulu la data scalable la chitetezo ndi zinsinsi
Chithunzi 1. Zolosera zapaintaneti komanso zopanda intaneti

Chifukwa chake, tiyenera kuyimilira bwino pogwiritsa ntchito mawonekedwe omwe amatha kuphatikizidwa pambuyo pake ndikusuntha mosavuta. Izi siziyenera kungopereka gulu lolondola, komanso kupereka kusinthasintha ndi kuwonjezereka kuti muwonjezere mosavuta ndikupeza mitundu yatsopano ya deta mtsogolomu. Kachiwiri, muyenera kuthana ndi matebulo akuluakulu opanda intaneti. Deta yokhazikika imatha kusungidwa m'matebulo omwe ali ndi ma petabytes ambiri. Izi zitha kupangitsa kuti sikanidwe mwachangu. Chachitatu, tiyenera kumamatira kumagulu okhwima a SLA pa data yosasinthika. Izi zimakakamiza dongosolo kuti likhale labwino kwambiri, lachangu komanso lolondola. Pomaliza, tiyenera kupereka m'gulu laling'ono la data la latency la data yosasinthika kuti ipange gulu lanthawi yeniyeni komanso milandu yogwiritsa ntchito intaneti.

Pepalali likufotokoza momwe tinachitira ndi zovuta zomwe zili pamwambazi ndipo limapereka dongosolo lachangu komanso losasinthika lomwe limayika magawo amtundu uliwonse, mawonekedwe, ndi magwero kutengera gulu lofanana. Tinakulitsa kamangidwe ka makina ndikupanga njira yophunzirira makina kuti tigawire mwachangu deta yapaintaneti komanso pa intaneti. Pepalali lakonzedwa motere: Gawo 2 likuwonetsa dongosolo lonse la dongosolo. Gawo 3 likukamba za mbali za makina ophunzirira makina. Ndime 4 ndi 5 ikuwonetsa ntchito zofananira ndikulongosola njira zamtsogolo zantchito.

zomangamanga

Kuti muthane ndi zovuta za deta yosalekeza komanso yapaintaneti ya Facebook, dongosolo lamagulu lili ndi mitsinje iwiri yosiyana, yomwe tidzakambirana mwatsatanetsatane.

Data Yokhazikika

Poyamba, dongosololi liyenera kuphunzira zazinthu zambiri za Facebook. Pankhokwe iliyonse, zidziwitso zina zimasonkhanitsidwa, monga malo osungira data omwe ali ndi datayo, makina omwe ali ndi datayo, ndi katundu omwe ali munkhokweyo. Izi zimapanga kabuku ka metadata komwe kamalola kuti dongosololi lizitenganso bwino deta popanda kudzaza makasitomala ndi zinthu zomwe zimagwiritsidwa ntchito ndi mainjiniya ena.

Kalozera wa metadata uyu amapereka gwero lovomerezeka lazinthu zonse zomwe zasinthidwa ndikukulolani kuti muwone momwe zinthu zilili. Pogwiritsa ntchito chidziwitsochi, kukonzekera patsogolo kumakhazikitsidwa potengera zomwe zasonkhanitsidwa ndi chidziwitso chamkati kuchokera kudongosolo, monga nthawi yomwe katunduyo adafufuzidwa bwino ndi nthawi yomwe adapangidwa, komanso kukumbukira zakale ndi zofunikira za CPU za katunduyo ngati idawunikidwa kale. Kenako, pa gwero lililonse la data (pamene zinthu zingapezeke), ntchito imatchedwa kusanthula kwenikweni gwero.

Ntchito iliyonse ndi fayilo ya binary yopangidwa yomwe imapanga zitsanzo za Bernoulli pazomwe zaposachedwa kwambiri pazachuma chilichonse. Chumacho chimagawidwa m'magawo amtundu uliwonse, pomwe zotsatira zagawo lililonse zimasinthidwa paokha. Kuphatikiza apo, makinawo amasanthula deta iliyonse yodzaza mkati mwa mizati. JSON, masanjidwe, ma encoded, ma URL, base 64 serialized data, ndi zina zonse zimasunthidwa. Izi zitha kukulitsa nthawi yoyeserera chifukwa tebulo limodzi limatha kukhala ndi masauzande ambiri amizere mu blob. json.

Pamzere uliwonse womwe wasankhidwa muzinthu za data, dongosolo lamagulu limachotsa zinthu zoyandama ndi zolemba kuchokera pazomwe zilimo ndikugwirizanitsa chinthu chilichonse kugawo lomwe chidatengedwa. Zotsatira za gawo lochotsa ndi mapu azinthu zonse pagawo lililonse lopezeka mu data.

Kodi zizindikiro zake ndi ziti?

Lingaliro la makhalidwe ndilofunika kwambiri. M'malo moyandama ndi zolemba, titha kudutsa zitsanzo za zingwe zomwe zimachotsedwa mwachindunji kuzinthu zilizonse za data. Kuphatikiza apo, makina ophunzirira makina amatha kuphunzitsidwa mwachindunji pachitsanzo chilichonse, m'malo mowerengera mazana ambiri omwe amangoyesa kuyerekeza zitsanzo. Pali zifukwa zingapo zochitira izi:

  1. Zazinsinsi choyamba: Chofunika kwambiri, lingaliro la mawonekedwe limatithandiza kusunga m'malingaliro matani omwe timapeza. Izi zimatsimikizira kuti timasunga zitsanzo ndi cholinga chimodzi ndipo sitingazilembe mwazochita zathu. Izi ndizofunikira kwambiri pazambiri zosasinthika, popeza ntchitoyo iyenera kukhalabe ndi magawo ena musanapereke kulosera.
  2. Memory: Zitsanzo zina zimatha kukhala zazitali za zilembo. Kusunga deta yotereyi ndikuitumiza ku mbali zina zadongosolo kumawononga ma byte ambiri owonjezera. Zinthu ziwirizi zimatha kuphatikiza pakapita nthawi, chifukwa pali zida zambiri zama data zomwe zili ndi masauzande ambiri.
  3. Kuphatikizika kwa mawonekedwe: Zowoneka zimayimira bwino zotsatira za sikani iliyonse kudzera mumagulu angapo, zomwe zimalola kuti makinawo aphatikize zotsatira za masikeni am'mbuyomu a data yomweyi m'njira yosavuta. Izi zitha kukhala zothandiza pakuphatikiza zotsatira za sikani kuchokera ku data imodzi pamayendedwe angapo.

Zinthuzo zimatumizidwa ku ntchito yolosera komwe timagwiritsa ntchito kugawa motsatira malamulo komanso kuphunzira pamakina kuti tidziwiretu zolemba zapagawo lililonse. Ntchitoyi imadalira magulu onse awiri a malamulo komanso kuphunzira pamakina ndikusankha kulosera kwabwino koperekedwa kuchokera ku chinthu chilichonse cholosera.

Ogawa malamulo ndi ma heuristics apamanja, amagwiritsa ntchito mawerengedwe ndi ma coefficients kuti asinthe chinthu kukhala 0 mpaka 100. Chiwongolero choyambirira choterechi chikapangidwa pamtundu uliwonse wa data ndi dzina lazagawo lolumikizidwa ndi datayo, siziphatikizidwa mu "kuletsa kulikonse." lists", Wophatikiza malamulo amasankha zigoli zokhazikika kwambiri pakati pa mitundu yonse ya data.

Chifukwa cha zovuta zamagulu, kudalira ma heuristics pamanja kumabweretsa kulondola kwamagulu otsika, makamaka pa data yosakhazikika. Pazifukwa izi, tinapanga makina ophunzirira makina kuti azigwira ntchito ndi gulu la data yosasinthika monga zomwe ogwiritsa ntchito ndi adilesi. Kuphunzira pamakina kwapangitsa kuti zitheke kuti ziyambe kuchoka pamachitidwe ogwiritsira ntchito pamanja ndikugwiritsa ntchito zizindikiro zowonjezera za data (monga mayina a mizati, kupezeka kwa data), kuwongolera kwambiri kulondola kwa kuzindikira. Tidzalowa mozama pamakina athu ophunzirira makina pambuyo pake.

Ntchito yolosera imasunga zotsatira zagawo lililonse limodzi ndi metadata yokhudzana ndi nthawi ndi momwe sikanidwe. Ogula onse ndi njira zotsika pansi zomwe zimadalira detayi akhoza kuziwerenga kuchokera ku dataset yofalitsidwa tsiku ndi tsiku. Seti iyi ikuphatikiza zotsatira za ntchito zonse zojambulira, kapena Real-Time Data Catalog APIs. Zoneneratu zosindikizidwa ndi maziko olimbikitsira mfundo zachinsinsi ndi chitetezo.

Potsirizira pake, utumiki wolosera utatha kulemba deta zonse ndi zoneneratu zonse zasungidwa, Data Catalog API yathu ikhoza kubwezera zolosera zamtundu wa deta zonse zachidziwitso mu nthawi yeniyeni. Tsiku lililonse makinawo amasindikiza deta yomwe ili ndi zolosera zaposachedwa pazachuma chilichonse.

Zosintha za data

Ngakhale kuti ndondomeko yomwe ili pamwambayi idapangidwira katundu wopitilira, kuchuluka kwa magalimoto osakhazikika kumawonedwanso kuti ndi gawo la data la bungwe ndipo kungakhale kofunikira. Pazifukwa izi, dongosololi limapereka API yapaintaneti yopangira zolosera zamagulu anthawi yeniyeni pamagalimoto aliwonse apakatikati. Dongosolo lolosera nthawi yeniyeni limagwiritsidwa ntchito kwambiri pogawa kuchuluka kwa magalimoto otuluka, kuchuluka kwa magalimoto olowera m'mamodeli ophunzirira makina ndi data ya otsatsa.

Apa API imatenga zifukwa ziwiri zazikulu: fungulo lamagulu ndi deta yaiwisi yomwe iyenera kuneneratu. Ntchitoyi imagwiranso ntchito pobweza chinthu chimodzimodzi monga tafotokozera pamwambapa ndikugawa zinthuzo kuti zikhale ndi kiyi yomweyo. Izi zimathandizidwanso mu cache yolimbikira pakulephera kuchira. Pa kiyi iliyonse yamagulu, ntchitoyo imawonetsetsa kuti yawona zitsanzo zokwanira musanayimbe mautumiki olosera, kutsatira ndondomeko yomwe tafotokozayi.

Kukhathamiritsa

Kuti tiyang'ane zosungirako zina, timagwiritsa ntchito malaibulale ndi njira zowonjezera kuwerenga kuchokera kumalo osungira otentha [2] ndikuwonetsetsa kuti palibe zosokoneza kuchokera kwa ogwiritsa ntchito ena omwe amapeza malo omwewo.

Pamatebulo akulu kwambiri (50+ petabytes), ngakhale kukhathamiritsa ndi kukumbukira bwino, makinawa amagwira ntchito kusanthula ndikuwerengera chilichonse chisanathe kukumbukira. Kupatula apo, sikaniyo imawerengedwa kwathunthu mu kukumbukira ndipo sikusungidwa pajambulidwe. Ngati matebulo akulu ali ndi zipilala masauzande okhala ndi zidziwitso zosasinthika, ntchitoyo imatha kulephera chifukwa cha zinthu zosakwanira zokumbukira pochita zolosera patebulo lonselo. Izi zidzachepetsa kufalitsa. Kuti tithane ndi izi, tidakonza makinawo kuti agwiritse ntchito liwiro la scan ngati projekiti ya momwe makinawo amagwirira ntchito yomwe ilipo. Timagwiritsa ntchito liwiro ngati njira yolosera kuti tiwone zovuta zamakumbukiro ndikuwerengera molosera mapu. Panthawi imodzimodziyo, timagwiritsa ntchito deta yochepa kusiyana ndi nthawi zonse.

Zizindikiro za data

Dongosolo lamagulu ndilabwino ngati ma siginecha ochokera ku data. Apa tiwona zizindikiro zonse zomwe zimagwiritsidwa ntchito ndi gulu lamagulu.

  • Zomwe Zili Pansi: Zowonadi, chizindikiro choyamba komanso chofunikira kwambiri ndichokhutira. Sampling ya Bernoulli imachitika pachinthu chilichonse chomwe timasanthula ndikuchotsa zinthu kutengera zomwe zili mu data. Zizindikiro zambiri zimachokera ku zomwe zili mkati. Nambala iliyonse ya zinthu zoyandama ndi zotheka, zomwe zimayimira kuwerengetsa kangati mtundu wina wa chitsanzo chawonedwa. Mwachitsanzo, titha kukhala ndi zizindikilo za kuchuluka kwa maimelo omwe amawonedwa pachitsanzo, kapena zizindikiro za ma emojis omwe amawonedwa pachitsanzo. Kuwerengera kwazinthu izi kumatha kusinthidwa ndikuphatikizidwa pamasika osiyanasiyana.
  • Kupezeka kwa data: Chizindikiro chofunikira chomwe chingathandize zomwe zasintha kuchokera patebulo la makolo. Chitsanzo chofala ndi data yofulumira. Zomwe zili patebulo la ana zimathamangitsidwa, nthawi zambiri zimachokera patebulo la makolo, pomwe zimakhala zomveka bwino. Deta yam'mizere imathandizira kuyika mitundu ina ya data pomwe siyinawerengedwe bwino kapena ikasinthidwa kuchokera patebulo lokwera.
  • Zofotokozera: Chizindikiro china chapamwamba chomwe chimathandiza kuzindikira zomwe sizinapangidwe. M'malo mwake, zofotokozera komanso zoyambira zitha kugwirira ntchito limodzi kuti zifalitse zomwe zili muzinthu zosiyanasiyana. Mawu ofotokozera amathandiza kuzindikira komwe kumachokera deta yosalongosoka, pamene deta ya mzere ingathandize kuyang'anira kayendedwe ka detayo m'nkhokwe yonse.
  • Kulowetsa deta ndi njira yomwe zilembo zapadera, zosawerengeka zimalowetsedwa mwadala kumalo odziwika amitundu yodziwika. Kenako, nthawi zonse tikasanthula zomwe zili ndi zilembo zosawerengeka, titha kunena kuti zomwe zili mumtundu wa data womwe umadziwika. Ichi ndi chizindikiro china chamtundu wa data chofanana ndi zofotokozera. Kupatula kuti kuzindikira kotengera zomwe zili kumathandizira kuzindikira zomwe zalowetsedwa.

Kuyeza Metrics

Chofunikira kwambiri ndi njira yokhazikika yoyezera ma metric. Miyezo yayikulu pakuwongolera kwamagulu ndikulondola komanso kukumbukira lebulo lililonse, mphambu ya F2 ndiyofunikira kwambiri.

Kuti muwerenge ma metric awa, njira yodziyimira payokha yolembera katundu wa data ndiyofunikira yomwe ili yodziyimira pawokha, koma ingagwiritsidwe ntchito kufananitsa nayo mwachindunji. Pansipa tikufotokoza momwe timapezera chowonadi kuchokera pa Facebook ndikuchigwiritsa ntchito pophunzitsa dongosolo lathu lamagulu.

Kusonkhanitsa deta yodalirika

Timasonkhanitsa deta yodalirika kuchokera kuzinthu zilizonse zomwe zalembedwa pansipa mu tebulo lake. Gome lililonse lili ndi udindo wophatikiza zinthu zomwe zawonedwa posachedwa kuchokera komweko. Chigawo chilichonse chimakhala ndi macheke amtundu wa data kuti zitsimikizire kuti zomwe zimawonedwa pa gwero lililonse ndizapamwamba komanso zili ndi zilembo zaposachedwa zamtundu wa data.

  • Kukonzekera kwa nsanja yodula mitengo: Minda ina mumng'oma ili ndi deta yomwe ili yamtundu wina. Kugwiritsiridwa ntchito ndi kufalitsa detayi kumakhala ngati gwero lodalirika la choonadi.
  • Kulemba pamanja: Madivelopa omwe amasunga dongosolo komanso olembera akunja amaphunzitsidwa kulemba zipilala. Izi nthawi zambiri zimagwira ntchito bwino pamitundu yonse ya data yomwe ili m'nkhokwe, ndipo ikhoza kukhala gwero lalikulu la chowonadi pa data ina yosalongosoka, monga mauthenga a mauthenga kapena zomwe ogwiritsa ntchito.
  • Mizati yochokera m'matebulo a makolo imatha kuzindikiridwa kapena kufotokozedwa kuti ili ndi data inayake, ndipo titha kuyang'anira datayo pamatebulo a ana.
  • Kutenga ulusi wopha: ulusi wakupha mu Facebook uli ndi mitundu ina ya data. Pogwiritsa ntchito scanner yathu ngati zomangamanga, titha kuyesa mitsinje yomwe imadziwika ndi mitundu ya data ndikuitumiza kudzera mudongosolo. Dongosolo limalonjeza kuti silisunga izi.
  • Zitsanzo za matebulo: Matebulo akuluakulu a ming'oma, omwe amadziwika kuti ali ndi deta yonse, angagwiritsidwenso ntchito ngati chidziwitso cha maphunziro ndikudutsa pa scanner ngati ntchito. Izi ndizabwino kwambiri pamagome okhala ndi mitundu yonse ya data, kotero kuti kuyesa gawo mwachisawawa kumakhala kofanana ndi kuyesa seti yonse yamtundu wa datayo.
  • Zopangira: Titha kugwiritsanso ntchito malaibulale omwe amapanga data pa ntchentche. Izi zimagwira ntchito pamitundu yosavuta, yapagulu monga adilesi kapena GPS.
  • Data Steward: Mapulogalamu achinsinsi nthawi zambiri amagwiritsa ntchito oyang'anira deta kuti agawane ndondomeko pazidutswa za data. Izi zimagwira ntchito ngati gwero lolondola kwambiri la chowonadi.

Timaphatikiza gwero lililonse lalikulu la chowonadi kukhala gulu limodzi ndi zonsezo. Vuto lalikulu pakutsimikizika ndikuwonetsetsa kuti likuyimira malo osungiramo data. Apo ayi, injini zamagulu zikhoza kupitirira. Kuti athane ndi izi, magwero onse omwe ali pamwambawa amagwiritsidwa ntchito kuti awonetsetse kuti ali ndi mphamvu pakuphunzitsa kapena powerengera ma metric. Kuonjezera apo, olemba malemba aumunthu amafanana ndi magawo osiyanasiyana m'malo osungiramo katundu ndikulemba deta moyenerera kuti kusonkhanitsa mfundo zenizeni zikhalebe zopanda tsankho.

Kuphatikiza Kopitiriza

Kuti muwonetsetse kubwereza kofulumira komanso kukonza bwino, ndikofunikira kuyeza nthawi zonse magwiridwe antchito munthawi yeniyeni. Titha kuyeza kuwongolera kwamagulu onse potengera dongosolo lamasiku ano, kuti tithe kuwongolera mwanzeru kusintha kwamtsogolo kutengera deta. Apa tikuyang'ana momwe dongosolo limamalizitsira kubwereza kwa mayankho omwe amaperekedwa ndi deta yovomerezeka.

Dongosolo lokonzekera likakumana ndi chinthu chomwe chili ndi chizindikiro chochokera ku gwero lodalirika, timakonza ntchito ziwiri. Yoyamba imagwiritsa ntchito sikani yathu yopanga ndipo motero luso lathu lopanga. Ntchito yachiwiri imagwiritsa ntchito scanner yaposachedwa yokhala ndi zida zaposachedwa. Ntchito iliyonse imalemba zotuluka zake patebulo lake, ndikuyika matembenuzidwe pamodzi ndi zotsatira zamagulu.

Umu ndi momwe timafananizira zotsatira za gulu la ofuna kumasulidwa ndi mtundu wa kupanga munthawi yeniyeni.

Ngakhale ma dataset akuyerekeza mawonekedwe a RC ndi PROD, mitundu yambiri ya injini yolosera zam'gulu la ML imayikidwa. Makina ophunzirira makina opangidwa posachedwa kwambiri, mtundu wapano pakupanga, ndi zitsanzo zilizonse zoyesera. Njira yomweyi imatilola "kudula" mitundu yosiyanasiyana yachitsanzo (agnostic kumagulu athu a malamulo) ndikufanizira ma metrics mu nthawi yeniyeni. Izi zimapangitsa kuti zikhale zosavuta kudziwa pamene kuyesa kwa ML kwakonzeka kuyamba kupanga.

Usiku uliwonse, zida za RC zowerengera tsikulo zimatumizidwa ku payipi yophunzitsira ya ML, komwe mtunduwo umaphunzitsidwa zaposachedwa kwambiri za RC ndikuwunika momwe zimagwirira ntchito motsutsana ndi zowona zenizeni.

M'mawa uliwonse, chitsanzocho chimamaliza maphunziro ndipo chimasindikizidwa ngati chitsanzo choyesera. Zimangophatikizidwa pamndandanda woyeserera.

Zotsatira zina

Mitundu yopitilira 100 ya data imalembedwa molondola kwambiri. Mitundu yopangidwa bwino monga maimelo ndi manambala a foni amagawidwa ndi f2 kuposa 0,95. Mitundu ya data yaulere monga zopangidwa ndi ogwiritsa ntchito ndi dzina zimagwiranso ntchito bwino kwambiri, ndi F2 zambiri kuposa 0,85.

Chiwerengero chachikulu cha mindandanda yazinthu zomwe zikupitilirabe komanso zosasinthika zimagawidwa tsiku lililonse m'malo onse osungira. Ma terabyte opitilira 500 amawunikidwa tsiku lililonse m'malo osungiramo zinthu 10. Zambiri mwazosungirazi zimakhala ndi 98%.

M'kupita kwa nthawi, kugawa kwakhala kothandiza kwambiri, ndikuyika ntchito m'magulu osagwiritsa ntchito intaneti omwe amatenga pafupifupi masekondi 35 kuchokera pakusanthula katundu mpaka kuwerengera zolosera pagawo lililonse.

Gulu la data scalable la chitetezo ndi zinsinsi
Mpunga. 2. Chithunzi chofotokozera kuyenda kophatikizana kosalekeza kuti mumvetsetse momwe zinthu za RC zimapangidwira ndikutumizidwa ku chitsanzo.

Gulu la data scalable la chitetezo ndi zinsinsi
Chithunzi 3. Chithunzi chapamwamba cha gawo lophunzirira makina.

Chigawo cha makina ophunzirira makina

M'gawo lapitalo, tidalowa mozama muzomangamanga zonse, kuwonetsa kukula, kukhathamiritsa, komanso kuyenda kwa data pa intaneti komanso pa intaneti. Mu gawoli, tiwona ntchito yolosera ndikufotokozera makina ophunzirira makina omwe amathandizira ntchito yolosera.

Ndi mitundu yopitilira 100 ya data komanso zinthu zina zosalongosoka monga mauthenga a mauthenga ndi zolemba za ogwiritsa ntchito, kugwiritsa ntchito ma heuristics pamanja kumabweretsa kulondola kwa magawo a subparametric, makamaka pa data yosakhazikika. Pachifukwa ichi, tapanganso makina ophunzirira makina kuti athe kuthana ndi zovuta za data yosasinthika. Kugwiritsa ntchito makina ophunzirira kumakupatsani mwayi woti muyambe kuchoka pamachitidwe owerengera pamanja ndikugwira ntchito ndi mawonekedwe ndi ma siginolo owonjezera a data (mwachitsanzo, mayina amzanja, chiyambi cha data) kuti muwongolere zolondola.

Zoyimira zomwe zakhazikitsidwa zimawerengera zoyimira vekitala [3] pa zinthu zowundana komanso zochepa padera. Izi zimaphatikizidwa kuti zipange vekitala, yomwe imadutsa mndandanda wa batch normalization [4] ndi masitepe osagwirizana kuti apange zotsatira zomaliza. Chotsatira chake ndi nambala yoyandama pakati pa [0-1] pa lebulo lililonse, kuwonetsa kuthekera kuti chitsanzocho ndi chamtundu wamtunduwu. Kugwiritsa ntchito PyTorch kwachitsanzo kunatilola kuyenda mwachangu, kulola opanga kunja kwa gulu kuti apange ndikuyesa kusintha.

Popanga kamangidwe kake, kunali kofunikira kutengera zinthu zochepa (mwachitsanzo, zolemba) ndi zowundana (mwachitsanzo manambala) mosiyana chifukwa cha kusiyana kwawo. Pamamangidwe omaliza, kunali kofunikanso kusesa kwa parameter kuti mupeze mtengo wokwanira wophunzirira, kukula kwa batch, ndi ma hyperparameter ena. Kusankhidwa kwa optimizer kunalinso kofunikira kwambiri. Tapeza kuti chowonjezera chodziwika bwino Adamnthawi zambiri kumabweretsa kuchulukirachulukira, pomwe mtundu wokhala ndi SGD wokhazikika. Panali ma nuances owonjezera omwe tidayenera kuphatikiza mwachindunji muzachitsanzo. Mwachitsanzo, malamulo osasunthika omwe amatsimikizira kuti chitsanzocho chimalosera motsimikiza ngati chinthu chili ndi mtengo wake. Malamulo osasunthikawa amafotokozedwa ndi makasitomala athu. Tidapeza kuti kuwaphatikiza mwachindunji muchitsanzocho kunapangitsa kuti pakhale zomanga zokhazikika komanso zolimba, kusiyana ndi kukhazikitsa gawo lokonzekera pambuyo pokonza milandu yapaderayi. Komanso dziwani kuti malamulowa amayimitsidwa panthawi yophunzitsidwa kuti asasokoneze maphunziro a gradient.

Mavuto

Chimodzi mwa zovuta chinali kusonkhanitsa deta yapamwamba, yodalirika. Chitsanzocho chimafunika chidaliro kwa kalasi iliyonse kuti athe kuphunzira kuyanjana pakati pa zinthu ndi zilembo. M'gawo lapitalo, takambirana njira zosonkhanitsira deta za muyeso wa dongosolo ndi maphunziro a chitsanzo. Kuwunikaku kunawonetsa kuti makalasi a data monga ma kirediti kadi ndi manambala a akaunti yaku banki sizofala kwambiri m'nkhokwe yathu yosungiramo zinthu. Izi zimapangitsa kuti zikhale zovuta kusonkhanitsa deta yodalirika yophunzitsa zitsanzo. Kuti tithane ndi vutoli, tapanga njira zopezera zowona zenizeni zamagulu awa. Ife kupanga deta ngati tcheru mitundu kuphatikizapo SSN, manambala a kirediti kadi ΠΈ IBAN-nambala zomwe chitsanzo sichikanatha kudziwiratu kale. Njirayi imalola mitundu ya data yomwe ili yofunika kukonzedwa popanda kuwopsa kwachinsinsi komwe kumakhudzana ndi kubisa zomwe zili zenizeni.

Kupatula pazowona zenizeni, palinso zovuta zamamangidwe zomwe tikugwira ntchito, monga kusintha kudzipatula ΠΈ kuyimitsa koyambirira. Kusintha kudzipatula ndikofunikira kuti zitsimikizire kuti kusintha kosiyanasiyana kumapangidwa kumadera osiyanasiyana a netiweki, zotsatira zake zimasiyanitsidwa ndi makalasi enaake ndipo sizikhudza kwambiri magwiridwe antchito amtsogolo. Kuwongolera njira zoyimitsira koyambirira ndikofunikiranso kuti tithe kuyimitsa maphunzirowo pamalo okhazikika a makalasi onse, m'malo mwakuti makalasi ena alepheretse ndipo ena satero.

Kufunika kwa mawonekedwe

Chinthu chatsopano chikayambitsidwa muchitsanzo, timafuna kudziwa momwe chimakhudzira mtunduwo. Tikufunanso kuwonetsetsa kuti zoloserazo zimamasuliridwa ndi anthu kotero kuti titha kumvetsetsa zomwe zikugwiritsidwa ntchito pamtundu uliwonse wa data. Pachifukwa ichi tapanga ndikuyambitsa mwa kalasi kufunikira kwa mawonekedwe a mtundu wa PyTorch. Zindikirani kuti izi ndizosiyana ndi zofunikira zonse, zomwe nthawi zambiri zimathandizidwa, chifukwa sizimatiuza zomwe zili zofunika pagulu linalake. Timayesa kufunikira kwa chinthu powerengera kuchuluka kwa zolakwika zolosera pambuyo pokonzanso chinthucho. Chinthu ndi "chofunika" pamene kusinthana kwa ma values ​​kumawonjezera zolakwika zachitsanzo chifukwa pamenepa chitsanzocho chinali kudalira mbaliyo kuti iwonetseretu. Chofunikira ndi "chosafunikira" mukasokoneza malingaliro ake chimasiya cholakwika chachitsanzocho sichinasinthidwe, chifukwa pankhaniyi chitsanzocho chidanyalanyaza [5].

Kufunika kwa gawo la kalasi iliyonse kumatithandiza kupanga chitsanzo chotanthauzira kuti tiwone zomwe chitsanzocho chikuyang'ana polosera chizindikiro. Mwachitsanzo, tikamasanthula ADDR, ndiye tikutsimikizira kuti chizindikiro chokhudzana ndi adilesi, monga AddressLinesCount, imakhala pamwamba pa tebulo lofunika kwambiri la kalasi iliyonse kuti chidziwitso chathu chaumunthu chigwirizane bwino ndi zomwe chitsanzocho chaphunzira.

kuwunika

Ndikofunikira kutanthauzira metric imodzi kuti apambane. Tinasankha F2 - kulinganiza pakati pa kukumbukira ndi kulondola (kumbukirani kukondera ndikokulirapo pang'ono). Kukumbukira ndikofunikira kwambiri pakugwiritsa ntchito mwachinsinsi kuposa kulondola chifukwa ndikofunikira kuti gulu lisaphonye zidziwitso zilizonse (pakuwonetsetsa kulondola). Kuwunika kwenikweni kwa F2 kwachitsanzo chathu ndikopitilira pepalali. Komabe, ndikuwongolera mosamala titha kupeza zambiri (0,9+) F2 m'makalasi ofunikira kwambiri.

Ntchito yogwirizana

Pali ma aligorivimu ambiri odzipangira okha zikalata zosakonzedwa pogwiritsa ntchito njira zosiyanasiyana monga kufananitsa ma pateni, kusaka kufanana kwa zolemba ndi njira zosiyanasiyana zophunzirira makina (Bayesian, mitengo yachisankho, oyandikana nawo a k-pafupi ndi ena ambiri) [6]. Zina mwa izi zitha kugwiritsidwa ntchito ngati gawo lamagulu. Komabe, vuto ndi scalability. Njira yamagulu m'nkhaniyi ndiyokondera ku kusinthasintha ndi ntchito. Izi zimatithandiza kuthandizira makalasi atsopano mtsogolomo ndikusunga latency yotsika.

Palinso ntchito yambiri pa zolemba zala za data. Mwachitsanzo, olemba mu [7] adalongosola njira yothetsera vuto lomwe limayang'ana pavuto lojambula zowonongeka zowonongeka. Lingaliro laling'ono ndiloti detayo imatha kujambulidwa ndi zala kuti ifanane ndi deta yodziwika bwino. Olemba mu [8] akufotokoza vuto lofananalo la kutayikira zachinsinsi, koma yankho lawo limachokera ku kamangidwe kake ka Android ndipo amangosankhidwa ngati zochita za ogwiritsa ntchito zipangitsa kugawana zambiri zamunthu kapena ngati pulogalamuyo itaya zambiri za ogwiritsa ntchito. Zomwe zili pano ndizosiyana chifukwa deta ya ogwiritsa ntchito imathanso kukhala yosakhazikika. Choncho, timafunikira njira yovuta kwambiri kuposa zolemba zala.

Pomaliza, kuti tithane ndi kusowa kwa data pamitundu ina yazinthu zodziwika bwino, tidayambitsa zopanga. Pali mabuku ambiri okhudza kuwonjezereka kwa deta, mwachitsanzo, olemba mu [9] adafufuza ntchito ya jekeseni wa phokoso panthawi yophunzitsidwa ndikuwona zotsatira zabwino mu maphunziro oyang'aniridwa. Njira yathu pazachinsinsi ndi yosiyana chifukwa kubweretsa zaphokoso kumatha kukhala kopanda phindu, ndipo m'malo mwake timangoyang'ana zambiri zopangidwa ndiukadaulo wapamwamba kwambiri.

Pomaliza

Mu pepala ili, tapereka dongosolo lomwe lingathe kugawa deta. Izi zimatilola kupanga machitidwe kuti azitsatira mfundo zachinsinsi ndi chitetezo. Tawonetsa kuti zida zokulirapo, kuphatikiza mosalekeza, kuphunzira makina ndi kutsimikizira kwamtundu wapamwamba kwambiri zimathandizira kwambiri pakuchita bwino kwachinsinsi chathu.

Pali njira zambiri zogwirira ntchito zamtsogolo. Izi zingaphatikizepo kupereka chithandizo cha deta yosasinthika (mafayilo), kuyika osati mtundu wa deta yokha komanso mlingo wa kukhudzidwa, ndi kugwiritsa ntchito maphunziro odziyang'anira panthawi ya maphunziro popanga zitsanzo zolondola zopangira. Zomwe, zidzathandiza chitsanzocho kuchepetsa kutayika ndi kuchuluka kwakukulu. Ntchito yamtsogolo ingathenso kuyang'ana pa kayendetsedwe ka kafukufuku, komwe timapita mopitirira kudziwika ndi kupereka zifukwa zosiyanasiyana zophwanya zinsinsi. Izi zidzathandiza ngati kusanthula kwa chidwi (i.e. ngati chidwi chachinsinsi cha mtundu wa data ndichapamwamba (monga IP ya ogwiritsa) kapena otsika (monga Facebook mkati IP)).

Nkhani zamalemba

  1. David Ben-David, Tamara Domany, ndi Abigail Taremu. Gulu la data labizinesi pogwiritsa ntchito matekinoloje a semantic pa intaneti. Mu Peter F.Ï Patel-Schneider, Yue Pan, Pascal Hitzler, Peter Mika, Lei Zhang, Jeff Z. Pan, Ian Horrocks, ndi Birte Glimm, akonzi, The Semantic Web - ISWC 2010, masamba 66-81, Berlin, Heidelberg, 2010. Springer Berlin Heidelberg.
  2. Subramanian Muralidhar, Wyatt Lloyd, Sabyasachi Roy, Cory Hill, Ernest Lin, Weiwen Liu, Satadru Pan, Shiva Shankar, Viswanath Sivakumar, Linpeng Tang, and Sanjeev Kumar. f4: Njira yosungira ya Facebook ya BLOB. Mu 11th USENIX Symposium pa Operating Systems Design and Implementation (OSDI 14), masamba 383–398, Broomfield, CO, October 2014. USENIX Association.
  3. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, ndi Jeff Dean. Kugawa mawonekedwe a mawu ndi ziganizo ndi kapangidwe kake. Mu C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, ndi K. Q. Weinberger, akonzi, Zotsogola mu Neural Information Processing Systems 26, masamba 3111-3119 . Curran Associates, Inc., 2013.
  4. Sergey Ioff ndi Christian Szegedy. Batch normalization: Kufulumizitsa maphunziro akuzama pamaneti pochepetsa kusintha kwamkati kwa covariate. Mu Francis Bach ndi David Blei, akonzi, Zomwe Zachitika pa Msonkhano Wapadziko Lonse wa 32nd on Machine Learningvoliyumu 37 ya Zochita za Machine Learning Research, masamba 448–456, Lille, France, 07–09 Jul 2015. PMLR.
  5. Leo Breiman. Nkhalango zachisawawa. Mach. Phunzirani., 45(1):5–32, October 2001.
  6. Thair Nu Phyu. Kufufuza kwa njira zamagulu mumigodi ya data.
  7. X. Shu, D. Yao, ndi E. Bertino. Kudziwikiratu kosunga zinsinsi za kuwonekera kwachinsinsi. IEEE Transactions pa Information Forensics ndi Chitetezo, 10(5):1092–1103, 2015.
  8. Zhemin Yang, Min Yang, Yuan Zhang, Guofei Gu, Peng Ning, ndi Xiaoyang Wang. Appintent: Kusanthula kufala kwa data tcheru mu android kuti muzindikire kutayikira kwachinsinsi. masamba 1043–1054, 11 2013.
  9. Qizhe Xie, Zihang Dai, Eduard H. Hovy, Minh-Thang Luong, ndi Quoc V. Le. Kuchulukitsa kwa data mosayang'aniridwa.

Gulu la data scalable la chitetezo ndi zinsinsi
Dziwani zambiri zamomwe mungapezere ntchito yomwe mukufuna kuyambira pachiyambi kapena Level Up malinga ndi luso ndi malipiro pochita maphunziro a pa intaneti a SkillFactory:

Maphunziro ambiri

Source: www.habr.com

Kuwonjezera ndemanga