Kuzindikira kwa Draw Doodle mwachangu: momwe mungapangire zibwenzi ndi R, C++ ndi neural network
Pa Habr!
Kugwa komaliza, Kaggle adachita mpikisano wosankha zithunzi zojambula pamanja, Kuzindikira kwa Doodle Mwachangu, komwe, mwa ena, gulu la R-asayansi adatenga nawo gawo: Artem Klevtsova, Philippa Manager ΠΈ Andrey Ogurtsov. Sitidzalongosola mpikisano mwatsatanetsatane; zomwe zachitika kale kusindikizidwa kwaposachedwa.
Nthawiyi sizinagwire ntchito ndi ulimi wa mendulo, koma zambiri zamtengo wapatali zinapindula, kotero ndikufuna kuuza anthu ammudzi za zinthu zingapo zosangalatsa komanso zothandiza pa Kagle ndi ntchito ya tsiku ndi tsiku. Pakati pa nkhani zomwe zafotokozedwa: moyo wovuta popanda OpenCV, JSON parsing (zitsanzo izi zimayang'ana kuphatikiza kwa C++ code mu zolemba kapena phukusi mu R pogwiritsa ntchito Rcpp), parameterization of scripts and dockerization of the final solution. Ma code onse a uthengawo mu mawonekedwe oyenera kuphedwa akupezeka nkhokwe.
1. Kwezani bwino deta kuchokera ku CSV kupita ku database ya MonetDB
Deta yomwe ili mumpikisanowu imaperekedwa osati mu mawonekedwe a zithunzi zopangidwa kale, koma mu mawonekedwe a mafayilo a 340 CSV (fayilo imodzi ya kalasi iliyonse) yomwe ili ndi ma JSON okhala ndi mfundo zogwirizanitsa. Polumikiza mfundozi ndi mizere, timapeza chithunzi chomaliza chokhala ndi ma pixel 256x256. Komanso pa mbiri iliyonse pali chizindikiro chosonyeza ngati chithunzicho chinazindikiridwa molondola ndi gulu lomwe linagwiritsidwa ntchito panthawi yomwe deta idasonkhanitsidwa, zilembo ziwiri za dziko lomwe akukhala la wolemba chithunzicho, chizindikiritso chapadera, chizindikiro cha nthawi. ndi dzina lakalasi lomwe likufanana ndi dzina lafayilo. Chidziwitso chosavuta cha deta yoyambirira chimalemera 7.4 GB mu archive ndipo pafupifupi 20 GB mutatsegula, deta yonse mutatsegula imatenga 240 GB. Okonzawo adawonetsetsa kuti matembenuzidwe onse awiri apanganso zojambula zomwezo, kutanthauza kuti zonsezo zinali zopanda ntchito. Mulimonse momwe zingakhalire, kusunga zithunzi 50 miliyoni m'mafayilo ojambulidwa kapena m'mawonekedwe ophatikizika kudawonedwa ngati kopanda phindu, ndipo tidaganiza zophatikiza mafayilo onse a CSV kuchokera pazosungidwa. train_simplified.zip m'nkhokwe ndi m'badwo wotsatira wa zithunzi za kukula kofunikira "pa ntchentche" pa batch iliyonse.
Dongosolo lotsimikiziridwa bwino linasankhidwa ngati DBMS MonetDB, ndiko kukhazikitsa kwa R ngati phukusi MonetDBLite. Phukusili limaphatikizapo mawonekedwe ophatikizidwa a seva ya database ndikukulolani kuti mutenge seva mwachindunji kuchokera ku gawo la R ndikugwira ntchito nayo pamenepo. Kupanga database ndikulumikizana nayo kumachitika ndi lamulo limodzi:
con <- DBI::dbConnect(drv = MonetDBLite::MonetDBLite(), Sys.getenv("DBDIR"))
Tidzafunika kupanga matebulo awiri: imodzi ya data yonse, ina yodziwitsa za mafayilo otsitsidwa (zothandiza ngati china chake sichikuyenda bwino ndipo ndondomekoyo iyenera kuyambiranso mutatsitsa mafayilo angapo):
Kujambula kumachitika pogwiritsa ntchito zida zokhazikika za R ndikusungidwa ku PNG yakanthawi yosungidwa mu RAM (pa Linux, zolemba zosakhalitsa za R zili m'ndandanda. /tmp, wokwezedwa mu RAM). Fayiloyi imawerengedwa ngati magawo atatu okhala ndi manambala kuyambira 0 mpaka 1. Izi ndizofunikira chifukwa BMP yodziwika bwino imatha kuwerengedwa mumtundu wakuda wokhala ndi ma code amtundu wa hex.
Kukhazikitsa uku kumawoneka ngati kosayenera kwa ife, chifukwa kupanga magulu akulu kumatenga nthawi yayitali, ndipo tidaganiza zotengera mwayi pazomwe anzathu adakumana nazo pogwiritsa ntchito laibulale yamphamvu. OpenCV. Panthawiyo panalibe phukusi lokonzekera la R (palibe tsopano), kotero kukhazikitsa kochepa kwa magwiridwe antchito kunalembedwa mu C ++ ndikuphatikiza mu R code pogwiritsa ntchito. Rcpp.
Kuti athetse vutoli, maphukusi ndi malaibulale otsatirawa adagwiritsidwa ntchito:
OpenCV pogwira ntchito ndi zithunzi ndi mizere yojambula. Amagwiritsidwa ntchito malaibulale oyikapo kale ndi mafayilo apamutu, komanso kulumikizana kwamphamvu.
xtensor pogwira ntchito ndi multidimensional arrays ndi tensor. Tidagwiritsa ntchito mafayilo apamutu omwe ali mu phukusi la R la dzina lomwelo. Laibulale imakulolani kuti mugwire ntchito ndi ma multidimensional arrays, onse mumzere waukulu ndi mzere waukulu.
ndjson za kusanthula JSON. Laibulale iyi imagwiritsidwa ntchito mu xtensor zokha ngati zilipo mu polojekiti.
RcppThread pokonza makina amtundu wamitundu yambiri kuchokera ku JSON. Gwiritsani ntchito mafayilo apamutu omwe aperekedwa ndi phukusili. Kuchokera kutchuka RcppParallel Phukusili, mwa zina, lili ndi njira yolumikizira lupu yomangidwira.
Ndikoyenera kuzindikira zimenezo xtensor inakhala godsend: kuphatikiza kuti ili ndi magwiridwe antchito ambiri komanso magwiridwe antchito apamwamba, opanga ake adakhala omvera ndikuyankha mafunso mwachangu komanso mwatsatanetsatane. Ndi chithandizo chawo, zinali zotheka kukhazikitsa masinthidwe a matrices a OpenCV kukhala ma tensor a xtensor, komanso njira yophatikizira ma tensor azithunzi-3-dimensional tensor ya 4-dimensional dimension yolondola (batch yokha).
R ali ndi mbiri yabwino yogwiritsira ntchito deta yomwe ikugwirizana ndi RAM, pamene Python imadziwika kwambiri ndi kukonzanso deta, kukulolani kuti mugwiritse ntchito mosavuta komanso mwachibadwa kuwerengera kunja kwapakati (kuwerengera pogwiritsa ntchito kukumbukira kunja). Chitsanzo chapamwamba komanso choyenera kwa ife munkhani yavuto lomwe tafotokozali ndi maukonde ozama a neural ophunzitsidwa ndi njira yotsika pang'onopang'ono ndi kuyerekeza kwa gradient pa sitepe iliyonse pogwiritsa ntchito gawo laling'ono la zowonera, kapena mini-batch.
Zolemba zakuya zolembedwa mu Python zili ndi makalasi apadera omwe amagwiritsira ntchito obwerezabwereza malinga ndi deta: matebulo, zithunzi mumafoda, mawonekedwe a binary, ndi zina zotero. Mu R titha kugwiritsa ntchito mwayi pazinthu zonse za library ya Python makamera ndi ma backend ake osiyanasiyana pogwiritsa ntchito phukusi la dzina lomwelo, lomwe limagwiranso ntchito pamwamba pa phukusi onaninso. Yotsirizirayo ikuyenera kukhala ndi nkhani yayitali; sizimangokulolani kuti muthamangitse Python code kuchokera ku R, komanso imakulolani kusamutsa zinthu pakati pa R ndi Python magawo, ndikuchita zosintha zonse zofunika.
Tinachotsa kufunikira kosunga deta yonse mu RAM pogwiritsa ntchito MonetDBLite, ntchito zonse za "neural network" zidzachitidwa ndi code yoyambirira ku Python, timangoyenera kulemba cholembera pa deta, popeza palibe chokonzekera. pazifukwa zotere mu R kapena Python. Pali zofunikira ziwiri zokha kwa izo: ziyenera kubweza ma batchi mosalekeza ndikusunga malo ake pakati pa kubwereza (yotsirizirayi mu R imayendetsedwa m'njira yosavuta kugwiritsa ntchito kutseka). M'mbuyomu, zinkafunika kusinthiratu ma R arrays kukhala numpy arrays mkati mwa iterator, koma mtundu waposachedwa wa phukusi. makamera amachita yekha.
Kubwereza kwa data yophunzitsa ndi kutsimikizira kudakhala motere:
Ntchitoyi imatenga ngati kulowetsa kusintha komwe kumalumikizidwa ndi database, manambala a mizere yogwiritsidwa ntchito, kuchuluka kwa makalasi, kukula kwa batch, sikelo (scale = 1 zikugwirizana ndi kupereka zithunzi za 256x256 pixels, scale = 0.5 - 128x128 pixels), chizindikiro cha mtundu (color = FALSE imatchula kumasulira mu grayscale ikagwiritsidwa ntchito color = TRUE sitiroko iliyonse imakokedwa ndi mtundu watsopano) ndi chizindikiro chokonzekera ma network ophunzitsidwa kale pa imagenet. Chotsatiracho chikufunika kuti muwonjeze ma pixel kuchokera pakapita nthawi [0, 1] mpaka pakapita nthawi [-1, 1], yomwe imagwiritsidwa ntchito pophunzitsa zomwe zaperekedwa. makamera zitsanzo.
Ntchito yakunja ili ndi kuwunika kwa mtundu wa mkangano, tebulo data.table ndi manambala osakanikirana osakanikirana kuchokera samples_index ndi manambala a batch, counter ndi kuchuluka kwa magulu, komanso mawu a SQL otsitsa deta kuchokera ku database. Kuphatikiza apo, tidafotokozera analogue yachangu yantchito mkati keras::to_categorical(). Tidagwiritsa ntchito pafupifupi data yonse pophunzitsa, ndikusiya theka la zana kuti litsimikizidwe, kotero kukula kwa epoch kunali kochepa ndi parameter. steps_per_epoch ataitanidwa keras::fit_generator(), ndi chikhalidwe if (i > max_i) adangogwira ntchito yotsimikiziranso.
Muzochita zamkati, mindandanda yamizere imabwezedwa pamndandanda wotsatira, zolembedwa zimatsitsidwa kuchokera ku nkhokwe ndikuwonjezera batch counter, JSON parsing (function). cpp_process_json_vector(), yolembedwa mu C ++) ndikupanga magulu ofanana ndi zithunzi. Kenako ma vector otentha amodzi okhala ndi zilembo zamakalasi amapangidwa, magulu okhala ndi ma pixel ndi zilembo amaphatikizidwa pamndandanda, womwe ndi mtengo wobwerera. Kuti tifulumizitse ntchito, tidagwiritsa ntchito kupanga ma index m'matebulo data.table ndi kusinthidwa kudzera pa ulalo - popanda "chips" ichi deta.table Ndizovuta kulingalira kugwira ntchito moyenera ndi kuchuluka kwa data mu R.
Zotsatira za kuyeza liwiro pa laputopu ya Core i5 ndi izi:
doc <- '
Usage:
train_nn.R --help
train_nn.R --list-models
train_nn.R [options]
Options:
-h --help Show this message.
-l --list-models List available models.
-m --model=<model> Neural network model name [default: mobilenet_v2].
-b --batch-size=<size> Batch size [default: 32].
-s --scale-factor=<ratio> Scale factor [default: 0.5].
-c --color Use color lines [default: FALSE].
-d --db-dir=<path> Path to database directory [default: Sys.getenv("db_dir")].
-r --validate-ratio=<ratio> Validate sample ratio [default: 0.995].
-n --n-gpu=<number> Number of GPUs [default: 1].
'
args <- docopt::docopt(doc)
Phukusi dokotala imayimira kukhazikitsa http://docopt.org/ kwa R. Ndi chithandizo chake, zolemba zimayambitsidwa ndi malamulo osavuta monga Rscript bin/train_nn.R -m resnet50 -c -d /home/andrey/doodle_db kapena ./bin/train_nn.R -m resnet50 -c -d /home/andrey/doodle_db, ngati file train_nn.R imatha kuchitidwa (lamulo ili liyamba kuphunzitsa chitsanzocho resnet50 pazithunzi zamitundu itatu zokhala ndi ma pixel 128x128, nkhokwe iyenera kukhala mufoda. /home/andrey/doodle_db). Mutha kuwonjezera liwiro la kuphunzira, mtundu wa optimizer, ndi zina zilizonse zomwe mungasinthe pamndandanda. Pokonzekera zofalitsa, zinapezeka kuti zomangamanga mobilenet_v2 kuchokera ku mtundu wamakono makamera mu R ntchito sangathe chifukwa cha zosintha zomwe sizinaganizidwe mu phukusi la R, tikudikirira kuti akonze.
Njirayi idapangitsa kuti zitheke kufulumizitsa kuyesa ndi mitundu yosiyanasiyana poyerekeza ndi kukhazikitsidwa kwachikhalidwe chazolemba mu RStudio (tikuwona phukusi ngati njira ina yothekera. tfruns). Koma mwayi waukulu ndikutha kuyendetsa mosavuta kukhazikitsidwa kwa zolemba mu Docker kapena pa seva, osayika RStudio pa izi.
6. Dockerization ya zolemba
Tidagwiritsa ntchito Docker kuwonetsetsa kuti chilengedwe chizikhala chamitundu yophunzitsira pakati pa mamembala amagulu komanso kuti atumizidwe mwachangu mumtambo. Mutha kuyamba kuzolowerana ndi chida ichi, chomwe ndi chachilendo kwa wopanga mapulogalamu a R, ndi izi mndandanda wa zofalitsa kapena kanema maphunziro.
Docker imakupatsani mwayi wopanga zithunzi zanu kuyambira poyambira ndikugwiritsa ntchito zithunzi zina ngati maziko opangira zanu. Posanthula zomwe zilipo, tidazindikira kuti kukhazikitsa madalaivala a NVIDIA, CUDA + cuDNN ndi malaibulale a Python ndi gawo lalikulu lachithunzichi, ndipo tidaganiza zotenga chithunzichi ngati maziko. tensorflow/tensorflow:1.12.0-gpu, ndikuwonjezera phukusi lofunikira la R pamenepo.
Kuti zikhale zosavuta, mapaketi omwe amagwiritsidwa ntchito adayikidwa m'mitundu yosiyanasiyana; zambiri zolembedwa zimakopera mkati mwa zotengera panthawi ya msonkhano. Tinasinthanso chipolopolo cha lamulo kuti /bin/bash kuti zitheke kugwiritsa ntchito zomwe zili /etc/os-release. Izi zidalepheretsa kufunikira kofotokozera mtundu wa OS mu code.
Kuphatikiza apo, bash script yaying'ono idalembedwa yomwe imakulolani kuyambitsa chidebe chokhala ndi malamulo osiyanasiyana. Mwachitsanzo, awa atha kukhala zolemba zophunzitsira ma neural network omwe adayikidwa kale mkati mwa chidebecho, kapena chipolopolo cholamula chowongolera ndikuwunika momwe chidebe chimagwirira ntchito:
Script kuti mutsegule chidebecho
#!/bin/sh
DBDIR=${PWD}/db
LOGSDIR=${PWD}/logs
MODELDIR=${PWD}/models
DATADIR=${PWD}/data
ARGS="--runtime=nvidia --rm -v ${DBDIR}:/db -v ${LOGSDIR}:/app/logs -v ${MODELDIR}:/app/models -v ${DATADIR}:/app/data"
if [ -z "$1" ]; then
CMD="Rscript /app/train_nn.R"
elif [ "$1" = "bash" ]; then
ARGS="${ARGS} -ti"
else
CMD="Rscript /app/train_nn.R $@"
fi
docker run ${ARGS} doodles-tf ${CMD}
Ngati bash script iyi imayendetsedwa popanda magawo, script imatchedwa mkati mwa chidebe train_nn.R okhala ndi zikhalidwe zosasintha; ngati mkangano woyamba ndi "bash", ndiye kuti chidebecho chidzayamba molumikizana ndi chipolopolo cholamula. Muzochitika zina zonse, mikangano yokhazikika imalowetsedwa m'malo: CMD="Rscript /app/train_nn.R $@".
Ndizofunikira kudziwa kuti zolembera zomwe zili ndi magwero a data ndi nkhokwe, komanso chikwatu chosungiramo zitsanzo zophunzitsidwa bwino, zimayikidwa mkati mwa chidebe kuchokera ku makina ochitira alendo, zomwe zimakupatsani mwayi wopeza zotsatira za zolembedwa popanda kusintha kosafunikira.
7. Kugwiritsa ntchito ma GPU angapo pa Google Cloud
Chimodzi mwa zinthu za mpikisano chinali deta yaphokoso kwambiri (onani chithunzi chamutu, chobwereka kuchokera ku @Leigh.plt kuchokera ku ODS slack). Magulu akulu amathandizira kuthana ndi izi, ndipo titayesa pa PC yokhala ndi 1 GPU, tidaganiza zophunzira bwino ma GPU angapo pamtambo. Ntchito GoogleCloud (chitsogozo chabwino ku zoyambira) chifukwa cha masankhidwe ambiri omwe alipo, mitengo yabwino ndi bonasi ya $ 300. Chifukwa cha umbombo, ndinalamula chitsanzo cha 4xV100 ndi SSD ndi tani ya RAM, ndipo kunali kulakwitsa kwakukulu. Makina oterowo amadya ndalama mwachangu; mutha kupita kukayesa popanda payipi yotsimikizika. Pofuna kuphunzitsa, ndi bwino kutenga K80. Koma kuchuluka kwa RAM kunabwera kothandiza - SSD yamtambo sinasangalale ndi magwiridwe ake, kotero nkhokweyo idasamutsidwa dev/shm.
Chochititsa chidwi kwambiri ndi kachidutswa kakang'ono kamene kamagwiritsa ntchito ma GPU angapo. Choyamba, chitsanzocho chimapangidwa pa CPU pogwiritsa ntchito woyang'anira nkhani, monga Python:
Ndi zinthu zothandiza ziti zomwe zaphunziridwa pampikisanowu:
Pazida zotsika mphamvu, mutha kugwira ntchito ndi ma data abwino (nthawi zambiri kukula kwa RAM) popanda kupweteka. Chikwama chapulasitiki deta.table imasunga kukumbukira chifukwa chakusintha kwamatebulo, komwe kumapewa kuwakopera, ndipo ikagwiritsidwa ntchito moyenera, kuthekera kwake pafupifupi nthawi zonse kumawonetsa kuthamanga kwambiri pakati pa zida zonse zomwe timazidziwa pazilankhulo zolembera. Kusunga deta mu nkhokwe kumakupatsani mwayi, nthawi zambiri, kuti musaganize konse za kufunikira kofinya deta yonse mu RAM.
Ntchito zochepa mu R zitha kusinthidwa ndi zofulumira mu C ++ pogwiritsa ntchito phukusi Rcpp. Ngati kuwonjezera ntchito RcppThread kapena RcppParallel, timapeza makonzedwe amitundu yambiri, kotero palibe chifukwa chofananira ndi code pa mlingo wa R.
Phukusi Rcpp angagwiritsidwe ntchito popanda kudziwa kwambiri C ++, zochepa zofunika zafotokozedwa apa. Mafayilo apamutu angapo abwino C-malaibulale ngati xtensor kupezeka pa CRAN, ndiye kuti, maziko akupangidwa kuti akwaniritse ma projekiti omwe amaphatikiza kachidindo ka C++ kokonzedwa kale kukhala R. Kusavuta kowonjezera ndikuwunikira kwa syntax ndi static C ++ code analyzer mu RStudio.
dokotala amakulolani kuyendetsa zolemba zokhala ndi magawo. Izi ndizosavuta kugwiritsa ntchito pa seva yakutali, kuphatikiza. pansi pa docker. Mu RStudio, ndizosasangalatsa kuchita zoyeserera maola ambiri pophunzitsa ma neural network, ndikuyika IDE pa seva palokha sikoyenera nthawi zonse.
Docker imawonetsetsa kusinthika kwa ma code ndi kubwezeredwa kwa zotsatira pakati pa opanga omwe ali ndi mitundu yosiyanasiyana ya OS ndi malaibulale, komanso kumasuka kwa ma seva. Mutha kuyambitsa payipi yonse yophunzitsira ndi lamulo limodzi lokha.
Google Cloud ndi njira yosavuta kugwiritsa ntchito bajeti yoyesera pazinthu zodula, koma muyenera kusankha masinthidwe mosamala.
Kuyeza liwiro la zidutswa za code ndizothandiza kwambiri, makamaka pophatikiza R ndi C ++, komanso phukusi. benchi - komanso zosavuta kwambiri.