Tufatufaina Aoaoga ma Apache MXNet ma Horovod

O le faaliliuga o le tusiga na saunia i le afiafi o le amataga o le vasega "ML tau pisinisi i luga o fa'amatalaga tetele"

O a'oa'oga tu'ufa'atasia i le tele o fa'atonuga fa'akomepiuta maualuga e mafai ona fa'aitiitia ai le taimi o a'oa'oga o feso'ota'iga neural loloto fa'aonaponei i luga o le tele o fa'amaumauga mai vaiaso i itula po'o ni minute fo'i, ma fa'ateleina ai lenei metotia fa'aa'oa'oga i fa'aoga fa'atino o a'oa'oga loloto. E tatau i tagata fa'aoga ona malamalama pe fa'afefea ona fa'asoa ma fa'amaopoopo fa'amaumauga i le tele o fa'ata'ita'iga, lea e i ai se a'afiaga tele i le fa'aleleia lelei. E le gata i lea, e tatau foi i tagata faʻaoga ona iloa pe faʻafefea ona faʻapipiʻi se tusitusiga aʻoaʻoga e alu i luga o se faʻataʻitaʻiga e tasi i le tele o taimi.

I totonu o lenei tusiga o le a tatou talanoa e uiga i se auala vave ma faigofie e tufatufa ai aʻoaʻoga e faʻaaoga ai le faletusi aʻoaʻoga loloto matala Apache MXNet ma le faʻavae aʻoaʻoga tufatufaina a Horovod. O le a matou faʻaalia manino le aoga o faʻatinoga o le faʻatulagaga o Horovod ma faʻaalia pe faʻapefea ona tusia se tusitusiga aʻoaʻoga MXNet ina ia galue i se faʻasalalauga faʻatasi ma Horovod.

O le a le Apache MXNet

Apache MX Net ose fa'avae a'oa'oga loloto e fa'aogaina e fa'aoga, fa'aa'oa'oina, ma fa'aogaina feso'ota'iga neural loloto. O le MXNet e fa'amama le lavelave e feso'ota'i ma le fa'aogaina o feso'ota'iga neural, e maualuga le fa'atinoga ma le fa'alauteleina, ma ofoina atu API mo gagana fa'apolokalame lauiloa e pei ole Python, C ++, Clojure, Java, Julia, R, Scala ma isi.

Fa'asoa a'oa'oga i le MXNet fa'atasi ai ma le fa'aumau fa'amaumau

Fa'asoa tu'ufa'atasia vaega a'oa'oga ile MXNet fa'aogaina se faiga fa'aumau fa'ata'ita'i. E fa'aogaina se seti o fa'aumau fa'ata'ita'i e aoina ai fa'alili mai tagata faigaluega ta'itasi, fa'atino fa'atasiga, ma toe fa'afo'i gradients fa'afouina i tagata faigaluega mo le isi su'esu'ega fa'alelei. O le fuafuaina o le fua sa'o o 'au'aunaga i tagata faigaluega o le ki lea i le fa'aleleia lelei. Afai e na'o le tasi le server parameter, e mafai ona avea ma se fa'alavelave i fa'atatauga. I le isi itu, afai e tele naua sapalai o loʻo faʻaogaina, tele-i-tele fesoʻotaʻiga e mafai ona poloka uma fesoʻotaʻiga fesoʻotaiga.

O le a le Horovod

Horovod ose ta'iala fa'aa'oa'oga loloto fa'asalalau fa'asalalau ua atia'e ile Uber. O lo'o fa'aogaina fa'atekonolosi fa'alava-GPU ma feso'ota'iga e pei o le NVIDIA Collective Communications Library (NCCL) ma le Message Passing Interface (MPI) e fa'asoa ma fa'apotopoto fa'ata'ita'iga fa'ata'ita'iga i vorecs. E sili ona lelei le faʻaogaina o le bandwidth network ma fua lelei pe a galue ma faʻataʻitaʻiga fesoʻotaʻiga neural loloto. O loʻo lagolagoina nei le tele o taʻutaʻua aʻoaʻoga masini, e taʻua MX Net, Tensorflow, Keras, ma PyTorch.

MXNet ma Horovod tu'ufa'atasiga

MXNet fa'atasi ma Horovod e ala i le Tufatufaina o Aoaoga API o lo'o fa'amatalaina ile Horovod. API o feso'ota'iga Horovod horovod.broadcast(), horovod.allgather() и horovod.allreduce() fa'atinoina ile fa'aogaina ole asynchronous callbacks ole afi MXNet, ose vaega o lana kalafa galuega. I lenei auala, o faʻamaumauga faʻalagolago i le va o fesoʻotaʻiga ma faʻatusatusaga e faigofie ona taulimaina e le afi MXNet e aloese ai mai le gau o faʻatinoga ona o le faʻaogaina. Tufatufaina optimizer mea fa'amatalaina i Horovod horovod.DistributedOptimizer fa'alautele Auiliiliga i le MXNet ina ia taʻua ai le Horovod APIs mo faʻasalalauga faʻasalalau faʻasalalau. O nei fa'amatalaga fa'atinoga uma e manino i tagata fa'au'uga.

vave amata

E mafai ona vave amata aʻoaʻoina se laʻititi fesoʻotaʻiga neural network i luga o le MNIST dataset faʻaaoga MXNet ma Horovod i lau MacBook.
Muamua, faʻapipiʻi mxnet ma horovod mai PyPI:

pip install mxnet
pip install horovod

Manatua: Afai e te feagai ma se mea sese i le taimi pip fa'apipi'i horovodmasalo e te manaʻomia le faʻaopoopoina o se fesuiaiga MACOSX_DEPLOYMENT_TARGET=10.vvfea vv - o le faʻasologa lenei o lau faʻasologa MacOS, mo se faʻataʻitaʻiga, mo MacOSX Sierra e te manaʻomia e tusi MACOSX_DEPLOYMENT_TARGET=10.12 pip fa'apipi'i horovod

Ona faʻapipiʻi lea o OpenMPI mai iinei.

I le fa'ai'uga, la'u mai le su'ega tusitusiga mxnet_mnist.py mai iinei ma faʻataʻitaʻi tulafono nei i le MacBook terminal i le lisi galue:

mpirun -np 2 -H localhost:2 -bind-to none -map-by slot python mxnet_mnist.py

O lenei mea o le a faʻatautaia aʻoaʻoga i luga o 'au e lua o lau processor. O le a fa'aalia mea nei:

INFO:root:Epoch[0] Batch [0-50] Speed: 2248.71 samples/sec      accuracy=0.583640
INFO:root:Epoch[0] Batch [50-100] Speed: 2273.89 samples/sec      accuracy=0.882812
INFO:root:Epoch[0] Batch [50-100] Speed: 2273.39 samples/sec      accuracy=0.870000

Fa'aaliga Fa'atinoga

Pe a aʻoaʻoina se faʻataʻitaʻiga ResNet50-v1 i luga o se faʻamaumauga ImageNet i luga ole 64 GPU ma le valu taimi p3.16xtele EC2, o loʻo i ai taʻitasi 8 NVIDIA Tesla V100 GPU i luga o le ao AWS, na matou ausia se toleniga o le 45000 ata / sec (ie, numera o faʻataʻitaʻiga aʻoaʻoina i le sekone). Na maeʻa aʻoaʻoga i le 44 minute i le maeʻa ai o le 90 taimi ma le saʻo sili o le 75.7%.

Na matou faʻatusatusaina lenei mea i le MXNet faʻasoa aʻoaʻoga auala o le faʻaogaina o sapalai faʻamau i luga ole 8, 16, 32 ma 64 GPU faʻatasi ai ma se faʻaumau faʻamaufaʻailoga e tasi ma se faʻaumau i tagata faigaluega o le 1 i le 1 ma le 2 i le 1, i le faasologa. E mafai ona e vaʻai i le iʻuga i le Ata 1 i lalo. I luga o le y-axis i le itu agavale, o faʻamau e faʻatusalia le numera o ata e toleni i le sekone, o laina e atagia ai le faʻaleleia lelei (o lona uiga, o le fua faatatau o le mea moni i le gaosiga lelei) i luga o le y-axis i le itu taumatau. E pei ona mafai ona e vaʻai, o le filifiliga o le numera o 'auʻaunaga e aʻafia ai le faʻaleleia lelei. Afai e na'o le tasi le 'au'aunaga fa'amaufa'ailoga, e pa'ū le fa'aleleia o le malosi i le 38% ile 64 GPU. Ina ia ausia le faʻaleleia tutusa e pei o Horovod, e tatau ona e faʻaluaina le numera o sapalai e faʻatatau i le numera o tagata faigaluega.

Tufatufaina Aoaoga ma Apache MXNet ma Horovod
Ata 1. Fa'atusatusaga o a'oa'oga fa'asoa e fa'aaoga ai le MXNet ma le Horovod ma fa'atasi ai ma le fa'aumau

I le Laulau 1 o loʻo i lalo, matou te faʻatusatusaina le tau mulimuli i le faʻataʻitaʻiga pe a faʻataʻitaʻiina suʻega i 64 GPU. O le faʻaaogaina o le MXNet ma Horovod e maua ai le gaosiga sili ona lelei ile tau maualalo.

Tufatufaina Aoaoga ma Apache MXNet ma Horovod
Laulau 1. Fa'atusatusaga tau i le va o le Horovod ma le Parameter Server fa'atasi ai ma le fua faatatau o le server i le tagata faigaluega o le 2 i le 1.

Laasaga e toe gaosia

I isi laasaga, matou te faʻaali atu ia te oe pe faʻapefea ona toe gaosia le taunuʻuga o aʻoaʻoga tufatufaina e faʻaaoga ai le MXNet ma Horovod. Mo nisi fa'amatalaga e uiga i a'oa'oga fa'asoa fa'atasi ma MXNet faitau lenei pou.

laa 1

Fausia se fuifui o faʻataʻitaʻiga tutusa ma le MXNet version 1.4.0 poʻo le maualuga ma le Horovod version 0.16.0 poʻo le maualuga e faʻaaoga ai aʻoaʻoga tufatufaina. E te manaʻomia foʻi le faʻapipiʻiina o faletusi mo aʻoaʻoga GPU. Mo a matou faʻataʻitaʻiga, na matou filifilia le Ubuntu 16.04 Linux, faʻatasi ai ma le GPU Avetaʻavale 396.44, CUDA 9.2, cuDNN 7.2.1 faletusi, NCCL 2.2.13 communicator ma OpenMPI 3.1.1. E mafai foi ona e faʻaaogaina Amazon Deep Learning AMI, lea ua uma ona faʻapipiʻiina nei faletusi.

laa 2

Faʻaopoopo le tomai e galue ai ma le Horovod API i lau tusitusiga aʻoaʻoga MXNet. O le faʻamaumauga o loʻo i lalo e faʻavae i luga o le MXNet Gluon API e mafai ona faʻaaogaina e fai ma faʻataʻitaʻiga faigofie. E mana'omia laina fa'amaualuga pe afai ua uma ona iai sau tusitusiga fa'aa'oa'oga talafeagai. O nisi nei o suiga taua e tatau ona e faia e aʻoaʻo ai ma Horovod:

  • Seti le talaaga e tusa ai ma le tulaga o le Horovod i le lotoifale (laina 8) e malamalama ai o loʻo faia aʻoaʻoga i luga o le ata saʻo.
  • Tu'u atu fa'amaufa'ailoga muamua mai le tagata faigaluega e to'atasi i tagata uma (laina 18) ina ia fa'amautinoa e amata uma tagata faigaluega i fa'amaufa'ailoga muamua e tasi.
  • Fausia se Horovod DistributedOptimizer (laina 25) e faʻafou ai faʻamaufaʻailoga i se auala tufatufa.

Ina ia maua le faʻamatalaga atoa, faʻamolemole vaʻai i faʻataʻitaʻiga Horovod-MXNet MNIST и IMAGEnet.

1  import mxnet as mx
2  import horovod.mxnet as hvd
3
4  # Horovod: initialize Horovod
5  hvd.init()
6
7  # Horovod: pin a GPU to be used to local rank
8  context = mx.gpu(hvd.local_rank())
9
10 # Build model
11 model = ...
12
13 # Initialize parameters
14 model.initialize(initializer, ctx=context)
15 params = model.collect_params()
16
17 # Horovod: broadcast parameters
18 hvd.broadcast_parameters(params, root_rank=0)
19
20 # Create optimizer
21 optimizer_params = ...
22 opt = mx.optimizer.create('sgd', **optimizer_params)
23
24 # Horovod: wrap optimizer with DistributedOptimizer
25 opt = hvd.DistributedOptimizer(opt)
26
27 # Create trainer and loss function
28 trainer = mx.gluon.Trainer(params, opt, kvstore=None)
29 loss_fn = ...
30
31 # Train model
32 for epoch in range(num_epoch):
33    ...

laa 3

Ulufale i se tasi o tagata faigaluega e amata tufatufaina aʻoaʻoga e faʻaaoga ai le faʻatonuga a le MPI. I lenei faʻataʻitaʻiga, o aʻoaʻoga tufatufaina e faʻatautaia i luga o faʻataʻitaʻiga e 4 GPU taʻitasi, ma le aofaʻi o 16 GPU i le fuifui. O le Stochastic Gradient Descent (SGD) optimizer o le a faʻaaogaina faʻatasi ma hyperparameters nei:

  • lapopo'a la'ititi: 256
  • fua faatatau aʻoaʻoga: 0.1
  • malosi: 0.9
  • pa'u mamafa: 0.0001

A o matou siʻitia mai le tasi GPU i le 64 GPU, matou te faʻavasegaina le fua faatatau o aʻoaʻoga e tusa ai ma le numera o GPU (mai le 0,1 mo le 1 GPU i le 6,4 mo le 64 GPU), aʻo tausia le numera o ata i le GPU i le 256 (mai se vaega o 256 ata mo le 1 GPU i le 16 mo le 384 GPU). O le pa'u o le mamafa ma le fa'asologa o le malosi na suia a'o fa'ateleina le numera o GPU. Sa matou fa'aogaina a'oa'oga fa'afefiloi fa'atasi ma le ituaiga fa'amaumauga o le float64 mo ​​le pasi i luma ma le float16 mo gradients e fa'avavevave ai fa'atusatusaga o le float32 e lagolagoina e NVIDIA Tesla GPU.

$ mpirun -np 16 
    -H server1:4,server2:4,server3:4,server4:4 
    -bind-to none -map-by slot 
    -mca pml ob1 -mca btl ^openib 
    python mxnet_imagenet_resnet50.py

iʻuga

I totonu o lenei tusiga, na matou vaʻavaʻai i se auala faʻapitoa e tufatufaina atu aʻoaʻoga faʻataʻitaʻiga e faʻaaoga ai Apache MXNet ma Horovod. Na matou faʻaalia le faʻaogaina o le faʻaleleia ma le taugofie pe a faʻatusatusa i le faʻaogaina o le server i luga o le ImageNet dataset lea na aʻoaʻoina ai le ResNet50-v1 faʻataʻitaʻiga. Ua matou fa'aofiina fo'i la'asaga e mafai ona e fa'aogaina e sui ai se fa'amatalaga o lo'o iai e fa'atino ai a'oa'oga fa'atele e fa'aoga ai le Horovod.

Afai o loʻo e amataina i le MXNet ma aʻoaʻoga loloto, alu i le itulau faʻapipiʻi MXNee muamua fausia MXNet. Matou te fautuaina malosi foi e faitau le tusiga MXNet ile 60 minutee amata ai.

Afai ua uma ona e galue ma MXNet ma e te manaʻo e faʻataʻitaʻi aʻoaʻoga tufatufaina ma Horovod, ona e tilotilo lea i Itulau faʻapipiʻi Horovod, fausia mai le MXNet ma mulimuli i le faʻataʻitaʻiga MNIST poʻo IMAGEnet.

* tau e fa'atatau ile tau itula AWS mo EC2 Faʻataʻitaʻiga

Aoao atili e uiga i le kosi "ML tau pisinisi i luga o fa'amatalaga tetele"

puna: www.habr.com

Faaopoopo i ai se faamatalaga