A haziri nsụgharị nke isiokwu ahụ n'abalị nke mmalite nke ọmụmụ ahụ
Ọzụzụ kesara n'ọtụtụ oge ịgbakọ arụmọrụ dị elu nwere ike ibelata oge ọzụzụ nke netwọkụ akwara miri emi nke oge a na nnukwu data sitere na izu ruo awa ma ọ bụ ọbụna nkeji, na-eme ka usoro ọzụzụ a gbasaa na ngwa bara uru nke mmụta miri emi. Ndị ọrụ ga-aghọta ka esi ekekọrịta na mekọrịta data n'ofe ọtụtụ oge, nke n'aka nke ya na-enwe mmetụta dị ukwuu n'ịkwalite arụmọrụ. Na mgbakwunye, ndị ọrụ kwesịkwara ịma ka esi ebuga edemede ọzụzụ nke na-agba ọsọ n'otu oge ruo ọtụtụ oge.
N'isiokwu a, anyị ga-ekwu maka ụzọ dị mfe ma dị mfe iji kesaa mmụta site na iji oghere mmụta miri emi nke Apache MXNet na Horovod kesara usoro mmụta. Anyị ga-egosipụta n'ụzọ doro anya uru arụmọrụ nke usoro Horovod ma gosipụta otu esi ede akwụkwọ ọzụzụ MXNet ka o wee rụọ ọrụ n'ụzọ nkesa na Horovod.
Kedu ihe bụ Apache MXNet
bụ usoro mmụta miri emi nke mepere emepe nke ejiri emepụta, zụọ, na itinye netwọkụ akwara miri emi. MXNet na-ewepụta ihe mgbagwoju anya jikọtara na mmejuputa netwọkụ akwara ozi, na-arụ ọrụ nke ukwuu ma nwee ike ịgbatị, ma na-enye API maka asụsụ mmemme ama ama dị ka. , , , , , , na ndị ọzọ.
Ọzụzụ ekesara na MXNet nwere ihe nkesa oke
na-eji usoro ihe nkesa oke. Ọ na-eji otu sava ihe nrụnye iji nakọta gradients n'aka onye ọrụ ọ bụla, rụkọta mkpokọta, ma ziga gradients emelitere azụ na ndị ọrụ maka nkwalite nkwalite ọzọ. Ịchọpụta oke nke sava na ndị ọrụ bụ isi ihe na-eme ka ọ dị mma. Ọ bụrụ na enwere naanị otu ihe nkesa, ọ nwere ike bụrụ ihe mgbochi na mgbako. N'aka nke ọzọ, ọ bụrụ na a na-eji ọtụtụ sava, nkwurịta okwu ọtụtụ-na-ọtụtụ nwere ike mechie njikọ netwọk niile.
Gịnị bụ Horovod
bụ usoro mmụta miri emi ekesa emepe emepe emepe emepe na Uber. Ọ na-eji teknụzụ cross-GPU na cross-node rụọ ọrụ nke ọma dị ka NVIDIA Collective Communications Library (NCCL) na Message Passing Interface (MPI) iji kesaa na chịkọta ihe nlereanya n'ofe vorecs. Ọ na-akwalite iji bandwidth netwọkụ na akpịrịkpa nke ọma mgbe ị na-arụ ọrụ na ụdị netwọkụ akwara miri emi. Ọ na-akwado ugbu a ọtụtụ usoro mmụta igwe ama ama, ya bụ , Tensorflow, Keras, na PyTorch.
MXNet na Horovod njikọ
MXNet jikọtara ya na Horovod site na API mmụta ekesa akọwapụtara na Horovod. API nkwukọrịta Horovod horovod.broadcast(), horovod.nkpokọta() и horovod.allreduce() emejuputa atumatu site na iji oku oku asynchronous nke injin MXNet, dika akụkụ nke eserese oru ya. N'ụzọ dị otú a, ndabere data n'etiti nkwurịta okwu na mgbako na-eji ngwa ngwa engine MXNet iji zere mfu arụmọrụ n'ihi mmekọrịta. Ihe kacha mma ekesa akọwapụtara na Horovod horovod.DistributedOptimizer gbasaa Ihe njikarịcha na MXNet ka ọ kpọọ Horovod APIs maka mmelite oke kesara. Ihe nkọwa mmejuputa a niile pụtara ìhè nye ndị ọrụ njedebe.
Mmalite ngwa ngwa
Ị nwere ike ịmalite ịzụ obere netwọkụ akwara ozi na MNIST site na iji MXNet na Horovod na MacBook gị.
Nke mbụ, wụnye mxnet na horovod site na PyPI:
pip install mxnet
pip install horovodMara: Ọ bụrụ na ị zutere mperi n'oge pip tinye horovodikekwe ị ga-agbakwunye mgbanwe MACOSX_DEPLOYMENT_TARGET=10.vvebe vv - Nke a bụ ụdị MacOS gị, dịka ọmụmaatụ, maka MacOSX Sierra ị ga-ede MACOSX_DEPLOYMENT_TARGET=10.12 pip tinye horovod
Mgbe ahụ wụnye OpenMPI .
Na njedebe, budata edemede ule mxnet_mnist.py ma mee iwu ndị a na MacBook ọnụ na ndekọ ọrụ:
mpirun -np 2 -H localhost:2 -bind-to none -map-by slot python mxnet_mnist.pyNke a ga-agba ọsọ ọzụzụ na abụọ cores nke gị processor. Nsonaazụ ga-abụ nke a:
INFO:root:Epoch[0] Batch [0-50] Speed: 2248.71 samples/sec accuracy=0.583640
INFO:root:Epoch[0] Batch [50-100] Speed: 2273.89 samples/sec accuracy=0.882812
INFO:root:Epoch[0] Batch [50-100] Speed: 2273.39 samples/sec accuracy=0.870000ngosi ngosi
Mgbe ị na-azụ ihe nlereanya ResNet50-v1 na dataset ImageNet na 64 GPU nwere oge asatọ. p3.16 ukwu EC2, nke ọ bụla nwere 8 NVIDIA Tesla V100 GPUs na igwe ojii AWS, anyị nwetara ntinye ọzụzụ nke ihe oyiyi 45000 / sk (ya bụ, ọnụ ọgụgụ nke ihe atụ a zụrụ azụ kwa sekọnd). Ọzụzụ emechara n'ime nkeji 44 ka oge 90 gachara yana izizi kacha mma nke 75.7%.
Anyị tụlere nke a na usoro ọzụzụ ekesa nke MXNet nke iji sava paramita na 8, 16, 32 na 64 GPU nwere otu ihe nkesa na ihe nkesa na oke onye ọrụ nke 1 ruo 1 na 2 ruo 1, n'otu n'otu. Ị nwere ike ịhụ nsonaazụ na eserese 1 n'okpuru. Na y-axis n'aka ekpe, ogwe ndị ahụ na-anọchi anya ọnụọgụ onyonyo iji zụọ kwa nkeji, ahịrị ndị ahụ na-egosipụta arụmọrụ scaling (ya bụ, oke nke n'ezie ka ezigbo nrụpụta) na y-axis n'aka nri. Dị ka ị pụrụ ịhụ, nhọrọ nke ọnụ ọgụgụ nke sava na-emetụta scaling arụmọrụ. Ọ bụrụ na enwere naanị otu ihe nkesa paramita, arụmọrụ na-agbadata na 38% na 64 GPUs. Iji nweta otu nrụpụta nrụpụta dị ka nke Horovod, ịkwesịrị ịbawanye ọnụọgụ nke sava na ọnụ ọgụgụ ndị ọrụ.

Ọgụgụ 1. Ntụle nke mmụta kesara site na iji MXNet na Horovod yana ihe nkesa parameter
Na Tebụl 1 dị n'okpuru, anyị na-atụnyere ọnụ ahịa ikpeazụ kwa atụ mgbe ị na-eme nnwale na 64 GPU. Iji MXNet na Horovod na-enye ntinye kacha mma na ọnụ ala dị ala.

Tebụl 1. Ntụle ọnụ ahịa dị n'etiti Horovod na Parameter Server na ihe nkesa na oke ọrụ nke 2 ruo 1.
Nzọụkwụ iji mụta nwa
N'ime usoro ndị ọzọ, anyị ga-egosi gị otu esi emepụtaghachi nsonaazụ nke ọzụzụ ekesa site na iji MXNet na Horovod. Iji mụtakwuo maka mmụta ekesa na MXNet gụọ .
kwụpụ 1
Создайте кластер однородных экземпляров с MXNet версии 1.4.0 или выше и Horovod версии 0.16.0 или выше, чтобы использовать распределенное обучение. Вам также нужно будет установить библиотеки для обучения на GPU. Для наших экземпляров мы выбрали Ubuntu 16.04 Linux, с GPU Driver 396.44, CUDA 9.2, библиотеку cuDNN 7.2.1, коммуникатор NCCL 2.2.13 и OpenMPI 3.1.1. Также вы можете использовать , ebe etinyelarị ọba akwụkwọ ndị a.
kwụpụ 2
Tinye ike iji Horovod API rụọ ọrụ na edemede ọzụzụ MXNet gị. Edemede dị n'okpuru dabere na MXNet Gluon API nwere ike iji dị ka ndebiri dị mfe. Ahịrị ndị ahụ dị nkwuwa okwu dị mkpa ma ọ bụrụ na ịnweelarị edemede ọzụzụ kwekọrọ. Nke a bụ mgbanwe ole na ole dị oke mkpa ịchọrọ ime iji mụọ na Horovod:
- Tọọ ihe gbara ya gburugburu dịka ọkwa Horovod mpaghara (ahịrị 8) si dị ka ịghọta na a na-eme ọzụzụ na isi eserese ziri ezi.
- Nyefee paramita mbụ site n'otu onye ọrụ gaa na mmadụ niile (akara 18) iji hụ na ndị ọrụ niile na-eji otu paramita mbụ malite.
- Mepụta Horovod Nkesa Optimizer (ahịrị 25) imelite paramita n'ụzọ ekesa.
Iji nweta ederede zuru oke, biko rụtụ aka na ihe atụ Horovod-MXNet и .
1 import mxnet as mx
2 import horovod.mxnet as hvd
3
4 # Horovod: initialize Horovod
5 hvd.init()
6
7 # Horovod: pin a GPU to be used to local rank
8 context = mx.gpu(hvd.local_rank())
9
10 # Build model
11 model = ...
12
13 # Initialize parameters
14 model.initialize(initializer, ctx=context)
15 params = model.collect_params()
16
17 # Horovod: broadcast parameters
18 hvd.broadcast_parameters(params, root_rank=0)
19
20 # Create optimizer
21 optimizer_params = ...
22 opt = mx.optimizer.create('sgd', **optimizer_params)
23
24 # Horovod: wrap optimizer with DistributedOptimizer
25 opt = hvd.DistributedOptimizer(opt)
26
27 # Create trainer and loss function
28 trainer = mx.gluon.Trainer(params, opt, kvstore=None)
29 loss_fn = ...
30
31 # Train model
32 for epoch in range(num_epoch):
33 ...kwụpụ 3
Banye n'otu n'ime ndị ọrụ ka ịmalite ọzụzụ nkesa site na iji ntuziaka MPI. N'ihe atụ a, ọzụzụ ekesa na-agba n'ọnọdụ anọ yana 4 GPU nke ọ bụla, yana ngụkọta nke 16 GPU na ụyọkọ. A ga-eji ihe nkwalite nke Stochastic Gradient Descent (SGD) jiri hyperparameter ndị a:
- Obere-ogbe nha: 256
- Ọnụ ọgụgụ mmụta: 0.1
- Ọnụ: 0.9
- arọ ire ere: 0.0001
Ka anyị si na otu GPU gaa na 64 GPUs, anyị na-agbatị usoro ọzụzụ dịka ọnụ ọgụgụ GPU si dị (site na 0,1 maka 1 GPU ruo 6,4 maka 64 GPUs), ebe anyị na-edobe ọnụ ọgụgụ nke ihe oyiyi kwa GPU na 256 (site na otu ogbe. Foto 256 maka 1 GPU ruo 16 maka 384 GPUs). Oke ire ere na oke ọkụ gbanwere ka ọnụọgụ GPU na-abawanye. Anyị jiri ọzụzụ ziri ezi agwakọta na ụdị data float64 maka ngafe na-aga n'ihu na float16 maka gradients iji mee ka ọnụ ọgụgụ float32 dị ngwa nke NVIDIA Tesla GPU na-akwado.
$ mpirun -np 16
-H server1:4,server2:4,server3:4,server4:4
-bind-to none -map-by slot
-mca pml ob1 -mca btl ^openib
python mxnet_imagenet_resnet50.pynkwubi
N'isiokwu a, anyị lere anya n'ụzọ dị mfe iji kesaa ọzụzụ nlereanya site na iji Apache MXNet na Horovod. Anyị gosipụtara ịrụ ọrụ nke ọma na ọnụ ahịa dị oke ọnụ ma e jiri ya tụnyere usoro ihe nkesa nke dị na dataset ImageNet nke a zụrụ ihe nlereanya ResNet50-v1. Anyị etinyewokwa usoro ị nwere ike iji gbanwee edemede dị adị ka iji Horovod mee ọzụzụ ọzụzụ ọtụtụ ihe.
Ọ bụrụ na ị na-amalite na MXNet na mmụta miri emi, gaa na ibe nrụnye ibu ụzọ wuo MXNet. Anyị na-akwadosi ike ịgụ akụkọ ahụ ibido.
Ọ bụrụ na ị na-arụ ọrụ na MXNet na-achọsi ike na-agbalị na-ekesa mmụta na Horovod, mgbe ahụ, lee anya , wuo ya site na MXNet ma soro ihe atụ ma ọ bụ .
* A na-agbakọ ọnụ ahịa dabere na AWS maka oge EC2
Mụtakwuo maka nkuzi ahụ
isi: www.habr.com
