ืœืžื™ื“ื” ืžื‘ื•ื–ืจืช ืขื Apache MXNet ื•- Horovod

ืชืจื’ื•ื ื”ืžืืžืจ ื”ื•ื›ืŸ ืขืจื‘ ืชื—ื™ืœืช ื”ืงื•ืจืก "ML ืชืขืฉื™ื™ืชื™ ืขืœ ื‘ื™ื’ ื“ืื˜ื”"

ื”ื›ืฉืจื” ืžื‘ื•ื–ืจืช ื‘ืžืกืคืจ ืžื•ืคืขื™ ืžื—ืฉื•ื‘ ื‘ืขืœื™ ื‘ื™ืฆื•ืขื™ื ื’ื‘ื•ื”ื™ื ื™ื›ื•ืœื” ืœื”ืคื—ื™ืช ืืช ื–ืžืŸ ื”ืื™ืžื•ืŸ ืฉืœ ืจืฉืชื•ืช ืขืฆื‘ื™ื•ืช ืขืžื•ืงื•ืช ืžื•ื“ืจื ื™ื•ืช ืขืœ ื›ืžื•ื™ื•ืช ื’ื“ื•ืœื•ืช ืฉืœ ื ืชื•ื ื™ื ืžืฉื‘ื•ืขื•ืช ืœืฉืขื•ืช ืื• ืืคื™ืœื• ื“ืงื•ืช, ืžื” ืฉื”ื•ืคืš ืืช ื˜ื›ื ื™ืงืช ื”ืื™ืžื•ืŸ ื”ื–ื• ืœื ืคื•ืฆื” ื‘ื™ื™ืฉื•ืžื™ื ืžืขืฉื™ื™ื ืฉืœ ืœืžื™ื“ื” ืขืžื•ืงื”. ืขืœ ื”ืžืฉืชืžืฉื™ื ืœื”ื‘ื™ืŸ ื›ื™ืฆื“ ืœืฉืชืฃ ื•ืœืกื ื›ืจืŸ ื ืชื•ื ื™ื ืขืœ ืคื ื™ ืžืกืคืจ ืžื•ืคืขื™ื, ืืฉืจ ื‘ืชื•ืจื• ื™ืฉ ืœื• ื”ืฉืคืขื” ื’ื“ื•ืœื” ืขืœ ื™ืขื™ืœื•ืช ืงื ื” ื”ืžื™ื“ื”. ื‘ื ื•ืกืฃ, ืขืœ ื”ืžืฉืชืžืฉื™ื ืœื“ืขืช ื’ื ื›ื™ืฆื“ ืœืคืจื•ืก ืกืงืจื™ืคื˜ ื”ื“ืจื›ื” ืฉืจืฅ ืขืœ ืžื•ืคืข ื‘ื•ื“ื“ ืœืžืกืคืจ ืžื•ืคืขื™ื.

ื‘ืžืืžืจ ื–ื” ื ื“ื‘ืจ ืขืœ ื“ืจืš ืžื”ื™ืจื” ื•ืงืœื” ืœื”ืคืฆืช ืœืžื™ื“ื” ื‘ืืžืฆืขื•ืช ืกืคืจื™ื™ืช ื”ืœืžื™ื“ื” ื”ืขืžื•ืงื” ื”ืคืชื•ื—ื” Apache MXNet ื•ืžืกื’ืจืช ื”ืœืžื™ื“ื” ื”ืžื‘ื•ื–ืจืช Horovod. ื ื“ื’ื™ื ื‘ื‘ื™ืจื•ืจ ืืช ื™ืชืจื•ื ื•ืช ื”ื‘ื™ืฆื•ืขื™ื ืฉืœ ืžืกื’ืจืช Horovod ื•ื ื“ื’ื™ื ื›ื™ืฆื“ ืœื›ืชื•ื‘ ืกืงืจื™ืคื˜ ืื™ืžื•ืŸ MXNet ื›ืš ืฉื™ืขื‘ื•ื“ ื‘ืฆื•ืจื” ืžื‘ื•ื–ืจืช ืขื Horovod.

ืžื” ื–ื” Apache MXNet

ืืคืืฆ'ื™ MXNet ื”ื™ื ืžืกื’ืจืช ืœืžื™ื“ื” ืขืžื•ืงื” ื‘ืงื•ื“ ืคืชื•ื— ื”ืžืฉืžืฉืช ืœื™ืฆื™ืจื”, ืื™ืžื•ืŸ ื•ืคืจื™ืกื” ืฉืœ ืจืฉืชื•ืช ืขืฆื‘ื™ื•ืช ืขืžื•ืงื•ืช. MXNet ืžืจื—ื™ืงื” ืืช ื”ืžื•ืจื›ื‘ื•ืช ื”ืงืฉื•ืจื” ื‘ื™ื™ืฉื•ื ืจืฉืชื•ืช ืขืฆื‘ื™ื•ืช, ื”ื™ื ื‘ืขืœืช ื‘ื™ืฆื•ืขื™ื ื’ื‘ื•ื”ื™ื ื•ื ื™ืชื ืช ืœื”ืจื—ื‘ื”, ื•ืžืฆื™ืขื” ืžืžืฉืงื™ API ืœืฉืคื•ืช ืชื›ื ื•ืช ืคื•ืคื•ืœืจื™ื•ืช ื›ื’ื•ืŸ ืคื™ืชื•ืŸ, C + +, ืงืœื•ื–'ื•ืจื”, Java, ื’'ื•ืœื™ื”, R, ืกื•ืœื ื•ืขื•ื“.

ื”ื“ืจื›ื” ืžื‘ื•ื–ืจืช ื‘-MXNet ืขื ืฉืจืช ืคืจืžื˜ืจื™ื

ืžื•ื“ื•ืœ ืœืžื™ื“ื” ืžื‘ื•ื–ืจ ืกื˜ื ื“ืจื˜ื™ ื‘-MXNet ืžืฉืชืžืฉ ื‘ื’ื™ืฉืช ืฉืจืช ืคืจืžื˜ืจื™ื. ื”ื•ื ืžืฉืชืžืฉ ื‘ืงื‘ื•ืฆื” ืฉืœ ืฉืจืชื™ ืคืจืžื˜ืจื™ื ื›ื“ื™ ืœืืกื•ืฃ ืžืขื‘ืจื™ ืฆื‘ืข ืžื›ืœ ืขื•ื‘ื“, ืœื‘ืฆืข ืฆื‘ื™ืจื” ื•ืœืฉืœื•ื— ืžืขื‘ืจื™ ืฆื‘ืข ืžืขื•ื“ื›ื ื™ื ื‘ื—ื–ืจื” ืœืขื•ื‘ื“ื™ื ืขื‘ื•ืจ ืื™ื˜ืจืฆื™ื™ืช ื”ืื•ืคื˜ื™ืžื™ื–ืฆื™ื” ื”ื‘ืื”. ืงื‘ื™ืขืช ื”ื™ื—ืก ื”ื ื›ื•ืŸ ื‘ื™ืŸ ืฉืจืชื™ื ืœืขื•ื‘ื“ื™ื ื”ื™ื ื”ืžืคืชื— ืœืงื ื” ืžื™ื“ื” ื™ืขื™ืœ. ืื ื™ืฉ ืจืง ืฉืจืช ืคืจืžื˜ืจ ืื—ื“, ื–ื” ืขืœื•ืœ ืœื”ืชื‘ืจืจ ื›ืฆื•ื•ืืจ ื‘ืงื‘ื•ืง ื‘ื—ื™ืฉื•ื‘ื™ื. ืœืขื•ืžืช ื–ืืช, ืื ื ืขืฉื” ืฉื™ืžื•ืฉ ื‘ื™ื•ืชืจ ืžื“ื™ ืฉืจืชื™ื, ืชืงืฉื•ืจืช ืžืจื•ื‘ื” ืœืจื‘ื™ื ืขืœื•ืœื” ืœืกืชื•ื ืืช ื›ืœ ื—ื™ื‘ื•ืจื™ ื”ืจืฉืช.

ืžื” ื–ื” ื”ื•ืจื•ื‘ื•ื“

ื”ื•ืจื•ื‘ื•ื“ ื”ื™ื ืžืกื’ืจืช ืœืžื™ื“ื” ืขืžื•ืงื” ืžื‘ื•ื–ืจืช ืคืชื•ื—ื” ืฉืคื•ืชื—ื” ื‘- Uber. ื”ื•ื ืžืžื ืฃ ื˜ื›ื ื•ืœื•ื’ื™ื•ืช ื™ืขื™ืœื•ืช ื—ื•ืฆื•ืช-GPU ื•ืฆืžืชื™ื ื›ืžื• ืกืคืจื™ื™ืช ืชืงืฉื•ืจืช ืงื•ืœืงื˜ื™ื‘ื™ืช ืฉืœ NVIDIA (NCCL) ื•ืžืžืฉืง ื”ืขื‘ืจืช ื”ื•ื“ืขื•ืช (MPI) ื›ื“ื™ ืœื”ืคื™ืฅ ื•ืœืฆื‘ื•ืจ ืคืจืžื˜ืจื™ื ืฉืœ ืžื•ื“ืœ ืขืœ ืคื ื™ vorecs. ื–ื” ืžื™ื™ืขืœ ืืช ื”ืฉื™ืžื•ืฉ ื‘ืจื•ื—ื‘ ื”ืคืก ืฉืœ ื”ืจืฉืช ื•ืžืชืจื—ื‘ ื”ื™ื˜ื‘ ื›ืืฉืจ ืขื•ื‘ื“ื™ื ืขื ืžื•ื“ืœื™ื ืฉืœ ืจืฉืชื•ืช ืขืฆื‘ื™ื•ืช ืขืžื•ืงื•ืช. ื›ืจื’ืข ื”ื•ื ืชื•ืžืš ื‘ืžืกืคืจ ืžืกื’ืจื•ืช ืœืžื™ื“ืช ืžื›ื•ื ื” ืคื•ืคื•ืœืจื™ื•ืช, ื›ืœื•ืžืจ MX Net, Tensorflow, Keras ื•- PyTorch.

ืฉื™ืœื•ื‘ MXNet ื•- Horovod

MXNet ืžืฉืชืœื‘ ืขื Horovod ื‘ืืžืฆืขื•ืช ืžืžืฉืงื™ API ืฉืœ ืœืžื™ื“ื” ืžื‘ื•ื–ืจืช ื”ืžื•ื’ื“ืจื™ื ื‘ื”ื•ืจื•ื‘ื•ื“. ืžืžืฉืงื™ API ืœืชืงืฉื•ืจืช Horovod horovod.broadcast(), horovod.allgather() ะธ horovod.allreduce() ืžื™ื•ืฉื ื‘ืืžืฆืขื•ืช ื”ืชืงืฉืจื•ื™ื•ืช ืืกื™ื ื›ืจื•ื ื™ื•ืช ืฉืœ ืžื ื•ืข MXNet, ื›ื—ืœืง ืžื’ืจืฃ ื”ืžืฉื™ืžื•ืช ืฉืœื•. ื‘ื“ืจืš ื–ื•, ืชืœื•ืช ื ืชื•ื ื™ื ื‘ื™ืŸ ืชืงืฉื•ืจืช ืœื—ื™ืฉื•ื‘ ืžื˜ื•ืคืœืช ื‘ืงืœื•ืช ืขืœ ื™ื“ื™ ืžื ื•ืข MXNet ื›ื“ื™ ืœืžื ื•ืข ืื•ื‘ื“ืŸ ื‘ื™ืฆื•ืขื™ื ืขืงื‘ ืกื ื›ืจื•ืŸ. ืื•ื‘ื™ื™ืงื˜ ืื•ืคื˜ื™ืžื™ื–ืฆื™ื” ืžื‘ื•ื–ืจ ืฉื”ื•ื’ื“ืจ ื‘ื”ืจื•ื‘ื•ื“ horovod.DistributedOptimizer ืžืชืจื—ื‘ ืžื™ื˜ื•ื‘ ื‘-MXNet ื›ืš ืฉื”ื•ื ืงื•ืจื ืœืžืžืฉืงื™ ื”-API ืฉืœ Horovod ื”ืชื•ืืžื™ื ืขื‘ื•ืจ ืขื“ื›ื•ื ื™ ืคืจืžื˜ืจื™ื ืžื‘ื•ื–ืจื™ื. ื›ืœ ืคืจื˜ื™ ื”ื”ื˜ืžืขื” ื”ืœืœื• ืฉืงื•ืคื™ื ืœืžืฉืชืžืฉื™ ื”ืงืฆื”.

ื”ืชื—ืœื” ืžื”ื™ืจื”

ืืชื” ื™ื›ื•ืœ ืœื”ืชื—ื™ืœ ื‘ืžื”ื™ืจื•ืช ืœื”ื›ืฉื™ืจ ืจืฉืช ืขืฆื‘ื™ืช ืงื•ื ื‘ื•ืœื•ืฆื™ื•ื ื™ืช ืงื˜ื ื” ื‘ืžืขืจืš ื”ื ืชื•ื ื™ื ืฉืœ MNIST ื‘ืืžืฆืขื•ืช MXNet ื•- Horovod ื‘-MacBook ืฉืœืš.
ืจืืฉื™ืช, ื”ืชืงืŸ ืืช mxnet ื•ืืช horovod ืž- PyPI:

pip install mxnet
pip install horovod

ื”ืขืจื”: ืื ืืชื” ื ืชืงืœ ื‘ืฉื’ื™ืื” ื‘ืžื”ืœืš pip ืœื”ืชืงื™ืŸ horovodืื•ืœื™ ืืชื” ืฆืจื™ืš ืœื”ื•ืกื™ืฃ ืžืฉืชื ื” MACOSX_DEPLOYMENT_TARGET=10.vvืื™ืคื” vv โ€“ ื–ื• ื”ื’ืจืกื” ืฉืœ ื’ืจืกืช ื”-MacOS ืฉืœืš, ืœืžืฉืœ, ืขื‘ื•ืจ MacOSX Sierra ืชืฆื˜ืจืš ืœื›ืชื•ื‘ MACOSX_DEPLOYMENT_TARGET=10.12 ื”ืชืงื ืช pip horovod

ืœืื—ืจ ืžื›ืŸ ื”ืชืงืŸ ืืช OpenMPI ืžื›ืืŸ.

ื‘ืกื•ืฃ, ื”ื•ืจื“ ืืช ืกืงืจื™ืคื˜ ื”ื‘ื“ื™ืงื” mxnet_mnist.py ืžื›ืืŸ ื•ื”ืคืขืœ ืืช ื”ืคืงื•ื“ื•ืช ื”ื‘ืื•ืช ื‘ืžืกื•ืฃ MacBook ื‘ืกืคืจื™ื™ืช ื”ืขื‘ื•ื“ื”:

mpirun -np 2 -H localhost:2 -bind-to none -map-by slot python mxnet_mnist.py

ื–ื” ื™ืคืขื™ืœ ืื™ืžื•ืŸ ืขืœ ืฉืชื™ ืœื™ื‘ื•ืช ืฉืœ ื”ืžืขื‘ื“ ืฉืœืš. ื”ืคืœื˜ ื™ื”ื™ื” ื”ื‘ื:

INFO:root:Epoch[0] Batch [0-50] Speed: 2248.71 samples/sec      accuracy=0.583640
INFO:root:Epoch[0] Batch [50-100] Speed: 2273.89 samples/sec      accuracy=0.882812
INFO:root:Epoch[0] Batch [50-100] Speed: 2273.39 samples/sec      accuracy=0.870000

ื”ื“ื’ืžืช ื‘ื™ืฆื•ืขื™ื

ื‘ืขืช ืื™ืžื•ืŸ ืžื•ื“ืœ ResNet50-v1 ืขืœ ืžืขืจืš ื ืชื•ื ื™ื ืฉืœ ImageNet ืขืœ 64 GPUs ืขื ืฉืžื•ื ื” ืžื•ืคืขื™ื p3.16xlarge EC2, ืฉื›ืœ ืื—ื“ ืžื”ื ืžื›ื™ืœ 8 GPUs NVIDIA Tesla V100 ืขืœ ืขื ืŸ AWS, ื”ืฉื’ื ื• ืชืคื•ืงืช ืื™ืžื•ืŸ ืฉืœ 45000 ืชืžื•ื ื•ืช/ืฉื ื™ื™ื” (ื›ืœื•ืžืจ, ืžืกืคืจ ื”ื“ื’ื™ืžื•ืช ื”ืžืื•ืžื ื•ืช ื‘ืฉื ื™ื™ื”). ื”ืื™ืžื•ืŸ ื”ื•ืฉืœื ืชื•ืš 44 ื“ืงื•ืช ืœืื—ืจ 90 ืขื™ื“ื ื™ื ืขื ื“ื™ื•ืง ืžื™ื˜ื‘ื™ ืฉืœ 75.7%.

ื”ืฉื•ื•ื™ื ื• ื–ืืช ืœื’ื™ืฉืช ื”ื”ื“ืจื›ื” ื”ืžื‘ื•ื–ืจืช ืฉืœ MXNet ืฉืœ ืฉื™ืžื•ืฉ ื‘ืฉืจืชื™ ืคืจืžื˜ืจื™ื ืขืœ 8, 16, 32 ื•-64 GPUs ืขื ืฉืจืช ืคืจืžื˜ืจ ื‘ื•ื“ื“ ื•ื™ื—ืก ืฉืจืช ืœืขื•ื‘ื“ ืฉืœ 1 ืœ-1 ื•-2 ืœ-1, ื‘ื”ืชืืžื”. ืืชื” ื™ื›ื•ืœ ืœืจืื•ืช ืืช ื”ืชื•ืฆืื” ื‘ืื™ื•ืจ 1 ืœืžื˜ื”. ื‘ืฆื™ืจ ื”-y ืžืฉืžืืœ, ื”ืคืกื™ื ืžื™ื™ืฆื’ื™ื ืืช ืžืกืคืจ ื”ืชืžื•ื ื•ืช ืœืื™ืžื•ืŸ ื‘ืฉื ื™ื™ื”, ื”ืงื•ื•ื™ื ืžืฉืงืคื™ื ืืช ื™ืขื™ืœื•ืช ืงื ื” ื”ืžื™ื“ื” (ื›ืœื•ืžืจ, ื”ื™ื—ืก ื‘ื™ืŸ ื”ืชืคื•ืงื” ื‘ืคื•ืขืœ ืœืื™ื“ื™ืืœื™ืช) ื‘ืฆื™ืจ ื”-y ืžื™ืžื™ืŸ. ื›ืคื™ ืฉื ื™ืชืŸ ืœืจืื•ืช, ื”ื‘ื—ื™ืจื” ื‘ืžืกืคืจ ื”ืฉืจืชื™ื ืžืฉืคื™ืขื” ืขืœ ื™ืขื™ืœื•ืช ืงื ื” ื”ืžื™ื“ื”. ืื ื™ืฉ ืจืง ืฉืจืช ืคืจืžื˜ืจ ืื—ื“, ื™ืขื™ืœื•ืช ืงื ื” ื”ืžื™ื“ื” ื™ื•ืจื“ืช ืœ-38% ื‘-64 GPUs. ื›ื“ื™ ืœื”ืฉื™ื’ ืืช ืื•ืชื” ื™ืขื™ืœื•ืช ืงื ื” ืžื™ื“ื” ื›ืžื• ืขื Horovod, ืขืœื™ืš ืœื”ื›ืคื™ืœ ืืช ืžืกืคืจ ื”ืฉืจืชื™ื ื‘ื™ื—ืก ืœืžืกืคืจ ื”ืขื•ื‘ื“ื™ื.

ืœืžื™ื“ื” ืžื‘ื•ื–ืจืช ืขื Apache MXNet ื•- Horovod
ืื™ื•ืจ 1. ื”ืฉื•ื•ืื” ืฉืœ ืœืžื™ื“ื” ืžื‘ื•ื–ืจืช ื‘ืืžืฆืขื•ืช MXNet ืขื Horovod ื•ืขื ืฉืจืช ืคืจืžื˜ืจื™ื

ื‘ื˜ื‘ืœื” 1 ืœื”ืœืŸ, ืื ื• ืžืฉื•ื•ื™ื ืืช ื”ืขืœื•ืช ื”ืกื•ืคื™ืช ืœืžื•ืคืข ื‘ืขืช ื”ืคืขืœืช ื ื™ืกื•ื™ื™ื ืขืœ 64 GPUs. ื”ืฉื™ืžื•ืฉ ื‘-MXNet ืขื Horovod ืžืกืคืง ืืช ื”ืชืคื•ืงื” ื”ื˜ื•ื‘ื” ื‘ื™ื•ืชืจ ื‘ืขืœื•ืช ื”ื ืžื•ื›ื” ื‘ื™ื•ืชืจ.

ืœืžื™ื“ื” ืžื‘ื•ื–ืจืช ืขื Apache MXNet ื•- Horovod
ื˜ื‘ืœื” 1. ื”ืฉื•ื•ืืช ืขืœื•ื™ื•ืช ื‘ื™ืŸ Horovod ืœ-Parameter Server ืขื ื™ื—ืก ืฉืจืช ืœืขื•ื‘ื“ ืฉืœ 2 ืœ-1.

ืฆืขื“ื™ื ืœืฉื—ื–ื•ืจ

ื‘ืฉืœื‘ื™ื ื”ื‘ืื™ื, ื ืจืื” ืœืš ื›ื™ืฆื“ ืœืฉื—ื–ืจ ืืช ื”ืชื•ืฆืื” ืฉืœ ืื™ืžื•ืŸ ืžื‘ื•ื–ืจ ื‘ืืžืฆืขื•ืช MXNet ื•- Horovod. ืœืžื™ื“ืข ื ื•ืกืฃ ืขืœ ืœืžื™ื“ื” ืžื‘ื•ื–ืจืช ืขื MXNet ืงืจื ื”ืคื•ืกื˜ ื”ื–ื”.

ืฉืœื‘ 1

ืฆื•ืจ ืืฉื›ื•ืœ ืฉืœ ืžื•ืคืขื™ื ื”ื•ืžื•ื’ื ื™ื™ื ืขื MXNet ื’ืจืกื” 1.4.0 ื•ืžืขืœื” ื•- Horovod ื’ืจืกื” 0.16.0 ื•ืžืขืœื” ื›ื“ื™ ืœื”ืฉืชืžืฉ ื‘ืœืžื™ื“ื” ืžื‘ื•ื–ืจืช. ืชืฆื˜ืจืš ื’ื ืœื”ืชืงื™ืŸ ืกืคืจื™ื•ืช ืขื‘ื•ืจ ืื™ืžื•ืŸ GPU. ืขื‘ื•ืจ ื”ืžืงืจื™ื ืฉืœื ื•, ื‘ื—ืจื ื• ื‘ืื•ื‘ื•ื ื˜ื• 16.04 ืœื™ื ื•ืงืก, ืขื ืžื ื”ืœ ื”ืชืงืŸ GPU 396.44, CUDA 9.2, ืกืคืจื™ื™ืช cuDNN 7.2.1, NCCL 2.2.13 communicator ื•-OpenMPI 3.1.1. ื›ืžื• ื›ืŸ ืืชื” ื™ื›ื•ืœ ืœื”ืฉืชืžืฉ Amazon Deep Learning AMI, ืฉื‘ื• ืกืคืจื™ื•ืช ืืœื” ื›ื‘ืจ ืžื•ืชืงื ื•ืช ืžืจืืฉ.

ืฉืœื‘ 2

ื”ื•ืกืฃ ืืช ื”ื™ื›ื•ืœืช ืœืขื‘ื•ื“ ืขื ื”-API ืฉืœ Horovod ืœืกืงืจื™ืคื˜ ื”ื”ื“ืจื›ื” ืฉืœืš ื‘-MXNet. ื”ืกืงืจื™ืคื˜ ืฉืœื”ืœืŸ ื”ืžื‘ื•ืกืก ืขืœ ื”-API ืฉืœ MXNet Gluon ื™ื›ื•ืœ ืœืฉืžืฉ ื›ืชื‘ื ื™ืช ืคืฉื•ื˜ื”. ื”ืฉื•ืจื•ืช ื”ืžื•ื“ื’ืฉื•ืช ื ื—ื•ืฆื•ืช ืื ื›ื‘ืจ ื™ืฉ ืœืš ืชืกืจื™ื˜ ื”ื“ืจื›ื” ืžืชืื™ื. ื”ื ื” ื›ืžื” ืฉื™ื ื•ื™ื™ื ืงืจื™ื˜ื™ื™ื ืฉืขืœื™ืš ืœื‘ืฆืข ื›ื“ื™ ืœืœืžื•ื“ ืขื Horovod:

  • ื”ื’ื“ืจ ืืช ื”ื”ืงืฉืจ ื‘ื”ืชืื ืœื“ืจื’ืช Horovod ื”ืžืงื•ืžื™ืช (ืฉื•ืจื” 8) ื›ื“ื™ ืœื”ื‘ื™ืŸ ืฉื”ืื™ืžื•ืŸ ืžืชื‘ืฆืข ืขืœ ื”ืœื™ื‘ื” ื”ื’ืจืคื™ืช ื”ื ื›ื•ื ื”.
  • ื”ืขื‘ืจ ืคืจืžื˜ืจื™ื ืจืืฉื•ื ื™ื™ื ืžืขื•ื‘ื“ ืื—ื“ ืœื›ื•ืœื (ืฉื•ืจื” 18) ื›ื“ื™ ืœื”ื‘ื˜ื™ื— ืฉื›ืœ ื”ืขื•ื‘ื“ื™ื ื™ืชื—ื™ืœื• ืขื ืื•ืชื ืคืจืžื˜ืจื™ื ืจืืฉื•ื ื™ื™ื.
  • ืฆื•ืจ ื”ื•ืจื‘ื•ื“ DistributedOptimizer (ืฉื•ืจื” 25) ืœืขื“ื›ื•ืŸ ื”ืคืจืžื˜ืจื™ื ื‘ืฆื•ืจื” ืžื‘ื•ื–ืจืช.

ื›ื“ื™ ืœืงื‘ืœ ืืช ื”ืชืกืจื™ื˜ ื”ืžืœื, ืขื™ื™ืŸ ื‘ื“ื•ื’ืžืื•ืช ืฉืœ Horovod-MXNet MNIST ะธ ืื™ืžื’'ื ื˜.

1  import mxnet as mx
2  import horovod.mxnet as hvd
3
4  # Horovod: initialize Horovod
5  hvd.init()
6
7  # Horovod: pin a GPU to be used to local rank
8  context = mx.gpu(hvd.local_rank())
9
10 # Build model
11 model = ...
12
13 # Initialize parameters
14 model.initialize(initializer, ctx=context)
15 params = model.collect_params()
16
17 # Horovod: broadcast parameters
18 hvd.broadcast_parameters(params, root_rank=0)
19
20 # Create optimizer
21 optimizer_params = ...
22 opt = mx.optimizer.create('sgd', **optimizer_params)
23
24 # Horovod: wrap optimizer with DistributedOptimizer
25 opt = hvd.DistributedOptimizer(opt)
26
27 # Create trainer and loss function
28 trainer = mx.gluon.Trainer(params, opt, kvstore=None)
29 loss_fn = ...
30
31 # Train model
32 for epoch in range(num_epoch):
33    ...

ืฉืœื‘ 3

ื”ื™ื›ื ืก ืœืื—ื“ ื”ืขื•ื‘ื“ื™ื ื›ื“ื™ ืœื”ืชื—ื™ืœ ื”ื›ืฉืจื” ืžื‘ื•ื–ืจืช ื‘ืืžืฆืขื•ืช ื”ื•ืจืืช MPI. ื‘ื“ื•ื’ืžื” ื–ื•, ืื™ืžื•ืŸ ืžื‘ื•ื–ืจ ืคื•ืขืœ ืขืœ ืืจื‘ืขื” ืžื•ืคืขื™ื ืขื 4 GPUs ื›ืœ ืื—ื“, ื•ื‘ืกืš ื”ื›ืœ 16 GPUs ื‘ืืฉื›ื•ืœ. ื›ืœื™ ื”ืื•ืคื˜ื™ืžื™ื–ืฆื™ื” ืฉืœ ื™ืจื™ื“ื” ื‘ื“ืจื’ื” ืกื˜ื•ืงื”ืกื˜ื™ืช (SGD) ื™ืฉืžืฉ ืขื ื”ื”ื™ืคืจืคืจืžื˜ืจื™ื ื”ื‘ืื™ื:

  • ื’ื•ื“ืœ ืžื™ื ื™ ืืฆื•ื•ื”: 256
  • ืงืฆื‘ ืœืžื™ื“ื”: 0.1
  • ืžื•ืžื ื˜ื•ื: 0.9
  • ื“ืขื™ื›ื” ื‘ืžืฉืงืœ: 0.0001

ื›ืืฉืจ ื”ืจื—ื‘ื ื• ืž-GPU ืื—ื“ ืœ-64 GPUs, ื”ื’ื“ืœื ื• ื‘ืื•ืคืŸ ืœื™ื ื™ืืจื™ ืืช ืงืฆื‘ ื”ืื™ืžื•ืŸ ื‘ื”ืชืื ืœืžืกืคืจ ื”-GPUs (ืž-0,1 ืขื‘ื•ืจ 1 GPU ืœ-6,4 ืขื‘ื•ืจ 64 GPUs), ืชื•ืš ืฉืžื™ืจื” ืขืœ ืžืกืคืจ ื”ืชืžื•ื ื•ืช ืœื›ืœ GPU ืขืœ 256 (ืžืงื‘ื•ืฆื” ืฉืœ 256 ืชืžื•ื ื•ืช ืขื‘ื•ืจ 1 GPU ืขื“ 16 ืขื‘ื•ืจ 384 GPUs). ืคืจืžื˜ืจื™ ื“ืขื™ื›ืช ื”ืžืฉืงืœ ื•ื”ืžื•ืžื ื˜ื•ื ื”ืฉืชื ื• ื›ื›ืœ ืฉืžืกืคืจ ื”-GPUs ื’ื“ืœ. ื”ืฉืชืžืฉื ื• ื‘ืื™ืžื•ืŸ ื“ื™ื•ืง ืžืขื•ืจื‘ ืขื ืกื•ื’ ื”ื ืชื•ื ื™ื float64 ืขื‘ื•ืจ ืžืขื‘ืจ ืงื“ื™ืžื” ื•-float16 ืขื‘ื•ืจ ืฉื™ืคื•ืขื™ื ื›ื“ื™ ืœื”ืื™ืฅ ืืช ื—ื™ืฉื•ื‘ื™ float32 ื”ื ืชืžื›ื™ื ืขืœ ื™ื“ื™ NVIDIA Tesla GPUs.

$ mpirun -np 16 
    -H server1:4,server2:4,server3:4,server4:4 
    -bind-to none -map-by slot 
    -mca pml ob1 -mca btl ^openib 
    python mxnet_imagenet_resnet50.py

ืžืกืงื ื”

ื‘ืžืืžืจ ื–ื”, ื‘ื“ืงื ื• ื’ื™ืฉื” ื ื™ืชื ืช ืœื”ืจื—ื‘ื” ืœืื™ืžื•ืŸ ืžื•ื“ืœื™ื ืžื‘ื•ื–ืจื™ื ื‘ืืžืฆืขื•ืช Apache MXNet ื•- Horovod. ื”ื“ื’ืžื ื• ืืช ื™ืขื™ืœื•ืช ืงื ื” ื”ืžื™ื“ื” ื•ื”ืขืœื•ืช-ืชื•ืขืœืช ื‘ื”ืฉื•ื•ืื” ืœื’ื™ืฉืช ืฉืจืช ื”ืคืจืžื˜ืจื™ื ื‘ืžืขืจืš ื”ื ืชื•ื ื™ื ืฉืœ ImageNet ืฉืขืœื™ื• ื”ื•ื›ืฉืจ ืžื•ื“ืœ ResNet50-v1. ื›ืœืœื ื• ื’ื ืฉืœื‘ื™ื ืฉื‘ื”ื ืืชื” ื™ื›ื•ืœ ืœื”ืฉืชืžืฉ ื›ื“ื™ ืœืฉื ื•ืช ืกืงืจื™ืคื˜ ืงื™ื™ื ื›ื“ื™ ืœื”ืคืขื™ืœ ืื™ืžื•ืŸ ืžืจื•ื‘ื” ืžื•ืคืขื™ื ื‘ืืžืฆืขื•ืช Horovod.

ืื ืืชื” ืจืง ื”ืชื—ืœืช ืขื MXNet ื•ืœืžื™ื“ื” ืขืžื•ืงื”, ืขื‘ื•ืจ ืœื“ืฃ ื”ื”ืชืงื ื” MXNeื›ื“ื™ ืœื‘ื ื•ืช ืชื—ื™ืœื” MXNet. ืื ื• ื’ื ืžืžืœื™ืฆื™ื ื‘ื—ื•ื ืœืงืจื•ื ืืช ื”ืžืืžืจ MXNet ืชื•ืš 60 ื“ืงื•ืชืœื”ืชื—ื™ืœ.

ืื ื›ื‘ืจ ืขื‘ื“ืช ืขื MXNet ื•ื‘ืจืฆื•ื ืš ืœื ืกื•ืช ืœืžื™ื“ื” ืžื‘ื•ื–ืจืช ืขื Horovod, ืชืกืชื›ืœ ืขืœ ื“ืฃ ื”ืชืงื ื” ืฉืœ ื”ื•ืจื‘ื•ื“, ื‘ื ื” ืื•ืชื• ืž-MXNet ื•ืคืขืœ ืœืคื™ ื”ื“ื•ื’ืžื” MNIST ืื• ืื™ืžื’'ื ื˜.

*ื”ืขืœื•ืช ืžื—ื•ืฉื‘ืช ืขืœ ืกืžืš ื—ื™ื•ื‘ื™ื ืœืคื™ ืฉืขื•ืช AWS ืขื‘ื•ืจ ืžื•ืคืขื™ EC2

ืœืžื™ื“ืข ื ื•ืกืฃ ืขืœ ื”ืงื•ืจืก "ML ืชืขืฉื™ื™ืชื™ ืขืœ ื‘ื™ื’ ื“ืื˜ื”"

ืžืงื•ืจ: www.habr.com

ื”ื•ืกืคืช ืชื’ื•ื‘ื”