ืœืžื™ื“ื” ืžื‘ื•ื–ืจืช ืขื Apache MXNet ื•- Horovod

ืชืจื’ื•ื ื”ืžืืžืจ ื”ื•ื›ืŸ ืขืจื‘ ืชื—ื™ืœืช ื”ืงื•ืจืก ืœืžื™ื“ืช ืžื›ื•ื ื” ืชืขืฉื™ื™ืชื™ืช ืขืœ ื‘ื™ื’ ื“ืื˜ื”

ืื™ืžื•ืŸ ืžื‘ื•ื–ืจ ืขืœ ืคื ื™ ืžืกืคืจ ืžื•ืคืขื™ ืžื—ืฉื•ื‘ ื‘ืขืœื™ ื‘ื™ืฆื•ืขื™ื ื’ื‘ื•ื”ื™ื ื™ื›ื•ืœ ืœื”ืคื—ื™ืช ืืช ื–ืžืŸ ื”ืื™ืžื•ืŸ ืฉืœ ืจืฉืชื•ืช ืขืฆื‘ื™ื•ืช ืขืžื•ืงื•ืช ืžื•ื“ืจื ื™ื•ืช ืขืœ ืžืขืจื›ื™ ื ืชื•ื ื™ื ื’ื“ื•ืœื™ื ืžืฉื‘ื•ืขื•ืช ืœืฉืขื•ืช ืื• ืืคื™ืœื• ื“ืงื•ืช, ืžื” ืฉื”ื•ืคืš ืืช ื˜ื›ื ื™ืงืช ื”ืื™ืžื•ืŸ ื”ื–ื• ืœื ืคื•ืฆื” ื‘ืฉื™ืžื•ืฉ ื”ืžืขืฉื™ ืฉืœ ืœืžื™ื“ื” ืขืžื•ืงื”. ืขืœ ื”ืžืฉืชืžืฉื™ื ืœื”ื‘ื™ืŸ ื›ื™ืฆื“ ืœืฉืชืฃ ื•ืœืกื ื›ืจืŸ ื ืชื•ื ื™ื ืขืœ ืคื ื™ ืžืกืคืจ ืžื•ืคืขื™ื, ืžื” ืฉื‘ืชื•ืจื• ืžืฉืคื™ืข ื‘ืื•ืคืŸ ืžืฉืžืขื•ืชื™ ืขืœ ื™ื›ื•ืœืช ื”ื”ืจื—ื‘ื”. ื™ืชืจ ืขืœ ื›ืŸ, ืขืœ ื”ืžืฉืชืžืฉื™ื ืœื“ืขืช ื’ื ื›ื™ืฆื“ ืœืคืจื•ืก ืกืงืจื™ืคื˜ ืื™ืžื•ืŸ ื”ืคื•ืขืœ ืขืœ ืžื•ืคืข ื™ื—ื™ื“ ืœืžืกืคืจ ืžื•ืคืขื™ื.

ื‘ืžืืžืจ ื–ื”, ื ื“ื•ืŸ ื‘ืฉื™ื˜ื” ืžื”ื™ืจื” ื•ืงืœื” ืœืื™ืžื•ืŸ ืžื‘ื•ื–ืจ ื‘ืืžืฆืขื•ืช ืกืคืจื™ื™ืช ื”ืœืžื™ื“ื” ื”ืขืžื•ืงื” ืฉืœ Apache MXNet ื‘ืงื•ื“ ืคืชื•ื— ื•ืžืกื’ืจืช ื”ืื™ืžื•ืŸ ื”ืžื‘ื•ื–ืจืช Horovod. ื ื“ื’ื™ื ืืช ื™ืชืจื•ื ื•ืช ื”ื‘ื™ืฆื•ืขื™ื ืฉืœ ืžืกื’ืจืช Horovod ื•ื ืจืื” ื›ื™ืฆื“ ืœื›ืชื•ื‘ ืกืงืจื™ืคื˜ ืื™ืžื•ืŸ MXNet ืฉืคื•ืขืœ ื‘ืื•ืคืŸ ืžื‘ื•ื–ืจ ืขื Horovod.

ืžื” ื–ื” ืืคืืฆ'ื™ MXNet?

ืืคืืฆ'ื™ MXNet MXNet ื”ื™ื ืžืกื’ืจืช ืœืžื™ื“ื” ืขืžื•ืงื” ื‘ืงื•ื“ ืคืชื•ื— ื”ืžืฉืžืฉืช ืœื™ืฆื™ืจื”, ืื™ืžื•ืŸ ื•ืคืจื™ืกื” ืฉืœ ืจืฉืชื•ืช ืขืฆื‘ื™ื•ืช ืขืžื•ืงื•ืช. MXNet ืžืคืฉื˜ืช ืืช ื”ืžื•ืจื›ื‘ื•ื™ื•ืช ื”ืงืฉื•ืจื•ืช ืœื™ื™ืฉื•ื ืจืฉืชื•ืช ืขืฆื‘ื™ื•ืช, ืžืกืคืงืช ื‘ื™ืฆื•ืขื™ื ื’ื‘ื•ื”ื™ื ื•ื’ืžื™ืฉื•ืช, ื•ืžืฆื™ืขื” ืžืžืฉืงื™ API ืœืฉืคื•ืช ืชื›ื ื•ืช ืคื•ืคื•ืœืจื™ื•ืช ื›ื’ื•ืŸ ืคื™ืชื•ืŸ, C + +, ืงืœื•ื–'ื•ืจื”, Java, ื’'ื•ืœื™ื”, R, ืกื•ืœื ื•ืขื•ื“.

ื”ื“ืจื›ื” ืžื‘ื•ื–ืจืช ื‘-MXNet ืขื ืฉืจืช ืคืจืžื˜ืจื™ื

ืžื•ื“ื•ืœ ืœืžื™ื“ื” ืžื‘ื•ื–ืจ ืกื˜ื ื“ืจื˜ื™ ื‘-MXNet ืžืฉืชืžืฉ ื‘ื’ื™ืฉืช ืฉืจืช ืคืจืžื˜ืจื™ื. ื”ื™ื ืžืฉืชืžืฉืช ื‘ืงื‘ื•ืฆืช ืฉืจืชื™ ืคืจืžื˜ืจื™ื ื›ื“ื™ ืœืืกื•ืฃ ื’ืจื“ื™ืื ื˜ื™ื ืžื›ืœ ืขื•ื‘ื“, ืœื‘ืฆืข ืฆื‘ื™ืจื” ื•ืœืฉืœื•ื— ื’ืจื“ื™ืื ื˜ื™ื ืžืขื•ื“ื›ื ื™ื ื‘ื—ื–ืจื” ืœืขื•ื‘ื“ื™ื ืขื‘ื•ืจ ืื™ื˜ืจืฆื™ื™ืช ื”ืื•ืคื˜ื™ืžื™ื–ืฆื™ื” ื”ื‘ืื”. ืงื‘ื™ืขืช ื”ื™ื—ืก ื”ื ื›ื•ืŸ ื‘ื™ืŸ ืฉืจืชื™ื ืœืขื•ื‘ื“ื™ื ื”ื™ื ื”ืžืคืชื— ืœืงื ื” ืžื™ื“ื” ื™ืขื™ืœ. ืื ืžืฉืชืžืฉื™ื ืจืง ื‘ืฉืจืช ืคืจืžื˜ืจื™ื ืื—ื“, ื”ื•ื ื™ื›ื•ืœ ืœื”ืคื•ืš ืœืฆื•ื•ืืจ ื‘ืงื‘ื•ืง ื—ื™ืฉื•ื‘ื™. ืœืขื•ืžืช ื–ืืช, ืื ืžืฉืชืžืฉื™ื ื‘ื™ื•ืชืจ ืžื“ื™ ืฉืจืชื™ื, ื”ืงืฉืจ ืฉืœ ืจื‘ื™ื ืœืจื‘ื™ื ื™ื›ื•ืœ ืœื”ืจื•ื•ืช ืืช ื›ืœ ื—ื™ื‘ื•ืจื™ ื”ืจืฉืช.

ืžื” ื–ื” ื”ื•ืจื•ื‘ื•ื“?

ื”ื•ืจื•ื‘ื•ื“ โ€“ ืžืกื’ืจืช ืœืžื™ื“ื” ืขืžื•ืงื” ืžื‘ื•ื–ืจืช ื‘ืงื•ื“ ืคืชื•ื— ืฉืคื•ืชื—ื” ื‘ืื•ื‘ืจ. ื”ื™ื ืžืฉืชืžืฉืช ื‘ื˜ื›ื ื•ืœื•ื’ื™ื•ืช ื™ืขื™ืœื•ืช ืœืชืงืฉื•ืจืช ื‘ื™ืŸ ืžืกืคืจ ืžืขื‘ื“ื™ื ื’ืจืคื™ื™ื ื•ืฆืžืชื™ื, ื›ื’ื•ืŸ ืกืคืจื™ื™ืช ื”ืชืงืฉื•ืจืช ื”ืงื•ืœืงื˜ื™ื‘ื™ืช ืฉืœ NVIDIA (NCCL) ื•ืžืžืฉืง ื”ืขื‘ืจืช ื”ื•ื“ืขื•ืช (MPI), ื›ื“ื™ ืœื”ืคื™ืฅ ื•ืœืฆื‘ื•ืจ ืคืจืžื˜ืจื™ื ืฉืœ ืžื•ื“ืœื™ื ืขืœ ืคื ื™ ืžืขืจื‘ื•ืœื•ืช. ื”ื™ื ืžืžื˜ื‘ืช ืืช ื ื™ืฆื•ืœ ืจื•ื—ื‘ ื”ืคืก ืฉืœ ื”ืจืฉืช ื•ืžืชืื™ืžื” ืืช ืขืฆืžื” ื”ื™ื˜ื‘ ื‘ืขืช ื”ืคืขืœืช ืžื•ื“ืœื™ื ืฉืœ ืจืฉืชื•ืช ืขืฆื‘ื™ื•ืช ืขืžื•ืงื•ืช. ื”ื™ื ืชื•ืžื›ืช ื›ื™ื•ื ื‘ืžืกืคืจ ืžืกื’ืจื•ืช ืคื•ืคื•ืœืจื™ื•ืช ืฉืœ ืœืžื™ื“ืช ืžื›ื•ื ื”, ื“ื”ื™ื™ื ื• MX Net, Tensorflow, Keras ื•-PyTorch.

ืื™ื ื˜ื’ืจืฆื™ื” ื‘ื™ืŸ MXNet ื•-Horovod

MXNet ืžืฉืชืœื‘ ืขื Horovod ื“ืจืš ืžืžืฉืงื™ ื”-API ืฉืœ ืœืžื™ื“ื” ืžื‘ื•ื–ืจืช ื”ืžื•ื’ื“ืจื™ื ื‘-Horovod. ืžืžืฉืงื™ ื”-API ืฉืœ ื”ืชืงืฉื•ืจืช ืฉืœ Horovod horovod.broadcast(), horovod.allgather() ะธ horovod.allreduce() ืžื™ื•ืฉืžื™ื ื‘ืืžืฆืขื•ืช ืงืจื™ืื•ืช ื—ื•ื–ืจื•ืช ืืกื™ื ื›ืจื•ื ื™ื•ืช ืฉืœ ืžื ื•ืข MXNet ื›ื—ืœืง ืžื’ืจืฃ ื”ืžืฉื™ืžื•ืช ืฉืœื•. ื‘ื“ืจืš ื–ื•, ืชืœื•ื™ื•ืช ื ืชื•ื ื™ื ื‘ื™ืŸ ืชืงืฉื•ืจืช ืœื—ื™ืฉื•ื‘ ืžื˜ื•ืคืœื•ืช ื‘ืงืœื•ืช ืขืœ ื™ื“ื™ ืžื ื•ืข MXNet ื›ื“ื™ ืœืžื ื•ืข ื”ืคืกื“ื™ ื‘ื™ืฆื•ืขื™ื ืขืงื‘ ืกื ื›ืจื•ืŸ. ืื•ื‘ื™ื™ืงื˜ ื”ืื•ืคื˜ื™ืžื™ื–ืฆื™ื” ื”ืžื‘ื•ื–ืจ, ื”ืžื•ื’ื“ืจ ื‘-Horovod horovod.DistributedOptimizer ืžืชืจื—ื‘ ืžื™ื˜ื•ื‘ ื‘-MXNet ื›ืš ืฉื”ื•ื ื™ืงืจื ืœ-API ืฉืœ Horovod ื”ืžืชืื™ืžื™ื ืขื‘ื•ืจ ืขื“ื›ื•ื ื™ ืคืจืžื˜ืจื™ื ืžื‘ื•ื–ืจื™ื. ื›ืœ ืคืจื˜ื™ ื”ื™ื™ืฉื•ื ื”ืœืœื• ืฉืงื•ืคื™ื ืœืžืฉืชืžืฉื™ ื”ืงืฆื”.

ื”ืชื—ืœื” ืžื”ื™ืจื”

ื ื™ืชืŸ ืœื”ืชื—ื™ืœ ื‘ืžื”ื™ืจื•ืช ืœืืžืŸ ืจืฉืช ื ื•ื™ืจื•ื ื™ื ืงื•ื ื‘ื•ืœื•ืฆื™ื•ื ื™ืช ืงื˜ื ื” ืขืœ ืงื‘ื•ืฆืช ื”ื ืชื•ื ื™ื MNIST ื‘ืืžืฆืขื•ืช MXNet ื•-Horovod ื‘ืžื—ืฉื‘ ื”-MacBook ืฉืœื›ื.
ื›ื“ื™ ืœื”ืชื—ื™ืœ, ื”ืชืงื™ื ื• ืืช mxnet ื•-horovod ืž-PyPI:

pip install mxnet
pip install horovod

ื”ืขืจื”: ืื ื ืชืงืœืช ื‘ืฉื’ื™ืื” ื‘ืžื”ืœืš ืคื™ืค ืžืชืงื™ืŸ ื”ื•ืจื•ื‘ื•ื“, ื™ื™ืชื›ืŸ ืฉืชืฆื˜ืจืš ืœื”ื•ืกื™ืฃ ืžืฉืชื ื” MACOSX_DEPLOYMENT_TARGET=10.vvืื™ืคื” vv โ€“ ื–ื•ื”ื™ ื’ืจืกืช ื”-MacOS ืฉืœืš, ืœื“ื•ื’ืžื”, ืขื‘ื•ืจ MacOSX Sierra ืชืฆื˜ืจืš ืœื›ืชื•ื‘ MACOSX_DEPLOYMENT_TARGET=10.12 ื”ืชืงื ืช pip horovod

ืœืื—ืจ ืžื›ืŸ ื”ืชืงืŸ ืืช OpenMPI ืžื›ืืŸ.

ื‘ืกื•ืฃ, ื”ืขืœื• ืืช ืกืงืจื™ืคื˜ ื”ื‘ื“ื™ืงื” mxnet_mnist.py ืžื›ืืŸ ื•ื”ืจืฅ ืืช ื”ืคืงื•ื“ื•ืช ื”ื‘ืื•ืช ื‘ืžืกื•ืฃ ื”-MacBook ื‘ืกืคืจื™ื™ืช ื”ืขื‘ื•ื“ื”:

mpirun -np 2 -H localhost:2 -bind-to none -map-by slot python mxnet_mnist.py

ืคืขื•ืœื” ื–ื• ืชืคืขื™ืœ ืื™ืžื•ืŸ ืขืœ ืฉืชื™ ืœื™ื‘ื•ืช ืฉืœ ื”ืžืขื‘ื“ ืฉืœืš. ื”ืคืœื˜ ื™ื”ื™ื”:

INFO:root:Epoch[0] Batch [0-50] Speed: 2248.71 samples/sec      accuracy=0.583640
INFO:root:Epoch[0] Batch [50-100] Speed: 2273.89 samples/sec      accuracy=0.882812
INFO:root:Epoch[0] Batch [50-100] Speed: 2273.39 samples/sec      accuracy=0.870000

ื”ื“ื’ืžืช ื‘ื™ืฆื•ืขื™ื

ื‘ืขืช ืื™ืžื•ืŸ ืžื•ื“ืœ ResNet50-v1 ืขืœ ืงื‘ื•ืฆืช ื”ื ืชื•ื ื™ื ImageNet ืขืœ 64 ื™ื—ื™ื“ื•ืช GPU ืขื ืฉืžื•ื ื” ืžื•ืคืขื™ื p3.16xlarge ื‘ืืžืฆืขื•ืช ืฉืจืชื™ EC2, ืฉื›ืœ ืื—ื“ ืžื”ื ื”ื›ื™ืœ ืฉืžื•ื ื” ืžืขื‘ื“ื™ GPU ืฉืœ NVIDIA Tesla V100 ืขืœ ื’ื‘ื™ AWS Cloud, ื”ืฉื’ื ื• ืชืคื•ืงืช ืื™ืžื•ืŸ ืฉืœ 45000 ืชืžื•ื ื•ืช/ืฉื ื™ื™ื” (ื›ืœื•ืžืจ, ืžืกืคืจ ื”ื“ื’ื™ืžื•ืช ืฉืื•ืžื ื• ืœืฉื ื™ื™ื”). ื”ืื™ืžื•ืŸ ื”ื•ืฉืœื ืชื•ืš 44 ื“ืงื•ืช ืœืื—ืจ 90 ืชืงื•ืคื•ืช, ืขื ื“ื™ื•ืง ืื•ืคื˜ื™ืžืœื™ ืฉืœ 75.7%.

ื”ืฉื•ื•ื™ื ื• ื–ืืช ืœืื™ืžื•ืŸ ืžื‘ื•ื–ืจ ืฉืœ MXNet ื‘ืืžืฆืขื•ืช ืฉืจืชื™ ืคืจืžื˜ืจื™ื ืขืœ ื’ื‘ื™ 8, 16, 32 ื•-64 ืžืขื‘ื“ื™ื ื’ืจืคื™ื™ื ืขื ืฉืจืช ืคืจืžื˜ืจ ื™ื—ื™ื“ ื•ื™ื—ืก ืฉืจืช-ืœืขื•ื‘ื“ ืฉืœ 1:1 ื•-2:1, ื‘ื”ืชืืžื”. ื”ืชื•ืฆืื•ืช ืžื•ืฆื’ื•ืช ื‘ืื™ื•ืจ 1 ืœื”ืœืŸ. ื”ืขืžื•ื“ื•ืช ื‘ืฆื™ืจ ื”-y ื”ืฉืžืืœื™ ืžื™ื™ืฆื’ื•ืช ืืช ืžืกืคืจ ืชืžื•ื ื•ืช ื”ืื™ืžื•ืŸ ืœืฉื ื™ื™ื”, ื•ื”ืงื•ื•ื™ื ื‘ืฆื™ืจ ื”-y ื”ื™ืžื ื™ ืžื™ื™ืฆื’ื™ื ืืช ื™ืขื™ืœื•ืช ืงื ื” ื”ืžื™ื“ื” (ื›ืœื•ืžืจ, ื”ื™ื—ืก ื‘ื™ืŸ ื”ืชืคื•ืงื” ื‘ืคื•ืขืœ ืœืชืคื•ืงื” ื”ืื™ื“ื™ืืœื™ืช). ื›ืคื™ ืฉื ื™ืชืŸ ืœืจืื•ืช, ื‘ื—ื™ืจืช ืžืกืคืจ ื”ืฉืจืชื™ื ืžืฉืคื™ืขื” ืขืœ ื™ืขื™ืœื•ืช ืงื ื” ื”ืžื™ื“ื”. ืขื ืฉืจืช ืคืจืžื˜ืจ ื™ื—ื™ื“, ื™ืขื™ืœื•ืช ืงื ื” ื”ืžื™ื“ื” ื™ื•ืจื“ืช ืœ-38% ืขืœ 64 ืžืขื‘ื“ื™ื ื’ืจืคื™ื™ื. ื›ื“ื™ ืœื”ืฉื™ื’ ืืช ืื•ืชื” ื™ืขื™ืœื•ืช ืงื ื” ืžื™ื“ื” ื›ืžื• ืขื Horovod, ื™ืฉ ืœื”ื›ืคื™ืœ ืืช ืžืกืคืจ ื”ืฉืจืชื™ื ื‘ื™ื—ืก ืœืžืกืคืจ ื”ืขื•ื‘ื“ื™ื.

ืœืžื™ื“ื” ืžื‘ื•ื–ืจืช ืขื Apache MXNet ื•- Horovod
ืื™ื•ืจ 1. ื”ืฉื•ื•ืื” ื‘ื™ืŸ ืื™ืžื•ืŸ ืžื‘ื•ื–ืจ ื‘ืืžืฆืขื•ืช MXNet ืขื Horovod ื•ืขื ืฉืจืช ืคืจืžื˜ืจื™ื

ื˜ื‘ืœื” 1 ืœื”ืœืŸ ืžืฉื•ื•ื” ืืช ื”ืขืœื•ืช ื”ื›ื•ืœืœืช ืœื›ืœ ืžื•ืคืข ื‘ืขืช ื”ืคืขืœืช ื ื™ืกื•ื™ื™ื ืขืœ 64 ืžืขื‘ื“ื™ื ื’ืจืคื™ื™ื. ืฉื™ืžื•ืฉ ื‘-MXNet ืขื Horovod ืžืกืคืง ืืช ื”ืชืคื•ืงื” ื”ื˜ื•ื‘ื” ื‘ื™ื•ืชืจ ื‘ืขืœื•ืช ื”ื ืžื•ื›ื” ื‘ื™ื•ืชืจ.

ืœืžื™ื“ื” ืžื‘ื•ื–ืจืช ืขื Apache MXNet ื•- Horovod
ื˜ื‘ืœื” 1. ื”ืฉื•ื•ืืช ืขืœื•ื™ื•ืช ื‘ื™ืŸ Horovod ืœืฉืจืช ืคืจืžื˜ืจื™ื ืขื ื™ื—ืก ืฉืจืช-ืœืขื•ื‘ื“ ืฉืœ 2:1.

ืฉืœื‘ื™ื ืœื”ืชืจื‘ื•ืช

ื‘ืฉืœื‘ื™ื ื”ื‘ืื™ื, ื ืจืื” ืœื›ื ื›ื™ืฆื“ ืœืฉื—ื–ืจ ืืช ืชื•ืฆืืช ื”ืื™ืžื•ืŸ ื”ืžื‘ื•ื–ืจืช ื‘ืืžืฆืขื•ืช MXNet ื•-Horovod. ืœืžื™ื“ืข ื ื•ืกืฃ ืขืœ ืื™ืžื•ืŸ ืžื‘ื•ื–ืจ ืขื MXNet, ืงืจืื• ื”ืคื•ืกื˜ ื”ื–ื”.

ืฉืœื‘ 1

ืฆื•ืจ ืืฉื›ื•ืœ ืฉืœ ืžื•ืคืขื™ื ื”ื•ืžื•ื’ื ื™ื™ื ืขื MXNet ื’ืจืกื” 1.4.0 ื•ืžืขืœื” ื•-Horovod ื’ืจืกื” 0.16.0 ื•ืžืขืœื” ื›ื“ื™ ืœื”ืฉืชืžืฉ ื‘ืื™ืžื•ืŸ ืžื‘ื•ื–ืจ. ืชืฆื˜ืจืš ื’ื ืœื”ืชืงื™ืŸ ืืช ื”ืกืคืจื™ื•ืช ืœืื™ืžื•ืŸ GPU. ืขื‘ื•ืจ ื”ืžื•ืคืขื™ื ืฉืœื ื•, ื‘ื—ืจื ื• Ubuntu 16.04 Linux, ืขื ืžื ื”ืœ ื”ืชืงืŸ GPU 396.44, CUDA 9.2, ืกืคืจื™ื™ืช cuDNN 7.2.1, NCCL communicator 2.2.13 ื•-OpenMPI 3.1.1. ื ื™ืชืŸ ื’ื ืœื”ืฉืชืžืฉ ืœืžื™ื“ื” ืขืžื•ืงื” ืฉืœ ืืžื–ื•ืŸ (AMI), ื›ืืฉืจ ืกืคืจื™ื•ืช ืืœื• ื›ื‘ืจ ืžื•ืชืงื ื•ืช ืžืจืืฉ.

ืฉืœื‘ 2

ืฉืคืจื• ืืช ืกืงืจื™ืคื˜ ื”ืื™ืžื•ืŸ ืฉืœ MXNet ืฉืœื›ื ื‘ืขื–ืจืช ืžืžืฉืง ื”-API ืฉืœ Horovod. ื”ืกืงืจื™ืคื˜ ืฉืœื”ืœืŸ, ื”ืžื‘ื•ืกืก ืขืœ ืžืžืฉืง ื”-API ืฉืœ MXNet Gluon, ื™ื›ื•ืœ ืœืฉืžืฉ ื›ืชื‘ื ื™ืช ืคืฉื•ื˜ื”. ื”ืฉื•ืจื•ืช ื”ืžื•ื“ื’ืฉื•ืช ื ื“ืจืฉื•ืช ืื ื›ื‘ืจ ื™ืฉ ืœื›ื ืกืงืจื™ืคื˜ ืื™ืžื•ืŸ ืชื•ืื. ื”ื ื” ื›ืžื” ืฉื™ื ื•ื™ื™ื ืงืจื™ื˜ื™ื™ื ืฉืขืœื™ื›ื ืœื‘ืฆืข ื›ื“ื™ ืœื”ืชืืžืŸ ืขื Horovod:

  • ื”ื’ื“ืจ ืืช ื”ื”ืงืฉืจ ื›ืš ืฉื™ืชืื™ื ืœื“ื™ืจื•ื’ Horovod ื”ืžืงื•ืžื™ (ืฉื•ืจื” 8) ื›ื“ื™ ืœื”ื‘ื˜ื™ื— ืฉื”ืื™ืžื•ืŸ ื™ื‘ื•ืฆืข ืขืœ ืœื™ื‘ืช ื”-GPU ื”ื ื›ื•ื ื”.
  • ื”ืขื‘ื™ืจื• ืคืจืžื˜ืจื™ื ื”ืชื—ืœืชื™ื™ื ืžืขื•ื‘ื“ ืื—ื“ ืœื›ื•ืœื (ืฉื•ืจื” 18) ื›ื“ื™ ืœื”ื‘ื˜ื™ื— ืฉื›ืœ ื”ืขื•ื‘ื“ื™ื ื™ืชื—ื™ืœื• ืขื ืื•ืชื ืคืจืžื˜ืจื™ื ื”ืชื—ืœืชื™ื™ื.
  • ืฆื•ืจ ื”ื•ืจื•ื‘ื•ื“ ืื•ืคื˜ื™ืžื™ื–ืฆื™ื” ืžื‘ื•ื–ืจืช (ืฉื•ืจื” 25) ื›ื“ื™ ืœืขื“ื›ืŸ ืืช ื”ืคืจืžื˜ืจื™ื ื‘ืฆื•ืจื” ืžื‘ื•ื–ืจืช.

ืœืงื‘ืœืช ื”ืกืงืจื™ืคื˜ ื”ืžืœื, ืื ื ืขื™ื™ื ื• ื‘ื“ื•ื’ืžืื•ืช ืฉืœ Horovod-MXNet. MNIST ะธ ืื™ืžื’'ื ื˜.

1  import mxnet as mx
2  import horovod.mxnet as hvd
3
4  # Horovod: initialize Horovod
5  hvd.init()
6
7  # Horovod: pin a GPU to be used to local rank
8  context = mx.gpu(hvd.local_rank())
9
10 # Build model
11 model = ...
12
13 # Initialize parameters
14 model.initialize(initializer, ctx=context)
15 params = model.collect_params()
16
17 # Horovod: broadcast parameters
18 hvd.broadcast_parameters(params, root_rank=0)
19
20 # Create optimizer
21 optimizer_params = ...
22 opt = mx.optimizer.create('sgd', **optimizer_params)
23
24 # Horovod: wrap optimizer with DistributedOptimizer
25 opt = hvd.DistributedOptimizer(opt)
26
27 # Create trainer and loss function
28 trainer = mx.gluon.Trainer(params, opt, kvstore=None)
29 loss_fn = ...
30
31 # Train model
32 for epoch in range(num_epoch):
33    ...

ืฉืœื‘ 3

ื”ืชื—ื‘ืจ ืœืื—ื“ ืžื”-workers ื›ื“ื™ ืœื”ืจื™ืฅ ืื™ืžื•ืŸ ืžื‘ื•ื–ืจ ื‘ืืžืฆืขื•ืช ื”ื•ืจืืช MPI. ื‘ื“ื•ื’ืžื” ื–ื•, ืื™ืžื•ืŸ ืžื‘ื•ื–ืจ ืžื•ืคืขืœ ืขืœ ืืจื‘ืขื” ืžื•ืคืขื™ื ืขื ืืจื‘ืขื” GPU ื›ืœ ืื—ื“, ื‘ืกืš ื”ื›ืœ 16 GPU ื‘ืืฉื›ื•ืœ. ื™ื™ืขืฉื” ืฉื™ืžื•ืฉ ื‘ืžืžื˜ื‘ ืกื˜ื•ื›ืกื˜ื™ ืฉืœ ื’ืจื“ื™ืื ื˜ ื™ื•ืจื“ (SGD) ืขื ื”ื”ื™ืคืจ-ืคืจืžื˜ืจื™ื ื”ื‘ืื™ื:

  • ื’ื•ื“ืœ ืžื™ื ื™-ืืฆื•ื•ื”: 256
  • ืงืฆื‘ ืœืžื™ื“ื”: 0.1
  • ืชื ืข: 0.9
  • ื“ืขื™ื›ืช ืžืฉืงืœ: 0.0001

ื›ืืฉืจ ื’ื“ืœื ื• ืžืžืขื‘ื“ ื’ืจืคื™ ื™ื—ื™ื“ ืœ-64 ืžืขื‘ื“ื™ื ื’ืจืคื™ื™ื, ืฉื“ืจื’ื ื• ืืช ืžื”ื™ืจื•ืช ื”ืื™ืžื•ืŸ ื‘ืื•ืคืŸ ืœื™ื ื™ืืจื™ ืขื ืžืกืคืจ ื”ืžืขื‘ื“ื™ื ื”ื’ืจืคื™ื™ื (ืž-0,1 ืขื‘ื•ืจ ืžืขื‘ื“ ื’ืจืคื™ ืื—ื“ ืœ-6,4 ืขื‘ื•ืจ 64 ืžืขื‘ื“ื™ื ื’ืจืคื™ื™ื) ืชื•ืš ืฉืžื™ืจื” ืขืœ ืžืกืคืจ ื”ืชืžื•ื ื•ืช ืœื›ืœ ืžืขื‘ื“ ื’ืจืคื™ ืขืœ 256 (ืžืงื‘ื•ืฆื” ืฉืœ 256 ืชืžื•ื ื•ืช ืขื‘ื•ืจ ืžืขื‘ื“ ื’ืจืคื™ ืื—ื“ ืœ-16,384 ืขื‘ื•ืจ 64 ืžืขื‘ื“ื™ื ื’ืจืคื™ื™ื). ืคืจืžื˜ืจื™ ื“ืขื™ื›ืช ื”ืžืฉืงืœ ื•ื”ืชื ืข ื”ื•ืชืืžื• ื›ื›ืœ ืฉืžืกืคืจ ื”ืžืขื‘ื“ื™ื ื”ื’ืจืคื™ื™ื ื’ื“ืœ. ื”ืฉืชืžืฉื ื• ื‘ืื™ืžื•ืŸ ื‘ื“ื™ื•ืง ืžืขื•ืจื‘ ืขื ืกื•ื’ื™ ื ืชื•ื ื™ื float16 ืขื‘ื•ืจ ืžืขื‘ืจ ืงื“ื™ืžื” ื•-float32 ืขื‘ื•ืจ ื’ืจื“ื™ืื ื˜ื™ื ื›ื“ื™ ืœื”ืื™ืฅ ืืช ื—ื™ืฉื•ื‘ื™ float16 ื”ื ืชืžื›ื™ื ืขืœ ื™ื“ื™ ืžืขื‘ื“ื™ ื’ืจืคื™ื™ื ืฉืœ NVIDIA Tesla.

$ mpirun -np 16 
    -H server1:4,server2:4,server3:4,server4:4 
    -bind-to none -map-by slot 
    -mca pml ob1 -mca btl ^openib 
    python mxnet_imagenet_resnet50.py

ืžืกืงื ื”

ื‘ืžืืžืจ ื–ื”, ื‘ื—ื ื• ื’ื™ืฉื” ื ื™ืชื ืช ืœื”ืจื—ื‘ื” ืœืื™ืžื•ืŸ ืžื•ื“ืœื™ื ืžื‘ื•ื–ืจื™ื ื‘ืืžืฆืขื•ืช Apache MXNet ื•-Horovod. ื”ื“ื’ืžื ื• ืืช ื”ืžื“ืจื’ื™ื•ืช ื•ื”ื™ืขื™ืœื•ืช ื”ืขืœื•ืชื™ืช ื‘ื”ืฉื•ื•ืื” ืœื’ื™ืฉืช ืฉืจืช ืคืจืžื˜ืจื™ื ืขืœ ืงื‘ื•ืฆืช ื”ื ืชื•ื ื™ื ImageNet, ืืฉืจ ืฉื™ืžืฉื” ืœืื™ืžื•ืŸ ืžื•ื“ืœ ResNet50-v1. ืชื™ืืจื ื• ื’ื ืืช ื”ืฉืœื‘ื™ื ืฉื ื™ืชืŸ ืœื ืงื•ื˜ ื›ื“ื™ ืœืฉื ื•ืช ืกืงืจื™ืคื˜ ืงื™ื™ื ื›ื“ื™ ืœื”ืจื™ืฅ ืื™ืžื•ืŸ ืขืœ ืžืกืคืจ ืžื•ืคืขื™ื ื‘ืืžืฆืขื•ืช Horovod.

ืื ืืชื ืจืง ืžืชื—ื™ืœื™ื ืขื MXNet ื•ืœืžื™ื“ื” ืขืžื•ืงื”, ื’ืฉื• ืœื“ืฃ ื”ื”ืชืงื ื”. MXNe, ื›ื“ื™ ืœื‘ื ื•ืช ืชื—ื™ืœื” ืืช MXNet. ืื ื• ืžืžืœื™ืฆื™ื ื‘ื—ื•ื ื’ื ืœืงืจื•ื ืืช ื”ืžืืžืจ MXNet ืชื•ืš 60 ื“ืงื•ืชื›ื“ื™ ืœื”ืชื—ื™ืœ.

ืื ื›ื‘ืจ ืขื‘ื“ืชื ืขื MXNet ื•ืจื•ืฆื™ื ืœื ืกื•ืช ืœืžื™ื“ื” ืžื‘ื•ื–ืจืช ืขื Horovod, ื‘ื“ืงื• ืืช ื“ืฃ ื”ื”ืชืงื ื” ืฉืœ ื”ื•ืจื•ื‘ื•ื“, ืงื•ืžืคื™ืœ ืื•ืชื• ืขื MXNet ื•ืคืขืœ ืœืคื™ ื”ื“ื•ื’ืžื” MNIST ืื• ืื™ืžื’'ื ื˜.

*ื”ืขืœื•ืช ืžื—ื•ืฉื‘ืช ืขืœ ืกืžืš ืชืขืจื™ืคื™ื ืœืฉืขื” AWS ืขื‘ื•ืจ ืžื•ืคืขื™ EC2

ืœืžื™ื“ืข ื ื•ืกืฃ ืขืœ ื”ืงื•ืจืก ืœืžื™ื“ืช ืžื›ื•ื ื” ืชืขืฉื™ื™ืชื™ืช ืขืœ ื‘ื™ื’ ื“ืื˜ื”

ืžืงื•ืจ: www.habr.com

ืงื ื” ืื™ืจื•ื— ืืžื™ืŸ ืœืืชืจื™ื ืขื ื”ื’ื ืช DDoS, ืฉืจืชื™ VPS VDS ๐Ÿ”ฅ ืงื ื” ืื—ืกื•ืŸ ืืชืจื™ื ืืžื™ืŸ ืขื ื”ื’ื ืช DDoS, ืฉืจืชื™ VPS VDS | ProHoster