The Jedi Technique for Reducing Convolutional Networks - pruning

The Jedi Technique for Reducing Convolutional Networks - pruning

Before you again is the task of detecting objects. Priority - the speed of work with acceptable accuracy. You take the YOLOv3 architecture and train it further. Accuracy(mAp75) is greater than 0.95. But the run speed is still low. Crap.

Today we will bypass quantization. And under the cut, consider Model Pruning - cutting off redundant parts of the network to speed up Inference without losing accuracy. Clearly - where, how much and how to cut. Let's figure out how to do it manually and where you can automate. At the end is a repository on keras.

Introduction

At my last place of work, Macroscop in Perm, I acquired one habit - to always monitor the execution time of algorithms. And the network run time should always be checked through the adequacy filter. Usually state-of-the-art in production doesn't pass this filter, which led me to Pruning.

Pruning is an old topic that was discussed in Stanford Lectures in 2017. The main idea is to reduce the size of the trained network without losing accuracy by removing various nodes. Sounds cool, but I rarely hear about its use. Probably, there are not enough implementations, there are no Russian-language articles, or simply everyone considers pruning know-how and is silent.
But to disassemble

A look into biology

I love it when ideas that come from biology look into Deep Learning. They, like evolution, can be trusted (did you know that ReLU is very similar to activation function of neurons in the brain?)

The Model Pruning process is also close to biology. The reaction of the network here can be compared with the plasticity of the brain. There are a couple of interesting examples in the book. Norman Doidge:

  1. The brain of a woman who was born with only one half, reprogrammed itself to perform the functions of the missing half
  2. The guy shot off the part of his brain responsible for vision. Over time, other parts of the brain took over these functions. (do not try to repeat)

So from your model, you can cut out some of the weak convolutions. In extreme cases, the remaining bundles will help replace the cut ones.

Do you love Transfer Learning or do you learn from scratch?

Option number one. You use Transfer Learning on Yolov3. Retina, Mask-RCNN or U-Net. But more often than not, we don't need to recognize 80 object classes like in COCO. In my practice, everything is limited to 1-2 classes. It can be assumed that the architecture for 80 classes is redundant here. The thought arises that the architecture needs to be reduced. Moreover, I would like to do this without losing the existing pre-trained weights.

Option number two. Maybe you have a lot of data and computing resources, or you just need a super-custom architecture. Doesn't matter. But you are learning the network from scratch. The usual order - we look at the data structure, select an EXCESSIVE architecture in terms of power and push dropouts from retraining. I saw 0.6 dropouts, Carl.

In both cases, the network can be reduced. Promoted. Now let's go figure out what kind of circumcision pruning

General algorithm

We decided that we can remove convolutions. It looks very simple:

The Jedi Technique for Reducing Convolutional Networks - pruning

Removing any convolution is stressful for the network, which usually leads to some increase in error. On the one hand, this increase in error is an indicator of how correctly we remove convolutions (for example, a large increase indicates that we are doing something wrong). But a small increase is quite acceptable and is often eliminated by subsequent light retraining with a small LR. Adding a retraining step:

The Jedi Technique for Reducing Convolutional Networks - pruning

Now we need to figure out when we want to stop our Learning<->Pruning loop. There may be exotic options here when we need to reduce the network to a certain size and run speed (for example, for mobile devices). However, the most common option is to continue the loop until the error becomes larger than the allowable one. Adding a condition:

The Jedi Technique for Reducing Convolutional Networks - pruning

So, the algorithm becomes clear. It remains to figure out how to determine the convolutions to be removed.

Finding Removed Bundles

We need to remove some convolutions. Rushing ahead and "shooting" any is a bad idea, although it will work. But since there is a head, you can think and try to select “weak” convolutions for deletion. There are several options:

  1. Lowest L1-measure or low_magnitude_pruning. The idea that convolutions with small weights contribute little to the final decision
  2. The smallest L1 measure given the mean and standard deviation. We supplement with an estimate of the nature of the distribution.
  3. Convolution masking and exclusion of those that have the least impact on the final accuracy. More accurate definition of insignificant convolutions, but very time and resource consuming.
  4. Others

Each of the options has the right to life and its implementation features. Here we consider the variant with the smallest L1-measure

Manual process for YOLOv3

The original architecture contains residual blocks. But no matter how cool they are for deep networks, they interfere with us a little. The difficulty is that you cannot delete reconciliations with different indices in these layers:

The Jedi Technique for Reducing Convolutional Networks - pruning

Therefore, we select the layers from which we can freely remove reconciliations:

The Jedi Technique for Reducing Convolutional Networks - pruning

Now let's build a work cycle:

  1. Uploading activations
  2. Figuring out how much to cut
  3. Cut out
  4. Learning 10 epochs with LR=1e-4
  5. We are testing

Unloading rollups is useful to evaluate how much we can remove at a particular step. Upload examples:

The Jedi Technique for Reducing Convolutional Networks - pruning

We see that almost everywhere 5% of convolutions have a very low L1-norm and we can remove them. At each step, such an unloading was repeated and an assessment was made of which layers and how much could be cut.

The whole process fit into 4 steps (here and everywhere the numbers for the RTX 2060 Super):

Step map75 Number of parameters, million Network size, mb From the original, % Run time, ms Cutoff condition
0 0.9656 60 241 100 180
1 0.9622 55 218 91 175 5% of all
2 0.9625 50 197 83 168 5% of all
3 0.9633 39 155 64 155 15% for layers with 400+ bundles
4 0.9555 31 124 51 146 10% for layers with 100+ bundles

One positive effect was added to step 2 - batch-size 4 got into memory, which greatly accelerated the process of additional training.
At the 4th step, the process was stopped, because even prolonged additional training did not raise mAp75 to the old values.
As a result, it was possible to accelerate the inference on 15%, reduce the size by 35% and not lose accuracy.

Automation for simpler architectures

For simpler network architectures (without conditional add, concaternate and residual blocks), it is quite possible to focus on the processing of all convolutional layers and automate the process of cutting convolutions.

I implemented this option here.
It's simple: you only have a loss function, an optimizer and batch generators:

import pruning
from keras.optimizers import Adam
from keras.utils import Sequence

train_batch_generator = BatchGenerator...
score_batch_generator = BatchGenerator...

opt = Adam(lr=1e-4)
pruner = pruning.Pruner("config.json", "categorical_crossentropy", opt)

pruner.prune(train_batch, valid_batch)

If necessary, you can change the config parameters:

{
    "input_model_path": "model.h5",
    "output_model_path": "model_pruned.h5",
    "finetuning_epochs": 10, # the number of epochs for train between pruning steps
    "stop_loss": 0.1, # loss for stopping process
    "pruning_percent_step": 0.05, # part of convs for delete on every pruning step
    "pruning_standart_deviation_part": 0.2 # shift for limit pruning part
}

Additionally, a restriction based on the standard deviation is implemented. The goal is to limit the part of those removed, excluding convolutions with already “sufficient” L1-measures:

The Jedi Technique for Reducing Convolutional Networks - pruning

Thus, we allow removing only weak convolutions from right-like distributions and not affecting the removal from left-like distributions:

The Jedi Technique for Reducing Convolutional Networks - pruning

When the distribution approaches normal, the pruning_standart_deviation_part coefficient can be selected from:

The Jedi Technique for Reducing Convolutional Networks - pruning
I recommend a 2 sigma assumption. Or you can ignore this feature, leaving the value < 1.0.

The output is a graph of the network size, loss, and network run time over the entire test, normalized to 1.0. For example, here the size of the network has been reduced by almost 2 times without loss in quality (a small convolutional network for 100k weights):

The Jedi Technique for Reducing Convolutional Networks - pruning

The run speed is subject to normal fluctuations and has not changed much. There is an explanation for this:

  1. The number of convolutions changes from convenient (32, 64, 128) to not the most convenient for video cards - 27, 51, etc. I could be wrong here, but it probably does.
  2. The architecture is not wide, but consistent. By reducing the width, we do not touch the depth. Thus, we reduce the load, but do not change the speed.

Therefore, the improvement was expressed in a decrease in the load of CUDA during the run by 20-30%, but not in a decrease in the run time

Results

Let's reflect. We considered 2 pruning options - for YOLOv3 (when you have to work with your hands) and for networks with simpler architectures. It can be seen that in both cases it is possible to achieve a reduction in the size of the network and acceleration without loss of accuracy. Results:

  • Reducing the size
  • Run acceleration
  • Reducing CUDA Load
  • As a result, environmental friendliness (We optimize the future use of computing resources. Somewhere one rejoices Greta Tunberg)

Appendix

  • After the pruning step, you can also tweak the quantization (for example, with TensorRT)
  • Tensorflow provides facilities for low_magnitude_pruning. Works.
  • Repository I want to develop and I will be happy to help

Source: habr.com

Add a comment