Working with neural networks: checklist for debugging

Working with neural networks: checklist for debugging

The code of machine learning software products is often complex and quite confusing. Finding and eliminating bugs in it is a resource-intensive task. Even the simplest feedforward neural networks require a serious approach to network architecture, weight initialization, network optimization. A small mistake can lead to unpleasant problems.

This article is about the algorithm for debugging your neural networks.

Skillbox recommends: Practical course Python developer from scratch.

We remind you: for all readers of "Habr" - a discount of 10 rubles when enrolling in any Skillbox course using the "Habr" promotional code.

The algorithm consists of five stages:

  • easy start;
  • confirmation of losses;
  • verification of intermediate results and connections;
  • parameter diagnostics;
  • work control.

If something seems more interesting to you than the rest, you can immediately skip to these sections.

Easy start

A neural network with a complex architecture, regularization, and a learning rate scheduler is more difficult to debug than a regular one. We are a little tricky here, since the item itself is indirectly related to debugging, but this is still an important recommendation.

A simple start is to create a simplified model and train it on one set (point) of data.

First we create a simplified model

For a quick start, we create a small network with a single hidden layer and check that everything works correctly. Then we gradually complicate the model, checking each new aspect of its structure (additional layer, parameter, etc.), and move on.

We train the model on a single data set (point)

As a quick health check for your project, you can train with one or two data points to confirm that the system is working correctly. The neural network should show 100% training and validation accuracy. If it doesn't, then either the model is too small or you already have a bug.

Even if all is well, prepare the model for one or more epochs before moving on.

Estimated loss

Loss estimation is the main way to refine model performance. You need to make sure that the loss is appropriate for the task, and that the loss functions are estimated on the correct scale. If you are using more than one loss type, then make sure they are all of the same order and properly scaled.

It is important to be attentive to initial losses. Check how close the real result is to the expected result if the model started with a random guess. IN Andrey Karpaty's paper suggests the following: "Make sure you get the result you expect when you start with a small number of parameters. It's better to check for data loss right away (with the regularization degree set to zero). For example, for CIFAR-10 with a Softmax classifier, we expect the initial loss to be 2.302 because the expected diffuse probability is 0,1 for each class (since there are 10 classes), and the Softmax loss is the negative log probability of the correct class as βˆ’ ln(0.1) = 2.302".

For the binary example, a similar calculation is simply made for each of the classes. Here, for example, the data: 20% 0's and 80% 1's. The expected initial loss is up to -0,2ln (0,5) -0,8ln (0,5) = 0,693147. If the result is greater than 1, this may indicate that the weights of the neural network are not properly balanced or the data is not normalized.

Checking intermediate results and connections

To debug a neural network, it is necessary to understand the dynamics of processes within the network and the role of individual intermediate layers, since they are connected. Here are the typical errors you may encounter:

  • incorrect expressions for gradient updates;
  • weight updates are not applied;
  • vanishing or exploding gradients.

If the gradient values ​​are zero, it means that the optimizer's learning rate is too low, or that you have encountered an incorrect expression for updating the gradient.

In addition, it is necessary to keep track of the values ​​of the activation functions, weights and updates of each of the layers. For example, the amount of parameter updates (weights and biases) should be 1-e3.

There is a phenomenon called "Dying ReLU" or "vanishing gradient problem"when ReLU neurons will output zero after learning a large negative bias value (bias) for its weights. These neurons never fire again at any data location.

You can use gradient checking to identify these errors by approximating the gradient using a numerical approach. If it is close to the calculated gradients, then backpropagation has been implemented correctly. To create a gradient check check out these great resources from CS231 here ΠΈ hereAs well as lesson Andrew Nga on this topic.

Faizan Sheikh indicates three main methods for visualizing a neural network:

  • Preliminaries are simple methods that show us the general structure of the trained model. They include the output of shapes or filters of the individual layers of the neural network and the parameters in each layer.
  • Activation based. In them, we decipher the activations of individual neurons or groups of neurons in order to understand their functions.
  • based on gradients. These methods tend to manipulate the gradients that form from back and forth passes while training the model (including salience maps and class activation maps).

There are several useful tools for visualizing activations and connections of individual layers, for example, ConX ΠΈ tensorboard.

Working with neural networks: checklist for debugging

Parameter diagnostics

Neural networks have a lot of parameters that interact with each other, which complicates optimization. Actually, this section is the subject of active research by specialists, so the suggestions below should be considered only as tips, starting points from which to build.

Package size (batch size) - if you want the batch size to be large enough to get accurate error gradient estimates, but small enough so that stochastic gradient descent (SGD) can order your network. Small packet sizes will lead to fast convergence due to noise during the training process and further optimization difficulties. This is described in more detail here.

Learning rate - too low will result in slow convergence or the risk of getting stuck in local minima. At the same time, a high learning rate will cause optimization divergence as you run the risk of "jumping" over the deep yet narrow part of the loss function. Try using speed scheduling to slow it down while training the neural network. In the course of CS231n there is a large section dedicated to this problem.

gradient clippingβ€Š β€” clipping of parameter gradients during backpropagation at the maximum value or rate limit. Useful for dealing with any exploding gradients you might encounter in point three.

Batch normalization - used to normalize the input data of each layer, which solves the problem of internal covariate shift. If you use Dropout and Batch Norma together, check out this article.

Stochastic Gradient Descent (SGD) - There are several varieties of SGD that use momentum, adaptive learning rates and the Nesterov method. At the same time, none of them has a clear advantage both in terms of learning efficiency and generalization (details here).

Regularization β€” is crucial for building a generalizable model, since it adds a penalty for model complexity or extreme parameter values. This is a way to reduce the variance of a model without significantly increasing its bias. More detailed information - here.

To evaluate everything yourself, you need to turn off regularization and check the data loss gradient yourself.

Dropping out is another method to streamline your network to prevent congestion. During training, dropout is done only by keeping the neuron active with some probability p (hyperparameter) or by setting it to zero otherwise. As a result, the network must use a different subset of parameters for each training batch, which reduces changes in certain parameters that become dominant.

Important: If you are using both dropout and batch normalization, be careful with the order of these operations, or even with their use together. All this is still being actively discussed and supplemented. Here are two important discussions on this topic on stackoverflow ΠΈ Arxiv.

Work control

It's about documenting workflows and experiments. If you do not document anything, you can forget, for example, what learning rate or class weights are used. Thanks to the control, you can easily view and reproduce previous experiments. This reduces the number of duplicate experiments.

True, manual documentation can become a difficult task in the case of a large amount of work. This is where tools like Comet.ml come in to help automatically log datasets, code changes, experiment history, and production models, including key details about your model (hyperparameters, model performance metrics, and environment information).

The neural network can be very sensitive to small changes, and this will lead to a drop in the performance of the model. Tracking and documenting work is the first step worth taking to standardize your environment and modeling.

Working with neural networks: checklist for debugging

I hope that this post can be a starting point from which you will start debugging your neural network.

Skillbox recommends:

Source: habr.com

Add a comment