Classification of handwritten drawings. Report in Yandex

A few months ago, our colleagues from Google held on Kaggle competition to create a classifier for images obtained in the sensational the game "Quick Draw!" The team, which included Yandex developer Roman Vlasov, took fourth place in the competition. At the January training session on machine learning, Roman shared the ideas of his team, the final implementation of the classifier, and interesting practices of his rivals.


- Hi all! My name is Roma Vlasov, today I will tell you about Quick, Draw! Doodle Recognition Challenge.

Classification of handwritten drawings. Report in Yandex

There were five people in our team. I joined her right before the merge deadline. We were unlucky, we were shaken a little, but we were shaken from money, and they were shaken from gold. And we took an honorable fourth place.

(During the competition, the teams observed themselves in the ranking, which was formed according to the results shown on one part of the proposed data set. The final ranking, in turn, was formed on the other part of the data set. This is done so that the participants in the competition do not adjust their algorithms to specific data. Therefore, in the final, when switching between ratings, the positions "shake" a little (from the English shake up - to mix up): on other data, the result may turn out to be different. Roman's team was first in the top three. In this case, the top three is money, monetary rating zone, since only the first three places were entitled to a cash prize.After the shake-up, the team was already in fourth place.In the same way, the other team lost the victory, the gold position. - Approx. ed.)

Classification of handwritten drawings. Report in Yandex

The competition was also significant in that Evgeny Babakhnin received a grandmaster for it, Ivan Sosin received a master, Roman Solovyov remained a grandmaster, Alex Parinov received a master, I became an expert, and now I am already a master.

Classification of handwritten drawings. Report in Yandex

What is this Quick, Draw? This is a service from Google. Google pursued the goal of popularizing AI and with this service wanted to show how neural networks work. You go there, press Let's draw, and a new page pops out, where they tell you: draw a zigzag, you have 20 seconds to do it. You are trying to draw a zigzag in 20 seconds, like here, for example. If everything works out for you, the network says that this is a zigzag, and you move on. There are six such pictures.

If the Google network failed to recognize what you drew, a cross was placed on the task. Later I will tell you what it will mean in the future whether the pattern is recognized by the network or not.

This service gathered quite a large number of users, and all the pictures that users drew were logged.

Classification of handwritten drawings. Report in Yandex

Managed to collect almost 50 million pictures. From this formed the train and test date for our competition. By the way, the amount of data in the test and the number of classes are highlighted in bold for a reason. I will talk about them a little later.

The data format was as follows. These are not just RGB pictures, but, roughly speaking, a log of everything that the user did. Word is our target, countrycode is where the author of the doodle comes from, timestamp is the time. The recognized label just shows whether the network recognized the image from Google or not. And drawing itself is a sequence, an approximation of a curve that the user draws with dots. And timings. This is the time from the start of drawing the picture.

Classification of handwritten drawings. Report in Yandex

The data was presented in two formats. This is the first format, and the second is simplified. They sawed out the timings from there and approximated this set of points with a smaller set of points. For this they used Douglas-Pecker algorithm. You have a large set of points that just approximates some straight line, and you can actually approximate this line with just two points. This is the idea of ​​the algorithm.

The data was distributed as follows. Everything is uniform, but there are some outliers. When we solved the problem, we did not look at it. The main thing is that there were no really few classes, we didn't have to do weighted samplers and data oversampling.

Classification of handwritten drawings. Report in Yandex

What did the pictures look like? This is the class "aircraft" and examples from it with the tags recognized and unrecognized. Their ratio was somewhere around 1 to 9. As you can see, the data is quite noisy. I would guess it's an airplane. If you look at not recognized, in most cases it's just noise. Someone even tried to write "airplane", but apparently in French.

Most of the participants simply took the grids, drew the data from this sequence of lines as RGB images, and threw it into the network. I also drew in approximately the same way: I took a palette of colors, drew the first line with one color, which was at the beginning of this palette, the last line with another, which was at the end of the palette, and interpolated everywhere in between them according to this palette. By the way, this gave a better result than if you draw as on the very first slide - just in black.

Other members of the team, such as Ivan Sosin, tried slightly different approaches to drawing. With one channel, he simply drew a gray image, with another channel he painted each stroke with a gradient from start to finish, from 32 to 255, and with the third channel he drew a gradient over all strokes from 32 to 255.

Another interesting thing is that Alex Parinov uploaded information to the network using countrycode.

Classification of handwritten drawings. Report in Yandex

The metric that was used in the competition is Mean Average Precision. What is the essence of this metric for the competition? You can give three predicates, and if there is no correct one in these three predicants, then you get 0. If there is a correct one, then its order is taken into account. And the target result will be counted as 1 divided by the order of your prediction. For example, you have made three predicates, and the first one is correct, then you divide 1 by 1 and get 1. If the predicine is correct and its order is 2, then 1 divided by 2, you get 0,5. Well, etc.

Classification of handwritten drawings. Report in Yandex

With data preprocessing - how to draw pictures and so on - we decided a little. What architectures did we use? We tried to use bold architectures like PNASNet, SENet, and already classic architectures like SE-Res-NeXt, they are coming into new competitions more and more. There were also ResNet and DenseNet.

Classification of handwritten drawings. Report in Yandex

Classification of handwritten drawings. Report in Yandex

Classification of handwritten drawings. Report in Yandex

How did we teach it? All the models that we took, we took ourselves pre-trained on imagenet. Although there is a lot of data, 50 million images, but anyway, if you take a network pretrained on imagenet, it showed a better result than if you just train it from scratch.

What learning techniques did we use? This is Cosing Annealing with Warm Restarts, I will talk about it a little later. This is a technique that I use in almost all of my recent competitions, and with them it turns out to train the nets quite well, to reach a good minimum.

Classification of handwritten drawings. Report in Yandex

Next Reduce Learning Rate on Plateau. You start training the network, set some specific learning rate, then learn it, your loss gradually converges to some specific value. You check this, for example, for ten epochs, the loss has not changed in any way. You reduce your learning rate by some amount and keep learning. It again drops a little for you, converges at some minimum, and you again lower the learning rate, and so on, until your network finally converges.

Next is an interesting technique Don't decay the learning rate, increase the batch size. There is an article with the same title. When you train a network, you don't have to decrease the learning rate, you can just increase the batch size.

This technique, by the way, was used by Alex Parinov. He started with a batch equal to 408, and when his network came to some kind of plateau, he simply doubled the batch size, and so on.

In fact, I don’t remember to what value his batch size reached, but interestingly, there were teams on Kaggle that used the same technique, their batch size was about 10000. By the way, modern frameworks for deep learning, such as PyTorch, for example, makes it very easy for you to do this. You generate your batch and feed it to the network not as it is, in its entirety, but divide it into chunks so that it fits into your video card, calculate the gradients, and after the gradient has been calculated for the entire batch, you update the weights.

By the way, large batch sizes also entered this competition, because the data was quite noisy, and a large batch size helped you more accurately approximate the gradient.

Pseudo-labeling was also used, it was mostly used by Roman Solovyov. He sampled about half of the data from the test in a batch, and trained the grid on such batches.

The size of the pictures played a role, but the fact is that you have a lot of data, you need to train for a long time, and if your picture size is quite large, then you will train for a very long time. But it didn't add much to the quality of your final classifier, so it was worth using some kind of trade-off. And we tried only pictures of not very large sizes.

How was it all taught? At first, small-sized pictures were taken, several epochs were run on them, it took quite a lot of time. Then large pictures were given, the network was trained, then even more, even more, so as not to train it from scratch and not spend a lot of time.

About optimizers. We used SGD and Adam. In this way, it was possible to get a single model, which gave a score of 0,941-0,946 on the public leaderboard, which is pretty good.

If you ensemble the models in some way, you will end up with something like 0,951. If you apply one more technique, then you will get the final score on the public board of 0,954, as we got. But more on that later. Next, I will tell you how we assembled the models, and how we managed to achieve such a final score.

Next, I would like to talk about Cosing Annealing with Warm Restarts or Stochastic Gradient Descent with Warm Restarts. Roughly speaking, in principle, you can put any optimizer, but the bottom line is this: if you just train one network and it gradually converges to some minimum, then everything is ok, you will get one network, it makes certain mistakes, but you can train it a little differently. You will set some initial learning rate, and gradually lower it according to this formula. You underestimate it, your network comes to a certain minimum, then you save the weights, and again set the learning rate that was at the beginning of training, thereby getting out of this minimum somewhere up, and again underestimate your learning rate.

Thus, you can visit several lows at once, in which your loss will be plus or minus the same. But the fact is that networks with given weights will give different errors on your date. By averaging them, you will get some approximation, and your score will be higher.

Classification of handwritten drawings. Report in Yandex

About how we assembled our models. At the beginning of the presentation, I said to pay attention to the amount of data in the test and the number of classes. If you add 1 to the number of targets in the test set and divide by the number of classes, you will get the number 330, and this was written on the forum - that the classes in the test are balanced. This could be used.

Based on this, Roman Solovyov came up with a metric, we called it Proxy Score, which correlated quite well with the leaderboard. The bottom line is: you make a prediction, take the top 1 of your predictions and count the number of objects for each class. Then subtract 330 from each value and add the resulting absolute values.

We got these values. This helped us not to make a probing leaderboard, but to validate locally and select coefficients for our ensembles.

With an ensemble, you could get such a speed. What else to do? Suppose you use the information that the classes in your test are balanced.

The balances were different. An example of one of them - balancing from the guys who took first place.

What did we do? Our balancing was quite simple, it was proposed by Evgeny Babakhnin. We first sorted our predictions by top 1 and selected candidates from them so that the number of classes did not exceed 330. But for some classes, it turns out that there are fewer than 330 predictors. Okay, let's sort by top 2 and top 3, and we will also select candidates.

How did our balancing differ from the balancing of the first place? They used an iterative approach, took the most popular class and reduced the probabilities for that class by some small number until that class was not the most popular. They took the next most popular class. So further and lowered until the number of all classes did not become equal.

Everyone used plus or minus one approach to train networks, but not everyone used balancing. Using balancing, you could go into gold, and if you were lucky, then into money.

How to preprocess a date? Everyone pre-processed the date plus or minus in the same way - making handcrafted features, trying to encode timings with different stroke colors, etc. Alexey Nozdrin-Plotnitsky, who took 8th place, spoke about this.

Classification of handwritten drawings. Report in Yandex

He did it differently. He said that all these handcrafted features of yours do not work, you don’t need to do this, your network should learn all this on its own. And instead, he came up with trainable modules that did the preprocessing of your data. He threw the source data into them without preprocessing - the coordinates of the points and timings.

Further along the coordinates, he took the difference, and averaged it all over the timings. And he got quite a long matrix. He applied 1D convolution to it several times to get a 64xn matrix, where n is the total number of points, and 64 is done in order to apply the resulting matrix to a layer of some convolutional network that accepts the number of channels - 64. he got a matrix of 64xn, then from this it was necessary to make a tensor of some size so that the number of channels was 64. He normalized all points X, Y in the range from 0 to 32 to make a tensor of size 32x32. I don't know why he wanted 32x32, it happened that way. And in this coordinate, he put a fragment of this matrix with a size of 64xn. So he just got a 32x32x64 tensor that could be put further into your convolutional neural network. That's all I wanted to say.

Source: habr.com

Add a comment