In February-March 2019, a social network feed ranking contest was held SNA Hackathon 2019in which our team took first place. In the article, I will talk about the organization of the contest, the methods we tried, and the catboost settings for training on big data.

SNA Hackathon

The hackathon under this name is held for the third time. It is organized by the social network ok.ru, respectively, the task and data are directly related to this social network.
SNA (social network analysis) in this case is more correctly understood not as social graph analysis, but rather as social network analysis.

In 2014, the challenge was to predict the number of likes a post would get.
In 2016, the VVZ task (you may be familiar with it), is closer to the analysis of a social graph.
In 2019, ranking a user's feed based on the likelihood that a user will like a post.

I can’t say about 2014, but in 2016 and 2019, in addition to the ability to analyze data, skills in working with big data were also required. I think that it was the combination of machine learning and big data tasks that attracted me to these competitions, and experience in these areas helped me win.

mlbootcamp

In 2019, the competition was organized on the platform https://mlbootcamp.ru.

The competition began online on February 7 and consisted of 3 tasks. Anyone can register on the site, download baseline and load your car for a few hours. At the end of the online stage on March 15, the top 15 of each show jumping were invited to the Mail.ru office for the offline stage, which took place from March 30 to April 1.

Task

The source data provides user IDs (userId) and post IDs (objectId). If the user was shown a post, then the data contains a line containing userId, objectId, user reactions to this post (feedback) and a set of various features or links to pictures and texts.

userId	objectId	ownerId	feedback	images
3555	22	5677	[liked, clicked]	[hash1]
12842	55	32144	[disliked]	[hash2,hash3]
13145	35	5677	[clicked, shared]	[hash2]

The test dataset contains a similar structure, but the feedback field is missing. The task is to predict the presence of a 'liked' reaction in the feedback field.
The submit file has the following structure:

userId	SortedList[objectId]
123	78,13,54,22
128	35,61,55
131	35,68,129,11

Metric - average ROC AUC by users.

A more detailed description of the data can be found at matching site. Also there you can download data, including tests and pictures.

Online stage

At the online stage, the task was divided into 3 parts

Collaborative system - includes all features, except for images and texts;
Image — includes only image information;
Texts — includes information about texts only.

Offline stage

At the offline stage, the data included all features, while texts and images were sparse. The rows in the dataset, which were already many, became 1,5 times more.

The solution of the problem

Since I do cv at work, I started my journey in this competition with the "Images" task. The data that was provided is the userId, objectId, ownerId (the group in which the post is published), timestamps for creating and displaying the post, and, of course, the image for this post.
After generating a few timestamp-based features, the next idea was to take the penultimate layer of the neural network pre-trained on imagenet and send these embeddings to boosting.

The results were not impressive. Embeddings from the imagenet neuron are irrelevant, I thought, I need to file my own autoencoder.

It took a lot of time and the result did not improve.

Feature generation

Working with images takes a lot of time, and I decided to do something simpler.
As you can immediately see, there are several categorical features in the dataset, and in order not to bother too much, I just took catboost. The solution was excellent, without any settings, I immediately got to the first line of the leaderboard.

There is a lot of data and they are laid out in the parquet format, so without thinking twice, I took scala and started writing everything in spark.

The simplest features that gave more growth than image embeddings:

how many times objectId, userId, and ownerId occurred in the data (should correlate with popularity);
how many posts userId saw from ownerId (should correlate with the user's interest in the group);
how many unique userIds have viewed posts by ownerId (reflects the size of the group's audience).

From timestamps, it was possible to get the time of day at which the user watched the feed (morning/afternoon/evening/night). By combining these categories, you can continue to generate features:

how many times userId logged in in the evening;
at what time this post is shown more often (objectId) and so on.

All this gradually improved the metric. But the size of the training dataset is about 20M records, so adding features greatly slowed down training.

I have revised my approach to using data. Although the data is time-dependent, I did not see any obvious leaks of information "in the future", nevertheless, just in case, I broke it like this:

The training set provided to us (February and 2 weeks of March) was divided into 2 parts.
Trained the model on the data of the last N days. The aggregations described above were built on all data, including the test. At the same time, data appeared on which it is possible to build various encodings of the target variable. The easiest approach is to reuse the code that is already creating new features and just feed it data that will not be trained and target = 1.

Thus, we got similar features:

How many times userId has seen the post in the ownerId group;
How many times userId liked the post to the group ownerId;
Percentage of posts that userId liked from ownerId.

That is, it turned out mean target encoding on a part of the dataset for various combinations of categorical features. In principle, catboost also builds target encoding and from this point of view there is no benefit, but, for example, it became possible to count the number of unique users who liked posts in this group. At the same time, the main goal was achieved - my dataset decreased several times, and it was possible to continue generating features.

While catboost can build encoding only on the liked reaction, feedback has other reactions: reshared, disliked, unwanted, clicked, ignored, which can be manually encoded. I counted all kinds of aggregates and filtered out features with low importance so as not to inflate the dataset.

By then I was in first place by a wide margin. The only embarrassing thing was that image embeddings almost did not give an increase. The idea came to give everything at the mercy of catboost. We cluster Kmeans images and get a new categorical feature imageCat.

Here are some classes after manually filtering and merging clusters obtained from KMeans.

Based on imageCat, we generate:

New categorical features:
- Which imageCat was most often viewed by userId;
- Which imageCat most often shows ownerId;
- Which imageCat was most often liked by userId;
Various counters:
- How many unique imageCat looked userId;
- About 15 similar features plus target encoding as described above.

Texts

The results in the image competition suited me and I decided to try myself in the texts. I didn't work much with texts before and foolishly killed a day on tf-idf and svd. Then I saw baseline with doc2vec, which does exactly what I need. After tweaking the doc2vec parameters a bit, I got text embeddings.

And then I simply reused the code for images, in which I replaced image embeddings with text embeddings. As a result, I got 2nd place in the competition of texts.

Collaborative system

There was one contest left that I had not yet “poked with a stick”, and judging by the AUC on the leaderboard, the results of this particular contest should have had the most impact on the offline stage.
I took all the features that were in the original data, chose the categorical ones and calculated the same aggregates as for the images, except for the features for the images themselves. Just putting it in catboost got me in 2nd place.

First steps of catboost optimization

One first and two second places pleased me, but there was an understanding that I did nothing special, which means that we can expect a loss of positions.

The task of the contest is to rank posts within the user, and all this time I have been solving the classification problem, that is, I have optimized the wrong metric.

Let me give you a simple example:

userId	objectId	prediction	ground truth
1	10	0.9	1
1	11	0.8	1
1	12	0.7	1
1	13	0.6	1
1	14	0.5	0
2	15	0.4	0
2	16	0.3	1

Making a small change

userId	objectId	prediction	ground truth
1	10	0.9	1
1	11	0.8	1
1	12	0.7	1
1	13	0.6	0
2	16	0.5	1
2	15	0.4	0
1	14	0.3	1

We get the following results:

Model	AUC	User1AUC	User2AUC	mean AUC
Option 1	0,8	1,0	0,0	0,5
Option 2	0,7	0,75	1,0	0,875

As can be seen, an improvement in the overall AUC metric does not mean an improvement in the mean AUC metric within a user.

Catboost can optimize ranking metrics from the box. I read about ranking metrics, success stories when using catboost and set to train YetiRankPairwise overnight. The result is not impressive. Deciding that I was undertrained, I changed the error function to QueryRMSE, which, judging by the catboost documentation, converges faster. As a result, I got the same results as when training for classification, but the ensembles of these two models gave a good increase, which brought me to the first places in all three competitions.

5 minutes before the closing of the online stage in the Collaborative Systems competition, Sergey Shalnov moved me to second place. We walked the rest of the way together.

Preparation for the offline stage

The victory in the online stage was guaranteed by the RTX 2080 TI video card, but the main prize of 300 rubles and, rather, the final first place made us work for these 000 weeks.

As it turned out, Sergey also used catboost. We exchanged ideas and features, and I learned about report by Anna Veronica Dorogush in which there were answers to many of my questions, and even to those that had not yet appeared to me by that time.

Viewing the report led me to the idea that it is necessary to return all parameters to the default value, and to deal with the settings very carefully and only after fixing the set of features. Now one training took about 15 hours, but one model managed to get a score better than it turned out in an ensemble with a ranking.

Feature generation

In the "Collaborative Systems" competition, a large number of features are evaluated as important for the model. For example, auditweights_spark_svd - the most important sign, while there is no information about what it means. I thought it would be worth counting the various aggregates based on important features. For example, average auditweights_spark_svd per user, per group, per object. The same can be calculated on the data on which training is not performed and target = 1, that is, the average auditweights_spark_svd by user by objects that he likes. Important features besides auditweights_spark_svd, there were several. Here are some of them:

auditweightsCtrlGender
auditweightsCtrHigh
userOwnerCounterCreateLikes

For example, the mean value auditweightsCtrlGender by userId turned out to be an important feature, just like the average value userOwnerCounterCreateLikes by userId+ownerId. This should have already made you think about the need to deal with the meaning of the fields.

Also important features were auditweightsLikesCount и auditweightsShowsCount. Dividing one into another, we got an even more important feature.

Data leaks

Competition and production models are very different tasks. When preparing data, it is very difficult to take into account all the details and not convey some non-trivial information about the target variable on the test. If we are creating a production solution, we will try to avoid using data leaks when training the model. But if we want to win the competition, then data leaks are the best features.

After examining the data, you can see that according to the objectId of the value auditweightsLikesCount и auditweightsShowsCount change, which means that the ratio of the maximum values of these features will reflect the conversion of the post much better than the ratio at the time of display.

The first leak we found is auditweightsLikesCountMax/auditweightsShowsCountMax.
What if we take a closer look at the data? Sort by show date and get:

objectId	userId	auditweightsShowsCount	auditweightsLikesCount	target (is liked)
1	1	12	3	probably not
1	2	15	3	maybe yes
1	3	16	4

It was surprising when I found the first such example and it turned out that my prediction did not come true. But, given the fact that the maximum values of these features within the object gave an increase, we were not too lazy and decided to find auditweightsShowsCountNext и auditweightsLikesCountNext, that is, the values at the next point in time. By adding a feature
(auditweightsShowsCountNext-auditweightsShowsCount)/(auditweightsLikesCount-auditweightsLikesCountNext) we made a sharp leap forward.
Similar leaks could be exploited by finding the following values for userOwnerCounterCreateLikes within userId+ownerId and, for example, auditweightsCtrlGender within objectId+userGender. We found 6 similar fields with leaks and extracted as much information as possible from them.

By that time, we squeezed the maximum information out of collaborative features, but did not return to competitions of images and texts. There was a great idea to check: how much do features directly give for images or texts in the corresponding contests?

There were no leaks in the image and text contests, but by that time I returned the default parameters of catboost, cleaned up the code and added a few features. Total turned out:

Solution	speed
Maximum with images	0.6411
Maximum without images	0.6297
Second place result	0.6295

Solution	speed
Maximum with texts	0.666
Maximum without texts	0.660
Second place result	0.656

Solution	speed
Maximum in collaborative	0.745
Second place result	0.723

It became obvious that it was unlikely that we could squeeze much out of texts and images, and after trying a couple of the most interesting ideas, we stopped working with them.

Further generation of features in collaborative systems did not give an increase, and we started ranking. At the online stage, the ensemble of classification and ranking gave me a small increase, as it turned out, because I undertrained the classification. None of the error functions, including YetiRanlPairwise, even came close to the result that LogLoss did (0,745 vs. 0,725). There was hope for QueryCrossEntropy, which could not be launched.

Offline stage

At the offline stage, the data structure remained the same, but there were small changes:

identifiers userId, objectId, ownerId have been re-randomized;
a few features were removed and a few renamed;
data has increased by about 1,5 times.

In addition to the difficulties listed above, there was one big plus: a large server with an RTX 2080TI was assigned to the team. I have enjoyed htop for a long time.

The idea was one - just reproduce what is already there. After spending a couple of hours setting up the environment on the server, we gradually began to check that the results were reproducible. The main problem we faced is the increase in the amount of data. We decided to slightly reduce the load and set the catboost parameter ctr_complexity=1. This slightly lowers the speed, but my model started working, the result was good - 0,733. Sergey, unlike me, did not split the data into 2 parts and trained on all the data, although this gave the best result at the online stage, there were many difficulties at the offline stage. If we take all the features that we have generated and try to put them into catboost head-on, then nothing would have happened at the online stage. Sergey did type optimization, for example, converting float64 to float32 types. In this article You can find information on memory optimization in pandas. As a result, Sergey learned on the CPU on all the data and got about 0,735.

These results were enough to win, but we hid our real score and could not be sure that other teams were not doing the same.

Fight to the last

tuning catboost

Our solution was fully reproduced, we added text data and image features, so it only remained to tune the catboost parameters. Sergey trained on the CPU with a small number of iterations, and I trained on ctr_complexity=1. There was one day left, and if you just added iterations or increased ctr_complexity, then you could get even better speed by morning, and walk all day.

At the offline stage, the scores could be very easily hidden, simply by choosing not the best solution on the site. We expected dramatic changes in the leaderboard in the last minutes before the close of submissions and decided not to stop.

From Anna's video, I learned that to improve the quality of the model, it is best to select the following parameters:

learning_rate - The default value is calculated based on the size of the dataset. An increase in learning_rate requires an increase in the number of iterations.
l2_leaf_reg - Regularization coefficient, default value 3, it is desirable to choose from 2 to 30. Reducing the value leads to an increase in overfit.
bagging_temperature - adds randomization to the weights of objects in the sample. The default value is 1, in which the weights are chosen from an exponential distribution. Decreasing the value leads to an increase in overfit.
random_strength — Influences the choice of splits at a particular iteration. The higher the random_strength, the higher the chance for a split with low importance to be picked. At each next iteration, the randomness decreases. Decreasing the value leads to an increase in overfit.

Other parameters have much less effect on the final result, so I did not try to select them. One iteration of training on my GPU dataset with ctr_complexity=1 took 20 minutes, and the selected parameters on the reduced dataset were slightly different from the optimal ones on the full dataset. As a result, I did about 30 iterations on 10% of the data, and then about 10 more iterations on all the data. It turned out something like this:

learning_rate I increased by 40% from the default;
l2_leaf_reg left the same;
bagging_temperature и random_strength reduced to 0,8.

It can be concluded that the model was undertrained with default parameters.

I was very surprised when I saw the result on the leaderboard:

Model	1 model	2 model	3 model	ensemble
Without tuning	0.7403	0.7404	0.7404	0.7407
with tuning	0.7406	0.7405	0.7406	0.7408

I concluded for myself that if you do not need a quick application of the model, then it is better to replace the selection of parameters with an ensemble of several models on non-optimized parameters.

Sergey was involved in optimizing the size of the dataset to run it on the GPU. The easiest option is to cut off part of the data, but this can be done in several ways:

gradually remove the oldest data (beginning of February) until the dataset starts to fit into memory;
remove features with the lowest importance;
remove userId for which there is only one entry;
leave only the userId that is in the test.

And in the end - to make an ensemble of all options.

Last Ensemble

By the late evening of the last day, we posted an ensemble of our models, which gave 0,742. At night, I ran my model with ctr_complexity=2 and instead of 30 minutes, it trained for 5 hours. Only at 4 in the morning she counted, and I made the last ensemble, which gave 0,7433 on the public leaderboard.

Due to different approaches to solving the problem, our predictions were not strongly correlated, which gave a good increase in the ensemble. To get a good ensemble, it's better to use the raw model predictions predict(prediction_type='RawFormulaVal') and set scale_pos_weight=neg_count/pos_count.

On the site you can see final results on the private leaderboard.

Other solutions

Many teams followed the canons of recommender algorithms. I, not being an expert in this area, cannot evaluate them, but I remember 2 interesting solutions.

Nikolai Anokhin's decision. Nikolai, being an employee of Mail.ru, did not claim any prizes, so he set himself the goal not to get the maximum speed, but to get an easily scalable solution.
The decision of the team that received the Jury Prize, based on this article from facebook, made it possible to cluster images very well without manual work.

Conclusion

What sticks in my memory the most:

If there are categorical features in the data, and you know how to do target encoding correctly, it's still better to try catboost.
If you are participating in a contest, you should not waste time selecting parameters other than learning_rate and iterations. A faster solution is to make an ensemble of several models.
Boostings can learn on the GPU. Catboost can learn very quickly on the GPU, but eats a lot of memory.
While developing and testing ideas, it's better to set small rsm~=0.2 (CPU only) and ctr_complexity=1.
Unlike other teams, the ensemble of our models gave a big increase. We only exchanged ideas and wrote in different languages. We had a different approach to data partitioning and I think everyone had their own bugs.
It is not clear why ranking optimization performed worse than classification optimization.
I got some experience with texts and an understanding of how recommender systems are made.

Thanks to the organizers for the emotions, knowledge and prizes.

Source: habr.com

SNA Hackathon 2019