Large Hadron Collider and Odnoklassniki

Continuing the topic of machine learning contests on Habré, we want to introduce readers to two more platforms. They are certainly not as huge as kaggle, but they definitely deserve attention.

Large Hadron Collider and Odnoklassniki

Personally, I don't like kaggle much for several reasons:

  • firstly, competitions there often last for several months, and for active participation you have to spend a lot of energy;
  • secondly, public kernels (public solutions). Adepts of kaggle advise them to treat them with the calmness of Tibetan monks, but in reality it’s quite a shame when what you have been going for a month or two suddenly turns out to be laid out on a silver platter for everyone.

Fortunately, machine learning competitions are also held on other platforms, and a couple of such competitions will be discussed.

IDAO SNA Hackathon 2019
Official language: English,
organizers: Yandex, Sberbank, HSE
Official Russian language,
organizers: Mail.ru Group
Online Round: Jan 15 - Feb 11, 2019;
On-Site Final: Apr 4-6, 2019
online — from February 7 to March 15;
offline - from March 30 to April 1.
From some set of data about a particle in the Large Hadron Collider (about the trajectory, momentum, and other rather complex physical parameters), determine whether it is a muon or not
From this statement, 2 tasks were singled out:
- in one you just had to send your prediction,
- and in the other - the complete code and model for prediction, and quite strict restrictions were imposed on the execution time and memory usage
For the SNA Hackathon competition, logs of content impressions from open groups in users' news feeds for February-March 2018 were collected. The last week and a half of March are hidden in the test set. Each entry in the log contains information about what was shown and to whom, as well as how the user reacted to this content: put a “class”, commented, ignored or hidden from the feed.
The essence of the tasks of SNA Hackathon is to rank their feed for each user of the Odnoklassniki social network, raising those posts that will receive a “class” as high as possible.
At the online stage, the task was divided into 3 parts:
1. sort posts according to various collaborative features
2. sort posts by the images they contain
3. sort posts by the text they contain
Complex custom metric, something like ROC-AUC Average ROC-AUC by users
Prizes for the first stage - T-shirts for N places, passage to the second stage, where accommodation and meals were paid during the competition
Second phase - ??? (For some reason, I was not present at the awards ceremony and could not find out what the prizes were in the end). We promised laptops to all members of the winning team
Prizes for the first stage - T-shirts for the 100 best participants, passage to the second stage, where they paid for travel to Moscow, accommodation and meals during the competition. Also, towards the end of the first stage, prizes were announced for the best in 3 tasks at stage 1: each won an RTX 2080 TI video card!
The second stage is a team stage, there were from 2 to 5 people in the teams, prizes:
1st place — 300 rubles
2st place — 200 rubles
3st place — 100 rubles
jury prize - 100 rubles
Official telegram group, ~190 participants, communication in English, questions had to wait for an answer for several days Official telegram group, ~1500 participants, active discussion of tasks between participants and organizers
The organizers provided two basic solutions, simple and advanced. The simple one required less than 16 GB of RAM, and the advanced one did not fit into 16. At the same time, running a little ahead, the participants did not manage to significantly outperform the advanced solution. There were no difficulties in launching these solutions. It should be noted that in the advanced example there was a comment with a hint on where to start improving the solution. Basic primitive solutions were provided for each of the tasks, easily surpassed by the participants. In the first days of the competition, the participants encountered several difficulties: firstly, the data was given in the Apache Parquet format, and not all combinations of Python and the parquet package worked without errors. The second difficulty was downloading pictures from the mail cloud, at the moment there is no easy way to download a large amount of data at a time. As a result, these problems delayed the participants for a couple of days.

IDAO. First stage

The task was to classify muon / non-muon particles according to their characteristics. The key feature of this task was the presence of the weight column in the training data, which the organizers themselves interpreted as confidence in the answer for this row. The problem was that quite a few rows contained negative weights.

Large Hadron Collider and Odnoklassniki

After thinking for a few minutes over the line with the hint (the hint simply paid attention to this feature of the weight column) and building this graph, we decided to check 3 options:

1) invert the target for rows with negative weights (and weights, respectively)
2) shift the weights to the minimum value so that they start from 0
3) don't use string weights

The third option turned out to be the worst, but the first two improved the result, the best option was number 1, which immediately brought us to the current second place in the first task and to the first in the second.
Large Hadron Collider and Odnoklassniki
Our next step was to review the data for missing values. The organizers gave us already combed data, where there were quite a few missing values, and they were replaced by -9999.

We found missing values ​​in the MatchedHit_{X,Y,Z}[N] and MatchedHit_D{X,Y,Z}[N] columns, and only when N=2 or 3. As we understood, some particles did not fly through all 4 detectors , and stopped either on the 3rd or 4th plate. There were also Lextra_{X,Y}[N] columns in the data, which apparently describe the same thing as MatchedHit_{X,Y,Z}[N], but using some kind of extrapolation. These meager guesses suggested that Lextra_{X,Y}[N] could be substituted for missing values ​​in MatchedHit_{X,Y,Z}[N] (only for X and Y coordinates). MatchedHit_Z[N] filled well with the median. These manipulations allowed us to reach 1 intermediate place in both tasks.

Large Hadron Collider and Odnoklassniki

Considering that nothing was given for the victory in the first stage, we could stop there, but we continued, drew some beautiful pictures and came up with new features.

Large Hadron Collider and Odnoklassniki

For example, we found that if we plot the intersection points of a particle with each of the four detector plates, we can see that the points on each of the plates are grouped into 5 rectangles with an aspect ratio of 4 to 5 and centered at (0,0), and at there are no dots in the first rectangle.

Insert No. / Rectangle Sizes 1 2 3 4 5
Plate 1 500x625 1000x1250 2000x2500 4000x5000 8000x10000
Plate 2 520x650 1040x1300 2080x2600 4160x5200 8320x10400
Plate 3 560x700 1120x1400 2240x2800 4480x5600 8960x11200
Plate 4 600x750 1200x1500 2400x3000 4800x6000 9600x12000

Having determined these dimensions, we added 4 new categorical features for each particle - the number of the rectangle in which they intersect each plate.

Large Hadron Collider and Odnoklassniki

We also noticed that the particles seemed to scatter away from the center, and the idea arose to somehow evaluate the “quality” of this expansion. Ideally, it would probably be possible to come up with some kind of “ideal” parabola depending on the point of entry and estimate the deviation from it, but we limited ourselves to the “ideal” straight line. By constructing such ideal straight lines for each point of entry, we were able to calculate the standard deviation of the trajectory of each particle from this straight line. Since the average deviation for target = 1 turned out to be 152, and for target = 0 it turned out to be 390, we tentatively rated this feature as good. And indeed, this feature immediately hit the top of the most useful.

We were delighted, and added the deviation of all 4 intersection points for each particle from the ideal line as additional 4 features (and they also worked well).

Links to scientific articles on the topic of the contest, given to us by the organizers, suggested that we are far from the first to solve this problem and, perhaps, there is some kind of specialized software. Having found a repository on github where the IsMuonSimple, IsMuon, IsMuonLoose methods were implemented, we transferred them to ourselves with minor modifications. The methods themselves were very simple: for example, if the energy is less than a certain threshold, then it is not a muon, otherwise it is a muon. Such simple features obviously could not give an increase in the case of using gradient boosting, so we added another sign "distance" to the threshold. These features have also been slightly improved. Perhaps, having analyzed the existing methods more thoroughly, it was possible to find stronger methods and add them to the signs.

At the end of the competition, we tweaked the “quick” solution for the second problem a little, as a result, it differed from the baseline in the following points:

  1. In rows with negative weight, target was inverted
  2. Fill in the missing values ​​in MatchedHit_{X,Y,Z}[N]
  3. Reduced depth to 7
  4. Reduced learning rate to 0.1 (was 0.19)

As a result, we tried more features (not very successful), selected the parameters and trained catboost, lightgbm and xgboost, tried different prediction blendings and, before opening the private, confidently won the second task, and were among the leaders in the first one.

After the opening of the private, we were in 10th place for 1 task and 3rd for the second. All the leaders were mixed up, and the speed in private turned out to be higher than on the liberboard. It seems that the data was poorly stratified (or, for example, there were no rows with negative weights in private), and this was a little frustrating.

SNA Hackathon 2019 - Lyrics. First stage

The task was to rank the user's posts in the Odnoklassniki social network according to the text contained in them, in addition to the text, there were a few more characteristics of the post (language, owner, date and time of creation, date and time of viewing).

As classical approaches to working with text, I would single out two options:

  1. Mapping each word to an n-dimensional vector space such that similar words have similar vectors (for more details, see our article), then either finding the middle word for the text or using mechanisms that take into account the relative position of words (CNN, LSTM / GRU).
  2. Using models that immediately know how to work with whole sentences. For example, Bert. In theory, this approach should work better.

Since this was my first experience with lyrics, it would be wrong to teach someone else, so I will teach myself. Here are some tips I would give myself at the beginning of the competition:

  1. Before you run to train something, look at the data! In addition to the texts themselves, there were several columns in the data and much more could be squeezed out of them than I did. The simplest thing is to do mean target encoding for some of the columns.
  2. Don't learn from all the data! There was a lot of data (about 17 million rows) and it was not necessary to use all of them to test hypotheses. Training and preprocessing were very slow, and I clearly would have had time to test more interesting hypotheses.
  3. <controversial advice> No need to look for a killer model. I dealt with Elmo and Bert for a long time, hoping that they would immediately take me to a high place, and as a result I used pre-trained FastText embeddings for the Russian language. With Elmo it was not possible to achieve a better score, and with Bert I did not have time to figure it out.
  4. <controversial advice> No need to look for one killer feature. Looking at the data, I noticed that around 1 percent of the texts contain no actual text! But then there were links to some resources, and I wrote a simple parser that opened the site and pulled out the title and description. It seems like a good idea, but then I got carried away, decided to parse all links for all texts and again lost a lot of time. All this did not give a significant improvement in the final result (although I figured out stemming, for example).
  5. Classic features work. Google, for example, “text features kaggle”, read and add everything. TF-IDF gave an improvement, statistical features, like text length, words, amount of punctuation - too.
  6. If there are DateTime columns, it is worth parsing them into several separate features (hours, days of the week, etc.). Which features to highlight should be analyzed with graphs / some kind of metrics. Here, on a whim, I did everything right and highlighted the necessary features, but a normal analysis would not hurt (for example, as we did at the final).

Large Hadron Collider and Odnoklassniki

As a result of the competition, I trained one keras model with word convolution, and another one based on LSTM and GRU. Both here and there, pre-trained FastText embeddings for the Russian language were used (I tried a number of other embeddings, but these were the ones that worked best). After averaging the predictions, I took the final 7th place out of 76 participants.

Already after the first stage was published article by Nikolai Anokhin, who took second place (he participated out of competition), and his solution to some stage repeated mine, but he went further due to the query-key-value attention mechanism.

Second stage OK & IDAO

The second stages of the competitions were held almost in a row, so I decided to consider them together.

First, with the newly acquired team, I ended up in the impressive office of Mail.ru, where our task was to combine the models of three tracks from the first stage - text, pictures and collab. A little more than 2 days were allotted for this, which turned out to be very little. In fact, we were only able to repeat our results of the first stage, without getting any gain from the merger. As a result, we took 5th place, but the text model could not be used. After looking at the solutions of other participants, it seems that it was worth trying to cluster the texts and add them to the collab model. A side effect of this stage was new impressions, acquaintances and communication with cool participants and organizers, as well as severe lack of sleep, which may have affected the result of the IDAO final stage.

The task at the internal stage of IDAO 2019 Final was to predict the waiting time for an order for Yandex taxi drivers at the airport. At the 2nd stage, 3 tasks were identified = 3 airports. For each airport, minute-by-minute data on the number of taxi orders for six months are given. And as test data, the next month and minute-by-minute order data for the past 2 weeks were given. There was little time (1,5 days), the task was quite specific, only one person from the team came to the competition - and as a result, a sad place towards the end. Of the interesting ideas were attempts to use external data: about the weather, traffic jams and Yandex taxi order statistics. Although the organizers did not say which airports they were, many participants assumed that they were Sheremetyevo, Domodedovo and Vnukovo. Although this assumption was refuted after the contest, features, for example, from Moscow weather data, improved the result both on validation and on the leaderboard.

Conclusion

  1. ML contests are cool and interesting! Here there is an application for skills in data analysis, and in cunning models and techniques, and just common sense is welcome.
  2. ML is already a huge reservoir of knowledge that seems to be growing exponentially. I set myself the goal of getting to know different areas (signals, pictures, tables, text) and already realized how much there is to learn. For example, after these competitions, I decided to study: clustering algorithms, advanced techniques for working with gradient boosting libraries (in particular, working with CatBoost on a GPU), capsule networks, query-key-value attention mechanism.
  3. Not kaggle alone! There are many other contests where it is easier to get at least a T-shirt, and there are more chances for other prizes.
  4. Communicate! There is already a large community in the field of machine learning and data analysis, there are thematic groups in telegram, slack, and serious people from Mail.ru, Yandex and other companies answer questions and help beginners and those who continue their journey in this field of knowledge.
  5. Anyone who is imbued with the previous paragraph, I advise you to visit datafest - a large free conference in Moscow, which will be held on May 10-11.

Source: habr.com

Add a comment