How we work on the quality and speed of selection of recommendations

My name is Pavel Parkhomenko, I am an ML developer. In this article, I would like to talk about the device of the Yandex.Zen service and share technical improvements, the implementation of which has increased the quality of recommendations. From the post you will learn how to find the most relevant for the user among millions of documents in just a few milliseconds; how to perform a continuous decomposition of a large matrix (consisting of millions of columns and tens of millions of rows) so that new documents receive their vector in tens of minutes; how to reuse user-article matrix decomposition to get a good vector representation for video.

How we work on the quality and speed of selection of recommendations

Our recommendation base contains millions of documents in various formats: text articles created on our platform and taken from external sites, videos, narratives and short posts. The development of such a service is associated with a large number of technical challenges. Here are some of them:

  • Separate computational tasks: do all heavy operations offline, and in real time perform only fast application of models to be responsible for 100-200 ms.
  • Quickly take into account user actions. To do this, it is necessary that all events are instantly delivered to the recommender and affect the result of the models.
  • Make the feed so that for new users it quickly adapts to their behavior. Newly logged in people should feel that their feedback influences the recommendations.
  • Quickly understand who to recommend a new article to.
  • Respond quickly to the constant emergence of new content. Tens of thousands of articles are published every day, and many of them have a limited lifespan (let's say news). This is their difference from films, music and other long-lived and expensive content to create.
  • Transfer knowledge from one domain area to another. If the recommender system has trained models for text articles and we add videos to it, we can reuse the existing models so that the new type of content ranks better.

I will tell you how we solved these problems.

Selection of candidates

How to reduce the number of considered documents by a factor of thousands in a few milliseconds, without practically worsening the quality of the ranking?

Suppose we have trained many ML models, generated features based on them, and trained another model that ranks documents for the user. Everything would be fine, but you can’t just take and calculate all the signs for all documents in real time if there are millions of these documents, and recommendations need to be built in 100-200 ms. The task is to choose a certain subset from millions, which will be ranked for the user. This stage is commonly referred to as candidate selection. It has several requirements. First, the selection should be very fast, so that the ranking itself has as much time as possible. Secondly, having greatly reduced the number of documents for ranking, we must preserve the documents relevant to the user as fully as possible.

Our principle of selecting candidates has evolved, and at the moment we have come to a multi-stage scheme:

How we work on the quality and speed of selection of recommendations

First, all documents are divided into groups, and the most popular documents are taken from each group. Groups can be sites, topics, clusters. For each user, based on his history, the groups closest to him are selected and the best documents are taken from them. We also use the kNN index to select the documents closest to the user in real time. There are several methods for constructing a kNN index, the best one worked for us HNSW (Hierarchical Navigable Small World graphs). This is a hierarchical model that allows you to find the N closest vectors for a user from a millionth base in a few milliseconds. First, we index our entire database of documents offline. Since the search in the index is quite fast, if there are several strong embeddings, you can make several indexes (one index for each embedding) and access each of them in real time.

We are left with tens of thousands of documents for each user. This is still a lot to count all the features, so at this stage we apply light ranking, a lightweight heavy ranking model with fewer features. The task is to predict which documents the heavy model will have in the top. Documents with the highest predictor will be used in the heavy model, that is, in the last stage of the ranking. This approach allows reducing the database of documents considered for the user from millions to thousands in tens of milliseconds.

ALS step at runtime

How to take into account user feedback immediately after a click?

An important factor in recommendations is the response time to user feedback. This is especially important for new users: when a person is just starting to use the recommender system, he receives a non-personalized feed of documents of various topics. As soon as he makes the first click, you must immediately take this into account and adjust to his interests. If all factors are calculated offline, a quick response of the system will become impossible due to the delay. So it is necessary to process user actions in real time. For these purposes, we use the ALS runtime step to build a vector representation of the user.

Suppose we have a vector representation for all documents. For example, we can build offline embeddings based on the text of an article using ELMo, BERT or other machine learning models. How can you get a vector representation of users in the same space based on their interaction in the system?

The general principle of the formation and decomposition of the user-document matrixLet's say we have m users and n documents. For some users, their relationship to certain documents is known. Then this information can be represented as an mxn matrix: rows correspond to users, and columns correspond to documents. Since a person has not seen most of the documents, most of the matrix cells will remain empty, while others will be filled. For each event (like, dislike, click), the matrix provides some value - but consider a simplified model in which like corresponds to 1, and dislike - 1.

Let's decompose the matrix into two: P (mxd) and Q (dxn), where d is the dimension of the vector representation (usually a small number). Then each object will correspond to a d-dimensional vector (the user - a row in the matrix P, the document - a column in the matrix Q). These vectors will be the embeddings of the corresponding objects. To predict whether a user will like a document, you can simply multiply their embeddings together.

How we work on the quality and speed of selection of recommendations
One of the possible ways to decompose a matrix is ​​ALS (Alternating Least Squares). We will optimize the following loss function:

How we work on the quality and speed of selection of recommendations

Here rui is the interaction of user u with document i, qi is the vector of document i, pu is the vector of user u.

Then the optimal user vector from the point of view of the mean square error (for fixed vectors of documents) is found analytically by solving the corresponding linear regression.

This is called the "ALS step". And the ALS algorithm itself lies in the fact that we alternately fix one of the matrices (users and articles) and update the other, finding the optimal solution.

Fortunately, finding the user's vector representation is a fairly fast operation that can be done at runtime using vector instructions. This trick allows you to immediately take into account user feedback in the ranking. The same embedding can be used in the kNN index to improve the selection of candidates.

Distributed Collaborative Filtering

How to do incremental distributed matrix factorization and quickly find vector representation of new articles?

Content is not the only source of signals for recommendations. Collaborative information is another important source. Good ranking features can traditionally be obtained from a decomposition of the user-document matrix. But when trying to make such a decomposition, we ran into problems:

1. We have millions of documents and tens of millions of users. The matrix does not fit entirely on one machine, and decomposition will be very long.
2. Most of the content in the system has a short lifetime: documents remain relevant for only a few hours. Therefore, it is necessary to build their vector representation as soon as possible.
3. If you build a decomposition immediately after the publication of the document, a sufficient number of users will not have time to evaluate it. Therefore, its vector representation is likely to be not very good.
4. If a user likes or dislikes, we won't be able to take that into account right away.

To solve these problems, we implemented a distributed decomposition of the user-document matrix with frequent incremental updates. How exactly does it work?

Suppose we have a cluster of N machines (N is in the hundreds) and we want to perform a distributed decomposition of a matrix on them that does not fit on one machine. The question is how to perform this decomposition so that, on the one hand, there is enough data on each machine and, on the other hand, so that the calculations are independent?

How we work on the quality and speed of selection of recommendations

We will use the ALS decomposition algorithm described above. Let's consider how to perform one ALS step distributed - the rest of the steps will be similar. Suppose we have a fixed matrix of documents and we want to build a matrix of users. To do this, let's break it into N parts by lines, each part will contain approximately the same number of lines. We will send non-empty cells of the corresponding lines to each machine, as well as the document embedding matrix (in its entirety). Since it is not very large, and the user-document matrix is ​​usually very sparse, this data will fit on a normal machine.

This trick can be repeated for several epochs until the model converges, alternately changing the fixed matrix. But even then, matrix decomposition can take several hours. And this does not solve the problem that you need to quickly receive embeddings of new documents and update the embeddings of those of them about which there was little information when building the model.

We were helped by the introduction of a quick incremental update of the model. Let's say we have a current trained model. Since her training, there have been new articles that our users have interacted with, as well as articles that had little interaction when taught. To quickly get embeddings of such articles, we use the user embeddings obtained during the first big model training and take one ALS step to compute the document matrix with a fixed user matrix. This allows you to receive embeddings quite quickly - within a few minutes after the publication of the document - and often update the embeddings of fresh documents.

In order to immediately take into account human actions for recommendations, we do not use user embeddings received offline at runtime. Instead, we take the ALS step and get the actual user vector.

Transfer to another domain area

How to use user feedback on text articles to build a vector representation of a video?

Initially, we only recommended text articles, so many of our algorithms are tailored for this type of content. But when adding content of a different type, we were faced with the need to adapt models. How did we solve this problem using the video as an example? One option is to retrain all models from scratch. But this is a long time, besides, some of the algorithms are demanding on the volume of the training sample, which is not yet in the right amount for a new type of content in the first moments of its life on the service.

We went the other way and reused text models for videos. The same trick with ALS helped us in creating vector representations of the video. We took a vector representation of users based on text articles and did an ALS step using video view information. So we easily got a vector representation of the video. And in runtime, we simply calculate the proximity between the user vector obtained from text articles and the video vector.

Conclusion

Development of the core of a real-time recommender system involves many tasks. You need to process data quickly and apply ML methods to effectively use this data; build complex distributed systems capable of processing user signals and new content units in a minimum time; and many other tasks.

In the current system, the device of which I described, the quality of recommendations for the user grows along with his activity and duration of stay on the service. But of course, here lies the main difficulty: it is difficult for the system to immediately understand the interests of a person who interacted little with the content. Improving recommendations for new users is our key task. We will continue to optimize algorithms so that content that is relevant to a person gets into their feed faster, and irrelevant content is not shown.

Source: habr.com

Add a comment