How we dramatically improved the quality of recommendations in offline retail

Hi all! My name is Sasha, I am CTO & Co-Founder at LoyaltyLab. Two years ago, my friends and I, like all poor students, went in the evening for beer to the nearest store near the house. We were very upset that the retailer, knowing that we would come for beer, did not offer a discount on chips or crackers, although this is so logical! We did not understand why this situation is happening and decided to create our own company. Well, as a bonus, write out discounts for yourself every Friday for those same chips.

How we dramatically improved the quality of recommendations in offline retail

And everything got to the point that I speak with material on the technical side of the product at NVIDIA GTC. We are happy to share our work with the community, so I am posting my report in the form of an article.

Introduction

Like everyone at the beginning of the journey, we started with an overview of how recommender systems are made. And the architecture of the following type turned out to be the most popular:
How we dramatically improved the quality of recommendations in offline retail

It consists of two parts:

  1. Sampling candidates for recommendations by a simple and fast model, usually collaborative.
  2. Ranking of candidates by a more complex and slower content model, taking into account all possible features in the data.

Here and below I will use the following terms:

  • candidate / candidate for recommendations - a pair of user-product, which can potentially get into recommendations in production.
  • candidates extraction/extractor/candidate extraction method — a process or method for extracting “candidates for recommendations” from available data.

At the first step, different variations of collaborative filtering are usually used. The most popular - ALS. Surprisingly, most articles about recommender systems only reveal various improvements to collaborative models at the first stage, but no one talks about other sampling methods. For us, the approach of using only collaborative models and various optimizations with them did not work with the quality that we expected, so we dug into the research specifically on this part. And at the end of the article I will show how much we were able to improve ALS, which was our baseline.

Before I move on to describing our approach, it is important to note that with realtime recommendations, when it is important for us to consider data that happened 30 minutes ago, there really are not many approaches that can work in the right time. But, in our case, we have to collect recommendations no more than once a day, and in most cases - once a week, which gives us the opportunity to use complex models and multiply the quality.

Let's take for the baseline what metrics only ALS shows on the task of extracting candidates. The key metrics we monitor are:

  • Precision - the proportion of correctly selected candidates from the sampled ones.
  • Recall - the proportion of candidates that happened out of those that were actually in the target interval.
  • F1-score - F-score calculated on the previous two points.

We will also look at the metrics of the final model after training gradient boosting with additional content features. There are also 3 main metrics:

  • precision@5 — average percentage of hits from the top 5 by probability for each customer.
  • response-rate@5 — conversion of buyers from a visit to the store into the purchase of at least one personal offer (one offer contains 5 products).
  • avg roc-auc per user - medium roc-auc for each buyer.

It is important to note that all these metrics are measured on time-series cross-validation, that is, training takes place in the first k weeks, and k + 1 weeks are taken as test data. Thus, seasonal ups/downs had a minimal effect on the interpretation of the quality of the models. Further, on all charts, the abscissa axis will indicate the week number in cross-validation, and the ordinate axis will indicate the value of the specified metric. All graphs are based on the transactional data of one client, so that the comparison between them is correct.

Before we start describing our approach, let's first take a look at the baseline, which is the ALS trained model.
Candidate Extraction Metrics:
How we dramatically improved the quality of recommendations in offline retail

Final metrics:
How we dramatically improved the quality of recommendations in offline retail

I treat all implementations of algorithms as some kind of business hypothesis. Thus, very roughly, any collaborative models can be considered as a hypothesis that “people tend to buy what people like them buy”. As I said, we didn’t limit ourselves to such semantics, and here are some hypotheses that still work cool on data in offline retail:

  1. What have you bought before.
  2. Similar to what I bought before.
  3. The period of a long past purchase.
  4. Popular by category/brand.
  5. Alternate purchases of different goods from week to week (Markov chains).
  6. Similar products to buyers, according to the characteristics built by different models (Word2Vec, DSSM, etc.).

What did you buy before

The most obvious heuristic that works very well in grocery retail. Here we take all the goods that the loyalty card holder bought in the last K days (usually 1-3 weeks), or K days a year ago. Applying only this method, we obtain the following metrics:
How we dramatically improved the quality of recommendations in offline retail

It is quite obvious here that the more we take the period, the more recall and less precision we have and vice versa. Better results on average for clients gives the “last 2 weeks”.

Similar to what I bought before

Not surprisingly, “what has bought before” works well for grocery retail, but extracting candidates only from what the user has already bought is not very cool, because it is unlikely that it will be possible to surprise the buyer with some new product. Therefore, we propose to slightly improve this heuristic using the same collaborative models. From the vectors that we received during the ALS training, you can get similar products to what the user has already bought. This idea is very similar to “similar videos” in video content viewing services, but since we don’t know what the user is eating/buying at a particular moment, we can only look for something similar only to what he has already bought, especially since we we already know how well it works. Applying this method on user transactions over the past 2 weeks, we get the following metrics:
How we dramatically improved the quality of recommendations in offline retail

Here k - the number of similar products that are retrieved for each product purchased by the buyer in the last 14 days.
This approach worked especially well for us on a client who was critical not to recommend at all what was already in the user's purchase history.

Long past purchase period

As we have already found out, due to the high frequency of buying goods, the first approach works well for our specifics. But what about goods like washing powder/shampoo/etc. That is, with products that are unlikely to be needed every week or two and which previous methods cannot extract. This implies the following idea - it is proposed to calculate the period of purchase of each product on average for buyers who purchased the product more k once. And then extract what most likely the buyer has already run out of. The calculated periods for goods can be checked with the eyes for adequacy:
How we dramatically improved the quality of recommendations in offline retail

And then we will see if the end of the product period falls within the time interval when the recommendations will be in production and sample what falls. The approach can be illustrated like this:
How we dramatically improved the quality of recommendations in offline retail

Here we have 2 main cases that can be considered:

  1. Whether to sample products for customers who bought the product less than K times.
  2. Whether to sample the product if the end of its period falls before the start of the target interval.

The following graph shows what results such a method achieves with different hyperparameters:
How we dramatically improved the quality of recommendations in offline retail
ft - Take only buyers who bought the product at least K (here K = 5) times
tm — Take only candidates who fall into the target interval

Not surprisingly, able (0, 0) the biggest recall and the smallest precision, since under this condition the most candidates are extracted. However, the best results are obtained when we do not sample products for customers who bought a particular product less than k times and extract, among other things, goods whose period end falls before the target interval.

Popular by Category

Another fairly obvious idea is to sample popular products across different categories or brands. Here we calculate for each customer top-k “favorite” categories/brands and extract “popular” from that category/brand. In our case, we define “favorite” and “popular” by the number of product purchases. An additional advantage of this approach is its applicability in a cold start case. That is, for customers who have made either very few purchases, or have not been in the store for a long time, or in general have only issued a loyalty card. For them, it is easier and best to throw in goods from popular with buyers with an existing history. The metrics are as follows:
How we dramatically improved the quality of recommendations in offline retail
Here, the number after the word “category” means the nesting level of the category.

In general, it is also not surprising that narrower categories achieve better results, as they extract more accurate “favorite” products for buyers.

Alternate purchases of different goods from week to week

An interesting approach that I have not seen in articles about recommender systems is a fairly simple and at the same time working statistical method of Markov chains. Here we take 2 different weeks, then for each customer we build pairs of products [bought in week i]-[bought in week j], where j > i, and from here we calculate for each product the probability of switching to another product next week. That is, for each pair of goods producti-productj count their number in the found pairs and divide by the number of pairs, where products was in the first week. To extract candidates, we take the last check of the buyer and get top-k the most likely next products from the transition matrix we got. The process of building a transition matrix looks like this:
How we dramatically improved the quality of recommendations in offline retail

From real examples in the matrix of transition probabilities, we see the following interesting phenomena:
How we dramatically improved the quality of recommendations in offline retail
Here you can notice interesting dependencies that are revealed in consumer behavior: for example, citrus lovers or a brand of milk, from which they most likely switch to another. It's also not surprising that items with high repeat purchases, like butter, also end up here.

The metrics in the method with Markov chains are as follows:
How we dramatically improved the quality of recommendations in offline retail
k - the number of products that are retrieved for each item purchased from the buyer's last transaction.
As we can see, the configuration with k=4 shows the best result. The spike at week 4 can be explained by seasonal behavior around holidays. 

Similar products to buyers, according to characteristics built by different models

So we come to the most difficult and interesting part - the search for the nearest neighbors in the vectors of buyers and products built according to various models. In our work, we use 3 such models:

  • ALS
  • Word2Vec (Item2Vec for such tasks)
  • DSSM

We have already dealt with ALS, you can read about how it learns here. In the case of Word2Vec, we use the well-known implementation of the model from gensim. By analogy with the texts, we define the offer as a purchase receipt. Thus, when constructing the product vector, the model learns to predict its “context” for the product in the receipt (the rest of the goods in the receipt). In ecommerce data, it’s better to use the buyer’s session instead of a receipt, the guys from Ozone. And DSSM is more interesting to disassemble. It was originally written by the guys from Microsoft as a search model, you can read the original research paper here. The architecture of the model looks like this:
How we dramatically improved the quality of recommendations in offline retail

Here Q - query, user's search query, D[i] - document, web page. The input of the model receives the signs of the request and pages, respectively. Each input layer is followed by a number of fully connected layers (multilayer perceptron). Next, the model learns to minimize the cosine between the vectors obtained in the last layers of the model.
The recommendation tasks use exactly the same architecture, but instead of a request, there is a user, and instead of pages, there are products. And in our case, this architecture is transformed into the following:
How we dramatically improved the quality of recommendations in offline retail

Now, to check the results, it remains to cover the last point - if in the case of ALS and DSSM we have explicitly defined user vectors, then in the case of Word2Vec we have only product vectors. Here, to build a user vector, we have identified 3 main approaches:

  1. Just add the vectors, then for the cosine distance it turns out that we just averaged the products in the shopping history.
  2. Summation of vectors with some time weighting.
  3. Weighing goods with TF-IDF coefficient.

In the case of linear weighting of the buyer vector, we proceed from the hypothesis that the product that the user bought yesterday has a greater influence on his behavior than the product that he bought six months ago. So we consider the previous week of the buyer with a coefficient of 1, and what happened next with coefficients of ½, ⅓, etc.:
How we dramatically improved the quality of recommendations in offline retail

For TF-IDF coefficients, we do exactly the same thing as in TF-IDF for texts, only we consider the buyer as a document, and the receipt as an offer, respectively, the word is a product. So the user vector will shift more towards rare goods, and goods that are frequent and familiar to the buyer will not change it much. The approach can be illustrated like this:
How we dramatically improved the quality of recommendations in offline retail

Now let's look at the metrics. This is what ALS results look like:
How we dramatically improved the quality of recommendations in offline retail
Metrics by Item2Vec with different variations of constructing the buyer vector:
How we dramatically improved the quality of recommendations in offline retail
In this case, exactly the same model is used as in our baseline. The only difference is which k we will use. In order to use only collaborative models, you have to take about 50-70 closest products for each customer.

And DSSM metrics:
How we dramatically improved the quality of recommendations in offline retail

How to combine all methods?

Cool, you say, but what to do with such a large set of candidate extraction tools? How to choose the optimal configuration for your data? Here we have several problems:

  1. It is necessary to somehow limit the search space for hyperparameters in each method. It is, of course, discrete everywhere, but the number of possible points is very large.
  2. How to choose the best configuration for your metric using a small limited sample of specific methods with specific hyperparameters?

We have not yet found an unambiguously correct answer to the first question, so we proceed from the following: for each method, a hyperparameter search space limiter is written, depending on some statistics on the data that we have. Thus, knowing the average period between purchases from people, we can guess with what period to use the “what has already been bought” and “the period of a long past purchase” method.

And after we have gone through some adequate number of variations of different methods, we note the following: each implementation extracts a certain number of candidates and has a certain value of the metric (recall) that is key for us. We want to get a certain number of candidates in total, depending on our allowable computing power, with the highest possible metric. Here the problem nicely collapses into the knapsack problem.
How we dramatically improved the quality of recommendations in offline retail

Here the number of candidates is the weight of the ingot, and the method recall is its value. However, there are 2 more points that should be taken into account when implementing the algorithm:

  • Methods may have overlap in the candidates they pull out.
  • In some cases, it will be correct to take one method twice with different parameters and the candidates at the output of the first one will not be a subset of the second one.

For example, if we take the implementation of the “what has already been bought” method with different intervals for extraction, then their sets of candidates will be nested into each other. At the same time, different parameters in the "periodic purchases" at the exit do not give a complete intersection. Therefore, we divide sampling methods with different parameters into blocks so that from each block we want to take at most one extraction approach with specific hyperparameters. To do this, you need to trick a little in the implementation of the knapsack problem, but the asymptotics and the result will not change from this.

Such a clever combination allows us to get the following metrics in comparison with simply collaborative models:
How we dramatically improved the quality of recommendations in offline retail
On the final metrics we see the following picture:
How we dramatically improved the quality of recommendations in offline retail

However, here you can see that there is one uncovered point for recommendations that are useful for business. Now we just learned how to coolly predict what the user will buy, for example, next week. But just giving a discount on the fact that he will buy anyway is not very cool. But it’s cool to maximize the expectation, for example, of the following metrics:

  1. Margin/turnover based on personal recommendations.
  2. Average check of buyers.
  3. visit frequency.

So we multiply the obtained probabilities by different coefficients and re-rank them so that the top includes products that affect the metrics above. There is no ready-made solution here, which approach is better to use. Even we are experimenting with such coefficients directly in production. But here are some interesting tricks that most often give us the best results:

  1. Multiply by the price/margin of the item.
  2. Multiply by the average check in which the product occurs. So the goods will come out with which they usually take something else.
  3. Multiply by the average frequency of visits by buyers of this product, based on the hypothesis that this product provokes more frequent returns for it.

After experimenting with coefficients, we got the following metrics in production:
How we dramatically improved the quality of recommendations in offline retail
Here overall product conversion - the share of purchased products from all products in the recommendations that we generated.

An attentive reader will notice a significant difference between offline and online metrics. This behavior is explained by the fact that not all dynamic filters for products that can be recommended can be taken into account when training the model. It is a normal story for us when half of the extracted candidates can be filtered out, such specificity is typical in our industry.

In terms of revenue, the following story is obtained, it is clear that after the launch of the recommendations, the revenue of the test group is growing strongly, now the average increase in revenue with our recommendations is 3-4%:
How we dramatically improved the quality of recommendations in offline retail

In conclusion, I want to say that if you need non-realtime recommendations, then a very large increase in quality is found in experiments with extracting candidates for recommendations. A large amount of time to generate them makes it possible to combine many good methods, which in total will give cool results for the business.

I will be glad to chat in the comments with everyone who finds the material interesting. You can ask me questions in person telegram. I also share my thoughts on AI/startups in my telegram channel — welcome :)

Source: habr.com

Add a comment