Operation of machine learning in Mail.ru Mail

Operation of machine learning in Mail.ru Mail

Based on my speeches at Highload++ and DataFest Minsk 2019.

For many today, email is an integral part of life online. With its help, we conduct business correspondence, store all kinds of important information related to finances, hotel reservations, ordering and much more. In the middle of 2018, we formulated a product development strategy for mail. What should be the modern mail?

The mail must be smart, that is, to help users navigate the increasing amount of information: filter, structure and provide it in the most convenient way. She must be useful, allowing you to solve various tasks right in your mailbox, for example, pay fines (a function that, unfortunately, I use). And at the same time, of course, mail must provide information protection, cutting off spam and protecting against hacking, that is, to be safe.

These directions define a number of key tasks, many of which can be effectively solved using machine learning. Here are examples of existing features developed as part of the strategy - one for each direction.

  • Smart Reply. Mail has a smart reply feature. The neural network analyzes the text of the letter, understands its meaning and purpose, and as a result offers the three most appropriate response options: positive, negative and neutral. This helps to significantly save time when answering letters, as well as often answer non-standard and funny for yourself.
  • Grouping lettersrelated to orders in online stores. We do a lot of online shopping and typically stores can send multiple emails per order. For example, from Aliexpress, the largest service, there are a lot of letters for one order, and we calculated that in the terminal case their number can reach up to 29. Therefore, using the Named Entity Recognition model, we extract the order number and other information from the text and group all letters in one thread. We also show the basic information about the order in a separate box, which makes it easier to work with this type of email.

    Operation of machine learning in Mail.ru Mail

  • Antiphishing. Phishing is a particularly dangerous fraudulent type of email, with the help of which attackers try to get hold of financial information (including the user's bank cards) and logins. Such letters mimic the real ones sent by the service, including visually. Therefore, with the help of Computer Vision, we recognize the logos and style of letters of large companies (for example, Mail.ru, Sberbank, Alfa) and take this into account along with text and other features in our spam and phishing classifiers.

Machine learning

A little about machine learning in mail in general. Mail is a highly loaded system: an average of 1,5 billion letters per day pass through our servers for 30 million DAU users. Serve all the necessary functions and features of about 30 machine learning systems.

Each email goes through a whole classification pipeline. First, we cut off spam and leave good emails. Users often do not notice the work of antispam, because 95-99% of spam does not even get into the appropriate folder. Spam detection is a very important part of our system, and the most difficult one, as the anti-spam area is constantly adapting between defense and attack systems, which provides an ongoing engineering challenge for our team.

Next, we separate letters from people and robots. Emails from people are the most important, so we provide features like Smart Reply for them. Emails from robots are divided into two parts: transactional - these are important letters from services, for example, confirmations of purchases or hotel reservations, finances, and informational - these are business advertising, discounts.

We believe that transactional emails are equal in importance to personal correspondence. They should be at hand, because it is often necessary to quickly find information about an order or a flight reservation, and we spend time looking for these letters. Therefore, for convenience, we automatically divide them into six main categories: travel, orders, finances, tickets, registrations and, finally, fines.

Informational emails are the most numerous and probably less important group that do not require an instant response, since nothing significant will change in the user's life if he does not read such a letter. In our new interface, we fold them into two threads: social networks and mailing lists, thus visually clearing the inbox and leaving only important messages in sight.

Operation of machine learning in Mail.ru Mail

Exploitation

A large number of systems delivers many difficulties in operation. After all, models degrade over time, like any software: signs break, machines fail, a crooked code rolls. In addition, data is constantly changing: new data is added, the user behavior pattern is being transformed, etc., so the model without proper support will work worse and worse over time.

We must not forget that the deeper machine learning penetrates the lives of users, the greater the impact they have on the ecosystem, and, as a result, the more financial losses or profits market players can receive. Therefore, in more and more areas, players are adapting to the work of ML algorithms (classic examples are advertising, search, and the already mentioned anti-spam).

Also, machine learning tasks have a peculiarity: any, albeit insignificant, change in the system can generate a lot of work with the model: working with data, retraining, deployment, which can take weeks or months. Therefore, the faster the environment in which your models operate, the more effort they require to maintain. A team can create a lot of systems and be happy about it, and then spend almost all the resources on maintaining them, without the opportunity to do something new. We once encountered such a situation in the anti-spam team. And they made the obvious conclusion that maintenance should be automated.

Automation

What can be automated? In fact, almost everything. I have identified four areas that define the machine learning infrastructure:

  • data collection;
  • additional training;
  • deploy;
  • testing & monitoring.

If the environment is unstable and constantly changing, then the entire infrastructure around the model is much more important than the model itself. It may be a good old linear classifier, but if it is properly federated with features and good feedback from users, it will work much better than State-Of-The-Art models with all the bells and whistles.

Feedback loop

This cycle combines data collection, additional training and deployment - in fact, the entire model update cycle. Why is it important? Look at the registration schedule in the mail:

Operation of machine learning in Mail.ru Mail

A machine learning developer implemented an anti-bot model that prevents bots from registering with mail. The graph drops to a value where only real users remain. Everything is great! But four hours pass, the bot growers tweak their scripts, and everything returns to normal. In this implementation, the developer spent a month adding features and retraining the model, but the spammer was able to adapt in four hours.

In order not to be so excruciatingly painful and not have to redo everything later, we must initially think about how the feedback loop will look like, and what we will do if the environment changes. Let's start with data collection - this is the fuel for our algorithms.

Π‘Π±ΠΎΡ€ Π΄Π°Π½Π½Ρ‹Ρ…

It is clear that for modern neural networks, the more data, the better, and they, in fact, are generated by the users of the product. Users can help us by marking up the data, but this should not be abused, because at some point users will get tired of teaching your models and they will switch to another product.

One of the most common mistakes (here I refer to Andrew Ng) is to focus too much on metrics on the test dataset, and not on user feedback, which is actually the main measure of the quality of work, since we are creating a product for the user. If the user does not understand or does not like the work of the model, then everything is decaying.

Therefore, the user should always be able to vote, you should give him a tool for feedback. If we believe that a letter related to finance has arrived in the mailbox, we need to label it β€œfinance”, and draw a button that the user can click and say that this is not finance.

Feedback quality

Let's talk about the quality of user feedback. Firstly, you and the user can put different meanings into one concept. For example, you and your product managers think that β€œfinance” is letters from the bank, and the user thinks that a letter from a grandmother about retirement also refers to finance. Secondly, there are users who mindlessly like to press buttons without any logic. Thirdly, the user may be deeply mistaken in his conclusions. A striking example from our practice is the introduction of a classifier Nigerian spam, a rather funny type of spam, where the user is asked to take several million dollars from a suddenly found distant relative in Africa. After implementing this classifier, we tested the β€œNot spam” clicks on these emails and it turned out that 80% of them are juicy Nigerian spam, which suggests that users can be extremely gullible.

And let's not forget that not only people can poke buttons, but also all sorts of bots that pretend to be a browser. So raw feedback is not good for learning. What can be done with this information?

We take two approaches:

  • Feedback from related ML. For example, we have an online anti-bot system that, as I mentioned, makes a quick decision based on a limited number of features. And there is a second, slow system that works after the fact. It has more data about the user, his behavior, etc. As a result, the most balanced decision is made; accordingly, it has higher accuracy and completeness. You can send the difference in the operation of these systems to the first as data for training. So the simpler system will always try to get closer to the performance of the more complex one.
  • Click classification. You can simply classify each user click, evaluate its validity and usability. We do this in mail anti-spam, using the user's attributes, his history, the sender's attributes, the text itself and the result of the classifiers. As a result, we get an automatic system that validates user feedback. And since it needs to be trained much less often, its work can become the main one for all other systems. The main priority in this model is precision, because training the model on inaccurate data is fraught with consequences.

While we are cleaning data and retraining our ML systems, we must not forget about users, because for us, thousands, millions of errors on the chart are statistics, and for the user, each bug is a tragedy. In addition to the fact that the user needs to somehow live with your error in the product, after feedback, he expects an exception of a similar situation in the future. Therefore, you should always give users not only the opportunity to vote, but also correct the behavior of ML systems, creating, for example, personal heuristics for each feedback click, in the case of mail, this may be the ability to filter such letters by sender and title for this user.

You also need to crutch the model on the basis of some reports or calls to support in semi-automatic or manual mode so that other users do not suffer from similar problems either.

Heuristics for Learning

There are two problems with these heuristics and crutches. The first is that the ever-increasing number of crutches is difficult to maintain, not to mention their quality and performance over long distances. The second problem is that the error may not be frequent, and a few clicks will not be enough to retrain the model. It would seem that these two unrelated effects can be significantly leveled if the following approach is applied.

  1. We create a temporary crutch.
  2. We send data from it to the model, it is regularly trained, including on the received data. Here, of course, it is important that the heuristic be highly accurate so as not to reduce the quality of the data in the training set.
  3. Then we set up monitoring for the operation of the crutch, and if after some time the crutch no longer works and is completely covered by the model, then you can safely remove it. Now this problem is unlikely to happen again.

So the army of crutches is very useful. The main thing is that their service should be urgent, not permanent.

Additional education

Retraining is the process of adding new data received as a result of feedback from users or other systems, and training an existing model on them. There can be several problems with retraining:

  1. The model may simply not support additional training, but learn only from scratch.
  2. Nowhere in the book of nature is it written that additional training will certainly improve the quality of work in production. It often happens just the opposite, that is, only worsening is possible.
  3. Changes can be unpredictable. This is a rather subtle point that we have identified for ourselves. Even if a new model in an A / B test shows similar results compared to the current one, this does not mean at all that it will work identically. Their work may differ by as little as one percent, which may bring new errors or bring back already fixed old ones. Both we and users already know how to live with current errors, and when a large number of new errors appear, the user may also not understand what is happening, because he expects predictable behavior.

Therefore, the most important thing in retraining is to guarantee the improvement of the model, or at least not worsen it.

The first thing that comes to mind when we talk about additional learning is the Active Learning approach. What does this mean? For example, the classifier determines whether the letter is about finance, and around its decision boundary we add a selection of marked examples. This works well, for example, in advertising, where there is a lot of feedback and you can train the model online. And if there is little feedback, then we get a highly biased sample relative to the production data distribution, on the basis of which it is impossible to evaluate the behavior of the model during operation.

Operation of machine learning in Mail.ru Mail

In fact, our goal is to keep old patterns, already known models, and acquire new ones. Continuity is important here. The model, which we often rolled out with great difficulty, is already working, so we can focus on its performance.

Different models are used in mail: trees, linear, neural networks. For each, we make our own retraining algorithm. In the process of retraining, we get not only new data, but also often new features, which we will take into account in all the algorithms below.

Linear Models

Let's say we have a logistic regression. We compose a model loss from the following components:

  • LogLoss on new data;
  • we regularize the weights of new features (we do not touch the old ones);
  • we also learn from old data in order to preserve old patterns;
  • and, perhaps, the most important: we hang Harmonic Regularization, which guarantees not a strong change in weights relative to the old model according to the norm.

Since each Loss component has coefficients, we can choose the optimal values ​​for our task on cross-validation or based on product requirements.

Operation of machine learning in Mail.ru Mail

trees

Let's move on to decision trees. We filed the following tree retraining algorithm:

  1. The forest of 100-300 trees is working on the production, which is trained on the old data set.
  2. We delete at the end M = 5 pieces and add 2M = 10 new ones, trained on the entire data set, but with a high weight for new data, which naturally guarantees an incremental change in the model.

Obviously, over time, the number of trees increases greatly, and they must be periodically reduced in order to fit into the timings. To do this, we use the now ubiquitous Knowledge Distillation (KD). Briefly about the principle of its work.

  1. We have the current "complex" model. We run it on a training data set and get the probability distribution of classes at the output.
  2. Next, we teach the learner model (the model with fewer trees in this case) to repeat the results of the model, using the class distribution as the target variable.
  3. It is important to note here that we do not use dataset markup in any way, and therefore we can use arbitrary data. Of course, we use a sample of data from the combat stream as a training sample for the student model. Thus, the training set allows us to ensure the accuracy of the model, and the flow sample guarantees a similar performance on the production distribution, compensating for the bias of the training sample.

Operation of machine learning in Mail.ru Mail

The combination of these two techniques (adding trees and periodically reducing their number with the help of Knowledge Distillation) ensures the introduction of new patterns and complete continuity.

With KD, we also perform distinguishing operations on the features of the model, such as removing features and working on gaps. In our case, we have a number of important statistical features (by senders, text hashes, urls, etc.) that are stored in the database and tend to fail. Of course, the model is not ready for such a development of events, since there are no failure situations in the training set. In such cases, we combine KD and augmentation techniques: when training for a part of the data, we remove or null the necessary features, and we take the labels (outputs of the current model) as the original ones, the student model teaches to repeat this distribution.

Operation of machine learning in Mail.ru Mail

We noticed that the more serious the manipulation of models occurs, the more percentage of the flow sample is required.

Feature removal, the simplest operation, requires only a small portion of the flow, since only a couple of features change, and the current model was trained on the same set - the difference is minimal. To simplify the model (reducing the number of trees by several times), it takes 50 by 50 already. And for the omissions of important statistical features that seriously affect the performance of the model, even more flow is required to even out the work of the new omission-resistant model on all types of letters.

Operation of machine learning in Mail.ru Mail

fast text

Let's move on to FastText. Let me remind you that the representation (Embedding) of a word consists of the sum of the embedding of the word itself and all its letter N-grams, usually trigrams. Since there can be quite a lot of trigrams, Bucket Hashing is used, that is, converting the entire space into a certain fixed hashmap. As a result, the weight matrix is ​​obtained by the dimension of the inner layer by the number of words + buckets.

With additional training, new signs appear: words and trigrams. Nothing significant happens in Facebook's standard follow-up. Only old weights with cross-entropy on new data are retrained. Thus, new features are not used, of course, this approach has all the above disadvantages associated with the unpredictability of the model in production. Therefore, we have slightly modified FastText. We add more and more new weights (words and trigrams), train the entire cross-entropy matrix and add harmonic regularization by analogy with the linear model, which guarantees an insignificant change in the old weights.

Operation of machine learning in Mail.ru Mail

CNN

Convolutional networks are somewhat more complicated. If the last layers are learned in a CNN, then, of course, harmonic regularization can be applied and continuity guaranteed. But in the event that additional training of the entire network is required, then such regularization can no longer be hung on all layers. However, there is an option with learning complementary embeddings through Triplet Loss (original article).

Triple Loss

Using the example of an anti-phishing task, let's take a general look at Triplet Loss. We take our logo, as well as positive and negative examples of logos of other companies. We minimize the distance between the first and maximize the distance between the second, we do this with a small gap to ensure a more compact class.

Operation of machine learning in Mail.ru Mail

If we retrain the network, then our metric space completely changes, and it becomes absolutely incompatible with the previous one. This is a serious problem in problems that use vectors. To get around this problem, we will mix in the old embeddings during training.

We have added new data to the training set and are training the second version of the model from scratch. At the second stage, we train our network (Finetuning): first, the last layer is trained, and then the entire network is defrosted. In the process of compiling triplets, only part of the embeddings is calculated using the trained model, the rest - using the old one. Thus, in the process of retraining, we ensure the compatibility of the metric spaces v1 and v2. A kind of harmonic regularization.

Operation of machine learning in Mail.ru Mail

Whole architecture

If we consider the entire system as an example of anti-spam, then the models are not isolated, but nested in each other. We take pictures, text and other features, using CNN and Fast Text we get embeddings. Next, classifiers are applied on top of the embeddings, which give out scores for various classes (types of letters, spam, the presence of a logo). Soon and signs already fall into the forest of trees to make the final decision. Separate classifiers in this scheme allow you to better interpret the results of the system and more accurately retrain the components in case of problems, rather than feed all the data raw into the decision trees.

Operation of machine learning in Mail.ru Mail

As a result, we guarantee continuity at every level. At the lower level in CNN and Fast Text we use harmonic regularization, for classifiers in the middle we also use harmonic regularization and score calibration for probability distribution compatibility. Well, tree boosting is trained incrementally or with the help of Knowledge Distillation.

In general, maintaining such a nested machine learning system is usually a pain, since any component at the lower level leads to an update of the entire system above. But since in our setup each component changes slightly and it is compatible with the previous one, the whole system can be updated bit by bit without the need to retrain the entire structure, which allows us to maintain it without a serious overhead.

Deploy

We have analyzed the data collection and additional training of different types of models, so we are moving on to deploying them to the production environment.

A/B testing

As I said earlier, during the data collection process, we usually get a biased sample, from which it is impossible to evaluate the production performance of the model. Therefore, when deploying, the model must be compared with the previous version in order to understand how things are actually going, that is, to conduct A / B tests. In fact, the process of rolling out and analyzing charts is quite routine and lends itself perfectly to automation. We roll out our models gradually by 5%, 30%, 50% and 100% of users, while collecting all available metrics based on model responses and user feedback. In the case of some serious outliers, we automatically roll back the model, and for other cases, having collected a sufficient number of user clicks, we decide to increase the percentage. As a result, we bring the new model to 50% of users completely automatically, and a person will approve the rollout for the entire audience, although this step can also be automated.

However, the A/B testing process presents room for optimization. The fact is that any A/B test is quite long (in our case, it takes from 6 to 24 hours depending on the amount of feedback), which makes it quite expensive and with limited resources. In addition, a sufficiently high percentage of test flow is required to essentially speed up the overall time of the A / B test (it can take a very long time to recruit a statistically significant sample to evaluate metrics at a low percentage), which makes the number of A / B slots extremely limited. Obviously, we need to bring only the most promising models into the test, of which we get a lot in the process of additional training.

To solve this problem, we trained a separate classifier that predicts the success of an A/B test. To do this, we take decision statistics, Precision, Recall and other metrics on the training set, on the deferred and on the sample from the stream as features. We also compare the model with the current one in production, with heuristics, and take into account the complexity (Complexity) of the model. Using all these features, a classifier trained on the history of tests speeds up candidate models, in our case, these are forests of trees, and decides which one to put into the A / B test.

Operation of machine learning in Mail.ru Mail

At the time of implementation, this approach allowed us to increase the number of successful A/B tests by several times.

Testing & Monitoring

Testing and monitoring, oddly enough, do not harm our health, rather, on the contrary, they improve and relieve unnecessary stress. Testing helps prevent failure, and monitoring helps to detect it in time to reduce the impact on users.

It is important to understand here that sooner or later your system will always make mistakes - this is due to the development cycle of any software. At the beginning of system development, there are always a lot of bugs until everything settles down and the main stage of innovation is completed. But over time, entropy takes its toll, and errors appear again - due to the degradation of components around and data changes, which I spoke about at the beginning.

Here I would like to note that any machine learning system should be considered in terms of its profit throughout its entire life cycle. The graph below shows an example of how the system works to catch a rare type of spam (the line in the graph is near zero). Once, because of an incorrectly cached feature, she went crazy. Unfortunately, there was no monitoring for anomalous operation, as a result, the system began to save letters to the spam folder at the border of making a decision in large numbers. Despite the correction of the consequences, the system has already made so many mistakes that it will not pay for itself in five years. And this is a complete failure in terms of the life cycle of the model.

Operation of machine learning in Mail.ru Mail

Therefore, such a simple thing as monitoring can become key in the life of the model. In addition to standard and obvious metrics, we consider the distribution of responses and model scores, as well as the distribution of key feature values. With the help of KL divergence, we can compare the current distribution with the historical distribution or the values ​​on the A / B test with the rest of the stream, which allows us to notice anomalies in the model and roll back changes in time.

In most cases, we run our first versions of systems using simple heuristics or models that we use as monitoring in the future. For example, we monitor the NER model in comparison with regular expressions for specific online stores, and if the classifier coverage sags in comparison with them, then we understand the reasons. Another useful application of heuristics!

Results

Let's go over the key points of the article.

  • Fibdeck. We always think about the user: how he will live with our mistakes, how he will be able to report them. Do not forget that users are not a source of pure feedback for training models, and it must be cleaned with the help of auxiliary ML systems. If it is not possible to collect a signal from the user, then we are looking for alternative sources of feedback, for example, related systems.
  • Additional education. The main thing here is continuity, so we rely on the current production model. We train new models so that they do not differ much from the previous one due to harmonic regularization and similar tricks.
  • Deploy. Autodeploy by metrics greatly reduces the time to implement models. Monitoring statistics and distribution of decision making, the number of falls from users is a must for your restful sleep and productive weekends.

Well, I hope this read will help you improve your ML systems faster, get them to market faster, and make them more reliable while reducing the amount of work stress.

Source: habr.com

Add a comment