Magic Ensemble Learning

Hey Habr! We invite Data Engineers and Machine Learning specialists to a free Demo-lesson "Introduction of ML models to the industrial environment using the example of online recommendations". We also publish an article by Luca Monno - Head of Financial Analytics at CDP SpA.

One of the most useful and simple machine learning methods is Ensemble Learning. Ensemble Learning is the underlying technique for XGBoost, Bagging, Random Forest, and many other algorithms.

There are many great articles on Towards Data Science, but I chose two stories (first ΠΈ second) that I liked the most. So why write another article about EL? Because I want to show you how it works on a simple example, which made me understand that there is no magic here.

When I first saw EL in action (working with some very simple regression models) I couldn't believe my eyes, and I still remember the professor who taught me this method.

I had two different models (two weak learning algorithms) with exponents out-of-sample RΒ² equal to 0,90 and 0,93 respectively. Before looking at the result, I thought that I would get RΒ² somewhere between the two initial values. In other words, I thought that EL could be used to make the model not perform as badly as the worst model, but not as well as the best model could.

To my great surprise, the results of a simple averaging of the predictions gave an RΒ² of 0,95. 

At first I started looking for an error, but then I thought that there might be some magic hidden here!

What is Ensemble Learning

With EL, you can combine the predictions of two or more models to get a more reliable and performant model. There are many methodologies for working with ensembles of models. Here I will touch on the two most useful ones to give you an idea.

With regression you can average the performance of the available models.

With classification you can let models choose labels. The label that was chosen most often is the one that will be chosen by the new model.

Why EL Works Better

The main reason why EL works better is because each prediction has an error (we know this from probability theory), combining two predictions can help reduce the error, and thus improve performance indicators (RMSE, RΒ², etc.). d.).

The following diagram shows how two weak algorithms work on a data set. The first algorithm has a larger slope than necessary, while the second has almost zero (possibly due to excessive regularization). But together shows better results. 

If you look at the RΒ², then the first and second training algorithm will have it equal to -0.01ΒΉ, 0.22, respectively, while for the ensemble it will be equal to 0.73.

Magic Ensemble Learning

There are many reasons why an algorithm can be a poor model even for a basic example like this one: maybe you decided to use regularization to avoid overfitting, or decided not to eliminate some anomalies, or maybe you used polynomial regression and picked the wrong degree (for example , used a polynomial of the second degree, and the test data shows a clear asymmetry, for which the third degree would be better suited).

When EL Works Best

Let's look at two learning algorithms that work on the same data.

Magic Ensemble Learning

Here you can see that combining the two models did not improve performance much. Initially, for the two training algorithms, the RΒ² values ​​were -0,37 and 0,22, respectively, and for the ensemble it turned out to be -0,04. That is, the EL model received the average value of the indicators.

However, there is a big difference between these two examples: in the first example, the errors of the models were negatively correlated, and in the second - positively (the coefficients of the three models were not estimated, but were simply chosen by the author as an example.)

Therefore, Ensemble Learning can be used to improve the bias/dispersion balance in all cases, but when model errors are not positively correlated, using EL can lead to better performance.

Homogeneous and heterogeneous models

Very often EL is used on homogeneous models (as in this example or random forest), but in fact you can combine different models (linear regression + neural network + XGBoost) with different sets of explanatory variables. This is likely to lead to uncorrelated errors and improve performance.

Comparison with portfolio diversification

EL works in a similar way to diversification in portfolio theory, but so much the better for us. 

When you diversify, you try to reduce the variance in your performance by investing in uncorrelated stocks. A well-diversified portfolio of stocks will perform better than the worst single stock, but never better than the best.

Quoting Warren Buffett: 

"Diversification is a defense against ignorance, for someone who does not know what he is doing, it [diversification] makes very little sense."

In machine learning, EL helps reduce the variance of your model, but this can result in a model with better overall performance than the best initial model.

To summarize

Combining multiple models into one is a relatively simple technique that can lead to a solution to the variance bias problem and improved performance.

If you have two or more models that work well, don't choose between them: use them all (but with caution)!

Are you interested in developing in this direction? Sign up for a free demo lesson "Introduction of ML models to the industrial environment using the example of online recommendations" and participate in online meeting with Andrey Kuznetsov β€” Machine Learning Engineer at Mail.ru Group.

Source: habr.com

Add a comment