Feature Selection in Machine Learning

Hey Habr!

We at Reksoft translated the article into Russian Feature Selection in Machine Learning. We hope it will be useful to everyone who is not indifferent to the topic.

In the real world, data isn't always as clean as business customers sometimes think. That is why data mining (data mining and data wrangling) is in demand. It helps identify missing values ​​and patterns in query-structured data that a human cannot determine. In order to find and use these patterns to predict outcomes using discovered relationships in the data, Machine Learning comes in handy.

To understand any algorithm, it is necessary to look at all the variables in the data and figure out what those variables represent. This is critical because the rationale behind the results is based on an understanding of the data. If the data contains 5 or even 50 variables, you can examine them all. What if there are 200 of them? Then there simply isn't enough time to study every single variable. Moreover, some algorithms do not work for categorical data, and then you have to convert all categorical columns to scale variables (they may look like scale variables, but the metrics will show that they are categorical) in order to add them to the model. Thus, the number of variables increases, and there are about 500 of them. What should we do now? One might think that the answer would be dimensionality reduction. Dimension reduction algorithms reduce the number of parameters, but negatively affect interpretability. What if there are other techniques that eliminate features and still make it easy to understand and interpret the rest?

Depending on whether the analysis is based on regression or classification, feature selection algorithms may differ, but the main idea of ​​their implementation remains the same.

Strongly correlated variables

Variables that are highly correlated with each other give the model the same information, therefore, it is not necessary to use all of them for analysis. For example, if the dataset contains the features "Online time" and "Used traffic", we can assume that they will be correlated to some extent, and we will see a strong correlation even if we choose an unbiased data sample. In this case, only one of these variables is needed in the model. If you use both, then the model will be overfitted and biased relative to one individual feature.

P-values

In algorithms such as linear regression, an initial statistical model is always a good idea. It helps to show the importance of features using their p-values, which were obtained by this model. Having set the significance level, we check the received p-values, and if some value is below the given significance level, then this feature is declared significant, that is, changing its value will probably lead to a change in the value of the target.

Direct Selection

Direct selection is a technique that uses stepwise regression. Model building starts from scratch, that is, an empty model, and then each iteration adds a variable that makes an improvement to the model being built. Which variable is added to the model is determined by its significance. This can be calculated using various metrics. The most common way is to apply p-values ​​from the original statistical model using all variables. Sometimes direct selection can lead to overfitting of the model, because the model may contain highly correlated variables, even if they provide the same information to the model (but the model shows an improvement).

Reverse selection

Reverse selection also consists in the step-by-step exclusion of features, but in the opposite direction compared to the direct one. In this case, the initial model includes all independent variables. Variables are then eliminated (one per iteration) if they do not carry a value for the new regression model in each iteration. Feature exclusion is based on the p-values ​​of the initial model. This method also introduces uncertainty when removing highly correlated variables.

Recursive feature elimination

RFE is a widely used technique/algorithm for selecting the exact number of significant features. Sometimes the method is used to explain a number of the "most important" features that affect the results; and sometimes to reduce a very large number of variables (about 200-400), and only those that make at least some contribution to the model are left, and all the rest are excluded. RFE uses a ranking system. The features in the data set are ranked. These ranks are then used to recursively exclude features depending on the collinearity between them and the significance of these features in the model. In addition to ranking features, RFE can show whether these features are important or not even for a given number of features (because it is very likely that the selected number of features may not be optimal, and the optimal number of features may be either more or less than the selected one).

Feature Importance Chart

Speaking about the interpretability of machine learning algorithms, they usually discuss linear regressions (allowing you to analyze the significance of features using p-values) and decision trees (literally showing the importance of features in the form of a tree, and at the same time their hierarchy). On the other hand, in algorithms such as Random Forest, LightGBM and XG Boost, a feature significance diagram is often used, that is, a diagram of variables and their β€œquantity of importance” is built. This is especially useful when you need to provide a structured justification for the importance of features in terms of their business impact.

Regularization

Regularization is done to control the balance between bias and variance. The bias shows how much the model has overfitted on the training dataset. The variance shows how different the predictions were between the training and test datasets. Ideally, both bias and variance should be small. This is where regularization comes in! There are two main techniques:

L1 Regularization - Lasso: Lasso penalizes model weights to change their importance to the model and may even nullify them (i.e. remove these variables from the final model). Typically, Lasso is used when the dataset contains a large number of variables and you want to exclude some of them in order to better understand how important features affect the model (i.e. those features that have been selected by Lasso and have their importance set).

L2 Regularization - Ridge method: Ridge's task is to preserve all variables and at the same time assign importance to them based on the contribution to the performance of the model. Ridge is a good choice if the data set contains a small number of variables and all of them are needed to interpret the findings and results.

Since Ridge leaves all variables and Lasso better sets their importance, an algorithm was developed that combines the best features of both regularizations and is known as Elastic-Net.

There are many more ways to select features for machine learning, but the main idea is always the same: demonstrate the importance of variables and then eliminate some of them based on the resulting importance. Importance is a very subjective term, as it is not one but a whole set of metrics and charts that can be used to find key features.

Thank you for reading! Happy learning!

Source: habr.com

Add a comment