The correct selection of features for data analysis allows you to:
improve the quality of machine learning models with and without a teacher,
reduce training time and reduce the required computing power,
and in the case of input data of high dimension, it allows to weaken the “curse of dimension”.
An assessment of the importance of attributes is necessary to interpret the results of the model.
We will consider the existing methods for selecting attributes for learning problems with and without a teacher. Each method is illustrated by an open source implementation in Python so that you can quickly test the proposed algorithms. However, this is not a complete selection: over the past 20 years, many algorithms have been created, and here you will find the most basic of them. For a deeper study, check out this review .
Models with and without a teacher
There are selection algorithms with a teacher that allow you to determine the appropriate attributes for the best quality of work of teaching tasks with the teacher (for example, in classification and regression problems). These algorithms need access to tagged data. For unlabeled data, there are also a number of feature selection methods that evaluate all features based on various criteria: variance, entropy, ability to maintain local similarity, etc. Relevant features detected using heuristic methods without a teacher can also be used in models with a teacher, because they can detect patterns other than correlation of features with the target variable.
Characteristic selection methods are usually divided into 4 categories: filters, wrappers, built-in and hybrid.
Wrappers
With this approach, we evaluate the effectiveness of a subset of features, taking into account the final result of the applied learning algorithm (for example, what is the increase in accuracy in solving the classification problem). In this combination of search strategy and modeling, any learning algorithm can be used.
Existing selection strategies:
Forward selection : start with an empty set of features, and then iteratively add features that provide the best increase in model quality.
Backward selection : start with a set of all the attributes, then at each iteration we remove the “worst” attribute.
Implementation: these algorithms are implemented in the mlxtend package, here is an example of use.
RFE (Recursive feature elimination): A greedy search algorithm that selects features by recursively defining ever smaller sets of features. It ranks the signs according to the order in which they are removed.
This group includes algorithms that simultaneously train the model and select features. This is usually implemented using the l1- regularizer (sparsity regularizer) or a condition that limits some of the signs.
SMLR (Sparse Multinomial Logistic Regression): This algorithm implements l1-regularization using ARD (Automatic relevance determination) as part of the classic multinomial logistic regression. Regularization determines the importance of each trait and nullifies those that are useless for prediction.
ARD (Automatic Relevance Determination Regression): The model uses Bayesian Ridge Regression. It shifts the weight of the coefficients to zero more strongly, compared, for example, with the least squares method.
ARD zeros the weight of some features, thereby helping to identify relevant dimensions.
Other examples of regularization algorithms: Lasso (implements l1 -regularization), ridge regression (implements l2- regularization), Elastic Net (implements l1- and l2- regularization). If you plot these methods graphically, you can see that Lasso regression limits the coefficients to an area of a square, ridge regression delineates a circle, and Elastic Net occupies an intermediate position.
A comprehensive description of these algorithms is provided here .
Filters
With this approach, we evaluate the importance of attributes only on the basis of their inherent characteristics, without involving learning algorithms. These methods are faster and require less computational resources compared to wrapper methods. If there is not enough data to model a statistical correlation between features, then filters can produce worse results than wrappers. Unlike wrappers, such methods are less prone to retraining. They are widely used for working with high-dimensional data, when wrapper methods require too much computing power.
Teacher Methods
Relief : This method randomly selects samples from the dataset and updates the significance of each attribute based on the difference between the selected instance and the two objects closest to it of the same and opposite classes. If there is a difference in the values of the characteristic for the two closest neighbors of the same class, its importance decreases, and if, on the contrary, there is a difference between the values of the characteristic for objects of different classes, the importance increases accordingly.
The weight of the attribute decreases if its value differs for the nearest objects of the same class more than for the nearest objects from different classes; otherwise the weight increases.
The advanced ReliefF algorithm uses feature weighting and searches for more nearest neighbors.
Fisher score : Commonly used in binary classification problems. The Fisher ratio (FiR) is defined as the distance between the average values of the attributes for each class divided by their variance:
Chi-squared score : Checks if there is a significant difference between the observed and expected frequencies of two categorical variables. Thus, the null hypothesis of the absence of a connection between two variables is tested.
Chi-square independence criterion .
In order to correctly apply the chi-square criterion to check the relationship between different signs from the dataset and the target variable, it is necessary to observe the conditions: the variables must be categorical , independent and must have an expected frequency of more than 5 . The last condition guarantees that the CDF (cumulative density function) of the statistical criterion (test statistic) can be approximated using the chi-square distribution. Learn more here .
CFS (Correlation-based feature selection): The rationale for this method can be formulated as follows:
Signs are relevant if their values systematically change depending on whether they belong to a particular category.
Thus, a good subset of features contains those features that are highly correlated with the target variable, while not correlating with each other. The score of a subset of k features is calculated as follows :
Here Is the average of all correlations between a trait and a class, and - the average value of all correlations between features. The CFS criterion is defined as follows:
FCBF (Fast correlation-based filter): This method works faster and more efficiently than ReliefF and CFS, and is therefore more commonly used for high dimensional input. In fact, this is a typical approach that takes into account relevance and redundancy, in which first Symmetrical Uncertainty (mutual information between X and YI (X, Y) divided by the sum of their entropies) is calculated for all signs, then the signs are sorted by this criterion, and then the excess is removed.
Variance : It has been shown that character variance estimation can be an effective way to select traits. As a rule, signs with almost zero dispersion are not significant and can be removed.
Average absolute difference : Calculate the average absolute difference between the values of the attribute and its average value ( implementation ).
Higher values tend to have higher predictive power.
Dispersion ratio : Arithmetic mean divided by geometric mean. Higher variance corresponds to more relevant features ( implementation ).
Insofar as if and only if equality is respected then:
Laplacian Score : It is based on the observation that data from one class is often located closer to each other, so you can evaluate the importance of a feature by its ability to reflect this proximity. The method consists of embedding data in the nearest neighbors graph by measuring an arbitrary distance, followed by calculating the weight matrix. Then, for each feature, we calculate the Laplace criterion and obtain a property such that the smallest values correspond to the most important dimensions. However, in practice, when selecting a subset of features, a different clustering algorithm (k-means method) is usually used, with which the most effective group is selected.
Laplace criterion combined with distance-based entropy : the algorithm is based on the Laplace criterion, where k-means clustering is replaced by entropy. The algorithm demonstrates higher stability on high-dimensional datasets ( implementation ).
MCFS (Multi-Cluster Feature selection): Spectral analysis is performed to measure the correlation between different characteristics. For data clustering and feature evaluation, eigenvectors of the Laplacian operator (graph Laplacian) are used. Their calculation is described in this paper .
Algorithms LFSBSS (Localized feature selection), weighted k-means (weighted k-means), SPEC and Apriori are considered here and implemented in this package .
Hybrid methods
Another way to implement feature selection is a hybrid of filters and wrappers combined in a two-phase process: first, the features are filtered by statistical properties, and then wrapping methods are applied.
Other sources
A lot of literature has been written in which the problem of the selection of characters is considered, and here we only slightly touched on the entire array of research works.
A complete list of other trait selection algorithms that I did not mention was implemented in the scikit-feature package.
Relevant features can also be determined using PLS (Partial least squares, as described in this article , or using linear dimension reduction methods, as shown here .