FSelectorRcpp is an Rcpp (free of Java/Weka) implementation of FSelector entropy-based feature selection algorithms with a sparse matrix support. It is also equipped with a parallel backend.
install.packages('FSelectorRcpp') # stable release version on CRAN
devtools::install_github('mi2-warsaw/FSelectorRcpp') # dev version
# windows users should have Rtools for devtools installation
# https://cran.r-project.org/bin/windows/Rtools/
In the modern statistical learning the biggest bottlenecks are computation times of model training procedures and the overfitting. Both are caused by the same issue - the high dimension of explanatory variables space. Researchers have encountered problems with too big sets of features used in machine learning algorithms also in terms of model interpretaion. This motivates applying feature selection algorithms before performing statisical modeling, so that on smaller set of attributes the training time will be shorter, the interpretation might be clearer and the noise from non important features can be avoided. More motivation can be found in (John, Kohavi, and Pfleger 1994).
Many methods were developed to reduce the curse of dimensionality like Principal Component Analysis (Pearson 1901) or Singular Value Decomposition (Eckart and Young 1936) which approximates the variables by smaller number of combinations of original variables, but this approach is hard to interpret in the final model.
Sophisticated methods of attribute selection as Boruta algoritm (Kursa and Rudnicki 2010), genetic algorithms Aziz et al. (2013) or simulated annealing techniques (Khachaturyan, Semenovsovskaya, and Vainshtein 1981) are known and broadly used but in some cases for those algorithms computations can take even days, not to mention that datasets are growing every hour.
Few classification and regression models can reduce redundand variables during the training phase of statistical learning process, e.g. Decision Trees Breiman et al. (1984), LASSO Regularized Generalized Linear Models (with cross-validation) (Friedman, Hastie, and Tibshirani 2010) or Regularized Support Vector Machine (Xu, Caramanis, and Mannor 2009), but still computations starting with full set of explanatory variables are time consuming and the understaning of the feature selection procedure in this case is not simple and those methods are sometimes used without the understanding.
In business applications there appear a need to provide a fast feature selection that is extremely easy to understand. For such demands easy methods are prefered. This motivates using simple techniques like Entropy Based Feature Selection (Largeron, Moulin, and Géry 2011), where every feature can be checked independently so that computations can be performed in a parallel to shorter the procedure’s time. For this approach we provide an R interface to Rcpp reimplementation (Dirk Eddelbuettel 2011) of methods included in FSelector package which we also extended with parallel background and sparse matrix support. This has significant impact on computations time and can be used on greater datasets, comparing to FSelector. Additionally we avoided the Weka (Hall et al. 2009) dependency and we provided faster discretization implementations than those from entropy package, used originally in FSelector.
A simple entropy based feature selection workflow. Information gain is an easy, linear algorithm that computes the entropy of a dependent and explanatory variables, and the conditional entropy of a dependent variable with a respect to each explanatory variable separately. This simple statistic enables to calculate the belief of the distribution of a dependent variable when we only know the distribution of a explanatory variable.
information_gain( # Calculate the score for each attribute
formula = Species ~ ., # that is on the right side of the formula.
data = iris, # Attributes must exist in the passed data.
type = "infogain" # Choose the type of a score to be calculated.
) %>%
cut_attrs( # Then take attributes with the highest rank.
k = 2 # For example: 2 attrs with the higehst rank.
) %>%
to_formula( # Create a new formula object with
attrs = ., # the most influencial attrs.
class = "Species"
) %>%
glm(
formula = ., # Use that formula in any classification algorithm.
data = iris,
family = "binomial"
)
Call: glm(formula = ., family = "binomial", data = iris)
Coefficients:
(Intercept) Petal.Width Petal.Length
-69.45 33.89 17.60
Degrees of Freedom: 149 Total (i.e. Null); 147 Residual
Null Deviance: 191
Residual Deviance: 5.17e-09 AIC: 6
Apply a complicated feature selection search engine, that checks
combinations of subsets of the specified attributes’ set, to score the
currently considered subset, depending on the criterion passed with the
fun
parameter.
evaluator_R2_lm <- # Create a scorer function.
function(
attributes, # That takes the currently considered subset of attributes
data, # from a specified dataset.
dependent =
names(data)[1] # To find features that best describe the dependent variable.
) {
summary( # In this situation we take the r.squared statistic
lm( # from the summary of a linear model object.
to_formula( # This is the score to use to choose between considered
attributes, # subsets of attributes.
dependent
),
data = data)
)$r.squared
}
feature_search( # feature_search work in 2 modes - 'exhaustive' and 'greedy'
attributes =
names(iris)[-1], # It takes attribues and creates combinations of it's subsets.
fun = evaluator_R2_lm, # And it calculates the score of a subset that depends on the
data = iris, # evaluator function passed in the `fun` parameter.
mode = "exhaustive", # exhaustive - means to check all possible
sizes = # attributes' subset combinations
1:length(attributes) # of sizes passed in sizes.
)$all
Sepal.Width Petal.Length Petal.Width Species values
1 1 0 0 0 0.01382265
2 0 1 0 0 0.7599546
3 0 0 1 0 0.6690277
4 0 0 0 1 0.6187057