Algorithms that find ranks of importance of discrete attributes, basing on their entropy with a continous class attribute. This function is a reimplementation of FSelector's information.gain, gain.ratio and symmetrical.uncertainty.
information_gain(
formula,
data,
x,
y,
type = c("infogain", "gainratio", "symuncert"),
equal = FALSE,
discIntegers = TRUE,
nbins = 5,
threads = 1
)
An object of class formula with model description.
A data.frame accompanying formula.
A data.frame or sparse matrix with attributes.
A vector with response variable.
Method name.
A logical. Whether to discretize dependent variable with the
equal frequency binning discretization
or not.
logical value. If true (default), then integers are treated as numeric vectors and they are discretized. If false integers are treated as factors and they are left as is.
Number of bins used for discretization. Only used if `equal = TRUE` and the response is numeric.
defunct. Number of threads for parallel backend - now turned off because of safety reasons.
data.frame with the following columns:
attributes - variables names.
importance - worth of the attributes.
type = "infogain"
is $$H(Class) + H(Attribute) - H(Class,
Attribute)$$
type = "gainratio"
is $$\frac{H(Class) + H(Attribute) - H(Class,
Attribute)}{H(Attribute)}$$
type = "symuncert"
is $$2\frac{H(Class) + H(Attribute) - H(Class,
Attribute)}{H(Attribute) + H(Class)}$$
where H(X) is Shannon's Entropy for a variable X and H(X, Y) is a joint Shannon's Entropy for a variable X with a condition to Y.
irisX <- iris[-5]
y <- iris$Species
## data.frame interface
information_gain(x = irisX, y = y)
#> attributes importance
#> 1 Sepal.Length 0.4521286
#> 2 Sepal.Width 0.2672750
#> 3 Petal.Length 0.9402853
#> 4 Petal.Width 0.9554360
# formula interface
information_gain(formula = Species ~ ., data = iris)
#> attributes importance
#> 1 Sepal.Length 0.4521286
#> 2 Sepal.Width 0.2672750
#> 3 Petal.Length 0.9402853
#> 4 Petal.Width 0.9554360
information_gain(formula = Species ~ ., data = iris, type = "gainratio")
#> attributes importance
#> 1 Sepal.Length 0.4196464
#> 2 Sepal.Width 0.2472972
#> 3 Petal.Length 0.8584937
#> 4 Petal.Width 0.8713692
information_gain(formula = Species ~ ., data = iris, type = "symuncert")
#> attributes importance
#> 1 Sepal.Length 0.4155563
#> 2 Sepal.Width 0.2452743
#> 3 Petal.Length 0.8571872
#> 4 Petal.Width 0.8705214
# sparse matrix interface
library(Matrix)
i <- c(1, 3:8); j <- c(2, 9, 6:10); x <- 7 * (1:7)
x <- sparseMatrix(i, j, x = x)
y <- c(1, 1, 1, 1, 2, 2, 2, 2)
information_gain(x = x, y = y)
#> attributes importance
#> 1 1 0
#> 2 2 0
#> 3 3 0
#> 4 4 0
#> 5 5 0
#> 6 6 0
#> 7 7 0
#> 8 8 0
#> 9 9 0
#> 10 10 0
information_gain(x = x, y = y, type = "gainratio")
#> attributes importance
#> 1 1 NaN
#> 2 2 NaN
#> 3 3 NaN
#> 4 4 NaN
#> 5 5 NaN
#> 6 6 NaN
#> 7 7 NaN
#> 8 8 NaN
#> 9 9 NaN
#> 10 10 NaN
information_gain(x = x, y = y, type = "symuncert")
#> attributes importance
#> 1 1 0
#> 2 2 0
#> 3 3 0
#> 4 4 0
#> 5 5 0
#> 6 6 0
#> 7 7 0
#> 8 8 0
#> 9 9 0
#> 10 10 0