Algorithms that find ranks of importance of discrete attributes, basing on their entropy with a continous class attribute. This function is a reimplementation of FSelector's information.gain, gain.ratio and symmetrical.uncertainty.
information_gain( formula, data, x, y, type = c("infogain", "gainratio", "symuncert"), equal = FALSE, discIntegers = TRUE, nbins = 5, threads = 1 )
formula | An object of class formula with model description. |
---|---|
data | A data.frame accompanying formula. |
x | A data.frame or sparse matrix with attributes. |
y | A vector with response variable. |
type | Method name. |
equal | A logical. Whether to discretize dependent variable with the
|
discIntegers | logical value. If true (default), then integers are treated as numeric vectors and they are discretized. If false integers are treated as factors and they are left as is. |
nbins | Number of bins used for discretization. Only used if `equal = TRUE` and the response is numeric. |
threads | defunct. Number of threads for parallel backend - now turned off because of safety reasons. |
data.frame with the following columns:
attributes - variables names.
importance - worth of the attributes.
type = "infogain"
is $$H(Class) + H(Attribute) - H(Class,
Attribute)$$
type = "gainratio"
is $$\frac{H(Class) + H(Attribute) - H(Class,
Attribute)}{H(Attribute)}$$
type = "symuncert"
is $$2\frac{H(Class) + H(Attribute) - H(Class,
Attribute)}{H(Attribute) + H(Class)}$$
where H(X) is Shannon's Entropy for a variable X and H(X, Y) is a joint Shannon's Entropy for a variable X with a condition to Y.
Zygmunt Zawadzki zygmunt@zstat.pl
irisX <- iris[-5] y <- iris$Species ## data.frame interface information_gain(x = irisX, y = y)#> attributes importance #> 1 Sepal.Length 0.4521286 #> 2 Sepal.Width 0.2672750 #> 3 Petal.Length 0.9402853 #> 4 Petal.Width 0.9554360# formula interface information_gain(formula = Species ~ ., data = iris)#> attributes importance #> 1 Sepal.Length 0.4521286 #> 2 Sepal.Width 0.2672750 #> 3 Petal.Length 0.9402853 #> 4 Petal.Width 0.9554360information_gain(formula = Species ~ ., data = iris, type = "gainratio")#> attributes importance #> 1 Sepal.Length 0.4196464 #> 2 Sepal.Width 0.2472972 #> 3 Petal.Length 0.8584937 #> 4 Petal.Width 0.8713692information_gain(formula = Species ~ ., data = iris, type = "symuncert")#> attributes importance #> 1 Sepal.Length 0.4155563 #> 2 Sepal.Width 0.2452743 #> 3 Petal.Length 0.8571872 #> 4 Petal.Width 0.8705214# sparse matrix interface library(Matrix) i <- c(1, 3:8); j <- c(2, 9, 6:10); x <- 7 * (1:7) x <- sparseMatrix(i, j, x = x) y <- c(1, 1, 1, 1, 2, 2, 2, 2) information_gain(x = x, y = y)#> attributes importance #> 1 1 0 #> 2 2 0 #> 3 3 0 #> 4 4 0 #> 5 5 0 #> 6 6 0 #> 7 7 0 #> 8 8 0 #> 9 9 0 #> 10 10 0information_gain(x = x, y = y, type = "gainratio")#> attributes importance #> 1 1 NaN #> 2 2 NaN #> 3 3 NaN #> 4 4 NaN #> 5 5 NaN #> 6 6 NaN #> 7 7 NaN #> 8 8 NaN #> 9 9 NaN #> 10 10 NaNinformation_gain(x = x, y = y, type = "symuncert")#> attributes importance #> 1 1 0 #> 2 2 0 #> 3 3 0 #> 4 4 0 #> 5 5 0 #> 6 6 0 #> 7 7 0 #> 8 8 0 #> 9 9 0 #> 10 10 0