--- title: "Integer variables" author: "Zygmunt Zawadzki" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Integer variables} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` The information_gain and discretize functions are the two most important functions in the `FSelectorRcpp` package. Usually, there are easy to use, and the user should not worry about anything. However, there's one case where the user intervention may be required. By default `information_gain` and `discretize` discretizes numeric and integer columns, and leaves other ones as is. In the example below the `x` column is discretized: ```{r} library(FSelectorRcpp) data <- data.frame( y = factor(c("A", "A", "A", "B", "B", "A")), x = c(1.2, 2.1, 4.1, 7.3, 8.2, 3.2), z = c("x", "x", "y", "y", "y", "x"), stringsAsFactors = FALSE) discretize(data, y ~ .) ``` So far, so good. However, there is a problem when the column is of a type integer. Because integers might be used to encode levels of a factor, e.g., a number of a day in a week (0-6, 1-7), id from some mapping (e.g., `1 - Poland, 2 - USA, 3 - Germany`), and probably many other (gender and so on), possibly any variable that is going to be one-hot encoded in the final model. In such cases, discretizing those values don't make any sense, because they're already discrete. On the other hand, some variables might be encoded as integers, but they could be discretized. E.g., age, income, height, weight, and so on. It would be tough for FSelectorRcpp to guess if the integer encoded variable should be discretized because it depends heavily on the context. So we (as the authors of the `FSelectorRcpp`) decided that by default the integers columns will be treated like numerics, and they will be discretized. However, the user can control this behavior by using `discIntegers` parameter. See the example below to get some idea about the behavior of the `discretize` function when integers columns are present: ```{r} library(FSelectorRcpp) data <- data.frame( y = factor(c("A", "A", "A", "B", "B", "A")), x = c(1.2, 2.1, 4.1, 7.3, 8.2, 3.2), z = c("x", "x", "y", "y", "y", "x"), int = as.integer(c(1, 1, 1, 2, 2, 1)), uniqueInt = as.integer(c(10, 20, 11, 22, 25, 11)), stringsAsFactors = FALSE) # default (integers are discretized) discretize(data, y ~ .) # discIntegers is set to FALSE - integers are left as is. discretize(data, y ~ ., discIntegers = FALSE) ``` There might be a case that you have both types of integer columns in your data, and you can't directly use `discIntegers`. In this case, you need to manually convert to `numeric` the columns which should be discretized. The code below shows a simple approach to convert the columns to numeric if they contain a lot of distinct values: ```{r} can_discretize <- function(x, treshold = 0.9) { is.integer(x) && (length(unique(x)) / length(x) > treshold) } can_discretize(1:10) can_discretize(rnorm(10)) can_discretize(as.integer(c(rep(1,10), rep(2, 10)))) library(dplyr) set.seed(123) dt <- tibble( x = 1:20, y = rnorm(20), z = as.integer(c(rep(1,10), rep(2, 10))) ) glimpse(dt) glimpse(dt %>% mutate_if(can_discretize, as.numeric)) ```