%\VignetteIndexEntry{mRMRe: an R package for parallelized mRMR ensemble feature selection} %\VignetteDepends{} %\VignetteSuggests{} %\VignetteKeywords{} %\VignettePackage{mRMRe} \documentclass[12pt]{article} \usepackage[utf8]{inputenc} \usepackage{authblk} \title{mRMRe: an R package for parallelized mRMR ensemble feature selection} \author[1]{Nicolas De Jay} \author[1]{Simon Papillon-Cavanagh} \author[2]{Catharina Olsen} \author[2]{Gianluca Bontempi} \author[1]{Benjamin Haibe-Kains} \affil[1]{Bioinformatics and Computational Biology Laboratory, Institut de recherches cliniques de Montr\'{e}al, Montreal, Quebec, Canada} \affil[2]{Machine Learning Group, Universit\'{e} Libre de Bruxelles, Brussels, Belgium} \SweaveOpts{highlight=TRUE, tidy=TRUE, keep.space=TRUE, keep.blank.space=FALSE, keep.comment=TRUE} <>= options(keep.source=TRUE) @ \begin{document} \SweaveOpts{concordance=TRUE} \maketitle \tableofcontents %------------------------------------------------------------ \section{Introduction} %------------------------------------------------------------ \textit{mRMRe} is an R package for parallelized mRMR ensemble feature selection. %------------------------------------------------------------ \subsection{Installation and Settings} %------------------------------------------------------------ \textit{mRMRe} requires that \textit{Rcpp} is installed. These should be installed automatically when you install \textit{mRMRe}. Install \textit{mRMRe} from CRAN or Bioconductor using \textit{biocLite} function. <>= install.packages("mRMRe") @ Load \textit{mRMRe} into your current workspace: <>= library(mRMRe) @ The mRMRe package allows its users to set the number of threads it will use for computations. One should may consider the following method to avoid crowding computing clusters, or fully utilize them. <>= set.thread.count(2) @ Load the example dataset \textit{cgps} into your current workspace: <>= data(cgps) data.annot <- data.frame(cgps.annot) data.cgps <- data.frame(cgps.ic50, cgps.ge) @ %------------------------------------------------------------ \subsection{Requirements} %------------------------------------------------------------ \textit{mRMRe} has only been tested on Windows and Linux platforms. It requires that the \textit{openMP} C library be installed on the hosts on which the package is intended to run. %------------------------------------------------------------ \section{Measures of Associtation} %------------------------------------------------------------ %------------------------------------------------------------ \subsection{Mutual Information Matrix} %------------------------------------------------------------ mRMRe offers a fully parallelized implementation to compute the Mutual Information Matrix (MIM). The object \textit{data\_cgps} should be a dataframe with samples/observations in rows and features/variables in columns. The method supports the following column types: "numeric" ("integer" or "double"), "ordered factor" and "Surv". Mutual information (MI) between two columns is estimated using a linear approximation based on correlation such that MI is estimated as %\begin{equation} $ I(x, y) = -\frac{1}{2} \ln{(1 - \rho(x,y)^2)} \label{eq:01} \nonumber $, %\end{equation} where $I$ and $\rho$ respectively represent the MI and correlation coefficient between features $x$ and $y$. Correlation between continuous variables can be computed using either Pearson's or Spearman's estimators, while Cramer's V and Somers' Dxy index are used for correlation between discrete variables and between continuous variables and survival data, respectively. <>= ## Test on a dummy dataset. # Create a dummy data set library(survival) df <- data.frame( "surv1" = Surv(runif(100), sample(0:1, 100, replace = TRUE)), "cont1" = runif(100), "disc1" = factor(sample(1:5, 100, replace = TRUE), ordered = TRUE), "surv2" = Surv(runif(100), sample(0:1, 100, replace = TRUE)), "cont2" = runif(100), "cont3" = runif(100), "surv3" = Surv(runif(100), sample(0:1, 100, replace = TRUE)), "disc2" = factor(sample(1:5, 100, replace = TRUE), ordered = TRUE)) dd <- mRMR.data(data = df) # Show a partial mutual information matrix. print(mim(subsetData(dd, 1:4, 1:4))) @ <>= ## Test on the 'cgps' dataset, where the ## variables are all of continuous type. dd <- mRMR.data(data = data.cgps) dd <- subsetData(dd, 1:10, 1:10) # Uses Spearman as correlation estimator spearman_mim <- mim(dd, continuous_estimator = "spearman") print(spearman_mim[1:4, 1:4]) # Uses Pearson as correlation estimator pearson_mim <- mim(dd, continuous_estimator = "pearson") print(pearson_mim[1:4, 1:4]) @ %------------------------------------------------------------ \subsection{Correlations} %------------------------------------------------------------ The mRMRe package offers an efficient, stratified and weighted implementation of the major correlation estimators: Cramer's V, Somers Dxy index (based on the concordance index), Pearson, Spearman correlation coefficients. <>= # Compute c-index between feature 1 and 2 correlate(cgps.ge[, 1], cgps.ge[, 2], method = "cindex") # Compute Cramer's V x <- sample(factor(c("CAT_1", "CAT_2", "CAT_3"), ordered = TRUE), 100, replace = TRUE) y <- sample(factor(c("CAT_1", "CAT_2"), ordered = TRUE), 100, replace = TRUE) correlate(x, y, method = "cramersv") # Compute Pearson coefficient with random strata and # sample weights between features 1 and 2 strata <- sample(factor(c("STRATUM_1", "STRATUM_2", "STRATUM_3"), ordered = TRUE), nrow(cgps.ge), replace = TRUE) weights <- runif(nrow(cgps.ge)) correlate(cgps.ge[, 1], cgps.ge[, 2], strata = strata, weights = weights, method = "pearson") @ %------------------------------------------------------------ \section{mRMR Feature Selection} %------------------------------------------------------------ mRMRe offers a highly efficient implementation of the mRMR feature selection \cite{Ding:2005tl,meyer2008informationtheoritic}. The two crucial aspects of our implementation consists first, in parallelizing the key steps of the algorithm and second, in using a lazy procedure to compute only the part of the MIM that is required during the search for the best set of features (instead of estimating the full MIM). %------------------------------------------------------------ \subsection{Classic mRMR} %------------------------------------------------------------ Here is an example of the classic mRMR feature selection \cite{Ding:2005tl}. <>= dd <- mRMR.data(data = data.cgps) mRMR.classic(data = dd, target_indices = c(1), feature_count = 30) @ %------------------------------------------------------------ \subsection{Ensemble mRMR} %------------------------------------------------------------ %Our ensemble approach allows to create a tree-like set of solutions of non redundant mRMR solutions. %The topology of the ensemble tree is user defined throught the \textit{levels} parameter. %A binary tree of depth 5 can be generated with \textit{levels=rep(2, 5)}, therefore creating $2^{5}$ mRMR solutions. <>= dd <- mRMR.data(data = data.cgps) # For mRMR.classic-like results mRMR.ensemble(data = dd, target_indices = c(1), solution_count = 1, feature_count = 30) # For mRMR.ensemble-like results mRMR.ensemble(data = dd, target_indices = c(1), solution_count = 5, feature_count = 30) @ %------------------------------------------------------------ \section{Fixed Selected Features} %------------------------------------------------------------ The mRMRe package allows to select the features with some features being fixed selected, also supports the return with/without the fixed ones <>= ensemble <- mRMR.ensemble(data = dd, target_indices = c(1), solution_count = 5, feature_count = 10, fixed_feature_count = 1) solutions(ensemble, with_fixed_features = FALSE) @ %------------------------------------------------------------ \section{Causality Inference} %------------------------------------------------------------ The mRMRe package allows one to infer causality through the use of the co-information lattice method \cite{Bell:2003tb,McGill:1954gz}. <>= ensemble <- mRMR.ensemble(data = dd, target_indices = c(1), solution_count = 5, feature_count = 10) causality(ensemble) @ %------------------------------------------------------------ \section{Session Info} %------------------------------------------------------------ <>= toLatex(sessionInfo()) @ \bibliographystyle{plain} \bibliography{biblio} \end{document}