\documentclass[nojss]{jss} \usepackage{amsmath,thumbpdf} \shortcites{mvtnorm} %% commands \newcommand{\ui}{\underline{i}} \newcommand{\oi}{\overline{\imath}} \newcommand{\argmin}{\operatorname{argmin}\displaylimits} \newcommand{\fixme}[1]{\emph{\marginpar{FIXME} (#1)}} \newcommand{\class}[1]{`\texttt{#1}'} %% neet no \usepackage{Sweave} \SweaveOpts{engine=R, eps=FALSE, keep.source=TRUE} <>= suppressWarnings(RNGversion("3.5.2")) library("partykit") options(prompt = "R> ", continue = "+ ", digits = 4, useFancyQuotes = FALSE) @ %\VignetteIndexEntry{Parties, Models, Mobsters: A New Implementation of Model-Based Recursive Partitioning in R} %\VignetteDepends{AER,Formula,mlbench,sandwich,strucchange,survival,TH.data,vcd,psychotools,psychotree} %\VignetteKeywords{parametric models, object-orientation, recursive partitioning} %\VignettePackage{partykit} \author{Achim Zeileis\\Universit\"at Innsbruck \And Torsten Hothorn\\Universit\"at Z\"urich} \Plainauthor{Achim Zeileis, Torsten Hothorn} \title{Parties, Models, Mobsters: A New Implementation of Model-Based Recursive Partitioning in \proglang{R}} \Plaintitle{Parties, Models, Mobsters: A New Implementation of Model-Based Recursive Partitioning in R} \Shorttitle{Model-Based Recursive Partitioning in \proglang{R}} \Keywords{parametric models, object-orientation, recursive partitioning} \Abstract{ MOB is a generic algorithm for \underline{mo}del-\underline{b}ased recursive partitioning \citep{Zeileis+Hothorn+Hornik:2008}. Rather than fitting one global model to a dataset, it estimates local models on subsets of data that are ``learned'' by recursively partitioning. It proceeds in the following way: (1)~fit a parametric model to a data set, (2)~test for parameter instability over a set of partitioning variables, (3)~if there is some overall parameter instability, split the model with respect to the variable associated with the highest instability, (4)~repeat the procedure in each of the resulting subsamples. It is discussed how these steps of the conceptual algorithm are translated into computational tools in an object-oriented manner, allowing the user to plug in various types of parametric models. For representing the resulting trees, the \proglang{R} package \pkg{partykit} is employed and extended with generic infrastructure for recursive partitions where nodes are associated with statistical models. Compared to the previously available implementation in the \pkg{party} package, the new implementation supports more inference options, is easier to extend to new models, and provides more convenience features. } \Address{ Achim Zeileis \\ Department of Statistics \\ Faculty of Economics and Statistics \\ Universit\"at Innsbruck \\ Universit\"atsstr.~15 \\ 6020 Innsbruck, Austria \\ E-mail: \email{Achim.Zeileis@R-project.org} \\ URL: \url{http://eeecon.uibk.ac.at/~zeileis/} \\ Torsten Hothorn\\ Institut f\"ur Epidemiologie, Biostatistik und Pr\"avention \\ Universit\"at Z\"urich \\ Hirschengraben 84\\ CH-8001 Z\"urich, Switzerland \\ E-mail: \email{Torsten.Hothorn@R-project.org}\\ URL: \url{http://user.math.uzh.ch/hothorn/}\\ } \begin{document} \section{Overview} To implement the model-based recursive partitioning (MOB) algorithm of \cite{Zeileis+Hothorn+Hornik:2008} in software, infrastructure for three aspects is required: (1)~statistical ``\emph{models}'', (2)~recursive ``\emph{party}''tions, and (3)~``\emph{mobsters}'' carrying out the MOB algorithm. Along with \cite{Zeileis+Hothorn+Hornik:2008}, an implementation of all three steps was provided in the \pkg{party} package \citep{party} for the \proglang{R} system for statistical computing \citep{R}. This provided one very flexible \code{mob()} function combining \pkg{party}'s \proglang{S}4 classes for representing trees with binary splits and the \proglang{S}4 model wrapper functions from \pkg{modeltools} \citep{modeltools}. However, while this supported many applications of interest, it was somewhat limited in several directions: (1)~The \proglang{S}4 wrappers for the models were somewhat cumbersome to set up. (2)~The tree infrastructure was originally designed for \code{ctree()} and somewhat too narrowly focused on it. (3)~Writing new ``mobster'' interfaces was not easy because of using unexported \proglang{S}4 classes. Hence, a leaner and more flexible interface (based on \proglang{S}3 classes) is now provided in \pkg{partykit} \citep{partykit}: (1)~New models are much easier to provide in a basic version and customization does not require setting up an additional \proglang{S}4 class-and-methods layer anymore. (2)~The trees are built on top of \pkg{partykit}'s flexible `\code{party}' objects, inheriting many useful methods and providing new ones dealing with the fitted models associated with the tree's nodes. (3)~New ``mobsters'' dedicated to specific models, e.g., \code{lmtree()} and \code{glmtree()} for MOBs of (generalized) linear models, are readily provided. The remainder of this vignette is organized as follows: Section~\ref{sec:algorithm} very briefly reviews the original MOB algorithm of \cite{Zeileis+Hothorn+Hornik:2008} and also highlights relevant subsequent work. Section~\ref{sec:implementation} introduces the new \code{mob()} function in \pkg{partykit} in detail, discussing how all steps of the MOB algorithm are implemented and which options for customization are available. For illustration logistic-regression-based recursive partitioning is applied to the Pima Indians diabetes data set from the UCI machine learning repository \citep{mlbench2}. Section~\ref{sec:illustration} and~\ref{sec:mobster} present further illustrative examples \citep[including replications from][]{Zeileis+Hothorn+Hornik:2008} before Section~\ref{sec:conclusion} provides some concluding remarks. \section{MOB: Model-based recursive partitioning} \label{sec:algorithm} First, the theory underling the MOB (model-based recursive partitioning) is briefly reviewed; a more detailed discussion is provided by \cite{Zeileis+Hothorn+Hornik:2008}. To fix notation, consider a parametric model $\mathcal{M}(Y, \theta)$ with (possibly vector-valued) observations $Y$ and a $k$-dimensional vector of parameters $\theta$. This model could be a (possibly multivariate) normal distribution for $Y$, a psychometric model for a matrix of responses $Y$, or some kind of regression model when $Y = (y, x)$ can be split up into a dependent variable $y$ and regressors $x$. An example for the latter could be a linear regression model $y = x^\top \theta$ or a generalized linear model (GLM) or a survival regression. Given $n$ observations $Y_i$ ($i = 1, \dots, n$) the model can be fitted by minimizing some objective function $\sum_{i = 1}^n \Psi(Y_i, \theta)$, e.g., a residual sum of squares or a negative log-likelihood leading to ordinary least squares (OLS) or maximum likelihood (ML) estimation, respectively. If a global model for all $n$ observations does not fit well and further covariates $Z_1, \dots, Z_\ell$ are available, it might be possible to partition the $n$ observations with respect to these variables and find a fitting local model in each cell of the partition. The MOB algorithm tries to find such a partition adaptively using a greedy forward search. The reasons for looking at such local models might be different for different types of models: First, the detection of interactions and nonlinearities in regression relationships might be of interest just like in standard classification and regression trees \citep[see e.g.,][]{Zeileis+Hothorn+Hornik:2008}. Additionally, however, this approach allows to add explanatory variables to models that otherwise do not have regressors or where the link between the regressors and the parameters of the model is inclear \citep[this idea is pursued by][]{Strobl+Wickelmaier+Zeileis:2011}. Finally, the model-based tree can be employed as a thorough diagnostic test of the parameter stability assumption (also termed measurement invariance in psychometrics). If the tree does not split at all, parameter stability (or measurement invariance) cannot be rejected while a tree with splits would indicate in which way the assumption is violated \citep[][employ this idea in psychometric item response theory models]{Strobl+Kopf+Zeileis:2015}. The basic idea is to grow a tee in which every node is associated with a model of type $\mathcal{M}$. To assess whether splitting of the node is necessary a fluctuation test for parameter instability is performed. If there is significant instability with respect to any of the partitioning variables $Z_j$, the node is splitted into $B$ locally optimal segments (the currenct version of the software has $B = 2$ as the default and as the only option for numeric/ordered variables) and then the procedure is repeated in each of the $B$ children. If no more significant instabilities can be found, the recursion stops. More precisely, the steps of the algorithm are % \begin{enumerate} \item Fit the model once to all observations in the current node. \item Assess whether the parameter estimates are stable with respect to every partitioning variable $Z_1, \dots, Z_\ell$. If there is some overall instability, select the variable $Z_j$ associated with the highest parameter instability, otherwise stop. \item Compute the split point(s) that locally optimize the objective function $\Psi$. \item Split the node into child nodes and repeat the procedure until some stopping criterion is met. \end{enumerate} % This conceptual framework is extremely flexible and allows to adapt it to various tasks by choosing particular models, tests, and methods in each of the steps: % \begin{enumerate} \item \emph{Model estimation:} The original MOB introduction \citep{Zeileis+Hothorn+Hornik:2008} discussed regression models: OLS regression, GLMs, and survival regression. Subsequently, \cite{Gruen+Kosmidis+Zeileis:2012} have also adapted MOB to beta regression for limited response variables. Furthermore, MOB provides a generic way of adding covariates to models that otherwise have no regressors: this can either serve as a check whether the model is indeed independent from regressors or leads to local models for subsets. Both views are of interest when employing MOB to detect parameter instabilities in psychometric models for item responses such as the Bradley-Terry or the Rasch model \citep[see][respectively]{Strobl+Wickelmaier+Zeileis:2011,Strobl+Kopf+Zeileis:2015}. \item \emph{Parameter instability tests:} To assess the stability of all model parameters across a certain partitioning variable, the general class of score-based fluctuation tests proposed by \cite{Zeileis+Hornik:2007} is employed. Based on the empirical score function observations (i.e., empirical estimating functions or contributions to the gradient), ordered with respect to the partitioning variable, the fluctuation or instability in the model's parameters can be tested. From this general framework the Andrews' sup\textit{LM} test is used for assessing numerical partitioning variables and a $\chi^2$ test for categorical partitioning variables (see \citealp{Zeileis:2005} and \citealp{Merkle+Zeileis:2013} for unifying views emphasizing regression and psychometric models, respectively). Furthermore, the test statistics for ordinal partitioning variables suggested by \cite{Merkle+Fan+Zeileis:2014} have been added as further options (but are not used by default as the simulation of $p$-values is computationally demanding). \item \emph{Partitioning:} As the objective function $\Psi$ is additive, it is easy to compute a single optimal split point (or cut point or break point). For each conceivable split, the model is estimated on the two resulting subsets and the resulting objective functions are summed. The split that optimizes this segmented objective function is then selected as the optimal split. For optimally splitting the data into $B > 2$ segments, the same idea can be used and a full grid search can be avoided by employing a dynamic programming algorithms \citep{Hawkins:2001,Bai+Perron:2003} but at the moment the latter is not implemented in the software. Optionally, however, categorical partitioning variables can be split into all of their categories (i.e., in that case $B$ is set to the number of levels without searching for optimal groupings). \item \emph{Pruning:} For determining the optimal size of the tree, one can either use a pre-pruning or post-pruning strategy. For the former, the algorithm stops when no significant parameter instabilities are detected in the current node (or when the node becomes too small). For the latter, one would first grow a large tree (subject only to a minimal node size requirement) and then prune back splits that did not improve the model, e.g., judging by information criteria such as AIC or BIC \citep[see e.g.,][]{Su+Wang+Fan:2004}. Currently, pre-pruning is used by default (via Bonferroni-corrected $p$-values from the score-based fluctuation tests) but AIC/BIC-based post-pruning is optionally available (especially for large data sets where traditional significance levels are not useful). \end{enumerate} % In the following it is discussed how most of the options above are embedded in a common computational framework using \proglang{R}'s facilities for model estimation and object orientation. \section[A new implementation in R]{A new implementation in \proglang{R}} \label{sec:implementation} This section introduces a new implementation of the general model-based recursive partitioning (MOB) algorithm in \proglang{R}. Along with \cite{Zeileis+Hothorn+Hornik:2008}, a function \code{mob()} had been introduced to the \pkg{party} package \citep{party} which continues to work but it turned out to be somewhat unflexible for certain applications/extensions. Hence, the \pkg{partykit} package \citep{partykit} provides a completely rewritten (and not backward compatible) implementation of a new \code{mob()} function along with convenience interfaces \code{lmtree()} and \code{glmtree()} for fitting linear model and generalized linear model trees, respectively. The function \code{mob()} itself is intended to be the workhorse function that can also be employed to quickly explore new models -- whereas \code{lmtree()} and \code{glmtree()} will be the typical user interfaces facilitating applications. The new \code{mob()} function has the following arguments: \begin{Code} mob(formula, data, subset, na.action, weights, offset, fit, control = mob_control(), ...) \end{Code} All arguments in the first line are standard for modeling functions in \proglang{R} with a \code{formula} interface. They are employed by \code{mob()} to do some data preprocessing (described in detail in Section~\ref{sec:formula}) before the \code{fit} function is used for parameter estimation on the preprocessed data. The \code{fit} function has to be set up in a certain way (described in detail in Section~\ref{sec:fit}) so that \code{mob()} can extract all information that is needed in the MOB algorithm for parameter instability tests (see Section~\ref{sec:sctest}) and partitioning/splitting (see Section~\ref{sec:split}), i.e., the estimated parameters, the associated objective function, and the score function contributions. A list of \code{control} options can be set up by the \code{mob_control()} function, including options for pruning (see Section~\ref{sec:prune}). Additional arguments \code{...} are passed on to the \code{fit} function. The result is an object of class `\code{modelparty}' inheriting from `\code{party}'. The \code{info} element of the overall `\code{party}' and the individual `\code{node}'s contain various informations about the models. Details are provided in Section~\ref{sec:object}. Finally, a wide range of standard (and some extra) methods are available for working with `\code{modelparty}' objects, e.g., for extracting information about the models, for visualization, computing predictions, etc. The standard set of methods is introduced in Section~\ref{sec:methods}. However, as will be discussed there, it may take some effort by the user to efficiently compute certain pieces of information. Hence, convenience interfaces such as \code{lmtree()} or \code{glmtree()} can alleviate these obstacles considerably, as illustrated in Section~\ref{sec:interface}. \subsection{Formula processing and data preparation} \label{sec:formula} The formula processing within \code{mob()} is essentially done in ``the usual way'', i.e., there is a \code{formula} and \code{data} and optionally further arguments such as \code{subset}, \code{na.action}, \code{weights}, and \code{offset}. These are processed into a \code{model.frame} from which the response and the covariates can be extracted and passed on to the actual \code{fit} function. As there are possibly three groups of variables (response, regressors, and partitioning variables), the \pkg{Formula} package \citep{Formula} is employed for processing these three parts. Thus, the formula can be of type \verb:y ~ x1 + ... + xk | z1 + ... + zl: where the variables on the left of the \code{|} specify the data $Y$ and the variables on the right specify the partitioning variables $Z_j$. As pointed out above, the $Y$ can often be split up into response (\code{y} in the example above) and regressors (\code{x1}, \dots, \code{xk} in the example above). If there are no regressors and just constant fits are employed, then the formula can be specified as \verb:y ~ 1 | z1 + ... + zl:. So far, this formula representation is really just a specification of groups of variables and does not imply anything about the type of model that is to be fitted to the data in the nodes of the tree. The \code{mob()} function does not know anything about the type of model and just passes (subsets of) the \code{y} and \code{x} variables on to the \code{fit} function. Only the partitioning variables \code{z} are used internally for the parameter instability tests (see Section~\ref{sec:sctest}). As different \code{fit} functions will require the data in different formats, \code{mob_control()} allows to set the \code{ytype} and \code{xtype}. The default is to assume that \code{y} is just a single column of the model frame that is extracted as a \code{ytype = "vector"}. Alternatively, it can be a \code{"data.frame"} of all response variables or a \code{"matrix"} set up via \code{model.matrix()}. The options \code{"data.frame"} and \code{"matrix"} are also available for \code{xtype} with the latter being the default. Note that if \code{"matrix"} is used, then transformations (e.g., logs, square roots etc.) and dummy/interaction codings are computed and turned into columns of a numeric matrix while for \code{"data.frame"} the original variables are preserved. By specifying the \code{ytype} and \code{xtype}, \code{mob()} is also provided with the information on how to correctly subset \code{y} and \code{x} when recursively partitioning data. In each step, only the subset of \code{y} and \code{x} pertaining to the current node of the tree is passed on to the \code{fit} function. Similarly, subsets of \code{weights} and \code{offset} are passed on (if specified). \subsubsection*{Illustration} For illustration, we employ the popular benchmark data set on diabetes among Pima Indian women that is provided by the UCI machine learning repository \citep{mlbench2} and available in the \pkg{mlbench} package \citep{mlbench}: % <>= data("PimaIndiansDiabetes", package = "mlbench") @ % Following \cite{Zeileis+Hothorn+Hornik:2008} we want to fit a model for \code{diabetes} employing the plasma glucose concentration \code{glucose} as a regressor. As the influence of the remaining variables on \code{diabetes} is less clear, their relationship should be learned by recursive partitioning. Thus, we employ the following formula: % <>= pid_formula <- diabetes ~ glucose | pregnant + pressure + triceps + insulin + mass + pedigree + age @ % Before passing this to \code{mob()}, a \code{fit} function is needed and a logistic regression function is set up in the following section. \subsection{Model fitting and parameter estimation} \label{sec:fit} The \code{mob()} function itself does not actually carry out any parameter estimation by itself, but assumes that one of the many \proglang{R} functions available for model estimation is supplied. However, to be able to carry out the steps of the MOB algorithm, \code{mob()} needs to able to extract certain pieces of information: especially the estimated parameters, the corresponding objective function, and associated score function contributions. Also, the interface of the fitting function clearly needs to be standardized so that \code{mob()} knows how to invoke the model estimation. Currently, two possible interfaces for the \code{fit} function can be employed: % \begin{enumerate} \item The \code{fit} function can take the following inputs \begin{Code} fit(y, x = NULL, start = NULL, weights = NULL, offset = NULL, ..., estfun = FALSE, object = FALSE) \end{Code} where \code{y}, \code{x}, \code{weights}, \code{offset} are (the subset of) the preprocessed data. In \code{start} starting values for the parameter estimates may be supplied and \code{...} is passed on from the \code{mob()} function. The \code{fit} function then has to return an output list with the following elements: \begin{itemize} \item \code{coefficients}: Estimated parameters $\hat \theta$. \item \code{objfun}: Value of the minimized objective function $\sum_i \Psi(y_i, x_, \hat \theta)$. \item \code{estfun}: Empirical estimating functions (or score function contributions) $\Psi'(y_i, x_i, \hat \theta)$. Only needed if \code{estfun = TRUE}, otherwise optionally \code{NULL}. \item \code{object}: A model object for which further methods could be available (e.g., \code{predict()}, or \code{fitted()}, etc.). Only needed if \code{object = TRUE}, otherwise optionally \code{NULL}. \end{itemize} By making \code{estfun} and \code{object} optional, the fitting function might be able to save computation time by only optimizing the objective function but avoiding further computations (such as setting up covariance matrix, residuals, diagnostics, etc.). \item The \code{fit} function can also of a simpler interface with only the following inputs: \begin{Code} fit(y, x = NULL, start = NULL, weights = NULL, offset = NULL, ...) \end{Code} The meaning of all arguments is the same as above. However, in this case \code{fit} needs to return a classed model object for which methods to \code{coef()}, \code{logLik()}, and \code{estfun()} \citep[see][and the \pkg{sandwich} package]{sandwich} are available for extracting the parameter estimates, the maximized log-likelihood, and associated empirical estimating functions (or score contributions), respectively. Internally, a function of type (1) is set up by \code{mob()} in case a function of type (2) is supplied. However, as pointed out above, a function of type (1) might be useful to save computation time. \end{enumerate} % In either case the \code{fit} function can, of course, choose to ignore any arguments that are not applicable, e.g., if the are no regressors \code{x} in the model or if starting values or not supported. The \code{fit} function of type (2) is typically convenient to quickly try out a new type of model in recursive partitioning. However, when writing a new \code{mob()} interface such as \code{lmtree()} or \code{glmtree()}, it will typically be better to use type (1). Similarly, employing supporting starting values in \code{fit} is then recommended to save computation time. \subsubsection*{Illustration} For recursively partitioning the \code{diabetes ~ glucose} relationship (as already set up in the previous section), we employ a logistic regression model. An interface of type (2) can be easily set up: % <>= logit <- function(y, x, start = NULL, weights = NULL, offset = NULL, ...) { glm(y ~ 0 + x, family = binomial, start = start, ...) } @ % Thus, \code{y}, \code{x}, and the starting values are passed on to \code{glm()} which returns an object of class `\code{glm}' for which all required methods (\code{coef()}, \code{logLik()}, and \code{estfun()}) are available. Using this \code{fit} function and the \code{formula} already set up above the MOB algorithm can be easily applied to the \code{PimaIndiansDiabetes} data: % <>= pid_tree <- mob(pid_formula, data = PimaIndiansDiabetes, fit = logit) @ % The result is a logistic regression tree with three terminal nodes that can be easily visualized via \code{plot(pid_tree)} (see Figure~\ref{fig:pid_tree}) and printed: <>= pid_tree @ % The tree finds three groups of Pima Indian women: \begin{itemize} \item[\#2] Women with low body mass index that have on average a low risk of diabetes, however this increases clearly with glucose level. \item[\#4] Women with average and high body mass index, younger than 30 years, that have a higher avarage risk that also increases with glucose level. \item[\#5] Women with average and high body mass index, older than 30 years, that have a high avarage risk that increases only slowly with glucose level. \end{itemize} Note that the example above is used for illustration here and \code{glmtree()} is recommended over using \code{mob()} plus manually setting up a \code{logit()} function. The same tree structure can be found via: % <>= pid_tree2 <- glmtree(diabetes ~ glucose | pregnant + pressure + triceps + insulin + mass + pedigree + age, data = PimaIndiansDiabetes, family = binomial) @ % However, \code{glmtree()} is slightly faster as it avoids many unnecessary computations, see Section~\ref{sec:interface} for further details. \begin{figure}[p!] \centering \setkeys{Gin}{width=0.8\textwidth} <>= plot(pid_tree) @ \caption{\label{fig:pid_tree} Logistic-regression-based tree for the Pima Indians diabetes data. The plots in the leaves report the estimated regression coefficients.} \setkeys{Gin}{width=\textwidth} <>= plot(pid_tree2, tp_args = list(ylines = 1, margins = c(1.5, 1.5, 1.5, 2.5))) @ \caption{\label{fig:pid_tree2} Logistic-regression-based tree for the Pima Indians diabetes data. The plots in the leaves give spinograms for \code{diabetes} versus \code{glucose}.} \end{figure} Here, we only point out that \code{plot(pid_tree2)} produces a nicer visualization (see Figure~\ref{fig:pid_tree2}). As $y$ is \code{diabetes}, a binary variable, and $x$ is \code{glucose}, a numeric variable, a spinogram is chosen automatically for visualization (using the \pkg{vcd} package, \citealp{vcd}). The x-axis breaks in the spinogram are the five-point summary of \code{glucose} on the full data set. The fitted lines are the mean predicted probabilities in each group. \subsection{Testing for parameter instability} \label{sec:sctest} In each node of the tree, first the parametric model is fitted to all observations in that node (see Section~\ref{sec:fit}). Subsequently it is of interest to find out whether the model parameters are stable over each particular ordering implied by the partitioning variables $Z_j$ or whether splitting the sample with respect to one of the $Z_j$ might capture instabilities in the parameters and thus improve the fit. The tests used in this step belong to the class of generalized M-fluctuation tests \citep{Zeileis:2005,Zeileis+Hornik:2007}. Depending on the class of each of the partitioning variables in \code{z} different types of tests are chosen by \code{mob()}: \begin{enumerate} \item For numeric partitioning variables (of class `\code{numeric}' or `\code{integer}') the sup\textit{LM}~statistic is used which is the maximum over all single-split \textit{LM} statistics. Associated $p$-values can be obtained from a table of critical values \citep[based on][]{Hansen:1997} stored within the package. If there are ties in the partitioning variable, the sequence of \textit{LM} statistics (and hence their maximum) is not unique and hence the results by default may depend to some degree on the ordering of the observations. To explore this, the option \code{breakties = TRUE} can be set in \code{mob_control()} which breaks ties randomly. If there are are only few ties, the influence is often negligible. If there are many ties (say only a dozen unique values of the partitioning variable), then the variable may be better treated as an ordered factor (see below). \item For categorical partitioning variables (of class `\code{factor}'), a $\chi^2$~statistic is employed which captures the fluctuation within each of the categories of the partitioning variable. This is also an \textit{LM}-type test and is asymptotically equivalent to the corresponding likelihood ratio test. However, it is somewhat cheaper to compute the \textit{LM} statistic because the model just has to be fitted once in the current node and not separately for each category of each possible partitioning variable. See also \cite{Merkle+Fan+Zeileis:2014} for a more detailed discussion. \item For ordinal partitioning variables (of class `\code{ordered}' inheriting from `\code{factor}') the same $\chi^2$ as for the unordered categorical variables is used by default \citep[as suggested by][]{Zeileis+Hothorn+Hornik:2008}. Although this test is consistent for any parameter instabilities across ordered variables, it does not exploit the ordering information. Recently, \cite{Merkle+Fan+Zeileis:2014} proposed an adapted max\textit{LM} test for ordered variables and, alternatively, a weighted double maximum test. Both are optionally availble in the new \code{mob()} implementation by setting \code{ordinal = "L2"} or \code{ordinal = "max"} in \code{mob_control()}, respectively. Unfortunately, computing $p$-values from both tests is more costly than for the default \code{ordinal = "chisq"}. For \code{"L2"} suitable tables of critical values have to be simulated on the fly using \code{ordL2BB()} from the \pkg{strucchange} package \citep{strucchange}. For \code{"max"} a multivariate normal probability has to be computed using the \pkg{mvtnorm} package \citep{mvtnorm}. \end{enumerate} All of the parameter instability tests above can be computed in an object-oriented manner as the ``\code{estfun}'' has to be available for the fitted model object. (Either by computing it in the \code{fit} function directly or by providing a \code{estfun()} extractor, see Section~\ref{sec:fit}.) All tests also require an estimate of the corresponding variance-covariance matrix of the estimating functions. The default is to compute this using an outer-product-of-gradients (OPG) estimator. Alternatively, the corrsponding information matrix or sandwich matrix can be used if: (a)~the estimating functions are actually maximum likelihood scores, and (b)~a \code{vcov()} method (based on an estimate of the information) is provided for the fitted model objects. The corresponding option in \code{mob_control()} is to set \code{vcov = "info"} or \code{vcov = "sandwich"} rather than \code{vcov = "opg"} (the default). For each of the $j = 1, \dots, \ell$ partitioning variables in \code{z} the test selected in the control options is employed and the corresponding $p$-value $p_j$ is computed. To adjust for multiple testing, the $p$ values can be Bonferroni adjusted (which is the default). To determine whether there is some overall instability, it is checked whether the minial $p$-value $p_{j^*} = \min_{j = 1, \dots, \ell} p_j$ falls below a pre-specified significance level $\alpha$ (by default $\alpha = 0.05$) or not. If there is significant instability, the variable $Z_{j^*}$ associated with the minimal $p$-value is used for splitting the node. \subsubsection*{Illustration} In the logistic-regression-based MOB \code{pid_tree} computed above, the parameter instability tests can be extracted using the \code{sctest()} function from the \pkg{strucchange} package (for \underline{s}tructural \underline{c}hange \underline{test}). In the first node, the test statistics and Bonferroni-corrected $p$-values are: % <>= library("strucchange") sctest(pid_tree, node = 1) @ % Thus, the body \code{mass} index has the lowest $p$-value and is highly significant and hence used for splitting the data. In the second node, no further significant parameter instabilities can be detected and hence partitioning stops in that branch. % <>= sctest(pid_tree, node = 2) @ % In the third node, however, there is still significant instability associated with the \code{age} variable and hence partitioning continues. % <>= sctest(pid_tree, node = 3) @ % Because no further instabilities can be found in the fourth and fifth node, the recursive partitioning stops there. \subsection{Splitting} \label{sec:split} In this step, the observations in the current node are split with respect to the chosen partitioning variable $Z_{j^*}$ into $B$ child nodes. As pointed out above, the \code{mob()} function currently only supports binary splits, i.e., $B = 2$. For deterimining the split point, an exhaustive search procedure is adopted: For each conceivable split point in $Z_{j^*}$, the two subset models are fit and the split associated with the minimal value of the objective function $\Psi$ is chosen. Computationally, this means that the \code{fit} function is applied to the subsets of \code{y} and \code{x} for each possibly binary split. The ``\code{objfun}'' values are simply summed up (because the objective function is assumed to be additive) and its minimum across splits is determined. In a search over a numeric partitioning variable, this means that typically a lot of subset models have to be fitted and often these will not vary a lot from one split to the next. Hence, the parameter estimates ``\code{coefficients}'' from the previous split are employed as \code{start} values for estimating the coefficients associated with the next split. Thus, if the \code{fit} function makes use of these starting values, the model fitting can often be speeded up. \subsubsection*{Illustration} For the Pima Indians diabetes data, the split points found for \code{pid_tree} are displayed both in the print output and the visualization (see Figure~\ref{fig:pid_tree} and~\ref{fig:pid_tree2}). \subsection{Pruning} \label{sec:prune} By default, the size of \code{mob()} trees is determined only by the significance tests, i.e., when there is no more significant parameter instability (by default at level $\alpha = 0.05$) the tree stops growing. Additional stopping criteria are only the minimal node size (\code{minsize}) and the maximum tree depth (\code{maxdepth}, by default infinity). However, for very large sample size traditional significance levels are typically not useful because even tiny parameter instabilities can be detected. To avoid overfitting in such a situation, one would either have to use much smaller significance levels or add some form of post-pruning to reduce the size of the tree afterwards. Similarly, one could wish to first grow a very large tree (using a large $\alpha$ level) and then prune it afterwards, e.g., using some information criterion like AIC or BIC. To accomodate such post-pruning strategies, \code{mob_control()} has an argument \code{prune} that can be a \code{function(objfun, df, nobs)} that either returns \code{TRUE} if a node should be pruned or \code{FALSE} if not. The arguments supplied are a vector of objective function values in the current node and the sum of all child nodes, a vector of corresponding degrees of freedom, and the number of observations in the current node and in total. For example if the objective function used is the negative log-likelihood, then for BIC-based pruning the \code{prune} function is: \code{(2 * objfun[1] + log(nobs[1]) * df[1]) < (2 * objfun[2] + log(nobs[2]) * df[2])}. As the negative log-likelihood is the default objective function when using the ``simple'' \code{fit} interface, \code{prune} can also be set to \code{"AIC"} or \code{"BIC"} and then suitable functions will be set up internally. Note, however, that for other objective functions this strategy is not appropriate and the functions would have to be defined differently (which \code{lmtree()} does for example). In the literature, there is no clear consensus as to how many degrees of freedom should be assigned to the selection of a split point. Hence, \code{mob_control()} allows to set \code{dfsplit} which by default is \code{dfsplit = TRUE} and then \code{as.integer(dfsplit)} (i.e., 1 by default) degrees of freedom per split are used. This can be modified to \code{dfsplit = FALSE} (or equivalently \code{dfsplit = 0}) or \code{dfsplit = 3} etc.\ for lower or higher penalization of additional splits. \subsubsection*{Illustration} With $n = \Sexpr{nrow(PimaIndiansDiabetes)}$ observations, the sample size is quite reasonable for using the classical significance level of $\alpha = 0.05$ which is also reflected by the moderate tree size with three terminal nodes. However, if we wished to explore further splits, a conceivable strategy could be the following: % <>= pid_tree3 <- mob(pid_formula, data = PimaIndiansDiabetes, fit = logit, control = mob_control(verbose = TRUE, minsize = 50, maxdepth = 4, alpha = 0.9, prune = "BIC")) @ This first grows a large tree until the nodes become too small (minimum node size: 50~observations) or the tree becomes too deep (maximum depth 4~levels) or the significance levels come very close to one (larger than $\alpha = 0.9$). Here, this large tree has eleven nodes which are subsequently pruned based on whether or not they improve the BIC of the model. For this data set, the resulting BIC-pruned tree is in fact identical to the pre-pruned \code{pid_tree} considered above. \subsection[Fitted `party' objects]{Fitted `\texttt{party}' objects} \label{sec:object} The result of \code{mob()} is an object of class `\code{modelparty}' inheriting from `\code{party}'. See the other vignettes in the \pkg{partykit} package \citep{partykit} for more details of the general `\code{party}' class. Here, we just point out that the main difference between a `\code{modelparty}' and a plain `\code{party}' is that additional information about the data and the associated models is stored in the \code{info} elements: both of the overall `\code{party}' and the individual `\code{node}'s. The details are exemplified below. \subsubsection*{Illustration} In the \code{info} of the overall `\code{party}' the following information is stored for \code{pid_tree}: % <>= names(pid_tree$info) @ % The \code{call} contains the \code{mob()} call. The \code{formula} and \code{Formula} are virtually the same but are simply stored as plain `\code{formula}' and extended `\code{Formula}' \citep{Formula} objects, respectively. The \code{terms} contain separate lists of terms for the \code{response} (and regressor) and the \code{partitioning} variables. The \code{fit}, \code{control} and \code{dots} are the arguments that were provided to \code{mob()} and \code{nreg} is the number of regressor variables. Furthermore, each \code{node} of the tree contains the following information: % <>= names(pid_tree$node$info) @ % The \code{coefficients}, \code{objfun}, and \code{object} are the results of the \code{fit} function for that node. In \code{nobs} and \code{p.value} the number of observations and the minimal $p$-value are provided, respectively. Finally, \code{test} contains the parameter instability test results. Note that the \code{object} element can also be suppressed through \code{mob_control()} to save memory space. \subsection{Methods} \label{sec:methods} There is a wide range of standard methods available for objects of class `\code{modelparty}'. The standard \code{print()}, \code{plot()}, and \code{predict()} build on the corresponding methods for `\code{party}' objects but provide some more special options. Furthermore, methods are provided that are typically available for models with formula interfaces: \code{formula()} (optionally one can set \code{extended = TRUE} to get the `\code{Formula}'), \code{getCall()}, \code{model.frame()}, \code{weights()}. Finally, there is a standard set of methods for statistical model objects: \code{coef()}, \code{residuals()}, \code{logLik()} (optionally setting \code{dfsplit = FALSE} suppresses counting the splits in the degrees of freedom, see Section~\ref{sec:prune}), \code{deviance()}, \code{fitted()}, and \code{summary()}. Some of these methods rely on reusing the corresponding methods for the individual model objects in the terminal nodes. Functions such as \code{coef()}, \code{print()}, \code{summary()} also take a \code{node} argument that can specify the node IDs to be queried. Two methods have non-standard arguments to allow for additional flexibility when dealing with model trees. Typically, ``normal'' users do not have to use these arguments directly but they are very flexible and facilitate writing convenience interfaces such as \code{glmtree()} (see Section~\ref{sec:interface}). \begin{itemize} \item The \code{predict()} method has the following arguments: \code{predict(object, newdata = NULL, type = "node", ...)}. The argument \code{type} can either be a function or a character string. More precisely, if \code{type} is a function it should be a \code{function (object, newdata = NULL, ...)} that returns a vector or matrix of predictions from a fitted model \code{object} either with or without \code{newdata}. If \code{type} is a character string, such a function is set up internally as \code{predict(object, newdata = newdata, type = type, ...)}, i.e., it relies on a suitable \code{predict()} method being available for the fitted models associated with the `\code{party}' object. \item The \code{plot()} method has the following arguments: \code{plot(x, terminal_panel = NULL, FUN = NULL)}. This simply calls the \code{plot()} method for `\code{party}' objects with a custom panel function for the terminal panels. By default, \code{node_terminal} is used to include some short text in each terminal node. This text can be set up by \code{FUN} with the default being the number of observations and estimated parameters. However, more elaborate terminal panel functions can be written, as done for the convenience interfaces. \end{itemize} Finally, two \proglang{S}3-style functions are provided without the corresponding generics (as these reside in packages that \pkg{partykit} does not depend on). The \code{sctest.modelparty} can be used in combination with the \code{sctest()} generic from \pkg{strucchange} as illustrated in Section~\ref{sec:sctest}. The \code{refit.modelparty} function extracts (or refits if necessary) the fitted model objects in the specified nodes (by default from all nodes). \subsubsection*{Illustration} The \code{plot()} and \code{print()} methods have already been illustrated for the \code{pid_tree} above. However, here we add that the \code{print()} method can also be used to show more detailed information about particular nodes instead of showing the full tree: % <>= print(pid_tree, node = 3) @ % Information about the model and coefficients can for example be extracted by: % <>= coef(pid_tree) coef(pid_tree, node = 1) ## IGNORE_RDIFF_BEGIN summary(pid_tree, node = 1) ## IGNORE_RDIFF_END @ % As the coefficients pertain to a logistic regression, they can be easily interpreted as odds ratios by taking the \code{exp()}: % <<>>= exp(coef(pid_tree)[,2]) @ % <>= risk <- round(100 * (exp(coef(pid_tree)[,2])-1), digits = 1) @ % i.e., the odds increase by \Sexpr{risk[1]}\%, \Sexpr{risk[2]}\% and \Sexpr{risk[3]}\% with respect to glucose in the three terminal nodes. Log-likelihoods and information criteria are available (which by default also penalize the estimation of splits): <>= logLik(pid_tree) AIC(pid_tree) BIC(pid_tree) @ % Mean squared residuals (or deviances) can be extracted in different ways: <>= mean(residuals(pid_tree)^2) deviance(pid_tree)/sum(weights(pid_tree)) deviance(pid_tree)/nobs(pid_tree) @ % Predicted nodes can also be easily obtained: % <>= pid <- head(PimaIndiansDiabetes) predict(pid_tree, newdata = pid, type = "node") @ % More predictions, e.g., predicted probabilities within the nodes, can also be obtained but require some extra coding if only \code{mob()} is used. However, with the \code{glmtree()} interface this is very easy as shown below. Finally, several methods for `\code{party}' objects are, of course, also available for `\code{modelparty}' objects, e.g., querying width and depth of the tree: % <>= width(pid_tree) depth(pid_tree) @ % Also subtrees can be easily extracted: % <>= pid_tree[3] @ % The subtree is again a completely valid `\code{modelparty}' and hence it could also be visualized via \code{plot(pid_tree[3])} etc. \subsection{Extensions and convenience interfaces} \label{sec:interface} As illustrated above, dealing with MOBs can be carried out in a very generic and object-oriented way. Almost all information required for dealing with recursively partitioned models can be encapsulated in the \code{fit()} function and \code{mob()} does not require more information on what type of model is actually used. However, for certain tasks more detailed information about the type of model used or the type of data it can be fitted to can (and should) be exploited. Notable examples for this are visualizations of the data along with the fitted model or model-based predictions in the leaves of the tree. To conveniently accomodate such specialized functionality, the \pkg{partykit} provides two convenience interfaces \code{lmtree()} and \code{glmtree()} and encourages other packages to do the same (e.g., \code{raschtree()} and \code{bttree()} in \pkg{psychotree}). Furthermore, such a convenience interface can avoid duplicated formula processing in both \code{mob()} plus its \code{fit} function. Specifically, \code{lmtree()} and \code{glmtree()} interface \code{lm.fit()}, \code{lm.wfit()}, and \code{glm.fit()}, respectively, and then provide some extra computations to return valid fitted `\code{lm}' and `\code{glm}' objects in the nodes of the tree. The resulting `\code{modelparty}' object gains an additional class `\code{lmtree}'/`\code{glmtree}' to dispatch to its enhanced \code{plot()} and \code{predict()} methods. \subsubsection*{Illustration} The \code{pid_tree2} object was already created above with \code{glmtree()} (instead of \code{mob()} as for \code{pid_tree}) to illustrate the enhanced plotting capabilities in Figure~\ref{fig:pid_tree2}. Here, the enhanced \code{predict()} method is used to obtain predicted means (i.e., probabilities) and associated linear predictors (on the logit scale) in addition to the previously available predicted nods IDs. % <<>>= predict(pid_tree2, newdata = pid, type = "node") predict(pid_tree2, newdata = pid, type = "response") predict(pid_tree2, newdata = pid, type = "link") @ \section{Illustrations} \label{sec:illustration} In the remainder of the vignette, further empirical illustrations of the MOB method are provided, including replications of the results from \cite{Zeileis+Hothorn+Hornik:2008}: \begin{enumerate} \item An investigation of the price elasticity of the demand for economics journals across covariates describing the type of journal (e.g., its price, its age, and whether it is published by a society, etc.) \item Prediction of house prices in the well-known Boston Housing data set, also taken from the UCI machine learning repository. \item Explore how teaching ratings and beauty scores of professors are associated and how this association changes across further explanatory variables such as age, gender, and native speaker status of the professors. \item Assessment of differences in the preferential treatment of women/children (``women and children first'') across subgroups of passengers on board of the ill-fated maiden voyage of the RMS Titanic. \item Modeling of breast cancer survival by capturing heterogeneity in certain (treatment) effects across patients. \item Modeling of paired comparisons of topmodel candidates by capturing heterogeneity in their attractiveness scores across participants in a survey. \end{enumerate} More details about several of the underlying data sets, previous studies exploring the data, and the results based on MOB can be found in \cite{Zeileis+Hothorn+Hornik:2008}. Here, we focus on using the \pkg{partykit} package to replicate the analysis and explore the resulting trees. The first three illustrations employ the \code{lmtree()} convenience function, the fourth is based on logistic regression using \code{glmtree()}, and the fifth uses \code{survreg()} from \pkg{survival} \citep{survival} in combination with \code{mob()} directly. The sixth and last illustration is deferred to a separate section and shows in detail how to set up new ``mobster'' functionality from scratch. \subsection{Demand for economic journals} The price elasticity of the demand for 180~economic journals is assessed by an OLS regression in log-log form: The dependent variable is the logarithm of the number of US library subscriptions and the regressor is the logarithm of price per citation. The data are provided by the \pkg{AER} package \citep{AER} and can be loaded and transformed via: % <>= data("Journals", package = "AER") Journals <- transform(Journals, age = 2000 - foundingyear, chars = charpp * pages) @ % Subsequently, the stability of the price elasticity across the remaining variables can be assessed using MOB. Below, \code{lmtree()} is used with the following partitioning variables: raw price and citations, age of the journal, number of characters, and whether the journal is published by a scientific society or not. A minimal segment size of 10~observations is employed and by setting \code{verbose = TRUE} detailed progress information about the recursive partitioning is displayed while growing the tree: % <>= j_tree <- lmtree(log(subs) ~ log(price/citations) | price + citations + age + chars + society, data = Journals, minsize = 10, verbose = TRUE) @ % \begin{figure}[t!] \centering \setkeys{Gin}{width=0.75\textwidth} <>= plot(j_tree) @ \caption{\label{fig:Journals} Linear-regression-based tree for the journals data. The plots in the leaves show linear regressions of log(subscriptions) by log(price/citation).} \end{figure} % The resulting tree just has one split and two terminal nodes for young journals (with a somewhat larger price elasticity) and old journals (with an even lower price elasticity), respectively. Figure~\ref{fig:Journals} can be obtained by \code{plot(j_tree)} and the corresponding printed representation is shown below. % <>= j_tree @ % Finally, various quantities of interest such as the coefficients, standard errors and test statistics, and the associated parameter instability tests could be extracted by the following code. The corresponding output is suppressed here. % <>= coef(j_tree, node = 1:3) summary(j_tree, node = 1:3) sctest(j_tree, node = 1:3) @ \subsection{Boston housing data} The Boston housing data are a popular and well-investigated empirical basis for illustrating nonlinear regression methods both in machine learning and statistics. We follow previous analyses by segmenting a bivariate linear regression model for the house values. The data set is available in package \pkg{mlbench} and can be obtained and transformed via: % <>= data("BostonHousing", package = "mlbench") BostonHousing <- transform(BostonHousing, chas = factor(chas, levels = 0:1, labels = c("no", "yes")), rad = factor(rad, ordered = TRUE)) @ % It provides $n = \Sexpr{NROW(BostonHousing)}$ observations of the median value of owner-occupied homes in Boston (in USD~1000) along with $\Sexpr{NCOL(BostonHousing)}$ covariates including in particular the number of rooms per dwelling (\code{rm}) and the percentage of lower status of the population (\code{lstat}). A segment-wise linear relationship between the value and these two variables is very intuitive, whereas the shape of the influence of the remaining covariates is rather unclear and hence should be learned from the data. Therefore, a linear regression model for median value explained by \verb:rm^2: and \verb:log(lstat): is employed and partitioned with respect to all remaining variables. Choosing appropriate transformations of the dependent variable and the regressors that enter the linear regression model is important to obtain a well-fitting model in each segment and we follow in our choice the recommendations of \cite{Breiman+Friedman:1985}. Monotonic transformations of the partitioning variables do not affect the recursive partitioning algorithm and hence do not have to be performed. However, it is important to distinguish between numerical and categorical variables for choosing an appropriate parameter stability test. The variable \code{chas} is a dummy indicator variable (for tract bounds with Charles river) and thus needs to be turned into a factor. Furthermore, the variable \code{rad} is an index of accessibility to radial highways and takes only 9 distinct values. Hence, it is most appropriately treated as an ordered factor. Note that both transformations only affect the parameter stability test chosen (step~2), not the splitting procedure (step~3). % Note that with splittry = 0 (according to old version of mob) there is no % split in dis <>= bh_tree <- lmtree(medv ~ log(lstat) + I(rm^2) | zn + indus + chas + nox + age + dis + rad + tax + crim + b + ptratio, data = BostonHousing) bh_tree @ % The corresponding visualization is shown in Figure~\ref{fig:BostonHousing}. It shows partial scatter plots of the dependent variable against each of the regressors in the terminal nodes. Each scatter plot also shows the fitted values, i.e., a projection of the fitted hyperplane. \setkeys{Gin}{width=\textwidth} \begin{figure}[p!] \centering <>= plot(bh_tree) @ \includegraphics[width=18cm,keepaspectratio,angle=90]{mob-BostonHousing-plot} \caption{\label{fig:BostonHousing} Linear-regression-based tree for the Boston housing data. The plots in the leaves give partial scatter plots for \code{rm} (upper panel) and \code{lstat} (lower panel).} \end{figure} From this visualization, it can be seen that in the nodes~4, 6, 7 and 8 the increase of value with the number of rooms dominates the picture (upper panel) whereas in node~9 the decrease with the lower status population percentage (lower panel) is more pronounced. Splits are performed in the variables \code{tax} (poperty-tax rate) and \code{ptratio} (pupil-teacher ratio). For summarizing the quality of the fit, we could compute the mean squared error, log-likelihood or AIC: % <>= mean(residuals(bh_tree)^2) logLik(bh_tree) AIC(bh_tree) @ \subsection{Teaching ratings data} \cite{Hamermesh+Parker:2005} investigate the correlation of beauty and teaching evaluations for professors. They provide data on course evaluations, course characteristics, and professor characteristics for 463 courses for the academic years 2000--2002 at the University of Texas at Austin. It is of interest how the average teaching evaluation per course (on a scale 1--5) depends on a standardized measure of beauty (as assessed by a committee of six persons based on photos). \cite{Hamermesh+Parker:2005} employ a linear regression, weighted by the number of students per course and adjusting for several further main effects: gender, whether or not the teacher is from a minority, a native speaker, or has tenure, respectively, and whether the course is taught in the upper or lower division. Additionally, the age of the professors is available as a regressor but not considered by \cite{Hamermesh+Parker:2005} because the corresponding main effect is not found to be significant in either linear or quadratic form. Here, we employ a similar model but use the available regressors differently. The basic model is again a linear regression for teaching rating by beauty, estimated by weighted least squares (WLS). All remaining explanatory variables are \emph{not} used as regressors but as partitioning variables because we argue that it is unclear how they influence the correlation between teaching rating and beauty. Hence, MOB is used to adaptively incorporate these further variables and determine potential interactions. First, the data are loaded from the \pkg{AER} package \citep{AER} and only the subset of single-credit courses is excluded. % <>= data("TeachingRatings", package = "AER") tr <- subset(TeachingRatings, credits == "more") @ % The single-credit courses include elective modules that are quite different from the remaining courses (e.g., yoga, aerobics, or dance) and are hence omitted from the main analysis. WLS estimation of the null model (with only an intercept) and the main effects model is carried out in a first step: % <>= tr_null <- lm(eval ~ 1, data = tr, weights = students) tr_lm <- lm(eval ~ beauty + gender + minority + native + tenure + division, data = tr, weights = students) @ % Then, the model-based tree can be estimated with \code{lmtree()} using only \code{beauty} as a regressor and all remaining variables for partitioning. For WLS estimation, we set \code{weights = students} and \code{caseweights = FALSE} because the weights are only proportionality factors and do not signal exactly replicated observations \citep[see][for a discussion of the different types of weights]{Lumley:2020}. % <>= (tr_tree <- lmtree(eval ~ beauty | minority + age + gender + division + native + tenure, data = tr, weights = students, caseweights = FALSE)) @ % \begin{figure}[t!] \setkeys{Gin}{width=\textwidth} <>= plot(tr_tree) @ \caption{\label{fig:tr_tree} WLS-based tree for the teaching ratings data. The plots in the leaves show scatterplots for teaching rating by beauty score.} \end{figure} % The resulting tree can be visualized by \code{plot(tr_tree)} and is shown in Figure~\ref{fig:tr_tree}. This shows that despite age not having a significant main effect \citep[as reported by][]{Hamermesh+Parker:2005}, it clearly plays an important role: While the correlation of teaching rating and beauty score is rather moderate for younger professors, there is a clear correlation for older professors (with the cutoff age somewhat lower for female professors). % <>= coef(tr_lm)[2] coef(tr_tree)[, 2] @ % Th $R^2$ of the tree is also clearly improved over the main effects model: % <>= 1 - c(deviance(tr_lm), deviance(tr_tree))/deviance(tr_null) @ \subsection{Titanic survival data} To illustrate how differences in treatment effects can be captured by MOB, the Titanic survival data is considered: The question is whether ``women and children first'' is applied in the same way for all subgroups of the passengers of the Titanic. Or, in other words, whether the effectiveness of preferential treatment for women/children differed across subgroups. The \code{Titanic} data is provided in base \proglang{R} as a contingency table and transformed here to a `\code{data.frame}' for use with \code{glmtree()}: % <>= data("Titanic", package = "datasets") ttnc <- as.data.frame(Titanic) ttnc <- ttnc[rep(1:nrow(ttnc), ttnc$Freq), 1:4] names(ttnc)[2] <- "Gender" ttnc <- transform(ttnc, Treatment = factor( Gender == "Female" | Age == "Child", levels = c(FALSE, TRUE), labels = c("Male&Adult", "Female|Child"))) @ % The data provides factors \code{Survived} (yes/no), \code{Class} (1st, 2nd, 3rd, crew), \code{Gender} (male, female), and \code{Age} (child, adult). Additionally, a factor \code{Treatment} is added that distinguishes women/children from male adults. To investigate how the preferential treatment effect (\code{Survived ~ Treatment}) differs across the remaining explanatory variables, the following logistic-regression-based tree is considered. The significance level of \code{alpha = 0.01} is employed here to avoid overfitting and separation problems in the logistic regression. % <>= ttnc_tree <- glmtree(Survived ~ Treatment | Class + Gender + Age, data = ttnc, family = binomial, alpha = 0.01) ttnc_tree @ % \begin{figure}[t!] \setkeys{Gin}{width=\textwidth} <>= plot(ttnc_tree, tp_args = list(ylines = 1, margins = c(1.5, 1.5, 1.5, 2.5))) @ \caption{\label{fig:ttnc_tree} Logistic-regression-based tree for the Titanic survival data. The plots in the leaves give spinograms for survival status versus preferential treatment (women or children).} \end{figure} % This shows that the treatment differs strongly across passengers classes, see also the result of \code{plot(ttnc_tree)} in Figure~\ref{fig:ttnc_tree}. The treatment effect is much lower in the 3rd class where women/children still have higher survival rates than adult men but the odds ratio is much lower compared to all remaining classes. The split between the 2nd and the remaining two classes (1st, crew) is due to a lower overall survival rate (intercepts of \Sexpr{round(coef(ttnc_tree)[2, 1], digits = 2)} and \Sexpr{round(coef(ttnc_tree)[3, 1], digits = 2)}, respectively) while the odds ratios associated with the preferential treatment are rather similar (\Sexpr{round(coef(ttnc_tree)[2, 2], digits = 2)} and \Sexpr{round(coef(ttnc_tree)[3, 2], digits = 2)}, respectively). Another option for assessing the class effect would be to immediately split into all four classes rather than using recursive binary splits. This can be obtained by setting \code{catsplit = "multiway"} in the \code{glmtree()} call above. This yields a tree with just a single split but four kid nodes. \subsection{German breast cancer data} To illustrate that the MOB approach can also be used beyond (generalized) linear regression models, it is employed in the following to analyze censored survival times among German women with positive node breast cancer. The data is available in the \pkg{TH.data} package and the survival time is transformed from days to years: % <>= data("GBSG2", package = "TH.data") GBSG2$time <- GBSG2$time/365 @ % For regression a parametric Weibull regression based on the \code{survreg()} function from the \pkg{survival} package \citep{survival} is used. A fitting function for \code{mob()} can be easily set up: % <>= library("survival") wbreg <- function(y, x, start = NULL, weights = NULL, offset = NULL, ...) { survreg(y ~ 0 + x, weights = weights, dist = "weibull", ...) } @ % As the \pkg{survreg} package currently does not provide a \code{logLik()} method for `\code{survreg}' objects, this needs to be added here: % <>= logLik.survreg <- function(object, ...) structure(object$loglik[2], df = sum(object$df), class = "logLik") @ % Without the \code{logLik()} method, \code{mob()} would not know how to extract the optimized objective function from the fitted model. With the functions above available, a censored Weibull-regression-tree can be fitted: The dependent variable is the censored survival time and the two regressor variables are the main risk factor (number of positive lymph nodes) and the treatment variable (hormonal therapy). All remaining variables are used for partitioning: age, tumor size and grade, progesterone and estrogen receptor, and menopausal status. The minimal segment size is set to 80. % <>= gbsg2_tree <- mob(Surv(time, cens) ~ horTh + pnodes | age + tsize + tgrade + progrec + estrec + menostat, data = GBSG2, fit = wbreg, control = mob_control(minsize = 80)) @ % \begin{figure}[p!] \centering \setkeys{Gin}{width=0.6\textwidth} <>= plot(gbsg2_tree) @ \caption{\label{fig:GBSG2} Censored Weibull-regression-based tree for the German breast cancer data. The plots in the leaves report the estimated regression coefficients.} \setkeys{Gin}{width=\textwidth} <>= gbsg2node <- function(mobobj, col = "black", linecol = "red", cex = 0.5, pch = NULL, jitter = FALSE, xscale = NULL, yscale = NULL, ylines = 1.5, id = TRUE, xlab = FALSE, ylab = FALSE) { ## obtain dependent variable mf <- model.frame(mobobj) y <- Formula::model.part(mobobj$info$Formula, mf, lhs = 1L, rhs = 0L) if(isTRUE(ylab)) ylab <- names(y)[1L] if(identical(ylab, FALSE)) ylab <- "" if(is.null(ylines)) ylines <- ifelse(identical(ylab, ""), 0, 2) y <- y[[1L]] ## plotting character and response if(is.null(pch)) pch <- y[,2] * 18 + 1 y <- y[,1] y <- as.numeric(y) pch <- rep(pch, length.out = length(y)) if(jitter) y <- jitter(y) ## obtain explanatory variables x <- Formula::model.part(mobobj$info$Formula, mf, lhs = 0L, rhs = 1L) xnam <- colnames(x) z <- seq(from = min(x[,2]), to = max(x[,2]), length = 51) z <- data.frame(a = rep(sort(x[,1])[c(1, NROW(x))], c(51, 51)), b = z) names(z) <- names(x) z$x <- model.matrix(~ ., data = z) ## fitted node ids fitted <- mobobj$fitted[["(fitted)"]] if(is.null(xscale)) xscale <- range(x[,2]) + c(-0.1, 0.1) * diff(range(x[,2])) if(is.null(yscale)) yscale <- range(y) + c(-0.1, 0.1) * diff(range(y)) ## panel function for scatter plots in nodes rval <- function(node) { ## node index nid <- id_node(node) ix <- fitted %in% nodeids(mobobj, from = nid, terminal = TRUE) ## dependent variable y <- y[ix] ## predictions yhat <- if(is.null(node$info$object)) { refit.modelparty(mobobj, node = nid) } else { node$info$object } yhat <- predict(yhat, newdata = z, type = "quantile", p = 0.5) pch <- pch[ix] ## viewport setup top_vp <- viewport(layout = grid.layout(nrow = 2, ncol = 3, widths = unit(c(ylines, 1, 1), c("lines", "null", "lines")), heights = unit(c(1, 1), c("lines", "null"))), width = unit(1, "npc"), height = unit(1, "npc") - unit(2, "lines"), name = paste("node_scatterplot", nid, sep = "")) pushViewport(top_vp) grid.rect(gp = gpar(fill = "white", col = 0)) ## main title top <- viewport(layout.pos.col = 2, layout.pos.row = 1) pushViewport(top) mainlab <- paste(ifelse(id, paste("Node", nid, "(n = "), ""), info_node(node)$nobs, ifelse(id, ")", ""), sep = "") grid.text(mainlab) popViewport() plot_vp <- viewport(layout.pos.col = 2, layout.pos.row = 2, xscale = xscale, yscale = yscale, name = paste("node_scatterplot", nid, "plot", sep = "")) pushViewport(plot_vp) ## scatterplot grid.points(x[ix,2], y, gp = gpar(col = col, cex = cex), pch = pch) grid.lines(z[1:51,2], yhat[1:51], default.units = "native", gp = gpar(col = linecol)) grid.lines(z[52:102,2], yhat[52:102], default.units = "native", gp = gpar(col = linecol, lty = 2)) grid.xaxis(at = c(ceiling(xscale[1]*10), floor(xscale[2]*10))/10) grid.yaxis(at = c(ceiling(yscale[1]), floor(yscale[2]))) if(isTRUE(xlab)) xlab <- xnam[2] if(!identical(xlab, FALSE)) grid.text(xlab, x = unit(0.5, "npc"), y = unit(-2, "lines")) if(!identical(ylab, FALSE)) grid.text(ylab, y = unit(0.5, "npc"), x = unit(-2, "lines"), rot = 90) grid.rect(gp = gpar(fill = "transparent")) upViewport() upViewport() } return(rval) } class(gbsg2node) <- "grapcon_generator" plot(gbsg2_tree, terminal_panel = gbsg2node, tnex = 2, tp_args = list(xscale = c(0, 52), yscale = c(-0.5, 8.7))) @ \caption{\label{fig:GBSG2-scatter} Censored Weibull-regression-based tree for the German breast cancer data. The plots in the leaves depict censored (hollow) and uncensored (solid) survival time by number of positive lymph nodes along with fitted median survival for patients with (dashed line) and without (solid line) hormonal therapy.} \end{figure} % Based on progesterone receptor, a tree with two leaves is found: % <>= gbsg2_tree coef(gbsg2_tree) logLik(gbsg2_tree) @ % The visualization produced by \code{plot(gbsg2_tree)} is shown in Figure~\ref{fig:GBSG2}. A nicer graphical display using a scatter plot (with indication of censoring) and fitted regression curves is shown in Figure~\ref{fig:GBSG2-scatter}. This uses a custom panel function whose code is too long and elaborate to be shown here. Interested readers are referred to the \proglang{R} code underlying the vignette. The visualization shows that for higher progesterone receptor levels: (1)~survival times are higher overall, (2)~the treatment effect of hormonal therapy is higher, and (3)~the negative effect of the main risk factor (number of positive lymph nodes) is less severe. \section{Setting up a new mobster} \label{sec:mobster} To conclude this vignette, we present another illustration that shows how to set up new mobster functionality from scratch. To do so, we implement the Bradley-Terry tree suggested by \cite{Strobl+Wickelmaier+Zeileis:2011} ``by hand''. The \pkg{psychotree} package already provides an easy-to-use mobster called \code{bttree()} but as an implementation exercise we recreate its functionality here. The only inputs required are a suitable data set with paired comparisons (\code{Topmodel2007} from \pkg{psychotree}) and a parametric model for paired comparison data (\code{btmodel()} from \pkg{psychotools}, implementing the Bradley-Terry model). The latter optionally computes the empirical estimating functions and already comes with a suitable extractor method. <>= data("Topmodel2007", package = "psychotree") library("psychotools") @ % The Bradley-Terry (or Bradley-Terry-Luce) model is a standard model for paired comparisons in social sciences. It parametrizes the probability $\pi_{ij}$ for preferring some object $i$ over another object $j$ in terms of corresponding ``ability'' or ``worth'' parameters $\theta_i$: \begin{eqnarray*} \pi_{ij}\phantom{)} & = & \frac{\theta_i}{\theta_i + \theta_j} \\ \mathsf{logit}(\pi_{ij}) & = & \log(\theta_i) - \log(\theta_j) \end{eqnarray*} This model can be easily estimated by maximum likelihood as a logistic or log-linear GLM. This is the approach used internally by \code{btmodel()}. The \code{Topmodel2007} data provide paired comparisons of attractiveness among the six finalists of the TV show \emph{Germany's Next Topmodel~2007}: Barbara, Anni, Hana, Fiona, Mandy, Anja. The data were collected in a survey with 192~respondents at Universit{\"a}t T{\"u}bingen and the available covariates comprise gender, age, and familiarty with the TV~show. The latter is assess by three by yes/no questions: (1)~Do you recognize the women?/Do you know the show? (2)~Did you watch it regularly? (3)~Did you watch the final show?/Do you know who won? To fit the Bradley-Terry tree to the data, the available building blocks just have to be tied together. First, we set up the basic/simple model fitting interface described in Section~\ref{sec:fit}: % @ <>= btfit1 <- function(y, x = NULL, start = NULL, weights = NULL, offset = NULL, ...) btmodel(y, ...) @ % The function \code{btfit1()} simply calls \code{btmodel()} ignoring all arguments except \code{y} as the others are not needed here. No more processing is required because \class{btmodel} objects come with a \code{coef()}, \code{logLik()}, and \code{estfun()} method. Hence, we can call \code{mob()} now specifying the response and the partitioning variable (and no regressors because there are no regressors in this model). % <>= bt1 <- mob(preference ~ 1 | gender + age + q1 + q2 + q3, data = Topmodel2007, fit = btfit1) @ % An alternative way to fit the exact same tree somewhat more quickly would be to employ the extended interface described in Section~\ref{sec:fit}: % <>= btfit2 <- function(y, x = NULL, start = NULL, weights = NULL, offset = NULL, ..., estfun = FALSE, object = FALSE) { rval <- btmodel(y, ..., estfun = estfun, vcov = object) list( coefficients = rval$coefficients, objfun = -rval$loglik, estfun = if(estfun) rval$estfun else NULL, object = if(object) rval else NULL ) } @ % Still \code{btmodel()} is employed for fitting the model but the quantities \code{estfun} and \code{vcov} are only computed if they are really required. This may save some computation time -- about 20\% on the authors' machine at the time of writing -- when computing the tree: % <>= bt2 <- mob(preference ~ 1 | gender + age + q1 + q2 + q3, data = Topmodel2007, fit = btfit2) @ % The speed-up is not huge but comes almost for free: just a few additional lines of code in \code{btfit2()} are required. For other models where it is more costly to set up a full model (with variance-covariance matrix, predictions, residuals, etc.) larger speed-ups are also possible. Both trees, \code{bt1} and \code{bt2}, are equivalent (except for the details of the fitting function). Hence, in the following we only explore \code{bt2}. However, the same code can be applied to \code{bt1} as well. Many tools come completely for free and are inherited from the general \class{modelparty}/\class{party}. For example, both printing (\code{print(bt2)}) and plotting (\code{plot(bt2)}) shows the estimated parameters in the terminal nodes which can also be extracted by the \code{coef()} method: <>= bt2 coef(bt2) @ The corresponding visualization is shown in the upper panel of Figure~\ref{fig:topmodel-plot1}. In all cases, the estimated coefficients on the logit scale omitting the fixed zero reference level (Anja) are reported. To show the corresponding worth parameters $\theta_i$ including the reference level, one can simply provide a small panel function \code{worthf()}. This applies the \code{worth()} function from \pkg{psychotools} to the fitted-model object stored in the \code{info} slot of each node, yielding the lower panel in Figure~\ref{fig:topmodel-plot1}. % <>= worthf <- function(info) paste(info$object$labels, format(round(worth(info$object), digits = 3)), sep = ": ") plot(bt2, FUN = worthf) @ % \begin{figure}[p!] \centering <>= plot(bt2) @ \vspace*{0.5cm} <>= <> @ \caption{\label{fig:topmodel-plot1} Bradley-Terry-based tree for the topmodel attractiveness data. The default plot (upper panel) reports the estimated coefficients on the log scale while the adapted plot (lower panel) shows the corresponding worth parameters.} \end{figure} % To show a graphical display of these worth parameters rather than printing their numerical values, one can use a simply glyph-style plot. A simply way to generate these is to use the \code{plot()} method for \class{btmodel} objects from \pkg{partykit} and \code{nodeapply} this to all terminal nodes (see Figure~\ref{fig:topmodel-plot2}): % <>= par(mfrow = c(2, 2)) nodeapply(bt2, ids = c(3, 5, 6, 7), FUN = function(n) plot(n$info$object, main = n$id, ylim = c(0, 0.4))) @ % \begin{figure}[t!] \centering <>= <> @ \caption{\label{fig:topmodel-plot2} Worth parameters in the terminal nodes of the Bradley-Terry-based tree for the topmodel attractiveness data.} \end{figure} % Alternatively, one could set up a proper panel-generating function in \pkg{grid} that allows to create the glyphs within the terminal nodes of the tree (see Figure~\ref{fig:topmodel-plot3}). As the code for this panel-generating function \code{node_btplot()} is too complicated to display within the vignette, we refer interested readers to the underlying code. Given this panel-generating function Figure~\ref{fig:topmodel-plot3} can be generated with <>= plot(bt2, drop = TRUE, tnex = 2, terminal_panel = node_btplot(bt2, abbreviate = 1, yscale = c(0, 0.5))) @ \begin{figure}[t!] \centering <>= ## visualization function node_btplot <- function(mobobj, id = TRUE, worth = TRUE, names = TRUE, abbreviate = TRUE, index = TRUE, ref = TRUE, col = "black", linecol = "lightgray", cex = 0.5, pch = 19, xscale = NULL, yscale = NULL, ylines = 1.5) { ## node ids node <- nodeids(mobobj, terminal = FALSE) ## get all coefficients cf <- partykit:::apply_to_models(mobobj, node, FUN = function(z) if(worth) worth(z) else coef(z, all = FALSE, ref = TRUE)) cf <- do.call("rbind", cf) rownames(cf) <- node ## get one full model mod <- partykit:::apply_to_models(mobobj, node = 1L, FUN = NULL) if(!worth) { if(is.character(ref) | is.numeric(ref)) { reflab <- ref ref <- TRUE } else { reflab <- mod$ref } if(is.character(reflab)) reflab <- match(reflab, mod$labels) cf <- cf - cf[,reflab] } ## reference if(worth) { cf_ref <- 1/ncol(cf) } else { cf_ref <- 0 } ## labeling if(is.character(names)) { colnames(cf) <- names names <- TRUE } ## abbreviation if(is.logical(abbreviate)) { nlab <- max(nchar(colnames(cf))) abbreviate <- if(abbreviate) as.numeric(cut(nlab, c(-Inf, 1.5, 4.5, 7.5, Inf))) else nlab } colnames(cf) <- abbreviate(colnames(cf), abbreviate) if(index) { x <- 1:NCOL(cf) if(is.null(xscale)) xscale <- range(x) + c(-0.1, 0.1) * diff(range(x)) } else { x <- rep(0, length(cf)) if(is.null(xscale)) xscale <- c(-1, 1) } if(is.null(yscale)) yscale <- range(cf) + c(-0.1, 0.1) * diff(range(cf)) ## panel function for bt plots in nodes rval <- function(node) { ## node index id <- id_node(node) ## dependent variable setup cfi <- cf[id,] ## viewport setup top_vp <- viewport(layout = grid.layout(nrow = 2, ncol = 3, widths = unit(c(ylines, 1, 1), c("lines", "null", "lines")), heights = unit(c(1, 1), c("lines", "null"))), width = unit(1, "npc"), height = unit(1, "npc") - unit(2, "lines"), name = paste("node_btplot", id, sep = "")) pushViewport(top_vp) grid.rect(gp = gpar(fill = "white", col = 0)) ## main title top <- viewport(layout.pos.col = 2, layout.pos.row = 1) pushViewport(top) mainlab <- paste(ifelse(id, paste("Node", id, "(n = "), ""), info_node(node)$nobs, ifelse(id, ")", ""), sep = "") grid.text(mainlab) popViewport() ## actual plot plot_vpi <- viewport(layout.pos.col = 2, layout.pos.row = 2, xscale = xscale, yscale = yscale, name = paste("node_btplot", id, "plot", sep = "")) pushViewport(plot_vpi) grid.lines(xscale, c(cf_ref, cf_ref), gp = gpar(col = linecol), default.units = "native") if(index) { grid.lines(x, cfi, gp = gpar(col = col, lty = 2), default.units = "native") grid.points(x, cfi, gp = gpar(col = col, cex = cex), pch = pch, default.units = "native") grid.xaxis(at = x, label = if(names) names(cfi) else x) } else { if(names) grid.text(names(cfi), x = x, y = cfi, default.units = "native") else grid.points(x, cfi, gp = gpar(col = col, cex = cex), pch = pch, default.units = "native") } grid.yaxis(at = c(ceiling(yscale[1] * 100)/100, floor(yscale[2] * 100)/100)) grid.rect(gp = gpar(fill = "transparent")) upViewport(2) } return(rval) } class(node_btplot) <- "grapcon_generator" plot(bt2, drop = TRUE, tnex = 2, terminal_panel = node_btplot(bt2, abbreviate = 1, yscale = c(0, 0.5))) @ \caption{\label{fig:topmodel-plot3} Bradley-Terry-based tree for the topmodel attractiveness data with visualizations of the worth parameters in the terminal nodes.} \end{figure} Finally, to illustrate how different predictions can be easily computed, we set up a small data set with three new individuals: % <>= tm <- data.frame(age = c(60, 25, 35), gender = c("male", "female", "female"), q1 = "no", q2 = c("no", "no", "yes"), q3 = "no") tm @ % For these we can easily compute (1)~the predicted node ID, (2)~the corresponding worth parameters, (3)~the associated ranks. <>= tm predict(bt2, tm, type = "node") predict(bt2, tm, type = function(object) t(worth(object))) predict(bt2, tm, type = function(object) t(rank(-worth(object)))) @ This completes the tour of fitting, printing, plotting, and predicting the Bradley-Terry tree model. Convenience interfaces that employ code like shown above can be easily defined and collected in new packages such as \pkg{psychotree}. \section{Conclusion} \label{sec:conclusion} The function \code{mob()} in the \pkg{partykit} package provides a new flexible and object-oriented implementation of the general algorithm for model-based recursive partitioning using \pkg{partykit}'s recursive partytioning infrastructure. New model fitting functions can be easily provided, especially if standard extractor functions (such as \code{coef()}, \code{estfun()}, \code{logLik()}, etc.) are available. The resulting model trees can then learned, visualized, investigated, and employed for predictions. To gain some efficiency in the computations and to allow for further customization (in particular specialized visualization and prediction methods), convenience interfaces \code{lmtree()} and \code{glmtree()} are provided for recursive partitioning based on (generalized) linear models. \bibliography{party} \end{document}