%\VignetteIndexEntry{Lexis functions for representation and analysis of follow-up data} \SweaveOpts{results=verbatim, keep.source=TRUE, include=FALSE, eps=FALSE} \documentclass[a4paper, dvipsnames, twoside, 12pt]{report} \newcommand{\Title}{\texttt{Lexis} functions in \texttt{Epi}\\ for representation and analysis of\\ follow-up data} \newcommand{\Tit}{\texttt{Lexis} FU} \newcommand{\Version}{Version 10} \newcommand{\Dates}{November 2024} \newcommand{\Where}{SDCC} \newcommand{\Homepage}{\url{http://bendixcarstensen.com/} } \newcommand{\Faculty}{\begin{tabular}{rl} Bendix Carstensen & Steno Diabetes Center Copenhagen, Herlev, Denmark\\ & {\small \& Department of Biostatistics, University of Copenhagen} \\ & \texttt{b@bxc.dk} \quad \texttt{bcar0029@regionh.dk} \\ & \url{http://BendixCarstensen.com} \\[1em] \end{tabular}} \input{topreport} \renewcommand{\rwpre}{./flup} \chapter*{Introduction} \addcontentsline{toc}{chapter}{Introduction} \section{Technicalities} First we set some graphics parameters for convenience and load the packages needed: <<>>= options(width = 90, show.signif.stars = FALSE, SweaveHooks=list(fig = function() par(mar = c(3, 3, 1, 1), mgp = c(3, 1, 0) / 1.6, las = 1, lend = "butt", bty = "n"))) library(Epi) library(popEpi) library(survival) clear() @ % must be after clear() because 'anfang' is used at the end <>= glmLexis <- glm.Lexis gamLexis <- gam.Lexis coxphLexis <- coxph.Lexis anfang <- Sys.time() cat("Start time:", format(anfang, "%F, %T"), "\n") @ % <>= vers <- data.frame(R = substr(R.version.string, 11, 15), Epi = as.character(packageVersion( "Epi")), popEpi = as.character(packageVersion("popEpi"))) names(vers) <- paste(" ", names(vers)) print(vers, row.names = FALSE) @ % \section{About this vignette} This vignette is an introduction to (parts of) the \texttt{Lexis} machinery in the \texttt{Epi} package, intended for representation and manipulation of follow-up data (``event history data'') from studies where exact dates of events are known. It accommodates follow-up through multiple states and on multiple time scales. This vignette uses a data example from the \texttt{Epi} package to illustrate the set-up of a simple \texttt{Lexis} object (a data frame of follow-up intervals), as well as the subdivision of follow-up intervals needed for multistate representation and analysis of transition rates by flexible parametric functions. The first chapter is exclusively on manipulation of the follow-up representation, but it points to the subsequent chapter where analysis is based on a \texttt{Lexis} representation with very small follow-up intervals. Chapter 2 uses analysis of mortality rates among Danish diabetes patients (available in the \texttt{Epi} package) currently on insulin treatment or not, to illustrate the use of \texttt{Lexis} object in the analysis of rates. Chapter 3 discusses creation and manipulation of multistate data, and chapter 4 is a list of all \texttt{Lexis} functions. \section{History} The \texttt{Lexis} machinery in the \texttt{Epi} package was first conceived and implemented by Martyn Plummer\cite{Plummer.2011,Carstensen.2011a}, and since its first appearance in the \texttt{Epi} package in 2008 it has been expanded with a number of utilities. An overview of these can be found in the last chapter of this note, \hyperlink{lexfun}{``\texttt{Lexis} functions''}. \subsection{Wilhelm Lexis} \begin{minipage}[t]{0.6\linewidth} \raggedright The \code{Lexis} functions(s) are named after the German demographer and economist Wilhelm Lexis (full name Wilhelm Hector Richard Albrecht Lexis) (17 July 1837 -- 24 August 1914), who in his book ``Einleitung in die Theorie der Bev\"{o}lkerungsstatistik'' [Introduction to the theory of population statistics] (1875) devised a diagram showing follow-up of persons on multiple time scales, notably calendar time and age. The diagram that nowadays is called a Lexis diagram, is usually drawn in a slightly different manner than that Lexis used in his book. The display of follow-up on two timescales naturally leads to representation on several time scales and statistical modeling of occurrence rates with two (or more) timescales as explanatory terms. Hence the naming of the machinery that represents follow-up on multiple time scales after Wilhelm Lexis. \end{minipage} \hfill \begin{minipage}[t]{0.35\linewidth} \ \\[-1em] \includegraphics[width=1.0\textwidth]{Wilhelm_Lexis} \end{minipage} \subsection{Modeling of rates} In 1980 John Whitehead published a paper: ``Fitting Cox's regression model to survival data using GLIM'', cite{Whitehead.1980} in which he devised the likelihood of a model for many small time bands with constant intensity in each, and demonstrated that Cox's partial likelihood could be approximated by a Poisson likelihood. This is basically what underlies the time-splitting and subsequent modeling of transition rates in this book. \noindent\\[2em] \ldots so there is very little (if anything) new in this note. \chapter{Representation of follow-up data in \texttt{Epi}} In the \texttt{Epi}-package, follow-up data is represented by adding some extra variables and a few attributes to a data frame. Such a data frame is called a \texttt{Lexis} object. The tools for handling follow-up data then use the structure of this for special plots, tabulations and modeling. Specifically, follow-up data requires a choice of time scale, a time of entry, a time of exit and an indication of the status at exit (normally either ``alive'' or ``dead'') for each person. Implicitly is also assumed a status \emph{during} the follow-up (usually ``alive''). \begin{figure}[htbp] \centering \setlength{\unitlength}{1pt} \begin{picture}(210, 70)(0, 75) %\scriptsize \thicklines \put( 0, 80){\makebox(0, 0)[r]{Age-scale}} \put( 50, 80){\line(1, 0){150}} \put( 50, 80){\line(0, 1){5}} \put(100, 80){\line(0, 1){5}} \put(150, 80){\line(0, 1){5}} \put(200, 80){\line(0, 1){5}} \put( 50, 77){\makebox(0, 0)[t]{35}} \put(100, 77){\makebox(0, 0)[t]{40}} \put(150, 77){\makebox(0, 0)[t]{45}} \put(200, 77){\makebox(0, 0)[t]{50}} \put( 0, 115){\makebox(0, 0)[r]{Follow-up}} \put( 80, 105){\makebox(0, 0)[r]{\small Two}} \put( 90, 105){\line(1, 0){87}} \put( 90, 100){\line(0, 1){10}} \put(100, 100){\line(0, 1){10}} \put(150, 100){\line(0, 1){10}} \put(180, 105){\circle{6}} \put( 95, 110){\makebox(0, 0)[b]{1}} \put(125, 110){\makebox(0, 0)[b]{5}} \put(165, 110){\makebox(0, 0)[b]{3}} \put( 50, 130){\makebox(0, 0)[r]{\small One}} \put( 60, 130){\line(1, 0){70}} \put( 60, 125){\line(0, 1){10}} \put(100, 125){\line(0, 1){10}} \put(130, 130){\circle*{6}} \put( 80, 135){\makebox(0, 0)[b]{4}} \put(115, 135){\makebox(0, 0)[b]{3}} \end{picture} \caption{\it Follow-up of two persons on the age-scale} \label{fig:fu2} \end{figure} \section{Time scales} A time scale is a variable that varies deterministically \emph{within} each person during follow-up, \textit{e.g.}: \begin{itemize}[noitemsep] \item Age \item Calendar time \item Time since start of treatment \item Time since relapse \end{itemize} All time scales advance at the same pace, so the time followed is the same on all time scales. Therefore, it will suffice to use only the entry point on each of the time scales, for example: \begin{itemize}[noitemsep] \item Age at entry \item Date of entry \item Time at treatment (\emph{at} treatment, time since treatment is 0) \item Time at relapse (\emph{at} relapse, time since relapse is 0) \end{itemize} In the \texttt{Epi} package, follow-up in a cohort is represented in a \texttt{Lexis} object. A \texttt{Lexis} object is a data frame with some extra structure to represent the follow-up. For the \texttt{DMlate} dataset of follow-up of diabetes patients in Denmark with recorded date of birth, date of diabetes, date of first insulin use, date of first oral drug use, date of exit and date of death --- we can construct a \texttt{Lexis} object by first including follow-up from entry at date of diabetes (\texttt{dodm}) to exit (\texttt{dox}): <<>>= data(DMlate) head(DMlate) dmL <- Lexis(entry = list(per = dodm, age = dodm-dobth, tfD = 0), exit = list(per = dox), exit.status = factor(!is.na(dodth), labels = c("DM", "Dead")), data = DMlate) timeScales(dmL) @ % The 4 excluded persons are persons with date of diabetes equal to date of death. The \texttt{entry} argument is a \emph{named} list with the entry points on each of the time scales we want to use. The names of the list defines the names of the time scales. The \texttt{exit} argument gives the exit time on \emph{one} of the time scales, so the name of the element in this list must match one of the names of the \texttt{entry} list. This is sufficient, because the follow-up time on all time scales is the same, in this case $\mathtt{dox}-\mathtt{dodm}$. The \texttt{exit.status} will normally be a categorical variable (a \emph{factor}) that indicates the exit status --- in this case whether the person (still) is in state \texttt{DM} or exits to \texttt{Dead} at the end of follow-up. We could also specify an \texttt{entry.status}; the default is to assume that all persons enter in the \emph{first} level of the factor \texttt{exit.state}s --- in this case \texttt{DM} (because $\mathtt{FALSE}<\mathtt{TRUE}$). Now take a look at the result: <<>>= str(dmL) head(dmL)[, 1:11] @ % The \texttt{Lexis} object \texttt{dmL} has a variable for each time scale, the value of which is the entry time for each person on this time scale. The length of the follow-up time is in the variable \texttt{lex.dur} (\texttt{dur}ation). Note that the exit status is in the variable \texttt{lex.Xst} (e\texttt{X}it \texttt{st}ate). The variable \texttt{lex.Cst} indicates the state where follow-up takes place (\texttt{C}urrent \texttt{st}ate), in this case \texttt{DM} (alive with diabetes) for all persons. This implies that observations \emph{censored} in state \texttt{A}, say, are characterized by having $\mathtt{lex.Cst} = \mathtt{lex.Xst} = \mathtt{A}$. There is a \texttt{summary} function for \texttt{Lexis} objects that lists the number of transitions and records as well as the total amount of follow-up time; it also (optionally) prints information about the names of the variables that constitute the time scales: <<>>= summary(dmL, timeScales = TRUE) @ % It is possible to get a visualization of the follow-up along the time scales chosen by using the \texttt{plot} method for \texttt{Lexis} objects. \texttt{dmL} is an object of \emph{class} \texttt{Lexis}, so using the function \texttt{plot()} on it means that \R\ will look for the function \texttt{plot.Lexis} and use this function. <>= plot(dmL) @ % The function allows quite a bit of control over the output, and a \texttt{points.Lexis} function allows plotting of the endpoints of follow-up: <>= par(mar = c(3, 3, 1, 1), mgp = c(3, 1, 0) / 1.6) plot(dmL, 1:2, lwd = 1, col = c("blue", "red")[dmL$sex], grid = TRUE, lty.grid = 1, col.grid = gray(0.7), xlim = 1960 + c(0, 60), xaxs = "i", ylim = 40 + c(0, 60), yaxs = "i", las = 1) points(dmL, 1:2, pch = c(NA, 3)[dmL$lex.Xst], col = "lightgray", lwd = 3, cex = 0.3) points(dmL, 1:2, pch = c(NA, 3)[dmL$lex.Xst], col = c("blue", "red")[dmL$sex], lwd = 1, cex = 0.3) box(bty = 'o') @ % In the above code you will note that the values of the arguments \texttt{col} and \texttt{pch} are indexed by factors, using the convention in \R\ that the index is taken as \emph{number of the level} of the supplied factor. Thus \texttt{c("blue", "red")[dmL\$sex]} is \texttt{"blue"} when \texttt{sex} is \texttt{M} (the first level of \texttt{sex}) and. \texttt{"red"} when \texttt{sex} is \texttt{F} (the second level of \texttt{sex}). The results of these two plotting commands are in figure \ref{fig:Lexis-diagram}, p. \pageref{fig:Lexis-diagram}. \begin{figure}[tb] \centering \includegraphics[width = 0.35\textwidth]{flup-dmL1} \includegraphics[width = 0.63\textwidth]{flup-dmL2} \caption{\it Lexis diagram of the \textrm{\tt DMlate} dataset; left panel is the default version, right panel is with some bells and whistles. The red lines are for women, blue for men, crosses indicate deaths.} \label{fig:Lexis-diagram} \end{figure} \section{Splitting the follow-up time along a time scale} In next chapter we shall conduct statistical analysis of mortality rates, and a prerequisite for parametric analysis of rates is that follow-up time is subdivided in smaller intervals, where we can reasonably assume that rates are constant. The follow-up time in a cohort can be subdivided (``split'') along a time scale, for example current age. This is achieved by the \texttt{splitLexis} (note that it is \emph{not} called \texttt{split.Lexis}). This requires that the time scale and the breakpoints on this time scale are supplied. Try: <<>>= dmS1 <- splitLexis(dmL, "age", breaks = seq(0, 100, 5)) summary(dmL) summary(dmS1) @ % We see that the number of persons and events and the amount of follow-up is the same in the two data sets; only the number of records differ --- the extra records all have \texttt{lex.Cst} = \texttt{DM} and \texttt{lex.Xst} = \texttt{DM}. To see how records are split for each individual, it is useful to list the results for a few individuals (whom we selected with a view to the illustrative usefulness): <<>>= wh.id <- c(9, 27, 52, 484) subset(dmL , lex.id %in% wh.id)[, 1:10] subset(dmS1, lex.id %in% wh.id)[, 1:10] @ % The resulting object, \texttt{dmS1}, is again a \texttt{Lexis} object. Note that the values of the timescales (\texttt{per}, \texttt{age}, \texttt{tfD}) are updated for each of the the resulting intervals. The follow-up in \texttt{dmS1} may be split further along another time scale, for example diabetes duration, \texttt{tfD}. Subsequently we list the results for the chosen individuals: <<>>= dmS2 <- splitLexis(dmS1, "tfD", breaks = c(0, 1, 2, 5, 10, 20, 30, 40)) subset(dmS2, lex.id %in% wh.id)[, 1:10] @ % A more efficient (and more intuitive) way of making this double split is to use the \texttt{splitMulti} function from the \texttt{popEpi} package: <<>>= dmM <- splitMulti(dmL, age = seq(0, 100, 5), tfD = c(0, 1, 2, 5, 10, 20, 30, 40), drop = FALSE) summary(dmS2) summary(dmM) @ % Note we used the argument \texttt{drop = FALSE} which will retain follow-up also outside the window defined by the range of the breaks. Otherwise, the default for \texttt{splitMulti} would be to drop follow-up outside \texttt{age} [0, 100] and \texttt{tfD} [0, 40]. This clipping behaviour is not available in \texttt{splitLexis}, nevertheless this may be exactly what we want in some situations. The recommended way of splitting follow-up time is by \texttt{splitMulti}, because it is faster. But you should be aware that the result is a \texttt{data.table} object unless you set the option \texttt{"popEpi.datatable" = FALSE}. In some circumstances \texttt{data.table}s behaves slightly differently from \texttt{data.frame}s. See the manual for \texttt{data.table}. \section{Cutting follow up time at dates of intermediate events} If we have a recording of the date of a specific event as for example recovery or relapse, we may classify follow-up time as being before or after this intermediate event, but it requires that follow-up records that straddle the event be cut in two and placed in separate records, one representing follow-up \emph{before} the intermediate event, and another representing follow-up \emph{after} the intermediate event. This is achieved with the function \texttt{cutLexis}, which takes three arguments: the time point of the intermediate event, the time scale that this point refers to, and the value of the (new) state following the date. Optionally, we may also define a new time scale with the argument \texttt{new.scale = }. We are interested in the time before and after inception of insulin use, which occurs at the date \texttt{doins}: <<>>= subset(dmL, lex.id %in% wh.id)[, 1:11] dmC <- cutLexis(data = dmL, cut = dmL$doins, timescale = "per", new.state = "Ins", new.scale = "tfI") subset(dmC, lex.id %in% wh.id)[, 1:11] @ % Note that the process of cutting time is simplified by having all types of events referred to the calendar time scale. This is a generally applicable advice in handling follow-up data: Get all event times as \emph{dates}, location of events and follow-up on other time scales can then easily be derived from this. Note that individual 52 has had his follow-up cut at 6.55 years from diabetes diagnosis and individual 484 at 5.70 years from diabetes diagnosis. This dataset could then be split along the time scales as we did before with \texttt{dmL}. The result of this can also be achieved by cutting the split dataset \texttt{dmS2} instead of \texttt{dmL}: <<>>= dmS2C <- cutLexis(data = dmS2, cut = dmS2$doins, timescale = "per", new.state = "Ins", new.scale = "tfI") subset(dmS2C, lex.id %in% wh.id)[, 1:11] @ % $ Thus it does not matter in which order we use \texttt{splitLexis} and \texttt{cutLexis}. Mathematicians would say that \texttt{splitLexis} and \texttt{cutLexis} are commutative. Note that for \texttt{lex.id} = 484, the follow-up subsequent to the event is classified as being in state \texttt{Ins}, but that the final transition to state \texttt{Dead} is preserved. Note that we defined a new time scale, \texttt{tfI}, using the argument \texttt{new.scale = "tfI"}. This has a special status relative to the other three time scales: it is defined as time since entry into a state, namely \texttt{Ins}, this is noted in the time scale part of the summary of \texttt{Lexis} object --- the information sits in the attribute \texttt{time.since} of the \texttt{Lexis} object, which can be accessed by the function \texttt{timeSince()} or through the \texttt{summary()}: <<>>= summary(dmS2C, timeScales = TRUE) @ % Finally we can get a quick overview of the states and transitions by using \texttt{boxes} --- \texttt{scale.R} scales transition rates to rates per 1000 PY: <>= boxes(dmC, boxpos = TRUE, scale.R = 1000, show.BE = TRUE) legendbox(70, 95) @ % \insfig{box1}{0.8}{States, person years, transitions and rates in the cut dataset. The numbers \emph{in} the boxes are person-years and the number of persons \texttt{B}eginning, resp. \texttt{E}nding their follow-up in each state (triggered by \textrm{\tt show.BE = TRUE}). The numbers at the arrows are the number of transitions and transition rates per 1000 (triggered by \textrm{\tt scale.R = 1000}).} The explanatory box in the upper right corner was generated by \texttt{legendbox}. \chapter{Modeling rates from \texttt{Lexis} objects} \section{Covariates} In the \texttt{Lexis} dataset \texttt{dmS2C} there are three types of covariates that can be used to describe mortality rates: \begin{enumerate}[noitemsep] \item time-dependent covariates \item time scales \item fixed covariates \end{enumerate} There is only one time-dependent covariate, namely \texttt{lex.Cst}, the current state of the person's follow up; it takes the values \texttt{DM} and \texttt{Ins} according to whether the person has ever purchased insulin at the beginning of a given follow-up interval. The time-scales are obvious candidates for explanatory variables for the rates, notably age and time from diagnosis (duration of diabetes) and insulin. \subsection{Time scales as covariates} If we want to model the effect of the time scale variables on occurrence rates, we will for each interval use either the value of the left endpoint in each interval or the middle. There is a function \texttt{timeBand} which returns either of these: <<>>= timeBand(dmS2C, "age", "middle")[1:10] # For nice printing and column labelling we use the data.frame() function: data.frame(dmS2C[, c("per", "age", "tfD", "lex.dur")], mid.age = timeBand(dmS2C, "age", "middle"), mid.t = timeBand(dmS2C, "tfD", "middle"), left.t = timeBand(dmS2C, "tfD", "left" ), right.t = timeBand(dmS2C, "tfD", "right" ), fact.t = timeBand(dmS2C, "tfD", "factor"))[1:15, ] @ % Note that the values of these functions are characteristics of the intervals defined by \texttt{breaks = }, \emph{not} the midpoints nor left or right endpoints of the \emph{actual} follow-up intervals (which would be \texttt{tfD} and \texttt{tfD+lex.dur}, respectively). These functions are intended for modeling time scale variables as factors (categorical variables) in which case the coding must be independent of the censoring and mortality pattern --- it should only depend on the chosen grouping of the time scale. Modeling time scales as \emph{quantitative} should not be based on these codings but directly on the values of the time-scale variables, i.e. the left endpoints of the intervals. \subsection{Differences between time scales} Apparently, the only fixed variable is \texttt{sex}, but the dates of birth (\texttt{dobth}), diagnosis (\texttt{dodm}) and first insulin purchase (\texttt{doins}) are available as fixed covariates too. These can be constructed as differences between time scales. These would then be age at birth (hardly relevant since it is the same for all persons), age at diabetes diagnosis and age at insulin treatment. \subsection{Keeping the relation between time scales} The midpoint (as well as the right interval endpoint) should be used with caution if the variable age at diagnosis, \texttt{dodm-dobth}, is modeled too; the age at diabetes is logically equal to the difference between current age (\texttt{age}) and time since diabetes diagnosis (\texttt{tfD}): <<>>= summary((dmS2$age - dmS2$tfD) - (dmS2$dodm - dmS2$dobth)) @ % This calculation refers to the value of the timescales at the \emph{beginning} of each interval --- which are in the timescale variables in the \texttt{Lexis} object. But when using the middle of the intervals, this relationship is not preserved: <<>>= summary(timeBand(dmS2, "age", "middle") - timeBand(dmS2, "tfD", "middle") - (dmS2$dodm - dmS2$dobth)) @ % If all three variables are to be included in a model, we must make sure that the \emph{substantial} relationship between the variables be maintained. One way is to recompute age at diabetes diagnosis from the two midpoint variables, but more straightforward would be to use the left endpoint of the intervals, that is the time scale variables in the \texttt{Lexis} object. If we dissolve the relationship between the variables \texttt{age}, \texttt{tfD} and age at diagnosis by grouping we may obtain identifiability of the three separate effects, but it will be at the expense of an arbitrary allocation of a linear trend between the three effects.. For the sake of clarity, consider current age, $a$, age at diagnosis $e$ and duration of disease, $d$, where \[ \text{current age} = \text{age at diagnosis} + \text{disease duration}, \quad \text{\ie} \quad a = e + d \quad \Leftrightarrow \quad e + d - a = 0 \] If we model the effect of the quantitative variables $a$, $e$ and $d$ on the log-rates by three functions $f$, $g$ and $h$: $\log(\lambda) = f(a) + g(d) + h(e)$ then for any $\kappa$: \begin{align*} \log(\lambda) & = f(a)+g(d)+h(e)+\kappa(e+d-a)\\ & = \big(f(a)-\kappa a \big)+ \big(g(d)+\kappa d \big)+ \big(h(e)+\kappa e \big) \\ & = \tilde f(a)+ \tilde g(d)+ \tilde h(e) \end{align*} In practical modeling this will turn up as a singular model matrix with one parameter aliased, corresponding to some arbitrarily chosen value of $\kappa$ (depending on software conventions for singular models). This phenomenon is well known from age-period-cohort models \cite{Carstensen.2007a}. Thus we see that we can move any slope around between the three terms, so if we achieve identifiability by using grouping of one of the variables we will in reality have settled for a particular value of $\kappa$, without knowing how and why we chose just that. There is \emph{no way} to separate the three effects. The only resorts are either to stick to predictions which are independent of the particular parametrization or to choose a parametrization with explicitly defined constraints clearly stated. \section{Modeling of rates} As mentioned, the purpose of subdividing follow-up data in smaller intervals is to be able to model effects of time scale variables as parametric functions. When we split along a time scale we can get intervals that are as small as we want; if they are sufficiently small, an assumption of constant rates in each interval becomes reasonable. In a model that assumes a constant occurrence rate in each of the intervals, the likelihood contribution from each interval is the same as the likelihood contribution from a Poisson variate $D$, say, with mean $\lambda \ell$ where $\lambda$ is the rate and $\ell$ is the interval length, and where the value of the variate $D$ is 1 or 0 according to whether an event has occurred or not. Moreover, the likelihood contributions from all follow-up intervals from a single person are \emph{conditionally} independent (conditional on having survived till the start of the interval in question). This implies that the total contribution to the likelihood from a single person is a product of terms, and hence the same as the likelihood of a number of independent Poisson terms, one from each interval. Note that the observations are neither Poisson distributed (\eg they can only ever assume values 0 or 1) nor independent --- it is only the \emph{likelihood} for the follow-up data that happens to be the same as the likelihood from independent Poisson variates because it is a product of terms. Different models can have the same likelihood; a model cannot be inferred from its likelihood. Parametric modeling of the rates is obtained by using the \emph{values} of the time scales for each interval as \emph{quantitative} explanatory variables, using for example splines. And of course also the values of the fixed covariates and the time-dependent variables for each interval. Thus the model will be one where the rate is assumed constant in each (small) interval, but where a parametric form of the \emph{size} of the rate in each interval is imposed by the model, using the time scale as a quantitative covariate. \subsection{Interval length} In the first chapter we illustrated cutting and splitting by listing the results for a few individuals across a number of intervals. For illustrational purposes we used 5-year age bands to avoid excessive listings, but since the doubling time for mortality on the age scale is only slightly larger than 5 years, the assumption about constant rates in each interval would be pretty far fetched if we were to use 5 year intervals. Thus, for modeling purposes we split the follow-up in 3 month intervals. When we use intervals of 3 months length it is superfluous to split along multiple time scales --- the precise location of tightly spaced splits will be irrelevant from any practical point of view. \texttt{splitLexis} and \texttt{splitMulti} will allocate the actual split times for all of the time scale variables, so these can be used directly in modeling. So we split the cut dataset in 3 months intervals along the age scale: <<>>= dmCs <- splitLexis(dmC, time.scale = "age", breaks = seq(0, 110, 1/4)) summary(dmCs, t = T) @ % We see that we now have 228,748 records and 9996 persons, so about 23 records per person. The total risk time is 54,275 years, a bit less than 3 months on average per record as expected. \subsection{Practicalities for splines} In this study we want to look at how mortality depend on age (\texttt{age}) and time since start of insulin use (\texttt{tfI}). If we want to use splines in the description we must allocate knots for anchoring the splines at each of the time scales, either by some \textit{ad hoc} method or by using some sort of penalized splines as for example by \texttt{gam}; the latter will not be treated here; it belongs in the realm of the \texttt{mgcv} package. Here we shall use the former approach and allocate 5 knots on each of the time-scales. We allocate knots so that we have the events evenly distributed between the knots. Since the insulin state starts at 0 for all individuals we include 0 as the first knot, such that any set of natural splines with these knots will have the value 0 at 0 on the time scale. <<>>= (a.kn <- with(subset(dmCs, lex.Xst == "Dead"), quantile(age+lex.dur, seq(5, 95, , 5) /100))) (i.kn <- c(0, with(subset(dmCs, lex.Xst == "Dead" & lex.Cst == "Ins"), quantile(tfI+lex.dur, seq(20, 95, , 4) / 100)))) @ % In the \texttt{Epi} package there is a convenience wrapper, \texttt{Ns}, for the \texttt{n}atural \texttt{s}pline generator \texttt{ns}, that takes the smallest and the largest of a set of supplied knots to be the boundary knots, so the explicit definition of the boundary knots becomes superfluous. Note that it is a feature of the \texttt{Ns} (via the features of \texttt{ns}) that any generated spline function is 0 at the leftmost knot (the smaller of the boundary knots). \subsection{Poisson models} A model that describes mortality rates as a function of only age (ignoring the insulin status) would then be: <<>>= ma <- glm((lex.Xst == "Dead") ~ Ns(age, knots = a.kn), family = poisson, offset = log(lex.dur), data = dmCs) summary(ma) @ % The offset, \texttt{log(lex.dur)} comes from the fact that the likelihood for the follow-up data during $\ell$ time is the same as that for independent Poisson variates with mean $\lambda \ell$, and that the default link function for the Poisson family is the log, so that we are using a linear model for the log-mean, $\log(\lambda) + \log(\ell)$. But when we want a model for the log-rate ($\log(\lambda)$), the term $\log(\ell)$ must still be included as a covariate, but with regression coefficient fixed to 1; a so-called \emph{offset}. This is however a technicality; it just exploits that the likelihood of a particular Poisson model and that of the rates model is the same. In the \texttt{Epi} package is a \texttt{glm} family, \texttt{poisreg}, that has a more intuitive interface to the likelihood of rates, namely where the response is a 2-column matrix of events and person-time, respectively. This is in concert with the fact that the outcome variable in follow-up studies is bivariate: (event, risk time). <<>>= Ma <- glm(cbind(lex.Xst == "Dead", lex.dur) ~ Ns(age, knots = a.kn), family = poisreg, data = dmCs) summary(Ma) @ % There is a convenience wrapper for \texttt{glm} with the \texttt{poisreg} family, exploiting the multistate structure in the \texttt{Lexis} object. It just requires specification of the transitions in terms of the arguments \texttt{from} and \texttt{to}: <<>>= Xa <- glmLexis(dmCs, formula = ~ Ns(age, knots = a.kn), from = "DM", to = "Dead",) @ % The result is a \texttt{glm} object but with an extra attribute, \texttt{Lexis} with the name of the data, transition(s) modeled and model formula <<>>= attr(Xa, "Lexis") @ % There are similar wrappers for \texttt{gam} and \texttt{coxph} models, \texttt{gamLexis} and \texttt{coxphLexis}, but these will not be elaborated in detail here. The \texttt{from=} argument can be omitted, in which case all possible transitions \emph{into} any of the ``\texttt{to}'' states is modeled. Similarly \texttt{to=} can be omitted, it defaults to the set of absorbing states. There are a couple of functions that show the absorbing and transient states: <<>>= transient(dmCs) absorbing(dmCs) preceding(dmCs, absorbing(dmCs)) @ So the default will be to model transitions from \texttt{DM} and \texttt{Ins} to \texttt{Dead}: <<>>= xa <- glmLexis(dmCs, formula = ~ Ns(age, knots = a.kn)) @ We can check if the four models fitted are the same: <<>>= c(ma = deviance(ma), Ma = deviance(Ma), Xa = deviance(Xa), xa = deviance(xa)) @ % Oops! the model \texttt{Xa} is apparently not the same as the other three? This is because the explicit specification \verb|from = "DM", to = "Dead"|, omits modeling contributions from the $\mathtt{Ins}\rightarrow\mathtt{Dead}$ transition --- the output actually said so --- see also figure \ref{fig:box1} on p. \pageref{fig:box1}. The other three models all use both transitions --- and assume that the two transition rates are the same, \ie that start of insulin has no effect on mortality. We shall relax this assumption later. The parameters from the model do not have any direct interpretation \textit{per se}, but we can compute the estimated mortality rates for a range of ages using \texttt{ci.pred} with a suitably defined prediction data frame. Note that if we use the \texttt{poisson} family of models, we must specify all covariates in the model, including the variable in the offset, \texttt{lex.dur} (remember that this was a covariate with coefficient fixed at 1). We set the latter to 1000, because we want the mortality rates per 1000 person-years. Using the \texttt{poisreg} family, the prediction will ignore any value of \texttt{lex.dur} specified in the prediction data frame, the returned rates will be per unit in which \texttt{lex.dur} is recorded when fitting the model. <>= nd <- data.frame(age = 40:85, lex.dur = 1000) pr.0 <- ci.pred(ma, newdata = nd) # mortality per 1000 PY pr.a <- ci.pred(Ma, newdata = nd)*1000 # mortality per 1000 PY summary(pr.0 / pr.a) matshade(nd$age, pr.a, plot = TRUE, type = "l", lty = 1, log = "y", xlab = "Age (years)", ylab = "DM mortality per 1000 PY") @ % $ \insfig{pr-a}{0.8}{Mortality among Danish diabetes patients by age with 95\% CI as shaded area. We see that the rates increase linearly on the log-scale, that is, exponentially by age.} \section{Time dependent covariates} One approach to modeling mortality rates by insulin status would be to assume that the mortality rate-ratio between patients on insulin and not on insulin is a fixed quantity, independent of time since start of insulin and independent of age. This is commonly termed a proportional hazards assumption, because the rates (hazards) in the two groups are proportional along the age (baseline time) scale. <<>>= pm <- glm(cbind(lex.Xst == "Dead", lex.dur) ~ Ns(age, knots = a.kn) + lex.Cst + sex, family = poisreg, data = dmCs) round(ci.exp(pm), 3) @ % Again we can simplify the code using \code{glmLexis}: <<>>= pm <- glmLexis(dmCs, ~ Ns(age, knots = a.kn) + lex.Cst + sex) round(ci.exp(pm), 3) @ % So we see that persons on insulin have about twice the mortality of persons not on insulin and that women have 2/3 the mortality of men. \chapter{Multiple time scales} \section{Time since insulin start} If we want to assess how the excess mortality depends on the time since start of insulin treatment, we can add a spline term in \texttt{tfI}, \texttt{t}ime \texttt{f}rom \texttt{I}nsulin start. But since \texttt{tfI} is a time scale defined as time since entry into a new state (\texttt{Ins}), the variable \texttt{tfI} is missing for those in the \texttt{DM} state, so before modeling we must set the \texttt{NA}s to 0, which we do with \texttt{tsNA20} (acronym for \texttt{t}ime\texttt{s}cale \texttt{NA}s to zero): <<>>= pm <- glm(cbind(lex.Xst == "Dead", lex.dur) ~ Ns(age, knots = a.kn) + Ns(tfI, knots = i.kn) + lex.Cst + sex, family = poisreg, data = tsNA20(dmCs)) @ % As noted before we could do this simpler with \texttt{glmLexis}, even without the \texttt{from=} and \texttt{to=} arguments, because we are modeling all transitions \emph{into} the absorbing state (\texttt{Dead}): <<>>= Pm <- glmLexis(tsNA20(dmCs), form = ~ Ns(age, knots = a.kn) + Ns(tfI, knots = i.kn) + lex.Cst + sex) c(deviance(Pm), deviance(pm)) identical(model.matrix(Pm), model.matrix(pm)) @ % The coding of the effect of \texttt{tfI} is so that the value is 0 at 0, so the meaning of the estimate of the effect of \texttt{lex.Cst} is the RR between persons with and without insulin, immediately after start of insulin: <<>>= round(ci.exp(Pm, subset = "ex"), 3) @ % We see that the effect of sex is pretty much the same as before, but the effect of \texttt{lex.Cst} is much larger; it now refers to a different quantity, namely the RR between persons just started on insulin (i.e. at time \texttt{tfI} = 0) and persons not on insulin. In the model \texttt{pm} above, the effect of \texttt{lex.Cst} was the average effect of insulin exposure, assuming that it was constant over time since start of insulin. If we want to see the effect of time since insulin, it is best viewed jointly with the effect of age, so we construct a prediction data frame --- a data frame with the explanatory variables from the model and values of these for which we want to see the predicted occurrence rates: <>= ndI <- data.frame(expand.grid(tfI = c(NA, seq(0, 15, 0.1)), ai = seq(40, 80, 10)), lex.Cst = "Ins", sex = "M") ndI <- transform(ndI, age = ai + tfI) head(ndI) ndA <- data.frame(age = seq(40, 100, 0.1), tfI = 0, lex.Cst = "DM", sex = "M") pri <- ci.pred(Pm, ndI) * 100 pra <- ci.pred(Pm, ndA) * 100 matshade(ndI$age, pri, plot = TRUE, xlab = "Attained age (years)", ylab = "DM mortality per 100 PY", las = 1, log = "y", lty = 1, col = "blue") matshade(ndA$age, pra) @ % \begin{expl} \texttt{expand.grid} yields a data frame with all combinations of \texttt{tfI} and \texttt{ai}, the latter is age at insulin start; we want predictions for different values of this. But it is (current) \texttt{age} that is in the model, so we must construct this. The \texttt{NA}s are inserted in order to produce separate curve for each value of \texttt{ai}. The prediction data frame for persons not on insulin is simpler, but must still include the \texttt{tfI} variable, but now uniformly set to 0. \texttt{ci.pred} will give predicted rates from the \texttt{Pm} model, per 1 person-year (because \texttt{lex.dur} is in years), so we multiply by 100 to get rates per 100 PY (\% / year). \texttt{matshade} produces curves with shaded confidence bands. \end{expl} \insfig{ins-time}{0.8}{Mortality rates of persons on insulin, starting insulin at ages 40, 50, \ldots, 80 (blue), compared with persons not on insulin (black curve). Shaded areas are 95\% CI.} In figure \ref{fig:ins-time}, p. \pageref{fig:ins-time}, we see that mortality is high just after insulin start, but falls by almost a factor 3 during the first year. Also we see that there is a tendency that mortality in a given age is smallest for those with the longest duration of insulin use. Or among those who started insulin first --- the two effects cannot be separated. \section{The Cox model} In the implementation of the Cox-model with \texttt{age} as baseline time scale, \texttt{age} appears as response variable, slightly counter-intuitive since it really is a covariate. Hence the age part of the linear predictors is not in the specification of the covariates: <<>>= cm <- coxph(Surv(age, age + lex.dur, lex.Xst == "Dead") ~ Ns(tfI, knots = i.kn) + lex.Cst + sex, data = tsNA20(dmCs)) @ % There is also a multistate wrapper for Cox models, requiring a l.h.s. side for the \texttt{formula =} argument: <<>>= Cm <- coxphLexis(tsNA20(dmCs), formula = age ~ Ns(tfI, knots = i.kn) + lex.Cst + sex) round(cbind(ci.exp(cm), ci.exp(Cm)), 4) @ % Note that this is really a model with two time scales: the baseline time scale \texttt{age} and the time since insulin, \texttt{tfi}. The effects of age and time since insulin are modeled differently, \texttt{age} non-parametrically and \texttt{tfI} with a smooth parametric spline. And only the spline effects is shown in the parameters. We can compare the estimates of the insulin effect from the Cox model with those from the Poisson model --- we must add \texttt{NA}s to the Cox-parameters for the comparison because the Cox-model does not give any parameters for the baseline time scale (\texttt{age}), but also remove one of the parameters, because \texttt{coxph} parametrizes factors (in this case \texttt{lex.Cst}) by all defined levels and not only by the levels present in the dataset at hand (note the line of \texttt{1.0000}s in the print above): <<>>= round(cbind(ci.exp(Pm), rbind(matrix(NA, 5, 3), ci.exp(cm)[-6, ])), 3) @ % Thus we see that the Poisson and Cox gives pretty much the same results with regards to the regression parameters, but only the Poisson gives a parametrization of the baseline hazard. You may argue that Cox requires a smaller dataset, because there is no need to subdivide data in small intervals \emph{before} insulin use. But certainly the time \emph{after} insulin inception needs to be subdivided in smaller intervals (as the \texttt{Lexis} data frame is) if the effect of this time should be modeled. The drawback of the Cox-modeling is that it is not possible to show the absolute rates as we did in figure \ref{fig:ins-time} on page \pageref{fig:ins-time}. \section{Marginal effect of time since insulin} When we plot the marginal effect of \texttt{tfI} from the two models we get pretty much the same; here we plot the RR relative to \texttt{tfI} = 2 years. Note that we are deriving the RR as the ratio of two sets of predictions, from the data frames \texttt{nd} and \texttt{nr}---variables assumed to have the same values in the two data frames need not be included in the prediction frames, but numerical variables omitted must be indicated in the \texttt{xvars=} argument. For further details, consult the help page for \texttt{ci.lin}, specifically the use of a list as the \texttt{ctr.mat} argument: <>= nd <- data.frame(tfI = seq(0, 15, , 151), lex.Cst = "Ins", sex = "M") nr <- data.frame(tfI = 2 , lex.Cst = "Ins", sex = "M") # We need to use xvars="age" in ci.exp because age is in the model # but not in the prediction frames nd and nr ppr <- ci.exp(pm, list(nd, nr), xvars = "age") cpr <- ci.exp(cm, list(nd, nr)) par(mar = c(3, 3, 1, 1), mgp = c(3, 1, 0)/1.6, las = 1, bty = "n") matshade(nd$tfI, cbind(ppr, cpr), plot = T, lty = c(1, 2), lwd = 3, log = "y", xlab = "Time since insulin (years)", ylab = "Mortality rate ratio") abline(h = 1, lty = 3) @ % \insfig{Ieff}{0.8}{The naked duration effects on mortality relative to 2 years of duration. Black from Poisson model, red from Cox model. The two sets of estimates are identical, and so are the standard errors, so the two shaded areas overlap almost perfectly.} In figure \ref{fig:Ieff}, p. \pageref{fig:Ieff}, we see that the duration effect is exactly the same from the two modeling approaches (Cox and Poisson). We will also want the RR relative to the non-insulin users --- recall these are coded 0 on the \texttt{tfI} variable: <>= nd <- data.frame(tfI = seq(0, 15, , 151), lex.Cst = "Ins", sex = "M") nr <- data.frame(tfI = 0 , lex.Cst = "DM" , sex = "M") ppr <- ci.exp(pm, list(nd, nr), xvars = "age") cpr <- ci.exp(cm, list(nd, nr)) par(mar = c(3, 3, 1, 1), mgp = c(3, 1, 0)/1.6, las = 1, bty = "n") matshade(nd$tfI, cbind(ppr, cpr), lwd = 3, xlab = "Time since insulin (years)", ylab = "Rate ratio relative to non-Insulin", lty = c(1, 2), log = "y", plot = TRUE) abline(h = 1, lty = 3) @ % \insfig{IeffR}{0.8}{Insulin duration effect (state \textrm{\tt Ins}) relative to no insulin (state \textrm{\tt DM}), black from Poisson model, red from Cox model. The \emph{shape} is the same as the previous figure, but the RR is now relative to non-insulin, instead of relative to insulin users at 2 years duration. The estimates from the Cox model and the Poisson model are virtually identical, and so are the standard errors, so the two shaded areas overlap almost perfectly.} In figure \ref{fig:IeffR}, p. \pageref{fig:IeffR}, we see the effect of increasing duration of insulin use \emph{for a fixed age} which is a bit artificial, so we would like to see the \emph{joint} effects of age and insulin duration. What we cannot see is how the duration affects mortality as a function of \emph{current} age (at the age attained at the same time as the attained \texttt{tfI}). Another way of interpreting this curve is as the rate ratio for a person on insulin relative to a person of the same age not on insulin, so we see that the RR (or hazard ratio, HR as some call it) is over 5 at the start of insulin (the \texttt{lex.Cst} estimate), and decreases to about 1.5 in the long term. Figure \ref{fig:ins-time}, \ref{fig:Ieff} and \ref{fig:IeffR} all indicate a declining RR by insulin duration, but only from figure \ref{fig:ins-time} it is visible that mortality actually is \emph{in}creasing by age after some 2 years after insulin start. This point would not be available if we had only fitted a Cox model where we do not have access to the baseline hazard as a function of age; the Cox model only gives the rate ratio of the blue to the black curve in \ref{fig:ins-time}. \section{Age$\times$duration interaction} The model we fitted assumes that the HR (or RR) is the same regardless of the age at start of insulin --- the hazards are multiplicative. Sometimes this is termed the \emph{proportional hazards} assumption: For \emph{any} fixed age the HR is the same as a function of time since insulin, and vice versa. A more correct term would be ``main effects model'' --- there is no interaction between age (the baseline time scale) and other covariates. So there is really no need for the term ``proportional hazards''; a well defined and precise statistical term for it has existed for eons. \subsection{Age at insulin start} In order to check the consistency of the multiplicative assumption across the spectrum of age at insulin inception, we can fit an interaction model. One approach to this --- which is not a completely general interaction --- would be using a non-linear effect of age at insulin inception (for convenience we use the same knots as for age). Note that the prediction data frames would be the same as we used above, because we do not compute age at insulin for use as a separate variable, but rather enter it in the model as the difference between current age (\texttt{age}) and insulin duration (\texttt{tfI}). At first glance we might think of doing: <<>>= ii <- glmLexis(tsNA20(dmCs), formula = ~ Ns(age , knots = a.kn) + Ns( tfI, knots = i.kn) + Ns(age - tfI, knots = a.kn) + lex.Cst + sex) @ % But this fits a model where the rate-ratio between persons with and without insulin at start of insulin (where \texttt{tfI} = 0) will be the same at any age, which is a bit too restrictive for the interaction we want. We want the \texttt{age-tfI} term to be specific for the insulin exposed so will use: <<>>= im <- glmLexis(tsNA20(dmCs), formula = ~ Ns(age , knots = a.kn) + Ns( tfI, knots = i.kn) + lex.Cst : Ns(age - tfI, knots = a.kn) + lex.Cst + sex) ci.exp(im) @ % The model (\texttt{im}) allows age-effects that differ non-linearly between persons with and without insulin, because the interaction term \texttt{lex.Cst:Ns(age-tfI...} for persons not on insulin is merely an age term (since \texttt{tfI} is coded 0 for all follow-up not on insulin). We can compare the two models fitted: <<>>= anova(ii, im, test = 'Chisq') @ % so we see that the second model (\texttt{im}, the interaction model) provides a substantial further improvement, by allowing non-linear HR along the age-scale. We can illustrate the estimated rates from the \texttt{im} model in figure \ref{fig:dur-int}, p. \pageref{fig:dur-int}: <>= pii <- ci.pred(im, ndI) pia <- ci.pred(im, ndA) par(mar = c(3, 3, 1, 1), mgp = c(3, 1, 0) / 1.6, las = 1, bty = "n") matshade(ndI$age, pii * 1000, plot = T, log = "y", xlab = "Age", ylab = "Mortality per 1000 PY", lty = 1, lwd = 2, col = c("blue", "forestgreen", "red"), alpha = 0.1) matshade(ndA$age, pia * 1000) @ % \insfig{dur-int}{0.8}{Age at insulin as interaction between age and duration, for persons starting insulin at ages 40, 50,\ldots (in blue) and persons not on insulin (in black).} We can also plot the RRs from the interaction model (figure \ref{fig:dur-int-RR}, p. \pageref{fig:dur-int-RR}); for this we need the reference frames, and the machinery from \texttt{ci.exp} allowing a list of two data frames: <>= ndR <- transform(ndI, tfI = 0, lex.Cst = "DM") cbind(head(ndI), head(ndR)) Rii <- ci.exp(im , list(ndI, ndR)) par(mar = c(3, 3, 1, 1), mgp = c(3, 1, 0)/1.6, las = 1, bty = "n") matshade(ndI$age, Rii, plot = T, log = "y", xlab = "Age (years)", ylab = "Rate ratio vs, non-Insulin", lty = 1, lwd = 2, col = c("blue", "forestgreen", "red"), alpha = 0.1) abline(h = 1) @ % \insfig{dur-int-RR}{0.9}{RRs from the interaction model.} \section{Separate models} In the above we insisted on making a joint model for the \texttt{DM}$\rightarrow$\texttt{Dead} and the \texttt{Ins}$\rightarrow$\texttt{Dead} transitions, but it would actually have been more sensible to model the two transitions separately: <<>>= dmd <- glmLexis(dmCs, from = "DM", to = "Dead", formula = ~ Ns(age, knots = a.kn) + sex) ind <- glmLexis(dmCs, from = "Ins", to = "Dead", formula = ~ Ns(age , knots = a.kn) + Ns( tfI, knots = i.kn) + Ns(age - tfI, knots = a.kn) + sex) ini <- ci.pred(ind, ndI) dmi <- ci.pred(dmd, ndI) dma <- ci.pred(dmd, ndA) @ % The estimated mortality rates are shown in figure \ref{fig:sep-mort}, p. \pageref{fig:sep-mort}, using: <>= par(mar = c(3, 3, 1, 1), mgp = c(3, 1, 0)/1.6, las = 1, bty = "n") matshade(ndI$age, ini * 100, plot = TRUE, log = "y", xlab = "Age (years)", ylab = "Mortality rates per 100 PY", lwd = 2, col = "red") matshade(ndA$age, dma*100, lwd = 2, col = "black") @ % $ The estimated RRs can now be computed exploiting the fact that the estimates from the two models are uncorrelated, and hence qualify for \texttt{ci.ratio}: <>= par(mar = c(3, 3, 1, 1), mgp = c(3, 1, 0)/1.6, las = 1, bty = "n") matshade(ndI$age, ci.ratio(ini, dmi), plot = TRUE, log = "y", xlab = "Age (years)", ylab = "RR insulin vs. no insulin", lwd = 2, col = "red") abline(h = 1) @ % \begin{figure}[tb] \centering \includegraphics[width = 0.49\textwidth]{flup-sep-mort} \includegraphics[width = 0.49\textwidth]{flup-sep-HR} \caption{\it Left panel: Mortality rates from separate models for the two mortality transitions; the \textrm{\tt DM}$\rightarrow$\textrm{\tt Dead} transition modeled by age alone; \textrm{\tt Ins}$\rightarrow$\textrm{\tt Dead} transition modeled with spline effects of current age, time since insulin and and age at insulin. \newline Right panel: Mortality HR of insulin vs. no insulin.} \label{fig:Ins-noIns} \end{figure} The only difference between the interaction model and the two separate models is that the latter allows different \texttt{sex}-effects between mortality rates from \texttt{DM} and \texttt{Ins}. There actually \emph{is} a difference between the estimates but hardly visible. \chapter{More states} \section{Subdividing states} It may be of interest to subdivide the state \texttt{Dead} according to whether the event has occurred or not. This will enable us to estimate the number of patients that ever go on insulin. This is done with \texttt{cutLexis} by using the argument \texttt{split.states = TRUE}. <<>>= dmCs <- cutLexis(data = dmS2, cut = dmS2$doins, timescale = "per", new.state = "Ins", new.scale = "tfI", split.states = TRUE) summary(dmCs) @ % We can illustrate the numbers and the transitions (figure \ref{fig:box4}, p. \pageref{fig:box4}) <>= boxes(dmCs, boxpos = list(x = c(15, 15, 85, 85), y = c(85, 15, 85, 15)), scale.R = 1000, show.BE = TRUE) legendbox(70, 50) @ % $ \insfig{box4}{0.7}{Transitions between 4 states: the numbers \emph{in} the boxes are person-years (middle), and below the number of persons who start, respectively end their follow-up in each of the states.} Note that it is only the mortality rates that we have been modeling, namely the transitions \texttt{DM}$\rightarrow$\texttt{Dead} and \texttt{Ins}$\rightarrow$\texttt{Dead(Ins)}. If we were to model the cumulative risk of using insulin or currently being on insulin we would also need a model for the DM$\rightarrow$Ins transition. Subsequent to that we would then compute the probability of being in each state conditional on suitable starting conditions (including time of start). With models where transition rates depend on several time scales this is not a trivial task. This is treated in more detail in the vignette on \texttt{simLexis}. \section{Multiple intermediate events} We may be interested in initiation of either insulin or OAD (oral anti-diabetic drugs), thus giving rise to more states and more time scales. This can be accomplished by the \texttt{mcutLexis} function, that generalizes \texttt{cutLexis}: <<>>= dmM <- mcutLexis(dmL, timescale = "per", wh = c("doins", "dooad"), new.states = c("Ins", "OAD"), new.scales = c("tfI", "tfO"), ties.resolve = TRUE) @ % The \texttt{Lexis} machinery does not know what a reasonable order of states is, so that will have to be fixed by hand using \texttt{Relevel}: <<>>= levels(dmM) dmM <- Relevel(dmM, c("DM", "OAD", "Ins", "OAD-Ins", "Ins-OAD", "Dead")) summary(dmM, t = T) @ % We see that we now have two time scales defined as time since entry into states. <<>>= wh <- c(subset(dmM, lex.Cst == "Ins-OAD")$lex.id[1:2], subset(dmM, lex.Cst == "OAD-Ins")$lex.id[1:2]) print(subset(dmM, lex.id %in% wh), nd = 2) @ % \begin{expl} We use subset to locate all records with \texttt{lex.Cst} equal to \texttt{Ins-OAD}, resp. \texttt{OAD-Ins}, and extract the ids (\texttt{lex.id}) from these. We the select the two first of each, and print all records for these persons. \end{expl} We can also illustrate the transitions to the different states, as in figure \ref{fig:mbox} --- the specification of the \texttt{boxpos} argument is facilitated by the logical ordering of the states <>= boxes(dmM, boxpos = list(x = c(15, 40, 40, 85, 85, 80), y = c(50, 90, 10, 90, 10, 50)), scale.R = 1000, show.BE = TRUE) legendbox(6, 95) @ % \insfig{mbox}{1.0}{Boxes for the \textrm{\tt dmM} object. The numbers \emph{in} the boxes are person-years (middle), and below the number of persons who start, respectively end their follow-up in each of the states.} We may not be interested in whether persons were prescribed insulin before OAD or vice versa, in which case we would combine the levels with both insulin and OAD to one, regardless of order (figure \ref{fig:mboxr}): <>= summary(dmMr <- Relevel(dmM, list(1, 2, 3, 'OAD+Ins' = 4:5, 6))) boxes(dmMr, boxpos = list(x = c(15, 15, 85, 85, 50), y = c(85, 15, 85, 15, 50)), scale.R = 1000, show.BE = TRUE) @ % \insfig{mboxr}{1.0}{Boxes for the \textrm{\tt dmMr} object with collapsed states. The numbers \emph{in} the boxes are person-years (middle), and below the number of persons who start, respectively end their follow-up in each of the states.} Diagrams as those in figures \ref{fig:mbox} and \ref{fig:mboxr} gives an overview of the possible transitions, which states it might be relevant to collapse, and which transitions to model and how. \subsection{Modeling rates} The modeling of the transition rates is straightforward; split the data along some timescale and then use \texttt{glmLexis} or \texttt{gamLexis}, where it is possible to select the transitions modeled. This is also possible with the \texttt{coxphLexis} function, but it requires that a single time scale be selected as the baseline time scale, and the effect of this will not be accessible. Here is a brief sketch of models that might be considered: <<>>= dmMs <- splitMulti(dmMr, age = 0:100) summary(dmMs) levels(dmMs) rateIns <- gamLexis(dmMr, ~ s(age) + lex.Cst, from = 1:2 , to = 3:4) rateOAD <- gamLexis(dmMr, ~ s(age) + lex.Cst, from = c(1,3), to = c(2, 4)) rateDth <- gamLexis(dmMr, ~ s(age) + lex.Cst) ci.exp(rateIns, subset = "lex") ci.exp(rateOAD, subset = "lex") ci.exp(rateDth, subset = "lex") @ % \chapter{\texttt{Lexis} functions} \hypertarget{lexfun}{The \texttt{Lexis} machinery} has evolved over time since it was first introduced in a workable version in \texttt{Epi\_1.0.5} in August 2008. Over the years there have been additions of tools for handling multistate data. Here is a list of the current functions relating to \texttt{Lexis} objects with a very brief description; it does not replace the documentation, so read that before use. Unless otherwise stated, functions named \texttt{something.Lexis} (with a ``\texttt{.}'') are S3 methods for \texttt{Lexis} objects, so you can skip the ``\texttt{.Lexis}'' in daily use. \setlist{noitemsep} \begin{description} \item[Define]\ \\ \begin{description} \item[\texttt{cal.yr}] transforms \texttt{Date} variables (measured in days) to \texttt{cal.yr} format (measured in years) \item[\texttt{Lexis}] defines a \texttt{Lexis} object \end{description} \item[Cut and split]\ \\ \begin{description} \item[\texttt{cutLexis}] cut follow-up at intermediate event \item[\texttt{mcutLexis}] cut follow-up at multiple intermediate events, keeping track of history \item[\texttt{rcutLexis}] cut follow-up at intermediate, possibly recurring, events, only recording the current state \item[\texttt{countLexis}] cut follow-up at intermediate event time and count the no. events so far \item[\texttt{splitLexis}] split follow up along a time scale \item[\texttt{splitMulti}] split follow up along several time scales --- from the \texttt{popEpi} package, faster and has simpler syntax than \texttt{splitLexis} \item[\texttt{addCov.Lexis}] add clinical measurements at a given date to a \texttt{Lexis} object \item[\texttt{addDrug.Lexis}] add drug exposures to a \texttt{Lexis} object \item[\texttt{coarse.Lexis}] combine successive records in a \texttt{Lexis} object \end{description} \item[Boxes and plots]\ \\ \begin{description} \item[\texttt{boxes.Lexis}] draw a diagram of states and transitions \item[\texttt{legendbox}] draw a box explaining the numbers output by \texttt{boxes.Lexis} \item[\texttt{plot.Lexis}] draw a standard Lexis diagram \item[\texttt{points.Lexis}] add points to a Lexis diagram \item[\texttt{lines.Lexis}] add lines to a Lexis diagram \item[\texttt{PY.ann.Lexis}] annotate life lines in a Lexis diagram \end{description} \newpage \item[Summarize and query]\ \\ \begin{description} \item[\texttt{summary.Lexis}] overview of transitions, risk time etc. \item[\texttt{levels.Lexis}] what are the states in the \texttt{Lexis} object \item[\texttt{nid.Lexis}] number of persons in the \texttt{Lexis} object --- how many unique values of \texttt{lex.id} are present \item[\texttt{entry}] entry time \item[\texttt{exit}] exit time \item[\texttt{status}] status at entry or exit \item[\texttt{timeBand}] factor of time bands \item[\texttt{timeScales}] what time scales are in the \texttt{Lexis} object \item[\texttt{timeSince}] what time scales are defined as time since a given state \item[\texttt{breaks}] what breaks are currently defined \item[\texttt{absorbing}] what are the absorbing states \item[\texttt{transient}] what are the transient states \item[\texttt{preceding}, \texttt{before}] which states precede this \item[\texttt{succeeding}, \texttt{after}] which states can follow this \item[\texttt{tmat.Lexis}] transition matrix for the \texttt{Lexis} object \end{description} \item[Manipulate]\ \\ \begin{description} \item[\texttt{subset.Lexis}, \texttt{[}] subset of a \texttt{Lexis} object \item[\texttt{merge.Lexis}] merges a \texttt{Lexis} objects with a \texttt{data.frame} \item[\texttt{cbind.Lexis}] bind a \texttt{data.frame} to a \texttt{Lexis} object \item[\texttt{rbind.Lexis}] put two \texttt{Lexis} objects head-to-foot \item[\texttt{transform.Lexis}] transform and add variables \item[\texttt{tsNA20}] turn \texttt{NA}s to 0s for time scales \item[\texttt{factorize.Lexis}] turn \texttt{lex.Cst} and \texttt{lex.Xst} into factors with levels equal to the actually occurring values in both \item[\texttt{Relevel.Lexis}] reorder and/or combine states \item[\texttt{rm.tr}] remove transitions from a \texttt{Lexis} object \item[\texttt{bootLexis}] bootstrap sample of \emph{persons} (\texttt{lex.id}) from a \texttt{Lexis} object \item[\texttt{unLexis}] remove \texttt{Lexis} attributes from a \texttt{Lexis} object \end{description} \item[Simulate]\ \\ \begin{description} \item[\texttt{simLexis}] simulate a \texttt{Lexis} object from specified transition rate models \item[\texttt{nState}, \texttt{pState}] count state occupancy from a simulated \texttt{Lexis} object \item[\texttt{plot.pState}, \texttt{lines.pState}] plot state occupancy from a \texttt{pState} object \end{description} \item[Stack]\ \\ \begin{description} \item[\texttt{stack.Lexis}] make a stacked object for simultaneous analysis of transitions --- returns a \texttt{stacked.Lexis} object \item[\texttt{subset.stacked.Lexis}] subsets of a \texttt{stacked.Lexis} object \item[\texttt{transform.stacked.Lexis}] transform a \texttt{stacked.Lexis} object \end{description} \newpage \item[Interface to other packages]\ \\ \begin{description} \item[\texttt{msdata.Lexis}] interface to \texttt{mstate} package \item[\texttt{etm.Lexis}] interface to \texttt{etm} package \item[\texttt{crr.Lexis}] interface to \texttt{cmprsk} package \end{description} \item[Statistical models]\ \\ \begin{description} \item[\texttt{AaJ.Lexis}] compute the Aalen-Johansen estimator for a \texttt{Lexis} object --- wrapper for \texttt{survfit} from \texttt{survival} \item[\texttt{ci.Crisk}] compute cumulative risks from model objects for rates \item[\texttt{glmLexis}] fit a \texttt{glm} model using the \texttt{poisreg} family to (hopefully) time split data \item[\texttt{gamLexis}] fit a \texttt{gam} model (from the \texttt{mgcv} package) using the \texttt{poisreg} family to (hopefully) time split data \item[\texttt{coxphLexis}] fit a Cox model to follow-up in a \texttt{Lexis} object \item In versions of Epi up to 2.56 these functions were called \texttt{glm.Lexis}, \texttt{gam.Lexis} and \texttt{coxph.Lexis} --- but they are not S3 methods so naming was illogical. The versions with the old names still exist in \texttt{Epi} for backward compatibility. \end{description} \end{description} <>= ende <- Sys.time() cat(" Start time:", format(anfang, "%F, %T"), "\n End time:", format( ende, "%F, %T"), "\nElapsed time:", round(difftime(ende, anfang, units = "mins"), 2), "minutes\n") @ % \renewcommand{\bibname}{References} \bibliographystyle{plain} \begin{thebibliography}{1} \bibitem{Plummer.2011} M~Plummer and B~Carstensen. \newblock Lexis: An {R} class for epidemiological studies with long-term follow-up. \newblock {\em Journal of Statistical Software}, 38(5):1--12, 1 2011. \bibitem{Carstensen.2011a} B~Carstensen and M~Plummer. \newblock Using {L}exis objects for multi-state models in {R}. \newblock {\em Journal of Statistical Software}, 38(6):1--18, 1 2011. \bibitem{Iacobelli.2013} S~Iacobelli and B~Carstensen. \newblock {Multiple time scales in multi-state models}. \newblock {\em Stat Med}, 32(30):5315--5327, Dec 2013. \bibitem{Carstensen.2007a} B~Carstensen. \newblock Age-{P}eriod-{C}ohort models for the {L}exis diagram. \newblock {\em Statistics in Medicine}, 26(15):3018--3045, July 2007. \bibitem{Whitehead.1980} J~Whitehead. \newblock Fitting {C}ox's regression model to survival data using {GLIM}. \newblock {\em Applied Statistics}, 29(3):268--275, 1980. \end{thebibliography} \addcontentsline{toc}{chapter}{\bibname} \end{document}