%\VignetteIndexEntry{Getting started with outbreak detection} \documentclass[a4paper,11pt]{article} \usepackage[T1]{fontenc} \usepackage{graphicx} \usepackage{natbib} \bibliographystyle{apalike} \usepackage{lmodern} \usepackage{amsmath} \usepackage{amsfonts,amssymb} \newcommand{\pkg}[1]{{\bfseries #1}} \newcommand{\surveillance}{\pkg{surveillance}} \usepackage{hyperref} \hypersetup{ pdfauthor = {Michael H\"ohle and Andrea Riebler and Michaela Paul}, pdftitle = {Getting started with outbreak detection}, pdfsubject = {R package 'surveillance'} } \title{Getting started with outbreak detection} \author{ Michael H{\"o}hle\thanks{Author of correspondence: Department of Statistics, University of Munich, Ludwigstr.\ 33, 80539 M{\"u}nchen, Germany, Email: \texttt{hoehle@stat.uni-muenchen.de}} , Andrea Riebler and Michaela Paul\\ Department of Statistics\\ University of Munich\\ Germany } \date{17 November 2007} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Sweave %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \usepackage{Sweave} %Put all in another directory \SweaveOpts{prefix.string=plots/surveillance, width=9, height=4.5} \setkeys{Gin}{width=1\textwidth} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Initial R code %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% <>= library("surveillance") options(SweaveHooks=list(fig=function() par(mar=c(4,4,2,0)+.5))) options(width=70) ## create directory for plots dir.create("plots", showWarnings=FALSE) ###################################################################### #Do we need to compute or can we just fetch results ###################################################################### CACHEFILE <- "surveillance-cache.RData" compute <- !file.exists(CACHEFILE) message("Doing computations: ", compute) if(!compute) load(CACHEFILE) @ \begin{document} \fbox{\vbox{\small \noindent\textbf{Disclaimer}: This vignette reflects package state at version 0.9-7 and is hence somewhat outdated. New functionality has been added to the package: this includes various endemic-epidemic modelling frameworks for surveillance data (\texttt{hhh4}, \texttt{twinSIR}, and \texttt{twinstim}), as well as more outbreak detection methods (\texttt{glrnb}, \texttt{boda}, and \texttt{farringtonFlexible}). These new features are described in detail in \citet{meyer.etal2014} and \citet{salmon.etal2014}, respectively. %and corresponding vignettes are included in the package; %see \texttt{vignette(package = "surveillance")} for an overview. Note in particular that use of the new \texttt{S4} class \texttt{sts} instead of \texttt{disProg} is encouraged to encapsulate time series data. }} {\let\newpage\relax\maketitle} \begin{abstract} \noindent This document gives an introduction to the \textsf{R} package \surveillance\ containing tools for outbreak detection in routinely collected surveillance data. The package contains an implementation of the procedures described by~\citet{stroup89}, \citet{farrington96} and the system used at the Robert Koch Institute, Germany. For evaluation purposes, the package contains example data sets and functionality to generate surveillance data by simulation. To compare the algorithms, benchmark numbers like sensitivity, specificity, and detection delay can be computed for a set of time series. Being an open-source package it should be easy to integrate new algorithms; as an example of this process, a simple Bayesian surveillance algorithm is described, implemented and evaluated.\\ \noindent{\bf Keywords:} infectious disease, monitoring, aberrations, outbreak, time series of counts. \end{abstract} \newpage \section{Introduction}\label{sec:intro} Public health authorities have in an attempt to meet the threats of infectious diseases to society created comprehensive mechanisms for the collection of disease data. As a consequence, the abundance of data has demanded the development of automated algorithms for the detection of abnormalities. Typically, such an algorithm monitors a univariate time series of counts using a combination of heuristic methods and statistical modelling. Prominent examples of surveillance algorithms are the work by~\citet{stroup89} and~\citet{farrington96}. A comprehensive survey of outbreak detection methods can be found in~\citep{farrington2003}. The R-package \texttt{surveillance} was written with the aim of providing a test-bench for surveillance algorithms. From the Comprehensive R Archive Network (CRAN) the package can be downloaded together with its source code. It allows users to test new algorithms and compare their results with those of standard surveillance methods. A few real world outbreak datasets are included together with mechanisms for simulating surveillance data. With the package at hand, comparisons like the one described by~\citet{hutwagner2005} should be easy to conduct. The purpose of this document is to illustrate the basic functionality of the package with R-code examples. Section~\ref{sec:data} contains a description of the data format used to store surveillance data, mentions the built-in datasets and illustrates how to create new datasets by simulation. Section~\ref{sec:algo} contains a short description of how to use the surveillance algorithms and illustrate the results. Further information on the individual functions can be found on the corresponding help pages of the package. \section{Surveillance Data}\label{sec:data} Denote by $\{y_t\>;t=1,\ldots,n\}$ the time series of counts representing the surveillance data. Because such data typically are collected on a weekly basis, we shall also use the alternative notation $\{y_{i:j}\}$ with $j=\{1,\ldots,52\}$ being the week number in year $i=\{-b,\ldots,-1,0\}$. That way the years are indexed such that most current year has index zero. For evaluation of the outbreak detection algorithms it is also possible for each week to store -- if known -- whether there was an outbreak that week. The resulting multivariate series $\{(y_t,x_t)\>; t=1,\ldots,n\}$ is in \texttt{surveillance} given by an object of class \texttt{disProg} (disease progress), which is basically a \texttt{list} containing two vectors: the observed number of counts and a boolean vector \texttt{state} indicating whether there was an outbreak that week. A number of time series are contained in the package (see \texttt{data(package="surveillance")}), mainly originating from the SurvStat@RKI database at \url{https://survstat.rki.de/} maintained by the Robert Koch Institute, Germany~\citep{survstat}. For example the object \texttt{k1} describes cryptosporidiosis surveillance data for the German federal state Baden-W\"{u}rttemberg 2001-2005. The peak in 2001 is due to an outbreak of cryptosporidiosis among a group of army soldiers in a boot camp~\citep{bulletin3901}. <>= data(k1) plot(k1, main = "Cryptosporidiosis in BW 2001-2005") @ For evaluation purposes it is also of interest to generate surveillance data using simulation. The package contains functionality to generate surveillance data containing point-source like outbreaks, for example with a Salmonella serovar. The model is a Hidden Markov Model (HMM) where a binary state $X_t, t=1,\ldots,n$, denotes whether there was an outbreak and $Y_t$ is the number of observed counts, see Figure~\ref{fig:hmm}. \begin{figure}[htb] \centering \includegraphics[width=.75\textwidth]{figures/HMM} \caption{The Hidden Markov Model} \label{fig:hmm} \end{figure} The state $X_t$ is a homogeneous Markov chain with transition matrix \begin{center} \begin{tabular}{c|cc} $X_t\backslash X_{t+1}$ & 0 & 1\\ \hline $0$ & $p$ & $1 - p$ \\ $1$ & $1 - r$ & $r$ \end{tabular} \end{center} Hence $1-p$ is the probability to switch to an outbreak state and $1-r$ is the probability that $X_t=1$ is followed by $X_{t+1}=1$. Furthermore, the observation $Y_t$ is Poisson-distributed with log-link mean depending on a seasonal effect and time trend, i.e.\ \[ \log \mu_t = A \cdot \sin \, (\omega \cdot (t + \varphi)) + \alpha + \beta t. \] In case of an outbreak $(X_t=1)$ the mean increases with a value of $K$, altogether \begin{equation}\label{eq:hmm} Y_t \sim \operatorname{Po}(\mu_t + K \cdot X_t). \end{equation} The model in (\ref{eq:hmm}) corresponds to a single-source, common-vehicle outbreak, where the length of an outbreak is controlled by the transition probability $r$. The daily numbers of outbreak-cases are simply independently Poisson distributed with mean $K$. A physiologically better motivated alternative could be to operate with a stochastic incubation time (e.g.\ log-normal or gamma distributed) for each individual exposed to the source, which results in a temporal diffusion of the peak. The advantage of (\ref{eq:hmm}) is that estimation can be done by a generalized linear model (GLM) using $X_t$ as covariate and that it allows for an easy definition of a correctly identified outbreak: each $X_t=1$ has to be identified. More advanced setups would require more involved definitions of an outbreak, e.g.\ as a connected series of time instances, where the number of outbreak cases is greater than zero. Care is then required in defining what a correctly identified outbreak for time-wise overlapping outbreaks means. In \surveillance\ the function \verb+sim.pointSource+ is used to simulate such a point-source epidemic; the result is an object of class \verb+disProg+. \label{ex:sts} <<>>= set.seed(1234) sts <- sim.pointSource(p = 0.99, r = 0.5, length = 400, A = 1, alpha = 1, beta = 0, phi = 0, frequency = 1, state = NULL, K = 1.7) @ <>= plot(sts) @ \section{Surveillance Algorithms}\label{sec:algo} Surveillance data often exhibit strong seasonality, therefore most surveillance algorithms only use a set of so called \emph{reference values} as basis for drawing conclusions. Let $y_{0:t}$ be the number of cases of the current week (denoted week $t$ in year $0$), $b$ the number of years to go back in time and $w$ the number of weeks around $t$ to include from those previous years. For the year zero we use $w_0$ as the number of previous weeks to include -- typically $w_0=w$. Altogether the set of reference values is thus defined to be \[ R(w,w_0,b) = \left(\bigcup\limits_{i=1}^b\bigcup\limits_{j=\,-w}^w y_{-i:t+j}\right) \cup \left(\bigcup_{k=-w_0}^{-1} y_{0:t+k}\right) \] Note that the number of cases of the current week is not part of $R(w,w_0,b)$. A surveillance algorithm is a procedure using the reference values to create a prediction $\hat{y}_{0:t}$ for the current week. This prediction is then compared with the observed $y_{0:t}$: if the observed number of cases is much higher than the predicted number, the current week is flagged for further investigations. In order to do surveillance for time $0:t$ an important concern is the choice of $b$ and $w$. Values as far back as time $-b:t-w$ contribute to $R(w,w_0,b)$ and thus have to exist in the observed time series. Currently, we have implemented four different type of algorithms in \surveillance. The Centers for Disease Control and Prevention (CDC) method~\citep{stroup89}, the Communicable Disease Surveillance Centre (CDSC) method~\citep{farrington96}, the method used at the Robert Koch Institute (RKI), Germany~\citep{altmann2003}, and a Bayesian approach documented in~\citet{riebler2004}. A detailed description of each method is beyond the scope of this note, but to give an idea of the framework the Bayesian approach developed in~\citet{riebler2004} is presented: Within a Bayesian framework, quantiles of the predictive posterior distribution are used as a measure for defining alarm thresholds. The model assumes that the reference values are identically and independently Poisson distributed with parameter $\lambda$ and a Gamma-distribution is used as Prior distribution for $\lambda$. The reference values are defined to be $R_{\text{Bayes}}= R(w,w_0,b) = \{y_1, \ldots, y_{n}\}$ and $y_{0:t}$ is the value we are trying to predict. Thus, $\lambda \sim \text{Ga}(\alpha, \beta)$ and $y_i|\lambda \sim \text{Po}(\lambda)$, $i = 1,\ldots,{n}$. Standard derivations show that the posterior distribution is \begin{equation*} \lambda|y_1, \ldots, y_{n} \sim \text{Ga}(\alpha + \sum_{i=1}^{n} y_i, \beta + n). \end{equation*} Computing the predictive distribution \begin{equation*} f(y_{0:t}|y_1,\ldots,y_{n}) = \int\limits^\infty_0{f(y_{0:t}|\lambda)\, f(\lambda|y_1,\ldots,y_{n})}\, d\lambda \end{equation*} we get the Poisson-Gamma-distribution \begin{equation*} y_{0:t}|y_1,\ldots,y_{n} \sim \text{PoGa}(\alpha + \sum_{i=1}^{n} y_i, \beta + n), \end{equation*} which is a generalization of the negative Binomial distribution, i.e.\ \[ y_{0:t}|y_1,\ldots,y_{n} \sim \text{NegBin}(\alpha + \sum_{i=1}^{n} y_i, \tfrac{\beta + n}{\beta + n + 1}). \] Using the Jeffrey's Prior $\text{Ga}(\tfrac{1}{2}, 0)$ as non-informative Prior distribution for $\lambda$ the parameters of the negative Binomial distribution are \begin{align*} \alpha + \sum_{i=1}^{n} y_i &= \frac{1}{2} + \sum_{y_{i:j} \in R_{\text{Bayes}}}\!\! y_{i:j} \quad % \intertext{and} \quad\text{and}\quad \frac{\beta + n}{\beta + n + 1} = \frac{|R_{\text{Bayes}}|}{|R_{\text{Bayes}}| + 1}. \end{align*} Using a quantile-parameter $\alpha$, the smallest value $y_\alpha$ is computed, so that \begin{equation*} P(y \leq y_\alpha) \geq 1-\alpha. \end{equation*} Now \begin{equation*} A_{0:t} = I(y_{0:t} \geq y_\alpha), \end{equation*} i.e. if $y_{0:t}\geq y_\alpha$ the current week is flagged as an alarm. As an example, the \verb+Bayes1+ method uses the last six weeks as reference values, i.e.\ $R(w,w_0,b)=(6,6,0)$, and is applied to the \texttt{k1} dataset with $\alpha=0.01$ as follows. <>= k1.b660 <- algo.bayes(k1, control = list(range = 27:192, b = 0, w = 6, alpha = 0.01)) plot(k1.b660, disease = "k1") @ Several extensions of this simple Bayesian approach are imaginable, for example the inane over-dispersion of the data could be modeled by using a negative-binomial distribution, time trends and mechanisms to correct for past outbreaks could be integrated, but all at the cost of non-standard inference for the predictive distribution. Here simulation based methods like Markov Chain Monte Carlo or heuristic approximations have to be used to obtain the required alarm thresholds. In general, the \verb+surveillance+ package makes it easy to add additional algorithms -- also those not based on reference values -- by using the existing implementations as starting point. The following call uses the CDC and Farrington procedure on the simulated time series \verb+sts+ from page~\pageref{ex:sts}. Note that the CDC procedure operates with four-week aggregated data -- to better compare the upper bound value, the aggregated number of counts for each week are shown as circles in the plot. <>= cntrl <- list(range=300:400,m=1,w=3,b=5,alpha=0.01) sts.cdc <- algo.cdc(sts, control = cntrl) sts.farrington <- algo.farrington(sts, control = cntrl) @ <>= if (compute) { <> } @ <>= par(mfcol=c(1,2)) plot(sts.cdc, legend.opts=NULL) plot(sts.farrington, legend.opts=NULL) @ Typically, one is interested in evaluating the performance of the various surveillance algorithms. An easy way is to look at the sensitivity and specificity of the procedure -- a correct identification of an outbreak is defined as follows: if the algorithm raises an alarm for time $t$, i.e.\ $A_t=1$ and $X_t=1$ we have a correct classification, if $A_t=1$ and $X_t=0$ we have a false-positive, etc. In case of more involved outbreak models, where an outbreak lasts for more than one week, a correct identification could be if at least one of the outbreak weeks is correctly identified, see e.g.\ \citet{hutwagner2005}. To compute various performance scores the function \verb+algo.quality+ can be used on a \verb+survRes+ object. <<>>= print(algo.quality(k1.b660)) @ This computes the number of false positives, true negatives, false negatives, the sensitivity and the specificity. Furthermore, \texttt{dist} is defined as \[ \sqrt{(Spec-1)^2 + (Sens - 1)^2}, \] that is the distance to the optimal point $(1,1)$, which serves as a heuristic way of combining sensitivity and specificity into a single score. Of course, weighted versions are also imaginable. Finally, \texttt{lag} is the average number of weeks between the first of a consecutive number of $X_t=1$'s (i.e.\ an outbreak) and the first alarm raised by the algorithm. To compare the results of several algorithms on a single time series we declare a list of control objects -- each containing the name and settings of the algorithm we want to apply to the data. <>= control <- list( list(funcName = "rki1"), list(funcName = "rki2"), list(funcName = "rki3"), list(funcName = "bayes1"), list(funcName = "bayes2"), list(funcName = "bayes3"), list(funcName = "cdc", alpha=0.05), list(funcName = "farrington", alpha=0.05) ) control <- lapply(control, function(ctrl) { ctrl$range <- 300:400; return(ctrl) }) @ % In the above, \texttt{rki1}, \texttt{rki2} and \texttt{rki3} are three methods with reference values $R_\text{rki1}(6,6,0)$, $R_\text{rki2}(6,6,1)$ and $R_\text{rki3}(4,0,2)$, all called with $\alpha=0.05$. The \texttt{bayes*} methods use the Bayesian algorithm with the same setup of reference values. The CDC method is special since it operates on aggregated four-week blocks. To make everything comparable, a common $\alpha=0.05$ level is used for all algorithms. All algorithms in \texttt{control} are applied to \texttt{sts} using: <>= algo.compare(algo.call(sts, control = control)) @ <>= if (compute) { acall <- algo.call(sts, control = control) } print(algo.compare(acall), digits = 3) @ A test on a set of time series can be done as follows. Firstly, a list containing 10 simulated time series is created. Secondly, all the algorithms specified in the \texttt{control} object are applied to each series. Finally the results for the 10 series are combined in one result matrix. <>= #Create 10 series ten <- lapply(1:10,function(x) { sim.pointSource(p = 0.975, r = 0.5, length = 400, A = 1, alpha = 1, beta = 0, phi = 0, frequency = 1, state = NULL, K = 1.7)}) @ <>= #Do surveillance on all 10, get results as list ten.surv <- lapply(ten,function(ts) { algo.compare(algo.call(ts,control=control)) }) @ <>= if (compute) { <> } @ <>= #Average results algo.summary(ten.surv) @ <>= print(algo.summary(ten.surv), digits = 3) @ A similar procedure can be applied when evaluating the 14 surveillance series drawn from SurvStat@RKI~\citep{survstat}. A problem is however, that the series after conversion to 52 weeks/year are of length 209 weeks. This is insufficient to apply e.g.\ the CDC algorithm. To conduct the comparison on as large a dataset as possible the following trick is used: The function \texttt{enlargeData} replicates the requested \texttt{range} and inserts it before the original data, after which the evaluation can be done on all 209 values. <>= #Update range in each - cyclic continuation range = (2*4*52) + 1:length(k1$observed) control <- lapply(control,function(cntrl) { cntrl$range=range;return(cntrl)}) #Auxiliary function to enlarge data enlargeData <- function(disProgObj, range = 1:156, times = 1){ disProgObj$observed <- c(rep(disProgObj$observed[range], times), disProgObj$observed) disProgObj$state <- c(rep(disProgObj$state[range], times), disProgObj$state) return(disProgObj) } #Outbreaks outbrks <- c("m1", "m2", "m3", "m4", "m5", "q1_nrwh", "q2", "s1", "s2", "s3", "k1", "n1", "n2", "h1_nrwrp") #Load and enlarge data. outbrks <- lapply(outbrks,function(name) { data(list=name) enlargeData(get(name),range=1:(4*52),times=2) }) #Apply function to one one.survstat.surv <- function(outbrk) { algo.compare(algo.call(outbrk,control=control)) } @ <>= algo.summary(lapply(outbrks,one.survstat.surv)) @ <>= if (compute) { res.survstat <- algo.summary(lapply(outbrks,one.survstat.surv)) } print(res.survstat, digits=3) @ In both this study and the earlier simulation study the Bayesian approach seems to do quite well. However, the extent of the comparisons do not make allowance for any more supported statements. Consult the work of~\citet{riebler2004} for a more thorough comparison using simulation studies. <>= if (compute) { # save computed results save(list=c("sts.cdc","sts.farrington","acall","res.survstat", "ten.surv"), file=CACHEFILE) tools::resaveRdaFiles(CACHEFILE) } @ \section{Discussion and Future Work} Many extensions and additions are imaginable to improve the package. For now, the package is intended as an academic tool providing a test-bench for integrating new surveillance algorithms. Because all algorithms are implemented in R, performance has not been an issue. Especially the current implementation of the Farrington Procedure is rather slow and would benefit from an optimization possible with fragments written in C. One important improvement would be to provide more involved mechanisms for the simulation of epidemics. In particular it would be interesting to include multi-day outbreaks originating from single-source exposure, but with delay due to varying incubation time~\citep{hutwagner2005} or SEIR-like epidemics~\citep{andersson2000}. However, defining what is meant by a correct outbreak identification, especially in the case of overlapping outbreaks, creates new challenges which have to be met. \section{Acknowledgements} We are grateful to K.\ Stark and D.\ Altmann, RKI, Germany, for discussions and information on the surveillance methods used by the RKI. Our thanks to C.\ Lang, University of Munich, for his work on the R--implementation and M. Kobl, T. Schuster and M. Rossman, University of Munich, for their initial work on gathering the outbreak data from SurvStat@RKI. The research was conducted with financial support from the Collaborative Research Centre SFB 386 funded by the German research foundation (DFG). \bibliography{references} \end{document}