%\VignetteEngine{knitr::knitr} %\VignetteIndexEntry{MALDIquant: Quantitative Analysis of Mass Spectrometry Data} %\VignetteKeywords{Bioinformatics, Proteomics, Mass Spectrometry} %\VignettePackage{MALDIquant} \documentclass[12pt]{article} \usepackage{natbib} \usepackage{hyperref} \usepackage{tikz} \usepackage{bibentry} % inline bibentries \nobibliography* % no special bibliography for bibentry \newcommand{\R}{\texttt{R}} \newcommand{\CRAN}{\texttt{CRAN}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\texttt{#1}}} \newcommand{\Mq}{\Rpackage{MALDIquant}} \newcommand{\email}[1]{\href{mailto:#1}{\normalfont\texttt{#1}}} \newcommand{\workflow}{ \tikzstyle{line} = [draw, -latex] \tikzstyle{mqbox}=[rectangle, draw, text width=8.5em, text centered, minimum height=2.0em, fill=white] \begin{tikzpicture}[node distance=3.2em] % nodes \node [mqbox] (di) {Data Import}; \node [mqbox, below of=di] (qc) {Quality Control}; \node [mqbox, below of=qc] (vs) {Transformation \& Smoothing}; \node [mqbox, below of=vs] (bc) {Baseline Correction}; \node [mqbox, below of=bc] (cb) {Intensity Calibration}; \node [mqbox, below of=cb] (sa) {Spectra Alignment}; \node [mqbox, below of=sa] (pd) {Peak Detection}; \node [mqbox, below of=pd] (pb) {Peak Binning}; \node [mqbox, below of=pb] (fm) {Feature Matrix}; % edges \path [line] (di) -- (qc); \path [line] (qc) -- (vs); \path [line] (vs) -- (bc); \path [line] (bc) -- (cb); \path [line] (cb) -- (sa); \path [line] (sa) -- (pd); \path [line] (pd) -- (pb); \path [line] (pb) -- (fm); \end{tikzpicture} } \title{\Mq{}: Quantitative Analysis of Mass Spectrometry Data} \author{ Sebastian Gibb% \thanks{\email{mail@sebastiangibb.de}} } \date{\today} \begin{document} <>= library("knitr") opts_chunk$set(tidy.opts=list(width.cutoff=45), tidy=FALSE, fig.align="center", fig.height=4.25, comment=NA, prompt=TRUE) @ <>= suppressPackageStartupMessages(library("MALDIquant")) @ \maketitle \begin{abstract} \Mq{} provides a complete analysis pipeline for MALDI-TOF and other 2D mass spectrometry data.\\ This vignette describes the usage of the \Mq{} package and guides the user through a typical preprocessing workflow. \end{abstract} \clearpage \tableofcontents \section*{Foreword} \Mq{} is free and open source software for the \R{} \citep{RPROJECT} environment and under active development. If you use it, please support the project by citing it in publications: \begin{quote} \bibentry{MALDIquant} \end{quote} If you have any questions, bugs, or suggestions do not hesitate to contact me (\email{mail@sebastiangibb.de}). \\ Please visit \url{http://strimmerlab.github.io/software/maldiquant/}. \section{Introduction} \Mq{} comprising all steps from importing of raw data, preprocessing (e.g. baseline removal), peak detection and non-linear peak alignment to calibration of mass spectra. \Mq{} was initially developed for clinical proteomics using Matrix-Assisted Laser Desorption/Ionization (MALDI) technology. However, the algorithms implemented in \Mq{} are generic and may be equally applied to other 2D mass spectrometry data. \Mq{} was carefully designed to be independent of any specific mass spectrometry hardware. Nonetheless, a lot of open and native file formates, e.g. binary data files from Bruker flex series instruments, mzXML, mzML, etc. are supported through the associated \R{} package \Rpackage{MALDIquantForeign}. \section{Setup} After starting \R{} we could install \Mq{} and \Rpackage{MALDIquantForeign} directly from \CRAN{} using \Rfunction{install.packages}: <>= install.packages(c("MALDIquant", "MALDIquantForeign")) @ Before we can use \Mq{} we have to load the package. <>= library("MALDIquant") @ \section{MALDIquant objects} \Mq{} is written in an object-oriented programming approach and uses \R{}'s S4 objects. A spectrum is represented by an \Robject{MassSpectrum} and a list of peaks by an \Robject{MassPeaks} instance. To create such objects manually we could use \Rfunction{createMassSpectrum} and \Rfunction{createMassPeaks}. In general we do not need these functions because \Rpackage{MALDIquantForeign's} import routines will generate the \Robject{MassSpectrum}/\Robject{MassPeaks} objects. <>= s <- createMassSpectrum(mass=1:10, intensity=1:10, metaData=list(name="Spectrum1")) s @ Each \Robject{MassSpectrum}/\Robject{MassPeaks} stores the mass and intensity values of a spectrum respective of the peaks. Additionally they contain a list of metadata. To access these information we use \Rfunction{mass}, \Rfunction{intensity} and \Rfunction{metaData}. <>= mass(s) intensity(s) metaData(s) @ \section{Workflow} A Mass Spectrometry Analysis often follows the same workflow (see also Fig. \ref{fig:workflow}). After importing the raw data (see also the \Rpackage{MALDIquantForeign} package) we control the quality of the spectra and draw some plots. We apply a variance-stabilizing transformation and smoothing filter. Next we remove the chemical background using a Baseline Correction method. To compare the intensities across spectra we calibrate the intensity values (often called normalization) and the mass values (warping, alignment). Subsequently we perfom a Peak Detection and do some post processing like filtering etc. \begin{figure}[htbp] \centering \begin{small} \workflow{} \caption{MS Analysis Workflow} \label{fig:workflow} \end{small} \end{figure} \clearpage \subsection{Data Import} Normally we will use some of the import methods provided by \Rpackage{MALDIquantForeign}, e.g. \Rfunction{importBrukerFlex}, \Rfunction{importMzMl}, etc. But in this vignette we will use a small example dataset shipped with \Mq{}. This dataset is a subset of MALDI-TOF data described in \cite{Fiedler2009}. <>= data(fiedler2009subset) @ \Robject{fiedler2009subset} is a \Robject{list} of 16 \Robject{MassSpectrum} objects. The 16 spectra are 8 biological samples with 2 technical replicates. <>= length(fiedler2009subset) fiedler2009subset[1:2] @ \subsection{Quality Control} For a basic quality control we test whether all spectra contain the same number of data points and are not empty. <>= any(sapply(fiedler2009subset, isEmpty)) table(sapply(fiedler2009subset, length)) @ Subsequently we control the mass difference between each data point (should be equal or monotonically increasing) because \Mq{} is designed for profile data and not for centroided data. <>= all(sapply(fiedler2009subset, isRegular)) @ Finally we draw some plots and inspect the spectra visually. <>= plot(fiedler2009subset[[1]]) plot(fiedler2009subset[[16]]) @ \subsection{Variance Stabilization} We use the square root transformation to simplify graphical visualization and to overcome the potential dependency of the variance from the mean. <>= spectra <- transformIntensity(fiedler2009subset, method="sqrt") @ \subsection{Smoothing} Next we use a 21 point Savitzky-Golay-Filter \citep{Savitzky1964} to smooth the spectra. <>= spectra <- smoothIntensity(spectra, method="SavitzkyGolay", halfWindowSize=10) @ \subsection{Baseline Correction} Before we correct the baseline we visualize it. Here we use the SNIP algorithm \citep{Ryan1988}. <>= baseline <- estimateBaseline(spectra[[16]], method="SNIP", iterations=100) plot(spectra[[16]]) lines(baseline, col="red", lwd=2) @ If we are satisfied with our estimated baseline we remove it. <>= spectra <- removeBaseline(spectra, method="SNIP", iterations=100) plot(spectra[[1]]) @ \subsection{Intensity Calibration/Normalization} For better comparison and to overcome (very) small batch effects we equalize the intensity values using the Total-Ion-Current-Calibration (often called normalization). <>= spectra <- calibrateIntensity(spectra, method="TIC") @ \subsection{Warping/Alignment} Now we (re)calibrate the mass values. Our alignment procedure is a peak based warping algorithm. If you need a finer control or want to investigate the impact of different parameters please use \Rfunction{determineWarpingFunctions} instead of the easier \Rfunction{alignSpectra}. <>= spectra <- alignSpectra(spectra, halfWindowSize=20, SNR=2, tolerance=0.002, warpingMethod="lowess") @ Before we call the Peak Detection we want to average the technical replicates. Therefore we look for the sample name that is stored in the metadata because each technical replicate has the same sample name. <>= samples <- factor(sapply(spectra, function(x)metaData(x)$sampleName)) @ Next we use \Rfunction{averageMassSpectra} to create a mean spectrum for each biological sample. <>= avgSpectra <- averageMassSpectra(spectra, labels=samples, method="mean") @ \subsection{Peak Detection} The next crucial step is the Peak Detection. Before we perform the peak detection algorithm we estimate the noise of the spectra to get a feeling for the signal-to-noise ratio. <>= noise <- estimateNoise(avgSpectra[[1]]) plot(avgSpectra[[1]], xlim=c(4000, 5000), ylim=c(0, 0.002)) lines(noise, col="red") lines(noise[,1], noise[, 2]*2, col="blue") @ We decide to use a signal-to-noise ratio of 2 (blue line). <>= peaks <- detectPeaks(avgSpectra, method="MAD", halfWindowSize=20, SNR=2) plot(avgSpectra[[1]], xlim=c(4000, 5000), ylim=c(0, 0.002)) points(peaks[[1]], col="red", pch=4) @ \subsection{Peak Binning} After the alignment the peak positions (mass) are very similar but not identical. The binning is needed to make similar peak mass values identical. <>= peaks <- binPeaks(peaks, tolerance=0.002) @ \subsection{Feature Matrix} We choose a very low signal-to-noise ratio to keep as much features as possible. To remove some false positive peaks we remove less frequent peaks. <>= peaks <- filterPeaks(peaks, minFrequency=0.25) @ At the end of the analysis we create a feature matrix that could be used in further statistical analysis. Please note that missing values (not detected peaks) are imputed/interpolated from the corresponding spectrum. <>= featureMatrix <- intensityMatrix(peaks, avgSpectra) head(featureMatrix[, 1:3]) @ \section{Summary} We shortly described a complete example workflow of a mass spectrometry data analysis. Please note that this workflow is only an example and could not cover every use case.\\ \Mq{} provides a lot of more functions than we mentioned in this vignette. The described functions are the most used ones but they have a lot of more parameters which could/need adjust to your data (e.g. \Robject{halfWindowSize}, \Robject{SNR}, \Robject{tolerance}, etc.). That's why we suggest the user to read the manual pages of theses functions carefully.\\ We also provide more examples in the demo directory and at: \begin{quote} \url{http://strimmerlab.github.io/software/maldiquant/} \end{quote} Please do not hesitate to contact me (\email{mail@sebastiangibb.de}) if you have any questions. \section{Session Information} <>= toLatex(sessionInfo(), locale=FALSE) @ \bibliographystyle{apalike} \bibliography{bibliography} \end{document}