--- title: Introduction to Numero output: html_document: toc: true toc_float: true toc_depth: 2 number_sections: true vignette: > %\VignetteIndexEntry{intro} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- **Author:** Ville-Petteri Mäkinen **Co-authors:** Song Gao, Stefan Mutter, Aaron E Casey **Version:** `r packageVersion("Numero")` **Abstract:** In textbook examples, multivariable datasets are clustered into distinct subgroups that can be clearly identified by a set of optimal mathematical criteria. However, many real-world datasets arise from synergistic consequences of multiple effects, noisy and partly redundant measurements, and may represent a continuous spectrum of the different phases of a phenomenon. In medicine, complex diseases associated with ageing are typical examples. We postulate that population-based biomedical datasets (and many other real-world examples) do not contain an intrinsic clustered structure that would give rise to mathematically well-defined subgroups. From a modeling point of view, the lack of intrinsic structure means that the data points inhabit a contiguous cloud in high-dimensional space without abrupt changes in density to indicate subgroup boundaries, hence a mathematical criteria cannot segment the cloud reliably by its internal structure. Yet we need data-driven classification and subgrouping to aid decision-making and to facilitate the development of testable hypotheses. For this reason, we developed the Numero package, a more flexible and transparent process that allows human observers to create usable multivariable subgroups even when conventional clustering frameworks struggle. **Citation:** Gao S, Mutter S, Casey AE, Mäkinen V-P (2018) Numero: a statistical framework to define multivariable subgroups in complex population-based datasets, Int J Epidemiology, https://doi.org/10.1093/ije/dyy113 *** ``` {r echo=FALSE, results="hide"} # Save space on screen printouts. options(digits = 3) ``` *** --- ########################################################################### --- # Getting started {.tabset} ## Installation You can install Numero using the standard procedure: ``` {r eval=FALSE} # Install the package from a remote repository. install.packages("Numero") ``` The functions come in two flavors: the ones starting with `numero` provide high level pipeline components that will serve most users, whereas the `nro` functions perform specific tasks and provide a more granular interface to the package. In this introductory document, we will only use the `numero` group of functions. To activate and list the library functions, type ``` {r} # Activate the library. library("Numero") packageVersion("Numero") ls("package:Numero") ``` Each Numero function comes with a help page that contains a code example, such as ``` {r eval=FALSE} # Access function documentation (not shown in vignette). ? numero.create ``` You can run all the code examples in one go by typing ``` {r eval=FALSE} # Run all code examples (not shown in vignette). fn <- system.file("extcode", "examples.R", package = "Numero") source(fn) ``` Please use the tabs at the top to navigate to the next page. ## Dataset In the examples, we use data on diabetic kidney disease from a previous publication (see the readme file below). While simplified, the data contain enough information to replicate some of the findings from the original study. The main clinical characteristics are summarized in Table 1, and you can print the readme file on the screen by typing ```{r eval=FALSE} # Show readme file on screen (not shown in vignette). fn <- system.file("extdata", "finndiane.readme.txt", package = "Numero") cat(readChar(fn, 1e5)) ```

Table: Summary of the FinnDiane dataset. The mean and standard deviation are reported for each continuous variable. Abbreviations: urinary albumin excretion rate (uALB), triglycerides (TG), high density lipoprotein subclass 2 (HDL2). P-values were estimated by the t-test for continuous variables and by Fisher’s test for binary traits.

Trait | No kidney disease | Diabetic kidney disease | P-value ---------------------------------- | ----------------- | ----------------------- | -------- Men / Women | 192 / 196 | 119 / 106 | 0.45 Age (years) | 38.8 ± 12.2 | 41.7 ± 9.7 | 0.0012 Type 1 diabetes duration (years) | 25.3 ± 10.3 | 28.6 ± 7.8 | <0.001 Log10 uALB (mg/24h) | 1.20 ± 0.51 | 2.72 ± 0.59 | <0.001 Log10 TG (mmol/L) | 0.034 ± 0.201 | 0.159 ± 0.212 | <0.001 Total cholesterol (mmol/L) | 4.89 ± 0.77 | 5.35 ± 0.96 | <0.001 HDL2 cholesterol (mmol/L) | 0.54 ± 0.16 | 0.51 ± 0.18 | 0.027 Log10 serum creatinine (µmol/L) | 1.94 ± 0.09 | 2.14 ± 0.24 | <0.001 Metabolic syndrome | 90 (23.2%) | 114 (50.7%) | <0.001 Macrovascular disease | 16 (4.1%) | 38 (16.9%) | <0.001 Diabetic retinopathy | 133 (34.4%) | 178 (79.1%) | <0.001 Died during follow-up | 13 (3.4%) | 43 (19.1%) | <0.001 The data are included in the package as a tab-delimited file: ``` {r} # Import data. fname <- system.file("extdata", "finndiane.txt", package = "Numero") dataset <- read.delim(file = fname, stringsAsFactors = FALSE) nrow(dataset) colnames(dataset) ``` Please use the tabs at the top to navigate to the next page. ## Data integrity Most datasets contain missing entries, duplicated rows and other unusable features, therefore it is important to carefully inspect all data points and their identification. In this case, the biochemical and clinical information about individual participants is organized as rows and each row is identified by the variable INDEX. To create a data frame where the row names are set to INDEX, type ``` {r} # Manage unusable entries and data identification. dataset <- numero.clean(data = dataset, identity = "INDEX") ``` Almost all functions within Numero work only on numeric values, so if the dataset contains factors or other categorical data types, please convert them to integer labels before analyses. --- ########################################################################### --- # Basic pipeline {.tabset} ## Prepare Before any analysis, we recommend using `numero.create` to check that the data are in usable format (the previous section on data integrity). A typical usage case of the Numero framework involves a set of training variables that are considered to explain medical or other outcomes. Here, we use the biochemical data as the training variables under the general hypothesis that the metabolic profile (as measured by biochemistry) predicts adverse clinical events. The kidney disease dataset includes five biochemical measures: ``` {r} # Select training variables. trvars <- c("CHOL", "HDL2C", "TG", "CREAT", "uALB") ``` Centering and scaling are basic machine learning techniues to ensure that the variables with large values due to the choice of the measurement unit do not bias the results. To create a standardized training set, type ``` {r results="hide"} # Center and scale the training data. trdat.basic <- scale.default(dataset[,trvars]) ``` All the training variables have now unit variance and we can expect their impact on the SOM to be dependent only on the information content and statistical associations: ``` {r} # Calculate standard deviations. apply(trdat.basic, 2, sd, na.rm = TRUE) ``` Please use the tabs at the top to navigate to the next page. --- #--------------------------------------------------------------------------- --- ## Create Conceptually, the SOM algorithm is analogous to a human observer who wants to make sense of a set of objects. Suppose there is a pile of portrait photos of ordinary people on a round table, and your task is to organize them on the table in such a way that similar photos are placed next to each other and, by consequence, dissimilar photos end up on the opposite sides of the table. During the organization process, your brain interprets the multi-variable visual data (age, gender, ethnicity, hair style, head shape, etc.) and projects it onto two dimensions (vertical and horizontal positions). The SOM algorithm, also known as Kohonen map, mimics the organization process and was originally inspired by the plasticity of neuronal networks. As this is a self-organizing process, the starting point may have an impact on the outcomes. We have added an extra initialization step to ensure repeatable results. The function `numero.create` applies K-means clustering to create an initial map model, and then refines the model using the SOM algorithm: ``` {r} # Create a new self-organizing map based on the training set. modl.basic <- numero.create(data = trdat.basic) summary(modl.basic) ``` The resulting data structure contains the SOM itself and additional information that is needed for further statistical analyses, please see the manual pages for details. The output from `numero.create` contains the positions of the data points on a virtual round "table"; we use the term *map* instead of table, and refer to the positions as the *layout* of a dataset on the SOM. The results also contain an internal model of the training dataset; the model comprises a set of *district centroids* that summarize the salient features of the training dataset, please see the section on terminology for further descriptions. Please use the tabs at the top to navigate to the next page. --- #--------------------------------------------------------------------------- --- ## Quality After the training is complete, it is prudent to examine the SOM for potential problems with the data. The Numero package provides three different quality tools: **Histogram:** The distribution of data points on the map can be expressed as a histogram that shows the counts of data points in the districts. If any areas of the SOM are devoid, or disproportionally populated by data points, it usually reflects unusual or discontinuous value distributions within the data (it can also be caused by poorly chosen preprocessing methods). The histogram is rarely flat and up to two-fold differences between the least and most populated districts are common, but if the pattern is more extreme, careful examination of the distributions of the input data in relation to the preprocessing procedures is warranted. **Coverage:** The SOM is instrinsically robust against missing values if the missingness pattern is close to random. We define data coverage as the mean ratio of usable values per a multi-variable data point. If the coverage drops substantially in certain districts compared to others, it is possible that the data point layout has been biased by a non-random pattern of missing values. **Model fit:** The location of each data point on the map depends on how similar it is to district centroids; the district with the most similar centroid is chosen as the optimal location. The similarity of a centroid and a data point can be quantified using mathematical formulas, such as the Euclidean distance. The Numero package provides two measures. The first is the residual distance $d$ between a data point and a centroid. The second version is indpependent of scale and is calculated as $z = (d - d_{mean}) / d_{sd}$, where $d_{mean}$ and $d_{sd}$ are the mean residual and standard deviation of residuals in the training set. ``` {r} # Calculate map quality measures for the training data. qc.basic <- numero.quality(model = modl.basic) ``` Outliers can be detected by plotting the distribution of quality measures: ``` {r results="hide", fig.width=7, fig.height=3, fig.align="center", fig.cap="Figure: Distribution of model residuals."} # Plot frequencies of data points at different quality levels. par(mar = c(5,4,1,0), mfrow = c(1,2)) hist(x = qc.basic$layout$RESIDUAL, breaks = 50, main = NULL, xlab = "RESIDUAL", ylab = "Number of data points", col = "#FFEFA0", cex = 0.8) hist(x = qc.basic$layout$RESIDUAL.z, breaks = 50, main = NULL, xlab = "RESIDUAL.z", ylab = "Number of data points", col = "#FFEFA0", cex = 0.8) ``` In this example, the distribution of residuals is heavily skewed and there are a few data points that are several standard deviations from the mean (RESIDUAL.z). The skewness is often due to skewed distributions in the data; urinary albumin and serum creatinine are highly skewed towards larger values. The Numero framework includes automatic procedures based on log-transforms or rank-based preprocessing to mitigate skewness. We will explore these in separate examples later in the document. The quality measures can be visualized on the map to reveal subgroups of problematic data points: ``` {r results="hide", fig.width=9, fig.height=3, fig.align="center", fig.cap="Figure: Visualization of quality measures across map districts. "} # Plot map quality measures. numero.plot(results = qc.basic, subplot = c(1,4)) ``` While coverage is uniform across the map (there were only a few missing data), the model fit is noticeably worse on the top part of the map (higher RESIDUAL). The poor model fit is also reflected in the sample histogram that shows a substantial depletion of data points in the top part of the map. In practice, these structural issues do not necessarily present a problem (as we see in the next section), but they need to be considered when interpreting the significance of map patterns. Please use the tabs at the top to navigate to the next page. --- #--------------------------------------------------------------------------- --- ## Statistics Organizing a pile of photos on a table is achievable for a human observer -- remember the discussion from the previous section -- but a human would struggle with rows of numerical data, especially if there was a large number of data points. Also, it is of interest to examine each variable one at a time (e.g. if there is a map region associated with high cholesterol), or to see if the outcomes of interest (e.g. mortality) accumulate on a particular section of the SOM. Since we consider the SOM as a map, we can apply concepts from geography to overcome these challenges. Just as a city map can be colored according to the average regional house prices, we can color the SOM according to the average regional cholesterol concentrations or mortality rates. Before the SOM is visualized, however, we need to estimate descriptive statistics so that the colors can be calibrated according to the strength of the statistical evidence. Any finite dataset will produce regional variation on the SOM. For example, it is reasonable to expect that parts of the map will differ with respect to average mortality simply by chance. But how to know when the regional variation is within the random expectation or not? The Numero package contains a permutation engine that provides the answer: ``` {r} # Map statistics for the whole dataset. stats.basic <- numero.evaluate(model = qc.basic, data = dataset) summary(stats.basic) ``` The output contains data frames for the data point layout, district averages for each variables (i.e. component planes) and z-scores and p-values for the observed amount of regional variation. Of note, the function omits P-values for columns that overlap with the training data as it does not make sense to estimate statistical significance for the same data that were used for constructing the model. Overall, the results above comprise the minimal information needed to color the SOM according to district averages. ``` {r results="hide", fig.width=6, fig.height=4, fig.align="center", fig.cap="Figure: Statistically normalized colorings of the training variables in the kidney disease dataset. The color intensity depends on how likely the observed regional variation would arise by chance; intense reds and intense blues indicate that these extremes would be very unlikely if the data point layout was random. The numbers show the average values in original units for selected districts."} # Plot map colorings of training variables. numero.plot(results = stats.basic, variables = trvars, subplot = c(2,3)) ``` As expected, all the training variables (including CHOL) produce strong colors since they were the basis of data point locations and therefore directly influenced the data point layout. On the other hand, we did not use age or sex as training variables, yet both show significant regional variation: ``` {r} stats.basic$statistics[c("CHOL","MALE","AGE","T1D_DURAT"), c("TRAINING","Z","P.z")] ``` These signals indicate that the biochemical profiles are associated with age and sex, which is not a surprise considering the effects of ageing and sex dimorphism in human populations. Of note, P-values are only reported for the non-biochemical variables that were not used for training. Please use the tabs at the top to navigate to the next page. --- #--------------------------------------------------------------------------- --- ## Subgroups To recap, we hypothesized that the metabolic profile indicates and/or predicts adverse health outcomes. For this reason, we chose to use only the biochemical data to construct the SOM, which allows us to use the clinical traits as validation outcomes from a modelling perspective. The next step is to divide the dataset into a limited number of subgroups that summarize the typical metabolic profiles we can expect too see in the data. It is important to use only the training variables in this step so that we remain agnostic to the clinical traits and preserve the original unsupervised study design. The Numero framework includes an interactive tool to conduct the subgrouping based on SOM colorings. Unfortunately, the R vignette format is not compatible with the interactive procedure, therefore we will rely on the automatic subgrouping feature to proceed with the example: ``` {r} # Automatic subgrouping based on the training set. subgr.basic <- numero.subgroup(results = stats.basic, variables = trvars, automatic = TRUE) ``` It is possible to use the interactive feature to tweak the automatic result by passing the updated map topology: ``` {r eval=FALSE} # Interactive subgrouping. subgr.basic <- numero.subgroup(results = stats.basic, variables = trvars, topology = subgr.basic) ``` The colorings can also be saved as an interactive web page, which is useful when the standard R plotting gets too slow or unusable due to multiple colorings or large map size (see the chapter on large datasets and maps). ``` {r results="hide", fig.width=6, fig.height=4, fig.align="center", fig.cap="Figure: Statistically normalized colorings of the training variables in the kidney disease dataset. The labels show the results from the subgrouping procedure."} # Plot results from the subgrouping procedure. numero.plot(results = stats.basic, variables = trvars, topology = subgr.basic, subplot = c(2,3)) ``` It is evident from Figure 2 that the subgrouping workaround could be improved to cover the most important patterns on the map. Serum creatinine and urinary albumin are biomarkers of kidney disease, so they are highly concordant, whereas cholesterol and triglycerides show patterns that do not fit well with the kidney biomarkers. Therefore, additional subgroups and manual adjustments of the district labels may be the preferred course of action if this were an actual research project. With the interactive tool, users can choose their own subgroups based on the SOM patterns. This is the greatest strength of the pipeline: multiple people will be able to participate in the process of defining the subgroups, and different subgroups can be created for different settings. For example, cardiovascular disease is the most common cause of death in people with type 1 diabetes, and therefore a subgrouping that takes cholesterol into account may be warranted. Please use the tabs at the top to navigate to the next page. --- #--------------------------------------------------------------------------- --- ## Compare We omitted the clinical end-points intentionally from the previous analyses to minimize their influence on the subgrouping step, as required by the study design. Since we have now defined the subgroups, it is safe to plot the colorings for all the variables. The subplot option was used here to ensure a pleasing look for the vignette, but it can be omitted in most circumstances. ``` {r results="hide", fig.width=6, fig.height=4, fig.align="center", fig.cap="Figure: Statistically normalized colorings of selected variables in the kidney disease dataset."} # Plot results from subgrouping procedure for non-biochemical variables. numero.plot(results = stats.basic, variables = c("AGE", "MALE", "DIAB_KIDNEY", "DIAB_RETINO", "MACROVASC", "DECEASED"), topology = stats.basic$map$topology, subplot = c(2,3)) ``` ``` {r results="hide", fig.width=6, fig.height=4, fig.align="center", fig.cap="Figure: Colorings of selected variables with subgroup labels."} # Plot results from subgrouping procedure for non-biochemical variables. numero.plot(results = stats.basic, variables = c("AGE", "MALE", "DIAB_KIDNEY", "DIAB_RETINO", "MACROVASC", "DECEASED"), topology = subgr.basic, subplot = c(2,3)) ``` The subgroup with the highest urinary albumin overlaps with the clinical classification of diabetic kidney disease (top-right coloring). This is expected, since albuminuria is one of the key diagnostic markers for kidney disease. Accordingly, the same subgroup overlaps with high prevalence of retinopathy (bottom-left) and high mortality rate (bottom-right). In research papers, it is often easier to report conventional comparison statistics between the subgroups you have defined rather than the SOM visualizations (however, the colorings should always be included as supplements). The Numero package includes the function `numero.summary()` that combines the visual subgroupings with the original data: ``` {r} # Compare subgroups. report.basic <- numero.summary(results = stats.basic, topology = subgr.basic) colnames(report.basic) ``` For instance, the report contains the statistical signals for mortality across the subgroups: ``` {r} # Show results for mortality rate. rows <- which(report.basic$VARIABLE == "DECEASED") report.basic[rows,c("SUBGROUP","N","MEAN","P.chisq")] ``` The first row of the output shows the chi2 test for the full contingency table. The largest subgroup is always chosen as the reference, hence P = 1 for subgroup with the most members. The subgroup that captured the highest urinary albumin values should show the greatest mortality rate. Lastly, it is often useful to identify the members of these subgroups. The output contains that information as attributes, please see the documentation on `numero.summary()` and `nroSummary()` for details. --- ########################################################################### --- # Preprocessing and adjustment {.tabset} ## Prepare The introductory analyses revealed that both sex and age were associated with the metabolic profiles via the SOM. Moreover, men and women display anatomical and metabolic differences, which usually complicate the interpretation of the SOM. For this reason, we recommend using a sex-specific standardization procedure that eliminates the differences. If necessary, separate visualizations can be made afterwards for men and women using the same map. Ageing is another factor that is often considered a confounder, especially within cross-sectional study designs. For this reason, we will reduce the co-variation with AGE and T1D_DURAT in the training set prior to creating the SOM. To prepare the training data, we can use the tool provided with the Numero package: ``` {r} # Mitigate stratification and confounding factors. trdata.adj <- numero.prepare(data = dataset, variables = trvars, batch = "MALE", confounders = c("AGE", "T1D_DURAT")) colnames(trdata.adj) ``` In addition to adjusting for the confounders, the function also applies the logarithm to variables that are highly skewed. While binary data can be used for training, we recommend using continuous variables only for building a SOM since they are easier to adjust for confounders and less prone to numerical instability. We also need to know which identities belong to men or women so that the sexes can be separated in later analysis steps. As a by-product of the batch correction for MALE, the identities of batch members are included in the output according to the values in the MALE column (1 = man, 0 = woman): ``` {r} subsets <- attr(trdata.adj, "subsets") women <- subsets[["0"]] men <- subsets[["1"]] c(length(men), length(women)) ``` Please use the tabs at the top to navigate to the next page. --- #--------------------------------------------------------------------------- --- ## Create The SOM is created the same way as within the basic pipeline, but this time using the sex-adjusted training variables: ``` {r} # Create a new self-organizing map based on sex-adjusted data. modl.adj <- numero.create(data = trdata.adj) summary(modl.adj) ``` Please use the tabs at the top to navigate to the next page. --- #--------------------------------------------------------------------------- --- ## Quality After the training is complete, the structure of the SOM can be inspected as we did in the previous example: ``` {r} # Calculate map quality measures for sex-adjusted data. qc.adj <- numero.quality(model = modl.adj) ``` ``` {r results="hide", fig.width=7, fig.height=3, fig.align="center", fig.cap="Figure: Distribution of model residuals. The training data were adjusted for age and sex."} # Plot frequencies of data points at different quality levels. par(mar = c(5,4,1,0), mfrow = c(1,2)) hist(x = qc.adj$layout$RESIDUAL, breaks = 20, main = NULL, xlab = "RESIDUAL", ylab = "Number of data points", col = "#FFEFA0", cex = 0.8) hist(x = qc.adj$layout$RESIDUAL.z, breaks = 20, main = NULL, xlab = "RESIDUAL.z", ylab = "Number of data points", col = "#FFEFA0", cex = 0.8) ``` The preprocessing resulted in a substantial improvement of the residual distributions. Although the residuals are still skewed, the most extreme residuals have been removed: ``` {r} # Maximum residuals from unadjusted and adjusted analyses. c(max(qc.basic$layout$RESIDUAL.z, na.rm = TRUE), max(qc.adj$layout$RESIDUAL.z, na.rm = TRUE)) ``` Furthermore, the spread of data points across the map is now more balanced ``` {r} # Variation in point density in unadjusted and adjusted analyses. c(sd(qc.basic$planes[,"HISTOGRAM"], na.rm = TRUE), sd(qc.adj$planes[,"HISTOGRAM"], na.rm = TRUE)) ``` and the depletion of data points in the poor quality area is attenuated: ``` {r results="hide", fig.width=9, fig.height=3, fig.align="center", fig.cap="Figure: Visualization of data point quality measures across map districts from age and sex adjusted analysis."} # Plot map quality measures. numero.plot(results = qc.adj, subplot = c(1,4)) ``` Please use the tabs at the top to navigate to the next page. --- #--------------------------------------------------------------------------- --- ## Statistics We can estimate the map statistics using the whole dataset as before, except that the model is now age and sex adjusted. Note how we use the original unadjusted data as input `data = dataset` but the evaluation is done via the map layout created from adjusted data: ``` {r} # Map statistics for the whole dataset. stats.adj <- numero.evaluate(model = qc.adj, data = dataset) ``` Furthermore, it is also possible to conduct sex-specific analyses after we identified men and women in the dataset: ``` {r results="hide"} # Map statistics for women. stats.adjW <- numero.evaluate(model = qc.adj, data = dataset[women,]) ``` ``` {r results="hide"} # Map statistics for men. stats.adjM <- numero.evaluate(model = qc.adj, data = dataset[men,]) ``` We could have used `data = trdata.adj` instead of `data = dataset`, however, using the original unadjusted data to describe the sex-adjusted subgroups is easier to interpret in relation to what you might see when meeting the participants in the real world. If the adjustments of the training data are successful, we expect to see non-significant P-values for MALE and AGE when using the original data: ``` {r} stats.adj$statistics[c("MALE","AGE","T1D_DURAT"), c("Z","P.z")] ``` Indeed, the regional variation in the proportion of men to women is now within the range of random sampling fluctuation. That said, the sex adjustment only affects the individuals' positions on the map, thus the well known sex differences in the average biomarker concentrations are not affected: ``` {r results="hide", fig.width=6, fig.height=4, fig.align="center", fig.cap="Figure: Colorings of training variables using the data from female participants."} numero.plot(results = stats.adjW, variables = trvars, subplot = c(2,3)) ``` ``` {r results="hide", fig.width=6, fig.height=4, fig.align="center", fig.cap="Figure: Colorings of training variables using the data from male participants."} numero.plot(results = stats.adjM, variables = trvars, subplot = c(2,3)) ``` The colorings look very similar, but the regional averages are systematically different between the sexes. For instance, HDL2C is higher in women across all areas of the map, while the color patterns are almost indistinguishable between the sexes. As we estimated the statistics separately for men and women, the color scales are likely to differ, which may make it more difficult to compare the data patterns. Another option is to calibrate the colors between the sexes against a common reference, that is, the same shade of blue will always correspond to the same concentration of cholesterol, for example, regardless of which subset of individuals are being visualized. This is accomplished by using the results of the full dataset as the reference point `reference = stats.adj`: ``` {r results="hide", fig.width=6, fig.height=4, fig.align="center", fig.cap="Figure: Colorings of training variables using the data from female participants. Color scales were derived from the full dataset."} numero.plot(results = stats.adjW, variables = trvars, subplot = c(2,3), reference = stats.adj) ``` ``` {r results="hide", fig.width=6, fig.height=4, fig.align="center", fig.cap="Figure: Colorings of training variables using the data from male participants. Color scales were derived from the full dataset."} numero.plot(results = stats.adjM, variables = trvars, subplot = c(2,3), reference = stats.adj) ``` It is now easier to see from the colors that women tend to have higher HDLC2 (bright red areas on the right side and less blue on the left side), whereas they tend to have lower serum creatinine (more blue for women compared to men). Please use the tabs at the top to navigate to the next page. --- #--------------------------------------------------------------------------- --- ## Subgroups To recap, we followed the procedures in the basic pipeline, except that the training data were adjusted for age and sex differences, and we also analyzed men and women separately. The subgrouping step proceeds as before. The Numero framework includes an interactive tool to conduct the subgrouping based on SOM colorings ``` {r eval=FALSE} # Interactive subgrouping based on the training set. subgr.adj <- numero.subgroup(results = stats.adj, variables = trvars) ``` As mentioned previously, the vignette format is not compatible with the interactive tool, therefore we have to use the automatic subgrouping feature to demonstrate the function of the pipeline. ``` {r} # Automatic subgrouping based on the training set. subgr.adj <- numero.subgroup(results = stats.adj, variables = trvars, automatic = TRUE) ``` The colorings can also be saved as an interactive web page, which is useful when the standard R plotting gets too slow or unusable due to multiple colorings or large map size (see the chapter on large datasets and maps). ``` {r results="hide", fig.width=6, fig.height=4, fig.align="center", fig.cap="Figure: Statistically normalized colorings of the training variables in the kidney disease dataset. The labels show the results from the subgrouping procedure. The map was created from sex-adjusted data."} # Plot results from subgrouping procedure. numero.plot(results = stats.adj, variables = trvars, topology = subgr.adj, subplot = c(2,3)) ``` Although the sex adjustment had a minor effect on the map colorings, we can observe minor changes in how the subgroups settled on the map. As was discussed earlier, the subgroups could be made more reasonable, which highlights the importance of being able to set the subgroup labels yourself for the most useful or parsimonious results. Please use the tabs at the top to navigate to the next page. --- #--------------------------------------------------------------------------- --- ## Compare In the last step, the subgroups are compared using conventional statistics as we did earlier. It is always useful to plot the subgroup labels in relation to the variables that were not used for training: ``` {r results="hide", fig.width=6, fig.height=4, fig.align="center", fig.cap="Figure: Colorings of selected variables with subgroup labels. The map was created from age and sex adjusted data."} # Plot results from subgrouping procedure for non-biochemical variables. numero.plot(results = stats.adj, variables = c("AGE", "MALE", "DIAB_KIDNEY", "DIAB_RETINO", "MACROVASC", "DECEASED"), topology = subgr.adj, subplot = c(2,3)) ``` The effect of the sex adjustment on the variable MALE is immediately apparent: the regional variation in sex ratio is so close to the random expectation that the color scale has been squashed flat. In the unadjusted model, the difference between the predominantly male and the predominantly female districts was substantial, whereas here the gap has been reduced: ``` {r} # District averages from unadjusted analysis. summary(stats.basic$planes[,"MALE"]) ``` ``` {r} # District averages from adjusted analysis. summary(stats.adj$planes[,"MALE"]) ``` The results for mortality have changed as a consequence of the changes in subgrouping, but not substantially: ``` {r} # Compare subgroups. report.adj <- numero.summary(results = stats.adj, topology = subgr.adj) ``` ``` {r} # Show results for mortality rate. rows <- which(report.adj$VARIABLE == "DECEASED") report.adj[rows,c("SUBGROUP","N","MEAN","P.chisq")] ``` Again, the statistical signal is mostly coming from the high albuminuria subgroup. It is important to note that while the preprocessing affected the map quality and the results in general, the overall interpretations of the results did not change dramatically. --- ########################################################################### --- # Multiple datasets {.tabset} ## Prepare Men and women are biologically different, and also have different biochemical profiles as we observed earlier. In this example, we use the sexes as if they are two different epidemiological cohorts. Indeed, differences in data collection between epidemiological studies lead to differences in the mean concentrations of biomarkers. In this respect, the male-female comparison is an illustrative and easy proxy for discussing multi-cohort applications of the SOM. A typical multi-cohort setting involves a discovery cohort and one or more replication cohorts with additional meta-analyses across all available data. We will focus on two aspects: 1) defining subgroups in the discovery cohort and 2) assigning participants from the replication cohort into these subgroups. For the meta-analysis part, established statistical methods can be applied after the subgroups have been defined (not discussed here). We also use this example as an opportunity to show how rank-based preprocessing of the data affects the quality measures by setting `method = "tapered"`. The SOM framework includes two rank-based methods: "uniform" for normalized ranks between -1 and +1 and "tapered" for a version that puts more samples around zero and less around -1 or +1. To facilitate more intuitive discussion, we refer to the sexes as two different cohorts from now on. To simulate incompatible datasets, we add a dataset of men with the metabolic syndrome as a third cohort: ``` {r} ds.discov <- dataset[women,] ds.replic <- dataset[men,] ds.mets <- ds.replic[which(ds.replic[,"METAB_SYNDR"] == 1),] ``` We create four preprocessed datasets (one for discovery/training and three for replication) to demonstrate how data can be used across cohorts. Summary statistics are shown for urinary albumin as it is the most important clinical biomarker: ``` {r results="hide"} # Discovery cohort and training set. trdata.discov <- numero.prepare(data = ds.discov, variables = trvars, confounders = c("AGE", "T1D_DURAT"), method = "tapered") ``` ``` {r} summary(trdata.discov[,"uALB"]) ``` ``` {r results="hide"} # Replication cohort, version A. param <- attr(trdata.discov, "pipeline") trdata.replicA <- numero.prepare(data = ds.replic, pipeline = param) ``` ``` {r} summary(trdata.replicA[,"uALB"]) ``` The first replication dataset A is analogous to a situation where two cohorts/batches are compared without batch correction, that is, we assume that there is no measurement or other bias between the descriptive cohort statistics. This is the safe starting point, since the quality control will reveal if the assumption is violated, and we can adjust the second round of analyses accordingly. ``` {r results="hide"} # Replication cohort, version B. trdata.replicB <- numero.prepare(data = ds.replic, variables = trvars, confounders = c("AGE", "T1D_DURAT"), method = "tapered") ``` ``` {r} summary(trdata.replicB[,"uALB"]) ``` The second dataset B is adjusted for discrepancies in average values. We still assume that both cohorts represent the same underlying population that is sampled the same way. This situation is similar to having two separate birth cohorts of approximately the same age from the same broader population, but with biochemical data measured in different labs. ``` {r results="hide"} # Replication cohort, MetS version. trdata.mets <- numero.prepare(data = ds.mets, variables = trvars, confounders = c("AGE", "T1D_DURAT"), method = "tapered") ``` ``` {r} summary(trdata.mets[,"uALB"]) ``` The third dataset reflects a situation where the study cohort have been recruited differently or are otherwise demographically incompatible. This represents a challenge to the subgrouping analysis and can lead to erroneous conclusions. Bias of this type is difficult to distinguish from bias from measurement techniques, and therefore needs to be accounted for by re-sampling or calibration techniques before using the SOM framework. Please use the tabs at the top to navigate to the next page. ## Create The SOM is created the usual way and trained with the discovery dataset, exept that we set the map radius to be the same as before to make it easier to compare results across examples: ``` {r} # Create a new self-organizing map based on sex-adjusted data. radius.basic <- attr(modl.basic$map$topology, "radius") modl.discov <- numero.create(data = trdata.discov, radius = radius.basic) summary(modl.discov) ``` Please use the tabs at the top to navigate to the next page. --- #--------------------------------------------------------------------------- --- ## Quality In this section, we inspect the structure of the SOM for the discovery cohort and the versions of the replication cohort. Notice how the model is the same, but new data are used for the replication cohorts: ``` {r results="hide"} # Calculate map quality measures. qc.discov <- numero.quality(model = modl.discov) qc.replicA <- numero.quality(model = modl.discov, data = trdata.replicA) qc.replicB <- numero.quality(model = modl.discov, data = trdata.replicB) qc.mets <- numero.quality(model = modl.discov, data = trdata.mets) ``` The training data were transformed into ranks during the preprocessing step. In many cases, this is the safest approach since it mitigates the effect of outliers or highly skewed distributions. On the other hand, the rank transform discards the original differences in magnitudes between data points, which may be undesirable in some circumstances. In this example, the rank preprocessing resulted in less skewed residuals compared to the earlier examples: ``` {r} # Define comparable histogram bins. rz <- c(qc.adj$layout[,"RESIDUAL.z"], qc.discov$layout[,"RESIDUAL.z"]) rz.breaks <- seq(min(rz, na.rm = TRUE), max(rz, na.rm = TRUE), length.out=20) ``` ``` {r results="hide", fig.width=7, fig.height=3, fig.align="center", fig.cap="Figure: Distribution of model residuals when the training data were preprocessed by scaling & centering or by tapered ranking."} # Plot frequencies of data points at different quality levels. par(mar = c(5,4,1,0), mfrow = c(1,2)) hist(x = qc.adj$layout[,"RESIDUAL.z"], breaks = rz.breaks, main = NULL, xlab = "RESIDUAL.z (scale & center)", ylab = "Number of data points", col = "#FFEFA0", cex = 0.8) hist(x = qc.discov$layout[,"RESIDUAL.z"], breaks = rz.breaks, main = NULL, xlab = "RESIDUAL.z (rank)", ylab = "Number of data points", col = "#FFEFA0", cex = 0.8) ``` Residuals for the replication datasets are expected to be higher than those for the training set (the training set always overfits to itself). Nevertheless, all the rank-transformed replication sets had lower maximum residuals than the previous centered and scaled training set: ``` {r} # Comparison of maximum residuals across examples. r <- c(max(qc.basic$layout[,"RESIDUAL.z"], na.rm = TRUE), max(qc.adj$layout[,"RESIDUAL.z"], na.rm = TRUE), max(qc.discov$layout[,"RESIDUAL.z"], na.rm = TRUE), max(qc.replicA$layout[,"RESIDUAL.z"], na.rm = TRUE), max(qc.replicB$layout[,"RESIDUAL.z"], na.rm = TRUE), max(qc.mets$layout[,"RESIDUAL.z"], na.rm = TRUE)) names(r) <- c("basic", "adj", "discov", "replicA", "replicB", "mets") print(r) ``` More balanced data point density is another potential benefit of the rank-transform: ``` {r results="hide", fig.width=10, fig.height=3, fig.align="center", fig.cap="Figure: Visualization of quality measures across map districts for the discovery cohort. "} # Plot map quality measures. numero.plot(results = qc.discov, subplot = c(1,4)) ``` On the other hand, the centered and scale analysis in the previous chapter revealed a connection between weak model fit and high burden of disease; this aspect of the data was lost due to the rank transform. The concentric pattern of good fit at the center and weaker fit on the edges (see RESIDUAL) is often seen when using rank-based preprocessing, which suggests that the pattern should not be interpreted as being a characteristic of any particular dataset. Comparisons between the replication sets revealed important clues on how well the data fit together: ``` {r results="hide", fig.width=10, fig.height=3, fig.align="center", fig.cap="Figure: Visualization of quality measures across map districts for the replication dataset A. "} # Plot map quality measures. numero.plot(results = qc.replicA, subplot = c(1,4)) ``` For replication set A, the data point histogram is very uneven with the majority of data points clustered in one quadrant of the map. To recap, replication A was preprocessed using the same parameters as the discovery set under the assumption that the two datasets were samples from the same underlying population with identical measurement protocols. Evidently, this assumption is wrong since the multi-variable data did not overlap well. ``` {r results="hide", fig.width=10, fig.height=3, fig.align="center", fig.cap="Figure: Visualization of quality measures across map districts for the replication dataset B. "} # Plot map quality measures. numero.plot(results = qc.replicB, subplot = c(1,4)) ``` Replication B was assumed to come from the same population, but with potential bias in the measured values. As the training data were preprocessed independently of the discovery set, resulting in identical standardized distributions, the data point histogram and residuals are similar to the discovery set. This does not mean that the initial assumption is correct, but it is possible to investigate demographic factors such as age, sex or clinical disease profiles to check that the overall characteristics are compatible across the different regions on the map. ``` {r results="hide", fig.width=10, fig.height=3, fig.align="center", fig.cap="Figure: Visualization of quality measures across map districts for the replication dataset with the metabolic syndrome. "} # Plot map quality measures. numero.plot(results = qc.mets, subplot = c(1,4)) ``` Lastly, the metabolic syndrome subset was created as an example with biased sampling and where the replication is not compatible with the discovery dataset. However, this is hard to detect from the quality plots. As mentioned above, the independent preprocessing will reproduce the even data point histogram and residuals (in most settings), therefore the quality measures alone cannot identify problems arising from these types of problems in study design. Please use the tabs at the top to navigate to the next page. --- #--------------------------------------------------------------------------- --- ## Statistics To create colorings of the SOM for the different datasets, we combine several features of the Numero pipeline. The first task is to identify where the data points go on the map. For the discovery cohort (that includes the training set), this information is contained in the model itself. However, for the other cohorts, the district assignments are obtained as part of the quality control; the output from `numero.quality()` contains the layout of data points that were used for calculating the histograms and residuals in the previous section. The estimation of map statistics proceeds as before, except that the quality control results are used for the replication datasets: ``` {r results="hide"} # Map statistics for discovery and replication datasets. stats.discov <- numero.evaluate(model = modl.discov, data = ds.discov) stats.replicA <- numero.evaluate(model = qc.replicA, data = ds.replic) stats.replicB <- numero.evaluate(model = qc.replicB, data = ds.replic) stats.mets <- numero.evaluate(model = qc.mets, data = ds.mets) ``` The second task is to choose the coloring scheme. In the previous chapter, we experimented with sex-specific colorings and here we will apply the same technique of a common reference `reference = stats.discov`. This will ensure that colors between the datasets are directly comparable, that is, the same data value will always correspond to the same color. In addition, we reduce the "gain" or global amplification of the color range. This has the effect of making the available value-color mapping wider so that values that exceed the reference range wash out less: ``` {r results="hide", fig.width=6, fig.height=4, fig.align="center", fig.cap="Figure: Colorings of training variables for the discovery dataset."} numero.plot(results = stats.discov, variables = trvars, gain = 0.8, subplot = c(2,3)) ``` Notice how the strongest reds and blues are not used for the discovery set due to the reduced (<1.0) gain. This means that they are now available to use to distinguish more extreme values on the map. The results for replication A are very similar to the discovery set both for the color patterns and also the district averages. Yet we know from the quality analysis that the two datasets did not overlap well on the SOM and the assumptions on study design were invalid -- this is one of the reasons it is important to check the quality of the SOM. ``` {r results="hide", fig.width=6, fig.height=4, fig.align="center", fig.cap="Figure: Colorings of selected variables for replication A."} numero.plot(results = stats.replicA, variables = trvars, gain = 0.8, subplot = c(2,3), reference = stats.discov) ``` Replication B reproduces the patterns of the discovery set in relative terms (areas of lower and higher concentrations match), but the absolute values are different. This is expected since we assumed that the data patterns would be the same but there was systematic bias in the measurement data. ``` {r results="hide", fig.width=6, fig.height=4, fig.align="center", fig.cap="Figure: Colorings of selected variables for replication B."} numero.plot(results = stats.replicB, variables = trvars, gain = 0.8, subplot = c(2,3), reference = stats.discov) ``` The metabolic syndrome dataset produced colorings that were substantially different from the discovery set: ``` {r results="hide", fig.width=6, fig.height=4, fig.align="center", fig.cap="Figure: Colorings of selected variables for MetS dataset."} numero.plot(results = stats.mets, variables = trvars, gain = 0.8, subplot = c(2,3), reference = stats.discov) ``` While the relative patterns appear somewhat similar, the district averages reveal that the SOM of the discovery set does not replicate well. The main observation that demonstrates the poor fit is biological. For example, the lowest serum creatinine district average in the discovery set is within the range that you would expect to see in healthy adults (60 - 110 umol/L), however, the district averages for the MetS dataset are higher in those map areas, which may indicate a substantial difference in kidney health. Please use the tabs at the top to navigate to the next page. --- #--------------------------------------------------------------------------- --- ## Subgroups In previous chapters, the subrgouping was done based on the training variables only so that the other traits could be used as methodologically independent data. Here, the discovery set enables us to define the subgroups based on all variables (if we so choose) since we did not use the replication datasets for training. For this example, a good option is to use clinically interesting variables for the subgrouping: ``` {r} # Selection of clinically interesting variables. clinvars <- c("uALB", "AGE", "DIAB_KIDNEY", "DIAB_RETINO", "MACROVASC", "DECEASED") ``` As mentioned earlier, the Numero framework includes an interactive tool to conduct the subgrouping based on SOM colorings, but the vignette does not allow user interaction so we use the automatic feature instead: ``` {r} # Automatic subgrouping based on the training set. subgr.discov <- numero.subgroup(results = stats.discov, variables = clinvars, automatic = TRUE) ``` The colorings can also be saved as an interactive web page, which is useful when the standard R plotting gets too slow or unusable due to multiple colorings or large map size (see the chapter on large datasets and maps). ``` {r results="hide", fig.width=6, fig.height=4, fig.align="center", fig.cap="Figure: Statistically normalized colorings of the training variables in the kidney disease dataset. The labels show the results from the subgrouping procedure."} # Plot results from subgrouping procedure. numero.plot(results = stats.discov, variables = clinvars, topology = subgr.discov, subplot = c(2,3)) ``` Please use the tabs at the top to navigate to the next page. --- #--------------------------------------------------------------------------- --- ## Compare In the last step, the subgroups are compared using conventional statistics. We use the dataset-specific map statistics to characterize the subgroups in the discovery and replication sets: ``` {r eval=FALSE} # Compare subgroups. report.discov <- numero.summary(results = stats.discov, topology = subgr.discov) report.replicA <- numero.summary(results = stats.replicA, topology = subgr.discov) report.replicB <- numero.summary(results = stats.replicB, topology = subgr.discov) report.mets <- numero.summary(results = stats.mets, topology = subgr.discov) ``` ``` {r echo=FALSE, results="hide"} suppressWarnings({ report.discov <- numero.summary(results = stats.discov, topology = subgr.discov) report.replicA <- numero.summary(results = stats.replicA, topology = subgr.discov) report.replicB <- numero.summary(results = stats.replicB, topology = subgr.discov) report.mets <- numero.summary(results = stats.mets, topology = subgr.discov)}) ``` The mortality rates in the discovery set are as expected with the highest mortality observed for the districts that had the highest kidney disease prevalence: ``` {r} # Show results for mortality rate in the discovery set. rows <- which(report.discov$VARIABLE == "DECEASED") report.discov[rows,c("SUBGROUP","N","MEAN","P.chisq")] ``` The same pattern is observable also for replication A, however, the subgroup sizes are poorly balanced and the statistics weaker as a consequence: ``` {r} # Show results for mortality rate in replication A. rows <- which(report.replicA$VARIABLE == "DECEASED") report.replicA[rows,c("SUBGROUP","N","MEAN","P.chisq")] ``` The outcomes in the discovery set are replicated well in dataset B and the subgroups are also well balanced in their size: ``` {r} # Show results for mortality rate in replication B. rows <- which(report.replicB$VARIABLE == "DECEASED") report.replicB[rows,c("SUBGROUP","N","MEAN","P.chisq")] ``` The mortality outcomes for the metabolic syndrome dataset are not statistically significant. This is partly explained by the smaller size of the dataset (= less power) but the overall mortality risk was higher already due to the biased sampling, and therefore direct comparisons with the discovery set are problematic. ``` {r} # Show results for mortality rate in metabolic syndrome subset. rows <- which(report.mets$VARIABLE == "DECEASED") report.mets[rows,c("SUBGROUP","N","MEAN","P.chisq")] ``` In all studies, we recommend careful verification of the design and the validity of assumptions using traditional statistical tests, SOM quality measures, SOM colorings and also the final subgroups comparisons. --- ########################################################################### --- # Large datasets and maps The Numero package is also suitable for analyzing larger datasets (>100k rows of data). Given the increase in statistical power, the map radius can be safely increased, but this usually causes difficulties in plotting the colorings on screen due to the inefficient way the default R graphics are implemented. In particular, it may not be feasible to do the interactive part using the standard R plotting. It is possible to use web browsers for a better user experience but, to view the map colorings, the plots need to be saved in an HTML file first. Here, we are using the results from the basic example that was introduced in the beginning of the document: ``` {r eval=FALSE} # Save map colorings of training variables. numero.plot(results = stats.basic, variables = trvars, folder = "/tmp/Results") ``` ``` {r echo=FALSE} s <- paste("\n", "*** numero.plot ***\n", "Thu Sep 5 15:44:02 2019\n", "\n", "Resources:\n", "5 column(s) included\n", "destination folder '/tmp/Results'\n", "\n", "Figure 1:\n", "5 subplot(s)\n", "file name '/tmp/Results/figure01.svg'\n", "97622 bytes saved in '/tmp/Results/figure01.svg'\n", "99982 bytes saved in '/tmp/Results/figure01.html'\n", "\n", "Summary:\n", "1 figure(s) -> '/tmp/Results'\n", sep="") cat(s) ``` You can replace the folder to something suitable in your computer. Here we used the temporary folder of a Linux system as an example. The command creates both a Scalable Vector Graphics (SVG) file that you can open in editors such as Inkscape, and a web page (HTML) that you can open in a browser. Importantly, the web page is interactive, and you can select regions of the SOM and click on empty space between plots to save the topology and regions in a tab-delimited text file. Assuming we assigned districts to different regions and downloaded the results in the file 'Downloads/regions.txt', we can then import the spreadsheet into R by typing ``` {r eval=FALSE} # Import topology and region assignments. subgr <- read.delim(file = "Downloads/regions.txt", stringsAsFactors=FALSE) ``` and calculate subgroup statistics with ``` {r eval=FALSE} # Calculate subgroup statistics. report <- numero.summary(results = stats.basic, topology = subgr) ``` as before. **Useful hints**: If it takes a long time to train a SOM for a large dataset, try setting the `subsample` parameter to something less than the number of rows in the training set. It may also take a long time to evaluate the map statistics; to reduce waiting time, you can assign data columns into smaller batches and then run each batch on a different processor in parallel. --- ########################################################################### --- # Terminology **Best-matching centroid**: The BMC is the district centroid that is the most similar to a data point. The concept is closely related to the data point layout: the location of a data point on the map is determined by finding the BMC for that data point. **Coloring**: The Numero framework always creates a single map model. However, the map districts can be painted with different colors. This enables the user to create multiple colorings of the map to visualize regional differences. These colorings can be made for each variable, which helps to identify which parts of the map are particularly important for a specific phenomenon. Again, this is similar to a real city map where the districts are colored according to the income level of the local residents, or according to the mean age, smoking rates, obesity etc. **Data point**: We define the term data point as a single uniquely identifiable row in a spreadsheet of data (with variables as columns). For instance, in the diabetic kidney disease dataset in the basic pipeline example, a data point refers to a patient (and vice versa) as there is only one row per patient. On the other hand, if a patient dataset included multiple examination visits, and the visits were organized on separate rows, a patient could be linked to multiple data points. **District**: A district refers to a pre-defined division of the map into uniformly sized areas. The districts are created mainly for technical reasons: using districts speeds up calculations and enables the estimation of map-related statistics. This is analogous to a real city being divided into districts to estimate regional demographics, for instance. **District centroid**: The SOM algorithm works through the districts during the optimization of the data point layout on the map. The computational process eventually converges to a stable configuration that is stored as a set of district centroids. From a practical point of view, a district centroid represents the typical averaged profile that captures the characteristics of the data points within the district. In technical terms, the district centroid (also known as the prototype) contains the mean weighted data values across all the data points, where the weights are determined by the neighborhood function used in the SOM algorithm. **Layout**: We make a distinction between what is a map, and what is the layout of data points on it. The layout is a vector of data point locations as coordinates, whereas the map is a more integrated concept that also includes information that is necessary to find the locations of new previously unseen data points, and to draw and paint the map in visual form. **Map**: A map is a general term to describe the two-dimensional canvas onto which the multivariable data points are projected. The concept is analogous to a geographic map that indicates where people live, except that the location is not based on geography (i.e. physical distances), but comes from the data (i.e. distances = data-based similarities). **Subgroup**: We expect that most uses of Numero will result in the subgrouping of a complex dataset. Visually, we define a subgroup via a contiguous set of adjacent districts on the map. Consequently, all the data points that are located within the set of districts are the subgroup members. Of note, our concept of a subgroup is different from subgroup discovery in data mining; we use the term as a replacement to the word 'cluster' to emphasize the lack of intrinsic structure in the original data space. # Build information ```{r echo=FALSE} sessionInfo() Sys.time() ```