---
title: Introduction to Numero
output:
html_document:
toc: true
toc_float: true
toc_depth: 2
number_sections: true
vignette: >
%\VignetteIndexEntry{intro}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
**Author:** Ville-Petteri Mäkinen
**Co-authors:** Song Gao, Stefan Mutter, Aaron E Casey
**Version:** `r packageVersion("Numero")`
**Abstract:** In textbook examples, multivariable datasets are clustered into
distinct subgroups that can be clearly identified by a set of optimal
mathematical criteria. However, many real-world datasets arise from
synergistic consequences of multiple effects, noisy and partly redundant
measurements, and may represent a continuous spectrum of the different phases
of a phenomenon. In medicine, complex diseases associated with ageing are
typical examples. We postulate that population-based biomedical datasets (and
many other real-world examples) do not contain an intrinsic clustered
structure that would give rise to mathematically well-defined subgroups. From
a modeling point of view, the lack of intrinsic structure means that the data
points inhabit a contiguous cloud in high-dimensional space without abrupt
changes in density to indicate subgroup boundaries, hence a mathematical
criteria cannot segment the cloud reliably by its internal structure. Yet we
need data-driven classification and subgrouping to aid decision-making and to
facilitate the development of testable hypotheses. For this reason, we
developed the Numero package, a more flexible and transparent process that
allows human observers to create usable multivariable subgroups even when
conventional clustering frameworks struggle.
**Citation:** Gao S, Mutter S, Casey AE, Mäkinen V-P (2018) Numero: a
statistical framework to define multivariable subgroups in complex
population-based datasets, Int J Epidemiology,
https://doi.org/10.1093/ije/dyy113
***
``` {r echo=FALSE, results="hide"}
# Save space on screen printouts.
options(digits = 3)
```
***
---
###########################################################################
---
# Getting started {.tabset}
## Installation
You can install Numero using the standard procedure:
``` {r eval=FALSE}
# Install the package from a remote repository.
install.packages("Numero")
```
The functions come in two flavors: the ones starting with `numero` provide
high level pipeline components that will serve most users, whereas the `nro`
functions perform specific tasks and provide a more granular interface to
the package. In this introductory document, we will only use the `numero`
group of functions.
To activate and list the library functions, type
``` {r}
# Activate the library.
library("Numero")
packageVersion("Numero")
ls("package:Numero")
```
Each Numero function comes with a help page that contains a code example,
such as
``` {r eval=FALSE}
# Access function documentation (not shown in vignette).
? numero.create
```
You can run all the code examples in one go by typing
``` {r eval=FALSE}
# Run all code examples (not shown in vignette).
fn <- system.file("extcode", "examples.R", package = "Numero")
source(fn)
```
Please use the tabs at the top to navigate to the next page.
## Dataset
In the examples, we use data on diabetic kidney disease from a previous
publication (see the readme file below). While simplified, the data contain
enough information to replicate some of the findings from the original study.
The main clinical characteristics are summarized in Table 1, and you can
print the readme file on the screen by typing
```{r eval=FALSE}
# Show readme file on screen (not shown in vignette).
fn <- system.file("extdata", "finndiane.readme.txt", package = "Numero")
cat(readChar(fn, 1e5))
```
Table: Summary of the FinnDiane dataset. The mean and
standard deviation are reported for each continuous variable. Abbreviations:
urinary albumin excretion rate (uALB), triglycerides (TG), high density
lipoprotein subclass 2 (HDL2). P-values were estimated by the t-test for
continuous variables and by Fisher’s test for binary traits.
Trait | No kidney disease | Diabetic kidney disease | P-value
---------------------------------- | ----------------- | ----------------------- | --------
Men / Women | 192 / 196 | 119 / 106 | 0.45
Age (years) | 38.8 ± 12.2 | 41.7 ± 9.7 | 0.0012
Type 1 diabetes duration (years) | 25.3 ± 10.3 | 28.6 ± 7.8 | <0.001
Log10 uALB (mg/24h) | 1.20 ± 0.51 | 2.72 ± 0.59 | <0.001
Log10 TG (mmol/L) | 0.034 ± 0.201 | 0.159 ± 0.212 | <0.001
Total cholesterol (mmol/L) | 4.89 ± 0.77 | 5.35 ± 0.96 | <0.001
HDL2 cholesterol (mmol/L) | 0.54 ± 0.16 | 0.51 ± 0.18 | 0.027
Log10 serum creatinine (µmol/L) | 1.94 ± 0.09 | 2.14 ± 0.24 | <0.001
Metabolic syndrome | 90 (23.2%) | 114 (50.7%) | <0.001
Macrovascular disease | 16 (4.1%) | 38 (16.9%) | <0.001
Diabetic retinopathy | 133 (34.4%) | 178 (79.1%) | <0.001
Died during follow-up | 13 (3.4%) | 43 (19.1%) | <0.001
The data are included in the package as a tab-delimited file:
``` {r}
# Import data.
fname <- system.file("extdata", "finndiane.txt", package = "Numero")
dataset <- read.delim(file = fname, stringsAsFactors = FALSE)
nrow(dataset)
colnames(dataset)
```
Please use the tabs at the top to navigate to the next page.
## Data integrity
Most datasets contain missing entries, duplicated rows and other unusable
features, therefore it is important to carefully inspect all data points
and their identification. In this case, the biochemical and clinical
information about individual participants is organized as rows and each
row is identified by the variable INDEX. To create a data frame where the
row names are set to INDEX, type
``` {r}
# Manage unusable entries and data identification.
dataset <- numero.clean(data = dataset, identity = "INDEX")
```
Almost all functions within Numero work only on numeric values, so if the
dataset contains factors or other categorical data types, please convert them
to integer labels before analyses.
---
###########################################################################
---
# Basic pipeline {.tabset}
## Prepare
Before any analysis, we recommend using `numero.create` to check that
the data are in usable format (the previous section on data integrity).
A typical usage case of the Numero framework involves a set of training
variables that are considered to explain medical or other outcomes. Here,
we use the biochemical data as the training variables under the general
hypothesis that the metabolic profile (as measured by biochemistry)
predicts adverse clinical events. The kidney disease dataset includes five
biochemical measures:
``` {r}
# Select training variables.
trvars <- c("CHOL", "HDL2C", "TG", "CREAT", "uALB")
```
Centering and scaling are basic machine learning techniues to ensure that
the variables with large values due to the choice of the measurement unit do
not bias the results. To create a standardized training set, type
``` {r results="hide"}
# Center and scale the training data.
trdat.basic <- scale.default(dataset[,trvars])
```
All the training variables have now unit variance and we can expect their
impact on the SOM to be dependent only on the information content and
statistical associations:
``` {r}
# Calculate standard deviations.
apply(trdat.basic, 2, sd, na.rm = TRUE)
```
Please use the tabs at the top to navigate to the next page.
---
#---------------------------------------------------------------------------
---
## Create
Conceptually, the SOM algorithm is analogous to a human observer who wants to
make sense of a set of objects. Suppose there is a pile of portrait photos of
ordinary people on a round table, and your task is to organize them on the
table in such a way that similar photos are placed next to each other and, by
consequence, dissimilar photos end up on the opposite sides of the table.
During the organization process, your brain interprets the multi-variable
visual data (age, gender, ethnicity, hair style, head shape, etc.) and
projects it onto two dimensions (vertical and horizontal positions).
The SOM algorithm, also known as Kohonen map, mimics the organization process
and was originally inspired by the plasticity of neuronal networks. As this
is a self-organizing process, the starting point may have an impact on the
outcomes. We have added an extra initialization step to ensure repeatable
results. The function `numero.create` applies K-means clustering to create
an initial map model, and then refines the model using the SOM algorithm:
``` {r}
# Create a new self-organizing map based on the training set.
modl.basic <- numero.create(data = trdat.basic)
summary(modl.basic)
```
The resulting data structure contains the SOM itself and additional
information that is needed for further statistical analyses, please see
the manual pages for details.
The output from `numero.create` contains the positions of the data points
on a virtual round "table"; we use the term *map* instead of table, and refer
to the positions as the *layout* of a dataset on the SOM. The results also
contain an internal model of the training dataset; the model comprises a set
of *district centroids* that summarize the salient features of the training
dataset, please see the section on terminology for further descriptions.
Please use the tabs at the top to navigate to the next page.
---
#---------------------------------------------------------------------------
---
## Quality
After the training is complete, it is prudent to examine the SOM for
potential problems with the data. The Numero package provides three
different quality tools:
**Histogram:** The distribution of data points on the map can be expressed as
a histogram that shows the counts of data points in the districts.
If any areas of the SOM are devoid, or disproportionally populated by data
points, it usually reflects unusual or discontinuous value distributions
within the data (it can also be caused by poorly chosen preprocessing
methods). The histogram is rarely flat and up to two-fold differences
between the least and most populated districts are common, but if the pattern
is more extreme, careful examination of the distributions of the input data
in relation to the preprocessing procedures is warranted.
**Coverage:** The SOM is instrinsically robust against missing values if the
missingness pattern is close to random. We define data coverage as the
mean ratio of usable values per a multi-variable data point. If the coverage
drops substantially in certain districts compared to others, it is possible
that the data point layout has been biased by a non-random pattern of missing
values.
**Model fit:** The location of each data point on the map depends on how
similar it is to district centroids; the district with the most similar
centroid is chosen as the optimal location. The similarity of a centroid and
a data point can be quantified using mathematical formulas, such as the
Euclidean distance. The Numero package provides two measures. The first is
the residual distance $d$ between a data point and a centroid. The second
version is indpependent of scale and is calculated as
$z = (d - d_{mean}) / d_{sd}$, where $d_{mean}$ and $d_{sd}$ are the mean
residual and standard deviation of residuals in the training set.
``` {r}
# Calculate map quality measures for the training data.
qc.basic <- numero.quality(model = modl.basic)
```
Outliers can be detected by plotting the distribution of quality measures:
``` {r results="hide", fig.width=7, fig.height=3, fig.align="center", fig.cap="Figure: Distribution of model residuals."}
# Plot frequencies of data points at different quality levels.
par(mar = c(5,4,1,0), mfrow = c(1,2))
hist(x = qc.basic$layout$RESIDUAL, breaks = 50,
main = NULL, xlab = "RESIDUAL", ylab = "Number of data points",
col = "#FFEFA0", cex = 0.8)
hist(x = qc.basic$layout$RESIDUAL.z, breaks = 50,
main = NULL, xlab = "RESIDUAL.z", ylab = "Number of data points",
col = "#FFEFA0", cex = 0.8)
```
In this example, the distribution of residuals is heavily skewed and there
are a few data points that are several standard deviations from the mean
(RESIDUAL.z). The skewness is often due to skewed distributions in the data;
urinary albumin and serum creatinine are highly skewed towards larger values.
The Numero framework includes automatic procedures based on log-transforms
or rank-based preprocessing to mitigate skewness. We will explore these in
separate examples later in the document.
The quality measures can be visualized on the map to reveal subgroups of
problematic data points:
``` {r results="hide", fig.width=9, fig.height=3, fig.align="center", fig.cap="Figure: Visualization of quality measures across map districts. "}
# Plot map quality measures.
numero.plot(results = qc.basic, subplot = c(1,4))
```
While coverage is uniform across the map (there were only a few missing
data), the model fit is noticeably worse on the top part of the map
(higher RESIDUAL). The poor model fit is also reflected in the sample
histogram that shows a substantial depletion of data points in the top
part of the map. In practice, these structural issues do not necessarily
present a problem (as we see in the next section), but they need to be
considered when interpreting the significance of map patterns.
Please use the tabs at the top to navigate to the next page.
---
#---------------------------------------------------------------------------
---
## Statistics
Organizing a pile of photos on a table is achievable for a human observer
-- remember the discussion from the previous section -- but a human would
struggle with rows of numerical data, especially if there was a large number
of data points. Also, it is of interest to examine each variable one at a
time (e.g. if there is a map region associated with high cholesterol), or
to see if the outcomes of interest (e.g. mortality) accumulate on a
particular section of the SOM.
Since we consider the SOM as a map, we can apply concepts from geography to
overcome these challenges. Just as a city map can be colored according to
the average regional house prices, we can color the SOM according to the
average regional cholesterol concentrations or mortality rates.
Before the SOM is visualized, however, we need to estimate descriptive
statistics so that the colors can be calibrated according to the strength
of the statistical evidence.
Any finite dataset will produce regional variation on the SOM. For example,
it is reasonable to expect that parts of the map will differ with respect to
average mortality simply by chance. But how to know when the regional
variation is within the random expectation or not? The Numero package
contains a permutation engine that provides the answer:
``` {r}
# Map statistics for the whole dataset.
stats.basic <- numero.evaluate(model = qc.basic, data = dataset)
summary(stats.basic)
```
The output contains data frames for the data point layout, district averages
for each variables (i.e. component planes) and z-scores and p-values for
the observed amount of regional variation. Of note, the function omits
P-values for columns that overlap with the training data as it does not make
sense to estimate statistical significance for the same data that were used
for constructing the model.
Overall, the results above comprise the minimal information needed to color
the SOM according to district averages.
``` {r results="hide", fig.width=6, fig.height=4, fig.align="center", fig.cap="Figure: Statistically normalized colorings of the training variables in the kidney disease dataset. The color intensity depends on how likely the observed regional variation would arise by chance; intense reds and intense blues indicate that these extremes would be very unlikely if the data point layout was random. The numbers show the average values in original units for selected districts."}
# Plot map colorings of training variables.
numero.plot(results = stats.basic, variables = trvars, subplot = c(2,3))
```
As expected, all the training variables (including CHOL) produce strong
colors since they were the basis of data point locations and therefore
directly influenced the data point layout. On the other hand, we did not use
age or sex as training variables, yet both show significant regional variation:
``` {r}
stats.basic$statistics[c("CHOL","MALE","AGE","T1D_DURAT"),
c("TRAINING","Z","P.z")]
```
These signals indicate that the biochemical profiles are associated
with age and sex, which is not a surprise considering the effects of ageing
and sex dimorphism in human populations. Of note, P-values are only reported
for the non-biochemical variables that were not used for training.
Please use the tabs at the top to navigate to the next page.
---
#---------------------------------------------------------------------------
---
## Subgroups
To recap, we hypothesized that the metabolic profile indicates and/or predicts
adverse health outcomes. For this reason, we chose to use only the biochemical
data to construct the SOM, which allows us to use the clinical traits as
validation outcomes from a modelling perspective. The next step is to divide
the dataset into a limited number of subgroups that summarize the typical
metabolic profiles we can expect too see in the data.
It is important to use only the training variables in this step so that we
remain agnostic to the clinical traits and preserve the original unsupervised
study design.
The Numero framework includes an interactive tool to conduct the subgrouping
based on SOM colorings. Unfortunately, the R vignette format is not compatible
with the interactive procedure, therefore we will rely on the automatic
subgrouping feature to proceed with the example:
``` {r}
# Automatic subgrouping based on the training set.
subgr.basic <- numero.subgroup(results = stats.basic,
variables = trvars, automatic = TRUE)
```
It is possible to use the interactive feature to tweak the automatic result
by passing the updated map topology:
``` {r eval=FALSE}
# Interactive subgrouping.
subgr.basic <- numero.subgroup(results = stats.basic,
variables = trvars, topology = subgr.basic)
```
The colorings can also be saved as an interactive web page, which is
useful when the standard R plotting gets too slow or unusable due to multiple
colorings or large map size (see the chapter on large datasets and maps).
``` {r results="hide", fig.width=6, fig.height=4, fig.align="center", fig.cap="Figure: Statistically normalized colorings of the training variables in the kidney disease dataset. The labels show the results from the subgrouping procedure."}
# Plot results from the subgrouping procedure.
numero.plot(results = stats.basic, variables = trvars,
topology = subgr.basic, subplot = c(2,3))
```
It is evident from Figure 2 that the subgrouping workaround could be improved
to cover the most important patterns on the map. Serum creatinine and urinary
albumin are biomarkers of kidney disease, so they are highly concordant,
whereas cholesterol and triglycerides show patterns that do not fit well
with the kidney biomarkers. Therefore, additional subgroups and manual
adjustments of the district labels may be the preferred course of action
if this were an actual research project.
With the interactive tool, users can choose their own subgroups based on
the SOM patterns. This is the greatest strength of the pipeline: multiple
people will be able to participate in the process of defining the subgroups,
and different subgroups can be created for different settings. For example,
cardiovascular disease is the most common cause of death in people with type
1 diabetes, and therefore a subgrouping that takes cholesterol into account
may be warranted.
Please use the tabs at the top to navigate to the next page.
---
#---------------------------------------------------------------------------
---
## Compare
We omitted the clinical end-points intentionally from the previous analyses to
minimize their influence on the subgrouping step, as required by the
study design. Since we have now defined the subgroups, it is safe to plot the
colorings for all the variables. The subplot option was used here to ensure a
pleasing look for the vignette, but it can be omitted in most circumstances.
``` {r results="hide", fig.width=6, fig.height=4, fig.align="center", fig.cap="Figure: Statistically normalized colorings of selected variables in the kidney disease dataset."}
# Plot results from subgrouping procedure for non-biochemical variables.
numero.plot(results = stats.basic,
variables = c("AGE",
"MALE",
"DIAB_KIDNEY",
"DIAB_RETINO",
"MACROVASC",
"DECEASED"),
topology = stats.basic$map$topology, subplot = c(2,3))
```
``` {r results="hide", fig.width=6, fig.height=4, fig.align="center", fig.cap="Figure: Colorings of selected variables with subgroup labels."}
# Plot results from subgrouping procedure for non-biochemical variables.
numero.plot(results = stats.basic,
variables = c("AGE",
"MALE",
"DIAB_KIDNEY",
"DIAB_RETINO",
"MACROVASC",
"DECEASED"),
topology = subgr.basic, subplot = c(2,3))
```
The subgroup with the highest urinary albumin overlaps with the
clinical classification of diabetic kidney disease (top-right coloring).
This is expected, since albuminuria is one of the key diagnostic markers
for kidney disease. Accordingly, the same subgroup overlaps with high
prevalence of retinopathy (bottom-left) and high mortality rate
(bottom-right).
In research papers, it is often easier to report conventional comparison
statistics between the subgroups you have defined rather than the SOM
visualizations (however, the colorings should always be included as
supplements). The Numero package includes the function `numero.summary()`
that combines the visual subgroupings with the original data:
``` {r}
# Compare subgroups.
report.basic <- numero.summary(results = stats.basic, topology = subgr.basic)
colnames(report.basic)
```
For instance, the report contains the statistical signals for mortality
across the subgroups:
``` {r}
# Show results for mortality rate.
rows <- which(report.basic$VARIABLE == "DECEASED")
report.basic[rows,c("SUBGROUP","N","MEAN","P.chisq")]
```
The first row of the output shows the chi2 test for the full contingency
table. The largest subgroup is always chosen as the reference, hence
P = 1 for subgroup with the most members. The subgroup that captured the
highest urinary albumin values should show the greatest mortality rate.
Lastly, it is often useful to identify the members of these subgroups.
The output contains that information as attributes, please see the
documentation on `numero.summary()` and `nroSummary()` for details.
---
###########################################################################
---
# Preprocessing and adjustment {.tabset}
## Prepare
The introductory analyses revealed that both sex and age were associated with
the metabolic profiles via the SOM. Moreover, men and women display anatomical
and metabolic differences, which usually complicate the interpretation of the
SOM. For this reason, we recommend using a sex-specific standardization
procedure that eliminates the differences. If necessary, separate
visualizations can be made afterwards for men and women using the same map.
Ageing is another factor that is often considered a confounder, especially
within cross-sectional study designs. For this reason, we will reduce
the co-variation with AGE and T1D_DURAT in the training set prior to
creating the SOM.
To prepare the training data, we can use the tool provided with the
Numero package:
``` {r}
# Mitigate stratification and confounding factors.
trdata.adj <- numero.prepare(data = dataset, variables = trvars, batch = "MALE",
confounders = c("AGE", "T1D_DURAT"))
colnames(trdata.adj)
```
In addition to adjusting for the confounders, the function also applies
the logarithm to variables that are highly skewed. While binary data can
be used for training, we recommend using continuous variables only for
building a SOM since they are easier to adjust for confounders and less
prone to numerical instability.
We also need to know which identities belong to men or women so that
the sexes can be separated in later analysis steps. As a by-product of
the batch correction for MALE, the identities of batch members are included
in the output according to the values in the MALE column
(1 = man, 0 = woman):
``` {r}
subsets <- attr(trdata.adj, "subsets")
women <- subsets[["0"]]
men <- subsets[["1"]]
c(length(men), length(women))
```
Please use the tabs at the top to navigate to the next page.
---
#---------------------------------------------------------------------------
---
## Create
The SOM is created the same way as within the basic pipeline, but this time
using the sex-adjusted training variables:
``` {r}
# Create a new self-organizing map based on sex-adjusted data.
modl.adj <- numero.create(data = trdata.adj)
summary(modl.adj)
```
Please use the tabs at the top to navigate to the next page.
---
#---------------------------------------------------------------------------
---
## Quality
After the training is complete, the structure of the SOM can be inspected
as we did in the previous example:
``` {r}
# Calculate map quality measures for sex-adjusted data.
qc.adj <- numero.quality(model = modl.adj)
```
``` {r results="hide", fig.width=7, fig.height=3, fig.align="center", fig.cap="Figure: Distribution of model residuals. The training data were adjusted for age and sex."}
# Plot frequencies of data points at different quality levels.
par(mar = c(5,4,1,0), mfrow = c(1,2))
hist(x = qc.adj$layout$RESIDUAL, breaks = 20,
main = NULL, xlab = "RESIDUAL", ylab = "Number of data points",
col = "#FFEFA0", cex = 0.8)
hist(x = qc.adj$layout$RESIDUAL.z, breaks = 20,
main = NULL, xlab = "RESIDUAL.z", ylab = "Number of data points",
col = "#FFEFA0", cex = 0.8)
```
The preprocessing resulted in a substantial improvement of the residual
distributions. Although the residuals are still skewed, the most extreme
residuals have been removed:
``` {r}
# Maximum residuals from unadjusted and adjusted analyses.
c(max(qc.basic$layout$RESIDUAL.z, na.rm = TRUE),
max(qc.adj$layout$RESIDUAL.z, na.rm = TRUE))
```
Furthermore, the spread of data points across the map is now more
balanced
``` {r}
# Variation in point density in unadjusted and adjusted analyses.
c(sd(qc.basic$planes[,"HISTOGRAM"], na.rm = TRUE),
sd(qc.adj$planes[,"HISTOGRAM"], na.rm = TRUE))
```
and the depletion of data points in the poor quality area is attenuated:
``` {r results="hide", fig.width=9, fig.height=3, fig.align="center", fig.cap="Figure: Visualization of data point quality measures across map districts from age and sex adjusted analysis."}
# Plot map quality measures.
numero.plot(results = qc.adj, subplot = c(1,4))
```
Please use the tabs at the top to navigate to the next page.
---
#---------------------------------------------------------------------------
---
## Statistics
We can estimate the map statistics using the whole dataset as before, except
that the model is now age and sex adjusted. Note how we use the original
unadjusted data as input `data = dataset` but the evaluation
is done via the map layout created from adjusted data:
``` {r}
# Map statistics for the whole dataset.
stats.adj <- numero.evaluate(model = qc.adj, data = dataset)
```
Furthermore, it is also possible to conduct sex-specific analyses after we
identified men and women in the dataset:
``` {r results="hide"}
# Map statistics for women.
stats.adjW <- numero.evaluate(model = qc.adj, data = dataset[women,])
```
``` {r results="hide"}
# Map statistics for men.
stats.adjM <- numero.evaluate(model = qc.adj, data = dataset[men,])
```
We could have used `data = trdata.adj` instead of `data = dataset`,
however, using the original unadjusted data to describe the sex-adjusted
subgroups is easier to interpret in relation to what you might see when
meeting the participants in the real world.
If the adjustments of the training data are successful, we expect
to see non-significant P-values for MALE and AGE when using the original
data:
``` {r}
stats.adj$statistics[c("MALE","AGE","T1D_DURAT"), c("Z","P.z")]
```
Indeed, the regional variation in the proportion of men to women is now
within the range of random sampling fluctuation. That said, the sex
adjustment only affects the individuals' positions on the map, thus the
well known sex differences in the average biomarker concentrations are
not affected:
``` {r results="hide", fig.width=6, fig.height=4, fig.align="center", fig.cap="Figure: Colorings of training variables using the data from female participants."}
numero.plot(results = stats.adjW, variables = trvars, subplot = c(2,3))
```
``` {r results="hide", fig.width=6, fig.height=4, fig.align="center", fig.cap="Figure: Colorings of training variables using the data from male participants."}
numero.plot(results = stats.adjM, variables = trvars, subplot = c(2,3))
```
The colorings look very similar, but the regional averages are systematically
different between the sexes. For instance, HDL2C is higher in women across
all areas of the map, while the color patterns are almost indistinguishable
between the sexes.
As we estimated the statistics separately for men and women, the color
scales are likely to differ, which may make it more difficult to compare
the data patterns. Another option is to calibrate the colors between the
sexes against a common reference, that is, the same shade of blue will
always correspond to the same concentration of cholesterol, for example,
regardless of which subset of individuals are being visualized.
This is accomplished by using the results of the full dataset as the
reference point `reference = stats.adj`:
``` {r results="hide", fig.width=6, fig.height=4, fig.align="center", fig.cap="Figure: Colorings of training variables using the data from female participants. Color scales were derived from the full dataset."}
numero.plot(results = stats.adjW, variables = trvars,
subplot = c(2,3), reference = stats.adj)
```
``` {r results="hide", fig.width=6, fig.height=4, fig.align="center", fig.cap="Figure: Colorings of training variables using the data from male participants. Color scales were derived from the full dataset."}
numero.plot(results = stats.adjM, variables = trvars,
subplot = c(2,3), reference = stats.adj)
```
It is now easier to see from the colors that women tend to have higher
HDLC2 (bright red areas on the right side and less blue on the left side),
whereas they tend to have lower serum creatinine (more blue for women
compared to men).
Please use the tabs at the top to navigate to the next page.
---
#---------------------------------------------------------------------------
---
## Subgroups
To recap, we followed the procedures in the basic pipeline, except that the
training data were adjusted for age and sex differences, and we also analyzed
men and women separately. The subgrouping step proceeds as before.
The Numero framework includes an interactive tool to conduct the
subgrouping based on SOM colorings
``` {r eval=FALSE}
# Interactive subgrouping based on the training set.
subgr.adj <- numero.subgroup(results = stats.adj, variables = trvars)
```
As mentioned previously, the vignette format is not compatible with the
interactive tool, therefore we have to use the automatic subgrouping
feature to demonstrate the function of the pipeline.
``` {r}
# Automatic subgrouping based on the training set.
subgr.adj <- numero.subgroup(results = stats.adj,
variables = trvars, automatic = TRUE)
```
The colorings can also be saved as an interactive web page, which is
useful when the standard R plotting gets too slow or unusable due to multiple
colorings or large map size (see the chapter on large datasets and maps).
``` {r results="hide", fig.width=6, fig.height=4, fig.align="center", fig.cap="Figure: Statistically normalized colorings of the training variables in the kidney disease dataset. The labels show the results from the subgrouping procedure. The map was created from sex-adjusted data."}
# Plot results from subgrouping procedure.
numero.plot(results = stats.adj, variables = trvars,
topology = subgr.adj, subplot = c(2,3))
```
Although the sex adjustment had a minor effect on the map colorings, we can
observe minor changes in how the subgroups settled on the map.
As was discussed earlier, the subgroups could be made more reasonable, which
highlights the importance of being able to set the subgroup labels yourself
for the most useful or parsimonious results.
Please use the tabs at the top to navigate to the next page.
---
#---------------------------------------------------------------------------
---
## Compare
In the last step, the subgroups are compared using conventional statistics
as we did earlier. It is always useful to plot the subgroup labels in
relation to the variables that were not used for training:
``` {r results="hide", fig.width=6, fig.height=4, fig.align="center", fig.cap="Figure: Colorings of selected variables with subgroup labels. The map was created from age and sex adjusted data."}
# Plot results from subgrouping procedure for non-biochemical variables.
numero.plot(results = stats.adj,
variables = c("AGE",
"MALE",
"DIAB_KIDNEY",
"DIAB_RETINO",
"MACROVASC",
"DECEASED"),
topology = subgr.adj, subplot = c(2,3))
```
The effect of the sex adjustment on the variable MALE is immediately
apparent: the regional variation in sex ratio is so close to the random
expectation that the color scale has been squashed flat. In the unadjusted
model, the difference between the predominantly male and the
predominantly female districts was substantial, whereas
here the gap has been reduced:
``` {r}
# District averages from unadjusted analysis.
summary(stats.basic$planes[,"MALE"])
```
``` {r}
# District averages from adjusted analysis.
summary(stats.adj$planes[,"MALE"])
```
The results for mortality have changed as a consequence of the changes in
subgrouping, but not substantially:
``` {r}
# Compare subgroups.
report.adj <- numero.summary(results = stats.adj, topology = subgr.adj)
```
``` {r}
# Show results for mortality rate.
rows <- which(report.adj$VARIABLE == "DECEASED")
report.adj[rows,c("SUBGROUP","N","MEAN","P.chisq")]
```
Again, the statistical signal is mostly coming from the high albuminuria
subgroup. It is important to note that while the preprocessing affected
the map quality and the results in general, the overall interpretations
of the results did not change dramatically.
---
###########################################################################
---
# Multiple datasets {.tabset}
## Prepare
Men and women are biologically different, and also have different
biochemical profiles as we observed earlier. In this example, we
use the sexes as if they are two different epidemiological cohorts.
Indeed, differences in data collection between epidemiological studies
lead to differences in the mean concentrations of biomarkers. In this
respect, the male-female comparison is an illustrative and easy
proxy for discussing multi-cohort applications of the SOM.
A typical multi-cohort setting involves a discovery cohort and one or
more replication cohorts with additional meta-analyses across all
available data. We will focus on two aspects: 1) defining subgroups
in the discovery cohort and 2) assigning participants from the replication
cohort into these subgroups. For the meta-analysis part, established
statistical methods can be applied after the subgroups have been defined
(not discussed here).
We also use this example as an opportunity to show how rank-based
preprocessing of the data affects the quality measures by setting
`method = "tapered"`. The SOM framework includes two rank-based methods:
"uniform" for normalized ranks between -1 and +1 and "tapered" for a version
that puts more samples around zero and less around -1 or +1.
To facilitate more intuitive discussion, we refer to the sexes as two
different cohorts from now on. To simulate incompatible datasets, we
add a dataset of men with the metabolic syndrome as a third cohort:
``` {r}
ds.discov <- dataset[women,]
ds.replic <- dataset[men,]
ds.mets <- ds.replic[which(ds.replic[,"METAB_SYNDR"] == 1),]
```
We create four preprocessed datasets (one for discovery/training and three
for replication) to demonstrate how data can be used across cohorts. Summary
statistics are shown for urinary albumin as it is the most important
clinical biomarker:
``` {r results="hide"}
# Discovery cohort and training set.
trdata.discov <- numero.prepare(data = ds.discov, variables = trvars,
confounders = c("AGE", "T1D_DURAT"),
method = "tapered")
```
``` {r}
summary(trdata.discov[,"uALB"])
```
``` {r results="hide"}
# Replication cohort, version A.
param <- attr(trdata.discov, "pipeline")
trdata.replicA <- numero.prepare(data = ds.replic, pipeline = param)
```
``` {r}
summary(trdata.replicA[,"uALB"])
```
The first replication dataset A is analogous to a situation where two
cohorts/batches are compared without batch correction, that is, we assume
that there is no measurement or other bias between the descriptive cohort
statistics. This is the safe starting point, since the quality control
will reveal if the assumption is violated, and we can adjust the second
round of analyses accordingly.
``` {r results="hide"}
# Replication cohort, version B.
trdata.replicB <- numero.prepare(data = ds.replic, variables = trvars,
confounders = c("AGE", "T1D_DURAT"),
method = "tapered")
```
``` {r}
summary(trdata.replicB[,"uALB"])
```
The second dataset B is adjusted for discrepancies in average values.
We still assume that both cohorts represent the same underlying population
that is sampled the same way. This situation is similar to having two
separate birth cohorts of approximately the same age from the same broader
population, but with biochemical data measured in different labs.
``` {r results="hide"}
# Replication cohort, MetS version.
trdata.mets <- numero.prepare(data = ds.mets, variables = trvars,
confounders = c("AGE", "T1D_DURAT"),
method = "tapered")
```
``` {r}
summary(trdata.mets[,"uALB"])
```
The third dataset reflects a situation where the study cohort have been
recruited differently or are otherwise demographically incompatible.
This represents a challenge to the subgrouping analysis and can lead to
erroneous conclusions. Bias of this type is difficult to distinguish from
bias from measurement techniques, and therefore needs to be accounted for
by re-sampling or calibration techniques before using the SOM framework.
Please use the tabs at the top to navigate to the next page.
## Create
The SOM is created the usual way and trained with the discovery dataset,
exept that we set the map radius to be the same as before to make it
easier to compare results across examples:
``` {r}
# Create a new self-organizing map based on sex-adjusted data.
radius.basic <- attr(modl.basic$map$topology, "radius")
modl.discov <- numero.create(data = trdata.discov, radius = radius.basic)
summary(modl.discov)
```
Please use the tabs at the top to navigate to the next page.
---
#---------------------------------------------------------------------------
---
## Quality
In this section, we inspect the structure of the SOM for the discovery cohort
and the versions of the replication cohort. Notice how the model is the same,
but new data are used for the replication cohorts:
``` {r results="hide"}
# Calculate map quality measures.
qc.discov <- numero.quality(model = modl.discov)
qc.replicA <- numero.quality(model = modl.discov, data = trdata.replicA)
qc.replicB <- numero.quality(model = modl.discov, data = trdata.replicB)
qc.mets <- numero.quality(model = modl.discov, data = trdata.mets)
```
The training data were transformed into ranks during the preprocessing step.
In many cases, this is the safest approach since it mitigates the effect of
outliers or highly skewed distributions. On the other hand, the rank transform
discards the original differences in magnitudes between data points, which
may be undesirable in some circumstances.
In this example, the rank preprocessing resulted in less skewed residuals
compared to the earlier examples:
``` {r}
# Define comparable histogram bins.
rz <- c(qc.adj$layout[,"RESIDUAL.z"], qc.discov$layout[,"RESIDUAL.z"])
rz.breaks <- seq(min(rz, na.rm = TRUE), max(rz, na.rm = TRUE), length.out=20)
```
``` {r results="hide", fig.width=7, fig.height=3, fig.align="center", fig.cap="Figure: Distribution of model residuals when the training data were preprocessed by scaling & centering or by tapered ranking."}
# Plot frequencies of data points at different quality levels.
par(mar = c(5,4,1,0), mfrow = c(1,2))
hist(x = qc.adj$layout[,"RESIDUAL.z"], breaks = rz.breaks,
main = NULL, xlab = "RESIDUAL.z (scale & center)",
ylab = "Number of data points", col = "#FFEFA0", cex = 0.8)
hist(x = qc.discov$layout[,"RESIDUAL.z"], breaks = rz.breaks,
main = NULL, xlab = "RESIDUAL.z (rank)",
ylab = "Number of data points", col = "#FFEFA0", cex = 0.8)
```
Residuals for the replication datasets are expected to be higher than those
for the training set (the training set always overfits to itself).
Nevertheless, all the rank-transformed replication sets had lower maximum
residuals than the previous centered and scaled training set:
``` {r}
# Comparison of maximum residuals across examples.
r <- c(max(qc.basic$layout[,"RESIDUAL.z"], na.rm = TRUE),
max(qc.adj$layout[,"RESIDUAL.z"], na.rm = TRUE),
max(qc.discov$layout[,"RESIDUAL.z"], na.rm = TRUE),
max(qc.replicA$layout[,"RESIDUAL.z"], na.rm = TRUE),
max(qc.replicB$layout[,"RESIDUAL.z"], na.rm = TRUE),
max(qc.mets$layout[,"RESIDUAL.z"], na.rm = TRUE))
names(r) <- c("basic", "adj", "discov", "replicA", "replicB", "mets")
print(r)
```
More balanced data point density is another potential benefit of the
rank-transform:
``` {r results="hide", fig.width=10, fig.height=3, fig.align="center", fig.cap="Figure: Visualization of quality measures across map districts for the discovery cohort. "}
# Plot map quality measures.
numero.plot(results = qc.discov, subplot = c(1,4))
```
On the other hand, the centered and scale analysis in the previous chapter
revealed a connection between weak model fit and high burden of disease;
this aspect of the data was lost due to the rank transform. The concentric
pattern of good fit at the center and weaker fit on the edges (see RESIDUAL)
is often seen when using rank-based preprocessing, which suggests that the
pattern should not be interpreted as being a characteristic of any
particular dataset.
Comparisons between the replication sets revealed important clues on how
well the data fit together:
``` {r results="hide", fig.width=10, fig.height=3, fig.align="center", fig.cap="Figure: Visualization of quality measures across map districts for the replication dataset A. "}
# Plot map quality measures.
numero.plot(results = qc.replicA, subplot = c(1,4))
```
For replication set A, the data point histogram is very uneven with the
majority of data points clustered in one quadrant of the map. To recap,
replication A was preprocessed using the same parameters as the discovery
set under the assumption that the two datasets were samples from the same
underlying population with identical measurement protocols. Evidently, this
assumption is wrong since the multi-variable data did not overlap well.
``` {r results="hide", fig.width=10, fig.height=3, fig.align="center", fig.cap="Figure: Visualization of quality measures across map districts for the replication dataset B. "}
# Plot map quality measures.
numero.plot(results = qc.replicB, subplot = c(1,4))
```
Replication B was assumed to come from the same population, but with
potential bias in the measured values. As the training data were preprocessed
independently of the discovery set, resulting in identical standardized
distributions, the data point histogram and residuals are similar to the
discovery set. This does not mean that the initial assumption is correct,
but it is possible to investigate demographic factors such as age, sex
or clinical disease profiles to check that the overall characteristics are
compatible across the different regions on the map.
``` {r results="hide", fig.width=10, fig.height=3, fig.align="center", fig.cap="Figure: Visualization of quality measures across map districts for the replication dataset with the metabolic syndrome. "}
# Plot map quality measures.
numero.plot(results = qc.mets, subplot = c(1,4))
```
Lastly, the metabolic syndrome subset was created as an example with biased
sampling and where the replication is not compatible with the discovery
dataset. However, this is hard to detect from the quality plots. As mentioned
above, the independent preprocessing will reproduce the even data point
histogram and residuals (in most settings), therefore the quality measures
alone cannot identify problems arising from these types of problems in
study design.
Please use the tabs at the top to navigate to the next page.
---
#---------------------------------------------------------------------------
---
## Statistics
To create colorings of the SOM for the different datasets, we combine
several features of the Numero pipeline. The first task is to identify
where the data points go on the map. For the discovery cohort (that
includes the training set), this information is contained in the model
itself. However, for the other cohorts, the district assignments
are obtained as part of the quality control; the output from
`numero.quality()` contains the layout of data points that were used
for calculating the histograms and residuals in the previous section.
The estimation of map statistics proceeds as before, except that the
quality control results are used for the replication datasets:
``` {r results="hide"}
# Map statistics for discovery and replication datasets.
stats.discov <- numero.evaluate(model = modl.discov, data = ds.discov)
stats.replicA <- numero.evaluate(model = qc.replicA, data = ds.replic)
stats.replicB <- numero.evaluate(model = qc.replicB, data = ds.replic)
stats.mets <- numero.evaluate(model = qc.mets, data = ds.mets)
```
The second task is to choose the coloring scheme. In the previous chapter,
we experimented with sex-specific colorings and here we will apply the same
technique of a common reference `reference = stats.discov`. This will
ensure that colors between the datasets are directly comparable, that is,
the same data value will always correspond to the same color.
In addition, we reduce the "gain" or global amplification of the color range.
This has the effect of making the available value-color mapping wider so that
values that exceed the reference range wash out less:
``` {r results="hide", fig.width=6, fig.height=4, fig.align="center", fig.cap="Figure: Colorings of training variables for the discovery dataset."}
numero.plot(results = stats.discov, variables = trvars,
gain = 0.8, subplot = c(2,3))
```
Notice how the strongest reds and blues are not used for the discovery set
due to the reduced (<1.0) gain. This means that they are now available to use
to distinguish more extreme values on the map.
The results for replication A are very similar to the discovery set both
for the color patterns and also the district averages. Yet we know from the
quality analysis that the two datasets did not overlap well on the SOM and
the assumptions on study design were invalid -- this is one of the reasons
it is important to check the quality of the SOM.
``` {r results="hide", fig.width=6, fig.height=4, fig.align="center", fig.cap="Figure: Colorings of selected variables for replication A."}
numero.plot(results = stats.replicA, variables = trvars,
gain = 0.8, subplot = c(2,3), reference = stats.discov)
```
Replication B reproduces the patterns of the discovery set in relative
terms (areas of lower and higher concentrations match), but the absolute
values are different. This is expected since we assumed that the data
patterns would be the same but there was systematic bias in the measurement
data.
``` {r results="hide", fig.width=6, fig.height=4, fig.align="center", fig.cap="Figure: Colorings of selected variables for replication B."}
numero.plot(results = stats.replicB, variables = trvars,
gain = 0.8, subplot = c(2,3), reference = stats.discov)
```
The metabolic syndrome dataset produced colorings that were substantially
different from the discovery set:
``` {r results="hide", fig.width=6, fig.height=4, fig.align="center", fig.cap="Figure: Colorings of selected variables for MetS dataset."}
numero.plot(results = stats.mets, variables = trvars,
gain = 0.8, subplot = c(2,3), reference = stats.discov)
```
While the relative patterns appear somewhat similar, the district averages
reveal that the SOM of the discovery set does not replicate well. The main
observation that demonstrates the poor fit is biological. For example, the
lowest serum creatinine district average in the discovery set is within the
range that you would expect to see in healthy adults (60 - 110 umol/L),
however, the district averages for the MetS dataset are higher in those map
areas, which may indicate a substantial difference in kidney health.
Please use the tabs at the top to navigate to the next page.
---
#---------------------------------------------------------------------------
---
## Subgroups
In previous chapters, the subrgouping was done based on the training
variables only so that the other traits could be used as methodologically
independent data. Here, the discovery set enables us to define the subgroups
based on all variables (if we so choose) since we did not use the replication
datasets for training. For this example, a good option is to use clinically
interesting variables for the subgrouping:
``` {r}
# Selection of clinically interesting variables.
clinvars <- c("uALB", "AGE", "DIAB_KIDNEY", "DIAB_RETINO",
"MACROVASC", "DECEASED")
```
As mentioned earlier, the Numero framework includes an interactive tool to
conduct the subgrouping based on SOM colorings, but the vignette does not
allow user interaction so we use the automatic feature instead:
``` {r}
# Automatic subgrouping based on the training set.
subgr.discov <- numero.subgroup(results = stats.discov,
variables = clinvars, automatic = TRUE)
```
The colorings can also be saved as an interactive web page, which is
useful when the standard R plotting gets too slow or unusable due to multiple
colorings or large map size (see the chapter on large datasets and maps).
``` {r results="hide", fig.width=6, fig.height=4, fig.align="center", fig.cap="Figure: Statistically normalized colorings of the training variables in the kidney disease dataset. The labels show the results from the subgrouping procedure."}
# Plot results from subgrouping procedure.
numero.plot(results = stats.discov, variables = clinvars,
topology = subgr.discov, subplot = c(2,3))
```
Please use the tabs at the top to navigate to the next page.
---
#---------------------------------------------------------------------------
---
## Compare
In the last step, the subgroups are compared using conventional statistics.
We use the dataset-specific map statistics to characterize the
subgroups in the discovery and replication sets:
``` {r eval=FALSE}
# Compare subgroups.
report.discov <- numero.summary(results = stats.discov,
topology = subgr.discov)
report.replicA <- numero.summary(results = stats.replicA,
topology = subgr.discov)
report.replicB <- numero.summary(results = stats.replicB,
topology = subgr.discov)
report.mets <- numero.summary(results = stats.mets,
topology = subgr.discov)
```
``` {r echo=FALSE, results="hide"}
suppressWarnings({
report.discov <- numero.summary(results = stats.discov,
topology = subgr.discov)
report.replicA <- numero.summary(results = stats.replicA,
topology = subgr.discov)
report.replicB <- numero.summary(results = stats.replicB,
topology = subgr.discov)
report.mets <- numero.summary(results = stats.mets,
topology = subgr.discov)})
```
The mortality rates in the discovery set are as expected with the
highest mortality observed for the districts that had the highest
kidney disease prevalence:
``` {r}
# Show results for mortality rate in the discovery set.
rows <- which(report.discov$VARIABLE == "DECEASED")
report.discov[rows,c("SUBGROUP","N","MEAN","P.chisq")]
```
The same pattern is observable also for replication A, however, the
subgroup sizes are poorly balanced and the statistics weaker as
a consequence:
``` {r}
# Show results for mortality rate in replication A.
rows <- which(report.replicA$VARIABLE == "DECEASED")
report.replicA[rows,c("SUBGROUP","N","MEAN","P.chisq")]
```
The outcomes in the discovery set are replicated well in dataset B
and the subgroups are also well balanced in their size:
``` {r}
# Show results for mortality rate in replication B.
rows <- which(report.replicB$VARIABLE == "DECEASED")
report.replicB[rows,c("SUBGROUP","N","MEAN","P.chisq")]
```
The mortality outcomes for the metabolic syndrome dataset are
not statistically significant. This is partly explained by the smaller
size of the dataset (= less power) but the overall mortality risk was
higher already due to the biased sampling, and therefore direct comparisons
with the discovery set are problematic.
``` {r}
# Show results for mortality rate in metabolic syndrome subset.
rows <- which(report.mets$VARIABLE == "DECEASED")
report.mets[rows,c("SUBGROUP","N","MEAN","P.chisq")]
```
In all studies, we recommend careful verification of the design and the
validity of assumptions using traditional statistical tests, SOM quality
measures, SOM colorings and also the final subgroups comparisons.
---
###########################################################################
---
# Large datasets and maps
The Numero package is also suitable for analyzing larger datasets (>100k
rows of data). Given the increase in statistical power, the map radius
can be safely increased, but this usually causes difficulties in plotting
the colorings on screen due to the inefficient way the default R graphics
are implemented. In particular, it may not be feasible to do the
interactive part using the standard R plotting.
It is possible to use web browsers for a better user experience but, to view
the map colorings, the plots need to be saved in an HTML file first. Here,
we are using the results from the basic example that was introduced in the
beginning of the document:
``` {r eval=FALSE}
# Save map colorings of training variables.
numero.plot(results = stats.basic, variables = trvars,
folder = "/tmp/Results")
```
``` {r echo=FALSE}
s <- paste("\n",
"*** numero.plot ***\n",
"Thu Sep 5 15:44:02 2019\n",
"\n",
"Resources:\n",
"5 column(s) included\n",
"destination folder '/tmp/Results'\n",
"\n",
"Figure 1:\n",
"5 subplot(s)\n",
"file name '/tmp/Results/figure01.svg'\n",
"97622 bytes saved in '/tmp/Results/figure01.svg'\n",
"99982 bytes saved in '/tmp/Results/figure01.html'\n",
"\n",
"Summary:\n",
"1 figure(s) -> '/tmp/Results'\n", sep="")
cat(s)
```
You can replace the folder to something suitable in your computer. Here we
used the temporary folder of a Linux system as an example. The command
creates both a Scalable Vector Graphics (SVG) file that you can open in
editors such as Inkscape, and a web page (HTML) that you can open in
a browser. Importantly, the web page is interactive, and you can select
regions of the SOM and click on empty space between plots to save
the topology and regions in a tab-delimited text file.
Assuming we assigned districts to different regions and downloaded the
results in the file 'Downloads/regions.txt', we can then import the
spreadsheet into R by typing
``` {r eval=FALSE}
# Import topology and region assignments.
subgr <- read.delim(file = "Downloads/regions.txt", stringsAsFactors=FALSE)
```
and calculate subgroup statistics with
``` {r eval=FALSE}
# Calculate subgroup statistics.
report <- numero.summary(results = stats.basic, topology = subgr)
```
as before.
**Useful hints**: If it takes a long time to train a SOM for a large dataset,
try setting the `subsample` parameter to something less than the number of
rows in the training set. It may also take a long time to evaluate the map
statistics; to reduce waiting time, you can assign data columns into smaller
batches and then run each batch on a different processor in parallel.
---
###########################################################################
---
# Terminology
**Best-matching centroid**: The BMC is the district centroid that is the most
similar to a data point. The concept is closely related to the data point
layout: the location of a data point on the map is determined by finding the
BMC for that data point.
**Coloring**: The Numero framework always creates a single map model. However, the map districts can be painted with different colors. This enables the user to create multiple colorings of the map to visualize regional differences. These colorings can be made for each variable, which helps to identify which parts of the map are particularly important for a specific phenomenon. Again, this is similar to a real city map where the districts are colored according to the income level of the local residents, or according to the mean age, smoking rates, obesity etc.
**Data point**: We define the term data point as a single uniquely identifiable row in a spreadsheet of data (with variables as columns). For instance, in the diabetic kidney disease dataset in the basic pipeline example, a data point refers to a patient (and vice versa) as there is only one row per patient. On the
other hand, if a patient dataset included multiple examination visits, and the
visits were organized on separate rows, a patient could be linked to multiple
data points.
**District**: A district refers to a pre-defined division of the map into uniformly sized areas. The districts are created mainly for technical reasons: using districts speeds up calculations and enables the estimation of map-related statistics. This is analogous to a real city being divided into districts to estimate regional demographics, for instance.
**District centroid**: The SOM algorithm works through the districts during the optimization of the data point layout on the map. The computational process eventually converges to a stable configuration that is stored as a set of district centroids. From a practical point of view, a district centroid represents the typical averaged profile that captures the characteristics of the data points within the district. In technical terms, the district centroid (also known as the prototype) contains the mean weighted data values across all the data points, where the weights are determined by the neighborhood function used in the SOM algorithm.
**Layout**: We make a distinction between what is a map, and what is the layout of data points on it. The layout is a vector of data point locations as coordinates, whereas the map is a more integrated concept that also includes information that is necessary to find the locations of new previously unseen data points, and to draw and paint the map in visual form.
**Map**: A map is a general term to describe the two-dimensional canvas onto which the multivariable data points are projected. The concept is analogous to a geographic map that indicates where people live, except that the location is not based on geography (i.e. physical distances), but comes from the data (i.e. distances = data-based similarities).
**Subgroup**: We expect that most uses of Numero will result in the subgrouping of a complex dataset. Visually, we define a subgroup via a contiguous set of adjacent districts on the map. Consequently, all the data points that are located within the set of districts are the subgroup members. Of note, our concept of a subgroup is different from subgroup discovery in data mining; we use the term as a replacement to the word 'cluster' to emphasize the lack of intrinsic structure in the original data space.
# Build information
```{r echo=FALSE}
sessionInfo()
Sys.time()
```