Olink® Analyze Vignette

Olink DS team

2024-11-22

Olink® Analyze is an R package that provides a versatile toolbox to enable fast and easy handling of Olink® NPX data for your proteomics research. Olink® Analyze provides functions for using Olink data, including functions for importing Olink® NPX datasets, as well as quality control (QC) plot functions and functions for various statistical tests. This package is meant to provide a convenient pipeline for your Olink NPX data analysis.

Installation

You can install Olink® Analyze from CRAN.

install.packages("OlinkAnalyze")

List of functions

Preprocessing

Statistical analysis

Visualization

Sample datasets

Usage

Load the library

# Load OlinkAnalyze
library(OlinkAnalyze)

# Load other libraries used in Vignette
library(dplyr)
library(ggplot2)
library(stringr)

Preprocessing

Read NPX data (read_NPX)

The read_NPX function imports an NPX file into a tidy format to work with in R. No prior alterations to the NPX output file should be made for this function to work as expected.

Function arguments

  • filename: Path to the NPX output file.
data <- read_NPX("~/NPX_file_location.xlsx")

Function output

A tibble in long format containing:

  • SampleID: Sample names or IDs.
  • Index: Unique number for each SampleID. It is used to make up for non unique sample IDs.
  • OlinkID: Unique ID for each assay assigned by Olink. In case the assay is included in more than one panels it will have a different OlinkID in each one.
  • UniProt: UniProt ID.
  • Assay: Common gene name for the assay.
  • MissingFreq: Missing frequency for the OlinkID, i.e. frequency of samples with NPX value below limit of detection (LOD).
  • Panel: Olink Panel that samples ran on. Read more about Olink Panels here: https://olink.com/products/compare
  • Panel_Version: Version of the panel. A new panel version might include some different or improved assays.
  • PlateID: Name of the plate.
  • QC_Warning: Indication whether the sample passed Olink QC. More information about Olink quality control metrics can be found in our FAQ (Search “Quality control”).
  • LOD: Limit of detection (LOD) is the minimum level of an individual protein that can be measured. LOD is defined as 3 times the standard deviation over background.
  • NPX: Normalized Protein eXpression, is Olink’s unit of protein expression level in a log2 scale. The majority of the functions of this package use NPX values for calculations. Read more about NPX in the Olink FAQ (Search “What is NPX?”) or in Olink’s Data normalization and standardization white paper.

Read multiple NPX data files (read_NPX)

In order to import multiple NPX data files at once, the read_NPX function can be used in combination with the list.files, lapply and dplyr::bind_rows functions, as seen below. The pattern argument of the list.files function specifies the NPX file format (.csv, .parquet, or either). This method requires that all NPX files are stored in the same folder and have identical column names. No prior alterations to the NPX output file should be made for this method to work as expected.

# Read in multiple NPX files in .csv format
data <-  list.files(path = "path/to/dir/with/NPX/files",
                    pattern = "csv$",
                    full.names = TRUE) |>
         lapply(FUN = function(x){
    OlinkAnalyze::read_NPX(x) |> 
      dplyr::mutate(File = x) # Optionally add additional columns to add file identifiers
      }  |>
         dplyr::bind_rows() # optional to return a single data frame of all files instead of a list of data frames

# Read in multiple NPX files in .parquet format
data <-  list.files(path = "path/to/dir/with/NPX/files",
                    pattern = "parquet$",
                    full.names = TRUE) |>
         lapply(OlinkAnalyze::read_NPX)  |>
         dplyr::bind_rows()

# Read in multiple NPX files in either format
data <-  list.files(path = "path/to/dir/with/NPX/files",  
                    pattern = "parquet$|csv$",          
                    full.names = TRUE) |>
         lapply(OlinkAnalyze::read_NPX)  |>
         dplyr::bind_rows()

Statistical analysis

Post-hoc ANOVA analysis (olink_anova_posthoc)

olink_anova_posthoc performs a post-hoc ANOVA test using the function emmeans from the R library emmeans with Tukey p-value adjustment per assay (by OlinkID) at confidence level 0.95.

The function handles both factor and numerical variables and/or covariates. The post-hoc test for a numerical variable compares the difference in means of the outcome variable (default: NPX) for 1 standard deviation (SD) difference in the numerical variable, e.g. mean NPX at mean (numerical variable) versus mean NPX at mean (numerical variable) + 1*SD (numerical variable).

Control samples and control assays (AssayType is not “assay”, or Assay contains “control” or “ctrl”) should be removed before using this function.

Function arguments

  • df: NPX data frame in long format should minimally contain protein name (Assay), OlinkID, UniProt, Panel and an outcome factor with at least 3 levels.
  • olinkid_list: Character vector of OlinkID’s on which to perform the post-hoc analysis. If not specified, all assays in df are used.
  • variable: Single character value or character array. In case of single character then that should represent a column in the df. Otherwise, if length > 1, the included variable names will be used in crossed analyses. It can also accept the notations ‘:’ or ’*’.
  • covariates: Single character value or character array. Default: NULL. Confounding factors to include in the analysis. In case of single character then that should represent a column in the df. It can also accept the notations ‘:’ or ’*’, while crossed analysis will not be inferred from main effects.
  • outcome: Name of the column from df that contains the dependent variable. Default: NPX.
  • effect: Term on which to perform the post-hoc analysis. Character vector. Must be subset of or identical to the variable and no adjustment is performed.
  • mean_return: Logical. If true, returns the mean of each factor level rather than the difference in means (default). Note that no p-value is returned for mean_return = TRUE.
  • verbose: Logical. Default: True. If information about removed samples, factor conversion and final model formula is to be printed to the console.
# calculate the p-value for the ANOVA
anova_results_oneway <- olink_anova(df = npx_data1_no_controls, 
                                    variable = 'Site')
# extracting the significant proteins
anova_results_oneway_significant <- anova_results_oneway %>%
  filter(Threshold == 'Significant') %>%
  pull(OlinkID)
anova_posthoc_oneway_results <- olink_anova_posthoc(df = npx_data1_no_controls,
                                                    olinkid_list = anova_results_oneway_significant,
                                                    variable = 'Site',
                                                    effect = 'Site')

Function output

A tibble with the following columns:

  • Assay <chr>: Assay name.
  • OlinkID <chr>: Unique Olink ID.
  • UniProt <chr>: UniProt ID.
  • Panel <chr>: Olink Panel.
  • term <chr>: Name of the variable that was used for the p-value calculation. The “:” between variables indicates interaction between variables.
  • contrast <chr>: Variables (in term) that are compared.
  • estimate <dbl>: Difference in mean NPX between variables (from contrast).
  • conf.low <dbl>: Low bound of the confidence interval for the mean.
  • conf.high <dbl>: High bound of the confidence interval for the mean.
  • Adjusted_pval <dbl>: Adjusted p-value for the test (Benjamini & Hochberg).
  • Threshold <chr>: Text indication if assay is significant (adjusted p-value < 0.05).

Post-hoc linear mixed effects model analysis (olink_lmer_posthoc)

The olink_lmer_posthoc function is similar to olink_lmer but performs a post-hoc analysis based on a linear mixed model effects model using the function lmer from the R library lmerTest and the function emmeans from the R library emmeans. The function handles both factor and numerical variables and/or covariates. Differences in estimated marginal means are calculated for all pairwise levels of a given output variable. Degrees of freedom are estimated using Satterthwaite’s approximation. The post-hoc test for a numerical variable compares the difference in means of the outcome variable (default: NPX) for 1 standard deviation difference in the numerical variable, e.g. mean NPX at mean(numerical variable) versus mean NPX at mean(numerical variable) + 1*SD(numerical variable). The output tibble is arranged by ascending adjusted p-values.

Function arguments

  • df: NPX data frame in long format should minimally contain protein name (Assay), OlinkID, UniProt, Panel and 1-2 variables with at least 2 levels and subject ID.
  • variable: Single character value or character array. In case of single character then that should represent a column in the df. Otherwise, if length > 1, the included variable names will be used in crossed analyses. It can also accept the notations ‘:’ or ’*’.
  • olinkid_list: Character vector of OlinkID’s on which to perform the post-hoc analysis. If not specified, all assays in df are used.
  • effect: Term on which to perform the post-hoc analysis. Character vector. Must be subset of or identical to the variable.
  • outcome: Name of the column from df that contains the dependent variable. Default: NPX.
  • random: Single character value or character array with random effects.
  • covariates: Single character value or character array. Default: NULL. Confounding factors to include in the analysis. In case of single character then that should represent a column in the df. It can also accept the notations ‘:’ or ’*’, while crossed analysis will not be inferred from main effects.
  • mean_return: Logical. If true, returns the mean of each factor level rather than the difference in means (default). Note that no p-value is returned for mean_return = TRUE and no adjustment is performed.
  • verbose: Logical. Default: True. If information about removed samples, factor conversion and final model formula is to be printed to the console.
if (requireNamespace("lme4", quietly = TRUE) & requireNamespace("lmerTest", quietly = TRUE)){
  # Linear mixed model with two variables.
  lmer_results_twoway <- olink_lmer(df = npx_data1, 
                                    variable = c('Site', 'Treatment'),
                                    random = 'Subject')
  # extracting the significant proteins
  lmer_results_twoway_significant <- lmer_results_twoway %>%
    filter(Threshold == 'Significant', term == 'Treatment') %>%
    pull(OlinkID)
  # performing post-hoc analysis
  lmer_posthoc_twoway_results <- olink_lmer_posthoc(df = npx_data1,
                                                    olinkid_list = lmer_results_twoway_significant,
                                                    variable = c('Site', 'Treatment'),
                                                    random = 'Subject',
                                                    effect = 'Treatment') 
}

Function output

A tibble with the following columns:

  • Assay <chr>: Assay name.
  • OlinkID <chr>: Unique Olink ID.
  • UniProt <chr>: UniProt ID.
  • Panel <chr>: Olink Panel.
  • term <chr>: Name of the variable that was used for the p-value calculation. The “:” between variables indicates interaction between variables.
  • contrast <chr>: Variables (in term) that are compared.
  • estimate <dbl>: Difference in mean NPX between variables (from contrast).
  • conf.low <dbl>: Low bound of the confidence interval for the mean.
  • conf.high <dbl>: High bound of the confidence interval for the mean.
  • Adjusted_pval <dbl>: Adjusted p-value for the test (Benjamini & Hochberg).
  • Threshold <chr>: Text indication if assay is significant (adjusted p-value < 0.05).

Additional Statistical Tests

Many other statistical functions can be found within Olink Analyze, including:

To learn more about these function, consult their help documentation using the help() function.

Exploratory analysis

Visualization

Theming function (set_plot_theme)

This function sets a coherent plot theme for plots by adding it to a ggplot object. It is mainly used for aesthetic reasons.

npx_data1 %>% 
  filter(OlinkID == 'OID01216') %>% 
  ggplot(aes(x = Treatment, y = NPX, fill = Treatment)) +
  geom_boxplot() +
  set_plot_theme()