--- title: "transformations" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{transformations} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(deident) ``` # Transformations Out of the box, `deident` features a set of transformations to aid in the de-identification of data sets. Each transformation is implemented via `R6Class` and extends `BaseDeident`. User defined transformations can be implemented in a similar manner. To demonstrate the different transformation we supply a toy data set, `df`, comprising 26 observations of three variables: * A: character, a to z * B: numeric, 1 to 26 * C: character, `X` if `B <= 13`, `Y` if `B > 13` ``` {r, include=F} df <- data.frame( A = letters, B = 1:26, C = sort(rep(c("X", "Y"), 13)) ) df ``` ## Psudonymizer Apply a cached random replacement cipher. Re-occurrence of the same key will receive the same hash. Implemented `deident` options: ``` {r, eval=F} deident(df, "psudonymize", A) deident(df, "Pseudonymizer", A) deident(df, Pseudonymizer, A) deident(df, Pseudonymizer$new(), A) psu <- Pseudonymizer$new() deident(df, psu, A) ``` ### Options By default `Pseudonymizer` replaces values in variables with a random alpha-numeric string of 5 characters. This can be replaced via calling `set_method` on an instantiated Pseudonymizer with the desired function: ``` {r} psu <- Pseudonymizer$new() new_method <- function(key, ...){ paste(sample(letters, 12, T), collapse="") } psu$set_method(new_method) deident(df, psu, A) ``` The first argument to the method receives the key to be transformed. ## Shuffler Implemented `deident` options: ``` {r, eval=F} deident(df, "shuffle", A) deident(df, "Shuffler", A) deident(df, Shuffler, A) deident(df, Shuffler$new(), A) shuffle <- Shuffler$new() deident(df, shuffle, A) ``` ## Encrypter Apply cryptographic hashing to a variable. Implemented `deident` options: ``` {r, eval=F} deident(df, "encrypt", A) deident(df, "Encrypter", A) deident(df, Encrypter, A) deident(df, Encrypter$new(), A) encrypt <- Encrypter$new() deident(df, encrypt, A) ``` ### Options At initialization, `Encrypter` can be given `hash_key` and `seed` values to control the cryptographic encryption. It is recommended users set these values and do not disclose them. ``` {r} encrypt <- Encrypter$new(hash_key="deident_hash_key_123", seed=202) deident(df, encrypt, A) ``` ## Perturber Apply Gaussian white noise to a numeric variable. Implemented `deident` options: ``` {r, eval=F} deident(df, "perturb", A) deident(df, "Perturber", A) deident(df, Perturber, A) deident(df, Perturber$new(), A) perturb <- Perturber$new() deident(df, perturb, A) ``` ### Options At initialization, `Perturber` can be given a scale for the white noise via the `sd` argument. ``` {r} # perturb <- Perturber$new(noise=adaptive_noise(0.2)) # deident(df, perturb, B) ``` ## Blurer Aggregate categorical values dependent on a user supplied list. the list must be supplied to `Blur` at initialization. Implemented `deident` options: ``` {r, eval=F} letter_blur <- c(rep("Early", 13), rep("Late", 13)) names(letter_blur) <- letters blur <- Blurer$new(blur = letter_blur) deident(df, blur, A) ``` ## NumericBlurer Aggregate numeric values dependent on a user supplied vector of breaks/ cuts. If no vector is supplied `NumericBlurer` defaults to a binary classification about 0. Implemented `deident` options: ``` {r, eval=F} deident(df, "numeric_blur", B) deident(df, "NumericBlurer", B) deident(df, NumericBlurer, B) deident(df, NumericBlurer$new(), B) numeric_blur <- NumericBlurer$new() deident(df, numeric_blur, B) ``` ### Options At initialization `NumericBlurer` takes an argument `cuts` to define the limits of each interval. ``` {r} numeric_blur <- NumericBlurer$new(cuts=c(5, 10, 15, 20)) deident(df, numeric_blur, B) ``` ## GroupedShuffler Apply `Shuffler` to a data set having first grouped the data on column(s). The grouping needs to be defined at initialization. Implemented `deident` options: ``` {r, eval=F} grouped_shuffle <- GroupedShuffler$new(C) deident(df, grouped_shuffle, B) ``` ### Options At initialization `GroupedShuffler` takes an argument `limit` such that if any aggregated sub group has fewer than `limit` observations all values are dropped. ``` {r} numeric_blur <- GroupedShuffler$new(C, limit=1) deident(df, numeric_blur, B) ``` ## Drop Define a column to be removed from the pipeline. Implemented `deident` options: ``` {r, eval=F} deident(df, Drop, B) drop <- deident:::Drop$new() deident(df, drop, B) ```