--- title: "Using tidy data with Bayesian models" author: "Matthew Kay" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true df_print: kable params: EVAL: !r identical(Sys.getenv("NOT_CRAN"), "true") vignette: > %\VignetteIndexEntry{Using tidy data with Bayesian models} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r chunk_options, include=FALSE} if (requireNamespace("pkgdown", quietly = TRUE) && pkgdown::in_pkgdown()) { tiny_width = small_width = med_width = 6.75 tiny_height = small_height = med_height = 4.5 large_width = 8 large_height = 5.25 } else { tiny_width = 5.5 tiny_height = 3 + 2/3 small_width = med_width = 6.75 small_height = med_height = 4.5 large_width = 8 large_height = 5.25 } eval_chunks = if (isTRUE(exists("params"))) params$EVAL else FALSE knitr::opts_chunk$set( fig.width = small_width, fig.height = small_height, eval = eval_chunks ) if (capabilities("cairo") && Sys.info()[['sysname']] != "Darwin") { knitr::opts_chunk$set( dev.args = list(png = list(type = "cairo")) ) } dir.create("models", showWarnings = FALSE) ``` ## Introduction This vignette introduces the `tidybayes` package, which facilitates the use of tidy data (one observation per row) with Bayesian models in R. This vignette is geared towards working with tidy data in general-purpose modeling functions like JAGS or Stan. For a similar introduction to the use of `tidybayes` with high-level modeling functions such as those in `brms` or `rstanarm`, see `vignette("tidy-brms")` or `vignette("tidy-rstanarm")`. This vignette also describes how to use `ggdist` (the sister package to `tidybayes`) for visualizing model output. The default output (and sometimes input) data formats of popular modeling functions like JAGS and Stan often don't quite conform to the ideal of [tidy data](https://dx.doi.org/10.18637/jss.v059.i10). For example, input formats might expect a list instead of a data frame, and for all variables to be encoded as numeric values (requiring translation of factors to numeric values and the creation of index variables to store the number of levels per factor or the number of observations in a data frame). Output formats will often be in matrix form (requiring conversion for use with libraries like ggplot), and will use numeric indices (requiring conversion back into factor level names if the you wish to make meaningfully-labeled plots or tables). `tidybayes` automates all of these sorts of tasks. ### Philosophy There are a few core ideas that run through the `tidybayes` API that should (hopefully) make it easy to use: 1. __Tidy data does not always mean all parameter names as values__. In contrast to the `ggmcmc` library (which translates model results into a data frame with a `Parameter` and `value` column), the `spread_draws` function in `tidybayes` produces data frames where the columns are named after parameters and (in some cases) indices of those parameters, as automatically as possible and using a syntax as close to the same way you would refer to those variables in the model's language as possible. A similar function to `ggmcmc`'s approach is also provided in `gather_draws`, since sometimes you *do* want variable names as values in a column. The goal is for `tidybayes` to do the tedious work of figuring out how to make a data frame look the way you need it to, including turning parameters with indices like `"b[1,2]"` and the like into tidy data for you. 2. __Fit into the tidyverse__. `tidybayes` methods fit into a workflow familiar to users of the `tidyverse` (`dplyr`, `tidyr`, `ggplot2`, etc), which means fitting into the pipe (`%>%`) workflow, using and respecting grouped data frames (thus `spread_draws` and `gather_draws` return results already grouped by variable indices, and methods like `median_qi` calculate point summaries and intervals for variables and groups simultaneously), and not reinventing too much of the wheel if it is already made easy by functions provided by existing `tidyverse` packages (unless it makes for much clearer code for a common idiom). For compatibility with other package column names (such as `broom::tidy`), `tidybayes` provides transformation functions like `to_broom_names` that can be dropped directly into data transformation pipelines. 3. __Focus on composable operations and plotting primitives, not monolithic plots and operations__. Several other packages (notably `bayesplot` and `ggmcmc`) already provide an excellent variety of pre-made methods for plotting Bayesian results. `tidybayes` shies away from duplicating this functionality. Instead, it focuses on providing composable operations for generating and manipulating Bayesian samples in a tidy data format, and graphical primitives for `ggplot` that allow you to build custom plots easily. Most simply, where `bayesplot` and `ggmcmc` tend to have functions with many options that return a full ggplot object, `tidybayes` tends towards providing primitives (like `geom`s) that you can compose and combine into your own custom plots. I believe both approaches have their place: pre-made functions are especially useful for common, quick operations that don't need customization (like many diagnostic plots), while composable operations tend to be useful for more complex custom plots (in [my opinion](https://blog.mjskay.com/2017/11/05/i-don-t-want-your-monolithic-ggplot-function/)). 4. __Sensible defaults make life easy.__ But options (and the data being tidy in the first place) make it easy to go your own way when you need to. 5. __Variable names in models should be descriptive, not cryptic__. This principle implies avoiding cryptic (and short) subscripts in favor of longer (but descriptive) ones. This is a matter of readability and accessibility of models to others. For example, a common pattern among Stan users (and in the Stan manual) is to use variables like `J` to refer to the number of elements in a group (e.g., number of participants) and a corresponding index like `j` to refer to specific elements in that group. I believe this sacrifices too much readability for the sake of concision; I prefer a pattern like `n_participant` for the size of the group and `participant` (or a mnemonic short form like `p`) for specific elements. In functions where names are auto-generated (like `compose_data`), `tidybayes` will (by default) assume you want these sorts of more descriptive names; however, you can always override the default naming scheme. ### Supported model types `tidybayes` aims to support a variety of models with a uniform interface. Currently supported models include [rstan](https://cran.r-project.org/package=rstan), [cmdstanr](https://mc-stan.org/cmdstanr/), [brms](https://cran.r-project.org/package=brms), [rstanarm](https://cran.r-project.org/package=rstanarm), [runjags](https://cran.r-project.org/package=runjags), [rjags](https://cran.r-project.org/package=rjags), [jagsUI](https://cran.r-project.org/package=jagsUI), [coda::mcmc and coda::mcmc.list](https://cran.r-project.org/package=coda), [posterior::draws](https://mc-stan.org/posterior/), [MCMCglmm](https://cran.r-project.org/package=MCMCglmm), and anything with its own `as.mcmc.list` implementation. If you install the [tidybayes.rethinking](https://mjskay.github.io/tidybayes.rethinking/) package, models from the [rethinking](https://github.com/rmcelreath/rethinking) package are also supported. For an up-to-date list of supported models, see `?"tidybayes-models"`. ## Setup The following libraries are required to run this vignette: ```{r setup, message = FALSE, warning = FALSE} library(magrittr) library(dplyr) library(forcats) library(modelr) library(ggdist) library(tidybayes) library(ggplot2) library(cowplot) library(broom) library(rstan) library(rstanarm) library(brms) library(bayesplot) library(RColorBrewer) theme_set(theme_tidybayes() + panel_border()) ``` These options help Stan run faster: ```{r, eval=FALSE} rstan_options(auto_write = TRUE) options(mc.cores = parallel::detectCores()) ``` ```{r hidden_options, include=FALSE} # While the previous code chunk is the actual recommended approach, # CRAN vignette building policy limits us to 2 cores, so we use at most # 2 to build this vignette (but show the previous chunk to # the reader as a best pratice example) rstan_options(auto_write = TRUE) options(mc.cores = 1) #min(2, parallel::detectCores())) options(width = 120) ``` ## Example dataset To demonstrate `tidybayes`, we will use a simple dataset with 10 observations from 5 conditions each: ```{r} set.seed(5) n = 10 n_condition = 5 ABC = tibble( condition = factor(rep(c("A","B","C","D","E"), n)), response = rnorm(n * 5, c(0,1,2,1,-1), 0.5) ) ``` A snapshot of the data looks like this: ```{r} head(ABC, 10) ``` This is a typical tidy format data frame: one observation per row. Graphically: ```{r fig.width = tiny_width, fig.height = tiny_height} ABC %>% ggplot(aes(x = response, y = fct_rev(condition))) + geom_point(alpha = 0.5) + ylab("condition") ``` ## Using `compose_data` to prepare a data frame for the model Shunting data from a data frame into a format usable in samplers like JAGS or Stan can involve a tedious set of operations, like generating index variables storing the number of operations or the number of levels in a factor. `compose_data` automates these operations. A hierarchical model of our example data might fit an overall mean across the conditions (`overall_mean`), the standard deviation of the condition means (`condition_mean_sd`), the mean within each condition (`condition_mean[condition]`) and the standard deviation of the responses given a condition mean (`response_sd`): ```{stan ABC_stan_model, output.var = "ABC_stan", results = "hide"} data { int n; int n_condition; int condition[n]; real response[n]; } parameters { real overall_mean; vector[n_condition] condition_zoffset; real response_sd; real condition_mean_sd; } transformed parameters { vector[n_condition] condition_mean; condition_mean = overall_mean + condition_zoffset * condition_mean_sd; } model { response_sd ~ cauchy(0, 1); // => half-cauchy(0, 1) condition_mean_sd ~ cauchy(0, 1); // => half-cauchy(0, 1) overall_mean ~ normal(0, 5); condition_zoffset ~ normal(0, 1); // => condition_mean ~ normal(overall_mean, condition_mean_sd) for (i in 1:n) { response[i] ~ normal(condition_mean[condition[i]], response_sd); } } ``` We have compiled and loaded this model into the variable `ABC_stan`. This model expects these variables as input: * `n`: number of observations * `n_condition`: number of conditions * `condition`: a vector of integers indicating the condition of each observation * `response`: a vector of observations Our data frame (`ABC`) only has `response` and `condition`, and `condition` is in the wrong format (it is a factor instead of numeric). However, `compose_data` can generate a list containing the above variables in the correct format automatically. It recognizes that `condition` is a factor and converts it to a numeric, adds the `n_condition` variable automatically containing the number of levels in `condition`, and adds the `n` column containing the number of observations (number of rows in the data frame): ```{r} compose_data(ABC) ``` This makes it easy to skip right to running the model without munging the data yourself: ```{r message = FALSE, results = 'hide'} m = sampling(ABC_stan, data = compose_data(ABC), control = list(adapt_delta = 0.99)) ``` The results look like this: ```{r} m ``` ## Extracting draws from a fit in tidy-format using `spread_draws` ### Extracting model variable indices into a separate column in a tidy format data frame Now that we have our results, the fun begins: getting the draws out in a tidy format! The default methods in Stan for extracting draws from the model do so in a nested format: ```{r} str(rstan::extract(m)) ``` There are also methods for extracting draws as matrices or data frames in Stan (and other model types, such as JAGS and MCMCglmm, have their own formats). The `spread_draws` method yields a common format for all model types supported by `tidybayes`. It lets us instead extract draws into a data frame in tidy format, with a `.chain` and `.iteration` column storing the chain and iteration for each row (if available), a `.draw` column that uniquely indexes each draw, and the remaining columns corresponding to model variables or variable indices. The `spread_draws` method accepts any number of column specifications, which can include names for variables and names for variable indices. For example, we can extract the `condition_mean` variable as a tidy data frame, and put the value of its first (and only) index into the `condition` column, using a syntax that directly echoes how we would specify indices of the `condition_mean` variable in the model itself: ```{r} m %>% spread_draws(condition_mean[condition]) %>% head(10) ``` ### Automatically converting columns and indices back into their original data types As-is, the resulting variables don't know anything about where their indices came from. The index of the `condition_mean` variable was originally derived from the `condition` factor in the `ABC` data frame. But Stan doesn't know this: it is just a numeric index to Stan, so the `condition` column just contains numbers (`1, 2, 3, 4, 5`) instead of the factor levels these numbers correspond to (`"A", "B", "C", "D", "E"`). We can recover this missing type information by passing the model through `recover_types` before using `spread_draws`. In itself `recover_types` just returns a copy of the model, with some additional attributes that store the type information from the data frame (or other objects) that you pass to it. This doesn't have any useful effect by itself, but functions like `spread_draws` use this information to convert any column or index back into the data type of the column with the same name in the original data frame. In this example, `spread_draws` recognizes that the `condition` column was a factor with five levels (`"A", "B", "C", "D", "E"`) in the original data frame, and automatically converts it back into a factor: ```{r} m %>% recover_types(ABC) %>% spread_draws(condition_mean[condition]) %>% head(10) ``` Because we often want to make multiple separate calls to `spread_draws`, it is often convenient to decorate the original model using `recover_types` immediately after it has been fit, so we only have to call it once: ```{r} m %<>% recover_types(ABC) ``` Now we can omit the `recover_types` call before subsequent calls to `spread_draws`. ## Point summaries and intervals with the `point_interval` functions: `[median|mean|mode]_[qi|hdi]` ### With simple variables, wide format `tidybayes` provides a family of functions for generating point summaries and intervals from draws in a tidy format. These functions follow the naming scheme `[median|mean|mode]_[qi|hdi]`, for example, `median_qi`, `mean_qi`, `mode_hdi`, and so on. The first name (before the `_`) indicates the type of point summary, and the second name indicates the type of interval. `qi` yields a quantile interval (a.k.a. equi-tailed interval, central interval, or percentile interval) and `hdi` yields a highest density interval. Custom point or interval functions can also be applied using the `point_interval` function. For example, we might extract the draws corresponding to the overall mean and standard deviation of observations: ```{r} m %>% spread_draws(overall_mean, response_sd) %>% head(10) ``` Like with `condition_mean[condition]`, this gives us a tidy data frame. If we want the median and 95% quantile interval of the variables, we can apply `median_qi`: ```{r} m %>% spread_draws(overall_mean, response_sd) %>% median_qi(overall_mean, response_sd) ``` `median_qi` summarizes each input column using its median. If there are multiple columns to summarize, each gets its own `x.lower` and `x.upper` column (for each column `x`) corresponding to the bounds of the `.width`% interval. If there is only one column, the names `.lower` and `.upper` are used for the interval bounds. We can specify the columns we want to get medians and intervals from, as above, or if we omit the list of columns, `median_qi` will use every column that is not a grouping column or a special column (like `.chain`, `.iteration`, or `.draw`). Thus in the above example, `overall_mean` and `response_sd` are redundant arguments to `median_qi` because they are also the only columns we gathered from the model. So we can simplify the previous code to the following: ```{r} m %>% spread_draws(overall_mean, response_sd) %>% median_qi() ``` ### With indexed variables When we have a variable with one or more indices, such as `condition_mean`, we can apply `median_qi` (or other functions in the `point_interval` family) as we did before: ```{r} m %>% spread_draws(condition_mean[condition]) %>% median_qi() ``` How did `median_qi` know what to aggregate? Data frames returned by `spread_draws` are automatically grouped by all index variables you pass to it; in this case, that means it groups by `condition`. `median_qi` respects groups, and calculates the point summaries and intervals within all groups. Then, because no columns were passed to `median_qi`, it acts on the only non-special (`.`-prefixed) and non-group column, `condition_mean`. So the above shortened syntax is equivalent to this more verbose call: ```{r} m %>% spread_draws(condition_mean[condition]) %>% group_by(condition) %>% # this line not necessary (done automatically by spread_draws) median_qi(condition_mean) ``` When given only a single column, `median_qi` will use the names `.lower` and `.upper` for the lower and upper ends of the intervals. `tidybayes` also provides an implementation of `posterior::summarise_draws()` for grouped data frames (`tidybayes::summaries_draws.grouped_df()`), which you can use to quickly get convergence diagnostics: ```{r} m %>% spread_draws(condition_mean[condition]) %>% summarise_draws() ``` ## Plotting points and intervals ### Using `geom_pointinterval` Plotting medians and intervals is straightforward using `ggdist::geom_pointinterval()` or `ggdist::stat_pointinterval()`, which are similar to `ggplot2::geom_pointrange()` but with sensible defaults for multiple intervals. For example: ```{r fig.width = tiny_width, fig.height = tiny_height} m %>% spread_draws(condition_mean[condition]) %>% ggplot(aes(y = fct_rev(condition), x = condition_mean)) + stat_pointinterval() ``` These functions have `.width = c(.66, .95)` by default (showing 66% and 95% intervals), but this can be changed by passing a `.width` argument to `ggdist::stat_pointinterval()`. ### Intervals with posterior violins ("eye plots"): `stat_eye` The `ggdist::stat_halfeye()` geom provides a shortcut to generating "half-eye plots" (combinations of intervals and densities). This example also demonstrates how to change the interval probability (here, to 90% and 50% intervals): ```{r fig.width = tiny_width, fig.height = tiny_height} m %>% spread_draws(condition_mean[condition]) %>% ggplot(aes(y = fct_rev(condition), x = condition_mean)) + stat_halfeye(.width = c(.90, .5)) ``` Or say you want to annotate portions of the densities in color; the `fill` aesthetic can vary within a slab in all geoms and stats in the `ggdist::geom_slabinterval()` family, including `ggdist::stat_halfeye()`. For example, if you want to annotate a domain-specific region of practical equivalence (ROPE), you could do something like this: ```{r fig.width = tiny_width, fig.height = tiny_height} m %>% spread_draws(condition_mean[condition]) %>% ggplot(aes(y = fct_rev(condition), x = condition_mean, fill = after_stat(abs(x) < .8))) + stat_halfeye() + geom_vline(xintercept = c(-.8, .8), linetype = "dashed") + scale_fill_manual(values = c("gray80", "skyblue")) ``` ### Other visualizations of distributions: `stat_slabinterval` There are a variety of additional stats for visualizing distributions in the `ggdist::geom_slabinterval()` family of stats and geoms: The slabinterval family of geoms and stats See `vignette("slabinterval", package = "ggdist")` for an overview. ### Intervals with multiple probability levels: the `.width =` argument If you wish to summarise the data before plotting (sometimes useful for large samples), `median_qi()` and its sister functions can also produce an arbitrary number of probability intervals by setting the `.width =` argument: ```{r} m %>% spread_draws(condition_mean[condition]) %>% median_qi(.width = c(.95, .8, .5)) ``` The results are in a tidy format: one row per index (`condition`) and probability level (`.width`). This facilitates plotting. For example, assigning `-.width` to the `linewidth` aesthetic will show all intervals, making thicker lines correspond to smaller intervals: ```{r fig.width = tiny_width, fig.height = tiny_height} m %>% spread_draws(condition_mean[condition]) %>% median_qi(.width = c(.95, .66)) %>% ggplot(aes( y = fct_rev(condition), x = condition_mean, xmin = .lower, xmax = .upper, # size = -.width means smaller probability interval => thicker line # this can be omitted, geom_pointinterval includes it automatically # if a .width column is in the input data. linewidth = -.width )) + geom_pointinterval() ``` `ggdist::geom_pointinterval()` includes `size = -.width` as a default aesthetic mapping to facilitate exactly this usage. ### Plotting posteriors as quantile dotplots Intervals are nice if the alpha level happens to line up with whatever decision you are trying to make, but getting a shape of the posterior is better (hence eye plots, above). On the other hand, making inferences from density plots is imprecise (estimating the area of one shape as a proportion of another is a hard perceptual task). Reasoning about probability in frequency formats is easier, motivating [quantile dotplots](https://github.com/mjskay/when-ish-is-my-bus/blob/master/quantile-dotplots.md) ([Kay et al. 2016](https://doi.org/10.1145/2858036.2858558), [Fernandes et al. 2018](https://doi.org/10.1145/3173574.3173718)), which also allow precise estimation of arbitrary intervals (down to the dot resolution of the plot, 100 in the example below). Within the slabinterval family of geoms in tidybayes is the `dots` and `dotsinterval` family, which automatically determine appropriate bin sizes for dotplots and can calculate quantiles from samples to construct quantile dotplots. `ggdist::stat_dots()` is the variant designed for use on samples: ```{r fig.width = tiny_width, fig.height = tiny_height} m %>% spread_draws(condition_mean[condition]) %>% ggplot(aes(x = condition_mean, y = fct_rev(condition))) + stat_dotsinterval(quantiles = 100) ``` The idea is to get away from thinking about the posterior as indicating one canonical point or interval, but instead to represent it as (say) 100 approximately equally likely points. ### Alternative point summaries and intervals: median, mean, mode; qi, hdi, hdci The `point_interval()` family of functions follow the naming scheme `[median|mean|mode]_[qi|hdi|hdci]`, and all work in the same way as `median_qi()`: they take a series of names (or expressions calculated on columns) and summarize those columns with the corresponding point summary function (median, mean, or mode) and interval (qi, hdi, or hdci). `qi` yields a quantile interval (a.k.a. equi-tailed interval, central interval, or percentile interval), `hdi` yields one or more highest (posterior) density interval(s), and `hdci` yields a single (possibly) highest-density continuous interval. These can be used in any combination desired. The `*_hdi` functions have an additional difference: In the case of multimodal distributions, they may return multiple intervals for each probability level. Here are some draws from a multimodal normal mixture: ```{r} set.seed(123) multimodal_draws = tibble( x = c(rnorm(5000, 0, 1), rnorm(2500, 4, 1)) ) ``` Passed through `mode_hdi()`, we get multiple intervals at the 80% probability level: ```{r} multimodal_draws %>% mode_hdi(x, .width = .80) ``` This is easier to see when plotted: ```{r fig.width = large_width, fig.height = large_height/2} multimodal_draws %>% ggplot(aes(x = x)) + stat_slab(aes(y = 0)) + stat_pointinterval(aes(y = -0.5), point_interval = median_qi, .width = c(.95, .80)) + annotate("text", label = "median, 80% and 95% quantile intervals", x = 6, y = -0.5, hjust = 0, vjust = 0.3) + stat_pointinterval(aes(y = -0.25), point_interval = mode_hdi, .width = c(.95, .80)) + annotate("text", label = "mode, 80% and 95% highest-density intervals", x = 6, y = -0.25, hjust = 0, vjust = 0.3) + xlim(-3.25, 18) + scale_y_continuous(breaks = NULL) ``` ## Combining variables with different indices in a single tidy format data frame `spread_draws()` supports extracting variables that have different indices. It automatically matches up indices with the same name, and duplicates values as necessary to produce one row per all combination of levels of all indices. For example, we might want to calculate the difference between each condition mean and the overall mean. To do that, we can extract draws from the overall mean and all condition means: ```{r} m %>% spread_draws(overall_mean, condition_mean[condition]) %>% head(10) ``` Within each draw, `overall_mean` is repeated as necessary to correspond to every index of `condition_mean`. Thus, the `dplyr::mutate()` function can be used to take the differences over all rows, then we can summarize with `median_qi()`: ```{r} m %>% spread_draws(overall_mean, condition_mean[condition]) %>% mutate(condition_offset = condition_mean - overall_mean) %>% median_qi(condition_offset) ``` ## Posterior predictions We can use combinations of variables with difference indices to generate predictions from the model. In this case, we can combine the condition means with the residual standard deviation to generate predictive distributions from the model, then show the distributions using `ggdist::stat_interval()` and compare them to the data: ```{r fig.width = tiny_width, fig.height = tiny_height} m %>% spread_draws(condition_mean[condition], response_sd) %>% mutate(y_rep = rnorm(n(), condition_mean, response_sd)) %>% median_qi(y_rep, .width = c(.95, .8, .5)) %>% ggplot(aes(y = fct_rev(condition), x = y_rep)) + geom_interval(aes(xmin = .lower, xmax = .upper)) + #auto-sets aes(color = fct_rev(ordered(.width))) geom_point(aes(x = response), data = ABC) + scale_color_brewer() ``` If this model is well-calibrated, about 95% of the data should be within the outer intervals, 80% in the next-smallest intervals, and 50% in the smallest intervals. ### Posterior predictions with posterior distributions of means Altogether, data, posterior predictions, and posterior distributions of the means: ```{r fig.width = tiny_width, fig.height = tiny_height} draws = m %>% spread_draws(condition_mean[condition], response_sd) reps = draws %>% mutate(y_rep = rnorm(n(), condition_mean, response_sd)) ABC %>% ggplot(aes(y = condition)) + stat_interval(aes(x = y_rep), .width = c(.95, .8, .5), data = reps) + stat_pointinterval(aes(x = condition_mean), .width = c(.95, .66), position = position_nudge(y = -0.3), data = draws) + geom_point(aes(x = response)) + scale_color_brewer() ``` ## Comparing levels of a factor `compare_levels()` allows us to compare the value of some variable across levels of some factor. By default it computes all pairwise differences, though this can be changed using the `comparison = ` argument: ```{r fig.width = tiny_width, fig.height = tiny_height} m %>% spread_draws(condition_mean[condition]) %>% compare_levels(condition_mean, by = condition) %>% ggplot(aes(y = condition, x = condition_mean)) + stat_halfeye() ``` ## Gathering all model variable names into a single column: `gather_draws` and `gather_variables` We might also prefer all model variable names to be in a single column (long-format) instead of as column names. There are two methods for obtaining long-format data frames with `tidybayes`, whose use depends on where and how in the data processing chain you might want to transform into long-format: `gather_draws()` and `gather_variables()`. There are also two methods for wide (or semi-wide) format data frame, `spread_draws()` (described previously) and `tidy_draws()`. `gather_draws()` is the counterpart to `spread_draws()`, except it puts all variable names in a `.variable` column and all draws in a `.value` column: ```{r} m %>% gather_draws(overall_mean, condition_mean[condition]) %>% median_qi() ``` Note that `condition = NA` for the `overall_mean` row, because it does not have an index with that name in the specification passed to `gather_draws()`. While this works well if we do not need to perform computations that involve multiple columns, the semi-wide format returned by `spread_draws()` is very useful for computations that involve multiple columns names, such as the calculation of the `condition_offset` above. If we want to make intermediate computations on the format returned by `spread_draws` and *then* gather variables into one column, we can use `gather_variables()`, which will gather all non-grouped variables that are not special columns (like `.chain`, `.iteration`, or `.draw`): ```{r} m %>% spread_draws(overall_mean, condition_mean[condition]) %>% mutate(condition_offset = condition_mean - overall_mean) %>% gather_variables() %>% median_qi() ``` Note how `overall_mean` is now repeated here for each condition, because we have performed the gather after spreading model variables across columns. Finally, if we want raw model variable names as columns names instead of having indices split out as their own column names, we can use `tidy_draws()`. Generally speaking `spread_draws()` and `gather_draws()` are typically more useful than `tidy_draws()`, but it is provided as a common method for generating data frames from many types of Bayesian models, and is used internally by `gather_draws()` and `spread_draws()`: ```{r} m %>% tidy_draws() %>% head(10) ``` Combining `tidy_draws()` with `gather_variables()` also allows us to derive similar output to `ggmcmc::ggs()`, if desired: ```{r} m %>% tidy_draws() %>% gather_variables() %>% head(10) ``` But again, this approach does not handle variable indices for us automatically, so using `spread_draws()` and `gather_draws()` is generally recommended unless you do not have variable indices to worry about. ## Selecting variables using regular expressions You can use regular expressions in the specifications passed to `spread_draws()` and `gather_draws()` to match multiple columns by passing `regex = TRUE`. Our example fit contains variables named `condition_mean[i]` and `condition_zoffset[i]`. We could extract both using a single regular expression: ```{r} m %>% spread_draws(`condition_.*`[condition], regex = TRUE) %>% head(10) ``` This result is equivalent in this case to `spread_draws(c(condition_mean, condition_zoffset)[condition])`, but does not require us to list each variable explicitly---this can be useful, for example, in models with naming schemes like `b_[some name]` for coefficients. ## Drawing fit curves with uncertainty To demonstrate drawing fit curves with uncertainty, let's fit a slightly naive model to part of the `mtcars` dataset using `brms::brm()`: ```{r m_mpg_brms, results = "hide", message = FALSE, warning = FALSE, cache = TRUE} m_mpg = brm( mpg ~ hp * cyl, data = mtcars, file = "models/tidybayes_m_mpg.rds" # cache model (can be removed) ) ``` We can draw fit curves with probability bands using `add_epred_draws()` and `ggdist::stat_lineribbon()`: ```{r} mtcars %>% group_by(cyl) %>% data_grid(hp = seq_range(hp, n = 51)) %>% add_epred_draws(m_mpg) %>% ggplot(aes(x = hp, y = mpg, color = ordered(cyl))) + stat_lineribbon(aes(y = .epred)) + geom_point(data = mtcars) + scale_fill_brewer(palette = "Greys") + scale_color_brewer(palette = "Set2") ``` Or we can sample a reasonable number of fit lines (say 100) and overplot them: ```{r} mtcars %>% group_by(cyl) %>% data_grid(hp = seq_range(hp, n = 101)) %>% # NOTE: this shows the use of ndraws to subsample within add_epred_draws() # ONLY do this IF you are planning to make spaghetti plots, etc. # NEVER subsample to a small sample to plot intervals, densities, etc. add_epred_draws(m_mpg, ndraws = 100) %>% ggplot(aes(x = hp, y = mpg, color = ordered(cyl))) + geom_line(aes(y = .epred, group = paste(cyl, .draw)), alpha = .1) + geom_point(data = mtcars) + scale_color_brewer(palette = "Dark2") ``` For more examples of fit line uncertainty, see the corresponding sections in `vignette("tidy-brms")` or `vignette("tidy-rstanarm")`. ## Compatibility with other packages ### Compatibility of `point_interval` with `broom::tidy`: A model comparison example Combining `to_broom_names()` with `median_qi()` (or more generally, the `point_interval()` family of functions) makes it easy to compare results against models supported by `broom::tidy()`. For example, let's compare our model's fits for conditional means against an ordinary least squares (OLS) regression: ```{r} m_linear = lm(response ~ condition, data = ABC) ``` Combining `emmeans::emmeans` with `broom::tidy`, we can generate tidy-format summaries of conditional means from the above model: ```{r, eval = eval_chunks && requireNamespace("emmeans", quietly = TRUE)} linear_results = m_linear %>% emmeans::emmeans(~ condition) %>% tidy(conf.int = TRUE) %>% mutate(model = "OLS") linear_results ``` We can derive corresponding fits from our model: ```{r} bayes_results = m %>% spread_draws(condition_mean[condition]) %>% median_qi(estimate = condition_mean) %>% to_broom_names() %>% mutate(model = "Bayes") bayes_results ``` Here, `to_broom_names()` will convert `.lower` and `.upper` into `conf.low` and `conf.high` so the names of the columns we need to make the comparison (`condition`, `estimate`, `conf.low`, and `conf.high`) all line up easily. This makes it simple to combine the two tidy data frames together using `bind_rows`, and plot them: ```{r fig.width = tiny_width, fig.height = tiny_height} bind_rows(linear_results, bayes_results) %>% mutate(condition = fct_rev(condition)) %>% ggplot(aes(y = condition, x = estimate, xmin = conf.low, xmax = conf.high, color = model)) + geom_pointinterval(position = position_dodge(width = .3)) ``` Observe the shrinkage towards the overall mean in the Bayesian model compared to the OLS model. ### Compatibility with `bayesplot` using `unspread_draws` and `ungather_draws` Functions from other packages might expect draws in the form of a data frame or matrix with variables as columns and draws as rows. That is the format returned by `tidy_draws()`, but not by `gather_draws()` or `spread_draws()`, which split indices from variables out into columns. It may be desirable to use the `spread_draws()` or `gather_draws()` functions to transform your draws in some way, and then convert them *back* into the draw $\times$ variable format to pass them into functions from other packages, like `bayesplot`. The `unspread_draws()` and `ungather_draws()` functions invert `spread_draws()` and `gather_draws()` to return a data frame with variable column names that include indices in them and draws as rows. As an example, let's re-do the previous example of `compare_levels()`, but use `bayesplot::mcmc_areas()` to plot the results instead of `ggdist::stat_eye()`. First, the result of `compare_levels()` looks like this: ```{r} m %>% spread_draws(condition_mean[condition]) %>% compare_levels(condition_mean, by = condition) %>% head(10) ``` To get a version we can pass to `bayesplot::mcmc_areas()`, all we need to do is invert the `spread_draws()` call we started with: ```{r} m %>% spread_draws(condition_mean[condition]) %>% compare_levels(condition_mean, by = condition) %>% unspread_draws(condition_mean[condition]) %>% head(10) ``` We can pass that into `bayesplot::mcmc_areas()` directly. The `drop_indices = TRUE` argument to `unspread_draws()` indicates that `.chain`, `.iteration`, and `.draw` should not be included in the output: ```{r fig.width = tiny_width, fig.height = tiny_height} m %>% spread_draws(condition_mean[condition]) %>% compare_levels(condition_mean, by = condition) %>% unspread_draws(condition_mean[condition], drop_indices = TRUE) %>% bayesplot::mcmc_areas() ``` If you are instead working with tidy draws generated by `gather_draws()` or `gather_variables()`, the `ungather_draws()` function will transform those draws into the draw $\times$ variable format. It has the same syntax as `unspread_draws()`. ### Compatibility with `emmeans` (formerly `lsmeans`) The `emmeans::emmeans()` function provides a convenient syntax for generating marginal estimates from a model, including numerous types of contrasts. It also supports some Bayesian modeling packages, like `MCMCglmm`, `rstanarm`, and `brms`. However, it does not provide draws in a tidy format. The `gather_emmeans_draws()` function converts output from `emmeans` into a tidy format, keeping the `emmeans` reference grid and adding a `.value` column with long-format draws. (Another approach to using `emmeans` contrast methods is to use `emmeans_comparison()` to convert emmeans contrast methods into comparison functions that can be used with `tidybayes::compare_levels()`. See the documentation of `emmeans_comparison()` for more information). #### Using `rstanarm` or `brms` Both `rstanarm` and `brms` behave similarly when used with `emmeans::emmeans()`. The example below uses `rstanarm`, but should work similarly for `brms`. Given this `rstanarm` model: ```{r results = "hide", message = FALSE} m_rst = stan_glm(response ~ condition, data = ABC) ``` We can use `emmeans::emmeans()` to get conditional means with uncertainty: ```{r, eval = eval_chunks && requireNamespace("emmeans", quietly = TRUE)} m_rst %>% emmeans::emmeans( ~ condition) %>% gather_emmeans_draws() %>% median_qi() ``` Or `emmeans::emmeans()` with `emmeans::contrast()` to do all pairwise comparisons: ```{r, eval = eval_chunks && requireNamespace("emmeans", quietly = TRUE)} m_rst %>% emmeans::emmeans( ~ condition) %>% emmeans::contrast(method = "pairwise") %>% gather_emmeans_draws() %>% median_qi() ``` See the documentation for `emmeans::pairwise.emmc()` for a list of the numerous contrast types supported by `emmeans::emmeans()`. As before, we can plot the results instead of using a table: ```{r fig.width = tiny_width, fig.height = tiny_height, eval = eval_chunks && requireNamespace("emmeans", quietly = TRUE)} m_rst %>% emmeans::emmeans( ~ condition) %>% emmeans::contrast(method = "pairwise") %>% gather_emmeans_draws() %>% ggplot(aes(x = .value, y = contrast)) + stat_halfeye() ``` `gather_emmeans_draws()` also supports `emm_list` objects, which contain estimates from different reference grids (see `emmeans::ref_grid()` for more information on reference grids). An additional column with the default name of `.grid` is added to indicate the reference grid for each row in the output: ```{r, eval = eval_chunks && requireNamespace("emmeans", quietly = TRUE)} m_rst %>% emmeans::emmeans(pairwise ~ condition) %>% gather_emmeans_draws() %>% median_qi() ``` #### Using `MCMCglmm` Let's do the same example as above again, this time using `MCMCglmm::MCMCglmm()` instead of `rstanarm`. The only different when using `MCMCglmm()` is that to use `MCMCglmm()` with `emmeans()` you must also pass the original data used to fit the model to the `emmeans()` call (see `vignette("models", package = "emmeans"))` for more information). First, we'll fit the model: ```r # MCMCglmm does not support tibbles directly, # so we convert ABC to a data.frame on the way in m_glmm = MCMCglmm::MCMCglmm(response ~ condition, data = as.data.frame(ABC)) ``` Now we can use `emmeans()` and `gather_emmeans_draws()` exactly as we did with `rstanarm`, but we need to include a `data` argument in the `emmeans()` call: ```r m_glmm %>% emmeans::emmeans( ~ condition, data = ABC) %>% emmeans::contrast(method = "pairwise") %>% gather_emmeans_draws() %>% ggplot(aes(x = .value, y = contrast)) + stat_halfeye() ```