--- title: "The Mindfulness Study" package: multimedia output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{The Mindfulness Study} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( fig.align = "center", collapse = TRUE, comment = "#>", warning = FALSE, message = FALSE, cache = FALSE, dev.args = list(bg = "transparent"), out.width = 600, crop = NULL ) knitr::knit_hooks$set(output = multimedia::ansi_aware_handler) suppressPackageStartupMessages(library(ggplot2)) options( ggplot2.discrete.colour = c( "#9491D9", "#F24405", "#3F8C61", "#8C2E62", "#F2B705", "#11A0D9" ), ggplot2.discrete.fill = c( "#9491D9", "#F24405", "#3F8C61", "#8C2E62", "#F2B705", "#11A0D9" ), ggplot2.continuous.colour = function(...) { scale_color_distiller(palette = "Spectral", ...) }, ggplot2.continuous.fill = function(...) { scale_fill_distiller(palette = "Spectral", ...) }, crayon.enabled = TRUE ) th <- theme_classic() + theme( panel.background = element_rect(fill = "transparent"), strip.background = element_rect(fill = "transparent"), plot.background = element_rect(fill = "transparent", color = NA), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), legend.background = element_rect(fill = "transparent"), legend.box.background = element_rect(fill = "transparent"), legend.position = "bottom" ) theme_set(th) ``` ```{r helpers, include = FALSE} normalize <- function(x) { x <- as_tibble(x) x / rowSums(x) } ``` In most microbiome mediation analysis studies, microbiome composition is thought to mediate some outcome of interest. Sometimes, though, we might want to treat the microbiome as an outcome. Our motivating example here looks at the relationship between mindfulness and the microbiome. In this setting, we might imagine that mindfulness training influences behaviors (diet, sleep, exercise) that could then change the microbiome. Clinical surveys that gauge those behaviors now play the role of mediator. Alternatively, mindfulness could trigger neurophysiological changes that influence the microbiome in more subtle ways. The point of mediation analysis is to provide evidence for one or the other of these possibilities. ```{r libraries} library(ggdist) library(ggrepel) library(phyloseq) library(tidyverse) library(multimedia) data(mindfulness) set.seed(20231228) ``` This package needs a `mediation_data` object to organize the pretreatment, treatment, mediator, and outcome variables. `taxa_names(mindfulness)` specifies that we want all the taxa names to be outcome variables. We've already appended all the mediator variable names with the prefix "mediator," so we can specify them using the `starts_with` selector. We'll treat the `subject` column as pretreatment variables. The study tracked participants across a few timepoints, and we want to control for different baseline behavioral and microbiome profiles. ```{r data} exper <- mediation_data( mindfulness, taxa_names(mindfulness), "treatment", starts_with("mediator"), "subject" ) exper ``` This block applies a logistic-normal multinomial model to the microbiome outcome. This is an example where the response variables are all modeled jointly. We'll use a ridge regression model for each of the mediators. This is a better choice than ordinary linear models because we have to estimate many subject-level parameters, and our sample size is not that large. ```{r lnm} # use Flavonifractor as the reference for the log-ratio transform outcomes(exper) <- outcomes(exper) |> select(Flavonifractor, everything()) model <- multimedia( exper, lnm_model(seed = .Random.seed[1]), glmnet_model(lambda = 0.5, alpha = 0) ) |> estimate(exper) ``` # Evaluating Effects With this model, we can estimate direct and indirect effects. We're averaging over different settings of the mediator treatment variables. Since the LNM model outputs relative abundance profiles, all the estimated effects should be interpreted on a probability scale. ```{r lnm_direct} direct <- direct_effect(model, exper) |> group_by(outcome) |> summarise(direct_effect = mean(direct_effect)) |> arrange(-abs(direct_effect)) ``` These are the indirect effects when we set mediators to the treatment setting one at a time. ```{r lnm_indirect} indirect <- indirect_pathwise(model, exper) |> group_by(mediator, outcome) |> summarise(indirect_effect = mean(indirect_effect)) |> arrange(-abs(indirect_effect)) indirect ``` `plot_mediators` visualizes the relationship between mediator and outcome variables. The title for each subplot is the estimated indirect effect (note that it's `Control` - `Treatment` in the current definition). For example, in the panel relating sleep disturbances and Blautia, there is a rough positive association between these variables. The control group has a higher number of sleep disturbances, which is translated into a positive indirect effect for this pair. The model accounts for more than these pairwise relationships, though -- it controls for subject-level baselines and includes direct effects of treatment -- so these figures should be interpreted within the larger model context. ```{r mediator_plot, fig.width = 12, fig.height = 6} exper_rela <- exper outcomes(exper_rela) <- normalize(outcomes(exper)) plot_mediators(indirect, exper_rela) ``` # Model Alterations Hypothesis testing is built on the idea that a simple submodel might explain the data just as well as a full, more complicated one. In mediation analysis, we can zero out sets of edges to define submodels. For example, we can re-estimate a model that does not allow any relationship between the treatment and the mediators. ```{r alteration_m} altered <- model |> nullify("T->M") |> estimate(exper) ``` These modified models can be used to simulate synthetic null data that help contextualize the effects seen in real data. For example, we can compare imaginary cohort drawn from the original and the altered models. Notice that I re-assigned all the treatments. This removes potential confounding between the treatment and subject level effects -- this effect doesn't exist on average across the population because the study was randomized, but we want to remove any associations present in our finite sample. ```{r mediation_sample} new_assign <- treatments(exper)[sample(nrow(treatments(exper))), , drop = FALSE] profile <- setup_profile(model, new_assign, new_assign) m1 <- sample(model, profile = profile, pretreatment = pretreatments(exper)) m2 <- sample(altered, profile = profile, pretreatment = pretreatments(exper)) ``` The sampled objects above have the same class as `exper`. We can extract the real and simulated mediator data to see the treatment effects along those edges. As expected, the altered model has no systematic difference between treatment groups. This lends some evidence to the conclusion that mindfulness does influence the mediators, which was implicit in our discussion of indirect effects above. ```{r synthetic_mediation_plot, fig.width = 10, fig.height = 4} list( real = bind_cols(treatments(exper), mediators(exper), pretreatments(exper)), original = bind_cols(new_assign, mediators(m1), pretreatments(exper)), altered = bind_cols(new_assign, mediators(m2), pretreatments(exper)) ) |> bind_rows(.id = "source") |> pivot_longer(starts_with("mediator"), names_to = "mediator") |> ggplot() + geom_boxplot( aes(value, reorder(mediator, value, median), fill = treatment) ) + facet_grid(~source) ``` Here is the analogous alteration that removes all direct effects. The $x$-axis is on a log scale, and the LNM's estimated treatment effects for rare taxa are not practically significant. It's interesting that even though the original data don't show much difference between treatment groups, the model suggests that there are real effects, even for the abundant taxa. It's possible that, in the raw boxplots, heterogeneity across subjects and mediators masks true effects, and the LNM is able to detect these effects by appropriately accounting for sources of heterogeneity. ```{r outcome_alteration} altered_direct <- model |> nullify("T->Y") |> estimate(exper) altered_indirect <- model |> nullify("M->Y") |> estimate(exper) profile <- setup_profile(model, treatments(exper), treatments(exper)) y1 <- sample(model, profile = profile, pretreatment = pretreatments(exper)) y2 <- sample( altered_direct, profile = profile, pretreatment = pretreatments(exper) ) y3 <- sample( altered_indirect, profile = profile, pretreatment = pretreatments(exper) ) ``` ```{r alteration_plot, fig.width = 8, fig.height = 7} taxa_order <- normalize(outcomes(exper)) |> log() |> apply(2, \(x) { x[is.infinite(x)] <- NA median(x, na.rm = TRUE) }) |> order() combined_samples <- list( "Original Data" = normalize(outcomes(exper)), "Full Model" = normalize(outcomes(y1)), "No Direct Effect" = normalize(outcomes(y2)), "No Indirect Effect" = normalize(outcomes(y3)) ) |> bind_rows(.id = "source") |> mutate(treatment = rep(treatments(exper)$treatment, 4)) |> pivot_longer(colnames(outcomes(exper))) |> mutate(name = factor(name, levels = colnames(outcomes(exper))[taxa_order])) ggplot(combined_samples) + geom_boxplot( aes(log(value), name, fill = treatment), size = 0.4, outlier.size = 0.8, position = "identity", alpha = 0.7 ) + labs(x = "log(Relative Abundance)", y = "Genus", fill = "", col = "") + facet_grid(~source, scales = "free") + theme(panel.border = element_rect(fill = NA)) ``` ```{r alteration_presence, fig.width = 6, fig.height = 7} combined_samples |> group_by(name, source, treatment) |> summarise(presence = mean(value > 0)) |> ggplot() + geom_col( aes(presence, name, fill = treatment, col = treatment), position = "identity", width = 1, alpha = 0.7 ) + facet_grid(~source, scales = "free") + scale_x_continuous(expand = c(0, 0)) + theme(panel.border = element_rect(fill = NA)) ``` # Synthetic Null Testing A standard approach to inference is to apply a bootstrap (this is done at the end of the vignette). An alternative, which is less common, but well-suited to high-dimensions is to calibrate variable selection sets using synthetic null data. For example, this is the intuition behind the knockoff, Clipper, and mirror statistic algorithms. The general recipe is: 1. Simulate synthetic null data from a model/sampling mechanism where we know for certain that an effect does not exist. 1. Use the full model to estimate effects on this synthetic null data. 1. Define a selection rule so that the fraction of selected synthetic null effects is kept below an false discovery rate threshold. The synthetic null effects essentially serve as negative controls to calibrate inference. In `multimedia`, `null_contrast` estimates effects on real and synthetic null data, while `fdr_summary` ranks and selects features. Each point in the figure below is the directed effect for a single feature in either real or synthetic data. We rank features according to their absolute effect size, and as soon as negative control features enter, we begin increasing the estimated false discovery rate for the rule that selects features up to that point. ```{r direct_contrast} contrast <- null_contrast(model, exper, "T->Y", direct_effect) fdr_direct <- fdr_summary(contrast, "direct_effect") |> dplyr::rename(effect = direct_effect) ``` The indirect effect analog is shown below. It seems we have more evidence for direct than indirect effects. This is consistent with literature on the difficulty of estimating indirect effects in mediation analysis. ```{r indirect_contrast} contrast <- null_contrast(model, exper, "M->Y", indirect_overall) fdr_indirect <- fdr_summary(contrast, "indirect_overall") |> dplyr::rename(effect = indirect_effect) ``` ```{r contrast_plot, fig.width = 7, fig.height = 3} fdr_data <- bind_rows(fdr_direct, fdr_indirect, .id = "effect_type") |> mutate(effect_type = c("Direct", "Indirect")[as.integer(effect_type)]) ggplot(fdr_data, aes(effect, fdr_hat)) + geom_text_repel( data = filter(fdr_data, keep), aes(label = outcome, col = source), size = 3, force = 10 ) + labs(x = "Effect", y = expression(widehat(FDR)), col = "Data Source") + geom_point(aes(col = source, size = keep)) + scale_size_discrete("Selected", range = c(.5, 3), guide = FALSE) + scale_y_continuous(limits = c(-0.1, 0.55)) + facet_wrap(~effect_type, scales = "free_x") + theme( axis.text.x = element_text(size = 12), axis.text.y = element_text(size = 12), strip.text = element_text(size = 14), legend.text = element_text(size = 12) ) ``` ```{r, fig.width = 7, fig.height = 4.3} bind_mediation(exper_rela) |> select( subject, starts_with("mediator"), treatment, Roseburia, Blautia, Faecalibacterium ) |> pivot_longer(starts_with("mediator"), names_to = "mediator") |> mutate(mediator = str_remove(mediator, "mediator.")) |> pivot_longer( Roseburia:Faecalibacterium, names_to = "taxon", values_to = "abundance" ) |> group_by(subject, mediator, treatment, taxon) |> summarise(abundance = mean(abundance), value = mean(value)) |> ggplot() + geom_point(aes(value, abundance, col = treatment)) + facet_grid(taxon ~ mediator, scales = "free") + labs(y = "Relative Abundance", x = "Mediator Value", col = "") + theme( axis.title = element_text(size = 14), strip.text = element_text(size = 12), strip.text.y = element_text(size = 12, angle = 0), legend.text = element_text(size = 12), panel.border = element_rect(fill = NA) ) ``` Here is the analog for indirect pathwise effects. Since these effects distinguish between individual mediators, we can facet by these variables. ```{r indirect_pathwise_contrast, fig.width = 8, fig.height = 4} contrast <- null_contrast(model, exper, "M->Y", indirect_pathwise) fdr <- fdr_summary(contrast, "indirect_pathwise") ggplot(fdr, aes(indirect_effect, fdr_hat)) + geom_text_repel( data = filter(fdr, keep), aes(label = outcome, fill = source), size = 4 ) + labs(x = "Indirect Effect", y = "Estimated False Discovery Rate") + geom_point(aes(col = source, size = keep)) + scale_size_discrete(range = c(.2, 3)) + facet_wrap(~mediator) ``` # Random Forests The LNM makes relatively strong assumptions on the outcome model: The effects must enter linearly through a multinomial likelihood. As a nonparametric alternative, we can fit separate random forest models to each microbe's abundance. Since this model doesn't directly account for compositions, it will help to transform the response to relative abundances. ```{r rf_estimate} model <- multimedia( exper_rela, rf_model(num.trees = 50), glmnet_model(lambda = 0.5, alpha = 0) ) |> estimate(exper_rela) ``` We can compute direct and indirect effects from the estimated model. ```{r rf_effects} direct <- direct_effect(model, exper_rela) direct_sorted <- direct |> group_by(outcome, contrast) |> summarise(direct_effect = mean(direct_effect)) |> arrange(-direct_effect) indirect_sorted <- indirect_pathwise(model, exper_rela) |> group_by(outcome, mediator, contrast) |> summarise(indirect_effect = mean(indirect_effect)) |> arrange(-abs(indirect_effect)) direct_sorted indirect_sorted ``` We can visualize some of the features with the largest direct and indirect effects. Nonlinearity seems to have helped uncover a few mediator to microbe relationships, for example, the relationships between sleep disturbance/Ruminococcus2 and cereal/Faecalibacterium. ```{r rf_mediation_plots, fig.width = 11, fig.height = 5} plot_mediators(indirect_sorted, exper_rela) ``` # Bootstrap The bootstrap can help evaluate uncertainty for more complex statistics, and it's a good option for our random forest analysis. In each iteration, we randomly subsample timepoints from the original experiment and compute a list of estimates, specified by the list `funs` below. The output concatenates output across all bootstrap iterations. ```{r rf_bootstrap} funs <- list(direct = direct_effect, indirect = indirect_overall) # inference <- bootstrap(model, exper_rela, funs, B = 1e3) inference <- readRDS(url("https://go.wisc.edu/ba3e87")) ``` We can now visualize the bootstrap confidence intervals associated with the estimated effects. The largest interval corresponds to a 90% bootstrap confidence interval. The inner interval covers the median +/- one-third of the bootstrap distribution's mass. Some associations certainly seem worth following up on, though more data will be needed for validation. Notice that the direct effects seem generally stronger than indirect effects. Even with a flexible random forest model, it seems that changes in diet and sleep alone are not enough to explain the changes in microbiome composition seen in this study. This suggests that mechanisms related to the gut-brain axis may be at play, lying behind the observed direct effects. ```{r rf_bootstrap_figures, fig.width = 8.5, fig.height = 7} p1 <- ggplot(inference$indirect) + geom_vline(xintercept = 0) + stat_slabinterval( aes(indirect_effect, reorder(outcome, indirect_effect, median)), .width = c(.66, .95) ) + labs( title = "Indirect Effects", y = "Taxon", x = "Effect (Rel. Abund Scale)" ) p2 <- ggplot(inference$direct) + geom_vline(xintercept = 0) + stat_slabinterval( aes(direct_effect, reorder(outcome, direct_effect, median)), .width = c(.66, .95) ) + labs(title = "Direct Effects", y = "Taxon", x = "Effect (Rel. Abund Scale)") p1 | p2 ``` ```{r} sessionInfo() ```