--- title: "Microbiome - Metabolome Mediation Analysis" package: multimedia output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Microbiome - Metabolome Mediation Analysis} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r invisible, include = FALSE} knitr::opts_chunk$set( fig.align = "center", collapse = TRUE, comment = "#>", warning = FALSE, message = FALSE, cache = FALSE, dev.args = list(bg = "transparent"), out.width = 600, crop = NULL ) knitr::knit_hooks$set(output = multimedia::ansi_aware_handler) suppressPackageStartupMessages(library(ggplot2)) options( ggplot2.discrete.colour = c( "#9491D9", "#F24405", "#3F8C61", "#8C2E62", "#F2B705", "#11A0D9" ), ggplot2.discrete.fill = c( "#9491D9", "#F24405", "#3F8C61", "#8C2E62", "#F2B705", "#11A0D9" ), ggplot2.continuous.colour = function(...) { scale_color_distiller(palette = "Spectral", ...) }, ggplot2.continuous.fill = function(...) { scale_fill_distiller(palette = "Spectral", ...) }, crayon.enabled = TRUE ) th <- theme_classic() + theme( panel.background = element_rect(fill = "transparent"), strip.background = element_rect(fill = "transparent"), plot.background = element_rect(fill = "transparent", color = NA), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), legend.background = element_rect(fill = "transparent"), legend.box.background = element_rect(fill = "transparent"), legend.position = "bottom" ) theme_set(th) ``` Joint microbiome-metabolome experiments can give a view into how microbes shape their environments. `multimedia` lets us analyze high-dimensional outcome and mediation variables simultaneously. This allows us to analyze treatment $\to$ microbiome $\to$ metabolome and treatment $\to$ metabolome $\to$ microbiome paths. The methodology only allows us to analyze one kind of path at a time; in any case, it's impossible to completely disambiguate the directionality of the association from a single timepoint without prior knowledge. That said, the relationships that are uncovered from either of these directions can already suggest novel microbe-metabolite relationships. ```{r setup} library(brms) library(compositions) library(ggdist) library(glue) library(patchwork) library(tidyverse) library(vroom) library(multimedia) set.seed(20231222) ``` # Data Processing We'll load data from the Borenstein Lab's [microbiome-metabolome curated data](https://go.wisc.edu/582dhj) repository. This is a great resource that consistently documents each study and ensures uniform data processing. Our data is from [Franzosa et al.'s](https://go.wisc.edu/14b4a2) study of intestinal bowel disease. We'll treat disease status as the treatment, the microbiome as mediators, and metabolites as the outcome, in the spirit of that study's discussion: > Together, these findings suggest that yet-to-be characterized molecules in the gut metabolome, linked to inflammation and ultimately IBD, may be largely microbially derived or modified. ```{r read_data} Sys.setenv("VROOM_CONNECTION_SIZE" = 5e6) taxa <- read_tsv("https://go.wisc.edu/l015v0")[, -1] metabolites <- read_tsv("https://go.wisc.edu/0t3gs3")[, -1] metadata <- read_tsv("https://go.wisc.edu/9z36wr") ``` We'll filter quite aggressively so that the examples in this vignette run quickly. We then compile everything into a `mediation_data` object that organizes the treatment, mediators, and outcomes. We apply a centered log transform to the microbiome relative abundance and a log transform to the metabolite expression levels. Note that I've hidden some of the preprocessing functions (e.g., `simplify_tax_names`); these do what you expect, and you can read them in the `.Rmd` source. ```{r functions, echo = FALSE} filter_mean <- function(x, threshold = 3) { x[, colMeans(x) > threshold] } simplify_tax_names <- function(x) { colnames(x) <- colnames(x) |> str_extract("g__.*+") |> str_replace("g__", "g") |> str_remove_all("[[:punct:]]") x } simplify_metab_names <- function(x) { new_names <- colnames(x) |> str_remove_all("[[\\.|\\/|\\*|,|\\:|\\(|\\)]]") |> str_replace_all("-", "_") |> str_replace_all(" ", "_") |> str_remove("Cluster_|Clser") |> str_remove("HILIC|C8|C18") |> str_remove("pos|neg") |> str_remove("__") colnames(x) <- paste0("m", new_names) x } annotated_metabolites <- function(metabolites) { str_remove(colnames(metabolites), "C18-neg_Cluster_[0-9]+: ") |> str_remove("C8-pos_Cluster_[0-9]+: ") |> str_remove("HILIC-neg_Cluster_[0-9]+: ") |> str_remove("HILIC-pos_Cluster_[0-9]+: ") |> unique() |> setdiff("NA") } ``` ```{r} taxa <- clr(taxa) |> filter_mean() |> simplify_tax_names() metabolites <- log(1 + metabolites) |> filter_mean(threshold = 6) |> select(matches(annotated_metabolites(metabolites))) |> simplify_metab_names() combined <- metabolites |> bind_cols(taxa, metadata) |> as_tibble() exper <- mediation_data( combined, colnames(metabolites), "Study.Group", colnames(taxa) ) exper ``` # Lasso Approach We will fit sparse regression models for each outcome so that, for any metabolite, only a subset of microbes is expected to influence its abundance. ```{r lasso} model <- multimedia(exper, glmnet_model(lambda = 0.1)) |> estimate(exper) ``` We can have different direct effects depending on the treatment assignments of the mediators. Here we'll average over all mediator settings and plot the features with the largest direct effects. Note that the boxplots can reflect indirect effects as well, and this is why the ordering below (based on direct effect estimate) doesn't exactly correspond to the distance between the boxplots. ```{r, lasso_direct, fig.width = 6, fig.height = 5} direct <- list( CD = direct_effect(model, exper), UC = direct_effect(model, exper, 3, 2) ) |> map_dfr(effect_summary, .id = "treatment") vis_direct <- direct |> slice_max(abs(direct_effect), n = 20) |> pull(outcome) combined |> select(any_of(vis_direct), Study.Group) |> pivot_longer(-Study.Group, names_to = "feature") |> ggplot() + geom_boxplot( aes(value, reorder(feature, value, median), fill = Study.Group ) ) + labs( x = "log(1 + intensity)", y = "Metabolite", fill = "Group" ) ``` The block below calculates the analogous overall indirect effects. We're particularly interested in the difference between metabolites that have strong indirect vs. direct effects, because these are expected to have different relationships with the microbiome. To simplify this comparison, we combine the top direct and indirect effects in the object `top_effects`. ```{r} indirect_effect <- list( CD = indirect_overall(model, exper), UC = indirect_overall(model, exper, 3, 2) ) top_direct <- dplyr::rename(direct, effect = direct_effect) top_indirect <- dplyr::rename(bind_rows(indirect_effect), effect = indirect_effect ) top_effects <- list(direct = top_direct, indirect = top_indirect) |> bind_rows(.id = "type") vis_outcomes <- c( "m0181_hydrocinnamic_acid", "m1303_lithocholate", "m0036_creatinine", "m0253_sphingosine", "m1478_C182_CE", "m0295_arginine" ) top_effects <- bind_rows( filter(top_effects, outcome %in% vis_outcomes[1:3], type == "indirect"), filter(top_effects, outcome %in% vis_outcomes[4:6], type == "direct"), ) ``` The plot below is an MDS of microbiome compositions where point sizes represent metabolite abundances. Notice that the for indirect effects, there is a clear relationship between metabolites and microbiome composition. This is consistent with what we expect from a mediation analysis: Indirect effects have to travel through the mediators, which in this case are microbiome compositions. ```{r, fig.width = 8, fig.height = 4} eig <- \(x, k) 100 * round(x[k] / sum(x), 4) mds <- cmdscale(dist(mediators(exper)), eig = TRUE, k = 2) coords <- data.frame(mds$points) |> bind_cols(treatments(exper)) |> bind_cols(outcomes(exper)[, vis_outcomes]) |> pivot_longer(top_effects$outcome, names_to = "outcome", values_to = "abundance" ) |> left_join(top_effects) |> group_by(outcome) |> mutate(abundance_quantile = as.integer(as.factor(cut(abundance, 10))) / 10) ggplot(coords) + geom_point(aes(X1, X2, col = Study.Group, size = abundance_quantile)) + scale_size_area(max_size = 1.5, breaks = c(0.25, 0.5, 0.75)) + labs( x = glue("MDS1 [{eig(mds$eig, 1)}%]"), y = glue("MDS2 [{eig(mds$eig, 2)}%]"), size = "Metabolite Abundance Quantile", col = "Group" ) + facet_wrap(~ type + reorder(outcome, -abundance)) ``` Mediation analysis hinges on the assumption that mediators and outcomes don't share any unobserved confounding. This assumption can't be directly tested with data, but it is possible to see how sensitive conclusions are to violations of this assumption. Specifically, in sensitivity analysis, we deliberately imagine that model assumptions are violated, and examine the extent to which indirect effect estimates change. `multimedia` implements several types of experimental sensitivity analyses (functions starting `sensitivity_*`). However, here, we'll use a more standard approach that correlates the errors between subsets of mediators and outcomes, implemented in the `sensitivity()` function. Specifically, we'll correlate the mediator model error for `gEnterocloster` and the outcomes visualized above (`m0181_hydrocinnamic_acid`, `m1303_lithocholate`, etc.) to see whether our estimated overall indirect effects are sensitivity to violations in assumptions for this selection of mediators and outcomes. ```{r} confound_ix <- data.frame( outcome = vis_outcomes, mediator = "gEnterocloster" ) rho_seq <- seq(-0.5, 0.5, length.out = 10) # sensitivity_overall <- sensitivity( # model, exper, confound_ix, n_boot = 100, rho_seq = rho_seq, t1 = 1, t2 = 2 # ) sensitivity_overall <- read_csv("https://go.wisc.edu/aj4vg9") ``` We visualize the results below. Across correlations strengths from -0.5 to 0.5, none of the overall indirect effects seem sensitive: Though magnitudes of the estimated effects may change slightly, the sign of the estimates remain unchanged for all the mediator-outcome pairs that we tested. Only the association between `gEnterocloster` and `m0295_arginine` seems at risk of switching, and this mainly seems to be related to the large variance in its estimate. ```{r visualize_sensitivity_overall, fig.width = 5, fig.height = 4} sensitivity_overall |> filter(outcome %in% vis_outcomes) |> ggplot(aes(rho, indirect_effect)) + geom_line(linewidth = 1.5) + geom_ribbon( aes( ymin = indirect_effect - 2 * indirect_effect_standard_error, ymax = indirect_effect + 2 * indirect_effect_standard_error ), alpha = 0.2, linewidth = 1 ) + scale_x_continuous(expand = c(0, 0)) + facet_wrap(~ reorder(outcome, indirect_effect)) + geom_hline(yintercept = 0, linetype = 3, linewidth = 1) + labs( x = expression("Confounder Correlation Strength" ~ rho), y = "Estimated Overall Indirect Effect" ) ``` Pathwise indirect effects are those that appear when modifying the treatment status for a single mediator. Unlike overall indirect effects, they give us effects for individual mediator-outcome pairs. Now, instead of averaging over mediator treatment assignments, we average over direct edge treatment assignments. The block below is time-consuming, so if you are following along, you can just read in the pre-computed result in the next block. ```{r lasso_indirect} # indirect_sorted <- list( # CD = indirect_pathwise(model, exper), # UC = indirect_pathwise(model, exper, 3, 2) # ) |> # map_dfr(effect_summary, .id = "treatment") indirect_sorted <- read_csv("https://go.wisc.edu/0c6qkt") ``` We can look at the effects which have the strongest indirect effects. These make sense. For example, genus Biophila seems to have a negative relationship with Taurine. Maybe that metabolite is produced by a competitor? Or maybe the microbe eats the metabolite. In any case, IBD seems to decrease the abundance of Biophila, which consequently increases the abundance of the metabolite. ```{r lasso_mediators_plot, fig.width = 9, fig.height = 4.5} p <- indirect_sorted %>% split(.$treatment) |> map(~ arrange(., -indirect_effect) |> plot_mediators(exper, treatment = "Study.Group", nrow = 2)) p[["CD"]] ``` We can create models where the effects have been deliberately removed, which can be used for synthetic null testing. Here are simulated samples from a "nullified" model without direct effects, a full model with direct effects, and the original samples. To the extent that the altered and fitted models differ, the direct treatment edge is needed to improve model fit. ```{r lasso_synthetic, fig.width = 12, fig.height = 7} altered <- model |> nullify("T->Y") |> estimate(exper) # sample at original treatment assignments profile <- setup_profile(model, treatments(exper), treatments(exper)) samples <- list( real = outcomes(exper), fitted = outcomes(sample(model, profile = profile)), altered = outcomes(sample(altered, profile = profile)) ) |> bind_rows(.id = "source") |> bind_cols(treatment = rep(treatments(exper)$Study.Group, 3)) |> pivot_longer(direct$outcome) |> mutate(name = factor(name, levels = unique(direct$outcome))) # visualize ggplot(filter(samples, samples$name %in% unique(samples$name)[1:20])) + geom_boxplot(aes(value, name, fill = treatment)) + facet_wrap(~source) ``` How can we understand the uncertainty of the estimated models? One approach is to form the bootstrap distribution of the effects that we care about. It's a little problematic to form bootstrap confidence intervals for sparse regression. We can still interpret the full bootstrap distributions, though. Effects that have large nonzero components have more evidence of being real. This block is again slow, so you can skip ahead and load a pre-computed version. ```{r lasso_bootstrap} fs <- list(direct = direct_effect, indirect = indirect_overall) # inference <- bootstrap(model, exper, fs, B = 1000) inference <- readRDS(url("https://go.wisc.edu/08o4l0")) ``` ```{r lasso_bootstrap_vis, fig.width = 9, fig.height = 5} filter_outcomes <- inference$indirect |> effect_summary() |> group_by(outcome) |> summarise(indirect_effect = median(indirect_effect)) |> slice_max(abs(indirect_effect), n = 20) |> pull(outcome) p1 <- filter(inference$indirect, outcome %in% filter_outcomes) |> ggplot() + geom_vline(xintercept = 0) + stat_slabinterval( aes(indirect_effect, reorder(outcome, indirect_effect, median)), .width = c(.66, .95) ) + labs(x = "Indirect Effect", y = "Metabolite") filter_outcomes <- inference$direct |> group_by(outcome) |> summarise(direct_effect = median(direct_effect)) |> slice_max(abs(direct_effect), n = 20) |> pull(outcome) p2 <- filter(inference$direct, outcome %in% filter_outcomes) |> ggplot() + geom_vline(xintercept = 0) + stat_slabinterval( aes(direct_effect, reorder(outcome, direct_effect, median)), .width = c(.66, .95) ) + labs(x = "Direct Effect", y = "Metabolite") p1 | p2 ``` Finally, let's revisit the indirect effects that seem to be traveling through specific microbe-metabolite pairs, which we recovered using `indirect_pathwise()`. As with the overall indirect effects, we can carry out a sensitvity analysis, asking how conclusions would have changed if there had in fact been a confounder between specific mediators and metabolites. In this example, we will look at the strongest pathwise indirect effects that were related to Crohns Disease. The block below correlates the pairs in each row of the `confound_ix` data.frame. We use the same range of correlation strengths `rho_seq` as before. Since estimating pathwise indirect effects is more consuming than computing overall effects, we will settle with `n_boot = 1`, which unfortunately means we won't have confidence bands around our sensitivity curve. ```{r} confound_ix <- data.frame( mediator = c("gCholadousia", "gBilophila", "gCAG103"), outcome = c( "m0066_taurine", "m0066_taurine", "m0900_ketodeoxycholate" ) ) #sensitivity_path <- sensitivity_pathwise(model, exper, confound_ix, rho_seq, 1) sensitivity_path <- read_csv("https://go.wisc.edu/ph7ek4") ``` The results are given below. It seems that the estimated taurine $\iff$ Biophila effect is quite robust: even if there had been an unobserved confounder that correlates the mediator and outcome with strength $\rho = 0.5$, we still see a clear indirect effect. The other two estimates require a bit more interpretation. The taurine $\iff$ Choladousia effect seems highly variable for small levels of $\rho$, and for larger values, the estimate completely vanishes. We should not have too much confidence in this result, even for low levels of confounding. On the other hand, the ketodeoxycholate $\iff$ CAG103 seems quite robust for smaller values of $\rho$, but weakens under very strong confounding. This situation is more ambiguous. It's possible that a confounder could be responsible for the indirect effect; however, it would have to be quite strong, perhaps stronger than is plausible. ```{r, fig.width = 9, fig.height = 2.8} confound_ix |> left_join(sensitivity_path) |> ggplot(aes(rho, indirect_effect, group = contrast)) + geom_hline(yintercept = 0, linetype = 3, linewidth = 1) + geom_vline(xintercept = 0, linetype = 3, linewidth = 1) + geom_line(linewidth = 2) + labs( x = expression("Confounding parameter"~ rho), y = "Estimated Indirect Effect" ) + theme( axis.title = element_text(size = 16), strip.text = element_text(size = 14) ) + facet_wrap( reorder(outcome, indirect_effect) ~ reorder(mediator, indirect_effect), scales = "free_y" ) ``` # Hurdle-Lognormal Approach There are many zeros in the metabolites data, and sparse regression is not the best approach for modeling these kinds of outcomes. Indeed, we seem to have been most sensitive to indirect effects for highly abundant metabolites (which are present in most samples), and this raises the question of whether we might have missed true indirect effects where that model's linearity assumptions were not appropriate. As an alternative, we can work with a hurdle model, which explicitly models zeros together with a nonnegative component. For this, we'll return the outcomes to their original (unlogged) scaling. We'll apply a BRMS hurdle model. So that this isn't too slow during the indirect effect estimation, I've further reduced the number of taxa and metabolites in this analysis. ```{r reduced_experiment} exper2 <- exper outcomes(exper2) <- exp(outcomes(exper2)[, 1:50]) - 1 mediators(exper2) <- mediators(exper2)[, 1:50] ``` ```{r hurdle_estimate} # model <- multimedia( # exper2, # brms_model(family = hurdle_lognormal()) # ) |> # estimate(exper2) model <- readRDS(url("https://go.wisc.edu/wj197j")) ``` With the updated model, we can again compute pathwise indirect effects. This time, we are more sensitive to metabolites that completely vanish in the IBD (treatment) group. In all of the selected pairs, there is also a treatment effect on microbe abundance. If this hadn't been present, and if the metabolite simply vanished, then that would be a direct effect. ```{r hurdle_indirect, fig.width = 12, fig.height = 9} #indirect_sorted <- indirect_pathwise(model, exper2) |> indirect_sorted <- read_csv("https://go.wisc.edu/r269qh") |> effect_summary() plot_mediators(indirect_sorted, exper2, treatment = "Study.Group") ``` We can see how well the hurdle model captured the zero pattern by simulating data from the fitted model. There are a few outliers, which is why I've trimmed the $y$-axis. The model has not captured the specific shape of the treatment group scatterplots above, which tends to have regions with only zeros for some metabolites. In retrospect, this is natural. The hurdle model can only increase the zero probability as a linear function of species abundance. We would need a step function preidctor to support the kinds of transitions we see above. ```{r hurdle_synthetic, fig.width = 12, fig.height = 9} profile <- setup_profile(model, treatments(exper), treatments(exper)) samples <- sample(model, profile = profile, pretreatment = pretreatments(exper)) colnames(outcomes(samples)) <- colnames(outcomes(exper2)) plot_mediators(indirect_sorted, samples, treatment = "Study.Group") ``` Finally, we can look at direct effects from this model, which can be compared to the analogous figure for the sparse regression. Now, many outcomes with exact zeros under treatment have been detected, which is consistent with the the hurdle model's effectiveness in accurately reflecting features that influence zero-inflation. ```{r hurdle_direct, fig.width = 6, height = 5} #direct <- direct_effect(model, exper) |> direct <- read_csv("https://go.wisc.edu/28ft5e") |> effect_summary() |> slice_max(abs(direct_effect), n = 6) combined |> select(any_of(direct$outcome), Study.Group) |> pivot_longer(-Study.Group, names_to = "feature") |> mutate(feature = factor(feature, levels = unique(direct$outcome))) |> ggplot() + geom_boxplot(aes(value, feature, fill = Study.Group)) ``` ```{r} sessionInfo() ```