---
title: "Microbiome - Metabolome Mediation Analysis"
package: multimedia
output: 
  rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Microbiome - Metabolome Mediation Analysis}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}  
---

```{r invisible, include = FALSE}
knitr::opts_chunk$set(
    fig.align = "center",
    collapse = TRUE,
    comment = "#>",
    warning = FALSE,
    message = FALSE,
    cache = FALSE,
    dev.args = list(bg = "transparent"),
    out.width = 600,
    crop = NULL
)

knitr::knit_hooks$set(output = multimedia::ansi_aware_handler)
suppressPackageStartupMessages(library(ggplot2))
options(
    ggplot2.discrete.colour = c(
        "#9491D9", "#F24405", "#3F8C61", "#8C2E62", "#F2B705", "#11A0D9"
    ),
    ggplot2.discrete.fill = c(
        "#9491D9", "#F24405", "#3F8C61", "#8C2E62", "#F2B705", "#11A0D9"
    ),
    ggplot2.continuous.colour = function(...) {
        scale_color_distiller(palette = "Spectral", ...)
    },
    ggplot2.continuous.fill = function(...) {
        scale_fill_distiller(palette = "Spectral", ...)
    },
    crayon.enabled = TRUE
)

th <- theme_classic() +
    theme(
        panel.background = element_rect(fill = "transparent"),
        strip.background = element_rect(fill = "transparent"),
        plot.background = element_rect(fill = "transparent", color = NA),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        legend.background = element_rect(fill = "transparent"),
        legend.box.background = element_rect(fill = "transparent"),
        legend.position = "bottom"
    )

theme_set(th)
```

Joint microbiome-metabolome experiments can give a view into how microbes shape
their environments. `multimedia` lets us analyze high-dimensional outcome and
mediation variables simultaneously. This allows us to analyze treatment $\to$
microbiome $\to$ metabolome and treatment $\to$ metabolome $\to$ microbiome
paths. The methodology only allows us to analyze one kind of path at a time; in
any case, it's impossible to completely disambiguate the directionality of the
association from a single timepoint without prior knowledge. That said, the
relationships that are uncovered from either of these directions can already
suggest novel microbe-metabolite relationships.

```{r setup}
library(brms)
library(compositions)
library(ggdist)
library(glue)
library(patchwork)
library(tidyverse)
library(vroom)
library(multimedia)
set.seed(20231222)
```

# Data Processing

We'll load data from the Borenstein Lab's 
[microbiome-metabolome curated data](https://go.wisc.edu/582dhj)
repository. This is a great resource that consistently documents each study and
ensures uniform data processing. Our data is from 
[Franzosa et al.'s](https://go.wisc.edu/14b4a2) study of intestinal bowel
disease. We'll treat disease status as the treatment, the microbiome as
mediators, and metabolites as the outcome, in the spirit of that study's
discussion:

> Together, these findings suggest that yet-to-be characterized molecules in the
gut metabolome, linked to inflammation and ultimately IBD, may be largely
microbially derived or modified.

```{r read_data}
Sys.setenv("VROOM_CONNECTION_SIZE" = 5e6)
taxa <- read_tsv("https://go.wisc.edu/l015v0")[, -1]
metabolites <- read_tsv("https://go.wisc.edu/0t3gs3")[, -1]
metadata <- read_tsv("https://go.wisc.edu/9z36wr")
```

We'll filter quite aggressively so that the examples in this vignette run
quickly. We then compile everything into a `mediation_data` object that
organizes the treatment, mediators, and outcomes. We apply a centered log
transform to the microbiome relative abundance and a log transform to the
metabolite expression levels. Note that I've hidden some of the preprocessing
functions (e.g., `simplify_tax_names`); these do what you expect, and you can
read them in the `.Rmd` source.

```{r functions, echo = FALSE}
filter_mean <- function(x, threshold = 3) {
    x[, colMeans(x) > threshold]
}

simplify_tax_names <- function(x) {
    colnames(x) <- colnames(x) |>
        str_extract("g__.*+") |>
        str_replace("g__", "g") |>
        str_remove_all("[[:punct:]]")
    x
}

simplify_metab_names <- function(x) {
    new_names <- colnames(x) |>
        str_remove_all("[[\\.|\\/|\\*|,|\\:|\\(|\\)]]") |>
        str_replace_all("-", "_") |>
        str_replace_all(" ", "_") |>
        str_remove("Cluster_|Clser") |>
        str_remove("HILIC|C8|C18") |>
        str_remove("pos|neg") |>
        str_remove("__")

    colnames(x) <- paste0("m", new_names)
    x
}

annotated_metabolites <- function(metabolites) {
    str_remove(colnames(metabolites), "C18-neg_Cluster_[0-9]+: ") |>
        str_remove("C8-pos_Cluster_[0-9]+: ") |>
        str_remove("HILIC-neg_Cluster_[0-9]+: ") |>
        str_remove("HILIC-pos_Cluster_[0-9]+: ") |>
        unique() |>
        setdiff("NA")
}
```

```{r}
taxa <- clr(taxa) |>
    filter_mean() |>
    simplify_tax_names()

metabolites <- log(1 + metabolites) |>
    filter_mean(threshold = 6) |>
    select(matches(annotated_metabolites(metabolites))) |>
    simplify_metab_names()

combined <- metabolites |>
    bind_cols(taxa, metadata) |>
    as_tibble()

exper <- mediation_data(
    combined,
    colnames(metabolites),
    "Study.Group",
    colnames(taxa)
)

exper
```

# Lasso Approach

We will fit sparse regression models for each outcome so that, for any
metabolite, only a subset of microbes is expected to influence its abundance.

```{r lasso}
model <- multimedia(exper, glmnet_model(lambda = 0.1)) |>
    estimate(exper)
```

We can have different direct effects depending on the treatment assignments of
the mediators. Here we'll average over all mediator settings and plot the
features with the largest direct effects. Note that the boxplots can reflect
indirect effects as well, and this is why the ordering below (based on direct
effect estimate) doesn't exactly correspond to the distance between the
boxplots.

```{r, lasso_direct, fig.width = 6, fig.height = 5}
direct <- list(
    CD = direct_effect(model, exper),
    UC = direct_effect(model, exper, 3, 2)
) |>
    map_dfr(effect_summary, .id = "treatment")

vis_direct <- direct |>
    slice_max(abs(direct_effect), n = 20) |>
    pull(outcome)

combined |>
    select(any_of(vis_direct), Study.Group) |>
    pivot_longer(-Study.Group, names_to = "feature") |>
    ggplot() +
    geom_boxplot(
        aes(value, reorder(feature, value, median),
            fill = Study.Group
        )
    ) +
    labs(
        x = "log(1 + intensity)",
        y = "Metabolite",
        fill = "Group"
    )
```

The block below calculates the analogous overall indirect effects. We're
particularly interested in the difference between metabolites that have strong
indirect vs. direct effects, because these are expected to have different
relationships with the microbiome. To simplify this comparison, we combine the
top direct and indirect effects in the object `top_effects`.

```{r}
indirect_effect <- list(
    CD = indirect_overall(model, exper),
    UC = indirect_overall(model, exper, 3, 2)
)

top_direct <- dplyr::rename(direct, effect = direct_effect)
top_indirect <- dplyr::rename(bind_rows(indirect_effect),
    effect = indirect_effect
)
top_effects <- list(direct = top_direct, indirect = top_indirect) |>
    bind_rows(.id = "type")

vis_outcomes <- c(
    "m0181_hydrocinnamic_acid", "m1303_lithocholate",
    "m0036_creatinine", "m0253_sphingosine", "m1478_C182_CE",
    "m0295_arginine"
)
top_effects <- bind_rows(
    filter(top_effects, outcome %in% vis_outcomes[1:3], type == "indirect"),
    filter(top_effects, outcome %in% vis_outcomes[4:6], type == "direct"),
)
```

The plot below is an MDS of microbiome compositions where point sizes represent
metabolite abundances. Notice that the for indirect effects, there is a clear
relationship between metabolites and microbiome composition. This is consistent
with what we expect from a mediation analysis: Indirect effects have to travel
through the mediators, which in this case are microbiome compositions.

```{r, fig.width = 8, fig.height = 4}
eig <- \(x, k) 100 * round(x[k] / sum(x), 4)

mds <- cmdscale(dist(mediators(exper)), eig = TRUE, k = 2)
coords <- data.frame(mds$points) |>
    bind_cols(treatments(exper)) |>
    bind_cols(outcomes(exper)[, vis_outcomes]) |>
    pivot_longer(top_effects$outcome,
        names_to = "outcome",
        values_to = "abundance"
    ) |>
    left_join(top_effects) |>
    group_by(outcome) |>
    mutate(abundance_quantile = as.integer(as.factor(cut(abundance, 10))) / 10)

ggplot(coords) +
    geom_point(aes(X1, X2, col = Study.Group, size = abundance_quantile)) +
    scale_size_area(max_size = 1.5, breaks = c(0.25, 0.5, 0.75)) +
    labs(
        x = glue("MDS1 [{eig(mds$eig, 1)}%]"),
        y = glue("MDS2 [{eig(mds$eig, 2)}%]"),
        size = "Metabolite Abundance Quantile",
        col = "Group"
    ) +
    facet_wrap(~ type + reorder(outcome, -abundance))
```

Mediation analysis hinges on the assumption that mediators and outcomes don't
share any unobserved confounding. This assumption can't be directly tested with
data, but it is possible to see how sensitive conclusions are to violations of
this assumption. Specifically, in sensitivity analysis, we deliberately imagine
that model assumptions are violated, and examine the extent to which indirect
effect estimates change. `multimedia` implements several types of experimental
sensitivity analyses (functions starting `sensitivity_*`). However, here, we'll
use a more standard approach that correlates the errors between subsets of
mediators and outcomes, implemented in the `sensitivity()` function. 
Specifically, we'll correlate the mediator model error for `gEnterocloster` and
the outcomes visualized above (`m0181_hydrocinnamic_acid`, `m1303_lithocholate`,
etc.) to see whether our estimated overall indirect effects are sensitivity to
violations in assumptions for this selection of mediators and outcomes.

```{r}
confound_ix <- data.frame(
    outcome = vis_outcomes,
    mediator = "gEnterocloster"
)
rho_seq <- seq(-0.5, 0.5, length.out = 10)
# sensitivity_overall <- sensitivity(
#     model, exper, confound_ix, n_boot = 100, rho_seq = rho_seq, t1 = 1, t2 = 2
# )
sensitivity_overall <- read_csv("https://go.wisc.edu/aj4vg9")
```

We visualize the results below. Across correlations strengths from -0.5 to 0.5,
none of the overall indirect effects seem sensitive: Though magnitudes of the
estimated effects may change slightly, the sign of the estimates remain
unchanged for all the mediator-outcome pairs that we tested. Only the
association between `gEnterocloster` and `m0295_arginine` seems at risk of
switching, and this mainly seems to be related to the large variance in its
estimate.

```{r visualize_sensitivity_overall, fig.width = 5, fig.height = 4}
sensitivity_overall |>
    filter(outcome %in% vis_outcomes) |>
    ggplot(aes(rho, indirect_effect)) +
    geom_line(linewidth = 1.5) +
    geom_ribbon(
        aes(
            ymin = indirect_effect - 2 * indirect_effect_standard_error, 
            ymax = indirect_effect + 2 * indirect_effect_standard_error
        ),
        alpha = 0.2, linewidth = 1
    ) +
    scale_x_continuous(expand = c(0, 0)) +
    facet_wrap(~ reorder(outcome, indirect_effect)) +
    geom_hline(yintercept = 0, linetype = 3, linewidth = 1) +
    labs(
        x = expression("Confounder Correlation Strength" ~ rho),
        y = "Estimated Overall Indirect Effect"
    )
```

Pathwise indirect effects are those that appear when modifying the treatment
status for a single mediator. Unlike overall indirect effects, they give us
effects for individual mediator-outcome pairs. Now, instead of averaging over
mediator treatment assignments, we average over direct edge treatment
assignments. The block below is time-consuming, so if you are following along,
you can just read in the pre-computed result in the next block.

```{r lasso_indirect}
# indirect_sorted <- list(
#     CD = indirect_pathwise(model, exper),
#     UC = indirect_pathwise(model, exper, 3, 2)
# ) |>
#     map_dfr(effect_summary, .id = "treatment")
indirect_sorted <- read_csv("https://go.wisc.edu/0c6qkt")
```

We can look at the effects which have the strongest indirect effects. These make
sense. For example, genus Biophila seems to have a negative relationship with
Taurine. Maybe that metabolite is produced by a competitor? Or maybe the microbe
eats the metabolite. In any case, IBD seems to decrease the abundance of
Biophila, which consequently increases the abundance of the metabolite.

```{r lasso_mediators_plot, fig.width = 9, fig.height = 4.5}
p <- indirect_sorted %>%
    split(.$treatment) |>
    map(~ arrange(., -indirect_effect) |>
        plot_mediators(exper, treatment = "Study.Group", nrow = 2))

p[["CD"]]
```

We can create models where the effects have been deliberately removed, which can
be used for synthetic null testing. Here are simulated samples from a
"nullified" model without direct effects, a full model with direct effects, and
the original samples. To the extent that the altered and fitted models differ,
the direct treatment edge is needed to improve model fit.

```{r lasso_synthetic, fig.width = 12, fig.height = 7}
altered <- model |>
    nullify("T->Y") |>
    estimate(exper)

# sample at original treatment assignments
profile <- setup_profile(model, treatments(exper), treatments(exper))
samples <- list(
    real = outcomes(exper),
    fitted = outcomes(sample(model, profile = profile)),
    altered = outcomes(sample(altered, profile = profile))
) |>
    bind_rows(.id = "source") |>
    bind_cols(treatment = rep(treatments(exper)$Study.Group, 3)) |>
    pivot_longer(direct$outcome) |>
    mutate(name = factor(name, levels = unique(direct$outcome)))

# visualize
ggplot(filter(samples, samples$name %in% unique(samples$name)[1:20])) +
    geom_boxplot(aes(value, name, fill = treatment)) +
    facet_wrap(~source)
```

How can we understand the uncertainty of the estimated models? One approach is
to form the bootstrap distribution of the effects that we care about. It's a
little problematic to form bootstrap confidence intervals for sparse regression.
We can still interpret the full bootstrap distributions, though. Effects that
have large nonzero components have more evidence of being real. This block is
again slow, so you can skip ahead and load a pre-computed version.

```{r lasso_bootstrap}
fs <- list(direct = direct_effect, indirect = indirect_overall)
# inference <- bootstrap(model, exper, fs, B = 1000)
inference <- readRDS(url("https://go.wisc.edu/08o4l0"))
```

```{r lasso_bootstrap_vis, fig.width = 9, fig.height = 5}
filter_outcomes <- inference$indirect |>
    effect_summary() |>
    group_by(outcome) |>
    summarise(indirect_effect = median(indirect_effect)) |>
    slice_max(abs(indirect_effect), n = 20) |>
    pull(outcome)

p1 <- filter(inference$indirect, outcome %in% filter_outcomes) |>
    ggplot() +
    geom_vline(xintercept = 0) +
    stat_slabinterval(
        aes(indirect_effect, reorder(outcome, indirect_effect, median)),
        .width = c(.66, .95)
    ) +
    labs(x = "Indirect Effect", y = "Metabolite")

filter_outcomes <- inference$direct |>
    group_by(outcome) |>
    summarise(direct_effect = median(direct_effect)) |>
    slice_max(abs(direct_effect), n = 20) |>
    pull(outcome)
p2 <- filter(inference$direct, outcome %in% filter_outcomes) |>
    ggplot() +
    geom_vline(xintercept = 0) +
    stat_slabinterval(
        aes(direct_effect, reorder(outcome, direct_effect, median)),
        .width = c(.66, .95)
    ) +
    labs(x = "Direct Effect", y = "Metabolite")

p1 | p2
```

Finally, let's revisit the indirect effects that seem to be traveling through
specific microbe-metabolite pairs, which we recovered using
`indirect_pathwise()`. As with the overall indirect effects, we can carry out a
sensitvity analysis, asking how conclusions would have changed if there had in
fact been a confounder between specific mediators and metabolites. In this
example, we will look at the strongest pathwise indirect effects that were
related to Crohns Disease. The block below correlates the pairs in each row of
the `confound_ix` data.frame. We use the same range of correlation strengths
`rho_seq` as before. Since estimating pathwise indirect effects is more
consuming than computing overall effects, we will settle with `n_boot = 1`,
which unfortunately means we won't have confidence bands around our sensitivity
curve.

```{r}
confound_ix <- data.frame(
    mediator = c("gCholadousia", "gBilophila", "gCAG103"),
    outcome = c(
        "m0066_taurine", "m0066_taurine", "m0900_ketodeoxycholate"
    )
)

#sensitivity_path <- sensitivity_pathwise(model, exper, confound_ix, rho_seq, 1)
sensitivity_path <- read_csv("https://go.wisc.edu/ph7ek4")
```

The results are given below. It seems that the estimated taurine $\iff$ Biophila
effect is quite robust: even if there had been an unobserved confounder that
correlates the mediator and outcome with strength $\rho = 0.5$, we still see a
clear indirect effect. The other two estimates require a bit more
interpretation. The taurine $\iff$ Choladousia effect seems highly variable for
small levels of $\rho$, and for larger values, the estimate completely vanishes.
We should not have too much confidence in this result, even for low levels of
confounding. On the other hand, the ketodeoxycholate $\iff$ CAG103 seems quite
robust for smaller values of $\rho$, but weakens under very strong confounding.
This situation is more ambiguous. It's possible that a confounder could be
responsible for the indirect effect; however, it would have to be quite strong,
perhaps stronger than is plausible.

```{r, fig.width = 9, fig.height = 2.8}
confound_ix |>
    left_join(sensitivity_path) |>
    ggplot(aes(rho, indirect_effect, group = contrast)) +
    geom_hline(yintercept = 0, linetype = 3, linewidth = 1) +
    geom_vline(xintercept = 0, linetype = 3, linewidth = 1) +
    geom_line(linewidth = 2) +
    labs(
        x = expression("Confounding parameter"~ rho),
        y = "Estimated Indirect Effect"
    ) +
    theme(
        axis.title = element_text(size = 16),
        strip.text = element_text(size = 14)
    ) +
    facet_wrap(
        reorder(outcome, indirect_effect) ~ reorder(mediator, indirect_effect),
        scales = "free_y"
    )
```

# Hurdle-Lognormal Approach

There are many zeros in the metabolites data, and sparse regression is not the
best approach for modeling these kinds of outcomes. Indeed, we seem to have been
most sensitive to indirect effects for highly abundant metabolites (which are
present in most samples), and this raises the question of whether we might have
missed true indirect effects where that model's linearity assumptions were not
appropriate.

As an alternative, we can work with a hurdle model, which explicitly models
zeros together with a nonnegative component. For this, we'll return the outcomes
to their original (unlogged) scaling. We'll apply a BRMS hurdle model. So that
this isn't too slow during the indirect effect estimation, I've further reduced
the number of taxa and metabolites in this analysis.

```{r reduced_experiment}
exper2 <- exper
outcomes(exper2) <- exp(outcomes(exper2)[, 1:50]) - 1
mediators(exper2) <- mediators(exper2)[, 1:50]
```

```{r hurdle_estimate}
# model <- multimedia(
#     exper2,
#     brms_model(family = hurdle_lognormal())
# ) |>
#     estimate(exper2)
model <- readRDS(url("https://go.wisc.edu/wj197j"))
```

With the updated model, we can again compute pathwise indirect effects. This
time, we are more sensitive to metabolites that completely vanish in the IBD
(treatment) group. In all of the selected pairs, there is also a treatment
effect on microbe abundance. If this hadn't been present, and if the metabolite
simply vanished, then that would be a direct effect.

```{r hurdle_indirect, fig.width = 12, fig.height = 9}
#indirect_sorted <- indirect_pathwise(model, exper2) |>
indirect_sorted <- read_csv("https://go.wisc.edu/r269qh") |>
    effect_summary()
plot_mediators(indirect_sorted, exper2, treatment = "Study.Group")
```

We can see how well the hurdle model captured the zero pattern by simulating
data from the fitted model. There are a few outliers, which is why I've trimmed
the $y$-axis. The model has not captured the specific shape of the treatment
group scatterplots above, which tends to have regions with only zeros for some
metabolites. In retrospect, this is natural. The hurdle model can only increase
the zero probability as a linear function of species abundance. We would need a
step function preidctor to support the kinds of transitions we see above.

```{r hurdle_synthetic, fig.width = 12, fig.height = 9}
profile <- setup_profile(model, treatments(exper), treatments(exper))
samples <- sample(model, profile = profile, pretreatment = pretreatments(exper))
colnames(outcomes(samples)) <- colnames(outcomes(exper2))
plot_mediators(indirect_sorted, samples, treatment = "Study.Group")
```

Finally, we can look at direct effects from this model, which can be compared to
the analogous figure for the sparse regression. Now, many outcomes with exact
zeros under treatment have been detected, which is consistent with the the
hurdle model's effectiveness in accurately reflecting features that influence
zero-inflation.

```{r hurdle_direct, fig.width = 6, height = 5}
#direct <- direct_effect(model, exper) |>
direct <- read_csv("https://go.wisc.edu/28ft5e") |>
    effect_summary() |>
    slice_max(abs(direct_effect), n = 6)

combined |>
    select(any_of(direct$outcome), Study.Group) |>
    pivot_longer(-Study.Group, names_to = "feature") |>
    mutate(feature = factor(feature, levels = unique(direct$outcome))) |>
    ggplot() +
    geom_boxplot(aes(value, feature, fill = Study.Group))
```

```{r}
sessionInfo()
```