---
title: "The Mindfulness Study"
package: multimedia
output: 
  rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{The Mindfulness Study}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
    fig.align = "center",
    collapse = TRUE,
    comment = "#>",
    warning = FALSE,
    message = FALSE,
    cache = FALSE,
    dev.args = list(bg = "transparent"),
    out.width = 600,
    crop = NULL
)

knitr::knit_hooks$set(output = multimedia::ansi_aware_handler)
suppressPackageStartupMessages(library(ggplot2))
options(
    ggplot2.discrete.colour = c(
        "#9491D9", "#F24405", "#3F8C61", "#8C2E62", "#F2B705", "#11A0D9"
    ),
    ggplot2.discrete.fill = c(
        "#9491D9", "#F24405", "#3F8C61", "#8C2E62", "#F2B705", "#11A0D9"
    ),
    ggplot2.continuous.colour = function(...) {
        scale_color_distiller(palette = "Spectral", ...)
    },
    ggplot2.continuous.fill = function(...) {
        scale_fill_distiller(palette = "Spectral", ...)
    },
    crayon.enabled = TRUE
)

th <- theme_classic() +
    theme(
        panel.background = element_rect(fill = "transparent"),
        strip.background = element_rect(fill = "transparent"),
        plot.background = element_rect(fill = "transparent", color = NA),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        legend.background = element_rect(fill = "transparent"),
        legend.box.background = element_rect(fill = "transparent"),
        legend.position = "bottom"
    )
theme_set(th)
```

```{r helpers, include = FALSE}
normalize <- function(x) {
    x <- as_tibble(x)
    x / rowSums(x)
}
```

In most microbiome mediation analysis studies, microbiome composition is
thought to mediate some outcome of interest. Sometimes, though, we might want to
treat the microbiome as an outcome. Our motivating example here looks at the
relationship between mindfulness and the microbiome. In this setting, we might
imagine that mindfulness training influences behaviors (diet, sleep, exercise)
that could then change the microbiome. Clinical surveys that gauge those
behaviors now play the role of mediator. Alternatively, mindfulness could
trigger neurophysiological changes that influence the microbiome in more subtle
ways. The point of mediation analysis is to provide evidence for one or the
other of these possibilities.

```{r libraries}
library(ggdist)
library(ggrepel)
library(phyloseq)
library(tidyverse)
library(multimedia)
data(mindfulness)
set.seed(20231228)
```

This package needs a `mediation_data` object to organize the pretreatment,
treatment, mediator, and outcome variables. `taxa_names(mindfulness)` specifies
that we want all the taxa names to be outcome variables. We've already appended
all the mediator variable names with the prefix "mediator," so we can specify
them using the `starts_with` selector. We'll treat the `subject` column as
pretreatment variables. The study tracked participants across a few timepoints,
and we want to control for different baseline behavioral and microbiome
profiles.

```{r data}
exper <- mediation_data(
    mindfulness,
    taxa_names(mindfulness),
    "treatment",
    starts_with("mediator"),
    "subject"
)

exper
```

This block applies a logistic-normal multinomial model to the microbiome
outcome. This is an example where the response variables are all modeled
jointly. We'll use a ridge regression model for each of the mediators. This is a
better choice than ordinary linear models because we have to estimate many
subject-level parameters, and our sample size is not that large.

```{r lnm}
# use Flavonifractor as the reference for the log-ratio transform
outcomes(exper) <- outcomes(exper) |>
    select(Flavonifractor, everything())

model <- multimedia(
    exper,
    lnm_model(seed = .Random.seed[1]),
    glmnet_model(lambda = 0.5, alpha = 0)
) |>
    estimate(exper)
```

# Evaluating Effects

With this model, we can estimate direct and indirect effects. We're averaging
over different settings of the mediator treatment variables. Since the LNM model
outputs relative abundance profiles, all the estimated effects should be
interpreted on a probability scale.

```{r lnm_direct}
direct <- direct_effect(model, exper) |>
    group_by(outcome) |>
    summarise(direct_effect = mean(direct_effect)) |>
    arrange(-abs(direct_effect))
```

These are the indirect effects when we set mediators to the treatment setting
one at a time.

```{r lnm_indirect}
indirect <- indirect_pathwise(model, exper) |>
    group_by(mediator, outcome) |>
    summarise(indirect_effect = mean(indirect_effect)) |>
    arrange(-abs(indirect_effect))
indirect
```

`plot_mediators` visualizes the relationship between mediator and outcome
variables. The title for each subplot is the estimated indirect effect (note
that it's `Control` - `Treatment` in the current definition). For example, in
the panel relating sleep disturbances and Blautia, there is a rough positive
association between these variables. The control group has a higher number of
sleep disturbances, which is translated into a positive indirect effect for this
pair. The model accounts for more than these pairwise relationships, though --
it controls for subject-level baselines and includes direct effects of treatment
-- so these figures should be interpreted within the larger model context.

```{r mediator_plot, fig.width = 12, fig.height = 6}
exper_rela <- exper
outcomes(exper_rela) <- normalize(outcomes(exper))
plot_mediators(indirect, exper_rela)
```

# Model Alterations

Hypothesis testing is built on the idea that a simple submodel might explain the
data just as well as a full, more complicated one. In mediation analysis, we can
zero out sets of edges to define submodels. For example, we can re-estimate a
model that does not allow any relationship between the treatment and the
mediators.

```{r alteration_m}
altered <- model |>
    nullify("T->M") |>
    estimate(exper)
```

These modified models can be used to simulate synthetic null data that help
contextualize the effects seen in real data. For example, we can compare
imaginary cohort drawn from the original and the altered models. Notice that I
re-assigned all the treatments. This removes potential confounding between the
treatment and subject level effects -- this effect doesn't exist on average
across the population because the study was randomized, but we want to remove
any associations present in our finite sample.

```{r mediation_sample}
new_assign <- treatments(exper)[sample(nrow(treatments(exper))), , drop = FALSE]
profile <- setup_profile(model, new_assign, new_assign)
m1 <- sample(model, profile = profile, pretreatment = pretreatments(exper))
m2 <- sample(altered, profile = profile, pretreatment = pretreatments(exper))
```

The sampled objects above have the same class as `exper`. We can extract the
real and simulated mediator data to see the treatment effects along those edges.
As expected, the altered model has no systematic difference between treatment
groups. This lends some evidence to the conclusion that mindfulness does
influence the mediators, which was implicit in our discussion of indirect
effects above.

```{r synthetic_mediation_plot, fig.width = 10, fig.height = 4}
list(
    real = bind_cols(treatments(exper), mediators(exper), pretreatments(exper)),
    original = bind_cols(new_assign, mediators(m1), pretreatments(exper)),
    altered = bind_cols(new_assign, mediators(m2), pretreatments(exper))
) |>
    bind_rows(.id = "source") |>
    pivot_longer(starts_with("mediator"), names_to = "mediator") |>
    ggplot() +
    geom_boxplot(
        aes(value, reorder(mediator, value, median), fill = treatment)
    ) +
    facet_grid(~source)
```

Here is the analogous alteration that removes all direct effects. The $x$-axis
is on a log scale, and the LNM's estimated treatment effects for rare taxa are
not practically significant. It's interesting that even though the original data
don't show much difference between treatment groups, the model suggests that
there are real effects, even for the abundant taxa. It's possible that, in the
raw boxplots, heterogeneity across subjects and mediators masks true effects,
and the LNM is able to detect these effects by appropriately accounting for
sources of heterogeneity.

```{r outcome_alteration}
altered_direct <- model |>
    nullify("T->Y") |>
    estimate(exper)

altered_indirect <- model |>
    nullify("M->Y") |>
    estimate(exper)

profile <- setup_profile(model, treatments(exper), treatments(exper))
y1 <- sample(model, profile = profile, pretreatment = pretreatments(exper))
y2 <- sample(
    altered_direct,
    profile = profile,
    pretreatment = pretreatments(exper)
)
y3 <- sample(
    altered_indirect,
    profile = profile,
    pretreatment = pretreatments(exper)
)
```


```{r alteration_plot, fig.width = 8, fig.height = 7}
taxa_order <- normalize(outcomes(exper)) |>
    log() |>
    apply(2, \(x) {
        x[is.infinite(x)] <- NA
        median(x, na.rm = TRUE)
    }) |>
    order()

combined_samples <- list(
    "Original Data" = normalize(outcomes(exper)),
    "Full Model" = normalize(outcomes(y1)),
    "No Direct Effect" = normalize(outcomes(y2)),
    "No Indirect Effect" = normalize(outcomes(y3))
) |>
    bind_rows(.id = "source") |>
    mutate(treatment = rep(treatments(exper)$treatment, 4)) |>
    pivot_longer(colnames(outcomes(exper))) |>
    mutate(name = factor(name, levels = colnames(outcomes(exper))[taxa_order]))

ggplot(combined_samples) +
    geom_boxplot(
        aes(log(value), name, fill = treatment),
        size = 0.4, outlier.size = 0.8, position = "identity",
        alpha = 0.7
    ) +
    labs(x = "log(Relative Abundance)", y = "Genus", fill = "", col = "") +
    facet_grid(~source, scales = "free") +
    theme(panel.border = element_rect(fill = NA))
```

```{r alteration_presence, fig.width = 6, fig.height = 7}
combined_samples |>
    group_by(name, source, treatment) |>
    summarise(presence = mean(value > 0)) |>
    ggplot() +
    geom_col(
        aes(presence, name, fill = treatment, col = treatment),
        position = "identity", width = 1,
        alpha = 0.7
    ) +
    facet_grid(~source, scales = "free") +
    scale_x_continuous(expand = c(0, 0)) +
    theme(panel.border = element_rect(fill = NA))
```

# Synthetic Null Testing

A standard approach to inference is to apply a bootstrap (this is done at the
end of the vignette). An alternative, which is less common, but well-suited to
high-dimensions is to calibrate variable selection sets using synthetic null
data. For example, this is the intuition behind the knockoff, Clipper, and
mirror statistic algorithms. The general recipe is:

1. Simulate synthetic null data from a model/sampling mechanism where we know
for certain that an effect does not exist.
1. Use the full model to estimate effects on this synthetic null data.
1. Define a selection rule so that the fraction of selected synthetic null
effects is kept below an false discovery rate threshold.

The synthetic null effects essentially serve as negative controls to calibrate
inference.

In `multimedia`, `null_contrast` estimates effects on real and synthetic null
data, while `fdr_summary` ranks and selects features. Each point in the figure
below is the directed effect for a single feature in either real or synthetic
data. We rank features according to their absolute effect size, and as soon as
negative control features enter, we begin increasing the estimated false
discovery rate for the rule that selects features up to that point.

```{r direct_contrast}
contrast <- null_contrast(model, exper, "T->Y", direct_effect)
fdr_direct <- fdr_summary(contrast, "direct_effect") |>
    dplyr::rename(effect = direct_effect)
```

The indirect effect analog is shown below. It seems we have more evidence for
direct than indirect effects. This is consistent with literature on the
difficulty of estimating indirect effects in mediation analysis.

```{r indirect_contrast}
contrast <- null_contrast(model, exper, "M->Y", indirect_overall)
fdr_indirect <- fdr_summary(contrast, "indirect_overall") |>
    dplyr::rename(effect = indirect_effect)
```

```{r contrast_plot, fig.width = 7, fig.height = 3}
fdr_data <- bind_rows(fdr_direct, fdr_indirect, .id = "effect_type") |>
    mutate(effect_type = c("Direct", "Indirect")[as.integer(effect_type)])
ggplot(fdr_data, aes(effect, fdr_hat)) +
    geom_text_repel(
        data = filter(fdr_data, keep),
        aes(label = outcome, col = source), size = 3, force = 10
    ) +
    labs(x = "Effect", y = expression(widehat(FDR)), col = "Data Source") +
    geom_point(aes(col = source, size = keep)) +
    scale_size_discrete("Selected", range = c(.5, 3), guide = FALSE) +
    scale_y_continuous(limits = c(-0.1, 0.55)) +
    facet_wrap(~effect_type, scales = "free_x") +
    theme(
        axis.text.x = element_text(size = 12),
        axis.text.y = element_text(size = 12),
        strip.text = element_text(size = 14),
        legend.text = element_text(size = 12)
    )
```

```{r, fig.width = 7, fig.height = 4.3}
bind_mediation(exper_rela) |>
    select(
        subject, starts_with("mediator"), treatment, Roseburia, Blautia,
        Faecalibacterium
    ) |>
    pivot_longer(starts_with("mediator"), names_to = "mediator") |>
    mutate(mediator = str_remove(mediator, "mediator.")) |>
    pivot_longer(
        Roseburia:Faecalibacterium,
        names_to = "taxon", values_to = "abundance"
    ) |>
    group_by(subject, mediator, treatment, taxon) |>
    summarise(abundance = mean(abundance), value = mean(value)) |>
    ggplot() +
    geom_point(aes(value, abundance, col = treatment)) +
    facet_grid(taxon ~ mediator, scales = "free") +
    labs(y = "Relative Abundance", x = "Mediator Value", col = "") +
    theme(
        axis.title = element_text(size = 14),
        strip.text = element_text(size = 12),
        strip.text.y = element_text(size = 12, angle = 0),
        legend.text = element_text(size = 12),
        panel.border = element_rect(fill = NA)
    )
```

Here is the analog for indirect pathwise effects. Since these effects
distinguish between individual mediators, we can facet by these variables.

```{r indirect_pathwise_contrast, fig.width = 8, fig.height = 4}
contrast <- null_contrast(model, exper, "M->Y", indirect_pathwise)
fdr <- fdr_summary(contrast, "indirect_pathwise")
ggplot(fdr, aes(indirect_effect, fdr_hat)) +
    geom_text_repel(
        data = filter(fdr, keep),
        aes(label = outcome, fill = source), size = 4
    ) +
    labs(x = "Indirect Effect", y = "Estimated False Discovery Rate") +
    geom_point(aes(col = source, size = keep)) +
    scale_size_discrete(range = c(.2, 3)) +
    facet_wrap(~mediator)
```

# Random Forests

The LNM makes relatively strong assumptions on the outcome model: The effects
must enter linearly through a multinomial likelihood. As a nonparametric
alternative, we can fit separate random forest models to each microbe's
abundance. Since this model doesn't directly account for compositions, it will
help to transform the response to relative abundances.

```{r rf_estimate}
model <- multimedia(
    exper_rela,
    rf_model(num.trees = 50),
    glmnet_model(lambda = 0.5, alpha = 0)
) |>
    estimate(exper_rela)
```

We can compute direct and indirect effects from the estimated model.

```{r rf_effects}
direct <- direct_effect(model, exper_rela)
direct_sorted <- direct |>
    group_by(outcome, contrast) |>
    summarise(direct_effect = mean(direct_effect)) |>
    arrange(-direct_effect)

indirect_sorted <- indirect_pathwise(model, exper_rela) |>
    group_by(outcome, mediator, contrast) |>
    summarise(indirect_effect = mean(indirect_effect)) |>
    arrange(-abs(indirect_effect))

direct_sorted
indirect_sorted
```

We can visualize some of the features with the largest direct and indirect
effects. Nonlinearity seems to have helped uncover a few mediator to microbe
relationships, for example, the relationships between sleep
disturbance/Ruminococcus2 and cereal/Faecalibacterium.

```{r rf_mediation_plots, fig.width = 11, fig.height = 5}
plot_mediators(indirect_sorted, exper_rela)
```

# Bootstrap

The bootstrap can help evaluate uncertainty for more complex statistics, and
it's a good option for our random forest analysis. In each iteration, we
randomly subsample timepoints from the original experiment and compute a list of
estimates, specified by the list `funs` below. The output concatenates output
across all bootstrap iterations.

```{r rf_bootstrap}
funs <- list(direct = direct_effect, indirect = indirect_overall)
# inference <- bootstrap(model, exper_rela, funs, B = 1e3)
inference <- readRDS(url("https://go.wisc.edu/ba3e87"))
```

We can now visualize the bootstrap confidence intervals associated with the
estimated effects. The largest interval corresponds to a 90% bootstrap
confidence interval. The inner interval covers the median +/- one-third of the
bootstrap distribution's mass. Some associations certainly seem worth following
up on, though more data will be needed for validation.

Notice that the direct effects seem generally stronger than indirect effects.
Even with a flexible random forest model, it seems that changes in diet and
sleep alone are not enough to explain the changes in microbiome composition seen
in this study. This suggests that mechanisms related to the gut-brain axis may
be at play, lying behind the observed direct effects.

```{r rf_bootstrap_figures, fig.width = 8.5, fig.height = 7}
p1 <- ggplot(inference$indirect) +
    geom_vline(xintercept = 0) +
    stat_slabinterval(
        aes(indirect_effect, reorder(outcome, indirect_effect, median)),
        .width = c(.66, .95)
    ) +
    labs(
        title = "Indirect Effects", y = "Taxon", x = "Effect (Rel. Abund Scale)"
    )

p2 <- ggplot(inference$direct) +
    geom_vline(xintercept = 0) +
    stat_slabinterval(
        aes(direct_effect, reorder(outcome, direct_effect, median)),
        .width = c(.66, .95)
    ) +
    labs(title = "Direct Effects", y = "Taxon", x = "Effect (Rel. Abund Scale)")

p1 | p2
```

```{r}
sessionInfo()
```