1 Introduction

This project explores potential signatures of a vegan diet across the microbiome, metabolome, and lipidome. We used data from healthy vegan and omnivorous human subjects in two countries (Czech Republic and Italy), with subjects grouped by Country and Diet, resulting in four distinct groups.

To assess the generalizability of these findings, we validated our results with an independent cohort from the Czech Republic for external validation.

1.1 Statistical Methods

The statistical modeling approach is described in detail in this report. Briefly, the methods used included:

Multivariate analysis: We conducted multivariate analyses (PERMANOVA, PCA, correlation analyses) to explore the effects of diet, country, and their possible interaction (diet : country) on the microbiome, lipidome, and metabolome compositions in an integrative manner. This part of the analysis is not available on the GitHub page, but the code will be provided upon request.
Linear models: Linear models were applied to estimate the effects of diet, country, and their interaction (diet:country) on individual lipids, metabolites, bacterial taxa and pathways (“features”). Features that significantly differed between diet groups (based on the estimated average conditional effect of diet across both countries, adjusted for multiple comparisons with FDR < 0.05) were further examined in the independent external validation cohort to assess whether these associations were reproducible. Next, we fit linear models restricted to vegan participants to test whether omics profiles varied with the duration of vegan diet. Fixed-effect predictors were diet duration (per 10 years), country, their interaction, and age (included due to correlation with diet duration).
Predictive models (elastic net): We employed elastic net logistic regression (via the glmnet R package) to predict vegan status based on metabolome, lipidome, microbiome and pathways data (one model per dataset; four models in total). We considered three combinations of Lasso and Ridge penalties (alpha = 0, 0.2, 0.4). For each alpha, we selected the penalty strength (λ_1se) using 10-fold cross-validation. This value corresponds to the most regularized model whose performance was within one standard error of the minimum deviance. The alpha–lambda pair with the lowest deviance was chosen to fit the final model, whose coefficients are reported.
To estimate model performance, we repeated the full modeling procedure (including hyperparameter tuning) 500 times on bootstrap resamples of the training data. In each iteration, the model was trained on the resampled data and evaluated on the out-of-bag subjects (i.e., those not included in the training set in that iteration). The mean, and 2.5th, and 97.5th percentiles of the resulting ROC-AUC values represent the estimated out-of-sample AUC and its 95% confidence interval.
Finally, the final model was applied to an independent validation cohort to generate predicted probabilities of vegan status. These probabilities were then used to assess external discrimination between diet groups (ROC-AUC in the independent validation cohort). The elastic net models were not intended for practical prediction, but to quantify the strength of the signal separating the dietary groups, with its uncertainty, by using all features of a given dataset jointly. It also offered a complementary perspective on which features are most clearly associated with diet

2 Initiation

2.1 Set home directory

Open code

setwd('/home/ticf/secured_data/GitRepo/ticf/478_MOCA_italian')

2.2 Upload initiation file

Open code

source('478_initiation.R')

3 Data

3.1 Upload and epxlore

3.1.1 Get pathways

Only these pathways that have non-zero values in all observation in at least one dataset will be chosen

Open code

data_path_originalCZ <- read.delim(
  "gitignore/data/pathw/Pathway_abundance_MetaCyc_CZ_humann.tsv",
  header = TRUE, sep = "\t"
) %>%
  filter(!grepl("\\|", X..Pathway)) %>%
  select(X..Pathway)
dim(data_path_originalCZ)
## [1] 580   1


data_path_originalIT <- read.delim(
  "gitignore/data/pathw/Pathway_abundance_MetaCyc_IT_humann.tsv",
  header = TRUE, sep = "\t"
) %>%
  filter(!grepl("\\|", X..Pathway)) %>%
  select(X..Pathway)
dim(data_path_originalIT)
## [1] 488   1


data_path_validation <- read.delim(
  "gitignore/data/pathw/Pathway_abundance_MetaCyc_Validation_humann.tsv",
  header = TRUE, sep = "\t"
) %>%
  filter(!grepl("\\|", X..Pathway)) %>%
  select(X..Pathway)
dim(data_path_validation)
## [1] 580   1


tr <- intersect(
  data_path_originalCZ$X..Pathway,
  data_path_originalIT$X..Pathway
)

paths <- intersect(tr, data_path_validation$X..Pathway)
length(paths)
## [1] 481

3.1.2 Italian data

Open code

data_path_originalIT <- read.delim(
  'gitignore/data/pathw/Pathway_abundance_MetaCyc_IT_humann.tsv',  
  header = TRUE, sep = "\t"
  ) %>% 
  filter(X..Pathway %in% paths)

features <- data_path_originalIT[,1]
split_features <- strsplit(features, ": ")
feature_name <- sapply(split_features, `[`, 1)
feature_description <- sapply(split_features, `[`, 2)
feature_name <- gsub(" |-", "_", feature_name)

row.names(data_path_originalIT) <- c(feature_name)
data_path_originalIT <- data.frame(t(data_path_originalIT[,-1]))
attr(data_path_originalIT, "description") <- feature_description


originalIT_features <- data_path_originalIT%>%
  select(where(~ mean(. == 0) < 0.7)) %>% 
  select(-UNMAPPED, -UNINTEGRATED) %>% 
  colnames()

3.1.3 Czech data

Open code

data_path_originalCZ <- read.delim(
  'gitignore/data/pathw/Pathway_abundance_MetaCyc_CZ_humann.tsv',  
  header = TRUE, sep = "\t"
  ) %>% 
  filter(X..Pathway %in% paths)

features <- data_path_originalCZ[,1]
split_features <- strsplit(features, ": ")
feature_name <- sapply(split_features, `[`, 1)
feature_description <- sapply(split_features, `[`, 2)
feature_name <- gsub(" |-", "_", feature_name)

row.names(data_path_originalCZ) <- c(feature_name)
data_path_originalCZ <- data.frame(t(data_path_originalCZ[,-1]))
attr(data_path_originalCZ, "description") <- feature_description
 
originalCZ_features <- data_path_originalCZ %>%
  select(where(~ mean(. == 0) < 0.7)) %>% 
  select(-UNMAPPED, -UNINTEGRATED) %>% 
  colnames()

3.1.4 Validation data

Open code

data_pathways_validation <- read.delim(
  'gitignore/data/pathw/Pathway_abundance_MetaCyc_Validation_humann.tsv', 
  header = TRUE, 
  sep = "\t"
  ) %>% 
  filter(X..Pathway %in% paths)

features <- data_pathways_validation[,1]
split_features <- strsplit(features, ": ")
feature_name <- sapply(split_features, `[`, 1)
feature_description <- sapply(split_features, `[`, 2)
feature_name <- gsub(" |-", "_", feature_name)

row.names(data_pathways_validation) <- c(feature_name)
data_pathways_validation <- data.frame(t(data_pathways_validation[,-1]))
attr(data_pathways_validation, "description") <- feature_description
row.names(data_pathways_validation) <- gsub("^X", "K", row.names(data_pathways_validation))

validation_features <- data_pathways_validation %>%
  select(-UNMAPPED, -UNINTEGRATED) %>% 
  colnames()

3.1.5 Merging data

Modify data

Open code


# which taxa
set.seed(478)

features <- intersect(originalIT_features, originalCZ_features)
features <- intersect(features, validation_features)


# CZ
data_path_originalCZ_filtered <- data_path_originalCZ %>%
  select(any_of(features))

data_path_originalCZ_filtered <- data_path_originalCZ_filtered/ rowSums(data_path_originalCZ_filtered)

data_path_originalCZ_filtered <- data_path_originalCZ_filtered %>% 
  mutate(
    ID = row.names(.),
    Country = 'CZ'
  ) %>% 
  select(ID, Country, any_of(features))

# IT
data_path_originalIT_filtered <- data_path_originalIT %>%
  select(any_of(features))

data_path_originalIT_filtered <- data_path_originalIT_filtered/ rowSums(data_path_originalIT_filtered)

data_path_originalIT_filtered <- data_path_originalIT_filtered %>% 
  mutate(
    ID = row.names(.),
    Country = 'IT'
  ) %>% 
  select(ID, Country, any_of(features))

# joining the table

data_path_original_filtered <- bind_rows(data_path_originalIT_filtered,
                                         data_path_originalCZ_filtered)

bacteria_data <- data_path_original_filtered %>% 
  select(all_of(features))

if (file.exists("gitignore/data_path_original_impCLR.RData") == FALSE) {
  
  bacteria_data_imp <- lrSVD(
    bacteria_data,
    label = 0, 
    dl = NULL, 
    z.delete = FALSE, 
    ncp = 2
  )

  row.names( bacteria_data_imp) <- row.names(bacteria_data) 
  
  bacteria_data <- data.frame(clr(bacteria_data_imp)) %>%
    mutate(ID = row.names(.))

  training_metadata <- read.xlsx("gitignore/data/lipidome_training_cohort.xlsx") %>%
    select(Sample, Diet) %>%
    mutate(ID = Sample) %>%
    select(-Sample)

  data_path_original_filtered <- data_path_original_filtered %>%
    select(ID, Country) %>%
    left_join(bacteria_data, by = "ID")

  data_pathways_original_clr <- data_path_original_filtered %>%
    mutate(
      Data = if_else(Country == "CZ", "CZ_tr", "IT_tr")
    ) %>%
    left_join(training_metadata, by = "ID") %>%
    select(ID, Diet, Country, Data, everything())

  save(
    data_pathways_original_clr,
    file = "gitignore/data_path_original_impCLR.RData"
  )
}

load("gitignore/data_path_original_impCLR.RData")

if (file.exists("gitignore/data_pathways_original_impCLR.csv") == FALSE)
  write.csv(data_pathways_original_clr,
            "gitignore/data_pathways_original_impCLR.csv")

## Show variances of CLR proportions across samples
data_variance <- data_pathways_original_clr %>%
  rowwise() %>%   
  mutate(variance = var(c_across(-(ID:Data)))) %>%  
  ungroup() %>%       
  select(ID, variance)

## Look at distribution
hist(data_variance$variance)

Open code


## Show extreme samples
data_variance %>% arrange(desc(variance))
## # A tibble: 166 × 2
##    ID     variance
##    <chr>     <dbl>
##  1 T173      11.0 
##  2 T181       9.32
##  3 T182       8.98
##  4 VOV004     8.57
##  5 VOV100     8.54
##  6 T176       8.49
##  7 VOV042     8.48
##  8 VOV088     8.48
##  9 VOV006     8.46
## 10 T189       8.45
## # ℹ 156 more rows

Get diet informationDiet` from another dataset for validating data

Open code

validation_metadata <- read.xlsx('gitignore/data/lipidome_validation_cohort.xlsx') %>% 
  select(X1, X2) %>% 
  mutate(ID = X1,
         Diet = X2) %>% 
  select(-X1, -X2)

data_pathways_validation_filtered <- data_pathways_validation %>% 
  select(any_of(features))

set.seed(478)

data_pathways_validation_filtered <-  (data_pathways_validation_filtered/
  rowSums(data_pathways_validation_filtered ))

if (file.exists("gitignore/data_pathways_validation_impCLR.RData") == FALSE) {
  
  data_pathways_validation_filtered_imp <- lrSVD(
    data_pathways_validation_filtered,
    label = 0, dl = NULL, z.delete = FALSE
  )

row.names(data_pathways_validation_filtered_imp) <- row.names(data_pathways_validation_filtered)

  data_pathways_validation_clr <- data.frame(
    clr(data_pathways_validation_filtered_imp)
  ) %>%
    mutate(
      ID = row.names(.),
      Country = "CZ",
      Data = "valid"
    ) %>%
    left_join(validation_metadata, by = "ID") %>%
    select(ID, Diet, Country, Data, any_of(features)) %>%
    filter(!is.na(Diet))

  ## Add Diet for K284 which has the diet missing
  data_pathways_validation_clr[
    which(data_pathways_validation_clr$ID == "K284"), "Diet"
  ] <- "VEGAN"


  save(
    data_pathways_validation_clr, 
    file = "gitignore/data_pathways_validation_impCLR.RData"
  )
}

load("gitignore/data_pathways_validation_impCLR.RData")

if (file.exists("gitignore/data_pathways_validation_impCLR.csv") == FALSE)
  write.csv(data_pathways_validation_clr,
            "gitignore/data_pathways_validation_impCLR.csv")

## Show variances of CLR proportions across samples
data_variance <- data_pathways_validation_clr %>%
  rowwise() %>%   
  mutate(variance = var(c_across(-(ID:Data)))) %>%  
  ungroup() %>%       
  select(ID, variance)

## Look at distribution
hist(data_variance$variance)

Open code


## Show extreme samples
data_variance %>% arrange(desc(variance))
## # A tibble: 101 × 2
##    ID    variance
##    <chr>    <dbl>
##  1 K16       7.90
##  2 K55       7.38
##  3 K74       7.36
##  4 K119      7.29
##  5 K38       7.27
##  6 K12       7.22
##  7 K61       7.13
##  8 K292      7.10
##  9 K82       7.07
## 10 K64       7.02
## # ℹ 91 more rows

3.1.6 Merge training and validation dataset

Open code

data_merged <- bind_rows(
  data_pathways_original_clr,
  data_pathways_validation_clr
)

3.2 Explore

3.2.0.1 Distributions

The following plot will show distribution of 36 randomly selected pathways

Open code

size = c(6,6)
check <- data_pathways_original_clr[, 5:ncol(data_pathways_original_clr)]

check <- check[, sample(1:ncol(check), size[1]*size[2])]


par(mfrow = c(size[1],size[2]))
par(mar=c(2,1.5,2,0.5))

for(x in 1:ncol(check)){
  hist(check[,x], 
       16, 
       col='blue', 
       main = paste0(colnames(check)[x])
  )
}

Data seems to have relatively symmetric distribution

3.2.0.2 Pathways accross groups

Open code

set.seed(478)
colo <- c('#F9FFAF','#329243')

outcomes <- data.frame(
  variable = data_merged %>% 
    select(any_of(sample(features, 35))) %>% 
    colnames()
)

boxplot_cond <- function(variable) {
  
  
  p <- ggboxplot(data_merged, 
                 x = 'Diet', 
                 y = variable, 
                 fill = 'Diet', 
                 tip.length = 0.15,
                 palette = colo,
                 outlier.shape = 1,
                 lwd = 0.25,
                 outlier.size = 0.8,
                 facet.by = 'Data',
                 title = variable,
                 ylab = 'pathway level') +
    
    theme(
      plot.title = element_text(size = 10), 
      axis.title = element_text(size = 8),  
      axis.text.y = element_text(size = 7),
      axis.text.x = element_blank(),
      axis.title.x = element_blank()
    ) 
    
  
  return(p)
}

# Plot all outcomes
plots <- map(outcomes$variable, boxplot_cond)

# Create a matrix of plots
plots_arranged <- ggarrange(plotlist = plots, ncol = 5, nrow = 7,  common.legend = TRUE)
plots_arranged

Levels of pathways across all 3 cohorts (Czech and Italian training cohorts and an independent Czech valdiation cohort) and across dietary groups

4 Linear models across pathways

We will fit a feature-specific linear model where the clr-transformed pathway represents the outcome variable whereas country (Italy vs Czech), diet (vegan vs omnivore), and their interaction (country:diet) all represent fixed-effects predictors. So, each model has following form

\[ clr(\text{pathway level}) = \alpha + \beta_{1} \times \text{country} + \beta_{2} \times \text{diet} + \beta_{3} \times \text{country:diet} + \epsilon \]

The variables were coded as follows: \(diet = -0.5\) for omnivores and \(diet = 0.5\) for vegans; \(country = -0.5\) for the Czech cohort and \(country = 0.5\) for the Italian cohort.
This parameterization allows us to interpret the linear model summary output as presenting the conditional effects of diet averaged across both countries and the conditional effects of country averaged across both diet groups. We will then use the emmeans package (Lenth 2024) to obtain specific estimates for the effect of diet in the Italian and Czech cohorts separately, still from a single model.

pathways that will show a significant diet effect (average effect of diet across both countries, adjusted for multiple comparisons with FDR < 0.05) will be then visualized using a forest plot, with country-specific diet effect along with diet effect based on independent validation cohort, to evaluate how generalizable are these findings.

4.1 Select and wrangle data

Open code

data_analysis_pathways <- data_pathways_original_clr %>%
  na.omit() %>%
  dplyr::mutate(
    Diet_VEGAN = as.numeric(
      dplyr::if_else(
        Diet == "VEGAN", 0.5, -0.5
      )
    ),
    Country_IT = as.numeric(
      dplyr::if_else(
        Country == "IT", 0.5, -0.5
      )
    ),
    ) %>%
  dplyr::select(
    ID,
    Country,
    Country_IT,
    Diet,
    Diet_VEGAN,
    dplyr::everything()
  )

summary(data_analysis_pathways[ , 1:12])
##       ID              Country            Country_IT           Diet          
##  Length:166         Length:166         Min.   :-0.50000   Length:166        
##  Class :character   Class :character   1st Qu.:-0.50000   Class :character  
##  Mode  :character   Mode  :character   Median :-0.50000   Mode  :character  
##                                        Mean   :-0.03012                     
##                                        3rd Qu.: 0.50000                     
##                                        Max.   : 0.50000                     
##    Diet_VEGAN           Data            X1CMET2_PWY    ALLANTOINDEG_PWY 
##  Min.   :-0.50000   Length:166         Min.   :1.273   Min.   :-6.0996  
##  1st Qu.:-0.50000   Class :character   1st Qu.:2.422   1st Qu.:-4.9392  
##  Median : 0.50000   Mode  :character   Median :2.728   Median :-3.4441  
##  Mean   : 0.08434                      Mean   :2.663   Mean   :-3.6060  
##  3rd Qu.: 0.50000                      3rd Qu.:3.016   3rd Qu.:-2.3188  
##  Max.   : 0.50000                      Max.   :3.741   Max.   :-0.6658  
##  ANAEROFRUCAT_PWY ANAGLYCOLYSIS_PWY ARG.POLYAMINE_SYN  ARGININE_SYN4_PWY
##  Min.   :0.1183   Min.   :1.490     Min.   :-2.72033   Min.   :-2.4001  
##  1st Qu.:1.5725   1st Qu.:2.407     1st Qu.: 0.07355   1st Qu.: 0.7401  
##  Median :1.9835   Median :2.770     Median : 0.37284   Median : 1.4195  
##  Mean   :1.8992   Mean   :2.710     Mean   : 0.35348   Mean   : 1.3716  
##  3rd Qu.:2.2510   3rd Qu.:3.023     3rd Qu.: 0.77630   3rd Qu.: 2.1978  
##  Max.   :3.1912   Max.   :4.664     Max.   : 1.97414   Max.   : 3.0797

data_analysis_pathways[1:5 , 1:8]
##       ID Country Country_IT Diet Diet_VEGAN  Data X1CMET2_PWY ALLANTOINDEG_PWY
## 1 VOV002      IT        0.5 OMNI       -0.5 IT_tr    2.186503        -5.783155
## 2 VOV003      IT        0.5 OMNI       -0.5 IT_tr    2.008170        -1.685330
## 3 VOV004      IT        0.5 OMNI       -0.5 IT_tr    3.418194        -4.651552
## 4 VOV006      IT        0.5 OMNI       -0.5 IT_tr    3.174203        -4.576071
## 5 VOV007      IT        0.5 OMNI       -0.5 IT_tr    1.479598        -2.290716

4.1.1 Define number of pathways and covariates

Open code

n_covarites <- 6
n_features <- ncol(data_analysis_pathways) - n_covarites

4.1.2 Create empty objects

Open code

outcome <- vector('double', n_features)
est_VGdiet_inCZ <- vector('double', n_features)
est_VGdiet_inIT <- vector('double', n_features)
est_VGdiet_avg <- vector('double', n_features)

est_ITcountry_avg <- vector('double', n_features)
diet_country_int <- vector('double', n_features)


P_VGdiet_inCZ <- vector('double', n_features)
P_VGdiet_inIT <- vector('double', n_features)
P_VGdiet_avg <- vector('double', n_features)

P_ITcountry_avg <- vector('double', n_features)
P_diet_country_int <- vector('double', n_features)

CI_L_VGdiet_inCZ <- vector('double', n_features)
CI_L_VGdiet_inIT <- vector('double', n_features)
CI_L_VGdiet_avg <- vector('double', n_features)

CI_U_VGdiet_inCZ <- vector('double', n_features)
CI_U_VGdiet_inIT <- vector('double', n_features)
CI_U_VGdiet_avg <- vector('double', n_features)

4.2 Run linear models over pathways

Open code


for (i in 1:n_features) {
  
  ## define variable
  data_analysis_pathways$outcome <- data_analysis_pathways[, (i + n_covarites)]

  ## fit model
  model <- lm(outcome ~ Country_IT * Diet_VEGAN, data = data_analysis_pathways)

  ## get contrast (effects of diet BY COUNTRY)
  contrast_emm <- summary(
    pairs(
      emmeans(
        model,
        specs = ~ Diet_VEGAN | Country_IT
        ),
      interaction = TRUE,
      adjust = "none"
      ),
    infer = c(TRUE, TRUE)
    )

  ## save results
  outcome[i] <- names(data_analysis_pathways)[i + n_covarites]
  
  ## country effect
  est_ITcountry_avg[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Country_IT"
    ), 1
  ]

  P_ITcountry_avg[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Country_IT"
    ), 4
  ]
  
  
  ## diet effect
  tr <- confint(model)
  
  CI_L_VGdiet_avg[i] <- tr[which(row.names(tr) == 'Diet_VEGAN'),][1]
  CI_U_VGdiet_avg[i] <- tr[which(row.names(tr) == 'Diet_VEGAN'),][2]
  
  est_VGdiet_avg[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Diet_VEGAN"
    ), 1
  ]

  P_VGdiet_avg[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Diet_VEGAN"
    ), 4
  ]
  
  est_VGdiet_inCZ[i] <- -contrast_emm$estimate[1]
  P_VGdiet_inCZ[i] <- contrast_emm$p.value[1]
  CI_L_VGdiet_inCZ[i] <- -contrast_emm$upper.CL[1]
  CI_U_VGdiet_inCZ[i] <- -contrast_emm$lower.CL[1]
  
  
  est_VGdiet_inIT[i] <- -contrast_emm$estimate[2]
  P_VGdiet_inIT[i] <- contrast_emm$p.value[2]
  CI_L_VGdiet_inIT[i] <- -contrast_emm$upper.CL[2]
  CI_U_VGdiet_inIT[i] <- -contrast_emm$lower.CL[2]
  
  
  ## interaction
  diet_country_int[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Country_IT:Diet_VEGAN"
    ), 1
  ]

  P_diet_country_int[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Country_IT:Diet_VEGAN"
    ), 4
  ]
}

4.3 Results table

Open code

result_pathways <- data.frame(
  outcome,
  est_ITcountry_avg, P_ITcountry_avg,
  est_VGdiet_avg, P_VGdiet_avg,
  est_VGdiet_inCZ, P_VGdiet_inCZ,
  est_VGdiet_inIT, P_VGdiet_inIT,
  diet_country_int, P_diet_country_int,
  CI_L_VGdiet_avg, CI_U_VGdiet_avg,
  CI_L_VGdiet_inCZ, CI_U_VGdiet_inCZ,
  CI_L_VGdiet_inIT, CI_U_VGdiet_inIT
)

4.3.1 Adjust p values

Open code

result_pathways <- result_pathways %>% 
  dplyr::mutate(
    fdr_ITcountry_avg = p.adjust(P_ITcountry_avg, method = 'BH'),
    fdr_VGdiet_avg = p.adjust(P_VGdiet_avg, method = 'BH'),
    
    fdr_VGdiet_inCZ = p.adjust(P_VGdiet_inCZ, method = 'BH'),
    fdr_VGdiet_inIT = p.adjust(P_VGdiet_inIT, method = 'BH'),
    fdr_diet_country_int = p.adjust(P_diet_country_int, method = 'BH')
  ) %>% 
  dplyr::select(
    outcome,
    est_ITcountry_avg, P_ITcountry_avg, fdr_ITcountry_avg,
    est_VGdiet_avg, P_VGdiet_avg, fdr_VGdiet_avg,
    est_VGdiet_inCZ, P_VGdiet_inCZ, fdr_VGdiet_inCZ,
    est_VGdiet_inIT, P_VGdiet_inIT, fdr_VGdiet_inIT,
    diet_country_int, P_diet_country_int, fdr_diet_country_int,
    CI_L_VGdiet_avg, CI_U_VGdiet_avg,
    CI_L_VGdiet_inCZ, CI_U_VGdiet_inCZ,
    CI_L_VGdiet_inIT, CI_U_VGdiet_inIT
  )

4.3.2 Result: show and save

Open code

kableExtra::kable(result_pathways %>%  filter(fdr_VGdiet_avg < 0.05),
                  caption = "Result of linear models, modelling CLR-transformed relative level of given fcuntional pathway with `Diet`, `Country` and `Diet x Country` interaction as fixed-effect predictors. Only the pathways that differ between vegans and omnivores in training cohorts (FDR < 0.05, average diet effect over both countries) are shown. `est` prefix: denotes estimated effects (regression coefficient), i.e. how much clr-transformed relative pathway level differ in vegans compared to omnivores, and in Italian vs Czech cohort, respectively; `P`: p-value, `fdr`: p-value after adjustment for multiple comparison, `CI_L` and `CI_U`: lower and upper bounds of 95% confidence interval respectively. `avg` suffix shows effect averaged across subgroups, whereas `inCZ` and `inIT` shows effect in Czech or Italian cohort respectively. Interaction effects are reported as `diet_country_int` (difference in the effect of vegan diet between Italian and Czech cohorts; positive values indicate a stronger effect in Italian, negative values a stronger effect in Czech cohort) and `P_diet_country_int` (its p-value) All estimates in a single row are based on a single model"
                  )

Result of linear models, modelling CLR-transformed relative level of given fcuntional pathway with `Diet`, `Country` and `Diet x Country` interaction as fixed-effect predictors. Only the pathways that differ between vegans and omnivores in training cohorts (FDR < 0.05, average diet effect over both countries) are shown. `est` prefix: denotes estimated effects (regression coefficient), i.e. how much clr-transformed relative pathway level differ in vegans compared to omnivores, and in Italian vs Czech cohort, respectively; `P`: p-value, `fdr`: p-value after adjustment for multiple comparison, `CI_L` and `CI_U`: lower and upper bounds of 95% confidence interval respectively. `avg` suffix shows effect averaged across subgroups, whereas `inCZ` and `inIT` shows effect in Czech or Italian cohort respectively. Interaction effects are reported as `diet_country_int` (difference in the effect of vegan diet between Italian and Czech cohorts; positive values indicate a stronger effect in Italian, negative values a stronger effect in Czech cohort) and `P_diet_country_int` (its p-value) All estimates in a single row are based on a single model
outcome	est_ITcountry_avg	P_ITcountry_avg	fdr_ITcountry_avg	est_VGdiet_avg	P_VGdiet_avg	fdr_VGdiet_avg	est_VGdiet_inCZ	P_VGdiet_inCZ	fdr_VGdiet_inCZ	est_VGdiet_inIT	P_VGdiet_inIT	fdr_VGdiet_inIT	diet_country_int	P_diet_country_int	fdr_diet_country_int	CI_L_VGdiet_avg	CI_U_VGdiet_avg	CI_L_VGdiet_inCZ	CI_U_VGdiet_inCZ	CI_L_VGdiet_inIT	CI_U_VGdiet_inIT
CALVIN_PWY	-0.0296002	0.7062271	0.7756810	0.2613223	0.0010627	0.0250528	0.3326251	0.0031400	0.0456284	0.1900196	0.0882619	0.9858171	-0.1426055	0.3644016	0.5988432	0.1065206	0.4161240	0.1135788	0.5516714	-0.0287793	0.4088185
GALACTITOLCAT_PWY	-0.7049222	0.0030045	0.0114303	-0.9807686	0.0000454	0.0022705	-1.3052878	0.0001197	0.0038094	-0.6562494	0.0489003	0.9858171	0.6490384	0.1673536	0.5246896	-1.4428039	-0.5187332	-1.9590736	-0.6515020	-1.3092967	-0.0032021
HEXITOLDEGSUPER_PWY	-0.6345150	0.0033358	0.0122899	-0.9086781	0.0000337	0.0019655	-1.2010968	0.0001017	0.0035588	-0.6162595	0.0422636	0.9858171	0.5848373	0.1716696	0.5246896	-1.3292685	-0.4880877	-1.7962374	-0.6059562	-1.2107279	-0.0217911
HSERMETANA_PWY	0.0746296	0.3489152	0.4504382	0.2473730	0.0021839	0.0388779	0.3473933	0.0023538	0.0446191	0.1473527	0.1912725	0.9858171	-0.2000406	0.2098315	0.5283527	0.0904967	0.4042494	0.1254113	0.5693753	-0.0743785	0.3690840
NONOXIPENT_PWY	-0.1079297	0.1953293	0.3065706	0.3293683	0.0001085	0.0047487	0.3660825	0.0021621	0.0446191	0.2926541	0.0136120	0.9858171	-0.0734285	0.6588361	0.8034587	0.1654654	0.4932712	0.1341578	0.5980072	0.0609913	0.5243168
PROPFERM_PWY	-1.1733941	0.0000000	0.0000000	-0.5452806	0.0023327	0.0388779	-1.1841185	0.0000045	0.0002256	0.0935574	0.7077896	0.9858171	1.2776759	0.0003881	0.0445874	-0.8933922	-0.1971690	-1.6767008	-0.6915363	-0.3984685	0.5855832
PWY_1269	0.2494440	0.0568017	0.1234820	0.4002225	0.0024462	0.0389173	0.3686893	0.0467370	0.1237939	0.4317557	0.0200073	0.9858171	0.0630665	0.8086764	0.8985293	0.1434741	0.6569709	0.0053870	0.7319915	0.0688638	0.7946476
PWY_5104	-0.0029780	0.9880447	0.9965869	-0.8870113	0.0000146	0.0018502	-0.8735792	0.0022026	0.0446191	-0.9004434	0.0015980	0.2796561	-0.0268642	0.9461154	0.9714199	-1.2788612	-0.4951614	-1.4280518	-0.3191067	-1.4542896	-0.3465972
PWY_5367	-0.1763380	0.3098164	0.4170606	-0.8229389	0.0000044	0.0015281	-1.3000263	0.0000004	0.0001260	-0.3458515	0.1593626	0.9858171	0.9541748	0.0065146	0.2672971	-1.1647303	-0.4811475	-1.7836654	-0.8163873	-0.8289442	0.1372413
PWY_5747	0.3801128	0.1486819	0.2513565	-0.8724046	0.0010737	0.0250528	-0.8978253	0.0165274	0.0736276	-0.8469838	0.0234498	0.9858171	0.0508415	0.9228098	0.9699202	-1.3896726	-0.3551365	-1.6297661	-0.1658845	-1.5780978	-0.1158697
PWY_5971	0.2851474	0.1681036	0.2736570	-0.7492655	0.0003689	0.0117385	-1.2273875	0.0000420	0.0016315	-0.2711434	0.3530071	0.9858171	0.9562441	0.0215055	0.3516863	-1.1559662	-0.3425647	-1.8028741	-0.6519009	-0.8459800	0.3036931
PWY_6284	-0.0763172	0.6416310	0.7174788	-0.7170244	0.0000211	0.0018502	-1.1472182	0.0000018	0.0002256	-0.2868306	0.2167977	0.9858171	0.8603876	0.0094022	0.2672971	-1.0402209	-0.3938279	-1.6045453	-0.6898911	-0.7436411	0.1699800
PWY_6293	-0.3991880	0.0295207	0.0743327	-0.7979908	0.0000204	0.0018502	-1.2508209	0.0000027	0.0002256	-0.3451607	0.1810402	0.9858171	0.9056603	0.0137467	0.2672971	-1.1569706	-0.4390110	-1.7587819	-0.7428600	-0.8525478	0.1622265
PWY_6629	-0.1491528	0.0501929	0.1140748	0.2410823	0.0017135	0.0374820	0.3128749	0.0039398	0.0492480	0.1692897	0.1150530	0.9858171	-0.1435852	0.3436827	0.5866151	0.0918025	0.3903621	0.1016422	0.5241076	-0.0417044	0.3802838
PWY_7237	-0.2382669	0.0372708	0.0893478	0.3820200	0.0009489	0.0250528	0.5125117	0.0016953	0.0423814	0.2515284	0.1187112	0.9858171	-0.2609833	0.2517765	0.5640539	0.1579766	0.6060635	0.1954874	0.8295360	-0.0651378	0.5681946
PWY_801	-0.4000136	0.0326876	0.0817191	-0.7960755	0.0000309	0.0019655	-1.2545161	0.0000040	0.0002256	-0.3376348	0.2000867	0.9858171	0.9168813	0.0145845	0.2686626	-1.1627289	-0.4294220	-1.7733353	-0.7356969	-0.8558680	0.1805984
PWY_8178	-0.0688124	0.4122732	0.5197219	0.3181576	0.0002039	0.0079305	0.3622281	0.0026075	0.0446191	0.2740872	0.0217826	0.9858171	-0.0881408	0.5992903	0.7464470	0.1528518	0.4834635	0.1283182	0.5961380	0.0404415	0.5077329
PWY_8188	-1.1733941	0.0000000	0.0000000	-0.5452806	0.0023327	0.0388779	-1.1841185	0.0000045	0.0002256	0.0935574	0.7077896	0.9858171	1.2776759	0.0003881	0.0445874	-0.8933922	-0.1971690	-1.6767008	-0.6915363	-0.3984685	0.5855832
PWY_8189	-1.1733941	0.0000000	0.0000000	-0.5452806	0.0023327	0.0388779	-1.1841185	0.0000045	0.0002256	0.0935574	0.7077896	0.9858171	1.2776759	0.0003881	0.0445874	-0.8933922	-0.1971690	-1.6767008	-0.6915363	-0.3984685	0.5855832
PWY0_1586	0.0341764	0.6869672	0.7608813	0.2681826	0.0018360	0.0377992	0.3571433	0.0033129	0.0456284	0.1792219	0.1361326	0.9858171	-0.1779213	0.2949060	0.5689951	0.1010066	0.4353587	0.1205871	0.5936995	-0.0570671	0.4155109
PWY0_42	0.3942904	0.0154267	0.0440852	-0.5433769	0.0009276	0.0250528	-0.4880001	0.0337446	0.1072755	-0.5987536	0.0093549	0.9858171	-0.1107535	0.7314183	0.8512073	-0.8614198	-0.2253340	-0.9380348	-0.0379654	-1.0482800	-0.1492273
TRPSYN_PWY	-0.2950693	0.0005635	0.0028177	0.3120611	0.0002732	0.0095630	0.3618949	0.0026771	0.0446191	0.2622274	0.0283480	0.9858171	-0.0996675	0.5531917	0.7251576	0.1464543	0.4776680	0.1275591	0.5962306	0.0281564	0.4962985

Open code


if(file.exists('gitignore/lm_results/result_pathways_filt.csv') == FALSE){
  write.table(result_pathways, 
              'gitignore/lm_results/result_pathways_filt.csv', 
              row.names = FALSE)
  }

5 Elastic net

To assess the predictive power of pathways features on diet strategy, we employed Elastic Net logistic regression.

As we expected very high level of co-linearity, we allowed \(alpha\) to rather small (0, 0.2 or 0.4).

The performance of the predictive models was evaluated through their capacity of discriminate between vegan and omnivore diets, using out-of-sample area under ROC curve (AUC; estimated with out-of-bag bootstrap) as the measure of discriminatory capacity.

All features were transformed by 2 standard deviations (resulting in standard deviation of 0.5)

5.1 Prepare data for `glmnet`

Open code


data_pathways_glmnet <- data_pathways_original_clr %>%
  dplyr::mutate(
    vegan = as.numeric(
      dplyr::if_else(
        Diet == "VEGAN", 1, 0
      )
    )
  ) %>%
  dplyr::mutate(
    dplyr::across(all_of(features), ~ arm::rescale(.))
  ) %>%
  dplyr::select(
    ID, vegan, Country, all_of(features)
  )

5.2 Fit model

Open code


modelac <- "elanet_pathways_filt"

assign(
  modelac,
  run(
    expr = clust_glmnet_sep(
      data = data_pathways_glmnet,
      outcome = "vegan",
      clust_id = "ID",
      sample_method = "oos_boot",
      N = 500,
      alphas = c(0, 0.2, 0.4),
      family = "binomial",
      seed = 478
    ),
    path = paste0("gitignore/run/", modelac)
  )
)

5.3 See results

5.3.1 Model summary

Open code

elanet_pathways_filt$model_summary
##   alpha     lambda       auc auc_OutOfSample auc_oos_CIL auc_oos_CIU accuracy
## 1   0.4 0.04303069 0.9412819       0.7897421   0.6777675   0.8811993 0.873494
##   accuracy_OutOfSample accuracy_oos_CIL accuracy_oos_CIU
## 1            0.7252609        0.6105935        0.8321154
elanet_pathways_filt$country_AUC
##   auc_OutOfSample_IT auc_oos_CIL_IT auc_oos_CIU_IT auc_OutOfSample_CZ
## 1          0.6698679      0.4822149      0.8389606          0.8739966
##   auc_oos_CIL_CZ auc_oos_CIU_CZ
## 1      0.7339286      0.9936312

5.3.2 ROC curve - internal validation

Open code

elanet_pathways_filt$plot

Receiver operating characteristic (ROC) curves from a random subset of out-of-bag bootstrap iterations, illustrating the model’s ability to discriminate between vegan and omnivore status based on standardized CLR-transformed pathways reads. Each curve corresponds to an elastic net model trained on a bootstrap resample and evaluated on the subjects not included in that iteration (i.e., out-of-bag samples). The curve plots the true positive rate (sensitivity) against the false positive rate (1 – specificity) across thresholds of predicted vegan probability. The area under the curve (AUC) quantifies overall discriminatory performance, with values closer to 1 indicating stronger separation between groups.

5.3.3 Estimated coefficients

Open code

tr <- data.frame(
  label = row.names(
    elanet_pathways_filt$betas
    )[
      which(
        abs(
          elanet_pathways_filt$betas
          )>0
        )
      ],
  beta = elanet_pathways_filt$betas[
    abs(
      elanet_pathways_filt$betas
      )>0
    ]
  )[-1, ]

tr$pathway <- attr(data_path_originalCZ, 
                   "description")[
                     colnames(data_path_originalCZ) %in% tr$label]

kableExtra::kable(tr %>% select(label, pathway, beta))

	label	pathway	beta
2	CENTFERM_PWY	pyruvate fermentation to butanoate	-0.1635717
3	FUCCAT_PWY	fucose degradation	0.2839196
4	GALACTITOLCAT_PWY	galactitol degradation	-0.1996850
5	GLUCARDEG_PWY	D-glucarate degradation I	0.2602967
6	GOLPDLCAT_PWY	superpathway of glycerol degradation to 1,3-propanediol	0.1236880
7	HEXITOLDEGSUPER_PWY	superpathway of hexitol degradation (bacteria)	-0.2176896
8	HSERMETANA_PWY	L-methionine biosynthesis III	0.1004888
9	LACTOSECAT_PWY	lactose and galactose degradation I	-0.0286681
10	NONOXIPENT_PWY	pentose phosphate pathway (non-oxidative branch) I	0.3841247
11	PPGPPMET_PWY	ppGpp metabolism	0.2606652
12	PROPFERM_PWY	superpathway of L-alanine fermentation (Stickland reaction)	-0.1535219
13	PWY_5104	L-isoleucine biosynthesis IV	-0.2145823
14	PWY_5136	fatty acid β-oxidation II (plant peroxisome)	0.1595753
15	PWY_5345	superpathway of L-methionine biosynthesis (by sulfhydrylation)	0.0000036
16	PWY_5367	petroselinate biosynthesis	-0.3153989
17	PWY_5494	pyruvate fermentation to propanoate II (acrylate pathway)	-0.0597671
18	PWY_5747	2-methylcitrate cycle II	-0.0426507
19	PWY_622	starch biosynthesis	-0.0278495
20	PWY_6284	superpathway of unsaturated fatty acids biosynthesis (E. coli)	-0.1552727
21	PWY_6293	superpathway of L-cysteine biosynthesis (fungi)	-0.2893336
22	PWY_6470	peptidoglycan biosynthesis V (β-lactam resistance)	-0.4052525
23	PWY_6572	chondroitin sulfate degradation I (bacterial)	0.1353945
24	PWY_6590	superpathway of Clostridium acetobutylicum acidogenic fermentation	-0.1394431
25	PWY_6807	xyloglucan degradation II (exoglucanase)	0.1192990
26	PWY_6892	thiazole component of thiamine diphosphate biosynthesis I	0.0488946
27	PWY_6895	superpathway of thiamine diphosphate biosynthesis II	-0.2609380
28	PWY_6936	seleno-amino acid biosynthesis (plants)	0.2567455
29	PWY_6992	1,5-anhydrofructose degradation	-0.0422256
30	PWY_7111	pyruvate fermentation to isobutanol (engineered)	-0.0626192
31	PWY_7185	UTP and CTP dephosphorylation I	-0.1626717
32	PWY_7210	pyrimidine deoxyribonucleotides biosynthesis from CTP	-0.1586225
33	PWY_7234	inosine-5’-phosphate biosynthesis III	-0.0885173
34	PWY_7237	myo-, chiro- and scyllo-inositol degradation	0.1260601
35	PWY_7312	dTDP-β-D-fucofuranose biosynthesis	-0.1364504
36	PWY_7400	L-arginine biosynthesis IV (archaebacteria)	-0.0556371
37	PWY_7858	(5Z)-dodecenoate biosynthesis II	0.1653035
38	PWY_801	homocysteine and cysteine interconversion	-0.2465140
39	PWY_8178	pentose phosphate pathway (non-oxidative branch) II	0.2210654
40	PWY_8188	L-alanine degradation VI (reductive Stickland reaction)	-0.1546633
41	PWY_8189	L-alanine degradation V (oxidative Stickland reaction)	-0.1541587
42	PWY0_1477	ethanolamine utilization	-0.0720184
43	PWY0_42	2-methylcitrate cycle I	-0.0442314
44	PWY66_389	phytol degradation	0.0854299
45	PWY66_391	fatty acid β-oxidation VI (mammalian peroxisome)	-0.0083984
46	REDCITCYC	TCA cycle VI (Helicobacter)	0.0710738
47	SO4ASSIM_PWY	assimilatory sulfate reduction I	0.0312470
48	SULFATE_CYS_PWY	superpathway of sulfate assimilation and cysteine biosynthesis	0.0208325
49	TRPSYN_PWY	L-tryptophan biosynthesis	0.1046697
50	UDPNAGSYN_PWY	UDP-N-acetyl-D-glucosamine biosynthesis I	-0.0136180

5.3.4 Plot of coefficients

Open code

elacoef <- data.frame(
  pathway = row.names(elanet_pathways_filt$betas),
  beta_ela = elanet_pathways_filt$betas[, 1]
) %>%
  arrange(abs(beta_ela)) %>%
  filter(abs(beta_ela) > 0,
         !grepl('Intercept', pathway)) %>%
  mutate(pathway = factor(pathway, levels = pathway)) 


plotac <- "elanet_beta_pathway"
path <- "gitignore/figures"

assign(plotac, 
  ggplot(elacoef,
    aes(
      x = pathway,
      y = beta_ela
    )
  ) +
  geom_point() +
  geom_hline(yintercept = 0, color = "black") +
  labs(
    y = "Standardized beta coefficients",
    x = "pathway"
  ) +
  theme_minimal() +
  coord_flip() + 
  theme(
    axis.text.x = element_text(size = 10),
    axis.text.y = element_text(size = 10),
    axis.title.x = element_text(size = 12),
    axis.title.y = element_text(size = 12),
    legend.position = "bottom"
  )
)

if (file.exists(paste0(path, "/", plotac, ".svg")) == FALSE) {
    ggsave(
    path = paste0(path),
    filename = plotac,
    device = "svg",
    width = 7,
    height = 12
  )
}

get(plotac)

Regression coefficients from the elastic net model predicting vegan diet strategy based on clr-transformed and standardized pathways. Pathways are ordered by the magnitude of the standardized coefficients, indicating their relative importance in distinguishing between the diet groups. The sign of each coefficient indicates the direction of association with vegan diet status, with positive values indicating a higher probability of vegan status and vice versa. Pathways whose effects were shrunk to zero are not shown.

6 External validation

External validation was performed with an independent Czech cohort.

As a first step, we will use the previously developed and internally validated elastic net model to predict vegan status in the independent Czech cohort. The validation data will be standardized using the mean and standard deviation of each pathway level from the training cohort to ensure comparability across datasets. For each subject in the external validation cohort, we will estimate the predicted probability of being vegan using the elastic net model. This predicted probability will then be used as a variable to discriminate between the diet groups in the independent cohort.

In a 2nd step, we will look at pathways that significantly differed between diet groups (average vegan diet effect across both countries, FDR < 0.05) estimated with linear models (one per pathway) with training cohort. Then we will fit linear models also for external validation cohort. Effect of vegan diet on these pathways will be shown along with 95% confidence interval for all cohorts: training Czech and Italian cohorts, but also in Czech independent (validating) cohort

6.1 Diet discrimination (elastic net)

6.1.1 Get table of weights, means and SDs

Open code


coefs_pathways <- get_coef(
  original_data = data_analysis_pathways,
  glmnet_model = elanet_pathways_filt)

6.1.2 Identify shared and missing predictors

Open code

## Which are common with the validations et
common_predictors <- intersect(coefs_pathways$predictor, colnames(data_pathways_validation_clr))

6.1.3 Standardize data in validation set

Open code

data_pathways_validation_pred <- data_pathways_validation_clr %>%
  dplyr::mutate(
    vegan = if_else(
      Diet == 'VEGAN', 1, 0
    )
  ) %>%
  dplyr::select(
    vegan,
    dplyr::all_of(common_predictors)
  ) %>% 
  dplyr::mutate(
    across(
      .cols = -vegan,
      .fns = ~ . 
      - coefs_pathways$mean[
        match(
          cur_column(), 
          coefs_pathways$predictor
          )
        ]
      )
    ) %>% 
  dplyr::mutate(
    across(
      .cols = -vegan,
      .fns = ~ . 
      / coefs_pathways$SD[
        match(
          cur_column(), 
          coefs_pathways$predictor
          )
        ]
      )
    )

6.1.4 Get predicted value

Open code

elanet_pathways_filt$fit
## 
## Call:  glmnet::glmnet(x = original_predictors, y = original_outcome,      family = family, alpha = optim_par$alpha[1], lambda = optim_par$lamb_1se[1],      standardize = standardize) 
## 
##   Df  %Dev  Lambda
## 1 49 37.63 0.04303
newx <- as.matrix(data_pathways_validation_pred[,-1])

tr <- data_pathways_validation_pred %>% 
  dplyr::mutate(
    predicted_logit = as.numeric(
      predict(
        elanet_pathways_filt$fit, 
        newx = newx
        )
      )
    ) %>% 
  dplyr::mutate(
    predicted = inv_logit(predicted_logit)
  )

6.2 Result of external validation

6.2.1 ROC curve

Open code

roc_pathway <- pROC::roc(
  vegan ~ predicted_logit,
  data = tr,
  direction = "<",
  levels = c(0, 1),
  ci = TRUE
)

plotac <- "roc_pathway_pl"
path <- "gitignore/figures"

assign(plotac, ggroc(roc_pathway))
get(plotac)

Receiver operating characteristics (ROC) curve showing the model’s ability to discriminate between vegan and omnivore status in the external validation Czech cohort. The curve plots the true positive rate (sensitivity) against the true positive rate (specificity) at various thresholds of predicted vegan status, as estimated from the elastic net model developed on the training data. The area under the curve (AUC) represents the model’s overall performance, with values closer to 1 indicating stronger discrimination.

Open code


if (file.exists(paste0(path, "/", plotac, ".svg")) == FALSE) {  
  ggsave(
    path = paste0(path),
    filename = plotac,
    device = "svg",
    width = 6,
    height = 4.5
  )
}

6.2.2 Table

Open code

mod <- elanet_pathways_filt

trainAUC <- mean(mod[["valid_performances"]]$auc_resamp_test)
trainCI <- quantile(mod[["valid_performances"]]$auc_resamp_test,
  probs = c(1 / 40, 39 / 40)
)

res <- data.frame(
  alpha = c(mod$model_summary$alpha, rep("", 4)),
  lambda = c(round(mod$model_summary$lambda, 4), rep("", 4)),
  `performance type` = c(
    "Training set AUC",
    "Out-of-sample AUC (all)",
    "Out-of-sample AUC (Czech)",
    "Out-of-sample AUC (Italy)",
    "External validation AUC"
  ),
  `performance [95% CI]` = c(
    sprintf("%.3f [%.3f to %.3f]", trainAUC, trainCI[1], trainCI[2]),
    sprintf(
      "%.3f [%.3f to %.3f]",
      mod$model_summary$auc_OutOfSample,
      mod$model_summary$auc_oos_CIL,
      mod$model_summary$auc_oos_CIU
    ),
    sprintf(
      "%.3f [%.3f to %.3f]",
      mod$country_AUC$auc_OutOfSample_CZ,
      mod$country_AUC$auc_oos_CIL_CZ,
      mod$country_AUC$auc_oos_CIU_CZ
    ),
    sprintf(
      "%.3f [%.3f to %.3f]",
      mod$country_AUC$auc_OutOfSample_IT,
      mod$country_AUC$auc_oos_CIL_IT,
      mod$country_AUC$auc_oos_CIU_IT
    ),
    sprintf(
      "%.3f [%.3f to %.3f]",
      roc_pathway[["ci"]][2],
      roc_pathway[["ci"]][1],
      roc_pathway[["ci"]][3]
    )
  )
)


kableExtra::kable(
  res %>% mutate(across(where(is.numeric), ~ round(.x, 3))),
  caption = "Performance of the elastic‑net logistic regression model for discriminating vegan from omnivore status using CLR-transformed pathways based on bacteria reads. The model was developed on the combined training data (Czech and Italian cohorts), with the optimmized `alpha` (mixing parameter) and `lambda` (penalty strength) selected via 10-fold cross-validation. Internal validation employed 500 out‑of‑bag bootstrap resamples: the out‑of‑sample AUC is the mean across resamples, and its 95 % confidence interval (CI) is given by the 2.5th and 97.5th percentiles of the bootstrap distribution. The training‑set AUC and its CI were computed analogously from the in‑bag predictions. External validation was carried out on an independent Czech cohort; the reported AUC is the point estimate on that cohort, and its 95% CI was obtained with DeLong’s method."
)

Performance of the elastic‑net logistic regression model for discriminating vegan from omnivore status using CLR-transformed pathways based on bacteria reads. The model was developed on the combined training data (Czech and Italian cohorts), with the optimmized `alpha` (mixing parameter) and `lambda` (penalty strength) selected via 10-fold cross-validation. Internal validation employed 500 out‑of‑bag bootstrap resamples: the out‑of‑sample AUC is the mean across resamples, and its 95 % confidence interval (CI) is given by the 2.5th and 97.5th percentiles of the bootstrap distribution. The training‑set AUC and its CI were computed analogously from the in‑bag predictions. External validation was carried out on an independent Czech cohort; the reported AUC is the point estimate on that cohort, and its 95% CI was obtained with DeLong’s method.
alpha	lambda	performance.type	performance..95..CI.
0.4	0.043	Training set AUC	0.994 [0.975 to 1.000]
		Out-of-sample AUC (all)	0.790 [0.678 to 0.881]
		Out-of-sample AUC (Czech)	0.874 [0.734 to 0.994]
		Out-of-sample AUC (Italy)	0.670 [0.482 to 0.839]
		External validation AUC	0.774 [0.684 to 0.864]

6.3 Diet effect across datasets (forest plot)

Similarly as in training data cohorts, we will fit linear model per each of the selected pathway level with a single fixed effect factor of diet.

6.3.1 Linear model in validation cohort

Open code

data_analysis_pathways <- data_pathways_validation_clr %>%
  dplyr::mutate(
    Diet_VEGAN = as.numeric(
      dplyr::if_else(
        Diet == 'VEGAN', 1, 0
      )
    )  ) %>%
  dplyr::select(
    Diet_VEGAN,
    dplyr::everything()
  )

summary(data_analysis_pathways[, 1:12])
##    Diet_VEGAN          ID                Diet             Country         
##  Min.   :0.0000   Length:101         Length:101         Length:101        
##  1st Qu.:0.0000   Class :character   Class :character   Class :character  
##  Median :1.0000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :0.5743                                                           
##  3rd Qu.:1.0000                                                           
##  Max.   :1.0000                                                           
##      Data            X1CMET2_PWY    ALLANTOINDEG_PWY  ANAEROFRUCAT_PWY
##  Length:101         Min.   :1.820   Min.   :-4.1711   Min.   :1.139   
##  Class :character   1st Qu.:2.582   1st Qu.:-3.5853   1st Qu.:1.598   
##  Mode  :character   Median :2.730   Median :-3.3514   Median :1.842   
##                     Mean   :2.695   Mean   :-2.8111   Mean   :1.825   
##                     3rd Qu.:2.847   3rd Qu.:-2.0140   3rd Qu.:2.029   
##                     Max.   :3.105   Max.   :-0.5367   Max.   :2.748   
##  ANAGLYCOLYSIS_PWY ARG.POLYAMINE_SYN ARGININE_SYN4_PWY   ARGSYN_PWY   
##  Min.   :1.990     Min.   :-1.4687   Min.   :-1.4437   Min.   :2.000  
##  1st Qu.:2.777     1st Qu.: 0.2802   1st Qu.: 0.5786   1st Qu.:2.752  
##  Median :2.942     Median : 0.5856   Median : 0.9553   Median :2.907  
##  Mean   :2.913     Mean   : 0.5673   Mean   : 0.9585   Mean   :2.854  
##  3rd Qu.:3.086     3rd Qu.: 0.8994   3rd Qu.: 1.4342   3rd Qu.:3.001  
##  Max.   :3.541     Max.   : 1.7341   Max.   : 2.1724   Max.   :3.400

6.3.1.1 Define number of pathways and covariates

Open code

n_covarites <- 5
n_features <- ncol(data_analysis_pathways) - n_covarites

6.3.1.2 Create empty objects

Open code

outcome <- vector('double', n_features)
est_VGdiet <- vector('double', n_features)
P_VGdiet <- vector('double', n_features)
CI_L_VGdiet <- vector('double', n_features)
CI_U_VGdiet <- vector('double', n_features)

6.3.1.3 Linear models per outcome

Open code

for (i in 1:n_features) {
  ## define variable
  data_analysis_pathways$outcome <- data_analysis_pathways[, (i + n_covarites)]

  ## fit model
  model <- lm(outcome ~ Diet_VEGAN, data = data_analysis_pathways)

  ## save results
  outcome[i] <- names(data_analysis_pathways)[i + n_covarites]

  ## extract diet effect
  tr <- confint(model)

  CI_L_VGdiet[i] <- tr[which(row.names(tr) == "Diet_VEGAN"), ][1]
  CI_U_VGdiet[i] <- tr[which(row.names(tr) == "Diet_VEGAN"), ][2]

  est_VGdiet[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Diet_VEGAN"
    ), 1
  ]

  P_VGdiet[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Diet_VEGAN"
    ), 4
  ]
}

6.3.1.4 Results table

Open code

## relevant pathways
diet_sensitive_pathways <- result_pathways %>%
  filter(
    fdr_VGdiet_avg < 0.05
  ) %>%
  select(
    outcome
  )

result_pathways_val <- data.frame(
  outcome,
  est_VGdiet, P_VGdiet,
  CI_L_VGdiet, CI_U_VGdiet
) %>% 
  filter(outcome %in% diet_sensitive_pathways$outcome)

tr <- attr(
  data_path_originalCZ,
  "description")[colnames(data_path_originalCZ) %in% result_pathways_val$outcome]

val_res <- cbind(pathway = tr, result_pathways_val) %>% 
  arrange(desc(est_VGdiet))

kableExtra::kable(val_res,
                  caption = 'Results of linear models estimating the effect of diet on the CLR-transformed proportion of a given functional pathway as the outcome. Only pathways whose proportion significantly differed between diet groups in training cohorts (FDR < 0.05, average effect across both training cohorts) were included. `est` represents the estimated effects (regression coefficient), indicating how much the CLR-transformed proportions differ between vegans and omnivores. `P`: p-value, `fdr`: p-value adjusted for multiple comparisons, and `CI_L` and `CI_U` represent the lower and upper bounds of the 95% confidence interval, respectively. All estimates in a single row are based on a single model')

Results of linear models estimating the effect of diet on the CLR-transformed proportion of a given functional pathway as the outcome. Only pathways whose proportion significantly differed between diet groups in training cohorts (FDR < 0.05, average effect across both training cohorts) were included. `est` represents the estimated effects (regression coefficient), indicating how much the CLR-transformed proportions differ between vegans and omnivores. `P`: p-value, `fdr`: p-value adjusted for multiple comparisons, and `CI_L` and `CI_U` represent the lower and upper bounds of the 95% confidence interval, respectively. All estimates in a single row are based on a single model
pathway	outcome	est_VGdiet	P_VGdiet	CI_L_VGdiet	CI_U_VGdiet
peptidoglycan maturation (meso-diaminopimelate containing)	PWY0_1586	0.1535083	0.0140132	0.0317355	0.2752811
pentose phosphate pathway (non-oxidative branch) I	NONOXIPENT_PWY	0.1334798	0.0209651	0.0205883	0.2463713
L-tryptophan biosynthesis	TRPSYN_PWY	0.1327050	0.0186200	0.0226470	0.2427630
myo-, chiro- and scyllo-inositol degradation	PWY_7237	0.1258992	0.1422591	-0.0429823	0.2947807
pentose phosphate pathway (non-oxidative branch) II	PWY_8178	0.1132709	0.0398410	0.0053646	0.2211772
superpathway of L-tryptophan biosynthesis	PWY_6629	0.0985685	0.0431588	0.0030925	0.1940445
L-methionine biosynthesis III	HSERMETANA_PWY	0.0880344	0.0798108	-0.0106562	0.1867249
Calvin-Benson-Bassham cycle	CALVIN_PWY	0.0596844	0.2390520	-0.0402962	0.1596651
galactitol degradation	GALACTITOLCAT_PWY	0.0535177	0.8777221	-0.6348888	0.7419241
superpathway of hexitol degradation (bacteria)	HEXITOLDEGSUPER_PWY	-0.0106956	0.9728494	-0.6326706	0.6112795
2-methylcitrate cycle II	PWY_5747	-0.0366123	0.7913669	-0.3104904	0.2372658
palmitate biosynthesis (type II fatty acid synthase)	PWY_5971	-0.0392706	0.7701391	-0.3052183	0.2266771
2-methylcitrate cycle I	PWY0_42	-0.0517980	0.7101465	-0.3275435	0.2239474
CMP-3-deoxy-D-manno-octulosonate biosynthesis	PWY_1269	-0.0634412	0.6797402	-0.3674721	0.2405898
L-isoleucine biosynthesis IV	PWY_5104	-0.2815587	0.1150244	-0.6329268	0.0698094
L-alanine degradation V (oxidative Stickland reaction)	PWY_8189	-0.4025018	0.0014258	-0.6458656	-0.1591379
L-alanine degradation VI (reductive Stickland reaction)	PWY_8188	-0.4025018	0.0014258	-0.6458656	-0.1591379
superpathway of L-alanine fermentation (Stickland reaction)	PROPFERM_PWY	-0.4025018	0.0014258	-0.6458656	-0.1591379
superpathway of L-cysteine biosynthesis (fungi)	PWY_6293	-0.4259077	0.0020150	-0.6923133	-0.1595022
homocysteine and cysteine interconversion	PWY_801	-0.4273921	0.0020285	-0.6949062	-0.1598780
superpathway of unsaturated fatty acids biosynthesis (E. coli)	PWY_6284	-0.9274652	0.0000322	-1.3497696	-0.5051608
petroselinate biosynthesis	PWY_5367	-1.3126488	0.0000030	-1.8382687	-0.7870289

Open code


if(file.exists('gitignore/lm_results/result_pathways_validation_filt.csv') == FALSE){
  write.table(val_res,
              'gitignore/lm_results/result_pathways_validation_filt.csv', 
              row.names = FALSE)
  }

6.3.2 Forest plot

6.3.2.1 Data preparation

Open code


len <- nrow(diet_sensitive_pathways)

## subset result tables
result_pathways_subset <- result_pathways %>%
  filter(outcome %in% diet_sensitive_pathways$outcome)

result_pathways_val_subset <- result_pathways_val %>%
  filter(outcome %in% diet_sensitive_pathways$outcome)

## create a data frame
data_forest <- data.frame(
  outcome = rep(diet_sensitive_pathways$outcome, 3),
  beta = c(
    result_pathways_subset$est_VGdiet_inCZ,
    result_pathways_subset$est_VGdiet_inIT,
    result_pathways_val_subset$est_VGdiet
  ),
  lower = c(
    result_pathways_subset$CI_L_VGdiet_inCZ,
    result_pathways_subset$CI_L_VGdiet_inIT,
    result_pathways_val_subset$CI_L_VGdiet
  ),
  upper = c(
    result_pathways_subset$CI_U_VGdiet_inCZ,
    result_pathways_subset$CI_U_VGdiet_inIT,
    result_pathways_val_subset$CI_U_VGdiet
  ),
  dataset = c(
    rep("CZ", len),
    rep("IT", len),
    rep("Validation", len)
  )
)


validation_order <- data_forest %>%
  left_join(
    val_res %>% select(outcome, pathway),
    by = 'outcome'
  ) %>% 
group_by(outcome) %>%
  summarise(beta_mean = mean(beta), .groups = "drop",
            pathway  = first(pathway)) %>%
  arrange(beta_mean) %>%
  pull(pathway)


up_winners <- data_forest %>% 
  pivot_wider(names_from = dataset,
              values_from = c(beta, lower, upper)) %>% 
  left_join(
    elacoef %>% mutate(outcome = pathway) %>% select(-pathway), 
    by = 'outcome') %>% 
  filter(beta_CZ > 0,
         beta_IT > 0,
         lower_Validation > 0,
         beta_ela > 0.1) %>% 
  select(outcome)

down_winners <- data_forest %>% 
  pivot_wider(names_from = dataset,
              values_from = c(beta, lower, upper)) %>% 
  left_join(elacoef %>% mutate(outcome = pathway) %>% select(-pathway),
            by = 'outcome') %>% 
  filter(beta_CZ < 0,
         beta_IT < 0,
         upper_Validation < 0,
         beta_ela < -0.1) %>% 
  select(outcome)

winners <- as.character(c(up_winners$outcome, down_winners$outcome))

 data_forest <- data_forest %>%
  mutate(in_winner = if_else(outcome %in% winners, TRUE, FALSE, missing = FALSE)) %>%
  left_join(
    val_res %>% select(outcome, pathway),
    by = 'outcome'
  ) %>% 
  left_join(
    elacoef %>% mutate(outcome = pathway) %>% select(-pathway), 
    by = 'outcome') %>%  
   mutate(pathway = factor(pathway, levels = validation_order))

6.3.2.2 Plotting

Open code

plotac <- "forest_pathway"
path <- "gitignore/figures"

colors <- c("CZ" = "#150999", "IT" = "#329243", "Validation" = "grey60")

assign(
  plotac,
  ggplot(data_forest, 
         aes(x = pathway, 
             y = beta, 
             ymin = lower,
             ymax = upper, 
             color = dataset)) +
    geom_pointrange(position = position_dodge(width = 0.5), size = 0.5) +
    geom_hline(yintercept = 0, color = "black") +
    geom_errorbar(position = position_dodge(width = 0.5), width = 0.2) +
    scale_color_manual(values = colors) +
    labs(
      y = "Effect of vegan diet on clr-trasformed pathway",
      x = "Outcome",
      color = "Dataset"
    ) +
    theme_minimal() +
    coord_flip() +
    scale_x_discrete(
    labels = setNames(
      ifelse(data_forest$in_winner, 
             paste0("**", data_forest$pathway, "**"), 
             as.character(data_forest$pathway)
      ), data_forest$pathway
    )
  ) +
  theme(
    axis.text.x = element_text(size = 10),
    axis.text.y = ggtext::element_markdown(size = 10),  
    axis.title.x = element_text(size = 12),
    axis.title.y = element_text(size = 12),
    legend.position = "bottom"
  )
)

get(plotac)

The forest plot illustrates the effects of a vegan diet on the CLR-transformed relative level of selected metabolic pathways, along with their 95% confidence intervals, across two training cohorts (Czech and Italian) and one independent Czech cohort (validation). Green, blue, and grey points/lines represent differences in CLR-transformed pathway proportions between vegans and omnivores within the Italian cohort, Czech cohort, and Czech validation cohort, respectively. Positive values suggest a higher relative pathway level in vegans compared to omnivores. Only pathways that showed significant differences between vegan and omnivorous diets (as an average effect across both training cohorts) were selected, and these effects were further validated in the independent cohort. The estimates for the training cohorts were obtained from a single linear model that included `Diet`, `Country`, and the interaction term `Diet:Country` as predictors. In the independent Czech validation cohort, `Diet` was the only fixed-effect predictor. Patways validated in the linear model and showing predictive power in the elastic net model (|β| > 0.1) are **bold**

Open code


if (file.exists(paste0(path, "/", plotac, ".svg")) == FALSE) {
  ggsave(
    path = paste0(path),
    filename = plotac,
    device = "svg",
    width = 10,
    height = 12
  )
}

6.3.3 Boxplot

Open code

plotac <- "boxplot_pathway"
path <- "gitignore/figures"

colo <- c("#F9FFAF", "#329243")

boxplot_cond <- function(variable) {
  p <- ggboxplot(data_merged,
    x = "Diet",
    y = variable,
    fill = "Diet",
    tip.length = 0.15,
    palette = colo,
    outlier.shape = 1,
    lwd = 0.25,
    outlier.size = 0.8,
    facet.by = "Data",
    title = variable,
    ylab = "CLR(pathway proportion)"
  ) +

    theme(
      plot.title = element_text(size = 10),
      axis.title = element_text(size = 8),
      axis.text.y = element_text(size = 7),
      axis.text.x = element_blank(),
      axis.title.x = element_blank()
    )

  return(p)
}

# Plot all outcomes
plots <- map(diet_sensitive_pathways$outcome, boxplot_cond)

# Create a matrix of plots
plots_arranged <- ggarrange(plotlist = plots, ncol = 4, nrow = 6, 
                            common.legend = TRUE)
assign(plotac, plots_arranged)

if (file.exists(paste0(path, "/", plotac, ".svg")) == FALSE) {  
  ggsave(
    path = paste0(path),
    filename = plotac,
    device = "svg",
    width = 9,
    height = 13
  )
}

get(plotac)

7 Linear model VG duration

Next, we fit another series of linear models, this time modelling clr-transformed functional pathways (inferred from shotgun metagenomic sequencing) using the following fixed-effect predictors: duration of vegan status (Diet_duration, scaled in tens of years), Country, their interaction (Diet_duration × Country), and Age:

\[ \text{CLR(pathway proportion)} = \alpha + \beta_{1} \times \text{Country} + \beta_{2} \times \text{Diet duration} + \beta_{3} \times (\text{Country}:\text{Diet duration}) + \beta_{4} \times \text{Age} + \epsilon \]

This analysis includes only vegan participants, while omnivores are excluded. The aim was to test whether functional pathways that differ between vegans and omnivores also vary within the vegan group itself, depending on how long participants have been vegan. In other words, we asked whether long-term vegans show stronger up- or down-regulation of diet-sensitive pathways compared to those who adopted the diet more recently.

Because longer vegan duration is likely correlated with age (e.g. a 20-year-old cannot have 20 years of vegan history, unlike a 40- or 60-year-old), we also adjusted for age in the models.

7.1 Get data

7.1.1 Training

Open code

meta_trainIT <- read.xlsx('gitignore/data/diet_duration_age.xlsx', sheet = 1)
meta_trainCZ <- read.xlsx('gitignore/data/diet_duration_age.xlsx', sheet = 2) %>% 
  dplyr::mutate(ID = paste0('T', ID),
                Sex = SEX) %>% 
  dplyr::select(-SEX)

meta_trainIT[1:5,]
##       ID   COHORT GRP Diet_duration  Age Sex
## 1 VOV002 IT_train  OM             0 61.4   F
## 2 VOV003 IT_train  OM             0 43.7   F
## 3 VOV004 IT_train  OM             0 61.1   F
## 4 VOV006 IT_train  OM             0 31.7   F
## 5 VOV007 IT_train  OM             0 31.8   F
meta_trainCZ[1:5,]
##     ID   COHORT GRP Diet_duration      Age Sex
## 1  T97 CZ_train  OM             0 26.76438   M
## 2  T98 CZ_train  OM             0 27.61370   F
## 3 T134 CZ_train  OM             0 21.64384   F
## 4 T136 CZ_train  OM             0 31.25479   M
## 5 T137 CZ_train  OM             0 40.42192   F
data_meta_original <- bind_rows(meta_trainIT, meta_trainCZ) %>% 
  rename(`Sample` = `ID`)
  
data_meta_original[1:5,]
##   Sample   COHORT GRP Diet_duration  Age Sex
## 1 VOV002 IT_train  OM             0 61.4   F
## 2 VOV003 IT_train  OM             0 43.7   F
## 3 VOV004 IT_train  OM             0 61.1   F
## 4 VOV006 IT_train  OM             0 31.7   F
## 5 VOV007 IT_train  OM             0 31.8   F

data_pathways_original2 <- data_pathways_original_clr %>% 
  rename(`Sample` = `ID`) %>% 
  left_join(data_meta_original, by = 'Sample') %>% 
  select(Sample:Diet, COHORT:Sex, everything())

data_pathways_original2 %>% dim()
## [1] 166 359

data_pathways_original2[
  1:4, 
  (ncol(data_pathways_original2)-10):ncol(data_pathways_original2)
  ]
##   SO4ASSIM_PWY SULFATE_CYS_PWY TCA_GLYOX_BYPASS        TCA THISYN_PWY
## 1   -0.1880565       0.4369528       -0.2921062  0.1011286   1.643560
## 2   -0.4462730       0.1693518       -0.5477345  0.1521391   1.478927
## 3   -2.4927554      -1.6876045       -2.1987741 -2.0383007   2.849858
## 4   -2.2425268      -1.4401533       -2.7500641 -2.5363305   2.799085
##   THISYNARA_PWY THRESYN_PWY TRNA_CHARGING_PWY TRPSYN_PWY UDPNAGSYN_PWY
## 1      1.438471    1.880293          2.420052   1.424845      1.358719
## 2      0.884228    1.814359          2.601992   1.624480      1.603390
## 3      2.317259    3.015584          3.856117   1.818168      2.435054
## 4      2.327101    3.044502          3.956278   2.064342      3.331618
##   VALSYN_PWY
## 1   2.479489
## 2   2.416387
## 3   3.539453
## 4   3.409071

data_pathways_original2[1:4, 1:10]
##   Sample Diet   COHORT GRP Diet_duration  Age Sex Country  Data X1CMET2_PWY
## 1 VOV002 OMNI IT_train  OM             0 61.4   F      IT IT_tr    2.186503
## 2 VOV003 OMNI IT_train  OM             0 43.7   F      IT IT_tr    2.008170
## 3 VOV004 OMNI IT_train  OM             0 61.1   F      IT IT_tr    3.418194
## 4 VOV006 OMNI IT_train  OM             0 31.7   F      IT IT_tr    3.174203

7.1.2 Validation

Open code

data_meta_valid <- read.xlsx('gitignore/data/diet_duration_age.xlsx', sheet = 3) %>% 
  rename(`Sample` = `ID`)

data_pathways_valid2 <- data_pathways_validation_clr %>% 
  rename(`Sample` = `ID`) %>% 
  left_join(data_meta_valid, by = 'Sample') %>% 
  select(Sample:Diet, COHORT:SEX, everything())

data_pathways_valid2 %>% dim()
## [1] 101 359

data_pathways_valid2[
  1:4, 
  (ncol(data_pathways_valid2)-10):ncol(data_pathways_valid2)
  ]
##   SO4ASSIM_PWY SULFATE_CYS_PWY TCA_GLYOX_BYPASS       TCA THISYN_PWY
## 1   -0.8517185     -0.08304722        -3.365452 -3.033998   2.165097
## 2   -0.1495979      0.56899174        -3.314443 -2.782678   2.088872
## 3   -2.2760134     -1.47638331        -3.369634 -3.061441   2.328548
## 4   -3.5178606     -2.70918461        -3.528815 -3.179935   2.517770
##   THISYNARA_PWY THRESYN_PWY TRNA_CHARGING_PWY TRPSYN_PWY UDPNAGSYN_PWY
## 1      2.410630    2.475191          3.147883   2.856516      2.047496
## 2      1.721339    2.516181          2.951190   2.587668      2.403907
## 3      2.304818    2.513833          3.060194   2.777382      2.104066
## 4      2.168090    2.770383          3.276645   3.078048      2.792120
##   VALSYN_PWY
## 1   3.237373
## 2   3.172197
## 3   3.174123
## 4   3.403835

data_pathways_valid2[1:4, 1:10]
##   Sample  Diet COHORT GRP Diet_duration      Age SEX Country  Data X1CMET2_PWY
## 1     K4 VEGAN CZ_val  VN          <NA> 30.98904   M      CZ valid    2.830535
## 2     K5 VEGAN CZ_val  VN          <NA> 30.66575   F      CZ valid    2.582339
## 3     K7 VEGAN CZ_val  VN          <NA> 32.97808   F      CZ valid    2.825255
## 4    K12 VEGAN CZ_val  VN          <NA> 27.89315   M      CZ valid    3.022772

data_pathways_valid2 %>% select(Sample, Diet, GRP)
##     Sample  Diet GRP
## 1       K4 VEGAN  VN
## 2       K5 VEGAN  VN
## 3       K7 VEGAN  VN
## 4      K12 VEGAN  VN
## 5      K13 VEGAN  VN
## 6      K15 VEGAN  VN
## 7      K16 VEGAN  VN
## 8      K18 VEGAN  VN
## 9      K19 VEGAN  VN
## 10     K25 VEGAN  VN
## 11     K26 VEGAN  VN
## 12     K31 VEGAN  VN
## 13     K32 VEGAN  VN
## 14     K34 VEGAN  VN
## 15     K35 VEGAN  VN
## 16     K38 VEGAN  VN
## 17     K39 VEGAN  VN
## 18     K42  OMNI  OM
## 19     K43  OMNI  OM
## 20     K45 VEGAN  VN
## 21     K46 VEGAN  VN
## 22     K48 VEGAN  VN
## 23     K49 VEGAN  VN
## 24     K55 VEGAN  VN
## 25     K56 VEGAN  VN
## 26     K61 VEGAN  VN
## 27     K62 VEGAN  VN
## 28     K64 VEGAN  VN
## 29     K65 VEGAN  VN
## 30     K73 VEGAN  VN
## 31     K74 VEGAN  VN
## 32     K82 VEGAN  VN
## 33     K85 VEGAN  VN
## 34    K103 VEGAN  VN
## 35    K106 VEGAN  VN
## 36    K107 VEGAN  VN
## 37    K114 VEGAN  VN
## 38    K117 VEGAN  VN
## 39    K119 VEGAN  VN
## 40    K120 VEGAN  VN
## 41    K123  OMNI  OM
## 42    K124  OMNI  OM
## 43    K126  OMNI  OM
## 44    K127  OMNI  OM
## 45    K130  OMNI  OM
## 46    K131  OMNI  OM
## 47    K143  OMNI  OM
## 48    K150  OMNI  OM
## 49    K151  OMNI  OM
## 50    K154  OMNI  OM
## 51    K155  OMNI  OM
## 52    K175  OMNI  OM
## 53    K176  OMNI  OM
## 54    K178  OMNI  OM
## 55    K179  OMNI  OM
## 56    K182  OMNI  OM
## 57    K183  OMNI  OM
## 58    K198  OMNI  OM
## 59    K199  OMNI  OM
## 60    K201  OMNI  OM
## 61    K202  OMNI  OM
## 62    K205 VEGAN  VN
## 63    K206 VEGAN  VN
## 64    K209  OMNI  OM
## 65    K210  OMNI  OM
## 66    K216 VEGAN  VN
## 67    K217 VEGAN  VN
## 68    K219 VEGAN  VN
## 69    K220 VEGAN  VN
## 70    K222  OMNI  OM
## 71    K223  OMNI  OM
## 72    K226 VEGAN  VN
## 73    K227 VEGAN  VN
## 74    K230  OMNI  OM
## 75    K231  OMNI  OM
## 76    K234  OMNI  OM
## 77    K235  OMNI  OM
## 78    K238  OMNI  OM
## 79    K239  OMNI  OM
## 80    K250 VEGAN  VN
## 81    K251 VEGAN  VN
## 82    K253  OMNI  OM
## 83    K254  OMNI  OM
## 84    K258 VEGAN  VN
## 85    K260 VEGAN  VN
## 86    K261 VEGAN  VN
## 87    K263 VEGAN  VN
## 88    K264 VEGAN  VN
## 89    K266  OMNI  OM
## 90    K267  OMNI  OM
## 91    K270  OMNI  OM
## 92    K281 VEGAN  VN
## 93    K285 VEGAN  VN
## 94    K289 VEGAN  VN
## 95    K291 VEGAN  VN
## 96    K292 VEGAN  VN
## 97    K298  OMNI  OM
## 98    K299  OMNI  OM
## 99    K318  OMNI  OM
## 100   K329  OMNI  OM
## 101   K330  OMNI  OM

data_pathways_valid2 %>% select(Sample, Diet, GRP, Diet_duration) %>% 
  filter(
    (Diet == 'VEGAN' & GRP == 'OM') | (Diet == 'OMNI' & GRP == 'VN')
  )
## [1] Sample        Diet          GRP           Diet_duration
## <0 rows> (or 0-length row.names)

7.2 Training cohort

7.2.1 Select data

Open code

data_analysis <- data_pathways_original2 %>%
    mutate(Country_IT = as.numeric(
      dplyr::if_else(
        Country == "IT", 0.5, -0.5
      )
    )
  ) %>%
  filter(Diet == 'VEGAN',
         !is.na(Diet_duration)) %>% 
  select(1:8, Country_IT, all_of(diet_sensitive_pathways$outcome))

summary(data_analysis[ , 1:12])
##     Sample              Diet              COHORT              GRP           
##  Length:96          Length:96          Length:96          Length:96         
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Diet_duration         Age            Sex              Country         
##  Min.   : 0.600   Min.   :18.25   Length:96          Length:96         
##  1st Qu.: 3.150   1st Qu.:27.57   Class :character   Class :character  
##  Median : 5.000   Median :32.15   Mode  :character   Mode  :character  
##  Mean   : 5.492   Mean   :34.97                                        
##  3rd Qu.: 6.000   3rd Qu.:40.77                                        
##  Max.   :20.000   Max.   :60.70                                        
##    Country_IT        CALVIN_PWY    GALACTITOLCAT_PWY HEXITOLDEGSUPER_PWY
##  Min.   :-0.5000   Min.   :1.496   Min.   :-5.9819   Min.   :-4.7341    
##  1st Qu.:-0.5000   1st Qu.:2.395   1st Qu.:-4.9624   1st Qu.:-3.7146    
##  Median :-0.5000   Median :2.764   Median :-4.1624   Median :-2.9213    
##  Mean   :-0.1042   Mean   :2.698   Mean   :-3.7681   Mean   :-2.6037    
##  3rd Qu.: 0.5000   3rd Qu.:2.928   3rd Qu.:-2.8106   3rd Qu.:-1.6531    
##  Max.   : 0.5000   Max.   :4.289   Max.   : 0.3174   Max.   : 0.8302
data_analysis[1:10, 1:12]
##    Sample  Diet   COHORT GRP Diet_duration  Age Sex Country Country_IT
## 1  VOV010 VEGAN IT_train  VG           2.6 25.7   M      IT        0.5
## 2  VOV012 VEGAN IT_train  VG           3.0 55.7   F      IT        0.5
## 3  VOV014 VEGAN IT_train  VG           2.0 35.7   F      IT        0.5
## 4  VOV018 VEGAN IT_train  VG           3.0 33.5   F      IT        0.5
## 5  VOV019 VEGAN IT_train  VG           2.0 20.3   F      IT        0.5
## 6  VOV020 VEGAN IT_train  VG           5.0 24.7   M      IT        0.5
## 7  VOV021 VEGAN IT_train  VG           0.6 18.5   F      IT        0.5
## 8  VOV029 VEGAN IT_train  VG           7.0 52.3   F      IT        0.5
## 9  VOV043 VEGAN IT_train  VG           5.0 29.8   F      IT        0.5
## 10 VOV044 VEGAN IT_train  VG           5.0 32.2   F      IT        0.5
##    CALVIN_PWY GALACTITOLCAT_PWY HEXITOLDEGSUPER_PWY
## 1    2.011687        -3.3527234          -2.1492342
## 2    2.051551        -5.2216023          -3.9759989
## 3    2.105309        -5.8716201          -4.6238101
## 4    2.564119        -5.3142270          -4.0664170
## 5    2.460516        -2.8677805          -1.6570946
## 6    1.495842        -2.6777707          -1.4943911
## 7    2.047515        -0.4930437           0.1209159
## 8    2.789296        -4.8926198          -3.6448097
## 9    3.165458        -3.0580872          -1.8510735
## 10   2.246076        -4.0299535          -2.8231830

7.2.2 Define number of pathways and covariates

Open code

n_covarites <- 9
n_features <- ncol(data_analysis) - n_covarites

7.2.3 Create empty objects

Open code

outcome <- vector('double', n_features)

est_Diet_duration_inCZ <- vector('double', n_features)
est_Diet_duration_inIT <- vector('double', n_features)
est_Diet_duration_avg <- vector('double', n_features)

est_ITcountry_avg <- vector('double', n_features)

diet_country_int <- vector('double', n_features)

P_Diet_duration_inCZ <- vector('double', n_features)
P_Diet_duration_inIT <- vector('double', n_features)
P_Diet_duration_avg <- vector('double', n_features)

P_ITcountry_avg <- vector('double', n_features)

est_Age <- vector('double', n_features)
P_Age <- vector('double', n_features)


P_diet_country_int <- vector('double', n_features)

CI_L_Diet_duration_inCZ <- vector('double', n_features)
CI_L_Diet_duration_inIT <- vector('double', n_features)
CI_L_Diet_duration_avg <- vector('double', n_features)

CI_U_Diet_duration_inCZ <- vector('double', n_features)
CI_U_Diet_duration_inIT <- vector('double', n_features)
CI_U_Diet_duration_avg <- vector('double', n_features)

7.2.4 Estimate over outcomes

Open code

if(file.exists('gitignore/lm_results/result_pathways_SGB30_VGdur_training.Rds') == FALSE){
for (i in 1:n_features) {
  
  ## define variable
  data_analysis$outcome <- data_analysis[, (i + n_covarites)]

  ## fit model
  model <- lm(outcome ~ Country_IT * Diet_duration + Age, 
              data = data_analysis %>% 
                mutate(Diet_duration = (Diet_duration/10),
                       Age = (Age-33)/10))

  ## get contrast (effects of diet BY COUNTRY)
  contrast_emm <- summary(
    pairs(
      emmeans(
        model,
        specs = ~ Diet_duration | Country_IT,
        at = list(Diet_duration = c(0, 1)) 
        ),
      interaction = TRUE,
      adjust = "none"
      ),
    infer = c(TRUE, TRUE)
    )

  ## save results
  outcome[i] <- names(data_analysis)[i + n_covarites]
  
  ## country and age effect
  est_ITcountry_avg[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Country_IT"
    ), 1
  ]
  
  est_Age[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Age"
    ), 1
  ]

  P_Age[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Age"
    ), 4
  ]
  
  
  ## diet effect
  tr <- confint(model)
  
  CI_L_Diet_duration_avg[i] <- tr[which(row.names(tr) == 'Diet_duration'),][1]
  CI_U_Diet_duration_avg[i] <- tr[which(row.names(tr) == 'Diet_duration'),][2]
  
  est_Diet_duration_avg[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Diet_duration"
    ), 1
  ]

  P_Diet_duration_avg[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Diet_duration"
    ), 4
  ]
  
  est_Diet_duration_inCZ[i] <- -contrast_emm[1,3]
  P_Diet_duration_inCZ[i] <- contrast_emm$p.value[1]
  CI_L_Diet_duration_inCZ[i] <- -contrast_emm$upper.CL[1]
  CI_U_Diet_duration_inCZ[i] <- -contrast_emm$lower.CL[1]
  
  est_Diet_duration_inIT[i] <- -contrast_emm[2,3]
  P_Diet_duration_inIT[i] <- contrast_emm$p.value[2]
  CI_L_Diet_duration_inIT[i] <- -contrast_emm$upper.CL[2]
  CI_U_Diet_duration_inIT[i] <- -contrast_emm$lower.CL[2]
  
  ## interaction
  diet_country_int[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Country_IT:Diet_duration"
    ), 1
  ]

  P_diet_country_int[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Country_IT:Diet_duration"
    ), 4
  ]
}

  result_pathways <- data.frame(
  outcome,
  est_ITcountry_avg, P_ITcountry_avg,
  est_Age, P_Age,
  est_Diet_duration_avg, P_Diet_duration_avg,
  est_Diet_duration_inCZ, P_Diet_duration_inCZ,
  est_Diet_duration_inIT, P_Diet_duration_inIT,
  diet_country_int, P_diet_country_int,
  CI_L_Diet_duration_avg, CI_U_Diet_duration_avg,
  CI_L_Diet_duration_inCZ, CI_U_Diet_duration_inCZ,
  CI_L_Diet_duration_inIT, CI_U_Diet_duration_inIT
)
  
  saveRDS(result_pathways, 
              'gitignore/lm_results/result_pathways_SGB30_VGdur_training.Rds')
}

result_pathways <- readRDS('gitignore/lm_results/result_pathways_SGB30_VGdur_training.Rds')

7.2.5 Show and save results

Open code

kableExtra::kable(result_pathways %>% 
  dplyr::select(
    outcome,
    est_ITcountry_avg, P_ITcountry_avg,
    est_Age, P_Age,
    est_Diet_duration_avg, P_Diet_duration_avg,
    est_Diet_duration_inCZ, P_Diet_duration_inCZ,
    est_Diet_duration_inIT, P_Diet_duration_inIT,
    diet_country_int, P_diet_country_int,
    CI_L_Diet_duration_avg, CI_U_Diet_duration_avg,
    CI_L_Diet_duration_inCZ, CI_U_Diet_duration_inCZ,
    CI_L_Diet_duration_inIT, CI_U_Diet_duration_inIT
    ) %>% 
    arrange(P_Diet_duration_avg),
  caption = "Results of linear models, modelling clr-transformed pathways (functional pathways inferred from shotgun metagenomic sequencing, expressed as relative proportions) with vegan status duration (`Diet_duration`), `Country`, their interaction (`Diet_duration × Country`), and `Age` as predictors, using training data only (Czech and Italian vegan cohorts). Only pathways with clr-transformed levels differing by diet (FDR < 0.05, average effect across both countries) are shown. `est` prefix: denotes estimated effects (regression coefficients), i.e. expected change in clr-transforemd pathway proportion per 10 years of vegan diet or age, or for country (Italy vs Czech Republic). `P`: p-value. `CI_L` and `CI_U`: lower and upper bounds of the 95% confidence interval. `avg` suffix: effect averaged across cohorts, whereas `inCZ` and `inIT` indicate effects in the Czech or Italian cohort, respectively. All estimates in a single row are based on a single model with interaction",
  escape = FALSE
)

Results of linear models, modelling clr-transformed pathways (functional pathways inferred from shotgun metagenomic sequencing, expressed as relative proportions) with vegan status duration (`Diet_duration`), `Country`, their interaction (`Diet_duration × Country`), and `Age` as predictors, using training data only (Czech and Italian vegan cohorts). Only pathways with clr-transformed levels differing by diet (FDR < 0.05, average effect across both countries) are shown. `est` prefix: denotes estimated effects (regression coefficients), i.e. expected change in clr-transforemd pathway proportion per 10 years of vegan diet or age, or for country (Italy vs Czech Republic). `P`: p-value. `CI_L` and `CI_U`: lower and upper bounds of the 95% confidence interval. `avg` suffix: effect averaged across cohorts, whereas `inCZ` and `inIT` indicate effects in the Czech or Italian cohort, respectively. All estimates in a single row are based on a single model with interaction
outcome	est_ITcountry_avg	est_Age	P_Age	est_Diet_duration_avg	P_Diet_duration_avg	est_Diet_duration_inCZ	P_Diet_duration_inCZ	est_Diet_duration_inIT	P_Diet_duration_inIT	diet_country_int	P_diet_country_int	CI_L_Diet_duration_avg	CI_U_Diet_duration_avg	CI_L_Diet_duration_inCZ	CI_U_Diet_duration_inCZ	CI_L_Diet_duration_inIT	CI_U_Diet_duration_inIT
PWY0_42	-0.2773973	-0.1004082	0.3386596	0.8437349	0.0175709	0.1036314	0.7679825	1.5838384	0.0057482	1.4802070	0.0192055	0.1507842	1.5366857	-0.5920523	0.7993152	0.4716463	2.6960305
PWY_5104	-1.1008129	-0.1255575	0.3748064	0.8819153	0.0640597	-0.3115837	0.5111188	2.0754142	0.0072186	2.3869979	0.0053976	-0.0526011	1.8164317	-1.2497857	0.6266184	0.5755072	3.5753213
PWY0_1586	-0.1532302	-0.1244985	0.0668966	0.2778437	0.2187251	0.0725126	0.7482276	0.4831749	0.1829736	0.4106623	0.3064622	-0.1677868	0.7234743	-0.3748755	0.5199007	-0.2320661	1.1984158
PWY_5747	-0.5426096	0.1577512	0.3724702	0.6957188	0.2399696	-0.1858384	0.7537062	1.5772759	0.0982099	1.7631142	0.0955870	-0.4726613	1.8640988	-1.3588264	0.9871497	-0.2979845	3.4525362
PWY_5367	0.0661884	-0.0863856	0.4895611	0.4896242	0.2423707	0.1337625	0.7495402	0.8454859	0.2087336	0.7117233	0.3390805	-0.3368826	1.3161310	-0.6960039	0.9635290	-0.4810648	2.1720365
TRPSYN_PWY	-0.4380714	-0.0620052	0.3493535	0.2170546	0.3270580	0.0714053	0.7475225	0.3627038	0.3076609	0.2912985	0.4594053	-0.2205018	0.6546109	-0.3678767	0.5106874	-0.3395780	1.0649857
PWY_7237	-0.4726868	-0.1277910	0.1446405	0.2753235	0.3453469	0.0708106	0.8085468	0.4798364	0.3057300	0.4090259	0.4305587	-0.3012190	0.8518660	-0.5080058	0.6496269	-0.4455194	1.4051922
PWY_6284	0.1978288	-0.0630640	0.6001771	0.3609705	0.3700236	0.1088588	0.7872983	0.6130822	0.3429492	0.5042234	0.4813548	-0.4349332	1.1568742	-0.6901838	0.9079015	-0.6643502	1.8905146
PWY_6629	-0.2677474	-0.0508378	0.3895271	0.1611869	0.4141907	0.0707759	0.7205976	0.2515979	0.4270884	0.1808220	0.6063980	-0.2291330	0.5515069	-0.3210834	0.4626353	-0.3748691	0.8780649
PWY_1269	0.0542379	-0.0743909	0.4739454	0.2771687	0.4248289	-0.0012748	0.9970775	0.5556122	0.3193538	0.5568870	0.3678700	-0.4095946	0.9639319	-0.6907467	0.6881970	-0.5466489	1.6578733
GALACTITOLCAT_PWY	-0.7587257	-0.1317090	0.4444355	0.4407269	0.4438474	-0.0974639	0.8658540	0.9789177	0.2900121	1.0763816	0.2940863	-0.6976062	1.5790601	-1.2402865	1.0453588	-0.8481171	2.8059525
HEXITOLDEGSUPER_PWY	-0.6644502	-0.0964017	0.5397155	0.3674010	0.4844670	-0.0815175	0.8770494	0.8163194	0.3337190	0.8978369	0.3376733	-0.6721866	1.4069885	-1.1252051	0.9621701	-0.8522278	2.4848666
NONOXIPENT_PWY	-0.1145726	-0.0847598	0.2142740	0.1431280	0.5290129	0.0976397	0.6686444	0.1886162	0.6051177	0.0909765	0.8219570	-0.3067688	0.5930247	-0.3540314	0.5493108	-0.5334720	0.9107045
PROPFERM_PWY	-0.6354697	0.0042961	0.9746762	0.2765237	0.5413465	0.1743986	0.7010282	0.3786488	0.6022046	0.2042502	0.7997380	-0.6193986	1.1724460	-0.7250571	1.0738544	-1.0593144	1.8166120
PWY_8188	-0.6354697	0.0042961	0.9746762	0.2765237	0.5413465	0.1743986	0.7010282	0.3786488	0.6022046	0.2042502	0.7997380	-0.6193986	1.1724460	-0.7250571	1.0738544	-1.0593144	1.8166120
PWY_8189	-0.6354697	0.0042961	0.9746762	0.2765237	0.5413465	0.1743986	0.7010282	0.3786488	0.6022046	0.2042502	0.7997380	-0.6193986	1.1724460	-0.7250571	1.0738544	-1.0593144	1.8166120
PWY_6293	-0.6016035	0.1982998	0.1229276	-0.1430936	0.7374860	-0.6189605	0.1508899	0.3327732	0.6273239	0.9517337	0.2121950	-0.9885097	0.7023224	-1.4677108	0.2297898	-1.0241270	1.6896734
PWY_8178	-0.0496325	-0.0790724	0.2486419	0.0534566	0.8148275	0.0533209	0.8160012	0.0535923	0.8836815	0.0002714	0.9994669	-0.3986201	0.5055333	-0.4005387	0.5071806	-0.6719948	0.7791794
PWY_801	-0.6715556	0.2036234	0.1236193	-0.0705987	0.8722625	-0.6203486	0.1615842	0.4791513	0.4970850	1.0994999	0.1616967	-0.9403352	0.7991378	-1.4935153	0.2528181	-0.9167835	1.8750860
CALVIN_PWY	-0.0743894	-0.0533955	0.3990599	0.0301266	0.8865682	0.0180314	0.9322252	0.0422218	0.9008715	0.0241904	0.9486865	-0.3882129	0.4484661	-0.4019580	0.4380208	-0.6292167	0.7136603
HSERMETANA_PWY	0.0093006	0.0204975	0.7260141	-0.0029857	0.9878091	0.0471015	0.8102860	-0.0530730	0.8656333	-0.1001744	0.7733788	-0.3900740	0.3841025	-0.3415134	0.4357163	-0.6743529	0.5682070
PWY_5971	0.7845608	-0.0940461	0.5469506	0.0002204	0.9996626	-0.0740198	0.8875266	0.0744606	0.9290834	0.1484804	0.8728599	-1.0323683	1.0328091	-1.1106809	0.9626414	-1.5828533	1.7317746

Open code


if(file.exists('gitignore/lm_results/result_pathways_VGdur_training.csv') == FALSE){
  write.table(result_pathways, 
              'gitignore/lm_results/result_pathways_VGdur_training.csv', row.names = FALSE)
  }

7.3 Validation cohort

Open code

data_analysis_pathways <- data_pathways_valid2 %>%
  filter(Diet == 'VEGAN',
         !is.na(Diet_duration)) %>%
  dplyr::mutate(Diet_duration = as.numeric(as.character(Diet_duration))) %>% 
  dplyr::select(
    Diet_duration, Age,
    all_of(diet_sensitive_pathways$outcome)
  )

summary(data_analysis_pathways[ , 1:12])
##  Diet_duration         Age          CALVIN_PWY    GALACTITOLCAT_PWY
##  Min.   : 0.170   Min.   :21.54   Min.   :2.242   Min.   :-5.686   
##  1st Qu.: 4.062   1st Qu.:30.05   1st Qu.:2.720   1st Qu.:-5.191   
##  Median : 6.000   Median :32.81   Median :2.883   Median :-4.437   
##  Mean   : 6.790   Mean   :32.37   Mean   :2.865   Mean   :-3.758   
##  3rd Qu.:10.000   3rd Qu.:34.24   3rd Qu.:3.052   3rd Qu.:-2.130   
##  Max.   :15.000   Max.   :44.08   Max.   :3.372   Max.   : 1.556   
##  HEXITOLDEGSUPER_PWY HSERMETANA_PWY  NONOXIPENT_PWY   PROPFERM_PWY   
##  Min.   :-4.440      Min.   :1.852   Min.   :1.932   Min.   :-3.650  
##  1st Qu.:-3.945      1st Qu.:2.325   1st Qu.:2.398   1st Qu.:-3.072  
##  Median :-3.199      Median :2.451   Median :2.629   Median :-2.934  
##  Mean   :-2.646      Mean   :2.443   Mean   :2.610   Mean   :-2.836  
##  3rd Qu.:-1.018      3rd Qu.:2.602   3rd Qu.:2.832   3rd Qu.:-2.763  
##  Max.   : 0.303      Max.   :3.075   Max.   :3.384   Max.   :-1.320  
##     PWY_1269          PWY_5104         PWY_5367          PWY_5747     
##  Min.   :-1.5780   Min.   :-4.596   Min.   :-5.0226   Min.   :-6.481  
##  1st Qu.: 0.3895   1st Qu.:-4.332   1st Qu.:-4.3303   1st Qu.:-6.231  
##  Median : 0.7839   Median :-4.173   Median :-2.6540   Median :-6.127  
##  Mean   : 0.8536   Mean   :-3.878   Mean   :-2.6949   Mean   :-5.946  
##  3rd Qu.: 1.2113   3rd Qu.:-3.994   3rd Qu.:-1.8717   3rd Qu.:-6.027  
##  Max.   : 2.3199   Max.   :-1.523   Max.   : 0.7546   Max.   :-2.981

7.3.0.1 Define number of pathways and covariates

Open code

n_covarites <- 2
n_features <- ncol(data_analysis_pathways) - n_covarites

7.3.0.2 Create empty objects

Open code

outcome <- vector('double', n_features)
est_VGduration <- vector('double', n_features)
P_VGduration <- vector('double', n_features)
est_Age <- vector('double', n_features)
P_Age <- vector('double', n_features)
CI_L_VGduration <- vector('double', n_features)
CI_U_VGduration <- vector('double', n_features)

7.3.0.3 Estimate over outcomes

Open code

for (i in 1:n_features) {
  ## define variable
  data_analysis_pathways$outcome <- data_analysis_pathways[, (i + n_covarites)]

  ## fit model
  model <- lm(outcome ~ Diet_duration + Age, 
              data = data_analysis_pathways %>% 
                mutate(Diet_duration = (Diet_duration)/10,
                       Age = (Age-33)/10))

  ## save results
  outcome[i] <- names(data_analysis_pathways)[i + n_covarites]

  ## diet effect
  tr <- confint(model)

  CI_L_VGduration[i] <- tr[which(row.names(tr) == "Diet_duration"), ][1]
  CI_U_VGduration[i] <- tr[which(row.names(tr) == "Diet_duration"), ][2]

  est_VGduration[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Diet_duration"
    ), 1
  ]

  P_VGduration[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Diet_duration"
    ), 4
  ]
  
  est_Age[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Age"
    ), 1
  ]

  P_Age[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Age"
    ), 4
  ]
}

7.3.0.4 Results table

Open code

result_pathways_val <- data.frame(
  outcome,
  est_VGduration, P_VGduration,
  CI_L_VGduration, CI_U_VGduration,
  est_Age, P_Age
)

kableExtra::kable(result_pathways_val %>% 
    arrange(P_VGduration),
caption = "Results of linear models estimating the effect of vegan diet status duration and age on clr-transformed pathway proportions (functional pathways inferred from shotgun metagenomic sequencing). Only pathways with significant differences between vegans and omnivores in training cohorts (FDR < 0.05, average effect over both countries) were included. `est`: regression coefficient, i.e. expected change in clr-transformed pathway proportion per +10 years of vegan diet and +10 years of age respectively. `P`: p-value; `CI_L` and `CI_U`: lower and upper bounds of the 95% confidence interval. All estimates in a single row are based on a single model"
)

Results of linear models estimating the effect of vegan diet status duration and age on clr-transformed pathway proportions (functional pathways inferred from shotgun metagenomic sequencing). Only pathways with significant differences between vegans and omnivores in training cohorts (FDR < 0.05, average effect over both countries) were included. `est`: regression coefficient, i.e. expected change in clr-transformed pathway proportion per +10 years of vegan diet and +10 years of age respectively. `P`: p-value; `CI_L` and `CI_U`: lower and upper bounds of the 95% confidence interval. All estimates in a single row are based on a single model
outcome	est_VGduration	P_VGduration	CI_L_VGduration	CI_U_VGduration	est_Age	P_Age
PWY0_1586	-0.2034972	0.1217437	-0.4631027	0.0561082	-0.1596314	0.1513086
PWY_5971	0.4404654	0.1852968	-0.2180837	1.0990145	0.3086420	0.2720620
PWY_8188	-0.2530201	0.2466083	-0.6864001	0.1803598	-0.0053484	0.9767893
PROPFERM_PWY	-0.2530201	0.2466083	-0.6864001	0.1803598	-0.0053484	0.9767893
PWY_8189	-0.2530201	0.2466083	-0.6864001	0.1803598	-0.0053484	0.9767893
PWY_6284	0.4282847	0.4363589	-0.6676821	1.5242515	0.3688943	0.4288990
PWY_5367	0.5092329	0.4401775	-0.8048535	1.8233193	0.4477825	0.4232473
PWY_5104	0.2435981	0.4856992	-0.4527699	0.9399660	0.1381028	0.6404712
PWY_5747	-0.2012648	0.5319405	-0.8433037	0.4407740	0.7033059	0.0123152
NONOXIPENT_PWY	-0.0511207	0.6947732	-0.3111956	0.2089541	-0.3182347	0.0055092
TRPSYN_PWY	-0.0291099	0.8042250	-0.2636445	0.2054247	-0.2272285	0.0258614
PWY_1269	0.0913651	0.8077977	-0.6587243	0.8414546	-0.5837282	0.0710437
PWY_8178	-0.0294339	0.8122229	-0.2768912	0.2180233	-0.3027070	0.0055220
HSERMETANA_PWY	-0.0245491	0.8165366	-0.2358853	0.1867871	-0.2385405	0.0100427
PWY_801	0.0487496	0.8200686	-0.3793120	0.4768112	0.2074197	0.2563364
PWY_6293	0.0484819	0.8204130	-0.3780595	0.4750232	0.2063192	0.2571641
PWY_7237	0.0314140	0.8652072	-0.3382422	0.4010702	-0.4885971	0.0028790
PWY0_42	0.0433593	0.8785533	-0.5234589	0.6101774	0.3927582	0.1068248
CALVIN_PWY	-0.0101717	0.9288856	-0.2378529	0.2175096	-0.2283578	0.0212871
PWY_6629	0.0075671	0.9437346	-0.2066243	0.2217585	-0.1913790	0.0391760
HEXITOLDEGSUPER_PWY	-0.0226809	0.9746535	-1.4487920	1.4034301	0.1999101	0.7411747
GALACTITOLCAT_PWY	0.0239274	0.9768329	-1.6221390	1.6699939	0.2729324	0.6960884

Open code


if (file.exists("gitignore/result_pathways_VGdur_valid.csv") == FALSE) {
  write.table(result_pathways_val,
    "gitignore/result_pathways_VGdur_valid.csv",
    row.names = FALSE
  )
}

7.4 Forest plot

7.4.1 Prepare data

Open code


## subset result tables
result_pathways_subset <- result_pathways %>%
  filter(outcome %in% diet_sensitive_pathways$outcome)

result_pathways_val_subset <- result_pathways_val %>%
  filter(outcome %in% diet_sensitive_pathways$outcome)

## create a data frame
data_forest <- data.frame(
  outcome = rep(diet_sensitive_pathways$outcome, 3),
  beta = c(
    result_pathways_subset$est_Diet_duration_inCZ,
    result_pathways_subset$est_Diet_duration_inIT,
    result_pathways_val_subset$est_VGduration
  ),
  lower = c(
    result_pathways_subset$CI_L_Diet_duration_inCZ,
    result_pathways_subset$CI_L_Diet_duration_inIT,
    result_pathways_val_subset$CI_L_VGduration
  ),
  upper = c(
    result_pathways_subset$CI_U_Diet_duration_inCZ,
    result_pathways_subset$CI_U_Diet_duration_inIT,
    result_pathways_val_subset$CI_U_VGduration
  ),
  dataset = c(
    rep("CZ", len),
    rep("IT", len),
    rep("Validation", len)
  )
)


 data_forest <- data_forest %>%
  mutate(in_winner = if_else(outcome %in% winners, TRUE, FALSE, missing = FALSE)) %>%
   left_join(
     val_res %>% select(outcome, pathway), 
     by = 'outcome') %>%
  left_join(
    elacoef %>% mutate(outcome = pathway) %>% select(-pathway), 
    by = 'outcome') %>% 
   mutate(outcome = factor(pathway, levels = validation_order))

7.4.1.1 Plotting

Open code


plotac <- "forest_plot_pathways_VGduration"
path <- "gitignore/figures"

colors <- c("CZ" = "#150999", "IT" = "#329243", "Validation" = "grey60")

assign(plotac, ggplot(
  data_forest, aes(x = outcome, y = beta, ymin = lower, ymax = upper, color = dataset)
) +
  geom_pointrange(position = position_dodge(width = 0.5), size = 0.5) +
  geom_hline(yintercept = 0, color = "black") +
  geom_errorbar(position = position_dodge(width = 0.5), width = 0.2) +
  scale_color_manual(values = colors) +
  labs(
    y = "Effect of vegan diet duration (+10y) on CLR(relative level)",
    x = "Outcome",
    color = "Dataset"
  ) +
  theme_minimal() +
  coord_flip() + 
  scale_x_discrete(
    labels = setNames(
      ifelse(data_forest$in_winner, 
             paste0("**", data_forest$outcome, "**"), 
             as.character(data_forest$outcome)
      ), data_forest$outcome
    )
  ) +
  theme(
    axis.text.x = element_text(size = 10),
    axis.text.y = ggtext::element_markdown(size = 10),  
    axis.title.x = element_text(size = 12),
    axis.title.y = element_text(size = 12),
    legend.position = "bottom"
  )
)

get(plotac)

Forest plot shows the effect of vegan status duration (within the vegan sub-population, scale of tens years) on clr-transformed relative level of selected functional pathways (inferred from shotgun metagenomic sequencing), with 95% confidence intervals, across two training cohorts (Czech and Italian) and one independent Czech validation cohort. Green, blue, and grey points/lines represent the expected change in clr-transformed relative pathway level per +10 years of vegan diet within the Italian, Czech, and Czech validation cohorts, respectively. Positive values indicate higher relative pathway level in long-term vegans compared to short-term vegans. Only pathways with significant differences between vegans and omnivores (average effect over the two countries) were included. Estimates for the training cohorts come from a single linear model with `Diet_duration`, `Country`, their interaction (`Diet_duration:Country`), and `Age` as fixed-effect predictors. `Age` was included as it correlates with `Diet_duration` and could act as a confounder. In the Czech validation cohort, `Diet_duration` and `Age` were fixed-effect predictors. Diet-sensitive pathways are shown in **bold**, in the same order as in the forest plot of vegan–omnivore differences

Open code


if (file.exists(paste0(path, "/", plotac, ".svg")) == FALSE) {  
  ggsave(
    path = paste0(path),
    filename = plotac,
    device = "svg",
    width = 9,
    height = 13
  )
}

8 Summary info about Patways

8.1 Proportions of unmapped and unintegrated

8.1.1 Czech trainng cohort

Open code

StratfiedPaths_originalCZ <- read.delim(
  "gitignore/data/pathw/Pathway_abundance_MetaCyc_CZ_humann.tsv",
  header = TRUE, sep = "\t"
) %>%
  filter(!grepl("\\|", X..Pathway))

dim(StratfiedPaths_originalCZ)
## [1] 580  89

all_paths <- colSums(StratfiedPaths_originalCZ[, -1])

unmapped <- StratfiedPaths_originalCZ[1, -1] / all_paths
unintegrated <- StratfiedPaths_originalCZ[2, -1] / all_paths
classified <- 1 - unintegrated - unmapped


kableExtra::kable(
  dplyr::bind_rows(
    quantile(as.numeric(unmapped), probs = c(0, 0.25, 0.5, 0.75, 1)),
    quantile(as.numeric(unintegrated), probs = c(0, 0.25, 0.5, 0.75, 1)),
    quantile(as.numeric(classified), probs = c(0, 0.25, 0.5, 0.75, 1))
  ) %>%
    dplyr::mutate(
      category = c("unmapped", "unintegrated", "classified")
    ) %>%
    dplyr::select(category, everything()),
  caption = "Summary statistics (minimum, 25th, 50th, 75th, and maximum) of the proportions of unmapped, unintegrated, and classified pathway reads across samples from the Czech training cohort. 'Unmapped': reads not aligned to UniRef90. 'Unintegrated': reads aligned to UniRef90 gene families but not assigned to any MetaCyc pathway. 'Classified': reads assigned to known MetaCyc pathways (unstratified community-level abundances)"
)

Summary statistics (minimum, 25th, 50th, 75th, and maximum) of the proportions of unmapped, unintegrated, and classified pathway reads across samples from the Czech training cohort. ‘Unmapped’: reads not aligned to UniRef90. ‘Unintegrated’: reads aligned to UniRef90 gene families but not assigned to any MetaCyc pathway. ‘Classified’: reads assigned to known MetaCyc pathways (unstratified community-level abundances)
category	0%	25%	50%	75%	100%
unmapped	0.1463292	0.1816258	0.2156915	0.2538069	0.3634129
unintegrated	0.5993433	0.7027560	0.7373852	0.7718285	0.8265473
classified	0.0271235	0.0416706	0.0438855	0.0460305	0.0594788

8.1.2 Italian training cohort

Open code

StratifiedPaths_originalIT <- read.delim(
  "gitignore/data/pathw/Pathway_abundance_MetaCyc_IT_humann.tsv",
  header = TRUE, sep = "\t"
) %>%
  filter(!grepl("\\|", X..Pathway))

dim(StratifiedPaths_originalIT)
## [1] 488  79

all_paths <- colSums(StratifiedPaths_originalIT[, -1])

unmapped <- StratifiedPaths_originalIT[1, -1] / all_paths
unintegrated <- StratifiedPaths_originalIT[2, -1] / all_paths
classified <- 1 - unintegrated - unmapped


kableExtra::kable(
  dplyr::bind_rows(
    quantile(as.numeric(unmapped), probs = c(0, 0.25, 0.5, 0.75, 1)),
    quantile(as.numeric(unintegrated), probs = c(0, 0.25, 0.5, 0.75, 1)),
    quantile(as.numeric(classified), probs = c(0, 0.25, 0.5, 0.75, 1))
  ) %>%
    dplyr::mutate(
      category = c("unmapped", "unintegrated", "classified")
    ) %>%
    dplyr::select(category, everything()),
  caption = "Summary statistics (minimum, 25th, 50th, 75th, and maximum) of the proportions of unmapped, unintegrated, and classified pathway reads across samples from the Italian training cohort. 'Unmapped': reads not aligned to UniRef90. 'Unintegrated': reads aligned to UniRef90 gene families but not assigned to any MetaCyc pathway. 'Classified': reads assigned to known MetaCyc pathways (unstratified community-level abundances)"
)

Summary statistics (minimum, 25th, 50th, 75th, and maximum) of the proportions of unmapped, unintegrated, and classified pathway reads across samples from the Italian training cohort. ‘Unmapped’: reads not aligned to UniRef90. ‘Unintegrated’: reads aligned to UniRef90 gene families but not assigned to any MetaCyc pathway. ‘Classified’: reads assigned to known MetaCyc pathways (unstratified community-level abundances)
category	0%	25%	50%	75%	100%
unmapped	0.1291023	0.1754808	0.2275788	0.2617836	0.4390892
unintegrated	0.5217834	0.6929692	0.7215643	0.7704262	0.8291034
classified	0.0288690	0.0432038	0.0469259	0.0526642	0.0798983

8.1.3 Czech validation cohort

Open code

StratifiedPaths_validation <- read.delim(
  "gitignore/data/pathw/Pathway_abundance_MetaCyc_Validation_humann.tsv",
  header = TRUE, sep = "\t"
) %>%
  filter(!grepl("\\|", X..Pathway))

dim(StratifiedPaths_validation)
## [1] 580 104

all_paths <- colSums(StratifiedPaths_validation[, -1])

unmapped <- StratifiedPaths_validation[1, -1] / all_paths
unintegrated <- StratifiedPaths_validation[2, -1] / all_paths
classified <- 1 - unintegrated - unmapped


kableExtra::kable(
  dplyr::bind_rows(
    quantile(as.numeric(unmapped), probs = c(0, 0.25, 0.5, 0.75, 1)),
    quantile(as.numeric(unintegrated), probs = c(0, 0.25, 0.5, 0.75, 1)),
    quantile(as.numeric(classified), probs = c(0, 0.25, 0.5, 0.75, 1))
  ) %>%
    dplyr::mutate(
      category = c("unmapped", "unintegrated", "classified")
    ) %>%
    dplyr::select(category, everything()),
  caption = "Summary statistics (minimum, 25th, 50th, 75th, and maximum) of the proportions of unmapped, unintegrated, and classified pathway reads across samples from the Czech validation cohort. 'Unmapped': reads not aligned to UniRef90. 'Unintegrated': reads aligned to UniRef90 gene families but not assigned to any MetaCyc pathway. 'Classified': reads assigned to known MetaCyc pathways (unstratified community-level abundances)"
)

Summary statistics (minimum, 25th, 50th, 75th, and maximum) of the proportions of unmapped, unintegrated, and classified pathway reads across samples from the Czech validation cohort. ‘Unmapped’: reads not aligned to UniRef90. ‘Unintegrated’: reads aligned to UniRef90 gene families but not assigned to any MetaCyc pathway. ‘Classified’: reads assigned to known MetaCyc pathways (unstratified community-level abundances)
category	0%	25%	50%	75%	100%
unmapped	0.1363081	0.1825077	0.2098586	0.2292549	0.3303064
unintegrated	0.6290565	0.7209362	0.7419081	0.7664460	0.8230822
classified	0.0369937	0.0444025	0.0488882	0.0517536	0.0616683

Open code


data_path_validation <- read.delim(
  "gitignore/data/pathw/Pathway_abundance_MetaCyc_Validation_humann.tsv",
  header = TRUE, sep = "\t"
) %>%
  filter(!grepl("\\|", X..Pathway)) %>%
  select(X..Pathway)
dim(data_path_validation)
## [1] 580   1

8.2 Taxon-assigned vs. taxon-unassigned pathways

This refers to the proportion of reads that were functionally assigned to a MetaCyc pathway but whose taxonomic origin could not be determined, from the all classified reads.

8.2.1 Czech trainng cohort

Open code

Paths_originalCZ <- read.delim(
  "gitignore/data/pathw/Pathway_abundance_MetaCyc_CZ_humann.tsv",
  header = TRUE, sep = "\t", check.names = FALSE
) %>%
  filter(
    !grepl("UNMAPPED", `# Pathway`),
    !grepl("UNINTEGRATED", `# Pathway`)
  )

name_col <- names(Paths_originalCZ)[1]

unstrat <- Paths_originalCZ %>% filter(!grepl("\\|", `# Pathway`))
dim(unstrat)
## [1] 578  89
unstrat[1:10, 1:3]
##                                                                               # Pathway
## 1                                   12DICHLORETHDEG-PWY: 1,2-dichloroethane degradation
## 2                                      1CMET2-PWY: folate transformations III (E. coli)
## 3            3-HYDROXYPHENYLACETATE-DEGRADATION-PWY: 4-hydroxyphenylacetate degradation
## 4                    4-HYDROXYMANDELATE-DEGRADATION-PWY: 4-hydroxymandelate degradation
## 5                                            AEROBACTINSYN-PWY: aerobactin biosynthesis
## 6                      ALLANTOINDEG-PWY: superpathway of allantoin degradation in yeast
## 7                                             ANAEROFRUCAT-PWY: homolactic fermentation
## 8                                      ANAGLYCOLYSIS-PWY: glycolysis III (from glucose)
## 9                ARG+POLYAMINE-SYN: superpathway of arginine and polyamine biosynthesis
## 10 ARGDEG-PWY: superpathway of L-arginine, putrescine, and 4-aminobutanoate degradation
##         T36        T47
## 1     0.000    0.00000
## 2  7206.790 4995.96065
## 3     0.000    0.00000
## 4     0.000    0.00000
## 5     0.000    0.00000
## 6     0.000   22.69929
## 7  3694.568 2757.45106
## 8  5606.681 6779.42061
## 9  1129.840  467.06740
## 10    0.000    0.00000

strat <- Paths_originalCZ %>% filter(grepl("\\|", `# Pathway`))
stratCZ <- strat

dim(strat)
## [1] 20794    89
strat[1:10, 1:3]
##                                                                                          # Pathway
## 1       1CMET2-PWY: folate transformations III (E. coli)|g__Akkermansia.s__Akkermansia_muciniphila
## 2            1CMET2-PWY: folate transformations III (E. coli)|g__Alistipes.s__Alistipes_finegoldii
## 3           1CMET2-PWY: folate transformations III (E. coli)|g__Alistipes.s__Alistipes_onderdonkii
## 4            1CMET2-PWY: folate transformations III (E. coli)|g__Alistipes.s__Alistipes_putredinis
## 5            1CMET2-PWY: folate transformations III (E. coli)|g__Alistipes.s__Alistipes_sp_CAG_268
## 6            1CMET2-PWY: folate transformations III (E. coli)|g__Alistipes.s__Alistipes_timonensis
## 7          1CMET2-PWY: folate transformations III (E. coli)|g__Anaerostipes.s__Anaerostipes_hadrus
## 8            1CMET2-PWY: folate transformations III (E. coli)|g__Bacteroides.s__Bacteroides_caccae
## 9  1CMET2-PWY: folate transformations III (E. coli)|g__Bacteroides.s__Bacteroides_cellulosilyticus
## 10           1CMET2-PWY: folate transformations III (E. coli)|g__Bacteroides.s__Bacteroides_clarus
##         T36       T47
## 1  1365.836  0.000000
## 2     0.000  0.000000
## 3     0.000  0.000000
## 4     0.000  0.000000
## 5     0.000  0.000000
## 6     0.000  0.000000
## 7     0.000 13.076579
## 8     0.000  5.707278
## 9     0.000  0.000000
## 10    0.000  0.000000

strat_unclassified <- strat %>%
  filter(grepl("\\|unclassified", `# Pathway`))
dim(strat_unclassified)
## [1] 425  89
strat_unclassified[1:10, 1:3]
##                                                                                            # Pathway
## 1                                      1CMET2-PWY: folate transformations III (E. coli)|unclassified
## 2                                            AEROBACTINSYN-PWY: aerobactin biosynthesis|unclassified
## 3                      ALLANTOINDEG-PWY: superpathway of allantoin degradation in yeast|unclassified
## 4                                             ANAEROFRUCAT-PWY: homolactic fermentation|unclassified
## 5                                      ANAGLYCOLYSIS-PWY: glycolysis III (from glucose)|unclassified
## 6                ARG+POLYAMINE-SYN: superpathway of arginine and polyamine biosynthesis|unclassified
## 7  ARGDEG-PWY: superpathway of L-arginine, putrescine, and 4-aminobutanoate degradation|unclassified
## 8                                        ARGININE-SYN4-PWY: L-ornithine biosynthesis II|unclassified
## 9                               ARGSYN-PWY: L-arginine biosynthesis I (via L-ornithine)|unclassified
## 10                            ARGSYNBSUB-PWY: L-arginine biosynthesis II (acetyl cycle)|unclassified
##           T36         T47
## 1   850.22127  735.377758
## 2     0.00000    0.000000
## 3     0.00000    0.000000
## 4  1236.08342   69.609446
## 5  1582.32882 1230.262897
## 6    53.57712   60.195263
## 7     0.00000    0.000000
## 8    17.32423    7.024764
## 9  1881.25353 1426.920220
## 10 1578.50191 1479.972027

quants <- quantile(
  colSums(strat_unclassified[, -1]) / colSums(unstrat[, -1]),
  probs = c(0, 0.25, 0.5, 0.75, 1)
)

quants_df <- as.data.frame(t(quants))
names(quants_df) <- c("0%", "25%", "50%", "75%", "100%")
quants_df$category <- "unclassified_within_taxon-assigned"

kableExtra::kable(
  quants_df %>% select(category, everything()),
  caption = "Summary statistics (minimum, 25th, 50th, 75th, maximum) of the proportion of pathway signal that was taxonomically un-assigned, relative to all classified pathways (Czech training cohort). 'Unclassified pathways': MetaCyc pathway abundances that HUMAnN could not attribute to specific taxa (rows labeled '|unclassified' in the stratified output)"
)

Summary statistics (minimum, 25th, 50th, 75th, maximum) of the proportion of pathway signal that was taxonomically un-assigned, relative to all classified pathways (Czech training cohort). ‘Unclassified pathways’: MetaCyc pathway abundances that HUMAnN could not attribute to specific taxa (rows labeled ‘|unclassified’ in the stratified output)
category	0%	25%	50%	75%	100%
unclassified_within_taxon-assigned	0.024712	0.0665779	0.0909481	0.1279512	0.3327366

8.2.2 Italian trainng cohort

Open code

Paths_originalIT <- read.delim(
  "gitignore/data/pathw/Pathway_abundance_MetaCyc_IT_humann.tsv",
  header = TRUE, sep = "\t", check.names = FALSE
) %>%
  filter(
    !grepl("UNMAPPED", `# Pathway`),
    !grepl("UNINTEGRATED", `# Pathway`)
  )

name_col <- names(Paths_originalCZ)[1]

unstrat <- Paths_originalIT %>% filter(!grepl("\\|", `# Pathway`))
dim(unstrat)
## [1] 486  79
unstrat[1:10, 1:3]
##                                                                               # Pathway
## 1                                   12DICHLORETHDEG-PWY: 1,2-dichloroethane degradation
## 2                                      1CMET2-PWY: folate transformations III (E. coli)
## 3            3-HYDROXYPHENYLACETATE-DEGRADATION-PWY: 4-hydroxyphenylacetate degradation
## 4                                            AEROBACTINSYN-PWY: aerobactin biosynthesis
## 5                      ALLANTOINDEG-PWY: superpathway of allantoin degradation in yeast
## 6                                             ANAEROFRUCAT-PWY: homolactic fermentation
## 7                                      ANAGLYCOLYSIS-PWY: glycolysis III (from glucose)
## 8                ARG+POLYAMINE-SYN: superpathway of arginine and polyamine biosynthesis
## 9  ARGDEG-PWY: superpathway of L-arginine, putrescine, and 4-aminobutanoate degradation
## 10                                       ARGININE-SYN4-PWY: L-ornithine biosynthesis II
##        VOV002      VOV003
## 1     0.00000     0.00000
## 2  5831.38667 15090.47182
## 3    10.16048     0.00000
## 4     0.00000    27.10151
## 5     0.00000   375.52244
## 6  2392.63624  6594.16155
## 7  5156.72228 14176.11481
## 8  1649.41125  3106.04621
## 9    73.37596     0.00000
## 10 3006.12980  5661.73935

strat <- Paths_originalIT %>% filter(grepl("\\|", `# Pathway`))
stratIT <- strat

dim(strat)
## [1] 18706    79
strat[1:10, 1:3]
##                                                                                          # Pathway
## 1       1CMET2-PWY: folate transformations III (E. coli)|g__Akkermansia.s__Akkermansia_muciniphila
## 2            1CMET2-PWY: folate transformations III (E. coli)|g__Alistipes.s__Alistipes_finegoldii
## 3           1CMET2-PWY: folate transformations III (E. coli)|g__Alistipes.s__Alistipes_onderdonkii
## 4            1CMET2-PWY: folate transformations III (E. coli)|g__Alistipes.s__Alistipes_putredinis
## 5               1CMET2-PWY: folate transformations III (E. coli)|g__Alistipes.s__Alistipes_sp_An66
## 6            1CMET2-PWY: folate transformations III (E. coli)|g__Alistipes.s__Alistipes_sp_CAG_268
## 7            1CMET2-PWY: folate transformations III (E. coli)|g__Alistipes.s__Alistipes_timonensis
## 8          1CMET2-PWY: folate transformations III (E. coli)|g__Anaerostipes.s__Anaerostipes_hadrus
## 9            1CMET2-PWY: folate transformations III (E. coli)|g__Bacteroides.s__Bacteroides_caccae
## 10 1CMET2-PWY: folate transformations III (E. coli)|g__Bacteroides.s__Bacteroides_cellulosilyticus
##       VOV002    VOV003
## 1    0.00000 268.83988
## 2   49.94327 322.46788
## 3   16.31622   0.00000
## 4  389.90038 623.18145
## 5    0.00000   0.00000
## 6  843.30879   0.00000
## 7    0.00000   0.00000
## 8    0.00000   0.00000
## 9   36.86016  91.25667
## 10   0.00000   0.00000

strat_unclassified <- strat %>%
  filter(grepl("\\|unclassified", `# Pathway`))
dim(strat_unclassified)
## [1] 440  79
strat_unclassified[1:10, 1:3]
##                                                                                            # Pathway
## 1                                      1CMET2-PWY: folate transformations III (E. coli)|unclassified
## 2            3-HYDROXYPHENYLACETATE-DEGRADATION-PWY: 4-hydroxyphenylacetate degradation|unclassified
## 3                      ALLANTOINDEG-PWY: superpathway of allantoin degradation in yeast|unclassified
## 4                                             ANAEROFRUCAT-PWY: homolactic fermentation|unclassified
## 5                                      ANAGLYCOLYSIS-PWY: glycolysis III (from glucose)|unclassified
## 6                ARG+POLYAMINE-SYN: superpathway of arginine and polyamine biosynthesis|unclassified
## 7  ARGDEG-PWY: superpathway of L-arginine, putrescine, and 4-aminobutanoate degradation|unclassified
## 8                                        ARGININE-SYN4-PWY: L-ornithine biosynthesis II|unclassified
## 9                               ARGSYN-PWY: L-arginine biosynthesis I (via L-ornithine)|unclassified
## 10                            ARGSYNBSUB-PWY: L-arginine biosynthesis II (acetyl cycle)|unclassified
##        VOV002    VOV003
## 1  592.735952 1831.3830
## 2    8.815428    0.0000
## 3    0.000000    0.0000
## 4  211.424264  819.0017
## 5  453.479555 1874.8780
## 6   65.875697  438.1747
## 7   20.241094    0.0000
## 8  192.409181  200.4070
## 9  603.938329 4149.4989
## 10 636.519706 4733.9368

quants <- quantile(
  colSums(strat_unclassified[, -1]) / colSums(unstrat[, -1]),
  probs = c(0, 0.25, 0.5, 0.75, 1)
)

quants_df <- as.data.frame(t(quants))
names(quants_df) <- c("0%", "25%", "50%", "75%", "100%")
quants_df$category <- "unclassified_within_taxon-assigned"

kableExtra::kable(
  quants_df %>% select(category, everything()),
  caption = "Summary statistics (minimum, 25th, 50th, 75th, maximum) of the proportion of pathway signal that was taxonomically un-assigned, relative to all classified pathways (Italian training cohort). 'Unclassified pathways': MetaCyc pathway abundances that HUMAnN could not attribute to specific taxa (rows labeled '|unclassified' in the stratified output)"
)

Summary statistics (minimum, 25th, 50th, 75th, maximum) of the proportion of pathway signal that was taxonomically un-assigned, relative to all classified pathways (Italian training cohort). ‘Unclassified pathways’: MetaCyc pathway abundances that HUMAnN could not attribute to specific taxa (rows labeled ‘|unclassified’ in the stratified output)
category	0%	25%	50%	75%	100%
unclassified_within_taxon-assigned	0.0357561	0.0851309	0.1141805	0.1542638	0.4046326

8.2.3 Validation cohort

Open code

Paths_validation <- read.delim(
  "gitignore/data/pathw/Pathway_abundance_MetaCyc_Validation_humann.tsv",
  header = TRUE, sep = "\t", check.names = FALSE
) %>%
  filter(
    !grepl("UNMAPPED", `# Pathway`),
    !grepl("UNINTEGRATED", `# Pathway`)
  )

unstrat <- Paths_validation %>% filter(!grepl("\\|", `# Pathway`))
dim(unstrat)
## [1] 578 104
unstrat[1:10, 1:3]
##                                                                               # Pathway
## 1                                   12DICHLORETHDEG-PWY: 1,2-dichloroethane degradation
## 2                                      1CMET2-PWY: folate transformations III (E. coli)
## 3            3-HYDROXYPHENYLACETATE-DEGRADATION-PWY: 4-hydroxyphenylacetate degradation
## 4                    4-HYDROXYMANDELATE-DEGRADATION-PWY: 4-hydroxymandelate degradation
## 5                                            AEROBACTINSYN-PWY: aerobactin biosynthesis
## 6                      ALLANTOINDEG-PWY: superpathway of allantoin degradation in yeast
## 7                                             ANAEROFRUCAT-PWY: homolactic fermentation
## 8                                      ANAGLYCOLYSIS-PWY: glycolysis III (from glucose)
## 9                ARG+POLYAMINE-SYN: superpathway of arginine and polyamine biosynthesis
## 10 ARGDEG-PWY: superpathway of L-arginine, putrescine, and 4-aminobutanoate degradation
##            4         5
## 1     0.0000    0.0000
## 2  6827.5671 6081.1138
## 3     0.0000    0.0000
## 4     0.0000    0.0000
## 5     0.0000    0.0000
## 6     0.0000    0.0000
## 7  2021.6986 4361.0989
## 8  8939.6788 8361.6527
## 9   796.0169  316.3769
## 10    0.0000    0.0000

strat <- Paths_validation %>% filter(grepl("\\|", `# Pathway`))

stratVAL <- strat

dim(strat)
## [1] 20794   104
strat[1:10, 1:3]
##                                                                                          # Pathway
## 1       1CMET2-PWY: folate transformations III (E. coli)|g__Akkermansia.s__Akkermansia_muciniphila
## 2            1CMET2-PWY: folate transformations III (E. coli)|g__Alistipes.s__Alistipes_finegoldii
## 3           1CMET2-PWY: folate transformations III (E. coli)|g__Alistipes.s__Alistipes_onderdonkii
## 4            1CMET2-PWY: folate transformations III (E. coli)|g__Alistipes.s__Alistipes_putredinis
## 5            1CMET2-PWY: folate transformations III (E. coli)|g__Alistipes.s__Alistipes_sp_CAG_268
## 6            1CMET2-PWY: folate transformations III (E. coli)|g__Alistipes.s__Alistipes_timonensis
## 7          1CMET2-PWY: folate transformations III (E. coli)|g__Anaerostipes.s__Anaerostipes_hadrus
## 8            1CMET2-PWY: folate transformations III (E. coli)|g__Bacteroides.s__Bacteroides_caccae
## 9  1CMET2-PWY: folate transformations III (E. coli)|g__Bacteroides.s__Bacteroides_cellulosilyticus
## 10           1CMET2-PWY: folate transformations III (E. coli)|g__Bacteroides.s__Bacteroides_clarus
##           4         5
## 1  30.77154   0.00000
## 2   0.00000   0.00000
## 3   0.00000   0.00000
## 4  73.75813 204.65122
## 5   0.00000   0.00000
## 6   0.00000   0.00000
## 7  54.62160  37.22825
## 8   0.00000   0.00000
## 9   0.00000   0.00000
## 10  0.00000   0.00000

strat_unclassified <- strat %>%
  filter(grepl("\\|unclassified", `# Pathway`))
dim(strat_unclassified)
## [1] 425 104
strat_unclassified[1:10, 1:3]
##                                                                                            # Pathway
## 1                                      1CMET2-PWY: folate transformations III (E. coli)|unclassified
## 2                                            AEROBACTINSYN-PWY: aerobactin biosynthesis|unclassified
## 3                      ALLANTOINDEG-PWY: superpathway of allantoin degradation in yeast|unclassified
## 4                                             ANAEROFRUCAT-PWY: homolactic fermentation|unclassified
## 5                                      ANAGLYCOLYSIS-PWY: glycolysis III (from glucose)|unclassified
## 6                ARG+POLYAMINE-SYN: superpathway of arginine and polyamine biosynthesis|unclassified
## 7  ARGDEG-PWY: superpathway of L-arginine, putrescine, and 4-aminobutanoate degradation|unclassified
## 8                                        ARGININE-SYN4-PWY: L-ornithine biosynthesis II|unclassified
## 9                               ARGSYN-PWY: L-arginine biosynthesis I (via L-ornithine)|unclassified
## 10                            ARGSYNBSUB-PWY: L-arginine biosynthesis II (acetyl cycle)|unclassified
##             4         5
## 1   822.03841  701.9697
## 2     0.00000    0.0000
## 3     0.00000    0.0000
## 4   179.78691  259.8946
## 5   845.21715  793.1012
## 6     0.00000    0.0000
## 7     0.00000    0.0000
## 8    89.80743    0.0000
## 9  1308.48866 1227.9350
## 10 1658.82435 1324.9933

quants <- quantile(
  colSums(strat_unclassified[, -1]) / colSums(unstrat[, -1]),
  probs = c(0, 0.25, 0.5, 0.75, 1)
)

quants_df <- as.data.frame(t(quants))
names(quants_df) <- c("0%", "25%", "50%", "75%", "100%")
quants_df$category <- "unclassified_within_taxon-assigned"

kableExtra::kable(
  quants_df %>% select(category, everything()),
  caption = "Summary statistics (minimum, 25th, 50th, 75th, maximum) of the proportion of pathway signal that was taxonomically un-assigned, relative to all classified pathways (Czech validation cohort). 'Unclassified pathways': MetaCyc pathway abundances that HUMAnN could not attribute to specific taxa (rows labeled '|unclassified' in the stratified output)"
)

Summary statistics (minimum, 25th, 50th, 75th, maximum) of the proportion of pathway signal that was taxonomically un-assigned, relative to all classified pathways (Czech validation cohort). ‘Unclassified pathways’: MetaCyc pathway abundances that HUMAnN could not attribute to specific taxa (rows labeled ‘|unclassified’ in the stratified output)
category	0%	25%	50%	75%	100%
unclassified_within_taxon-assigned	0.0259304	0.0801659	0.1083844	0.1322093	0.2306524

8.3 Taxa contribution to diet-sensitive pathways

8.3.1 PWY-5367: petroselinate biosynthesis

8.3.1.1 Czech training cohort

Open code


res <- stratCZ %>%
  filter(grepl("PWY-5367", `# Pathway`)) %>%
  mutate(`# Pathway` = sub(".*\\|", "", `# Pathway`)) %>%
  {
    mat <- select(., -`# Pathway`)
    mat_prop <- sweep(mat, 2, colSums(mat), "/")
    quants <- t(apply(mat_prop, 1, function(row) {
      quantile(row, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)
    }))
    cbind(`# Pathway` = .$`# Pathway`, quants)
  } %>%
  as.data.frame()

colnames(res)[-1] <- c("0%", "25%", "50%", "75%", "100%")

res[, -1] <- lapply(res[, -1], as.numeric)

kableExtra::kable(
  res %>%
    arrange(desc(`50%`), desc(`75%`), desc(`100%`)) %>%
    filter(`100%` > 0) %>%
    head(10) %>%
    mutate(across(-`# Pathway`, ~ round(., 4))),
  caption = "Top 10 taxa contributing to pathway `PWY-5367: petroselinate biosynthesis` in the Czech training cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row 'unclassified' indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa"
)

Top 10 taxa contributing to pathway `PWY-5367: petroselinate biosynthesis` in the Czech training cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row ‘unclassified’ indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa
# Pathway	25%	50%	75%	100%
g__Blautia.s__Ruminococcus_torques	0.1836	0.9822	1.0000	1.0000
g__Escherichia.s__Escherichia_coli	0.0000	0.0000	0.0596	1.0000
g__Citrobacter.s__Citrobacter_freundii	0.0000	0.0000	0.0000	1.0000
unclassified	0.0000	0.0000	0.0000	1.0000
g__Enterococcus.s__Enterococcus_faecalis	0.0000	0.0000	0.0000	0.3800
g__Klebsiella.s__Klebsiella_pneumoniae	0.0000	0.0000	0.0000	0.1882
g__Enterobacter.s__Enterobacter_roggenkampii	0.0000	0.0000	0.0000	0.1190
g__Streptococcus.s__Streptococcus_pneumoniae	0.0000	0.0000	0.0000	0.1078
g__Escherichia.s__Escherichia_fergusonii	0.0000	0.0000	0.0000	0.0843
g__Escherichia.s__Escherichia_marmotae	0.0000	0.0000	0.0000	0.0726

8.3.1.2 Italian training cohort

Open code


res <- stratIT %>%
  filter(grepl("PWY-5367", `# Pathway`)) %>%
  mutate(`# Pathway` = sub(".*\\|", "", `# Pathway`)) %>%
  {
    mat <- select(., -`# Pathway`)
    mat_prop <- sweep(mat, 2, colSums(mat), "/")
    quants <- t(apply(mat_prop, 1, function(row) {
      quantile(row, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)
    }))
    cbind(`# Pathway` = .$`# Pathway`, quants)
  } %>%
  as.data.frame()

colnames(res)[-1] <- c("0%", "25%", "50%", "75%", "100%")

res[, -1] <- lapply(res[, -1], as.numeric)

kableExtra::kable(
  res %>%
    arrange(desc(`50%`), desc(`75%`), desc(`100%`)) %>%
    filter(`100%` > 0) %>%
    head(10) %>%
    mutate(across(-`# Pathway`, ~ round(., 4))),
  caption = "Top 10 taxa contributing to pathway `PWY-5367: petroselinate biosynthesis` in the Italian training cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row 'unclassified' indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa"
)

Top 10 taxa contributing to pathway `PWY-5367: petroselinate biosynthesis` in the Italian training cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row ‘unclassified’ indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa
# Pathway	50%	75%	100%
g__Escherichia.s__Escherichia_coli	0.6316	1	1.0000
g__Blautia.s__Ruminococcus_torques	0.0000	0	1.0000
g__Escherichia.s__Escherichia_marmotae	0.0000	0	1.0000
g__Leclercia.s__Leclercia_adecarboxylata	0.0000	0	1.0000
unclassified	0.0000	0	1.0000
g__Enterobacter.s__Enterobacter_roggenkampii	0.0000	0	0.9279
g__Enterobacter.s__Enterobacter_mori	0.0000	0	0.7381
g__Citrobacter.s__Citrobacter_freundii	0.0000	0	0.7378
g__Citrobacter.s__Citrobacter_braakii	0.0000	0	0.4592
g__Klebsiella.s__Klebsiella_oxytoca	0.0000	0	0.4285

8.3.1.3 Czech validation cohort

Open code


res <- stratVAL %>%
  filter(grepl("PWY-5367", `# Pathway`)) %>%
  mutate(`# Pathway` = sub(".*\\|", "", `# Pathway`)) %>%
  {
    mat <- select(., -`# Pathway`)
    mat_prop <- sweep(mat, 2, colSums(mat), "/")
    quants <- t(apply(mat_prop, 1, function(row) {
      quantile(row, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)
    }))
    cbind(`# Pathway` = .$`# Pathway`, quants)
  } %>%
  as.data.frame()

colnames(res)[-1] <- c("0%", "25%", "50%", "75%", "100%")

res[, -1] <- lapply(res[, -1], as.numeric)

kableExtra::kable(
  res %>%
    arrange(desc(`50%`), desc(`75%`), desc(`100%`)) %>%
    filter(`100%` > 0) %>%
    head(10) %>%
    mutate(across(-`# Pathway`, ~ round(., 4))),
  caption = "Top 10 taxa contributing to pathway `PWY-5367: petroselinate biosynthesis` in the Czech validation cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row 'unclassified' indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa"
)

Top 10 taxa contributing to pathway `PWY-5367: petroselinate biosynthesis` in the Czech validation cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row ‘unclassified’ indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa
# Pathway	25%	50%	75%	100%
g__Blautia.s__Ruminococcus_torques	1	1	1	1.0000
unclassified	0	0	0	1.0000
g__Enterococcus.s__Enterococcus_hirae	0	0	0	0.8935
g__Enterobacter.s__Enterobacter_bugandensis	0	0	0	0.8379
g__Escherichia.s__Escherichia_coli	0	0	0	0.6883
g__Enterococcus.s__Enterococcus_faecium	0	0	0	0.5535
g__Enterococcus.s__Enterococcus_durans	0	0	0	0.4465
g__Klebsiella.s__Klebsiella_oxytoca	0	0	0	0.3625
g__Klebsiella.s__Klebsiella_pneumoniae	0	0	0	0.3454
g__Escherichia.s__Escherichia_fergusonii	0	0	0	0.3117

8.3.2 NONOXIPENT-PWY: pentose phosphate pathway (non-oxidative branch) I

8.3.2.1 Czech training cohort

Open code


res <- stratCZ %>%
  filter(grepl("NONOXIPENT-PWY", `# Pathway`)) %>%
  mutate(`# Pathway` = sub(".*\\|", "", `# Pathway`)) %>%
  {
    mat <- select(., -`# Pathway`)
    mat_prop <- sweep(mat, 2, colSums(mat), "/")
    quants <- t(apply(mat_prop, 1, function(row) {
      quantile(row, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)
    }))
    cbind(`# Pathway` = .$`# Pathway`, quants)
  } %>%
  as.data.frame()

colnames(res)[-1] <- c("0%", "25%", "50%", "75%", "100%")

res[, -1] <- lapply(res[, -1], as.numeric)

kableExtra::kable(
  res %>%
    arrange(desc(`50%`), desc(`75%`), desc(`100%`)) %>%
    head(10) %>%
    mutate(across(-`# Pathway`, ~ round(., 4))),
  caption = "Top 10 taxa contributing to pathway `NONOXIPENT-PWY: pentose phosphate pathway (non-oxidative branch) I` in the Czech training cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row 'unclassified' indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa"
)

Top 10 taxa contributing to pathway `NONOXIPENT-PWY: pentose phosphate pathway (non-oxidative branch) I` in the Czech training cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row ‘unclassified’ indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa
# Pathway	0%	25%	50%	75%	100%
unclassified	0.0535	0.1265	0.2137	0.2827	0.6020
g__Lachnospiraceae_unclassified.s__Eubacterium_rectale	0.0000	0.0125	0.0417	0.1063	0.2775
g__Blautia.s__Ruminococcus_torques	0.0000	0.0131	0.0404	0.0732	0.1831
g__Eubacterium.s__Eubacterium_eligens	0.0000	0.0000	0.0299	0.0727	0.4486
g__Fusicatenibacter.s__Fusicatenibacter_saccharivorans	0.0000	0.0123	0.0266	0.0518	0.1936
g__Anaerostipes.s__Anaerostipes_hadrus	0.0000	0.0057	0.0208	0.0565	0.3011
g__Roseburia.s__Roseburia_faecis	0.0000	0.0000	0.0160	0.0556	0.5527
g__Eubacterium.s__Eubacterium_hallii	0.0000	0.0015	0.0136	0.0295	0.0907
g__Parabacteroides.s__Parabacteroides_distasonis	0.0000	0.0000	0.0109	0.0300	0.2507
g__Collinsella.s__Collinsella_aerofaciens	0.0000	0.0000	0.0108	0.0306	0.1724

8.3.2.2 Italian training cohort

Open code


res <- stratIT %>%
  filter(grepl("NONOXIPENT-PWY", `# Pathway`)) %>%
  mutate(`# Pathway` = sub(".*\\|", "", `# Pathway`)) %>%
  {
    mat <- select(., -`# Pathway`)
    mat_prop <- sweep(mat, 2, colSums(mat), "/")
    quants <- t(apply(mat_prop, 1, function(row) {
      quantile(row, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)
    }))
    cbind(`# Pathway` = .$`# Pathway`, quants)
  } %>%
  as.data.frame()

colnames(res)[-1] <- c("0%", "25%", "50%", "75%", "100%")

res[, -1] <- lapply(res[, -1], as.numeric)

kableExtra::kable(
  res %>%
    arrange(desc(`50%`), desc(`75%`), desc(`100%`)) %>%
    filter(`100%` > 0) %>%
    head(10) %>%
    mutate(across(-`# Pathway`, ~ round(., 4))),
  caption = "Top 10 taxa contributing to pathway `NONOXIPENT-PWY: pentose phosphate pathway (non-oxidative branch) I` in the Italian training cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row 'unclassified' indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa"
)

Top 10 taxa contributing to pathway `NONOXIPENT-PWY: pentose phosphate pathway (non-oxidative branch) I` in the Italian training cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row ‘unclassified’ indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa
# Pathway	0%	25%	50%	75%	100%
unclassified	0.0061	0.2064	0.3930	0.5533	0.8199
g__Fusicatenibacter.s__Fusicatenibacter_saccharivorans	0.0000	0.0024	0.0196	0.0485	0.4152
g__Parabacteroides.s__Parabacteroides_distasonis	0.0000	0.0018	0.0169	0.0475	0.1819
g__Lachnospiraceae_unclassified.s__Eubacterium_rectale	0.0000	0.0020	0.0167	0.0388	0.4557
g__Collinsella.s__Collinsella_aerofaciens	0.0000	0.0007	0.0159	0.0401	0.3181
g__Roseburia.s__Roseburia_faecis	0.0000	0.0000	0.0070	0.0316	0.2220
g__Flavonifractor.s__Flavonifractor_plautii	0.0000	0.0000	0.0031	0.0200	0.0997
g__Agathobaculum.s__Agathobaculum_butyriciproducens	0.0000	0.0000	0.0020	0.0121	0.0645
g__Escherichia.s__Escherichia_coli	0.0000	0.0000	0.0000	0.0717	0.7774
g__Roseburia.s__Roseburia_hominis	0.0000	0.0000	0.0000	0.0121	0.3341

8.3.2.3 Czech validation cohort

Open code


res <- stratVAL %>%
  filter(grepl("NONOXIPENT-PWY", `# Pathway`)) %>%
  mutate(`# Pathway` = sub(".*\\|", "", `# Pathway`)) %>%
  {
    mat <- select(., -`# Pathway`)
    mat_prop <- sweep(mat, 2, colSums(mat), "/")
    quants <- t(apply(mat_prop, 1, function(row) {
      quantile(row, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)
    }))
    cbind(`# Pathway` = .$`# Pathway`, quants)
  } %>%
  as.data.frame()

colnames(res)[-1] <- c("0%", "25%", "50%", "75%", "100%")

res[, -1] <- lapply(res[, -1], as.numeric)

kableExtra::kable(
  res %>%
    arrange(desc(`50%`), desc(`75%`), desc(`100%`)) %>%
    filter(`100%` > 0) %>%
    head(10) %>%
    mutate(across(-`# Pathway`, ~ round(., 4))),
  caption = "Top 10 taxa contributing to pathway `NONOXIPENT-PWY: pentose phosphate pathway (non-oxidative branch) I` in the Czech validation cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row 'unclassified' indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa"
)

Top 10 taxa contributing to pathway `NONOXIPENT-PWY: pentose phosphate pathway (non-oxidative branch) I` in the Czech validation cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row ‘unclassified’ indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa
# Pathway	0%	25%	50%	75%	100%
unclassified	0.0353	0.1040	0.1365	0.2026	0.4451
g__Blautia.s__Ruminococcus_torques	0.0000	0.0403	0.0730	0.0974	0.2555
g__Fusicatenibacter.s__Fusicatenibacter_saccharivorans	0.0000	0.0330	0.0591	0.1089	0.2600
g__Roseburia.s__Roseburia_faecis	0.0000	0.0047	0.0484	0.1113	0.3453
g__Lachnospiraceae_unclassified.s__Eubacterium_rectale	0.0000	0.0196	0.0439	0.1011	0.3910
g__Collinsella.s__Collinsella_aerofaciens	0.0000	0.0264	0.0424	0.0685	0.2160
g__Eubacterium.s__Eubacterium_hallii	0.0000	0.0257	0.0365	0.0512	0.2285
g__Blautia.s__Blautia_obeum	0.0003	0.0218	0.0343	0.0457	0.1634
g__Anaerostipes.s__Anaerostipes_hadrus	0.0000	0.0198	0.0334	0.0540	0.2550
g__Blautia.s__Blautia_wexlerae	0.0000	0.0114	0.0210	0.0359	0.1837

8.3.3 TRPSYN-PWY: L-tryptophan biosynthesis

8.3.3.1 Czech training cohort

Open code


res <- stratCZ %>%
  filter(grepl("TRPSYN-PWY", `# Pathway`)) %>%
  mutate(`# Pathway` = sub(".*\\|", "", `# Pathway`)) %>%
  {
    mat <- select(., -`# Pathway`)
    mat_prop <- sweep(mat, 2, colSums(mat), "/")
    quants <- t(apply(mat_prop, 1, function(row) {
      quantile(row, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)
    }))
    cbind(`# Pathway` = .$`# Pathway`, quants)
  } %>%
  as.data.frame()

colnames(res)[-1] <- c("0%", "25%", "50%", "75%", "100%")

res[, -1] <- lapply(res[, -1], as.numeric)

kableExtra::kable(
  res %>%
    arrange(desc(`50%`), desc(`75%`), desc(`100%`)) %>%
    filter(`100%` > 0) %>%
    head(10) %>%
    mutate(across(-`# Pathway`, ~ round(., 4))),
  caption = "Top 10 taxa contributing to pathway `TRPSYN-PWY: L-tryptophan biosynthesis` in the Czech training cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row 'unclassified' indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa"
)

Top 10 taxa contributing to pathway `TRPSYN-PWY: L-tryptophan biosynthesis` in the Czech training cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row ‘unclassified’ indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa
# Pathway	0%	25%	50%	75%	100%
unclassified	0.0171	0.0970	0.1647	0.2772	0.5479
g__Bacteroides.s__Bacteroides_vulgatus	0.0000	0.0261	0.0950	0.2094	0.7458
g__Eubacterium.s__Eubacterium_eligens	0.0000	0.0000	0.0312	0.0656	0.2780
g__Lachnospiraceae_unclassified.s__Eubacterium_rectale	0.0000	0.0040	0.0247	0.0743	0.3175
g__Blautia.s__Ruminococcus_torques	0.0000	0.0030	0.0209	0.0405	0.1779
g__Fusicatenibacter.s__Fusicatenibacter_saccharivorans	0.0000	0.0072	0.0179	0.0310	0.1868
g__Anaerostipes.s__Anaerostipes_hadrus	0.0000	0.0000	0.0130	0.0348	0.2000
g__Ruminococcus.s__Ruminococcus_bromii	0.0000	0.0000	0.0115	0.0505	0.3164
g__Blautia.s__Blautia_obeum	0.0000	0.0030	0.0098	0.0234	0.2721
g__Firmicutes_unclassified.s__Firmicutes_bacterium_CAG_41	0.0000	0.0000	0.0095	0.0173	0.0639

8.3.3.2 Italian training cohort

Open code


res <- stratIT %>%
  filter(grepl("TRPSYN-PWY", `# Pathway`)) %>%
  mutate(`# Pathway` = sub(".*\\|", "", `# Pathway`)) %>%
  {
    mat <- select(., -`# Pathway`)
    mat_prop <- sweep(mat, 2, colSums(mat), "/")
    quants <- t(apply(mat_prop, 1, function(row) {
      quantile(row, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)
    }))
    cbind(`# Pathway` = .$`# Pathway`, quants)
  } %>%
  as.data.frame()

colnames(res)[-1] <- c("0%", "25%", "50%", "75%", "100%")

res[, -1] <- lapply(res[, -1], as.numeric)

kableExtra::kable(
  res %>%
    arrange(desc(`50%`), desc(`75%`), desc(`100%`)) %>%
    filter(`100%` > 0) %>%
    head(10) %>%
    mutate(across(-`# Pathway`, ~ round(., 4))),
  caption = "Top 10 taxa contributing to pathway `TRPSYN-PWY: L-tryptophan biosynthesis` in the Italian training cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row 'unclassified' indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa"
)

Top 10 taxa contributing to pathway `TRPSYN-PWY: L-tryptophan biosynthesis` in the Italian training cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row ‘unclassified’ indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa
# Pathway	25%	50%	75%	100%
unclassified	0.1310	0.2364	0.3691	0.8064
g__Bacteroides.s__Bacteroides_vulgatus	0.0213	0.0920	0.2311	0.7756
g__Fusicatenibacter.s__Fusicatenibacter_saccharivorans	0.0000	0.0142	0.0382	0.2364
g__Lachnospiraceae_unclassified.s__Eubacterium_rectale	0.0000	0.0128	0.0387	0.4200
g__Lawsonibacter.s__Lawsonibacter_asaccharolyticus	0.0000	0.0072	0.0258	0.4782
g__Firmicutes_unclassified.s__Firmicutes_bacterium_CAG_65_45_313	0.0000	0.0000	0.0340	0.4960
g__Faecalibacterium.s__Faecalibacterium_prausnitzii	0.0000	0.0000	0.0207	0.4824
g__Roseburia.s__Roseburia_hominis	0.0000	0.0000	0.0188	0.1981
g__Ruminococcus.s__Ruminococcus_bromii	0.0000	0.0000	0.0187	0.0688
g__Eubacterium.s__Eubacterium_eligens	0.0000	0.0000	0.0179	0.0849

8.3.3.3 Czech validation cohort

Open code


res <- stratVAL %>%
  filter(grepl("TRPSYN-PWY", `# Pathway`)) %>%
  mutate(`# Pathway` = sub(".*\\|", "", `# Pathway`)) %>%
  {
    mat <- select(., -`# Pathway`)
    mat_prop <- sweep(mat, 2, colSums(mat), "/")
    quants <- t(apply(mat_prop, 1, function(row) {
      quantile(row, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)
    }))
    cbind(`# Pathway` = .$`# Pathway`, quants)
  } %>%
  as.data.frame()

colnames(res)[-1] <- c("0%", "25%", "50%", "75%", "100%")

res[, -1] <- lapply(res[, -1], as.numeric)

kableExtra::kable(
  res %>%
    arrange(desc(`50%`), desc(`75%`), desc(`100%`)) %>%
    filter(`100%` > 0) %>%
    head(10) %>%
    mutate(across(-`# Pathway`, ~ round(., 4))),
  caption = "Top 10 taxa contributing to pathway `TRPSYN-PWY: L-tryptophan biosynthesis` in the Czech validation cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row 'unclassified' indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa"
)

Top 10 taxa contributing to pathway `TRPSYN-PWY: L-tryptophan biosynthesis` in the Czech validation cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row ‘unclassified’ indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa
# Pathway	0%	25%	50%	75%	100%
unclassified	0.0166	0.0867	0.1270	0.2093	0.4686
g__Fusicatenibacter.s__Fusicatenibacter_saccharivorans	0.0000	0.0276	0.0511	0.0903	0.2820
g__Blautia.s__Blautia_obeum	0.0031	0.0310	0.0460	0.0701	0.2020
g__Blautia.s__Ruminococcus_torques	0.0000	0.0291	0.0455	0.0712	0.1757
g__Lachnospiraceae_unclassified.s__Eubacterium_rectale	0.0000	0.0195	0.0408	0.0849	0.4596
g__Bacteroides.s__Bacteroides_vulgatus	0.0000	0.0135	0.0376	0.0684	0.2935
g__Blautia.s__Blautia_wexlerae	0.0018	0.0208	0.0305	0.0449	0.0889
g__Anaerostipes.s__Anaerostipes_hadrus	0.0000	0.0179	0.0293	0.0459	0.2038
g__Roseburia.s__Roseburia_faecis	0.0000	0.0000	0.0289	0.0737	0.2440
g__Ruminococcus.s__Ruminococcus_bromii	0.0000	0.0000	0.0204	0.1113	0.3487

8.3.4 PWY-8178: pentose phosphate pathway (non-oxidative branch) II

8.3.4.1 Czech training cohort

Open code


res <- stratCZ %>%
  filter(grepl("PWY-8178", `# Pathway`)) %>%
  mutate(`# Pathway` = sub(".*\\|", "", `# Pathway`)) %>%
  {
    mat <- select(., -`# Pathway`)
    mat_prop <- sweep(mat, 2, colSums(mat), "/")
    quants <- t(apply(mat_prop, 1, function(row) {
      quantile(row, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)
    }))
    cbind(`# Pathway` = .$`# Pathway`, quants)
  } %>%
  as.data.frame()

colnames(res)[-1] <- c("0%", "25%", "50%", "75%", "100%")

res[, -1] <- lapply(res[, -1], as.numeric)

kableExtra::kable(
  res %>%
    arrange(desc(`50%`), desc(`75%`), desc(`100%`)) %>%
    filter(`100%` > 0) %>%
    head(10) %>%
    mutate(across(-`# Pathway`, ~ round(., 4))),
  caption = "Top 10 taxa contributing to pathway `PWY-8178: pentose phosphate pathway (non-oxidative branch) II` in the Czech training cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row 'unclassified' indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa"
)

Top 10 taxa contributing to pathway `PWY-8178: pentose phosphate pathway (non-oxidative branch) II` in the Czech training cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row ‘unclassified’ indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa
# Pathway	0%	25%	50%	75%	100%
unclassified	0.0599	0.1435	0.2467	0.3486	0.6512
g__Blautia.s__Ruminococcus_torques	0.0000	0.0119	0.0459	0.0664	0.1660
g__Lachnospiraceae_unclassified.s__Eubacterium_rectale	0.0000	0.0110	0.0388	0.0930	0.2692
g__Eubacterium.s__Eubacterium_eligens	0.0000	0.0044	0.0382	0.0794	0.4492
g__Anaerostipes.s__Anaerostipes_hadrus	0.0000	0.0078	0.0237	0.0634	0.3114
g__Fusicatenibacter.s__Fusicatenibacter_saccharivorans	0.0000	0.0089	0.0205	0.0428	0.1478
g__Roseburia.s__Roseburia_inulinivorans	0.0000	0.0000	0.0138	0.0365	0.2216
g__Roseburia.s__Roseburia_faecis	0.0000	0.0000	0.0137	0.0492	0.5145
g__Blautia.s__Blautia_wexlerae	0.0000	0.0056	0.0136	0.0266	0.2900
g__Eubacterium.s__Eubacterium_hallii	0.0000	0.0000	0.0135	0.0278	0.0714

8.3.4.2 Italian training cohort

Open code


res <- stratIT %>%
  filter(grepl("PWY-8178", `# Pathway`)) %>%
  mutate(`# Pathway` = sub(".*\\|", "", `# Pathway`)) %>%
  {
    mat <- select(., -`# Pathway`)
    mat_prop <- sweep(mat, 2, colSums(mat), "/")
    quants <- t(apply(mat_prop, 1, function(row) {
      quantile(row, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)
    }))
    cbind(`# Pathway` = .$`# Pathway`, quants)
  } %>%
  as.data.frame()

colnames(res)[-1] <- c("0%", "25%", "50%", "75%", "100%")

res[, -1] <- lapply(res[, -1], as.numeric)

kableExtra::kable(
  res %>%
    arrange(desc(`50%`), desc(`75%`), desc(`100%`)) %>%
    filter(`100%` > 0) %>%
    head(10) %>%
    mutate(across(-`# Pathway`, ~ round(., 4))),
  caption = "Top 10 taxa contributing to pathway `PWY-8178: pentose phosphate pathway (non-oxidative branch) II` in the Italian training cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row 'unclassified' indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa"
)

Top 10 taxa contributing to pathway `PWY-8178: pentose phosphate pathway (non-oxidative branch) II` in the Italian training cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row ‘unclassified’ indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa
# Pathway	0%	25%	50%	75%	100%
unclassified	0.0088	0.2572	0.4697	0.6587	0.9114
g__Fusicatenibacter.s__Fusicatenibacter_saccharivorans	0.0000	0.0010	0.0133	0.0367	0.3171
g__Lachnospiraceae_unclassified.s__Eubacterium_rectale	0.0000	0.0014	0.0132	0.0455	0.3451
g__Parabacteroides.s__Parabacteroides_distasonis	0.0000	0.0000	0.0110	0.0337	0.2201
g__Collinsella.s__Collinsella_aerofaciens	0.0000	0.0000	0.0088	0.0293	0.2724
g__Flavonifractor.s__Flavonifractor_plautii	0.0000	0.0000	0.0052	0.0265	0.0888
g__Roseburia.s__Roseburia_faecis	0.0000	0.0000	0.0048	0.0231	0.1822
g__Agathobaculum.s__Agathobaculum_butyriciproducens	0.0000	0.0000	0.0028	0.0131	0.0538
g__Roseburia.s__Roseburia_hominis	0.0000	0.0000	0.0027	0.0122	0.2948
g__Escherichia.s__Escherichia_coli	0.0000	0.0000	0.0000	0.0374	0.7374

8.3.4.3 Czech validation cohort

Open code


res <- stratVAL %>%
  filter(grepl("PWY-8178", `# Pathway`)) %>%
  mutate(`# Pathway` = sub(".*\\|", "", `# Pathway`)) %>%
  {
    mat <- select(., -`# Pathway`)
    mat_prop <- sweep(mat, 2, colSums(mat), "/")
    quants <- t(apply(mat_prop, 1, function(row) {
      quantile(row, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)
    }))
    cbind(`# Pathway` = .$`# Pathway`, quants)
  } %>%
  as.data.frame()

colnames(res)[-1] <- c("0%", "25%", "50%", "75%", "100%")

res[, -1] <- lapply(res[, -1], as.numeric)

kableExtra::kable(
  res %>%
    arrange(desc(`50%`), desc(`75%`), desc(`100%`)) %>%
    filter(`100%` > 0) %>%
    head(10) %>%
    mutate(across(-`# Pathway`, ~ round(., 4))),
  caption = "Top 10 taxa contributing to pathway `PWY-8178: pentose phosphate pathway (non-oxidative branch) II` in the Czech validation cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row 'unclassified' indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa"
)

Top 10 taxa contributing to pathway `PWY-8178: pentose phosphate pathway (non-oxidative branch) II` in the Czech validation cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row ‘unclassified’ indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa
# Pathway	0%	25%	50%	75%	100%
unclassified	0.0413	0.1185	0.1669	0.2232	0.4912
g__Blautia.s__Ruminococcus_torques	0.0000	0.0426	0.0673	0.0986	0.2491
g__Fusicatenibacter.s__Fusicatenibacter_saccharivorans	0.0000	0.0291	0.0509	0.0966	0.2282
g__Roseburia.s__Roseburia_faecis	0.0000	0.0019	0.0444	0.0978	0.3274
g__Lachnospiraceae_unclassified.s__Eubacterium_rectale	0.0000	0.0178	0.0397	0.0863	0.3465
g__Blautia.s__Blautia_wexlerae	0.0090	0.0227	0.0375	0.0535	0.2396
g__Anaerostipes.s__Anaerostipes_hadrus	0.0000	0.0234	0.0360	0.0635	0.2427
g__Collinsella.s__Collinsella_aerofaciens	0.0000	0.0210	0.0339	0.0548	0.1649
g__Blautia.s__Blautia_obeum	0.0004	0.0213	0.0330	0.0462	0.1463
g__Eubacterium.s__Eubacterium_hallii	0.0000	0.0222	0.0326	0.0457	0.2198

8.3.5 PWY-801: homocysteine and cysteine interconversion

8.3.5.1 Czech training cohort

Open code


res <- stratCZ %>%
  filter(grepl("PWY-801", `# Pathway`)) %>%
  mutate(`# Pathway` = sub(".*\\|", "", `# Pathway`)) %>%
  {
    mat <- select(., -`# Pathway`)
    mat_prop <- sweep(mat, 2, colSums(mat), "/")
    quants <- t(apply(mat_prop, 1, function(row) {
      quantile(row, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)
    }))
    cbind(`# Pathway` = .$`# Pathway`, quants)
  } %>%
  as.data.frame()

colnames(res)[-1] <- c("0%", "25%", "50%", "75%", "100%")

res[, -1] <- lapply(res[, -1], as.numeric)

kableExtra::kable(
  res %>%
    arrange(desc(`50%`), desc(`75%`), desc(`100%`)) %>%
    filter(`100%` > 0) %>%
    head(10) %>%
    mutate(across(-`# Pathway`, ~ round(., 4))),
  caption = "Top 10 taxa contributing to pathway `PWY-801: homocysteine and cysteine interconversion` in the Czech training cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row 'unclassified' indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa"
)

Top 10 taxa contributing to pathway `PWY-801: homocysteine and cysteine interconversion` in the Czech training cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row ‘unclassified’ indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa
# Pathway	25%	50%	75%	100%
unclassified	1	1	1	1.0000
g__Klebsiella.s__Klebsiella_pneumoniae	0	0	0	1.0000
g__Klebsiella.s__Klebsiella_variicola	0	0	0	0.7376

8.3.5.2 Italian training cohort

Open code


res <- stratIT %>%
  filter(grepl("PWY-801", `# Pathway`)) %>%
  mutate(`# Pathway` = sub(".*\\|", "", `# Pathway`)) %>%
  {
    mat <- select(., -`# Pathway`)
    mat_prop <- sweep(mat, 2, colSums(mat), "/")
    quants <- t(apply(mat_prop, 1, function(row) {
      quantile(row, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)
    }))
    cbind(`# Pathway` = .$`# Pathway`, quants)
  } %>%
  as.data.frame()

colnames(res)[-1] <- c("0%", "25%", "50%", "75%", "100%")

res[, -1] <- lapply(res[, -1], as.numeric)

kableExtra::kable(
  res %>%
    arrange(desc(`50%`), desc(`75%`), desc(`100%`)) %>%
    filter(`100%` > 0) %>%
    head(10) %>%
    mutate(across(-`# Pathway`, ~ round(., 4))),
  caption = "Top 10 taxa contributing to pathway `PWY-801: homocysteine and cysteine interconversion` in the Italian training cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row 'unclassified' indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa"
)

Top 10 taxa contributing to pathway `PWY-801: homocysteine and cysteine interconversion` in the Italian training cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row ‘unclassified’ indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa
# Pathway	25%	50%	75%	100%
unclassified	1	1	1	1.0000
g__Klebsiella.s__Klebsiella_pneumoniae	0	0	0	1.0000
g__Escherichia.s__Escherichia_coli	0	0	0	0.2929
g__Klebsiella.s__Klebsiella_oxytoca	0	0	0	0.0383

8.3.5.3 Czech validation cohort

Open code


res <- stratVAL %>%
  filter(grepl("PWY-801", `# Pathway`)) %>%
  mutate(`# Pathway` = sub(".*\\|", "", `# Pathway`)) %>%
  {
    mat <- select(., -`# Pathway`)
    mat_prop <- sweep(mat, 2, colSums(mat), "/")
    quants <- t(apply(mat_prop, 1, function(row) {
      quantile(row, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)
    }))
    cbind(`# Pathway` = .$`# Pathway`, quants)
  } %>%
  as.data.frame()

colnames(res)[-1] <- c("0%", "25%", "50%", "75%", "100%")

res[, -1] <- lapply(res[, -1], as.numeric)

kableExtra::kable(
  res %>%
    arrange(desc(`50%`), desc(`75%`), desc(`100%`)) %>%
    filter(`100%` > 0) %>%
    head(10) %>%
    mutate(across(-`# Pathway`, ~ round(., 4))),
  caption = "Top 10 taxa contributing to pathway `PWY-801: homocysteine and cysteine interconversion` in the Czech validation cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row 'unclassified' indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa"
)

Top 10 taxa contributing to pathway `PWY-801: homocysteine and cysteine interconversion` in the Czech validation cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row ‘unclassified’ indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa
# Pathway	25%	50%	75%	100%
unclassified	1	1	1	1.0000
g__Enterococcus.s__Enterococcus_gallinarum	0	0	0	0.7631
g__Klebsiella.s__Klebsiella_pneumoniae	0	0	0	0.2465
g__Enterococcus.s__Enterococcus_saccharolyticus	0	0	0	0.2369

8.3.6 PWY-6293: superpathway of L-cysteine biosynthesis (fungi)

8.3.6.1 Czech training cohort

Open code


res <- stratCZ %>%
  filter(grepl("PWY-6293", `# Pathway`)) %>%
  mutate(`# Pathway` = sub(".*\\|", "", `# Pathway`)) %>%
  {
    mat <- select(., -`# Pathway`)
    mat_prop <- sweep(mat, 2, colSums(mat), "/")
    quants <- t(apply(mat_prop, 1, function(row) {
      quantile(row, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)
    }))
    cbind(`# Pathway` = .$`# Pathway`, quants)
  } %>%
  as.data.frame()

colnames(res)[-1] <- c("0%", "25%", "50%", "75%", "100%")

res[, -1] <- lapply(res[, -1], as.numeric)

kableExtra::kable(
  res %>%
    arrange(desc(`50%`), desc(`75%`), desc(`100%`)) %>%
    filter(`100%` > 0) %>%
    head(10) %>%
    mutate(across(-`# Pathway`, ~ round(., 4))),
  caption = "Top 10 taxa contributing to pathway `PWY-6293: superpathway of L-cysteine biosynthesis (fungi)` in the Czech training cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row 'unclassified' indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa"
)

Top 10 taxa contributing to pathway `PWY-6293: superpathway of L-cysteine biosynthesis (fungi)` in the Czech training cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row ‘unclassified’ indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa
# Pathway	0%	25%	50%	75%	100%
unclassified	1	1	1	1	1

8.3.6.2 Italian training cohort

Open code


res <- stratIT %>%
  filter(grepl("PWY-6293", `# Pathway`)) %>%
  mutate(`# Pathway` = sub(".*\\|", "", `# Pathway`)) %>%
  {
    mat <- select(., -`# Pathway`)
    mat_prop <- sweep(mat, 2, colSums(mat), "/")
    quants <- t(apply(mat_prop, 1, function(row) {
      quantile(row, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)
    }))
    cbind(`# Pathway` = .$`# Pathway`, quants)
  } %>%
  as.data.frame()

colnames(res)[-1] <- c("0%", "25%", "50%", "75%", "100%")

res[, -1] <- lapply(res[, -1], as.numeric)

kableExtra::kable(
  res %>%
    arrange(desc(`50%`), desc(`75%`), desc(`100%`)) %>%
    filter(`100%` > 0) %>%
    head(10) %>%
    mutate(across(-`# Pathway`, ~ round(., 4))),
  caption = "Top 10 taxa contributing to pathway `PWY-6293: superpathway of L-cysteine biosynthesis (fungi)` in the Italian training cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row 'unclassified' indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa"
)

Top 10 taxa contributing to pathway `PWY-6293: superpathway of L-cysteine biosynthesis (fungi)` in the Italian training cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row ‘unclassified’ indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa
# Pathway	0%	25%	50%	75%	100%
unclassified	1	1	1	1	1

8.3.6.3 Czech validation cohort

Open code


res <- stratVAL %>%
  filter(grepl("PWY-6293", `# Pathway`)) %>%
  mutate(`# Pathway` = sub(".*\\|", "", `# Pathway`)) %>%
  {
    mat <- select(., -`# Pathway`)
    mat_prop <- sweep(mat, 2, colSums(mat), "/")
    quants <- t(apply(mat_prop, 1, function(row) {
      quantile(row, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)
    }))
    cbind(`# Pathway` = .$`# Pathway`, quants)
  } %>%
  as.data.frame()

colnames(res)[-1] <- c("0%", "25%", "50%", "75%", "100%")

res[, -1] <- lapply(res[, -1], as.numeric)

kableExtra::kable(
  res %>%
    arrange(desc(`50%`), desc(`75%`), desc(`100%`)) %>%
    filter(`100%` > 0) %>%
    head(10) %>%
    mutate(across(-`# Pathway`, ~ round(., 4))),
  caption = "Top 10 taxa contributing to pathway `PWY-6293: superpathway of L-cysteine biosynthesis (fungi)` in the Czech validation cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row 'unclassified' indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa"
)

Top 10 taxa contributing to pathway `PWY-6293: superpathway of L-cysteine biosynthesis (fungi)` in the Czech validation cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row ‘unclassified’ indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa
# Pathway	0%	25%	50%	75%	100%
unclassified	1	1	1	1	1

8.3.7 PWY-6284: superpathway of unsaturated fatty acids biosynthesis (E. coli)

8.3.7.1 Czech training cohort

Open code


res <- stratCZ %>%
  filter(grepl("PWY-6284", `# Pathway`)) %>%
  mutate(`# Pathway` = sub(".*\\|", "", `# Pathway`)) %>%
  {
    mat <- select(., -`# Pathway`)
    mat_prop <- sweep(mat, 2, colSums(mat), "/")
    quants <- t(apply(mat_prop, 1, function(row) {
      quantile(row, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)
    }))
    cbind(`# Pathway` = .$`# Pathway`, quants)
  } %>%
  as.data.frame()

colnames(res)[-1] <- c("0%", "25%", "50%", "75%", "100%")

res[, -1] <- lapply(res[, -1], as.numeric)

kableExtra::kable(
  res %>%
    arrange(desc(`50%`), desc(`75%`), desc(`100%`)) %>%
    filter(`100%` > 0) %>%
    head(10) %>%
    mutate(across(-`# Pathway`, ~ round(., 4))),
  caption = "Top 10 taxa contributing to pathway `PWY-6284: superpathway of unsaturated fatty acids biosynthesis (E. coli)` in the Czech training cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row 'unclassified' indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa"
)

Top 10 taxa contributing to pathway `PWY-6284: superpathway of unsaturated fatty acids biosynthesis (E. coli)` in the Czech training cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row ‘unclassified’ indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa
# Pathway	50%	75%	100%
g__Escherichia.s__Escherichia_coli	0.583	1	1.0000
unclassified	0.000	1	1.0000
g__Enterococcus.s__Enterococcus_faecalis	0.000	0	1.0000
g__Klebsiella.s__Klebsiella_pneumoniae	0.000	0	0.6643
g__Escherichia.s__Escherichia_marmotae	0.000	0	0.0502

8.3.7.2 Italian training cohort

Open code


res <- stratIT %>%
  filter(grepl("PWY-6284", `# Pathway`)) %>%
  mutate(`# Pathway` = sub(".*\\|", "", `# Pathway`)) %>%
  {
    mat <- select(., -`# Pathway`)
    mat_prop <- sweep(mat, 2, colSums(mat), "/")
    quants <- t(apply(mat_prop, 1, function(row) {
      quantile(row, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)
    }))
    cbind(`# Pathway` = .$`# Pathway`, quants)
  } %>%
  as.data.frame()

colnames(res)[-1] <- c("0%", "25%", "50%", "75%", "100%")

res[, -1] <- lapply(res[, -1], as.numeric)

kableExtra::kable(
  res %>%
    arrange(desc(`50%`), desc(`75%`), desc(`100%`)) %>%
    filter(`100%` > 0) %>%
    head(10) %>%
    mutate(across(-`# Pathway`, ~ round(., 4))),
  caption = "Top 10 taxa contributing to pathway `PWY-6284: superpathway of unsaturated fatty acids biosynthesis (E. coli)` in the Italian training cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row 'unclassified' indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa"
)

Top 10 taxa contributing to pathway `PWY-6284: superpathway of unsaturated fatty acids biosynthesis (E. coli)` in the Italian training cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row ‘unclassified’ indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa
# Pathway	50%	75%	100%
g__Escherichia.s__Escherichia_coli	1	1	1.0000
g__Citrobacter.s__Citrobacter_braakii	0	0	1.0000
g__Klebsiella.s__Klebsiella_oxytoca	0	0	1.0000
g__Leclercia.s__Leclercia_adecarboxylata	0	0	1.0000
unclassified	0	0	1.0000
g__Citrobacter.s__Citrobacter_freundii	0	0	0.9148
g__Enterobacter.s__Enterobacter_mori	0	0	0.6450
g__Klebsiella.s__Klebsiella_pneumoniae	0	0	0.4175
g__Kosakonia.s__Kosakonia_sacchari	0	0	0.2467
g__Escherichia.s__Escherichia_fergusonii	0	0	0.0186

8.3.7.3 Czech validation cohort

Open code


res <- stratVAL %>%
  filter(grepl("PWY-6284", `# Pathway`)) %>%
  mutate(`# Pathway` = sub(".*\\|", "", `# Pathway`)) %>%
  {
    mat <- select(., -`# Pathway`)
    mat_prop <- sweep(mat, 2, colSums(mat), "/")
    quants <- t(apply(mat_prop, 1, function(row) {
      quantile(row, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)
    }))
    cbind(`# Pathway` = .$`# Pathway`, quants)
  } %>%
  as.data.frame()

colnames(res)[-1] <- c("0%", "25%", "50%", "75%", "100%")

res[, -1] <- lapply(res[, -1], as.numeric)

kableExtra::kable(
  res %>%
    arrange(desc(`50%`), desc(`75%`), desc(`100%`)) %>%
    filter(`100%` > 0) %>%
    head(10) %>%
    mutate(across(-`# Pathway`, ~ round(., 4))),
  caption = "Top 10 taxa contributing to pathway `PWY-6284: superpathway of unsaturated fatty acids biosynthesis (E. coli)` in the Czech validation cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row 'unclassified' indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa"
)

Top 10 taxa contributing to pathway `PWY-6284: superpathway of unsaturated fatty acids biosynthesis (E. coli)` in the Czech validation cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row ‘unclassified’ indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa
# Pathway	50%	75%	100%
g__Escherichia.s__Escherichia_coli	0.2299	1.0	1.0000
unclassified	0.0000	0.5	1.0000
g__Streptococcus.s__Streptococcus_oralis	0.0000	0.0	1.0000
g__Enterobacter.s__Enterobacter_bugandensis	0.0000	0.0	0.7701
g__Enterococcus.s__Enterococcus_faecium	0.0000	0.0	0.5860
g__Klebsiella.s__Klebsiella_pneumoniae	0.0000	0.0	0.4828
g__Enterococcus.s__Enterococcus_durans	0.0000	0.0	0.4140

8.3.8 PWY-5367: petroselinate biosynthesis

8.3.8.1 Czech training cohort

Open code


res <- stratCZ %>%
  filter(grepl("PWY-5367", `# Pathway`)) %>%
  mutate(`# Pathway` = sub(".*\\|", "", `# Pathway`)) %>%
  {
    mat <- select(., -`# Pathway`)
    mat_prop <- sweep(mat, 2, colSums(mat), "/")
    quants <- t(apply(mat_prop, 1, function(row) {
      quantile(row, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)
    }))
    cbind(`# Pathway` = .$`# Pathway`, quants)
  } %>%
  as.data.frame()

colnames(res)[-1] <- c("0%", "25%", "50%", "75%", "100%")

res[, -1] <- lapply(res[, -1], as.numeric)

kableExtra::kable(
  res %>%
    arrange(desc(`50%`), desc(`75%`), desc(`100%`)) %>%
    filter(`100%` > 0) %>%
    head(10) %>%
    mutate(across(-`# Pathway`, ~ round(., 4))),
  caption = "Top 10 taxa contributing to pathway `PWY-5367: petroselinate biosynthesis` in the Czech training cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row 'unclassified' indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa"
)

Top 10 taxa contributing to pathway `PWY-5367: petroselinate biosynthesis` in the Czech training cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row ‘unclassified’ indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa
# Pathway	25%	50%	75%	100%
g__Blautia.s__Ruminococcus_torques	0.1836	0.9822	1.0000	1.0000
g__Escherichia.s__Escherichia_coli	0.0000	0.0000	0.0596	1.0000
g__Citrobacter.s__Citrobacter_freundii	0.0000	0.0000	0.0000	1.0000
unclassified	0.0000	0.0000	0.0000	1.0000
g__Enterococcus.s__Enterococcus_faecalis	0.0000	0.0000	0.0000	0.3800
g__Klebsiella.s__Klebsiella_pneumoniae	0.0000	0.0000	0.0000	0.1882
g__Enterobacter.s__Enterobacter_roggenkampii	0.0000	0.0000	0.0000	0.1190
g__Streptococcus.s__Streptococcus_pneumoniae	0.0000	0.0000	0.0000	0.1078
g__Escherichia.s__Escherichia_fergusonii	0.0000	0.0000	0.0000	0.0843
g__Escherichia.s__Escherichia_marmotae	0.0000	0.0000	0.0000	0.0726

8.3.8.2 Italian training cohort

Open code


res <- stratIT %>%
  filter(grepl("PWY-5367", `# Pathway`)) %>%
  mutate(`# Pathway` = sub(".*\\|", "", `# Pathway`)) %>%
  {
    mat <- select(., -`# Pathway`)
    mat_prop <- sweep(mat, 2, colSums(mat), "/")
    quants <- t(apply(mat_prop, 1, function(row) {
      quantile(row, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)
    }))
    cbind(`# Pathway` = .$`# Pathway`, quants)
  } %>%
  as.data.frame()

colnames(res)[-1] <- c("0%", "25%", "50%", "75%", "100%")

res[, -1] <- lapply(res[, -1], as.numeric)

kableExtra::kable(
  res %>%
    arrange(desc(`50%`), desc(`75%`), desc(`100%`)) %>%
    filter(`100%` > 0) %>%
    head(10) %>%
    mutate(across(-`# Pathway`, ~ round(., 4))),
  caption = "Top 10 taxa contributing to pathway `PWY-5367: petroselinate biosynthesis` in the Italian training cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row 'unclassified' indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa"
)

Top 10 taxa contributing to pathway `PWY-5367: petroselinate biosynthesis` in the Italian training cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row ‘unclassified’ indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa
# Pathway	50%	75%	100%
g__Escherichia.s__Escherichia_coli	0.6316	1	1.0000
g__Blautia.s__Ruminococcus_torques	0.0000	0	1.0000
g__Escherichia.s__Escherichia_marmotae	0.0000	0	1.0000
g__Leclercia.s__Leclercia_adecarboxylata	0.0000	0	1.0000
unclassified	0.0000	0	1.0000
g__Enterobacter.s__Enterobacter_roggenkampii	0.0000	0	0.9279
g__Enterobacter.s__Enterobacter_mori	0.0000	0	0.7381
g__Citrobacter.s__Citrobacter_freundii	0.0000	0	0.7378
g__Citrobacter.s__Citrobacter_braakii	0.0000	0	0.4592
g__Klebsiella.s__Klebsiella_oxytoca	0.0000	0	0.4285

8.3.8.3 Czech validation cohort

Open code


res <- stratVAL %>%
  filter(grepl("PWY-5367", `# Pathway`)) %>%
  mutate(`# Pathway` = sub(".*\\|", "", `# Pathway`)) %>%
  {
    mat <- select(., -`# Pathway`)
    mat_prop <- sweep(mat, 2, colSums(mat), "/")
    quants <- t(apply(mat_prop, 1, function(row) {
      quantile(row, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)
    }))
    cbind(`# Pathway` = .$`# Pathway`, quants)
  } %>%
  as.data.frame()

colnames(res)[-1] <- c("0%", "25%", "50%", "75%", "100%")

res[, -1] <- lapply(res[, -1], as.numeric)

kableExtra::kable(
  res %>%
    arrange(desc(`50%`), desc(`75%`), desc(`100%`)) %>%
    filter(`100%` > 0) %>%
    head(10) %>%
    mutate(across(-`# Pathway`, ~ round(., 4))),
  caption = "Top 10 taxa contributing to pathway `PWY-5367: petroselinate biosynthesis` in the Czech validation cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row 'unclassified' indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa"
)

Top 10 taxa contributing to pathway `PWY-5367: petroselinate biosynthesis` in the Czech validation cohort. Values represent the distribution (0th, 25th, 50th, 75th, 100th percentiles) of each taxon’s proportional contribution to the total pathway abundance across samples. The row ‘unclassified’ indicates the fraction of the pathway abundance that HUMAnN could not assign to specific taxa
# Pathway	25%	50%	75%	100%
g__Blautia.s__Ruminococcus_torques	1	1	1	1.0000
unclassified	0	0	0	1.0000
g__Enterococcus.s__Enterococcus_hirae	0	0	0	0.8935
g__Enterobacter.s__Enterobacter_bugandensis	0	0	0	0.8379
g__Escherichia.s__Escherichia_coli	0	0	0	0.6883
g__Enterococcus.s__Enterococcus_faecium	0	0	0	0.5535
g__Enterococcus.s__Enterococcus_durans	0	0	0	0.4465
g__Klebsiella.s__Klebsiella_oxytoca	0	0	0	0.3625
g__Klebsiella.s__Klebsiella_pneumoniae	0	0	0	0.3454
g__Escherichia.s__Escherichia_fergusonii	0	0	0	0.3117

9 Reproducibility

Open code

sessionInfo()
## R version 4.4.3 (2025-02-28)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=cs_CZ.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=cs_CZ.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=cs_CZ.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=cs_CZ.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Europe/Prague
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] vegan_2.6-6.1         lattice_0.22-5        permute_0.9-7        
##  [4] zCompositions_1.5.0-4 truncnorm_1.0-8       NADA_1.6-1.1         
##  [7] survival_3.7-0        MicrobiomeStat_1.2    glmnet_4.1-8         
## [10] pROC_1.18.0           arm_1.12-2            lme4_1.1-35.5        
## [13] Matrix_1.7-0          MASS_7.3-65           car_3.1-2            
## [16] carData_3.0-5         emmeans_1.10.4        brms_2.21.0          
## [19] Rcpp_1.0.13           rms_6.8-1             Hmisc_5.1-3          
## [22] glmmTMB_1.1.9         ggtext_0.1.2          ggdist_3.3.2         
## [25] cowplot_1.1.1         ggpubr_0.4.0          sjPlot_2.8.16        
## [28] kableExtra_1.4.0      flextable_0.9.6       gtsummary_2.0.2      
## [31] compositions_2.0-8    janitor_2.2.0         stringi_1.8.7        
## [34] lubridate_1.9.4       forcats_1.0.0         stringr_1.5.1        
## [37] dplyr_1.1.4           purrr_1.0.4           readr_2.1.5          
## [40] tidyr_1.3.1           tibble_3.3.0          ggplot2_3.5.1        
## [43] tidyverse_1.3.1       readxl_1.3.1          openxlsx_4.2.8       
## [46] RJDBC_0.2-10          rJava_1.0-11          DBI_1.2.3            
## 
## loaded via a namespace (and not attached):
##   [1] fs_1.6.6                matrixStats_1.3.0       httr_1.4.2             
##   [4] insight_0.20.2          numDeriv_2016.8-1.1     tools_4.4.3            
##   [7] backports_1.5.0         utf8_1.2.6              sjlabelled_1.2.0       
##  [10] R6_2.6.1                mgcv_1.9-1              withr_3.0.2            
##  [13] Brobdingnag_1.2-7       prettyunits_1.2.0       gridExtra_2.3          
##  [16] bayesm_3.1-6            quantreg_5.98           cli_3.6.5              
##  [19] textshaping_0.3.6       performance_0.12.2      officer_0.6.6          
##  [22] sandwich_3.0-1          labeling_0.4.2          mvtnorm_1.1-3          
##  [25] robustbase_0.93-9       polspline_1.1.25        ggridges_0.5.3         
##  [28] askpass_1.1             QuickJSR_1.3.1          commonmark_1.9.1       
##  [31] systemfonts_1.0.4       StanHeaders_2.32.10     foreign_0.8-90         
##  [34] gfonts_0.2.0            svglite_2.1.3           rstudioapi_0.16.0      
##  [37] httpcode_0.3.0          generics_0.1.4          shape_1.4.6            
##  [40] distributional_0.4.0    zip_2.2.0               inline_0.3.19          
##  [43] loo_2.4.1               abind_1.4-5             lifecycle_1.0.4        
##  [46] multcomp_1.4-18         yaml_2.3.10             snakecase_0.11.1       
##  [49] grid_4.4.3              promises_1.2.0.1        crayon_1.5.3           
##  [52] haven_2.4.3             pillar_1.11.0           knitr_1.50             
##  [55] statip_0.2.3            boot_1.3-31             estimability_1.5.1     
##  [58] codetools_0.2-19        glue_1.7.0              V8_4.4.2               
##  [61] fontLiberation_0.1.0    data.table_1.15.4       vctrs_0.6.5            
##  [64] cellranger_1.1.0        gtable_0.3.0            assertthat_0.2.1       
##  [67] datawizard_0.12.2       xfun_0.52               mime_0.12              
##  [70] coda_0.19-4             modeest_2.4.0           timeDate_3043.102      
##  [73] iterators_1.0.14        statmod_1.4.36          TH.data_1.1-0          
##  [76] nlme_3.1-168            fontquiver_0.2.1        rstan_2.32.6           
##  [79] fBasics_4041.97         tensorA_0.36.2.1        TMB_1.9.14             
##  [82] rpart_4.1.24            colorspace_2.0-2        nnet_7.3-20            
##  [85] tidyselect_1.2.1        processx_3.8.4          timeSeries_4032.109    
##  [88] compiler_4.4.3          curl_6.4.0              rvest_1.0.2            
##  [91] htmlTable_2.4.0         SparseM_1.81            xml2_1.3.3             
##  [94] fontBitstreamVera_0.1.1 posterior_1.6.0         checkmate_2.3.2        
##  [97] scales_1.3.0            DEoptimR_1.0-10         callr_3.7.6            
## [100] spatial_7.3-15          digest_0.6.37           minqa_1.2.4            
## [103] rmarkdown_2.27          htmltools_0.5.8.1       pkgconfig_2.0.3        
## [106] base64enc_0.1-3         stabledist_0.7-2        dbplyr_2.1.1           
## [109] fastmap_1.2.0           rlang_1.1.6             htmlwidgets_1.6.4      
## [112] shiny_1.9.1             farver_2.1.0            zoo_1.8-9              
## [115] jsonlite_2.0.0          magrittr_2.0.3          Formula_1.2-4          
## [118] bayesplot_1.8.1         munsell_0.5.0           gdtools_0.3.7          
## [121] stable_1.1.6            plyr_1.8.6              pkgbuild_1.3.1         
## [124] parallel_4.4.3          ggrepel_0.9.5           sjmisc_2.8.10          
## [127] ggeffects_1.7.0         splines_4.4.3           gridtext_0.1.5         
## [130] hms_1.1.3               sjstats_0.19.0          ps_1.7.7               
## [133] uuid_1.0-3              markdown_1.13           ggsignif_0.6.3         
## [136] stats4_4.4.3            rmutil_1.1.10           rstantools_2.1.1       
## [139] crul_1.5.0              reprex_2.0.1            evaluate_1.0.4         
## [142] RcppParallel_5.1.8      modelr_0.1.8            nloptr_2.0.0           
## [145] tzdb_0.5.0              foreach_1.5.2           httpuv_1.6.5           
## [148] MatrixModels_0.5-3      openssl_1.4.6           clue_0.3-65            
## [151] broom_1.0.6             xtable_1.8-4            rstatix_0.7.0          
## [154] later_1.3.0             viridisLite_0.4.0       ragg_1.4.0             
## [157] lmerTest_3.1-3          cluster_2.1.8.1         timechange_0.3.0       
## [160] bridgesampling_1.1-2

1 Introduction

1.1 Statistical Methods

2 Initiation

2.1 Set home directory

2.2 Upload initiation file

3 Data

3.1 Upload and epxlore

3.1.1 Get pathways

3.1.2 Italian data

3.1.3 Czech data

3.1.4 Validation data

3.1.5 Merging data

3.1.6 Merge training and validation dataset

3.2 Explore

3.2.0.1 Distributions

3.2.0.2 Pathways accross groups

4 Linear models across pathways

4.1 Select and wrangle data

4.1.1 Define number of pathways and covariates

4.1.2 Create empty objects

4.2 Run linear models over pathways

4.3 Results table

4.3.1 Adjust p values

4.3.2 Result: show and save

5 Elastic net

5.1 Prepare data for glmnet

5.2 Fit model

5.3 See results

5.3.1 Model summary

5.3.2 ROC curve - internal validation

5.3.3 Estimated coefficients

5.3.4 Plot of coefficients

6 External validation

6.1 Diet discrimination (elastic net)

6.1.1 Get table of weights, means and SDs

6.1.2 Identify shared and missing predictors

6.1.3 Standardize data in validation set

6.1.4 Get predicted value

6.2 Result of external validation

6.2.1 ROC curve

6.2.2 Table

6.3 Diet effect across datasets (forest plot)

6.3.1 Linear model in validation cohort

6.3.1.1 Define number of pathways and covariates

6.3.1.2 Create empty objects

6.3.1.3 Linear models per outcome

6.3.1.4 Results table

6.3.2 Forest plot

6.3.2.1 Data preparation

6.3.2.2 Plotting

6.3.3 Boxplot

7 Linear model VG duration

7.1 Get data

7.1.1 Training

7.1.2 Validation

7.2 Training cohort

7.2.1 Select data

7.2.2 Define number of pathways and covariates

7.2.3 Create empty objects

7.2.4 Estimate over outcomes

7.2.5 Show and save results

7.3 Validation cohort

7.3.0.1 Define number of pathways and covariates

7.3.0.2 Create empty objects

7.3.0.3 Estimate over outcomes

7.3.0.4 Results table

7.4 Forest plot

7.4.1 Prepare data

7.4.1.1 Plotting

8 Summary info about Patways

8.1 Proportions of unmapped and unintegrated

8.1.1 Czech trainng cohort

8.1.2 Italian training cohort

8.1.3 Czech validation cohort

8.2 Taxon-assigned vs. taxon-unassigned pathways

8.2.1 Czech trainng cohort

8.2.2 Italian trainng cohort

8.2.3 Validation cohort

8.3 Taxa contribution to diet-sensitive pathways

8.3.1 PWY-5367: petroselinate biosynthesis

5.1 Prepare data for `glmnet`