Vegan-specific signature implies healthier metabolic profile: findings from diet-related multi-omics observational study based on different European populations

Statistical report for metabolom analysis


Authors and affiliations

Anna Ouradova1,*, Giulio Ferrero2,3,*, Miriam Bratova4, Nikola Daskova4, Alena Bohdanecka4,5, Klara Dohnalova6, Marie Heczkova4, Karel Chalupsky6, Maria Kralova7,8, Marek Kuzma9, István Modos4, Filip Tichanek4, Lucie Najmanova9, Barbara Pardini10, Helena Pelantová9, Sonia Tarallo10, Petra Videnska11, Jan Gojda1,#, Alessio Naccarati10,#, Monika Cahova4,#


* These authors have contributed equally to this work and share first authorship
# These authors have contributed equally to this work and share last authorship

1 Department of Internal Medicine, Kralovske Vinohrady University Hospital and Third Faculty of Medicine, Charles University, Prague, Czech Republic
2 Department of Clinical and Biological Sciences, University of Turin, Turin, Italy
3 Department of Computer Science, University of Turin, Turin, Italy
4 Institute for Clinical and Experimental Medicine, Prague, Czech Republic
5 First Faculty of Medicine, Charles University, Prague, Czech Republic
6 Czech Centre for Phenogenomics, Institute of Molecular Genetics of the Czech Academy of Sciences, Prague, Czech Republic
7 Ambis University, Department of Economics and Management, Prague, Czech Republic
8 Department of Informatics, Brno University of Technology, Brno, Czech Republic
9 Institute of Microbiology of the Czech Academy of Sciences, Prague, Czech Republic
10 Italian Institute for Genomic Medicine (IIGM), c/o IRCCS Candiolo, Turin, Italy
11 Mendel University, Department of Chemistry and Biochemistry, Brno, Czech Republic


This is a statistical report of the study A vegan diet signature from a multi-omics study on different European populations is related to favorable metabolic outcomes that is currenlty under review

When using this code or data, cite the original publication:

TO BE ADDED

BibTex citation for the original publication:

TO BE ADDED


Original GitHub repository: https://github.com/filip-tichanek/ItCzVegans

Statistical reports can be found on the reports hub.

Data analysis is described in detail in the statistical methods report.


1 Introduction

This project explores potential signatures of a vegan diet across the microbiome, metabolome, and lipidome. We used data from healthy vegan and omnivorous human subjects in two countries (Czech Republic and Italy), with subjects grouped by Country and Diet, resulting in four distinct groups.

To assess the generalizability of these findings, we validated our results with an independent cohort from the Czech Republic for external validation.

1.1 Statistical Methods

The statistical modeling approach is described in detail in this report. Briefly, the methods used included:

  • Multivariate analysis: We conducted multivariate analyses (PERMANOVA, PCA, correlation analyses) to explore the effects of diet, country, and their possible interaction (diet : country) on the microbiome, lipidome, and metabolome compositions in an integrative manner. This part of the analysis is not available on the GitHub page, but the code will be provided upon request.

  • Linear models: Linear models were applied to estimate the effects of diet, country, and their interaction (diet:country) on individual lipids, metabolites, bacterial taxa and pathways (“features”). Features that significantly differed between diet groups (based on the estimated average conditional effect of diet across both countries, adjusted for multiple comparisons with FDR < 0.05) were further examined in the independent external validation cohort to assess whether these associations were reproducible. Next, we fit linear models restricted to vegan participants to test whether omics profiles varied with the duration of vegan diet. Fixed-effect predictors were diet duration (per 10 years), country, their interaction, and age (included due to correlation with diet duration).

  • Predictive models (elastic net): We employed elastic net logistic regression (via the glmnet R package) to predict vegan status based on metabolome, lipidome, microbiome and pathways data (one model per dataset; four models in total). We considered three combinations of Lasso and Ridge penalties (alpha = 0, 0.2, 0.4). For each alpha, we selected the penalty strength (λ1se) using 10-fold cross-validation. This value corresponds to the most regularized model whose performance was within one standard error of the minimum deviance. The alpha–lambda pair with the lowest deviance was chosen to fit the final model, whose coefficients are reported.
    To estimate model performance, we repeated the full modeling procedure (including hyperparameter tuning) 500 times on bootstrap resamples of the training data. In each iteration, the model was trained on the resampled data and evaluated on the out-of-bag subjects (i.e., those not included in the training set in that iteration). The mean, and 2.5th, and 97.5th percentiles of the resulting ROC-AUC values represent the estimated out-of-sample AUC and its 95% confidence interval.
    Finally, the final model was applied to an independent validation cohort to generate predicted probabilities of vegan status. These probabilities were then used to assess external discrimination between diet groups (ROC-AUC in the independent validation cohort). The elastic net models were not intended for practical prediction, but to quantify the strength of the signal separating the dietary groups, with its uncertainty, by using all features of a given dataset jointly. It also offered a complementary perspective on which features are most clearly associated with diet

2 Initiation

2.1 Set home directory

Open code
setwd('/home/ticf/1_ticf_sec/478_MOCA_italian/')

2.2 Upload initiation file

Open code
source('478_initiation.R')

3 Data

3.1 Upload all original data

3.1.1 Training set

Open code
data_metabolites_original <- read.xlsx('gitignore/data/serum_metabolome_training_cohort.xlsx')

3.1.2 Validation set

Open code
data_metabolites_validation <- read.xlsx('gitignore/data/Serum_metabolome_Validation_cohort_new.xlsx') %>% 
  select(-X38)

3.1.3 Merge training and validation dataset

Open code
tr1 <- data_metabolites_original %>% 
  mutate(Data = if_else(Country == 'CZ', 'CZ_tr', 'IT_tr')) %>% 
  select(Data, Diet, `formate`:`2-hydroxybutyrate`)

tr2 <- data_metabolites_validation %>% 
  mutate(Data = 'valid',
         Diet = if_else(SKUPINA == 0 , 'OMNI', 'VEGAN')) %>% 
  select(Data, Diet, `formate`:`2-hydroxybutyrate`)

data_merged <- bind_rows(tr1, tr2)

3.2 Explore

3.2.1 Distributions - raw data

The following plot will show distribution of 36 randomly selected metabolites

Open code
check <- data_metabolites_original %>% 
  dplyr::select(
    `formate`: `2-hydroxybutyrate`
    ) %>% 
  na.omit()


size = c(6,6)
par(mfrow = c(size[1],size[2]))
par(mar=c(2,1.5,2,0.5))
set.seed(16)

for(x in 1:ncol(check)){
  hist(check[,x], 
       16, 
       col='blue', 
       main = paste0(colnames(check)[x])
  )
}
  

Data seems to be highly right-tailed

3.2.2 Distribution - Log2 transformed

Open code

par(mfrow = c(size[1],size[2]))
par(mar=c(2,1.5,2,0.5))
set.seed(16)

for(x in 1:ncol(check)){
  hist(log2(check[,x]+1e-8), 
       16, 
       col='blue', 
       main = paste0('log2',colnames(check)[x])
  )
}

Seems more symmetrical and Gaussian

3.2.2.1 Comparison training vs validation cohort

Open code
tr1 <- data_metabolites_original %>% 
  select(formate:`2-hydroxybutyrate`) %>% 
  mutate(dataset = 'training')
  
tr2 <- data_metabolites_validation %>% select(formate:`2-hydroxybutyrate`) %>% 
  mutate(dataset = 'validation')

tr <- bind_rows(tr1, tr2)

size = c(6,6)
par(mfrow = c(size[1],size[2]))
par(mar = c(2,1.5,2,0.5))
par(mgp = c(3, 0.5, 0 ))

for(x in 1:(ncol(tr)-1)){
  plot(log2(tr[, x]) ~ factor(tr$dataset),
       main = paste0(colnames(check)[x]),
       ylim = c(-24, -10)
  )
}

3.2.2.2 Metabolites accross groups

Open code
colo <- c('#329243', '#F9FFAF')

outcomes <- data.frame(
  variable = data_merged %>% 
    select(formate:`2-hydroxybutyrate`) %>% 
    colnames())

boxplot_cond <- function(variable) {
  
  
  p <- ggboxplot(data_merged, 
                 x = 'Diet', 
                 y = variable, 
                 fill = 'Diet', 
                 tip.length = 0.15,
                 palette = colo,
                 outlier.shape = 1,
                 lwd = 0.25,
                 outlier.size = 0.8,
                 facet.by = 'Data',
                 title = variable,
                 ylab = 'Metabolite level') +
    
    theme(
      plot.title = element_text(size = 10), 
      axis.title = element_text(size = 8),  
      axis.text.y = element_text(size = 7),
      axis.text.x = element_blank(),
      axis.title.x = element_blank()
    ) 
    
  
  return(p)
}

# Plot all outcomes
plots <- map(outcomes$variable, boxplot_cond)

# Create a matrix of plots
plots_arranged <- ggarrange(plotlist = plots, ncol = 5, nrow = 7,  common.legend = TRUE)
plots_arranged

Levels of metabolites across all 3 cohorts (Czech and Italian training cohorts and an independent Czech valdiation cohort) and across dietary groups

4 Linear models across metabolites

We will fit a feature-specific linear model where the log2-transformed metabolite represents the outcome variable whereas country (Italy vs Czech), diet (vegan vs omnivore), and their interaction (country:diet) all represent fixed-effects predictors. So, each model has following form

\[ log_{2}(\text{metabolite level}) = \alpha + \beta_{1} \times \text{country} + \beta_{2} \times \text{diet} + \beta_{3} \times \text{country:diet} + \epsilon \]

The variables were coded as follows: \(diet = -0.5\) for omnivores and \(diet = 0.5\) for vegans; \(country = -0.5\) for the Czech cohort and \(country = 0.5\) for the Italian cohort.
This parameterization allows us to interpret the linear model summary output as presenting the conditional effects of diet averaged across both countries and the conditional effects of country averaged across both diet groups. We will then use the emmeans package (Lenth 2024) to obtain specific estimates for the effect of diet in the Italian and Czech cohorts separately, still from a single model.

Metabolites that will show a significant diet effect (average effect of diet across both countries, adjusted for multiple comparisons with FDR < 0.05) will be then visualized using a forest plot, with country-specific diet effect along with diet effect based on independent validation cohort, to evaluate how generalizable are these findings.

4.1 Define transformation function for each dataset

Given the distribution of the estimated metabolites concentrations, we will use log2-transformed values

Open code
trans_metabolite <- function(x){
  log2(x + 1e-8)
}

4.2 Select and wrangle data

Open code
data_analysis_metabolom <- data_metabolites_original %>%
  na.omit() %>%
  dplyr::mutate(
    Diet_VEGAN = as.numeric(
      dplyr::if_else(
        Diet == "VEGAN", 0.5, -0.5
      )
    ),
    Country_IT = as.numeric(
      dplyr::if_else(
        Country == "IT", 0.5, -0.5
      )
    ),
    dplyr::across(
      `formate`: `2-hydroxybutyrate`, ~ trans_metabolite(.)
    )
  ) %>%
  dplyr::select(
    Sample,
    Country,
    Country_IT,
    Diet,
    Diet_VEGAN,
    Group,
    dplyr::everything()
  )

summary(data_analysis_metabolom[ , 1:12])
##     Sample            Country            Country_IT           Diet          
##  Length:173         Length:173         Min.   :-0.50000   Length:173        
##  Class :character   Class :character   1st Qu.:-0.50000   Class :character  
##  Mode  :character   Mode  :character   Median :-0.50000   Mode  :character  
##                                        Mean   :-0.04913                     
##                                        3rd Qu.: 0.50000                     
##                                        Max.   : 0.50000                     
##    Diet_VEGAN          Group              formate         tryptophan    
##  Min.   :-0.50000   Length:173         Min.   :-20.03   Min.   :-17.69  
##  1st Qu.:-0.50000   Class :character   1st Qu.:-19.59   1st Qu.:-17.32  
##  Median : 0.50000   Mode  :character   Median :-19.25   Median :-17.17  
##  Mean   : 0.07803                      Mean   :-18.98   Mean   :-17.18  
##  3rd Qu.: 0.50000                      3rd Qu.:-18.30   3rd Qu.:-17.07  
##  Max.   : 0.50000                      Max.   :-17.93   Max.   :-16.10  
##  phenylalanine      histidine         tyrosine         glucose      
##  Min.   :-16.70   Min.   :-17.31   Min.   :-17.17   Min.   :-13.00  
##  1st Qu.:-16.34   1st Qu.:-16.93   1st Qu.:-16.55   1st Qu.:-12.61  
##  Median :-16.25   Median :-16.80   Median :-16.34   Median :-12.50  
##  Mean   :-16.23   Mean   :-16.80   Mean   :-16.36   Mean   :-12.47  
##  3rd Qu.:-16.12   3rd Qu.:-16.67   3rd Qu.:-16.20   3rd Qu.:-12.34  
##  Max.   :-15.76   Max.   :-16.16   Max.   :-15.25   Max.   :-11.58

4.2.1 Define number of metabolites and covariates

Open code
n_covarites <- 6
n_features <- ncol(data_analysis_metabolom) - n_covarites

4.2.2 Create empty objects

Open code
outcome <- vector('double', n_features)
log2FD_VGdiet_inCZ <- vector('double', n_features)
log2FD_VGdiet_inIT <- vector('double', n_features)
log2FD_VGdiet_avg <- vector('double', n_features)

log2FD_ITcountry_avg <- vector('double', n_features)
diet_country_int <- vector('double', n_features)


P_VGdiet_inCZ <- vector('double', n_features)
P_VGdiet_inIT <- vector('double', n_features)
P_VGdiet_avg <- vector('double', n_features)

P_ITcountry_avg <- vector('double', n_features)
P_diet_country_int <- vector('double', n_features)

CI_L_VGdiet_inCZ <- vector('double', n_features)
CI_L_VGdiet_inIT <- vector('double', n_features)
CI_L_VGdiet_avg <- vector('double', n_features)

CI_U_VGdiet_inCZ <- vector('double', n_features)
CI_U_VGdiet_inIT <- vector('double', n_features)
CI_U_VGdiet_avg <- vector('double', n_features)

4.3 Run linear models over metabolites

Open code
for (i in 1:n_features) {
  
  ## define variable
  data_analysis_metabolom$outcome <- data_analysis_metabolom[, (i + n_covarites)]

  ## fit model
  model <- lm(outcome ~ Country_IT * Diet_VEGAN, data = data_analysis_metabolom)

  ## get contrast (effects of diet BY COUNTRY)
  contrast_emm <- summary(
    pairs(
      emmeans(
        model,
        specs = ~ Diet_VEGAN | Country_IT
        ),
      interaction = TRUE,
      adjust = "none"
      ),
    infer = c(TRUE, TRUE)
    )

  ## save results
  outcome[i] <- names(data_analysis_metabolom)[i + n_covarites]
  
  ## country effect
  log2FD_ITcountry_avg[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Country_IT"
    ), 1
  ]

  P_ITcountry_avg[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Country_IT"
    ), 4
  ]
  
  
  ## diet effect
  tr <- confint(model)
  
  CI_L_VGdiet_avg[i] <- tr[which(row.names(tr) == 'Diet_VEGAN'),][1]
  CI_U_VGdiet_avg[i] <- tr[which(row.names(tr) == 'Diet_VEGAN'),][2]
  
  log2FD_VGdiet_avg[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Diet_VEGAN"
    ), 1
  ]

  P_VGdiet_avg[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Diet_VEGAN"
    ), 4
  ]
  
  log2FD_VGdiet_inCZ[i] <- -contrast_emm$estimate[1]
  P_VGdiet_inCZ[i] <- contrast_emm$p.value[1]
  CI_L_VGdiet_inCZ[i] <- -contrast_emm$upper.CL[1]
  CI_U_VGdiet_inCZ[i] <- -contrast_emm$lower.CL[1]
  
  
  log2FD_VGdiet_inIT[i] <- -contrast_emm$estimate[2]
  P_VGdiet_inIT[i] <- contrast_emm$p.value[2]
  CI_L_VGdiet_inIT[i] <- -contrast_emm$upper.CL[2]
  CI_U_VGdiet_inIT[i] <- -contrast_emm$lower.CL[2]
  
  
  ## interaction
  diet_country_int[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Country_IT:Diet_VEGAN"
    ), 1
  ]

  P_diet_country_int[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Country_IT:Diet_VEGAN"
    ), 4
  ]
}

4.4 Results table

Open code
result_metabolom <- data.frame(
  outcome,
  log2FD_ITcountry_avg, P_ITcountry_avg,
  log2FD_VGdiet_avg, P_VGdiet_avg,
  log2FD_VGdiet_inCZ, P_VGdiet_inCZ,
  log2FD_VGdiet_inIT, P_VGdiet_inIT,
  diet_country_int, P_diet_country_int,
  CI_L_VGdiet_avg, CI_U_VGdiet_avg,
  CI_L_VGdiet_inCZ, CI_U_VGdiet_inCZ,
  CI_L_VGdiet_inIT, CI_U_VGdiet_inIT
)

4.4.1 Adjust p values

Open code
result_metabolom <- result_metabolom %>% 
  dplyr::mutate(
    fdr_ITcountry_avg = p.adjust(P_ITcountry_avg, method = 'BH'),
    fdr_VGdiet_avg = p.adjust(P_VGdiet_avg, method = 'BH'),
    fdr_VGdiet_inCZ = p.adjust(P_VGdiet_inCZ, method = 'BH'),
    fdr_VGdiet_inIT = p.adjust(P_VGdiet_inIT, method = 'BH'),
    fdr_diet_country_int = p.adjust(P_diet_country_int, method = 'BH')
  ) %>% 
  dplyr::select(
    outcome,
    log2FD_ITcountry_avg, P_ITcountry_avg, fdr_ITcountry_avg,
    log2FD_VGdiet_avg, P_VGdiet_avg, fdr_VGdiet_avg,
    log2FD_VGdiet_inCZ, P_VGdiet_inCZ, fdr_VGdiet_inCZ,
    log2FD_VGdiet_inIT, P_VGdiet_inIT, fdr_VGdiet_inIT,
    diet_country_int, P_diet_country_int, fdr_diet_country_int,
    CI_L_VGdiet_avg, CI_U_VGdiet_avg,
    CI_L_VGdiet_inCZ, CI_U_VGdiet_inCZ,
    CI_L_VGdiet_inIT, CI_U_VGdiet_inIT
  )

4.4.2 Result: show and save

Open code
kableExtra::kable(result_metabolom,
                  caption = "Results of linear models, modeling the log2-transformed level of a given metabolite with `Diet`, `Country`, and `Diet:Country` interaction as fixed-effect predictors. Only metabolites with significantly different levels between diet groups are shown (FDR < 0.05, average effect across both countries). `log2FD` prefix: denotes estimated effects (regression coefficients), i.e., how much log2-transformed metabolite levels differ in vegans compared to omnivores, and between Italian and Czech cohorts, respectively; `P`: p-value; `fdr`: p-value after adjustment for multiple comparisons; `CI_L` and `CI_U`: lower and upper bounds of the 95% confidence interval, respectively. `avg` suffix denotes effects averaged across subgroups, whereas `inCZ` and `inIT` denote effects in the Czech or Italian cohort, respectively. Interaction effects are reported as `diet_country_int` (difference in the effect of a vegan diet between Italian and Czech cohorts; positive values indicate a stronger effect in the Italian, negative values a stronger effect in the Czech cohort) and `P_diet_country_int` (its p-value). All estimates in a single row are based on a single model."
                  ) 
Results of linear models, modeling the log2-transformed level of a given metabolite with Diet, Country, and Diet:Country interaction as fixed-effect predictors. Only metabolites with significantly different levels between diet groups are shown (FDR < 0.05, average effect across both countries). log2FD prefix: denotes estimated effects (regression coefficients), i.e., how much log2-transformed metabolite levels differ in vegans compared to omnivores, and between Italian and Czech cohorts, respectively; P: p-value; fdr: p-value after adjustment for multiple comparisons; CI_L and CI_U: lower and upper bounds of the 95% confidence interval, respectively. avg suffix denotes effects averaged across subgroups, whereas inCZ and inIT denote effects in the Czech or Italian cohort, respectively. Interaction effects are reported as diet_country_int (difference in the effect of a vegan diet between Italian and Czech cohorts; positive values indicate a stronger effect in the Italian, negative values a stronger effect in the Czech cohort) and P_diet_country_int (its p-value). All estimates in a single row are based on a single model.
outcome log2FD_ITcountry_avg P_ITcountry_avg fdr_ITcountry_avg log2FD_VGdiet_avg P_VGdiet_avg fdr_VGdiet_avg log2FD_VGdiet_inCZ P_VGdiet_inCZ fdr_VGdiet_inCZ log2FD_VGdiet_inIT P_VGdiet_inIT fdr_VGdiet_inIT diet_country_int P_diet_country_int fdr_diet_country_int CI_L_VGdiet_avg CI_U_VGdiet_avg CI_L_VGdiet_inCZ CI_U_VGdiet_inCZ CI_L_VGdiet_inIT CI_U_VGdiet_inIT
formate 1.2756132 0.0000000 0.0000000 0.0725783 0.0091069 0.0187830 0.0739017 0.0529521 0.1027894 0.0712550 0.0756434 0.1313806 -0.0026467 0.9616865 0.9665166 0.0182748 0.1268819 -0.0009521 0.1487555 -0.0074370 0.1499469
tryptophan 0.1587058 0.0000004 0.0000010 -0.0568079 0.0599258 0.0988776 -0.0129552 0.7543948 0.8584493 -0.1006607 0.0217582 0.0578906 -0.0877055 0.1455672 0.4003097 -0.1160166 0.0024007 -0.0945703 0.0686600 -0.1864607 -0.0148607
phenylalanine 0.1149355 0.0000180 0.0000397 -0.1006116 0.0001588 0.0004765 -0.0957169 0.0084039 0.0240306 -0.1055062 0.0057720 0.0183972 -0.0097893 0.8511236 0.9665166 -0.1520154 -0.0492077 -0.1665736 -0.0248602 -0.1799961 -0.0310163
histidine 0.1021990 0.0003020 0.0005862 -0.0754526 0.0071189 0.0156616 -0.0925386 0.0164038 0.0386661 -0.0583666 0.1477098 0.2321153 0.0341720 0.5381061 0.8181237 -0.1301250 -0.0207801 -0.1679009 -0.0171763 -0.1375931 0.0208600
tyrosine 0.2172448 0.0000002 0.0000008 -0.1764605 0.0000212 0.0000777 -0.0757827 0.1746785 0.2834809 -0.2771382 0.0000045 0.0000296 -0.2013554 0.0135152 0.0637147 -0.2560839 -0.0968370 -0.1855382 0.0339727 -0.3925214 -0.1617549
glucose 0.2042069 0.0000000 0.0000000 0.0294804 0.2268282 0.3118888 -0.0087509 0.7942477 0.8667242 0.0677117 0.0562146 0.1030602 0.0764626 0.1175782 0.3527345 -0.0184981 0.0774589 -0.0748861 0.0573842 -0.0018145 0.1372379
mannose -0.3851135 0.0000000 0.0000000 -0.0075879 0.8573508 0.9126637 0.0116376 0.8414869 0.8677834 -0.0268134 0.6612302 0.7212081 -0.0384510 0.6488879 0.8922208 -0.0907960 0.0756202 -0.1030591 0.1263343 -0.1473912 0.0937644
threonine 0.1072997 0.0036942 0.0060954 -0.0322902 0.3768737 0.4606234 0.0918177 0.0693572 0.1204626 -0.1563982 0.0035036 0.0128466 -0.2482159 0.0008255 0.0206291 -0.1042358 0.0396553 -0.0073544 0.1909898 -0.2606553 -0.0521410
lactate 0.1141207 0.1079126 0.1424446 -0.0025189 0.9715852 0.9982752 -0.0570212 0.5587577 0.7091925 0.0519835 0.6120868 0.7212081 0.1090047 0.4412625 0.7664033 -0.1419096 0.1368719 -0.2491619 0.1351194 -0.1500093 0.2539763
glycerol -0.4783502 0.0000000 0.0000000 0.2280676 0.0020679 0.0052493 -0.0065122 0.9484003 0.9484003 0.4626473 0.0000208 0.0001142 0.4691595 0.0015470 0.0206291 0.0841685 0.3719666 -0.2048672 0.1918428 0.2541216 0.6711731
glycine 0.0526110 0.3695555 0.4355475 0.3841430 0.0000000 0.0000000 0.4860745 0.0000000 0.0000003 0.2822116 0.0010648 0.0043925 -0.2038629 0.0831272 0.3166966 0.2687063 0.4995798 0.3269527 0.6451962 0.1149308 0.4494924
ornithine -0.0186175 0.6573762 0.6779192 -0.0100253 0.8111914 0.8923105 0.1173914 0.0436688 0.0900669 -0.1374420 0.0248718 0.0578906 -0.2548334 0.0027346 0.0206291 -0.0927415 0.0726909 0.0033727 0.2314101 -0.2573070 -0.0175770
lysine -0.0777062 0.0216447 0.0340130 -0.3392074 0.0000000 0.0000000 -0.2376291 0.0000007 0.0000123 -0.4407856 0.0000000 0.0000000 -0.2031565 0.0028285 0.0206291 -0.4053835 -0.2730312 -0.3288485 -0.1464098 -0.5366823 -0.3448890
asparagine 0.2022381 0.0000535 0.0001103 0.2937644 0.0000000 0.0000001 0.2917136 0.0000246 0.0001355 0.2958151 0.0000458 0.0002160 0.0041015 0.9665166 0.9665166 0.1974656 0.3900631 0.1589724 0.4244549 0.1562675 0.4353628
dimethylamine -0.1062383 0.4391315 0.4830447 0.1352541 0.3249083 0.4123836 0.0444494 0.8141955 0.8667242 0.2260589 0.2564313 0.3679232 0.1816096 0.5083386 0.8181237 -0.1351864 0.4056947 -0.3283346 0.4172333 -0.1658397 0.6179576
2-oxoisocaproate 0.0644592 0.2543679 0.3108941 -0.2423710 0.0000288 0.0000949 -0.2580569 0.0010967 0.0045238 -0.2266851 0.0061324 0.0183972 0.0313719 0.7811114 0.9546918 -0.3536323 -0.1311097 -0.4114231 -0.1046908 -0.3879151 -0.0654550
citrate 0.0520531 0.4040612 0.4597938 0.3502631 0.0000001 0.0000004 0.1636893 0.0580469 0.1064194 0.5368368 0.0000000 0.0000002 0.3731476 0.0031256 0.0206291 0.2274184 0.4731077 -0.0056437 0.3330222 0.3588213 0.7148524
glutamine -0.1170194 0.0000073 0.0000172 0.1192102 0.0000050 0.0000208 0.1628214 0.0000061 0.0000519 0.0755991 0.0406035 0.0802104 -0.0872223 0.0863718 0.3166966 0.0692988 0.1691217 0.0940218 0.2316209 0.0032719 0.1479264
pyruvate 0.2836625 0.0000053 0.0000135 0.0001306 0.9982752 0.9982752 0.0366745 0.6597014 0.8063017 -0.0364133 0.6774985 0.7212081 -0.0730878 0.5454158 0.8181237 -0.1189406 0.1192017 -0.1274570 0.2008059 -0.2089607 0.1361341
acetone -1.4752586 0.0000000 0.0000000 -0.0857535 0.5067286 0.5766222 -0.1237970 0.4868652 0.6694397 -0.0477100 0.7986838 0.8236427 0.0760870 0.7682203 0.9546918 -0.3401821 0.1686751 -0.4745096 0.2269155 -0.4164055 0.3209855
proline -0.0601208 0.0731143 0.1005321 0.0514584 0.1245750 0.1868625 0.0983343 0.0338050 0.0743709 0.0045826 0.9245413 0.9245413 -0.0937517 0.1615414 0.4100667 -0.0143548 0.1172717 0.0076151 0.1890534 -0.0907882 0.0999534
acetate 0.3309603 0.0000003 0.0000010 0.1071815 0.0868641 0.1365007 0.2629752 0.0025311 0.0092805 -0.0486121 0.5905828 0.7212081 -0.3115873 0.0132534 0.0637147 -0.0156776 0.2300407 0.0936222 0.4323282 -0.2266487 0.1294245
alanine 0.4407706 0.0000000 0.0000000 0.1112648 0.0306749 0.0532774 0.0946521 0.1803970 0.2834809 0.1278775 0.0857030 0.1414099 0.0332254 0.7452635 0.9546918 0.0104878 0.2120418 -0.0442621 0.2335663 -0.0181596 0.2739146
3-hydroxybutyrate -0.9479525 0.0000000 0.0000000 0.1342617 0.3112437 0.4108417 0.0713201 0.6959938 0.8202785 0.1972033 0.3047362 0.4057239 0.1258833 0.6345903 0.8922208 -0.1266980 0.3952214 -0.2883952 0.4310353 -0.1809565 0.5753632
ethanol -0.2601231 0.0000000 0.0000000 0.0923122 0.0297442 0.0532774 0.0591555 0.3096327 0.4442557 0.1254689 0.0413205 0.0802104 0.0663134 0.4321880 0.7664033 0.0091778 0.1754466 -0.0554396 0.1737506 0.0049979 0.2459400
2-propanol -0.0020939 0.9419149 0.9419149 0.0390849 0.1749891 0.2510713 0.0447128 0.2599048 0.3898572 0.0334570 0.4221865 0.5358521 -0.0112558 0.8447466 0.9665166 -0.0175624 0.0957323 -0.0333717 0.1227974 -0.0486314 0.1155454
2-oxoisovalerate  -0.1728101 0.0013380 0.0023239 -0.1361641 0.0110225 0.0213967 -0.1937298 0.0087384 0.0240306 -0.0785983 0.3073666 0.4057239 0.1151315 0.2787369 0.6123968 -0.2407424 -0.0315857 -0.3378839 -0.0495757 -0.2301440 0.0729474
3-methyl-2-oxovalerate 0.1784157 0.0008928 0.0016367 0.0388866 0.4620118 0.5445139 0.0458970 0.5287354 0.6979307 0.0318761 0.6771906 0.7212081 -0.0140209 0.8944265 0.9665166 -0.0652427 0.1430158 -0.0976381 0.1894321 -0.1190188 0.1827711
3-hydroxyisobutyrate 0.1641638 0.0458431 0.0657749 -0.2772298 0.0008489 0.0023345 -0.2834668 0.0126634 0.0321455 -0.2709927 0.0231637 0.0578906 0.0124741 0.9391682 0.9665166 -0.4383273 -0.1161322 -0.5055288 -0.0614047 -0.5044411 -0.0375444
valine 0.0182953 0.5905666 0.6286677 -0.2501700 0.0000000 0.0000000 -0.2117525 0.0000113 0.0000745 -0.2885876 0.0000000 0.0000003 -0.0768351 0.2592746 0.6111473 -0.3171717 -0.1831684 -0.3041098 -0.1193953 -0.3856805 -0.1914947
leucine 0.0539532 0.1315259 0.1669368 -0.2200006 0.0000000 0.0000000 -0.1882463 0.0001764 0.0008316 -0.2517549 0.0000024 0.0000201 -0.0635086 0.3737022 0.7254219 -0.2902826 -0.1497186 -0.2851253 -0.0913673 -0.3536014 -0.1499084
isoleucine 0.0721086 0.0328621 0.0492931 -0.0953559 0.0049898 0.0117617 -0.1304241 0.0053289 0.0175853 -0.0602878 0.2162240 0.3243359 0.0701363 0.2969197 0.6123968 -0.1615209 -0.0291909 -0.2216280 -0.0392201 -0.1561683 0.0355927
2-hydroxybutyrate -0.2965328 0.0000000 0.0000000 -0.2371837 0.0000029 0.0000138 -0.3151485 0.0000063 0.0000519 -0.1592189 0.0263139 0.0578906 0.1559296 0.1136270 0.3527345 -0.3339619 -0.1404054 -0.4485507 -0.1817463 -0.2994614 -0.0189764
Open code

if(file.exists('gitignore/lm_results/result_metabolom.csv') == FALSE){
  write.table(result_metabolom, 
              'gitignore/lm_results/result_metabolom.csv', 
              row.names = FALSE)
  }

5 Elastic net

To assess the predictive power of metabolome features on diet strategy, we employed Elastic Net logistic regression.

As we expected very high level of co-linearity, we allowed \(alpha\) to rather small (0, 0.2 or 0.4).

The performance of the predictive models was evaluated through their capacity of discriminate between vegan and omnivore diets, using out-of-sample area under ROC curve (AUC; estimated with out-of-bag bootstrap) as the measure of discriminatory capacity.

All features were transformed by 2 standard deviations (resulting in standard deviation of 0.5)

5.1 Prepare data for glmnet

Open code
data_metabolites_glmnet <- data_metabolites_original %>%
  dplyr::mutate(
    vegan = as.numeric(
      dplyr::if_else(
        Diet == "VEGAN", 1, 0
      )
    ),
    dplyr::across(
      `formate`:`2-hydroxybutyrate`, ~ arm::rescale(trans_metabolite(.))
    )
  ) %>% 
  dplyr::select(
    Sample, vegan, Country, `formate`:`2-hydroxybutyrate`
  )

dim(data_metabolites_glmnet)
## [1] 173  36
names(data_metabolites_glmnet)
##  [1] "Sample"                 "vegan"                  "Country"               
##  [4] "formate"                "tryptophan"             "phenylalanine"         
##  [7] "histidine"              "tyrosine"               "glucose"               
## [10] "mannose"                "threonine"              "lactate"               
## [13] "glycerol"               "glycine"                "ornithine"             
## [16] "lysine"                 "asparagine"             "dimethylamine"         
## [19] "2-oxoisocaproate"       "citrate"                "glutamine"             
## [22] "pyruvate"               "acetone"                "proline"               
## [25] "acetate"                "alanine"                "3-hydroxybutyrate"     
## [28] "ethanol"                "2-propanol"             "2-oxoisovalerate "     
## [31] "3-methyl-2-oxovalerate" "3-hydroxyisobutyrate"   "valine"                
## [34] "leucine"                "isoleucine"             "2-hydroxybutyrate"
summary(data_metabolites_glmnet)
##     Sample              vegan         Country             formate       
##  Length:173         Min.   :0.000   Length:173         Min.   :-0.8030  
##  Class :character   1st Qu.:0.000   Class :character   1st Qu.:-0.4692  
##  Mode  :character   Median :1.000   Mode  :character   Median :-0.2044  
##                     Mean   :0.578                      Mean   : 0.0000  
##                     3rd Qu.:1.000                      3rd Qu.: 0.5165  
##                     Max.   :1.000                      Max.   : 0.8004  
##    tryptophan       phenylalanine       histidine            tyrosine       
##  Min.   :-1.21480   Min.   :-1.2954   Min.   :-1.352559   Min.   :-1.36217  
##  1st Qu.:-0.33244   1st Qu.:-0.3114   1st Qu.:-0.349523   1st Qu.:-0.32851  
##  Median : 0.03462   Median :-0.0632   Median :-0.006086   Median : 0.02603  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.000000   Mean   : 0.00000  
##  3rd Qu.: 0.28038   3rd Qu.: 0.2953   3rd Qu.: 0.351011   3rd Qu.: 0.26732  
##  Max.   : 2.59340   Max.   : 1.2582   Max.   : 1.704357   Max.   : 1.86951  
##     glucose            mannose           threonine           lactate        
##  Min.   :-1.44184   Min.   :-1.57714   Min.   :-1.91626   Min.   :-0.98660  
##  1st Qu.:-0.37787   1st Qu.:-0.30430   1st Qu.:-0.32355   1st Qu.:-0.32800  
##  Median :-0.08058   Median : 0.08151   Median :-0.04016   Median :-0.03976  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.34563   3rd Qu.: 0.38405   3rd Qu.: 0.27712   3rd Qu.: 0.20942  
##  Max.   : 2.37649   Max.   : 0.94062   Max.   : 1.57739   Max.   : 1.86847  
##     glycerol           glycine           ornithine           lysine        
##  Min.   :-1.86801   Min.   :-2.19120   Min.   :-1.5540   Min.   :-1.82097  
##  1st Qu.:-0.31759   1st Qu.:-0.28974   1st Qu.:-0.2927   1st Qu.:-0.30989  
##  Median : 0.02681   Median : 0.05298   Median : 0.0154   Median : 0.04735  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.30485   3rd Qu.: 0.30088   3rd Qu.: 0.3300   3rd Qu.: 0.35728  
##  Max.   : 1.25029   Max.   : 1.21701   Max.   : 1.9208   Max.   : 1.55589  
##    asparagine       dimethylamine     2-oxoisocaproate      citrate        
##  Min.   :-1.19714   Min.   :-1.0765   Min.   :-1.21974   Min.   :-1.55282  
##  1st Qu.:-0.36044   1st Qu.:-0.3972   1st Qu.:-0.36202   1st Qu.:-0.26115  
##  Median : 0.02313   Median :-0.1378   Median : 0.01584   Median :-0.03012  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.32250   3rd Qu.: 0.4977   3rd Qu.: 0.35055   3rd Qu.: 0.28205  
##  Max.   : 1.41408   Max.   : 1.0813   Max.   : 1.28311   Max.   : 1.37139  
##    glutamine           pyruvate          acetone            proline        
##  Min.   :-1.83024   Min.   :-1.3249   Min.   :-1.55383   Min.   :-1.52820  
##  1st Qu.:-0.29243   1st Qu.:-0.3740   1st Qu.:-0.29407   1st Qu.:-0.32477  
##  Median : 0.03193   Median :-0.0523   Median : 0.07351   Median :-0.02069  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.33934   3rd Qu.: 0.3684   3rd Qu.: 0.31267   3rd Qu.: 0.32296  
##  Max.   : 1.00878   Max.   : 1.2458   Max.   : 1.10845   Max.   : 1.48558  
##     acetate             alanine          3-hydroxybutyrate    ethanol        
##  Min.   :-0.980091   Min.   :-1.602400   Min.   :-1.3186   Min.   :-1.13922  
##  1st Qu.:-0.323562   1st Qu.:-0.411117   1st Qu.:-0.3224   1st Qu.:-0.37898  
##  Median :-0.001196   Median :-0.004515   Median :-0.1249   Median : 0.04808  
##  Mean   : 0.000000   Mean   : 0.000000   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.276962   3rd Qu.: 0.337241   3rd Qu.: 0.2263   3rd Qu.: 0.33658  
##  Max.   : 2.673149   Max.   : 1.326617   Max.   : 1.5052   Max.   : 1.46759  
##    2-propanol        2-oxoisovalerate    3-methyl-2-oxovalerate
##  Min.   :-1.339002   Min.   :-1.352854   Min.   :-1.59828      
##  1st Qu.:-0.277101   1st Qu.:-0.349204   1st Qu.:-0.30949      
##  Median :-0.009172   Median :-0.003996   Median :-0.03688      
##  Mean   : 0.000000   Mean   : 0.000000   Mean   : 0.00000      
##  3rd Qu.: 0.304907   3rd Qu.: 0.340219   3rd Qu.: 0.32342      
##  Max.   : 1.645117   Max.   : 1.496604   Max.   : 1.36931      
##  3-hydroxyisobutyrate     valine            leucine            isoleucine      
##  Min.   :-1.42810     Min.   :-1.08398   Min.   :-1.393193   Min.   :-1.33229  
##  1st Qu.:-0.35778     1st Qu.:-0.38337   1st Qu.:-0.311148   1st Qu.:-0.33120  
##  Median :-0.02365     Median : 0.01007   Median : 0.005775   Median :-0.02574  
##  Mean   : 0.00000     Mean   : 0.00000   Mean   : 0.000000   Mean   : 0.00000  
##  3rd Qu.: 0.28404     3rd Qu.: 0.33301   3rd Qu.: 0.343020   3rd Qu.: 0.29964  
##  Max.   : 1.83272     Max.   : 1.96794   Max.   : 2.566032   Max.   : 1.48801  
##  2-hydroxybutyrate 
##  Min.   :-0.96906  
##  1st Qu.:-0.35015  
##  Median :-0.05846  
##  Mean   : 0.00000  
##  3rd Qu.: 0.32582  
##  Max.   : 1.40274

5.2 Fit model

Open code

modelac <- "elanet_metabolit_all"

assign(
  modelac,
  run(
    expr = clust_glmnet_sep(
      data = data_metabolites_glmnet,
      outcome = "vegan",
      clust_id = "Sample",
      sample_method = "oos_boot",
      N = 500,
      alphas = c(0, 0.2, 0.4),
      family = "binomial",
      seed = 478
    ),
    path = paste0("gitignore/run/", modelac)
  )
)

5.3 See results

5.3.1 Model summary

Open code
elanet_metabolit_all$model_summary
##   alpha     lambda       auc auc_OutOfSample auc_oos_CIL auc_oos_CIU  accuracy
## 1     0 0.02730688 0.9757534       0.9493977   0.9002391   0.9890267 0.9248555
##   accuracy_OutOfSample accuracy_oos_CIL accuracy_oos_CIU
## 1            0.8897059        0.8175096        0.9516129
elanet_metabolit_all$country_AUC
##   auc_OutOfSample_IT auc_oos_CIL_IT auc_oos_CIU_IT auc_OutOfSample_CZ
## 1          0.9502619      0.8583088              1          0.9497413
##   auc_oos_CIL_CZ auc_oos_CIU_CZ
## 1      0.8826595              1

5.3.2 ROC curve - internal validation

Open code
elanet_metabolit_all$plot

Receiver operating characteristic (ROC) curves from a random subset of out-of-bag bootstrap iterations, illustrating the model’s ability to discriminate between vegan and omnivore status based on standardized log2(metabolite) levels. Each curve corresponds to an elastic net model trained on a bootstrap resample and evaluated on the subjects not included in that iteration (i.e., out-of-bag samples). The curve plots the true positive rate (sensitivity) against the false positive rate (1 – specificity) across thresholds of predicted vegan probability. The area under the curve (AUC) quantifies overall discriminatory performance, with values closer to 1 indicating stronger separation between groups.

5.3.3 Estimated coefficients

Open code
elanet_metabolit_all$betas
## 34 x 1 sparse Matrix of class "dgCMatrix"
##                                 s0
## (Intercept)             0.51654945
## formate                -0.23781875
## tryptophan              0.09514574
## phenylalanine          -0.26041520
## histidine              -0.46961530
## tyrosine               -0.40207966
## glucose                -0.06094302
## mannose                 0.09576230
## threonine              -0.29336072
## lactate                -0.03716616
## glycerol                0.40862126
## glycine                 0.89462176
## ornithine               0.15526294
## lysine                 -1.29145285
## asparagine              0.82351717
## dimethylamine           0.06858337
## 2-oxoisocaproate       -0.19919398
## citrate                 0.38672431
## glutamine               0.63824650
## pyruvate                0.06344946
## acetone                -0.14041313
## proline                 0.25009447
## acetate                 0.29136570
## alanine                 0.16148569
## 3-hydroxybutyrate       0.34663556
## ethanol                 0.19167449
## 2-propanol             -0.07389282
## 2-oxoisovalerate       -0.01538734
## 3-methyl-2-oxovalerate  0.51986873
## 3-hydroxyisobutyrate   -0.40782423
## valine                 -0.53956958
## leucine                -0.36156524
## isoleucine              0.08659878
## 2-hydroxybutyrate      -0.45713854

5.3.4 Plot of coefficients

Open code
elacoef <- data.frame(
  metabolite = row.names(elanet_metabolit_all$betas),
  beta_ela = elanet_metabolit_all$betas[, 1]
) %>%
  arrange(abs(beta_ela)) %>%
  filter(abs(beta_ela) > 0,
         !grepl('Intercept', metabolite)) %>%
  mutate(metabolite = factor(metabolite, levels = metabolite)) %>% 
  mutate(outcome = as.character(metabolite))

plotac <- "elanet_beta_metabolite"
path <- "gitignore/figures"

assign(plotac, elacoef %>% 
  ggplot(
    aes(
      x = metabolite,
      y = beta_ela
    )
  ) +
  geom_point() +
  geom_hline(yintercept = 0, color = "black") +
  labs(
    y = "Standardized beta coefficients",
    x = "Metabolite"
  ) +
  theme_minimal() +
  coord_flip() + 
  theme(
    axis.text.x = element_text(size = 10),
    axis.text.y = element_text(size = 10),
    axis.title.x = element_text(size = 12),
    axis.title.y = element_text(size = 12),
    legend.position = "bottom"
  )
)

if (file.exists(paste0(path, "/", plotac, ".svg")) == FALSE) {
    ggsave(
    path = paste0(path),
    filename = plotac,
    device = "svg",
    width = 5,
    height = 6
  )
}

get(plotac)

Regression coefficients from the elastic net model predicting vegan diet strategy based on log2-transformed and standardized metabolite levels. Metabolites are ordered by the magnitude of the standardized coefficients, indicating their relative importance in distinguishing between the diet groups. The sign of each coefficient indicates the direction of association with vegan diet status, with positive values indicating a higher probability of vegan status and vice versa. Metabolites whose effects were shrunk to zero are not shown.

6 External validation

External validation was performed with an independent Czech cohort.

As a first step, we will use the previously developed and internally validated elastic net model to predict vegan status in the independent Czech cohort. The validation data will be standardized using the mean and standard deviation of each metabolite from the training cohort to ensure comparability across datasets. For each subject in the external validation cohort, we will estimate the predicted probability of being vegan using the elastic net model. This predicted probability will then be used as a variable to discriminate between the diet groups in the independent cohort.

In a 2nd step, we will look at metabolites that significantly differed between diet groups (average vegan diet effect across both countries, FDR < 0.05) estimated with linear models (one per metabolite) with training cohort. Then we will fit linear models also for external validation cohort. Effect of vegan diet on these metabolites will be shown along with 95% confidence interval for all cohorts: training Czech and Italian cohorts, but also in Czech independent (validating) cohort

6.1 Diet discrimination (elastic net)

6.1.0.1 Get table of weights, means and SDs

Open code

coefs_metabolom <- get_coef(
  original_data = data_analysis_metabolom,
  glmnet_model = elanet_metabolit_all)

coefs_metabolom
## # A tibble: 34 × 5
##    predictor     beta_scaled beta_OrigScale  mean     SD
##    <chr>               <dbl>          <dbl> <dbl>  <dbl>
##  1 (Intercept)        0.517         NA       NA   NA    
##  2 formate           -0.238         -0.312  -19.0  0.655
##  3 tryptophan         0.0951         0.0399 -17.2  0.209
##  4 phenylalanine     -0.260         -0.0959 -16.2  0.184
##  5 histidine         -0.470         -0.177  -16.8  0.189
##  6 tyrosine          -0.402         -0.239  -16.4  0.297
##  7 glucose           -0.0609        -0.0227 -12.5  0.186
##  8 mannose            0.0958         0.0632 -18.8  0.330
##  9 threonine         -0.293         -0.143  -16.0  0.244
## 10 lactate           -0.0372        -0.0336 -14.1  0.453
## # ℹ 24 more rows

6.1.0.2 Identify shared and missing predictors

Open code
## Colnames in validation set
names(data_metabolites_validation)
##  [1] "NazevSouboru"           "CisloPacienta"          "SkupinaNazev"          
##  [4] "SKUPINA"                "formate"                "tryptophan"            
##  [7] "phenylalanine"          "histidine"              "tyrosine"              
## [10] "glucose"                "mannose"                "threonine"             
## [13] "lactate"                "glycerol"               "glycine"               
## [16] "ornithine"              "lysine"                 "asparagine"            
## [19] "dimethylamine"          "2-oxoisocaproate"       "citrate"               
## [22] "glutamine"              "pyruvate"               "acetone"               
## [25] "proline"                "acetate"                "alanine"               
## [28] "3-hydroxybutyrate"      "ethanol"                "2-propanol"            
## [31] "2-oxoisovalerate "      "3-methyl-2-oxovalerate" "3-hydroxyisobutyrate"  
## [34] "valine"                 "leucine"                "isoleucine"            
## [37] "2-hydroxybutyrate"

## Which are missing in the validation set
missing <- setdiff(
  coefs_metabolom$predictor[-1], 
  colnames(
    data_metabolites_validation
    )
  )

missing
## character(0)

## Which are common with the validations et
common_predictors <- intersect(coefs_metabolom$predictor, colnames(data_metabolites_validation))

common_predictors
##  [1] "formate"                "tryptophan"             "phenylalanine"         
##  [4] "histidine"              "tyrosine"               "glucose"               
##  [7] "mannose"                "threonine"              "lactate"               
## [10] "glycerol"               "glycine"                "ornithine"             
## [13] "lysine"                 "asparagine"             "dimethylamine"         
## [16] "2-oxoisocaproate"       "citrate"                "glutamine"             
## [19] "pyruvate"               "acetone"                "proline"               
## [22] "acetate"                "alanine"                "3-hydroxybutyrate"     
## [25] "ethanol"                "2-propanol"             "2-oxoisovalerate "     
## [28] "3-methyl-2-oxovalerate" "3-hydroxyisobutyrate"   "valine"                
## [31] "leucine"                "isoleucine"             "2-hydroxybutyrate"

6.1.0.3 Standardize data in validation set

Open code
data_metabolites_validation_pred <- data_metabolites_validation %>%
  dplyr::mutate(
    vegan = if_else(
      SKUPINA == 1, 1, 0
    )
  ) %>%
  dplyr::select(
    vegan,
    dplyr::all_of(common_predictors)
  ) %>% 
  dplyr::mutate(
    across(
      .cols = -vegan,
      .fns = trans_metabolite
      )
    ) %>% 
  dplyr::mutate(
    across(
      .cols = -vegan,
      .fns = ~ . 
      - coefs_metabolom$mean[
        match(
          cur_column(), 
          coefs_metabolom$predictor
          )
        ]
      )
    ) %>% 
  dplyr::mutate(
    across(
      .cols = -vegan,
      .fns = ~ . 
      / coefs_metabolom$SD[
        match(
          cur_column(), 
          coefs_metabolom$predictor
          )
        ]
      )
    )

data_metabolites_validation_pred %>% names()
##  [1] "vegan"                  "formate"                "tryptophan"            
##  [4] "phenylalanine"          "histidine"              "tyrosine"              
##  [7] "glucose"                "mannose"                "threonine"             
## [10] "lactate"                "glycerol"               "glycine"               
## [13] "ornithine"              "lysine"                 "asparagine"            
## [16] "dimethylamine"          "2-oxoisocaproate"       "citrate"               
## [19] "glutamine"              "pyruvate"               "acetone"               
## [22] "proline"                "acetate"                "alanine"               
## [25] "3-hydroxybutyrate"      "ethanol"                "2-propanol"            
## [28] "2-oxoisovalerate "      "3-methyl-2-oxovalerate" "3-hydroxyisobutyrate"  
## [31] "valine"                 "leucine"                "isoleucine"            
## [34] "2-hydroxybutyrate"

6.1.0.4 Get predicted value

Open code
elanet_metabolit_all$fit
## 
## Call:  glmnet::glmnet(x = original_predictors, y = original_outcome,      family = family, alpha = optim_par$alpha[1], lambda = optim_par$lamb_1se[1],      standardize = standardize) 
## 
##   Df  %Dev  Lambda
## 1 33 60.94 0.02731
newx <- as.matrix(data_metabolites_validation_pred[,-1])

tr <- data_metabolites_validation_pred %>% 
  dplyr::mutate(
    predicted_logit = as.numeric(
      predict(
        elanet_metabolit_all$fit, 
        newx = newx
        )
      )
    ) %>% 
  dplyr::mutate(
    predicted = inv_logit(predicted_logit)
  )

6.1.1 Result of external validation

6.1.1.1 ROC curve

Open code
roc_metabolite <- pROC::roc(
  vegan ~ predicted_logit,
  data = tr,
  direction = "<",
  levels = c(0, 1),
  ci = TRUE
)

roc_metabolite
## 
## Call:
## roc.formula(formula = vegan ~ predicted_logit, data = tr, direction = "<",     levels = c(0, 1), ci = TRUE)
## 
## Data: predicted_logit in 49 controls (vegan 0) < 91 cases (vegan 1).
## Area under the curve: 0.9123
## 95% CI: 0.8588-0.9658 (DeLong)

plotac <- "roc_metabolite_pl"
path <- "gitignore/figures"

assign(plotac, ggroc(roc_metabolite))
get(plotac)

Receiver operating characteristics (ROC) curve showing the model’s ability to discriminate between vegan and omnivore status in the external validation Czech cohort, based on log2-transformed metabolite levels. The curve plots the true positive rate (sensitivity) against the true positive rate (specificity) at various thresholds of predicted vegan status, as estimated from the elastic net model developed on the training data. The area under the curve (AUC) represents the model’s overall performance, with values closer to 1 indicating stronger discrimination.
Open code

if (file.exists(paste0(path, "/", plotac, ".svg")) == FALSE) {  
  ggsave(
    path = paste0(path),
    filename = plotac,
    device = "svg",
    width = 6,
    height = 4.5
  )
}

6.1.1.2 Table

Open code
mod <- elanet_metabolit_all

trainAUC <- mean(mod[["valid_performances"]]$auc_resamp_test)
trainCI <- quantile(mod[["valid_performances"]]$auc_resamp_test,
  probs = c(1 / 40, 39 / 40))

res <- data.frame(
  alpha = c(mod$model_summary$alpha, rep("", 4)),
  lambda = c(round(mod$model_summary$lambda, 4), rep("", 4)),
  `performance type` = c(
 "Training set AUC",
"Out-of-sample AUC (all)",
"Out-of-sample AUC (Czech)",
"Out-of-sample AUC (Italy)",
"External validation AUC"
  ),
  `performance [95% CI]` = c(
    sprintf("%.3f [%.3f to %.3f]", trainAUC, trainCI[1], trainCI[2]),
    sprintf("%.3f [%.3f to %.3f]",
            mod$model_summary$auc_OutOfSample,
            mod$model_summary$auc_oos_CIL,
            mod$model_summary$auc_oos_CIU),
    sprintf("%.3f [%.3f to %.3f]",
            mod$country_AUC$auc_OutOfSample_CZ,
            mod$country_AUC$auc_oos_CIL_CZ,
            mod$country_AUC$auc_oos_CIU_CZ),
    sprintf("%.3f [%.3f to %.3f]",
            mod$country_AUC$auc_OutOfSample_IT,
            mod$country_AUC$auc_oos_CIL_IT,
            mod$country_AUC$auc_oos_CIU_IT),
    sprintf("%.3f [%.3f to %.3f]",
            roc_metabolite[["ci"]][2],
            roc_metabolite[["ci"]][1],
            roc_metabolite[["ci"]][3])
  )
)


kableExtra::kable(
  res %>% mutate(across(where(is.numeric), ~ round(.x, 3))),
  caption = 'Performance of the elastic‑net logistic regression model for discriminating vegan from omnivore status using log2‑transformed metabolite levels. The model was developed on the combined training data (Czech and Italian cohorts), with the optimmized `alpha` (mixing parameter) and `lambda` (penalty strength) selected via 10-fold cross-validation. Internal validation employed 500 out‑of‑bag bootstrap resamples: the out‑of‑sample AUC is the mean across resamples, and its 95 % confidence interval (CI) is given by the 2.5th and 97.5th percentiles of the bootstrap distribution. The training‑set AUC and its CI were computed analogously from the in‑bag predictions. External validation was carried out on an independent Czech cohort; the reported AUC is the point estimate on that cohort, and its 95% CI was obtained with DeLong’s method.'
)
Performance of the elastic‑net logistic regression model for discriminating vegan from omnivore status using log2‑transformed metabolite levels. The model was developed on the combined training data (Czech and Italian cohorts), with the optimmized alpha (mixing parameter) and lambda (penalty strength) selected via 10-fold cross-validation. Internal validation employed 500 out‑of‑bag bootstrap resamples: the out‑of‑sample AUC is the mean across resamples, and its 95 % confidence interval (CI) is given by the 2.5th and 97.5th percentiles of the bootstrap distribution. The training‑set AUC and its CI were computed analogously from the in‑bag predictions. External validation was carried out on an independent Czech cohort; the reported AUC is the point estimate on that cohort, and its 95% CI was obtained with DeLong’s method.
alpha lambda performance.type performance..95..CI.
0 0.0273 Training set AUC 0.988 [0.970 to 1.000]
Out-of-sample AUC (all) 0.949 [0.900 to 0.989]
Out-of-sample AUC (Czech) 0.950 [0.883 to 1.000]
Out-of-sample AUC (Italy) 0.950 [0.858 to 1.000]
External validation AUC 0.912 [0.859 to 0.966]

6.2 Diet effect across datasets (forest plot)

Similarly as in training data cohorts, we will fit linear model per each of the selected metabolite level (\(log_{2}\) - transformed), with a single fixed effect factor of diet.

6.2.1 Linear model in validation cohort

Open code
data_analysis_metabolom <- data_metabolites_validation %>%
  dplyr::mutate(
    Diet_VEGAN = as.numeric(
      dplyr::if_else(
        SKUPINA == 1, 1, 0
      )
    ),
    dplyr::across(
      `formate`:`2-hydroxybutyrate`, ~ trans_metabolite(.)
    )
  ) %>%
  dplyr::select(
    Diet_VEGAN,
    dplyr::everything()
  )

summary(data_analysis_metabolom[, 1:12])
##    Diet_VEGAN   NazevSouboru       CisloPacienta      SkupinaNazev      
##  Min.   :0.00   Length:140         Length:140         Length:140        
##  1st Qu.:0.00   Class :character   Class :character   Class :character  
##  Median :1.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :0.65                                                           
##  3rd Qu.:1.00                                                           
##  Max.   :1.00                                                           
##     SKUPINA        formate         tryptophan     phenylalanine   
##  Min.   :0.00   Min.   :-21.07   Min.   :-19.59   Min.   :-17.22  
##  1st Qu.:0.00   1st Qu.:-18.04   1st Qu.:-16.79   1st Qu.:-15.66  
##  Median :1.00   Median :-17.90   Median :-16.67   Median :-15.57  
##  Mean   :0.65   Mean   :-17.93   Mean   :-16.70   Mean   :-15.56  
##  3rd Qu.:1.00   3rd Qu.:-17.79   3rd Qu.:-16.57   3rd Qu.:-15.45  
##  Max.   :1.00   Max.   :-17.40   Max.   :-16.14   Max.   :-14.97  
##    histidine         tyrosine         glucose          mannose      
##  Min.   :-18.05   Min.   :-19.05   Min.   :-15.29   Min.   :-20.47  
##  1st Qu.:-16.63   1st Qu.:-15.92   1st Qu.:-12.24   1st Qu.:-17.69  
##  Median :-16.48   Median :-15.77   Median :-12.14   Median :-17.56  
##  Mean   :-16.47   Mean   :-15.79   Mean   :-12.16   Mean   :-17.54  
##  3rd Qu.:-16.32   3rd Qu.:-15.60   3rd Qu.:-12.04   3rd Qu.:-17.35  
##  Max.   :-15.75   Max.   :-15.12   Max.   :-11.77   Max.   :-16.69

6.2.1.1 Define number of metabolites and covariates

Open code
n_covarites <- 5
n_features <- ncol(data_analysis_metabolom) - n_covarites

6.2.1.2 Create empty objects

Open code
outcome <- vector('double', n_features)
log2FD_VGdiet <- vector('double', n_features)
P_VGdiet <- vector('double', n_features)
CI_L_VGdiet <- vector('double', n_features)
CI_U_VGdiet <- vector('double', n_features)

6.2.1.3 Linear models per outcome

Open code
for (i in 1:n_features) {
  ## define variable
  data_analysis_metabolom$outcome <- data_analysis_metabolom[, (i + n_covarites)]

  ## fit model
  model <- lm(outcome ~ Diet_VEGAN, data = data_analysis_metabolom)

  ## save results
  outcome[i] <- names(data_analysis_metabolom)[i + n_covarites]

  ## extract diet effect
  tr <- confint(model)

  CI_L_VGdiet[i] <- tr[which(row.names(tr) == "Diet_VEGAN"), ][1]
  CI_U_VGdiet[i] <- tr[which(row.names(tr) == "Diet_VEGAN"), ][2]

  log2FD_VGdiet[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Diet_VEGAN"
    ), 1
  ]

  P_VGdiet[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Diet_VEGAN"
    ), 4
  ]
}

6.2.1.4 Results table

Open code
## relevant metabolites
diet_sensitive_metabolites <- result_metabolom %>%
  filter(
    fdr_VGdiet_avg < 0.05
  ) %>%
  select(
    outcome
  )

len <- nrow(diet_sensitive_metabolites)

result_metabolom_val <- data.frame(
  outcome,
  log2FD_VGdiet, P_VGdiet,
  CI_L_VGdiet, CI_U_VGdiet
) %>% 
  filter(outcome %in% diet_sensitive_metabolites$outcome)

kableExtra::kable(result_metabolom_val,
                  caption = 'Results of linear models estimating the effect of diet on metabolite levels. Only metabolites that significantly differed between diet groups in training cohorts (FDR < 0.05, average effect across both training cohorts) were included. `log2FD` denotes the estimated effects (regression coefficient), indicating how much the log2-transformed metabolite levels differ between vegans and omnivores. `P`: p-value, `fdr`: p-value adjusted for multiple comparisons, and `CI_L` and `CI_U` represent the lower and upper bounds of the 95% confidence interval, respectively. All estimates in a single row are based on a single model') 
Results of linear models estimating the effect of diet on metabolite levels. Only metabolites that significantly differed between diet groups in training cohorts (FDR < 0.05, average effect across both training cohorts) were included. log2FD denotes the estimated effects (regression coefficient), indicating how much the log2-transformed metabolite levels differ between vegans and omnivores. P: p-value, fdr: p-value adjusted for multiple comparisons, and CI_L and CI_U represent the lower and upper bounds of the 95% confidence interval, respectively. All estimates in a single row are based on a single model
outcome log2FD_VGdiet P_VGdiet CI_L_VGdiet CI_U_VGdiet
formate 0.1397995 0.0134426 0.0294080 0.2501911
phenylalanine -0.0239133 0.5569092 -0.1042089 0.0563822
histidine 0.1928632 0.0000376 0.1033398 0.2823867
tyrosine -0.0529005 0.4108809 -0.1797119 0.0739110
glycerol 1.0869457 0.0000000 0.7652919 1.4085994
glycine 0.5251078 0.0000000 0.3845024 0.6657132
lysine -0.0603541 0.3180455 -0.1794414 0.0587332
asparagine 0.2260915 0.0000543 0.1187963 0.3333867
2-oxoisocaproate -0.2146024 0.0001050 -0.3208361 -0.1083687
citrate 0.4309305 0.0000001 0.2809046 0.5809563
glutamine 0.3275636 0.0000000 0.2157616 0.4393657
2-oxoisovalerate  -0.1131207 0.1129090 -0.2533195 0.0270781
3-hydroxyisobutyrate -0.3644292 0.0013001 -0.5838998 -0.1449586
valine -0.0850854 0.0666411 -0.1760896 0.0059189
leucine -0.0029199 0.9647790 -0.1334316 0.1275918
isoleucine 0.1097519 0.1853915 -0.0532959 0.2727997
2-hydroxybutyrate -0.0803085 0.3548272 -0.2513515 0.0907345
Open code

if(file.exists('gitignore/lm_results/result_metabolom_validation.csv') == FALSE){
  write.table(result_metabolom_val,
              'gitignore/lm_results/result_metabolom_validation.csv', 
              row.names = FALSE)
  }

6.2.2 Forest plot

6.2.2.1 Data preparation

Open code

## subset result tables
result_metabolom_subset <- result_metabolom %>%
  filter(outcome %in% diet_sensitive_metabolites$outcome)

result_metabolom_val_subset <- result_metabolom_val %>%
  filter(outcome %in% diet_sensitive_metabolites$outcome)

## create a data frame
data_forest <- data.frame(
  outcome = rep(diet_sensitive_metabolites$outcome, 3),
  beta = c(
    result_metabolom_subset$log2FD_VGdiet_inCZ,
    result_metabolom_subset$log2FD_VGdiet_inIT,
    result_metabolom_val_subset$log2FD_VGdiet
  ),
  lower = c(
    result_metabolom_subset$CI_L_VGdiet_inCZ,
    result_metabolom_subset$CI_L_VGdiet_inIT,
    result_metabolom_val_subset$CI_L_VGdiet
  ),
  upper = c(
    result_metabolom_subset$CI_U_VGdiet_inCZ,
    result_metabolom_subset$CI_U_VGdiet_inIT,
    result_metabolom_val_subset$CI_U_VGdiet
  ),
  dataset = c(
    rep("CZ", len),
    rep("IT", len),
    rep("Validation", len)
  )
)

validation_order <- data_forest %>%
  group_by(outcome) %>%
  summarise(beta_mean = mean(beta), .groups = "drop") %>%
  arrange(beta_mean) %>%
  pull(outcome)

up_winners <- data_forest %>% 
  pivot_wider(names_from = dataset,
              values_from = c(beta, lower, upper)) %>% 
  left_join(elacoef, by = 'outcome') %>% 
  filter(beta_CZ > 0,
         beta_IT > 0,
         lower_Validation > 0,
         beta_ela > 0.1) %>% 
  select(outcome)

down_winners <- data_forest %>% 
  pivot_wider(names_from = dataset,
              values_from = c(beta, lower, upper)) %>% 
  left_join(elacoef, by = 'outcome') %>% 
  filter(beta_CZ < 0,
         beta_IT < 0,
         upper_Validation < 0,
         beta_ela < -0.1) %>% 
  select(outcome)

winners <- as.character(c(up_winners$outcome, down_winners$outcome))

 data_forest <- data_forest %>%
  mutate(in_winner = if_else(outcome %in% winners, TRUE, FALSE, missing = FALSE)) %>%
  left_join(
    elacoef %>% select(-metabolite), 
    by = 'outcome') %>% 
   mutate(outcome = factor(outcome, levels = validation_order))

6.2.2.2 Plotting

Open code
colors <- c("CZ" = "#150999", "IT" = "#329243", "Validation" = "grey60")

plotac <- "forest_metabolite"
path <- "gitignore/figures"

assign(
  plotac, 
    ggplot(
      data_forest,
      aes(x = outcome, y = beta, ymin = lower, ymax = upper, color = dataset)) +
  geom_pointrange(position = position_dodge(width = 0.5), size = 0.5) +
  geom_hline(yintercept = 0, color = "black") +
  geom_errorbar(position = position_dodge(width = 0.5), width = 0.2) +
  scale_color_manual(values = colors) +
  labs(
    y = "Effect of vegan duration (+10y) on log2 lipid level",
    x = "Outcome",
    color = "Dataset"
  ) +
  theme_minimal() +
  coord_flip() +
  scale_x_discrete(
    labels = setNames(
      ifelse(data_forest$in_winner, 
             paste0("**", data_forest$outcome, "**"), 
             as.character(data_forest$outcome)
      ), data_forest$outcome
    )
  ) +
  theme(
    axis.text.x = element_text(size = 10),
    axis.text.y = ggtext::element_markdown(size = 10),  
    axis.title.x = element_text(size = 12),
    axis.title.y = element_text(size = 12),
    legend.position = "bottom"
  )
)

if (file.exists(paste0(path, "/", plotac, ".svg")) == FALSE) {
  ggsave(
    path = paste0(path),
    filename = plotac,
    device = "svg",
    width = 6,
    height = 8
  )
}

get(plotac)

The forest plot illustrates the effects of a vegan diet on the levels of selected log2-transformed metabolites, along with their 95% confidence intervals, across two training cohorts (Czech and Italian) and one independent Czech cohort (validation). Green, blue, and grey points/lines represent differences in log2-transformed metabolite levels between vegans and omnivores within the Italian cohort, Czech cohort, and Czech validation cohort, respectively. Positive values suggest a higher metabolite level in vegans compared to omnivores. Only metabolites that showed significant differences between vegan and omnivorous diets in training cohorts (FDR < 0.05, average effect across both countries) were selected, and these effects were further validated in the independent cohort. The estimates for the training cohorts were obtained from a single linear model that included Diet, Country, and the interaction term Diet x Country as fixed-effect predictors. In the independent Czech validation cohort, Diet was the only fixed-effect predictor. Metabolites validated in the linear model and showing predictive power in the elastic net model (|β| > 0.1) are bold

6.2.3 Boxplot

Open code
plotac <- "boxplot_metabolom"
path <- "gitignore/figures"

colo <- c('#F9FFAF','#329243')

data_merged_log2 <- data_merged %>%
  mutate(across(`formate`:`2-hydroxybutyrate`, ~ trans_metabolite(.)))

boxplot_cond <- function(variable) {
  
  p <- ggboxplot(data_merged_log2, 
                 x = 'Diet', 
                 y = variable, 
                 fill = 'Diet', 
                 tip.length = 0.15,
                 palette = colo,
                 outlier.shape = 1,
                 lwd = 0.25,
                 outlier.size = 0.8,
                 facet.by = 'Data',
                 title = variable,
                 ylab = 'Log2(metabolite level)') +
    
    theme(
      plot.title = element_text(size = 10), 
      axis.title = element_text(size = 8),  
      axis.text.y = element_text(size = 7),
      axis.text.x = element_blank(),
      axis.title.x = element_blank()
    ) 
  return(p)
}

# Plot all outcomes
plots <- map(diet_sensitive_metabolites$outcome, boxplot_cond)

# Create a matrix of plots
assign(plotac, 
       ggarrange(plotlist = plots, ncol = 3, nrow = 6,  common.legend = TRUE)
       )

get(plotac)

Log2-traqnsformed metabolites levels across all 3 cohorts (Czech and Italian training cohorts and an independent Czech valdiation cohort) and across dietary groups
Open code

if (file.exists(paste0(path, "/", plotac, ".svg")) == FALSE) {
  ggsave(
    path = paste0(path),
    filename = plotac,
    device = "svg",
    width = 6,
    height = 10
  )
}

7 Linear model VG duration

Next, we fit another series of linear models, this time modelling log2 metabolite levels using the following fixed-effect predictors: duration of vegan status (Diet_duration, scaled in tens of years), country, their interaction (Diet_duration × country), and age:

\[ \log_{2}(\text{metabolite level}) = \alpha + \beta_{1} \times \text{country} + \beta_{2} \times \text{diet duration} + \beta_{3} \times (\text{country}:\text{diet duration}) + \epsilon \]

This analysis includes only vegan participants, while omnivores are excluded. The goal was to examine whether metabolites that differ between vegans and non-vegans also vary within the vegan group itself, depending on how long participants have been vegan. In other words, we asked whether long-term vegans show stronger up- or down-regulation of diet-sensitive metabolites compared to those who adopted the diet more recently.

Because longer vegan duration is likely correlated with age (e.g. a 20-year-old cannot have 20 years of vegan history, unlike a 40-year-old), we also adjusted for age in the models.

7.1 Get data

7.1.1 Training

Open code
meta_trainIT <- read.xlsx('gitignore/data/diet_duration_age.xlsx', sheet = 1)
meta_trainCZ <- read.xlsx('gitignore/data/diet_duration_age.xlsx', sheet = 2) %>% 
  dplyr::mutate(ID = paste0('T', ID),
                Sex = SEX) %>% 
  dplyr::select(-SEX)

meta_trainIT[1:5,]
##       ID   COHORT GRP Diet_duration  Age Sex
## 1 VOV002 IT_train  OM             0 61.4   F
## 2 VOV003 IT_train  OM             0 43.7   F
## 3 VOV004 IT_train  OM             0 61.1   F
## 4 VOV006 IT_train  OM             0 31.7   F
## 5 VOV007 IT_train  OM             0 31.8   F
meta_trainCZ[1:5,]
##     ID   COHORT GRP Diet_duration      Age Sex
## 1  T97 CZ_train  OM             0 26.76438   M
## 2  T98 CZ_train  OM             0 27.61370   F
## 3 T134 CZ_train  OM             0 21.64384   F
## 4 T136 CZ_train  OM             0 31.25479   M
## 5 T137 CZ_train  OM             0 40.42192   F
data_meta_original <- bind_rows(meta_trainIT, meta_trainCZ) %>% 
  rename(`Sample` = `ID`)
  
data_meta_original[1:5,]
##   Sample   COHORT GRP Diet_duration  Age Sex
## 1 VOV002 IT_train  OM             0 61.4   F
## 2 VOV003 IT_train  OM             0 43.7   F
## 3 VOV004 IT_train  OM             0 61.1   F
## 4 VOV006 IT_train  OM             0 31.7   F
## 5 VOV007 IT_train  OM             0 31.8   F

gtsummary::tbl_summary(
  data_meta_original %>% 
    mutate(Category = paste0(COHORT, GRP)) %>% 
    select(Sex, Category),
  by = 'Category'
)

Characteristic

CZ_trainOM
N = 33

1

CZ_trainVG
N = 62

1

IT_trainOM
N = 40

1

IT_trainVG
N = 38

1
Sex



    F 17 (52%) 25 (40%) 24 (60%) 22 (58%)
    M 16 (48%) 37 (60%) 16 (40%) 16 (42%)
1

n (%)

Open code

data_metabolite_original2 <- data_metabolites_original %>% 
  left_join(data_meta_original, by = 'Sample') %>% 
  select(Sample:Diet, COHORT:Sex, everything())

data_metabolite_original2 %>% dim()
## [1] 173  42

data_metabolite_original2[
  1:4, 
  (ncol(data_metabolite_original2)-10):ncol(data_metabolite_original2)
  ]
##        alanine 3-hydroxybutyrate      ethanol   2-propanol 2-oxoisovalerate 
## 1 0.0001213836      2.387548e-05 8.178577e-06 2.724559e-06      5.733548e-06
## 2 0.0001159496      9.752541e-06 7.881878e-06 2.636666e-06      6.229350e-06
## 3 0.0001374516      9.122670e-06 7.059981e-06 2.858835e-06      5.990228e-06
## 4 0.0001086865      1.477925e-05 7.276468e-06 2.215811e-06      4.624115e-06
##   3-methyl-2-oxovalerate 3-hydroxyisobutyrate       valine      leucine
## 1           6.041008e-06         2.418543e-06 7.093858e-05 6.090779e-05
## 2           5.920311e-06         4.066098e-06 6.269643e-05 6.538483e-05
## 3           5.437201e-06         4.733488e-06 7.109963e-05 6.308931e-05
## 4           3.964752e-06         2.978589e-06 5.723841e-05 5.201635e-05
##     isoleucine 2-hydroxybutyrate
## 1 1.384813e-05      7.989465e-06
## 2 1.612050e-05      1.078008e-05
## 3 1.423981e-05      1.042251e-05
## 4 1.250699e-05      1.032980e-05

data_metabolite_original2[1:4, 1:10]
##   Sample Country  Diet   COHORT GRP Diet_duration      Age Sex    Group
## 1   T119      CZ VEGAN CZ_train  VG           3.0 38.21644   M VEGAN_CZ
## 2   T120      CZ VEGAN CZ_train  VG           5.0 31.00000   M VEGAN_CZ
## 3   T126      CZ VEGAN CZ_train  VG           6.0 25.22740   M VEGAN_CZ
## 4   T127      CZ VEGAN CZ_train  VG           3.2 22.48219   F VEGAN_CZ
##        formate
## 1 1.363904e-06
## 2 1.579640e-06
## 3 1.065157e-06
## 4 1.337780e-06

7.1.2 Validation

Open code
data_meta_valid <- read.xlsx('gitignore/data/diet_duration_age.xlsx', sheet = 3) %>% 
  rename(`Sample` = `ID`)

gtsummary::tbl_summary(
  data_meta_valid %>% 
    mutate(Category = GRP) %>% 
    select(SEX, Category),
  by = 'Category'
)

Characteristic

OM
N = 50

1

VN
N = 92

1
SEX

    F 25 (50%) 47 (51%)
    M 25 (50%) 45 (49%)
1

n (%)

Open code


data_metabolite_valid2 <- data_metabolites_validation %>% 
  mutate(Diet =  if_else(SKUPINA == 1, 'VEGAN', 'OMNI')) %>% 
  rename(`Sample` = `CisloPacienta`) %>% 
  select(Sample, Diet, everything()) %>% 
  select(-c(NazevSouboru:SKUPINA)) %>% 
  mutate(Sample = paste0('K', as.numeric(Sample))) %>% 
  left_join(data_meta_valid, by = 'Sample') %>% 
  select(Sample:Diet, COHORT:SEX, dplyr::everything())

pats_val1 <- data_metabolites_validation %>% 
  mutate(Diet =  if_else(SKUPINA == 1, 'VEGAN', 'OMNI')) %>% 
  rename(`Sample` = `CisloPacienta`) %>% 
  select(Sample, Diet) %>% 
  mutate(Sample = paste0('K', as.numeric(Sample))) %>% 
  select(Sample) %>% pull

pats_val2 <- data_meta_valid$Sample

union(setdiff(pats_val1, pats_val2),
                 setdiff(pats_val2, pats_val1))
## [1] "K56"  "K209"

data_metabolite_valid2 %>% dim()
## [1] 140  40

data_metabolite_valid2[
  1:4, 
  (ncol(data_metabolite_valid2)-10):ncol(data_metabolite_valid2)
  ]
##        alanine 3-hydroxybutyrate      ethanol   2-propanol 2-oxoisovalerate 
## 1 0.0001321337      6.622714e-06 1.322319e-06 1.862368e-06      3.882214e-06
## 2 0.0001666794      2.636257e-05 1.950937e-06 2.478307e-06      4.176860e-06
## 3 0.0001372282      2.689236e-06 1.276704e-06 1.305063e-06      5.873894e-06
## 4 0.0002100940      3.813916e-06 1.582076e-06 1.988745e-06      4.902592e-06
##   3-methyl-2-oxovalerate 3-hydroxyisobutyrate       valine      leucine
## 1           7.514717e-06         6.263071e-06 8.426543e-05 7.687952e-05
## 2           5.784112e-06         2.784464e-06 5.376007e-05 5.273321e-05
## 3           9.832289e-06         4.093783e-06 8.149183e-05 7.399149e-05
## 4           6.437421e-06         2.800246e-06 6.654833e-05 5.688137e-05
##     isoleucine 2-hydroxybutyrate
## 1 1.584328e-05      1.076326e-05
## 2 1.230743e-05      9.433298e-06
## 3 1.906236e-05      7.830384e-06
## 4 1.341109e-05      7.831377e-06

data_metabolite_valid2[1:4, 1:10]
##   Sample Diet COHORT GRP Diet_duration      Age SEX      formate   tryptophan
## 1    K42 OMNI CZ_val  OM          <NA> 37.01096   M 3.411723e-06 9.782523e-06
## 2    K43 OMNI CZ_val  OM          <NA> 39.52603   F 3.632946e-06 7.101108e-06
## 3   K123 OMNI CZ_val  OM          <NA> 29.12603   M 4.213210e-06 1.087543e-05
## 4   K124 OMNI CZ_val  OM          <NA> 27.95616   F 3.643107e-06 9.685863e-06
##   phenylalanine
## 1  1.977202e-05
## 2  2.134558e-05
## 3  2.377540e-05
## 4  1.942135e-05

data_metabolite_valid2 %>% select(Sample, Diet, GRP)
##     Sample  Diet GRP
## 1      K42  OMNI  OM
## 2      K43  OMNI  OM
## 3     K123  OMNI  OM
## 4     K124  OMNI  OM
## 5     K126  OMNI  OM
## 6     K127  OMNI  OM
## 7     K130  OMNI  OM
## 8     K131  OMNI  OM
## 9     K143  OMNI  OM
## 10    K144  OMNI  OM
## 11    K150  OMNI  OM
## 12    K151  OMNI  OM
## 13    K154  OMNI  OM
## 14    K155  OMNI  OM
## 15    K175  OMNI  OM
## 16    K176  OMNI  OM
## 17    K178  OMNI  OM
## 18    K179  OMNI  OM
## 19    K182  OMNI  OM
## 20    K183  OMNI  OM
## 21    K194  OMNI  OM
## 22    K195  OMNI  OM
## 23    K198  OMNI  OM
## 24    K199  OMNI  OM
## 25    K201  OMNI  OM
## 26    K202  OMNI  OM
## 27    K210  OMNI  OM
## 28    K222  OMNI  OM
## 29    K223  OMNI  OM
## 30    K230  OMNI  OM
## 31    K231  OMNI  OM
## 32    K234  OMNI  OM
## 33    K235  OMNI  OM
## 34    K238  OMNI  OM
## 35    K239  OMNI  OM
## 36    K253  OMNI  OM
## 37    K254  OMNI  OM
## 38    K266  OMNI  OM
## 39    K267  OMNI  OM
## 40    K269  OMNI  OM
## 41    K270  OMNI  OM
## 42    K298  OMNI  OM
## 43    K299  OMNI  OM
## 44    K309  OMNI  OM
## 45    K310  OMNI  OM
## 46    K318  OMNI  OM
## 47    K319  OMNI  OM
## 48    K329  OMNI  OM
## 49    K330  OMNI  OM
## 50      K4 VEGAN  VN
## 51      K5 VEGAN  VN
## 52      K7 VEGAN  VN
## 53     K12 VEGAN  VN
## 54     K13 VEGAN  VN
## 55     K15 VEGAN  VN
## 56     K16 VEGAN  VN
## 57     K18 VEGAN  VN
## 58     K19 VEGAN  VN
## 59     K25 VEGAN  VN
## 60     K26 VEGAN  VN
## 61     K31 VEGAN  VN
## 62     K32 VEGAN  VN
## 63     K34 VEGAN  VN
## 64     K35 VEGAN  VN
## 65     K38 VEGAN  VN
## 66     K39 VEGAN  VN
## 67     K45 VEGAN  VN
## 68     K46 VEGAN  VN
## 69     K48 VEGAN  VN
## 70     K49 VEGAN  VN
## 71     K51 VEGAN  VN
## 72     K52 VEGAN  VN
## 73     K55 VEGAN  VN
## 74     K61 VEGAN  VN
## 75     K62 VEGAN  VN
## 76     K64 VEGAN  VN
## 77     K65 VEGAN  VN
## 78     K70 VEGAN  VN
## 79     K71 VEGAN  VN
## 80     K73 VEGAN  VN
## 81     K74 VEGAN  VN
## 82     K81 VEGAN  VN
## 83     K82 VEGAN  VN
## 84     K85 VEGAN  VN
## 85     K86 VEGAN  VN
## 86     K92 VEGAN  VN
## 87     K94 VEGAN  VN
## 88     K95 VEGAN  VN
## 89    K100 VEGAN  VN
## 90    K101 VEGAN  VN
## 91    K103 VEGAN  VN
## 92    K104 VEGAN  VN
## 93    K106 VEGAN  VN
## 94    K107 VEGAN  VN
## 95    K109 VEGAN  VN
## 96    K110 VEGAN  VN
## 97    K113 VEGAN  VN
## 98    K114 VEGAN  VN
## 99    K116 VEGAN  VN
## 100   K117 VEGAN  VN
## 101   K119 VEGAN  VN
## 102   K120 VEGAN  VN
## 103   K146 VEGAN  VN
## 104   K147 VEGAN  VN
## 105   K165 VEGAN  VN
## 106   K166 VEGAN  VN
## 107   K171 VEGAN  VN
## 108   K172 VEGAN  VN
## 109   K187 VEGAN  VN
## 110   K188 VEGAN  VN
## 111   K190 VEGAN  VN
## 112   K191 VEGAN  VN
## 113   K205 VEGAN  VN
## 114   K206 VEGAN  VN
## 115   K216 VEGAN  VN
## 116   K217 VEGAN  VN
## 117   K219 VEGAN  VN
## 118   K220 VEGAN  VN
## 119   K226 VEGAN  VN
## 120   K227 VEGAN  VN
## 121   K250 VEGAN  VN
## 122   K251 VEGAN  VN
## 123   K257 VEGAN  VN
## 124   K258 VEGAN  VN
## 125   K260 VEGAN  VN
## 126   K261 VEGAN  VN
## 127   K263 VEGAN  VN
## 128   K264 VEGAN  VN
## 129   K281 VEGAN  VN
## 130   K282 VEGAN  VN
## 131   K284 VEGAN  VN
## 132   K285 VEGAN  VN
## 133   K288 VEGAN  VN
## 134   K289 VEGAN  VN
## 135   K291 VEGAN  VN
## 136   K292 VEGAN  VN
## 137   K295 VEGAN  VN
## 138   K296 VEGAN  VN
## 139   K301 VEGAN  VN
## 140   K302 VEGAN  VN

data_metabolite_valid2 %>% select(Sample, Diet, GRP, Diet_duration) %>% 
  filter(
    (Diet == 'VEGAN' & GRP == 'OM') | (Diet == 'OMNI' & GRP == 'VN')
  )
## [1] Sample        Diet          GRP           Diet_duration
## <0 rows> (or 0-length row.names)

7.2 Training cohort

7.2.1 Select data

Open code
data_analysis <- data_metabolite_original2 %>%
    mutate(Country_IT = as.numeric(
      dplyr::if_else(
        Country == "IT", 0.5, -0.5
      )
    )
  ) %>%
  filter(Diet == 'VEGAN',
         !is.na(Diet_duration)) %>% 
  select(Sample:Group, Country_IT, all_of(diet_sensitive_metabolites$outcome))

summary(data_analysis[ , 1:11])
##     Sample            Country              Diet              COHORT         
##  Length:98          Length:98          Length:98          Length:98         
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      GRP            Diet_duration         Age            Sex           
##  Length:98          Min.   : 0.600   Min.   :18.25   Length:98         
##  Class :character   1st Qu.: 3.212   1st Qu.:27.70   Class :character  
##  Mode  :character   Median : 5.000   Median :32.56   Mode  :character  
##                     Mean   : 5.472   Mean   :35.00                     
##                     3rd Qu.: 6.000   3rd Qu.:40.31                     
##                     Max.   :20.000   Max.   :60.70                     
##     Group             Country_IT         formate         
##  Length:98          Min.   :-0.5000   Min.   :9.235e-07  
##  Class :character   1st Qu.:-0.5000   1st Qu.:1.265e-06  
##  Mode  :character   Median :-0.5000   Median :1.548e-06  
##                     Mean   :-0.1122   Mean   :2.059e-06  
##                     3rd Qu.: 0.5000   3rd Qu.:3.096e-06  
##                     Max.   : 0.5000   Max.   :3.915e-06

7.2.2 Define number of metabolite and covariates

Open code
n_covarites <- 10
n_features <- ncol(data_analysis) - n_covarites

7.2.3 Create empty objects

Open code
outcome <- vector('double', n_features)

logFD_Diet_duration_inCZ <- vector('double', n_features)
logFD_Diet_duration_inIT <- vector('double', n_features)
logFD_Diet_duration_avg <- vector('double', n_features)

logFD_ITcountry_avg <- vector('double', n_features)
diet_country_int <- vector('double', n_features)

logFD_Age <- vector('double', n_features)
P_Age <- vector('double', n_features)

P_Diet_duration_inCZ <- vector('double', n_features)
P_Diet_duration_inIT <- vector('double', n_features)
P_Diet_duration_avg <- vector('double', n_features)

P_ITcountry_avg <- vector('double', n_features)
P_diet_country_int <- vector('double', n_features)

CI_L_Diet_duration_inCZ <- vector('double', n_features)
CI_L_Diet_duration_inIT <- vector('double', n_features)
CI_L_Diet_duration_avg <- vector('double', n_features)

CI_U_Diet_duration_inCZ <- vector('double', n_features)
CI_U_Diet_duration_inIT <- vector('double', n_features)
CI_U_Diet_duration_avg <- vector('double', n_features)

7.2.4 Estimate over outcomes

Open code
if(file.exists('gitignore/lm_results/result_metabolite_VGdur_training.Rds') == FALSE){
for (i in 1:n_features) {
  
  ## define variable
  data_analysis$outcome <- data_analysis[, (i + n_covarites)]

  ## fit model
  model <- lm(outcome ~ Country_IT * Diet_duration + Age, 
              data = data_analysis %>% 
                mutate(Diet_duration = (Diet_duration/10),
                       outcome = trans_metabolite(outcome),
                       Age = (Age-33)/10))

  ## get contrast (effects of diet BY COUNTRY)
  contrast_emm <- summary(
    pairs(
      emmeans(
        model,
        specs = ~ Diet_duration | Country_IT,
        at = list(Diet_duration = c(0, 1)) 
        ),
      interaction = TRUE,
      adjust = "none"
      ),
    infer = c(TRUE, TRUE)
    )

  ## save results
  outcome[i] <- names(data_analysis)[i + n_covarites]
  
  ## country and Age effect
  logFD_ITcountry_avg[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Country_IT"
    ), 1
  ]

  P_ITcountry_avg[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Country_IT"
    ), 4
  ]
  
    logFD_Age[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Age"
    ), 1
  ]

  P_Age[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Age"
    ), 4
  ]
  
  ## diet effect
  tr <- confint(model)
  
  CI_L_Diet_duration_avg[i] <- tr[which(row.names(tr) == 'Diet_duration'),][1]
  CI_U_Diet_duration_avg[i] <- tr[which(row.names(tr) == 'Diet_duration'),][2]
  
  logFD_Diet_duration_avg[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Diet_duration"
    ), 1
  ]

  P_Diet_duration_avg[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Diet_duration"
    ), 4
  ]
  
  logFD_Diet_duration_inCZ[i] <- -contrast_emm[1,3]
  P_Diet_duration_inCZ[i] <- contrast_emm$p.value[1]
  CI_L_Diet_duration_inCZ[i] <- -contrast_emm$upper.CL[1]
  CI_U_Diet_duration_inCZ[i] <- -contrast_emm$lower.CL[1]
  
  logFD_Diet_duration_inIT[i] <- -contrast_emm[2,3]
  P_Diet_duration_inIT[i] <- contrast_emm$p.value[2]
  CI_L_Diet_duration_inIT[i] <- -contrast_emm$upper.CL[2]
  CI_U_Diet_duration_inIT[i] <- -contrast_emm$lower.CL[2]
  
  ## interaction
  diet_country_int[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Country_IT:Diet_duration"
    ), 1
  ]

  P_diet_country_int[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Country_IT:Diet_duration"
    ), 4
  ]
}

  result_metabolite <- data.frame(
  outcome,
  logFD_ITcountry_avg, P_ITcountry_avg,
  logFD_Age, P_Age,  
  logFD_Diet_duration_avg, P_Diet_duration_avg,
  logFD_Diet_duration_inCZ, P_Diet_duration_inCZ,
  logFD_Diet_duration_inIT, P_Diet_duration_inIT,
  diet_country_int, P_diet_country_int,
  CI_L_Diet_duration_avg, CI_U_Diet_duration_avg,
  CI_L_Diet_duration_inCZ, CI_U_Diet_duration_inCZ,
  CI_L_Diet_duration_inIT, CI_U_Diet_duration_inIT
)
  
  saveRDS(result_metabolite, 
              'gitignore/lm_results/result_metabolite_VGdur_training.Rds')
}

result_metabolite <- readRDS('gitignore/lm_results/result_metabolite_VGdur_training.Rds')

7.2.5 Show and save results

Open code
kableExtra::kable(result_metabolite %>% 
  dplyr::select(
    outcome,
    logFD_ITcountry_avg, P_ITcountry_avg,
    logFD_Diet_duration_avg, P_Diet_duration_avg,
    logFD_Diet_duration_inCZ, P_Diet_duration_inCZ,
    logFD_Diet_duration_inIT, P_Diet_duration_inIT,
    diet_country_int, P_diet_country_int,
    CI_L_Diet_duration_avg, CI_U_Diet_duration_avg,
    CI_L_Diet_duration_inCZ, CI_U_Diet_duration_inCZ,
    CI_L_Diet_duration_inIT, CI_U_Diet_duration_inIT
  ) %>% arrange(P_Diet_duration_avg),
  caption = "Result of linear models, modelling the log2-transformed level of given metabolite with vegan status duration (`Diet_duration`), `Country`, their interaction (`Diet_duration x Country`) and `Age` as predictors, using training data only (both Czech and Italian cohorts). Only metabolites whose log2-transformed levels differ between diet in training cohorts (FDR < 0.05, average diet effet across both countries) are shown. `logFD` prefix: denotes estimated effects (regression coefficient), i.e. expected change in log2 metabolite level per 10 years of vegan diet or age, or for country (Italy vs Czechia). `P`: p-value, `CI_L` and `CI_U`: lower and upper bounds of 95% confidence interval respectively. `avg` suffix shows effect averaged across subgroups, whereas `inCZ` and `inIT` show effect in Czech or Italian cohort respectively. Interaction effects are reported as `diet_country_int` (difference in the effect of vegan diet duration between Italian and Czech cohorts; positive values indicate a stronger effect in Italian, negative values a stronger effect in Czech cohort) and `P_diet_country_int` (its p-value). All estimates in a single row are based on a single model with interaction",
  escape = FALSE
)
Result of linear models, modelling the log2-transformed level of given metabolite with vegan status duration (Diet_duration), Country, their interaction (Diet_duration x Country) and Age as predictors, using training data only (both Czech and Italian cohorts). Only metabolites whose log2-transformed levels differ between diet in training cohorts (FDR < 0.05, average diet effet across both countries) are shown. logFD prefix: denotes estimated effects (regression coefficient), i.e. expected change in log2 metabolite level per 10 years of vegan diet or age, or for country (Italy vs Czechia). P: p-value, CI_L and CI_U: lower and upper bounds of 95% confidence interval respectively. avg suffix shows effect averaged across subgroups, whereas inCZ and inIT show effect in Czech or Italian cohort respectively. Interaction effects are reported as diet_country_int (difference in the effect of vegan diet duration between Italian and Czech cohorts; positive values indicate a stronger effect in Italian, negative values a stronger effect in Czech cohort) and P_diet_country_int (its p-value). All estimates in a single row are based on a single model with interaction
outcome logFD_ITcountry_avg P_ITcountry_avg logFD_Diet_duration_avg P_Diet_duration_avg logFD_Diet_duration_inCZ P_Diet_duration_inCZ logFD_Diet_duration_inIT P_Diet_duration_inIT diet_country_int P_diet_country_int CI_L_Diet_duration_avg CI_U_Diet_duration_avg CI_L_Diet_duration_inCZ CI_U_Diet_duration_inCZ CI_L_Diet_duration_inIT CI_U_Diet_duration_inIT
citrate 0.1143157 0.4930994 0.4022623 0.0110006 0.2482721 0.1130090 0.5562525 0.0282494 0.3079804 0.2687216 0.0943973 0.7101274 -0.0598810 0.5564252 0.0606037 1.0519013
phenylalanine 0.1657750 0.0166864 -0.0801155 0.2099722 -0.0086252 0.8922892 -0.1516058 0.1412463 -0.1429806 0.2101544 -0.2061433 0.0459122 -0.1347709 0.1175204 -0.3545048 0.0512931
lysine -0.2203053 0.0303230 -0.0999406 0.2877428 -0.0981387 0.2969192 -0.1017424 0.5006567 -0.0036037 0.9828180 -0.2855565 0.0856753 -0.2839283 0.0876508 -0.4005756 0.1970907
histidine 0.0900923 0.1934038 -0.0439469 0.4951805 -0.0612703 0.3426407 -0.0266234 0.7972240 0.0346469 0.7630355 -0.1713873 0.0834936 -0.1888300 0.0662894 -0.2317968 0.1785499
leucine 0.0392781 0.6895034 -0.0569630 0.5349177 -0.0366777 0.6895893 -0.0772482 0.6010874 -0.0405705 0.8043271 -0.2385798 0.1246538 -0.2184645 0.1451090 -0.3696431 0.2151466
isoleucine 0.2262944 0.0138836 -0.0517482 0.5403743 0.0399003 0.6370529 -0.1433968 0.2929302 -0.1832970 0.2258693 -0.2189719 0.1154754 -0.1274798 0.2072804 -0.4126192 0.1258257
2-oxoisocaproate 0.3461730 0.0298305 -0.0890765 0.5444433 0.1254247 0.3943196 -0.3035777 0.2010185 -0.4290024 0.1041750 -0.3798475 0.2016945 -0.1656184 0.4164677 -0.7717058 0.1645504
tyrosine 0.0284829 0.7856861 -0.0493386 0.6139168 -0.1288198 0.1899391 0.0301426 0.8480927 0.1589624 0.3633615 -0.2428935 0.1442162 -0.3225558 0.0649162 -0.2814720 0.3417571
2-hydroxybutyrate -0.2117327 0.1272228 0.0390258 0.7618445 0.0333535 0.7958006 0.0446981 0.8292779 0.0113446 0.9606360 -0.2159417 0.2939933 -0.2218526 0.2885596 -0.3657880 0.4551843
glycine -0.1613635 0.2326582 -0.0259395 0.8365036 -0.1583859 0.2099492 0.1065069 0.5989009 0.2648928 0.2395579 -0.2748467 0.2229678 -0.4075260 0.0907543 -0.2942225 0.5072363
formate 1.2291700 0.0000000 0.0126308 0.8630555 -0.0167243 0.8195217 0.0419858 0.7218102 0.0587101 0.6535405 -0.1323831 0.1576446 -0.1618738 0.1284252 -0.1914799 0.2754516
asparagine 0.0080122 0.9523680 0.0168787 0.8927392 -0.1598072 0.2040944 0.1935645 0.3379878 0.3533717 0.1162509 -0.2310177 0.2647751 -0.4079356 0.0883212 -0.2055375 0.5926665
glutamine -0.2476564 0.0003372 -0.0072353 0.9074647 -0.0909137 0.1467927 0.0764431 0.4462767 0.1673568 0.1344316 -0.1305073 0.1160367 -0.2143011 0.0324737 -0.1220193 0.2749055
glycerol -0.2363158 0.2527762 -0.0186424 0.9227130 -0.0353652 0.8541236 -0.0019196 0.9950489 0.0334456 0.9223394 -0.3991932 0.3619083 -0.4162720 0.3455416 -0.6145891 0.6107498
valine -0.0412983 0.6386216 0.0077384 0.9248301 -0.0103518 0.8996564 0.0258287 0.8449300 0.0361805 0.8048743 -0.1546900 0.1701668 -0.1729322 0.1522285 -0.2356737 0.2873310
2-oxoisovalerate  0.0607476 0.6484160 0.0114058 0.9268662 0.1359799 0.2758019 -0.1131683 0.5719322 -0.2491482 0.2630375 -0.2346857 0.2574973 -0.1103418 0.3823016 -0.5093644 0.2830278
3-hydroxyisobutyrate 0.5652168 0.0075870 -0.0094720 0.9610123 0.3205106 0.1008869 -0.3394546 0.2780552 -0.6599652 0.0588473 -0.3932201 0.3742761 -0.0635966 0.7046178 -0.9572717 0.2783625
Open code

if(file.exists('gitignore/lm_results/result_metabolite_VGdur_training.csv') == FALSE){
  write.table(result_metabolite, 
              'gitignore/lm_results/result_metabolite_VGdur_training.csv',
              row.names = FALSE)
  }

7.3 Validation cohort

Open code
data_analysis_metabolite <- data_metabolite_valid2 %>%
  filter(Diet == 'VEGAN',
         !is.na(Diet_duration)) %>%
  dplyr::mutate(Diet_duration = as.numeric(as.character(Diet_duration))) %>% 
  dplyr::select(
    Diet_duration, Age,
    all_of(diet_sensitive_metabolites$outcome)
  )

summary(data_analysis_metabolite[ , 1:11])
##  Diet_duration        Age           formate          phenylalanine      
##  Min.   : 0.17   Min.   :21.54   Min.   :3.078e-06   Min.   :1.480e-05  
##  1st Qu.: 4.00   1st Qu.:31.56   1st Qu.:3.675e-06   1st Qu.:1.914e-05  
##  Median : 7.00   Median :34.17   Median :4.167e-06   Median :2.030e-05  
##  Mean   : 7.19   Mean   :34.57   Mean   :4.149e-06   Mean   :2.061e-05  
##  3rd Qu.:10.00   3rd Qu.:37.22   3rd Qu.:4.458e-06   3rd Qu.:2.198e-05  
##  Max.   :25.00   Max.   :50.88   Max.   :5.776e-06   Max.   :3.111e-05  
##    histidine            tyrosine            glycerol        
##  Min.   :8.295e-06   Min.   :1.093e-05   Min.   :1.851e-06  
##  1st Qu.:1.041e-05   1st Qu.:1.569e-05   1st Qu.:4.560e-06  
##  Median :1.144e-05   Median :1.746e-05   Median :6.533e-06  
##  Mean   :1.169e-05   Mean   :1.764e-05   Mean   :7.868e-06  
##  3rd Qu.:1.285e-05   3rd Qu.:1.972e-05   3rd Qu.:9.496e-06  
##  Max.   :1.813e-05   Max.   :2.421e-05   Max.   :5.184e-05  
##     glycine              lysine            asparagine       
##  Min.   :3.435e-05   Min.   :1.185e-05   Min.   :2.839e-06  
##  1st Qu.:6.676e-05   1st Qu.:1.719e-05   1st Qu.:5.024e-06  
##  Median :8.419e-05   Median :2.081e-05   Median :5.899e-06  
##  Mean   :8.683e-05   Mean   :2.201e-05   Mean   :6.005e-06  
##  3rd Qu.:1.024e-04   3rd Qu.:2.469e-05   3rd Qu.:6.876e-06  
##  Max.   :1.421e-04   Max.   :3.807e-05   Max.   :1.028e-05  
##  2-oxoisocaproate   
##  Min.   :5.581e-06  
##  1st Qu.:7.584e-06  
##  Median :8.774e-06  
##  Mean   :9.136e-06  
##  3rd Qu.:1.017e-05  
##  Max.   :1.899e-05

7.3.0.1 Define number of metabolite and covariates

Open code
n_covarites <- 2
n_features <- ncol(data_analysis_metabolite) - n_covarites

7.3.0.2 Create empty objects

Open code
outcome <- vector('double', n_features)
logFD_VGduration <- vector('double', n_features)
P_VGduration <- vector('double', n_features)
logFD_Age <- vector('double', n_features)
P_Age <- vector('double', n_features)
CI_L_VGduration <- vector('double', n_features)
CI_U_VGduration <- vector('double', n_features)

7.3.0.3 Estimate over outcomes

Open code
for (i in 1:n_features) {
  ## define variable
  data_analysis_metabolite$outcome <- data_analysis_metabolite[, (i + n_covarites)]

  ## fit model
  model <- lm(outcome ~ Diet_duration + Age, 
              data = data_analysis_metabolite %>% 
                mutate(Diet_duration = (Diet_duration)/10,
                       outcome = trans_metabolite(outcome),
                       Age = (Age-33)/10))

  ## save results
  outcome[i] <- names(data_analysis_metabolite)[i + n_covarites]

  ## diet effect
  tr <- confint(model)

  CI_L_VGduration[i] <- tr[which(row.names(tr) == "Diet_duration"), ][1]
  CI_U_VGduration[i] <- tr[which(row.names(tr) == "Diet_duration"), ][2]

  logFD_VGduration[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Diet_duration"
    ), 1
  ]

  P_VGduration[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Diet_duration"
    ), 4
  ]

    logFD_Age[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Age"
    ), 1
  ]

  P_Age[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Age"
    ), 4
  ]
  
}

7.3.0.4 Results table

Open code
result_metabolite_val <- data.frame(
  outcome,
  logFD_VGduration, P_VGduration,
  CI_L_VGduration, CI_U_VGduration,
  logFD_Age, P_Age
)

kableExtra::kable(result_metabolite_val %>% arrange(P_VGduration),
  caption = "Results of linear models estimating the effect of vegan diet status duration and age on log2-trasformed metabolite levels. Only metabolites metabolites with significantly different levels between diet groups in training cohorts (FDR < 0.05, average effect across both training cohorts) were included. `logFD` denotes estimated effects (regression coefficient), i.e. expected change in log2 metabolite level per +10 years of vegan diet and age, respectively. `P`: p-value, and `CI_L` and `CI_U` represent the lower and upper bounds of the 95% confidence interval, respectively. All estimates in a single row are based on a single model"
)
Results of linear models estimating the effect of vegan diet status duration and age on log2-trasformed metabolite levels. Only metabolites metabolites with significantly different levels between diet groups in training cohorts (FDR < 0.05, average effect across both training cohorts) were included. logFD denotes estimated effects (regression coefficient), i.e. expected change in log2 metabolite level per +10 years of vegan diet and age, respectively. P: p-value, and CI_L and CI_U represent the lower and upper bounds of the 95% confidence interval, respectively. All estimates in a single row are based on a single model
outcome logFD_VGduration P_VGduration CI_L_VGduration CI_U_VGduration logFD_Age P_Age
valine -0.2129785 0.0101981 -0.3740814 -0.0518756 0.1167345 0.0548426
glutamine -0.1862766 0.0177533 -0.3394330 -0.0331202 0.0980499 0.0890129
lysine -0.2883669 0.0178233 -0.5256127 -0.0511211 0.0840764 0.3435407
leucine -0.2088764 0.0242676 -0.3899040 -0.0278487 0.1192279 0.0803475
glycine 0.2873724 0.0291661 0.0298822 0.5448627 -0.1421564 0.1415944
isoleucine -0.1898327 0.0621485 -0.3895297 0.0098643 0.0567564 0.4470274
2-hydroxybutyrate -0.2585286 0.1101122 -0.5769415 0.0598844 0.1207362 0.3110292
histidine -0.0935354 0.1794716 -0.2309640 0.0438932 -0.0123405 0.8098482
2-oxoisovalerate  0.1225618 0.2444741 -0.0854190 0.3305426 -0.0048163 0.9505117
3-hydroxyisobutyrate 0.1733428 0.3357731 -0.1828034 0.5294891 0.0445163 0.7377100
glycerol 0.1667869 0.4642500 -0.2844422 0.6180161 -0.5576731 0.0013345
formate 0.0411991 0.4950196 -0.0783767 0.1607750 -0.0127748 0.7746800
phenylalanine 0.0329646 0.5615259 -0.0795269 0.1454560 0.0425598 0.3120995
citrate -0.0750249 0.6180556 -0.3732167 0.2231669 -0.0571242 0.6079397
tyrosine -0.0243180 0.7274266 -0.1626363 0.1140004 0.0738384 0.1550599
asparagine 0.0205672 0.8156797 -0.1543895 0.1955239 0.0684715 0.2958396
2-oxoisocaproate 0.0135937 0.8932256 -0.1872522 0.2144395 0.0170601 0.8199426
Open code

if (file.exists("gitignore/lm_results/result_metabolite_VGdur_valid.csv") == FALSE) {
  write.table(result_metabolite_val,
    "gitignore/lm_results/result_metabolite_VGdur_valid.csv",
    row.names = FALSE
  )
}

7.4 Forest plot

7.4.1 Prepare data

Open code

## subset result tables
result_metabolite_subset <- result_metabolite %>%
  filter(outcome %in% diet_sensitive_metabolites$outcome)

result_metabolite_val_subset <- result_metabolite_val %>%
  filter(outcome %in% diet_sensitive_metabolites$outcome)

## create a data frame
data_forest <- data.frame(
  outcome = rep(diet_sensitive_metabolites$outcome, 3),
  beta = c(
    result_metabolite_subset$logFD_Diet_duration_inCZ,
    result_metabolite_subset$logFD_Diet_duration_inIT,
    result_metabolite_val_subset$logFD_VGduration
  ),
  lower = c(
    result_metabolite_subset$CI_L_Diet_duration_inCZ,
    result_metabolite_subset$CI_L_Diet_duration_inIT,
    result_metabolite_val_subset$CI_L_VGduration
  ),
  upper = c(
    result_metabolite_subset$CI_U_Diet_duration_inCZ,
    result_metabolite_subset$CI_U_Diet_duration_inIT,
    result_metabolite_val_subset$CI_U_VGduration
  ),
  dataset = c(
    rep("CZ", len),
    rep("IT", len),
    rep("Validation", len)
  )
)


 data_forest <- data_forest %>%
  mutate(in_winner = if_else(outcome %in% winners, TRUE, FALSE, missing = FALSE)) %>%
  left_join(
    elacoef %>% mutate(outcome = metabolite) %>% select(-metabolite), 
    by = 'outcome') %>% 
   mutate(outcome = factor(outcome, levels = validation_order))

7.4.1.1 Plotting

Open code

plotac <- "forest_plot_metabolite_VGduration"
path <- "gitignore/figures"

colors <- c("CZ" = "#150999", "IT" = "#329243", "Validation" = "grey60")

assign(plotac, ggplot(
  data_forest, aes(x = outcome, y = beta, ymin = lower, ymax = upper, color = dataset)
) +
  geom_pointrange(position = position_dodge(width = 0.5), size = 0.5) +
  geom_hline(yintercept = 0, color = "black") +
  geom_errorbar(position = position_dodge(width = 0.5), width = 0.2) +
  scale_color_manual(values = colors) +
  labs(
    y = "Effect of vegan duration (+10y) on log2 metabolite level",
    x = "Outcome",
    color = "Dataset"
  ) +
  theme_minimal() +
  coord_flip() + 
  scale_x_discrete(
    labels = setNames(
      ifelse(data_forest$in_winner, 
             paste0("**", data_forest$outcome, "**"), 
             as.character(data_forest$outcome)
      ), data_forest$outcome
    )
  ) +
  theme(
    axis.text.x = element_text(size = 10),
    axis.text.y = ggtext::element_markdown(size = 10),  
    axis.title.x = element_text(size = 12),
    axis.title.y = element_text(size = 12),
    legend.position = "bottom"
  )
)

get(plotac)

Forest plot shows the effect of vegan status duration (within the vegan sub-population) on log2-transformed levels of selected metabolites, with 95% confidence intervals, across two training cohorts (Czech and Italian) and one independent Czech validation cohort. Green, blue, and grey points/lines represent the expected change in log2 metabolite level per +10 years of vegan diet within the Italian, Czech, and Czech validation cohorts, respectively. Positive values indicate higher metabolite levels in long-term vegans compared to short-term vegans. Only metabolites with significant differences between vegans and omnivores in training cohorts (average effect over the two countries) were included. Estimates for the training cohorts come from a single linear model with Diet_duration, Country, their interaction (Diet_duration x Country), and Age as fixed-effect predictors. Age was included as it correlates with Diet_duration and could act as a confounder. In the Czech validation cohort, Diet_duration and Age were fixed-effect predictors. Diet-sensitive metabolites are shown in bold, in the same order as in the forest plot of vegan–omnivore differences
Open code

if (file.exists(paste0(path, "/", plotac, ".svg")) == FALSE) {  
  ggsave(
    path = paste0(path),
    filename = plotac,
    device = "svg",
    width = 9,
    height = 13
  )
}

8 Reproducibility

Open code
sessionInfo()
## R version 4.4.3 (2025-02-28)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=cs_CZ.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=cs_CZ.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=cs_CZ.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=cs_CZ.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Europe/Prague
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] vegan_2.6-6.1         lattice_0.22-5        permute_0.9-7        
##  [4] zCompositions_1.5.0-4 truncnorm_1.0-8       NADA_1.6-1.1         
##  [7] survival_3.7-0        MicrobiomeStat_1.2    glmnet_4.1-8         
## [10] pROC_1.18.0           arm_1.12-2            lme4_1.1-35.5        
## [13] Matrix_1.7-0          MASS_7.3-65           car_3.1-2            
## [16] carData_3.0-5         emmeans_1.10.4        brms_2.21.0          
## [19] Rcpp_1.0.13           rms_6.8-1             Hmisc_5.1-3          
## [22] glmmTMB_1.1.9         ggtext_0.1.2          ggdist_3.3.2         
## [25] cowplot_1.1.1         ggpubr_0.4.0          sjPlot_2.8.16        
## [28] kableExtra_1.4.0      flextable_0.9.6       gtsummary_2.0.2      
## [31] compositions_2.0-8    janitor_2.2.0         stringi_1.8.7        
## [34] lubridate_1.9.4       forcats_1.0.0         stringr_1.5.1        
## [37] dplyr_1.1.4           purrr_1.0.4           readr_2.1.5          
## [40] tidyr_1.3.1           tibble_3.3.0          ggplot2_3.5.1        
## [43] tidyverse_1.3.1       readxl_1.3.1          openxlsx_4.2.8       
## [46] RJDBC_0.2-10          rJava_1.0-11          DBI_1.2.3            
## 
## loaded via a namespace (and not attached):
##   [1] fs_1.6.6                matrixStats_1.3.0       httr_1.4.2             
##   [4] insight_0.20.2          numDeriv_2016.8-1.1     tools_4.4.3            
##   [7] backports_1.5.0         utf8_1.2.6              sjlabelled_1.2.0       
##  [10] R6_2.6.1                mgcv_1.9-1              withr_3.0.2            
##  [13] Brobdingnag_1.2-7       prettyunits_1.2.0       gridExtra_2.3          
##  [16] bayesm_3.1-6            quantreg_5.98           cli_3.6.5              
##  [19] textshaping_0.3.6       performance_0.12.2      gt_0.11.0              
##  [22] officer_0.6.6           sandwich_3.0-1          sass_0.4.9             
##  [25] labeling_0.4.2          mvtnorm_1.1-3           robustbase_0.93-9      
##  [28] polspline_1.1.25        ggridges_0.5.3          askpass_1.1            
##  [31] QuickJSR_1.3.1          commonmark_1.9.1        systemfonts_1.0.4      
##  [34] StanHeaders_2.32.10     foreign_0.8-90          gfonts_0.2.0           
##  [37] svglite_2.1.3           rstudioapi_0.16.0       httpcode_0.3.0         
##  [40] generics_0.1.4          shape_1.4.6             distributional_0.4.0   
##  [43] zip_2.2.0               inline_0.3.19           loo_2.4.1              
##  [46] abind_1.4-5             lifecycle_1.0.4         multcomp_1.4-18        
##  [49] yaml_2.3.10             snakecase_0.11.1        grid_4.4.3             
##  [52] promises_1.2.0.1        crayon_1.5.3            haven_2.4.3            
##  [55] pillar_1.11.0           knitr_1.50              statip_0.2.3           
##  [58] boot_1.3-31             estimability_1.5.1      codetools_0.2-19       
##  [61] glue_1.7.0              V8_4.4.2                fontLiberation_0.1.0   
##  [64] data.table_1.15.4       vctrs_0.6.5             cellranger_1.1.0       
##  [67] gtable_0.3.0            assertthat_0.2.1        datawizard_0.12.2      
##  [70] xfun_0.52               mime_0.12               coda_0.19-4            
##  [73] modeest_2.4.0           timeDate_3043.102       iterators_1.0.14       
##  [76] statmod_1.4.36          TH.data_1.1-0           nlme_3.1-168           
##  [79] fontquiver_0.2.1        rstan_2.32.6            fBasics_4041.97        
##  [82] tensorA_0.36.2.1        TMB_1.9.14              rpart_4.1.24           
##  [85] colorspace_2.0-2        nnet_7.3-20             tidyselect_1.2.1       
##  [88] processx_3.8.4          timeSeries_4032.109     compiler_4.4.3         
##  [91] curl_6.4.0              rvest_1.0.2             htmlTable_2.4.0        
##  [94] SparseM_1.81            xml2_1.3.3              fontBitstreamVera_0.1.1
##  [97] posterior_1.6.0         checkmate_2.3.2         scales_1.3.0           
## [100] DEoptimR_1.0-10         callr_3.7.6             spatial_7.3-15         
## [103] digest_0.6.37           minqa_1.2.4             rmarkdown_2.27         
## [106] htmltools_0.5.8.1       pkgconfig_2.0.3         base64enc_0.1-3        
## [109] stabledist_0.7-2        dbplyr_2.1.1            fastmap_1.2.0          
## [112] rlang_1.1.6             htmlwidgets_1.6.4       shiny_1.9.1            
## [115] farver_2.1.0            zoo_1.8-9               jsonlite_2.0.0         
## [118] magrittr_2.0.3          Formula_1.2-4           bayesplot_1.8.1        
## [121] munsell_0.5.0           gdtools_0.3.7           stable_1.1.6           
## [124] plyr_1.8.6              pkgbuild_1.3.1          parallel_4.4.3         
## [127] ggrepel_0.9.5           sjmisc_2.8.10           ggeffects_1.7.0        
## [130] splines_4.4.3           gridtext_0.1.5          hms_1.1.3              
## [133] sjstats_0.19.0          ps_1.7.7                uuid_1.0-3             
## [136] markdown_1.13           ggsignif_0.6.3          stats4_4.4.3           
## [139] rmutil_1.1.10           rstantools_2.1.1        crul_1.5.0             
## [142] reprex_2.0.1            evaluate_1.0.4          RcppParallel_5.1.8     
## [145] modelr_0.1.8            nloptr_2.0.0            tzdb_0.5.0             
## [148] foreach_1.5.2           httpuv_1.6.5            cards_0.2.2            
## [151] MatrixModels_0.5-3      openssl_1.4.6           clue_0.3-65            
## [154] broom_1.0.6             xtable_1.8-4            rstatix_0.7.0          
## [157] later_1.3.0             viridisLite_0.4.0       ragg_1.4.0             
## [160] lmerTest_3.1-3          cluster_2.1.8.1         timechange_0.3.0       
## [163] bridgesampling_1.1-2

References

Lenth, Russell V. 2024. “Emmeans: Estimated Marginal Means, Aka Least-Squares Means.” https://CRAN.R-project.org/package=emmeans.