Vegan-specific signature implies healthier metabolic profile: findings from diet-related multi-omics observational study based on different European populations

Statistical report - statistical methodology in details

Authors and affiliations

Anna Ouradova^1,*, Giulio Ferrero^2,3,*, Miriam Bratova⁴, Nikola Daskova⁴, Alena Bohdanecka^4,5, Klara Dohnalova⁶, Marie Heczkova⁴, Karel Chalupsky⁶, Maria Kralova^7,8, Marek Kuzma⁹, István Modos⁴, Filip Tichanek⁴, Lucie Najmanova⁹, Barbara Pardini¹⁰, Helena Pelantová⁹, Sonia Tarallo¹⁰, Petra Videnska¹¹, Jan Gojda^1,#, Alessio Naccarati^10,#, Monika Cahova^4,#

^* These authors have contributed equally to this work and share first authorship
^# These authors have contributed equally to this work and share last authorship

¹ Department of Internal Medicine, Kralovske Vinohrady University Hospital and Third Faculty of Medicine, Charles University, Prague, Czech Republic
² Department of Clinical and Biological Sciences, University of Turin, Turin, Italy
³ Department of Computer Science, University of Turin, Turin, Italy
⁴ Institute for Clinical and Experimental Medicine, Prague, Czech Republic
⁵ First Faculty of Medicine, Charles University, Prague, Czech Republic
⁶ Czech Centre for Phenogenomics, Institute of Molecular Genetics of the Czech Academy of Sciences, Prague, Czech Republic
⁷ Ambis University, Department of Economics and Management, Prague, Czech Republic
⁸ Department of Informatics, Brno University of Technology, Brno, Czech Republic
⁹ Institute of Microbiology of the Czech Academy of Sciences, Prague, Czech Republic
¹⁰ Italian Institute for Genomic Medicine (IIGM), c/o IRCCS Candiolo, Turin, Italy
¹¹ Mendel University, Department of Chemistry and Biochemistry, Brno, Czech Republic

This is a statistical report of the study A vegan diet signature from a multi-omics study on different European populations is related to favorable metabolic outcomes that is currenlty under review

When using this code or data, cite the original publication:

TO BE ADDED

BibTex citation for the original publication:

TO BE ADDED

Original GitHub repository: https://github.com/filip-tichanek/ItCzVegans

Statistical reports can be found on the reports hub.

Data analysis is described in detail in the statistical methods report.

1 Introduction

This project explores potential signatures of a vegan diet across the microbiome, metabolome, and lipidome. We used data from healthy vegan and omnivorous human subjects in two countries (Czech Republic and Italy), with subjects grouped by Country and Diet, resulting in four distinct groups.

To assess the generalizability of these findings, we validated our results with an independent cohort from the Czech Republic for external validation.

1.1 Statistical Methods

The statistical modeling approach is described in detail in this report. Briefly, the methods used included:

Multivariate analysis: We conducted multivariate analyses (PERMANOVA, PCA, correlation analyses) to explore the effects of diet, country, and their possible interaction (diet : country) on the microbiome, lipidome, and metabolome compositions in an integrative manner. This part of the analysis is not available on the GitHub page, but the code will be provided upon request.
Linear models: Linear models were applied to estimate the effects of diet, country, and their interaction (diet:country) on individual lipids, metabolites, bacterial taxa and pathways (“features”). Features that significantly differed between diet groups (based on the estimated average conditional effect of diet across both countries, adjusted for multiple comparisons with FDR < 0.05) were further examined in the independent external validation cohort to assess whether these associations were reproducible. Next, we fit linear models restricted to vegan participants to test whether omics profiles varied with the duration of vegan diet. Fixed-effect predictors were diet duration (per 10 years), country, their interaction, and age (included due to correlation with diet duration).
Predictive models (elastic net): We employed elastic net logistic regression (via the glmnet R package) to predict vegan status based on metabolome, lipidome, microbiome and pathways data (one model per dataset; four models in total). We considered three combinations of Lasso and Ridge penalties (alpha = 0, 0.2, 0.4). For each alpha, we selected the penalty strength (λ_1se) using 10-fold cross-validation. This value corresponds to the most regularized model whose performance was within one standard error of the minimum deviance. The alpha–lambda pair with the lowest deviance was chosen to fit the final model, whose coefficients are reported.
To estimate model performance, we repeated the full modeling procedure (including hyperparameter tuning) 500 times on bootstrap resamples of the training data. In each iteration, the model was trained on the resampled data and evaluated on the out-of-bag subjects (i.e., those not included in the training set in that iteration). The mean, and 2.5th, and 97.5th percentiles of the resulting ROC-AUC values represent the estimated out-of-sample AUC and its 95% confidence interval.
Finally, the final model was applied to an independent validation cohort to generate predicted probabilities of vegan status. These probabilities were then used to assess external discrimination between diet groups (ROC-AUC in the independent validation cohort). The elastic net models were not intended for practical prediction, but to quantify the strength of the signal separating the dietary groups, with its uncertainty, by using all features of a given dataset jointly. It also offered a complementary perspective on which features are most clearly associated with diet

1.2 Statistical Methods in details

All statistical analyses were performed using R, version 4.4.1 (2024-06-14) (R Core Team 2023). Data visualizations were done with the ggplot2 package (Wickham 2016).

1.2.1 Linear model per feature

For each dataset in the training cohorts, we fitted a feature-specific linear model where the transformed feature (metabolite [log2], lipid [log2], microbiome [CLR] and pathways [CLR]) represented the outcome variable whereas country (Italy vs Czech), diet (vegan vs omnivore), and their interaction (country:diet) all represented fixed-effects predictors. So, each model has following form

\[ g(\text{outcome}) = \alpha + \beta_{1} \times \text{country} + \beta_{2} \times \text{diet} + \beta_{3} \times \text{country:diet} + \epsilon \]

with \(g\) representing the transformation applied: \(log_{2}\) for metabolomic and lipidomic data, and center-log-ratio (CLR) for microbiome and pathways data (both compositional). For compositional datasets, zeros were assumed to be false zeros and were estimated with the Log-Ratio Singular Value Decomposition using lrSVD function in ‘zCompositions’ package (J. Palarea-Albaladejo and Martín-Fernández 2015) since this method was shown efficient for replacing zeros even in sparse compositional dataset (Javier Palarea-Albaladejo et al. 2022).

The variables were coded as follows: \(diet = -0.5\) for omnivores and \(diet = 0.5\) for vegans; \(country = -0.5\) for the Czech cohort and \(country = 0.5\) for the Italian cohort.
This parameterization allowed us to interpret the linear model summary as presenting the average conditional effects of diet across both countries and the average conditional effects of country across both diet groups. We then used the emmeans package (Lenth 2024) to obtain specific estimates for the effect of diet in the Italian and Czech cohorts separately, still from a single model.

Features that showed a significant diet effect (average effect of diet across both countries, adjusted for multiple comparisons with FDR < 0.05) were then visualized using a forest plot. The plot displayed the estimated difference in the level of given feature between vegan and omnivorous subject, and 95% confidence intervals across all three cohorts (Czech and Italian training cohorts, as well as the Czech external validation cohort) separately to evaluate whether found associations of given feature with diet can be generalized to other datasets.

1.2.2 Linear model for the effect of vegan diet duration

Next, we fit another series of linear models, this time modelling omics profiles using the following fixed-effect predictors: duration of vegan status (Diet_duration, scaled in tens of years), Country, their interaction (Diet_duration × Country), and Age:

\[ \text{CLR(pathway proportion)} = \alpha + \beta_{1} \times \text{Country} + \beta_{2} \times \text{Diet duration} + \beta_{3} \times (\text{Country}:\text{Diet duration}) + \beta_{4} \times \text{Age} + \epsilon \]

This analysis includes only vegan participants, while omnivores are excluded. The aim was to test whether omics features differ between vegans and omnivores also vary within the vegan group itself, depending on how long participants have been vegan. In other words, we asked whether long-term vegans show stronger up- or down-regulation of diet-sensitive features compared to those who adopted the diet more recently. Because longer vegan duration correlates with the vegan diet duration, we also adjusted for age in the models.

1.2.3 Diet prediction

To explore whether omics profiles contain a consistent signal differentiating dietary groups, we applied elastic net logistic regression to each dataset (metabolome, lipidome, microbiome, pathways) using the glmnet R package (Friedman, Tibshirani, and Hastie 2010), combined with a custom validation function for internal validation using bootstrap. Separate models were fitted for each dataset (lipidome, metabolome, microbiome, pathways).

Due to the expected high collinearity among features, we limited the search for the mixing parameter \(alpha\) to rather smaller values (0, 0.2, 0.4). All features were standardized to have \(mean = 0\) and \(standard deviation = 0.5\) using the arm R package (Gelman and Su 2021) to ensure scale comparability.

Models were first evaluated for out-of-sample performance using an out-of-bag bootstrap approach (500 iterations). Predictive performance was quantified using the out-of-sample area under the ROC curve (AUC; internal validation), estimated via the pROC package (Robin et al. 2011). AUC was estimated for both cohorts combined and for each country separately. Final external validation was performed on an independent Czech cohort (external validation).

The modelling and validation procedure involved the following steps:

Training and internal validation

For each alpha (0, 0.2, 0.4), the cv.glmnet function was used to perform 10-fold cross-validation on the training data, using the default glmnet lambda sequence. For each alpha, we selected lambda.1se, corresponding to the most regularized model whose cross-validated deviance was within one standard error of the minimum. The alpha–lambda pair with the lowest deviance was chosen.
The final model was fitted on the full training set using glmnet with the selected alpha and lambda values.
To estimate internal performance, the entire modelling procedure (including tuning) was repeated 500 times on bootstrap resamples of the training data. For each resample, the optimal alpha and lambda were re-selected.
The model was trained on each bootstrap resample, and AUC was calculated on the out-of-bag subjects (i.e., those not included in that iteration’s training set). This yielded 500 out-of-sample AUC estimates.
The median, 2.5th, and 97.5th percentiles of the AUCs were reported to summarize internal out-of-sample performance and its 95% confidence interval.

External validation

External validation cohort data were standardized using the means and standard deviations of the training cohort to ensure consistent scaling.
The final model (fitted once on the full training data) was applied to the external validation cohort to generate predicted probabilities of vegan status.
For each subject in the external cohort, the final model returned a predicted probability of being vegan (a continuous score between 0 and 1).
This predicted probability was treated as the discrimination variable: by varying the threshold across its full range, we obtained sensitivity and specificity pairs from which the ROC curve and its AUC were computed. This AUC reflects the model’s ability to generalize to an independent cohort.

The models were not intended for practical prediction, but to quantify the strength of the signal separating the dietary groups, with its uncertainty, by using all features of a given dataset jointly. It also offered a complementary perspective on which features are most clearly associated with diet

2 Reproducibility

Open code

sessionInfo()
## R version 4.4.3 (2025-02-28)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=cs_CZ.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=cs_CZ.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=cs_CZ.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=cs_CZ.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Europe/Prague
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] htmlwidgets_1.6.4 compiler_4.4.3    fastmap_1.2.0     cli_3.6.5        
##  [5] tools_4.4.3       htmltools_0.5.8.1 yaml_2.3.10       rmarkdown_2.27   
##  [9] knitr_1.50        jsonlite_2.0.0    xfun_0.52         digest_0.6.37    
## [13] rlang_1.1.6       evaluate_1.0.4

References

Friedman, Jerome, Robert Tibshirani, and Trevor Hastie. 2010. “Regularization Paths for Generalized Linear Models via Coordinate Descent” 33. https://doi.org/10.18637/jss.v033.i01.

Gelman, Andrew, and Yu-Sung Su. 2021. “Arm: Data Analysis Using Regression and Multilevel/Hierarchical Models.” https://CRAN.R-project.org/package=arm.

Lenth, Russell V. 2024. “Emmeans: Estimated Marginal Means, Aka Least-Squares Means.” https://CRAN.R-project.org/package=emmeans.

Palarea-Albaladejo, Javier, Josep Antoni Martín-Fernández, Anne Ruiz-Gazen, and Christine Thomas-Agnan. 2022. “lrSVD: An Efficient Imputation Algorithm for Incomplete High-Throughput Compositional Data.” Journal of Chemometrics 36 (12). https://doi.org/10.1002/cem.3459.

Palarea-Albaladejo, J., and J. A. Martín-Fernández. 2015. “zCompositions – r Package for Multivariate Imputation of Left-Censored Data Under a Compositional Approach” 143. https://doi.org/10.1016/j.chemolab.2015.02.019.

R Core Team. 2023. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Robin, Xavier, Natacha Turck, Alexandre Hainard, Natalia Tiberti, Frédérique Lisacek, Jean-Charles Sanchez, and Markus Müller. 2011. “pROC: An Open-Source Package for r and s+ to Analyze and Compare ROC Curves” 12: 77.

Wickham, Hadley. 2016. “Ggplot2: Elegant Graphics for Data Analysis.” https://ggplot2.tidyverse.org.