Vegan-specific signature implies healthier metabolic profile: findings from diet-related multi-omics observational study based on different European populations
Statistical report - statistical methodology in details
Authors and affiliations
Anna Ouradova1,*, Giulio Ferrero2,3,*, Miriam Bratova4, Nikola Daskova4, Alena Bohdanecka4,5, Klara Dohnalova6, Marie Heczkova4, Karel Chalupsky6, Maria Kralova7,8, Marek Kuzma9, István Modos4, Filip Tichanek4, Lucie Najmanova9, Barbara Pardini10, Helena Pelantová9, Sonia Tarallo10, Petra Videnska11, Jan Gojda1,#, Alessio Naccarati10,#, Monika Cahova4,#
* These authors have contributed equally to this work and share first authorship
# These authors have contributed equally to this work and share last authorship
1 Department of Internal Medicine, Kralovske Vinohrady University Hospital and Third Faculty of Medicine, Charles University, Prague, Czech Republic
2 Department of Clinical and Biological Sciences, University of Turin, Turin, Italy
3 Department of Computer Science, University of Turin, Turin, Italy
4 Institute for Clinical and Experimental Medicine, Prague, Czech Republic
5 First Faculty of Medicine, Charles University, Prague, Czech Republic
6 Czech Centre for Phenogenomics, Institute of Molecular Genetics of the Czech Academy of Sciences, Prague, Czech Republic
7 Ambis University, Department of Economics and Management, Prague, Czech Republic
8 Department of Informatics, Brno University of Technology, Brno, Czech Republic
9 Institute of Microbiology of the Czech Academy of Sciences, Prague, Czech Republic
10 Italian Institute for Genomic Medicine (IIGM), c/o IRCCS Candiolo, Turin, Italy
11 Mendel University, Department of Chemistry and Biochemistry, Brno, Czech Republic
This is a statistical report of the study A vegan diet signature from a multi-omics study on different European populations is related to favorable metabolic outcomes that is currenlty under review
When using this code or data, cite the original publication:
TO BE ADDED
BibTex citation for the original publication:
TO BE ADDED
Original GitHub repository: https://github.com/filip-tichanek/ItCzVegans
Statistical reports can be found on the reports hub.
Data analysis is described in detail in the statistical methods report.
1 Introduction
This project explores potential signatures of a vegan diet across the microbiome, metabolome, and lipidome. We used data from healthy vegan and omnivorous human subjects in two countries (Czech Republic and Italy), with subjects grouped by Country
and Diet
, resulting in four distinct groups.
To assess the generalizability of these findings, we validated our results with an independent cohort from the Czech Republic for external validation.
1.1 Statistical Methods
The statistical modeling approach is described in detail in this report. Briefly, the methods used included:
Multivariate analysis: We conducted multivariate analyses (PERMANOVA, PCA, correlation analyses) to explore the effects of
diet
,country
, and their possible interaction (diet : country
) on the microbiome, lipidome, and metabolome compositions in an integrative manner. This part of the analysis is not available on the GitHub page, but the code will be provided upon request.Linear models: Linear models were applied to estimate the effects of
diet
,country
, and their interaction (diet:country
) on individual lipids, metabolites, bacterial taxa and pathways (“features”). Features that significantly differed between diet groups (based on the estimated average conditional effect of diet across both countries, adjusted for multiple comparisons with FDR < 0.05) were further examined in the independent external validation cohort to assess whether these associations were reproducible. Next, we fit linear models restricted to vegan participants to test whether omics profiles varied with the duration of vegan diet. Fixed-effect predictors were diet duration (per 10 years), country, their interaction, and age (included due to correlation with diet duration).Predictive models (elastic net): We employed elastic net logistic regression (via the
glmnet
R package) to predict vegan status based on metabolome, lipidome, microbiome and pathways data (one model per dataset; four models in total). We considered three combinations of Lasso and Ridge penalties (alpha = 0, 0.2, 0.4). For each alpha, we selected the penalty strength (λ1se) using 10-fold cross-validation. This value corresponds to the most regularized model whose performance was within one standard error of the minimum deviance. The alpha–lambda pair with the lowest deviance was chosen to fit the final model, whose coefficients are reported.
To estimate model performance, we repeated the full modeling procedure (including hyperparameter tuning) 500 times on bootstrap resamples of the training data. In each iteration, the model was trained on the resampled data and evaluated on the out-of-bag subjects (i.e., those not included in the training set in that iteration). The mean, and 2.5th, and 97.5th percentiles of the resulting ROC-AUC values represent the estimated out-of-sample AUC and its 95% confidence interval.
Finally, the final model was applied to an independent validation cohort to generate predicted probabilities of vegan status. These probabilities were then used to assess external discrimination between diet groups (ROC-AUC in the independent validation cohort). The elastic net models were not intended for practical prediction, but to quantify the strength of the signal separating the dietary groups, with its uncertainty, by using all features of a given dataset jointly. It also offered a complementary perspective on which features are most clearly associated with diet
1.2 Statistical Methods in details
All statistical analyses were performed using R, version 4.4.1 (2024-06-14) (R Core Team 2023). Data visualizations were done with the ggplot2
package (Wickham 2016).
1.2.1 Linear model per feature
For each dataset in the training cohorts, we fitted a feature-specific linear model where the transformed feature (metabolite [log2], lipid [log2], microbiome [CLR] and pathways [CLR]) represented the outcome variable whereas country
(Italy vs Czech), diet
(vegan vs omnivore), and their interaction (country:diet
) all represented fixed-effects predictors. So, each model has following form
\[ g(\text{outcome}) = \alpha + \beta_{1} \times \text{country} + \beta_{2} \times \text{diet} + \beta_{3} \times \text{country:diet} + \epsilon \]
with \(g\) representing the transformation applied: \(log_{2}\) for metabolomic and lipidomic data, and center-log-ratio (CLR) for microbiome and pathways data (both compositional). For compositional datasets, zeros were assumed to be false zeros and were estimated with the Log-Ratio Singular Value Decomposition using lrSVD
function in ‘zCompositions’ package (J. Palarea-Albaladejo and Martín-Fernández 2015) since this method was shown efficient for replacing zeros even in sparse compositional dataset (Javier Palarea-Albaladejo et al. 2022).
The variables were coded as follows: \(diet = -0.5\) for omnivores and \(diet = 0.5\) for vegans; \(country = -0.5\) for the Czech cohort and \(country = 0.5\) for the Italian cohort.
This parameterization allowed us to interpret the linear model summary as presenting the average conditional effects of diet
across both countries and the average conditional effects of country
across both diet groups. We then used the emmeans
package (Lenth 2024) to obtain specific estimates for the effect of diet
in the Italian and Czech cohorts separately, still from a single model.
Features that showed a significant diet effect (average effect of diet
across both countries, adjusted for multiple comparisons with FDR < 0.05) were then visualized using a forest plot. The plot displayed the estimated difference in the level of given feature between vegan and omnivorous subject, and 95% confidence intervals across all three cohorts (Czech and Italian training cohorts, as well as the Czech external validation cohort) separately to evaluate whether found associations of given feature with diet can be generalized to other datasets.
1.2.2 Linear model for the effect of vegan diet duration
Next, we fit another series of linear models, this time modelling omics profiles using the following fixed-effect predictors: duration of vegan status (Diet_duration
, scaled in tens of years), Country
, their interaction (Diet_duration × Country
), and Age
:
\[ \text{CLR(pathway proportion)} = \alpha + \beta_{1} \times \text{Country} + \beta_{2} \times \text{Diet duration} + \beta_{3} \times (\text{Country}:\text{Diet duration}) + \beta_{4} \times \text{Age} + \epsilon \]
This analysis includes only vegan participants, while omnivores are excluded. The aim was to test whether omics features differ between vegans and omnivores also vary within the vegan group itself, depending on how long participants have been vegan. In other words, we asked whether long-term vegans show stronger up- or down-regulation of diet-sensitive features compared to those who adopted the diet more recently. Because longer vegan duration correlates with the vegan diet duration, we also adjusted for age in the models.
1.2.3 Diet prediction
To explore whether omics profiles contain a consistent signal differentiating dietary groups, we applied elastic net logistic regression to each dataset (metabolome, lipidome, microbiome, pathways) using the glmnet
R package (Friedman, Tibshirani, and Hastie 2010), combined with a custom validation function for internal validation using bootstrap. Separate models were fitted for each dataset (lipidome, metabolome, microbiome, pathways).
Due to the expected high collinearity among features, we limited the search for the mixing parameter \(alpha\) to rather smaller values (0, 0.2, 0.4). All features were standardized to have \(mean = 0\) and \(standard deviation = 0.5\) using the arm
R package (Gelman and Su 2021) to ensure scale comparability.
Models were first evaluated for out-of-sample performance using an out-of-bag bootstrap approach (500 iterations). Predictive performance was quantified using the out-of-sample area under the ROC curve (AUC; internal validation), estimated via the pROC
package (Robin et al. 2011). AUC was estimated for both cohorts combined and for each country separately. Final external validation was performed on an independent Czech cohort (external validation).
The modelling and validation procedure involved the following steps:
Training and internal validation
For each alpha (0, 0.2, 0.4), the
cv.glmnet
function was used to perform 10-fold cross-validation on the training data, using the defaultglmnet
lambda sequence. For each alpha, we selectedlambda.1se
, corresponding to the most regularized model whose cross-validated deviance was within one standard error of the minimum. The alpha–lambda pair with the lowest deviance was chosen.The final model was fitted on the full training set using
glmnet
with the selected alpha and lambda values.To estimate internal performance, the entire modelling procedure (including tuning) was repeated 500 times on bootstrap resamples of the training data. For each resample, the optimal alpha and lambda were re-selected.
The model was trained on each bootstrap resample, and AUC was calculated on the out-of-bag subjects (i.e., those not included in that iteration’s training set). This yielded 500 out-of-sample AUC estimates.
The median, 2.5th, and 97.5th percentiles of the AUCs were reported to summarize internal out-of-sample performance and its 95% confidence interval.
External validation
External validation cohort data were standardized using the means and standard deviations of the training cohort to ensure consistent scaling.
The final model (fitted once on the full training data) was applied to the external validation cohort to generate predicted probabilities of vegan status.
For each subject in the external cohort, the final model returned a predicted probability of being vegan (a continuous score between 0 and 1).
This predicted probability was treated as the discrimination variable: by varying the threshold across its full range, we obtained sensitivity and specificity pairs from which the ROC curve and its AUC were computed. This AUC reflects the model’s ability to generalize to an independent cohort.
The models were not intended for practical prediction, but to quantify the strength of the signal separating the dietary groups, with its uncertainty, by using all features of a given dataset jointly. It also offered a complementary perspective on which features are most clearly associated with diet
2 Reproducibility
Open code
sessionInfo()
## R version 4.4.3 (2025-02-28)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=cs_CZ.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=cs_CZ.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=cs_CZ.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=cs_CZ.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Europe/Prague
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] htmlwidgets_1.6.4 compiler_4.4.3 fastmap_1.2.0 cli_3.6.5
## [5] tools_4.4.3 htmltools_0.5.8.1 yaml_2.3.10 rmarkdown_2.27
## [9] knitr_1.50 jsonlite_2.0.0 xfun_0.52 digest_0.6.37
## [13] rlang_1.1.6 evaluate_1.0.4