Dietary intake, Nutritional status, and Health outcomes among Vegan, Vegetarian and Omnivore families: results from the observational study

Statistical report - methods and results description

Authors and affiliations

Marina Heniková^1,2, Anna Ouřadová¹, Eliška Selinger^1,3, Filip Tichanek⁴, Petra Polakovičová⁴, Dana Hrnčířová², Pavel Dlouhý², Martin Světnička⁵, Eva El-Lababidi⁵, Jana Potočková¹, Tilman Kühn⁶, Monika Cahová⁴, Jan Gojda¹

¹ Department of Internal Medicine, Kralovske Vinohrady University Hospital and Third Faculty of Medicine, Charles University, Prague, Czech Republic.
² Department of Hygiene, Third Faculty of Medicine, Charles University, Prague, Czech Republic.
³ National Health Institute, Prague, Czech Republic.
⁴ Institute for Clinical and Experimental Medicine, Prague, Czech Republic.
⁵ Department of Pediatrics, Kralovske Vinohrady University Hospital and Third Faculty of Medicine, Charles University, Prague, Czech Republic.
⁶ Department of Epidemiology, MedUni, Vienna, Austria.

This is a statistical report of the study currenlty under review in the Communications Medicine journal.

When using this code or data, cite the original publication:

TO BE ADDED

BibTex citation for the original publication:

TO BE ADDED

Original GitHub repository: https://github.com/filip-tichanek/kompas_clinical

Statistical reports can be found on the reports hub.

Data analysis is described in detail in the statistical methods report.

1 Introduction

This project is designed to evaluate and compare clinical outcomes across three distinct dietary strategy groups:

Vegans
Vegetarians
Omnivores

The dataset includes both adults and children, with data clustered within families.

1.1 Main Questions

The study addresses the following key questions:

Q1. Do clinical outcomes vary significantly across different diet strategies?

Q2. Beyond diet group, which factors (e.g., sex, age, breastfeeding status for children, or supplementation when applicable) most strongly influence clinical outcomes? How correlated (“clustered”) are these characteristics within the same family?

Q3. Could the clinical characteristics effectively discriminate between different diet groups?

1.2 Statistical Methods

For full methodological details, see this report. In brief:

Robust linear mixed-effects models (rLME) were used to estimate adjusted differences between diet groups (Q1) and assess the importance of other variables (Q2), including how much clinical characteristics tend to cluster within families. Covariates included age, sex, breastfeeding status for children, and relevant supplementation factors where applicable.
Elastic net logistic regression was employed to answer Q3, evaluating whether clinical characteristics provide a strong overall signal distinguishing between diet groups, incorporating a predictive perspective.

All analyses were conducted separately for adults and children.

1.3 Statistical methods in Suppl

All statistical analyses were conducted in R, version 4.4.3 (2025-02-28) (R Core Team 2023). Summary statistics by dietary group, stratified by age (children below 3 years, children above 3 years, adults), are presented as medians (25th and 75th percentiles) for continuous variables and as counts (%) for categorical variables. Data were visualized mainly with help of ‘ggplot2’ (Wickham 2016) and ‘ggpubr’ (Kassambara 2020) R packages. Whole statistical procedure including R code, assumptions checks, and additional results can be found in online statistical report: https://filip-tichanek.github.io/kompas_clinical/.

Differences in numerical clinical outcomes were assessed using robust linear mixed-effects (rLME) models using ‘robustlmm’ package (Koller 2016) separately for children and adults, adjusting for pre-specified key confounders, selected based on domain knowledge and literature, selecting these with their known association with clinical characteristics and observed differences in distribution among dietary groups. This approach was chosen to avoid data-driven selection, which can bias estimates by overfitting or introducing collider bias. Specifically, we included age (log2-transformed for children), sex, and breastfeeding-related covariates (exclusive breastfeeding [0/1], partial breastfeeding [0/1], and breastfeeding duration in months), along with a random intercept for family to account foe within-family dependency. Where relevant (e.g., for biogenic elements and vitamins), supplementation status was also included. For children’s morphological characteristics, birth weight was an additional covariate. If diagnostic checks indicated non-normality or heteroscedasticity of residuals, we used a log2 transformation of the outcome when helpful.

We likewise fitted conventional linear mixed-effects models (LME) via ‘lme4’ (Bates et al. 2015) to evaluate importance of all variables. Specifically, we compared models with and without each variable group (diet, sex, age, importance of breastfeeding-related variables in the case of children, random effect of family) using the Akaike Information Criterion (AIC), which estimates how well a model is expected to predict new data. A drop in AIC when a variable group was removed indicates that it contributes to improving the predictive performance of the model.

For (r)LME, observations with missing outcome data were excluded from analysis, assuming missingness was unrelated to diet group. Thirteen missing values for partial breastfeeding were imputed using a regression-based approach, finally assigning 0 if a child was <1.77 years and 1 otherwise, reflecting its strong relationship with age.

The modeling process included:

Fitting a random-intercept generalized additive model (GAMM) with age as a non-linear predictor using ‘mgcv’ (Wood 2011)
Checking residuals and refitting with a log2-transformed outcome when needed.
Testing linear vs. quadratic age effects in subsequent (r)LME models if nonlinearity was indicated.
Applying robust LME modeling.
Conducting further mixed effects model with lme4 package, excluding specific covariates, or random effect of family, to evaluate their importance for each clinical outcome prediction. The Akaike Information Criterion (AIC) measured covariate importance regarding estimated out-of-sample predictive accuracy. A decrease in AIC after a covariate’s inclusion suggests an improvement in the model’s predictive capability, indicating importance of given variable.
In case when AIC is reduced with the inclusion of the random effect of family, we also calculated inter-class correlation, showing how much are observations correlated within family after controlling the effect of other variables, and thus strength of within-family clustering.

We visualized results using volcano plots (adjusted standardized differences between diet groups) and heatmaps (AIC changes after covariate removal). Significance level of α = 0.05 (P < 0.05) was considered ‘significant’. Raw p-values (not corrected for multiple comparisons) are reported to maximize sensitivity for a potential risk associated with the vegan/vegetarian diet (omitting a true risk was considered more serious than allowing a few false positives). However, FDR-corrected P-values and confidence intervals for the diet group differences can be found in the online statistical report: https://filip-tichanek.github.io/kompas_clinical/.

For binary outcomes, we applied logistic generalized linear additive mixed-effects models (GAMM), reporting odds ratios (OR).

1.3.1 Diet prediction

To assess the predictive power of clinical outcomes on diet strategy, we employed Elastic Net logistic regression using the ‘glmnet’ R package (Friedman, Tibshirani, and Hastie 2010).

For both adults and children, we first fitted a baseline model incorporating basic subject characteristics (age, sex, and, for children, breastfeeding status) as predictors. We then expanded the analysis with a reduced model that included these basic characteristics along with diverse clinical outcomes not primarily affected by supplementation. Finally, we fitted a full model, incorporating all clinical characteristics, including those strongly influenced by supplementation.

Missing predictor values were imputed using predictive mean matching (single stochastic imputation) with the ‘mice’ R package (Buuren and Groothuis-Oudshoorn 2011). All numerical predictors were standardized by dividing by 2 standard deviations using the ‘arm’ R package (Gelman and Su 2022) to ensure scale comparability.

The predictive performance of the models was evaluated based on their ability to discriminate between diet groups in out-of-sample data, using the area under the ROC curve (AUC) as the measure of discriminatory capacity (estimated with the ‘pROC’ R package (Robin et al. 2011)). To achieve this, we applied a cluster bootstrap resampling method (500 simulations), maintaining family-wise integrity in training and testing sets (i.e., all members of a single family were included in either the training or testing sample in each iteration) to prevent data leakage and overestimation of accuracy. This validation procedure was implemented using custom functions.

Estimated accuracies (AUC values) were compared against the baseline model to assess whether the more complex models provided a significant AUC gain. A model was considered to offer a significant improvement if the lower bound of the 95% confidence interval for the difference in AUC (complex model minus baseline model) was above zero.

The process of building the elastic net models and estimating accuracy involved the following steps:

The cv.glmnet function from the ‘glmnet’ package was utilized to determine the optimal alpha and lambda value (lambda.1se was selected for use).
The glmnet function from the ‘glmnet’ package was used to fit model using all available data and hyperparameters values optimized in the previous step.
Data were resampled 500 times, with all members of a single family allocated to the resampled dataset together to maintain family unit integrity. Hyperparameters were re-optimized again for each resample.
The glmnet function was applied with resampled data for training. Data of families that were NOT present in the i-th (resampled) dataset were used for estimation of out-of-sample AUC (validation). This was done for all resamples, totaling 500 iterations.
The average AUC and 2.5th and 97.5th percentiles were reported as out-of-sample AUC and its bounds of 95% confidence intervals.
Difference between out-of-sample AUCs of baseline vs. more complex model was calculated for each data resample, obtaining average difference in AUC (expressed as AUC_gain) and its 95% CI

1.4 Statistical methods in the main manuscript

All analyses were performed in R (v4.4.3) (R Core Team 2023). Clinical characteristics across diet groups were summarized separately for children <3 years, children ≥3 years, and adults, reporting medians (25th and 75th percentiles) for continuous variables and counts (%) for categorical variables.

Differences in clinical outcomes were assessed using robust linear mixed-effects models with ‘robustlmm’ R package (Koller 2016), adjusting for key prespecified variables (age, sex, and, for children, breastfeeding-related variables). A random effect for family was included to account for within-family correlations. To assess the importance of each variable, we also fitted standard linear mixed-effects models with ‘lme4’ R package (Bates et al. 2015) and compared models with and without specific variable groups using the Akaike Information Criterion (AIC), which reflects how much given variable contributes to outcome prediction. When necessary, outcomes were log2-transformed, and observations with missing outcomes were excluded (assuming missingness unrelated to diet). Thirteen missing values in partial breastfeeding were imputed based on age.

To evaluate whether clinical characteristics could predict diet group, we applied Elastic Net logistic regression using ‘glmnet’ R package (Friedman, Tibshirani, and Hastie 2010), comparing (i) a baseline model (age, sex, and, for children, breastfeeding status), (ii) a reduced model (adding clinical variables not strongly influenced by supplementation), and (iii) a full model (including all clinical characteristics). Missing predictor values were imputed using single stochastic imputation with ‘mice’ R package (Buuren and Groothuis-Oudshoorn 2011). Out-of-sample predictive accuracy was assessed via area under the ROC curve using ‘pROC’ R package (Robin et al. 2011), with 500 cluster bootstrap resamples that kept entire families together in training or testing sets. Differences in AUC, along with 95% confidence intervals, were used to determine whether more complex models provided significant improvements in predictive performance.

Detailed descriptions of the modeling approaches and assumptions are provided in the Supplementary Materials. We used a significance level of α = 0.05 (P < 0.05) for the robust linear mixed-effects models. P-values were not corrected for multiple comparisons to maximize sensitivity for a potential risk associated with the vegan/vegetarian diet (omitting a true risk was considered more serious than allowing a few false positives). However, FDR-corrected P-values and confidence intervals for the diet group differences can be found in the online statistical report, along with all relevant R code: https://filip-tichanek.github.io/kompas_clinical/.

2 Reproducibility

Open code

sessionInfo()
## R version 4.4.3 (2025-02-28)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=cs_CZ.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=cs_CZ.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=cs_CZ.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=cs_CZ.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Europe/Prague
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] htmlwidgets_1.6.4 compiler_4.4.3    fastmap_1.2.0     cli_3.6.3        
##  [5] tools_4.4.3       htmltools_0.5.8.1 yaml_2.3.5        rmarkdown_2.27   
##  [9] knitr_1.48        jsonlite_1.8.8    xfun_0.46         digest_0.6.37    
## [13] rlang_1.1.4       evaluate_1.0.0

References

Bates, Douglas, Martin Machler, Ben Bolker, and Steve Walker. 2015. “Fitting Linear Mixed-Effects Models Using Lme4” 67. https://doi.org/10.18637/jss.v067.i01.

Buuren, Stef van, and Karin Groothuis-Oudshoorn. 2011. “Mice: Multivariate Imputation by Chained Equations in r” 45: 1–67. https://doi.org/10.18637/jss.v045.i03.

Friedman, Jerome, Robert Tibshirani, and Trevor Hastie. 2010. “Regularization Paths for Generalized Linear Models via Coordinate Descent” 33. https://doi.org/10.18637/jss.v033.i01.

Gelman, Andrew, and Yu-Sung Su. 2022. “Arm: Data Analysis Using Regression and Multilevel/Hierarchical Models.” https://CRAN.R-project.org/package=arm.

Kassambara, Alboukadel. 2020. “Ggpubr: ’Ggplot2’ Based Publication Ready Plots.” https://CRAN.R-project.org/package=ggpubr.

Koller, Manuel. 2016. “Robustlmm: An r Package for Robust Estimation of Linear Mixed-Effects Models” 75. https://doi.org/10.18637/jss.v075.i06.

R Core Team. 2023. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Robin, Xavier, Natacha Turck, Alexandre Hainard, Natalia Tiberti, Frédérique Lisacek, Jean-Charles Sanchez, and Markus Müller. 2011. “pROC: An Open-Source Package for r and s+ to Analyze and Compare ROC Curves” 12: 77.

Wickham, Hadley. 2016. “Ggplot2: Elegant Graphics for Data Analysis.” https://ggplot2.tidyverse.org.

Wood, S. N. 2011. “Fast Stable Restricted Maximum Likelihood and Marginal Likelihood Estimation of Semiparametric Generalized Linear Models” 73: 3–36.