Vegan-specific signature implies healthier metabolic profile: findings from diet-related multi-omics observational study based on different European populations

Statistical report for microbiome analysis (SGB level, prevalence > 30% in training dataset)

Authors and affiliations

Anna Ouradova^1,*, Giulio Ferrero^2,3,*, Miriam Bratova⁴, Nikola Daskova⁴, Alena Bohdanecka^4,5, Klara Dohnalova⁶, Marie Heczkova⁴, Karel Chalupsky⁶, Maria Kralova^7,8, Marek Kuzma⁹, István Modos⁴, Filip Tichanek⁴, Lucie Najmanova⁹, Barbara Pardini¹⁰, Helena Pelantová⁹, Sonia Tarallo¹⁰, Petra Videnska¹¹, Jan Gojda^1,#, Alessio Naccarati^10,#, Monika Cahova^4,#

^* These authors have contributed equally to this work and share first authorship
^# These authors have contributed equally to this work and share last authorship

¹ Department of Internal Medicine, Kralovske Vinohrady University Hospital and Third Faculty of Medicine, Charles University, Prague, Czech Republic
² Department of Clinical and Biological Sciences, University of Turin, Turin, Italy
³ Department of Computer Science, University of Turin, Turin, Italy
⁴ Institute for Clinical and Experimental Medicine, Prague, Czech Republic
⁵ First Faculty of Medicine, Charles University, Prague, Czech Republic
⁶ Czech Centre for Phenogenomics, Institute of Molecular Genetics of the Czech Academy of Sciences, Prague, Czech Republic
⁷ Ambis University, Department of Economics and Management, Prague, Czech Republic
⁸ Department of Informatics, Brno University of Technology, Brno, Czech Republic
⁹ Institute of Microbiology of the Czech Academy of Sciences, Prague, Czech Republic
¹⁰ Italian Institute for Genomic Medicine (IIGM), c/o IRCCS Candiolo, Turin, Italy
¹¹ Mendel University, Department of Chemistry and Biochemistry, Brno, Czech Republic

This is a statistical report of the study A vegan diet signature from a multi-omics study on different European populations is related to favorable metabolic outcomes that is currenlty under review

When using this code or data, cite the original publication:

TO BE ADDED

BibTex citation for the original publication:

TO BE ADDED

Original GitHub repository: https://github.com/filip-tichanek/ItCzVegans

Statistical reports can be found on the reports hub.

Data analysis is described in detail in the statistical methods report.

1 Introduction

This project explores potential signatures of a vegan diet across the microbiome, metabolome, and lipidome. We used data from healthy vegan and omnivorous human subjects in two countries (Czech Republic and Italy), with subjects grouped by Country and Diet, resulting in four distinct groups.

To assess the generalizability of these findings, we validated our results with an independent cohort from the Czech Republic for external validation.

1.1 Statistical Methods

The statistical modeling approach is described in detail in this report. Briefly, the methods used included:

Multivariate analysis: We conducted multivariate analyses (PERMANOVA, PCA, correlation analyses) to explore the effects of diet, country, and their possible interaction (diet : country) on the microbiome, lipidome, and metabolome compositions in an integrative manner. This part of the analysis is not available on the GitHub page, but the code will be provided upon request.
Linear models: Linear models were applied to estimate the effects of diet, country, and their interaction (diet:country) on individual lipids, metabolites, bacterial taxa and pathways (“features”). Features that significantly differed between diet groups (based on the estimated average conditional effect of diet across both countries, adjusted for multiple comparisons with FDR < 0.05) were further examined in the independent external validation cohort to assess whether these associations were reproducible. Next, we fit linear models restricted to vegan participants to test whether omics profiles varied with the duration of vegan diet. Fixed-effect predictors were diet duration (per 10 years), country, their interaction, and age (included due to correlation with diet duration).
Predictive models (elastic net): We employed elastic net logistic regression (via the glmnet R package) to predict vegan status based on metabolome, lipidome, microbiome and pathways data (one model per dataset; four models in total). We considered three combinations of Lasso and Ridge penalties (alpha = 0, 0.2, 0.4). For each alpha, we selected the penalty strength (λ_1se) using 10-fold cross-validation. This value corresponds to the most regularized model whose performance was within one standard error of the minimum deviance. The alpha–lambda pair with the lowest deviance was chosen to fit the final model, whose coefficients are reported.
To estimate model performance, we repeated the full modeling procedure (including hyperparameter tuning) 500 times on bootstrap resamples of the training data. In each iteration, the model was trained on the resampled data and evaluated on the out-of-bag subjects (i.e., those not included in the training set in that iteration). The mean, and 2.5th, and 97.5th percentiles of the resulting ROC-AUC values represent the estimated out-of-sample AUC and its 95% confidence interval.
Finally, the final model was applied to an independent validation cohort to generate predicted probabilities of vegan status. These probabilities were then used to assess external discrimination between diet groups (ROC-AUC in the independent validation cohort). The elastic net models were not intended for practical prediction, but to quantify the strength of the signal separating the dietary groups, with its uncertainty, by using all features of a given dataset jointly. It also offered a complementary perspective on which features are most clearly associated with diet

2 Initiation

2.1 Set home directory

Open code

setwd('/home/ticf/secured_data/GitRepo/ticf/478_MOCA_italian')

2.2 Upload initiation file

Open code

source('478_initiation.R')

3 Data

3.1 Upload all original data

3.1.1 Training set

3.1.1.1 Connect metadata from lipidom table

Open code

training_metadata <- read.xlsx('gitignore/data/lipidome_training_cohort.xlsx') %>% 
  select(Sample, Country, Diet) %>% 
  mutate(ID = Sample)

3.1.1.2 Connect training data

Open code

data_microbiome_original_raw  <- read.table(
  'gitignore/data/0_Data_Metaphlan4_SGB_subset.txt') 

colnames(data_microbiome_original_raw) <- data_microbiome_original_raw[1,]

data_microbiome_original_raw <- data_microbiome_original_raw %>% 
  t() %>% 
  data.frame()

colnames(data_microbiome_original_raw) <- data_microbiome_original_raw[1,]

data_microbiome_original_raw <- data_microbiome_original_raw[-1, ] %>% 
  left_join(training_metadata, by = 'ID') %>% 
  mutate(Data = 'valid', Sample = ID) %>% 
  select(Sample, Data, Diet, Country, everything()) %>% 
  select(-ID)

dim(data_microbiome_original_raw)
## [1] 166 684

3.1.2 Validation set

3.1.2.1 Get metadata from lipidom table

Open code

data_lipids_validation <- read.xlsx('gitignore/data/lipidome_validation_cohort.xlsx') %>% 
  mutate(ID = X1) %>% 
  select(ID, X2)

3.1.2.2 Connect validation data

Open code

data_microbiome_validation_raw <- read.table(
  'gitignore/data/0_Data_Metaphlan4_SGB_subset_validation.txt') 

colnames(data_microbiome_validation_raw) <- data_microbiome_validation_raw[1,]

data_microbiome_validation_raw <- data_microbiome_validation_raw %>% 
  t() %>% 
  data.frame()

colnames(data_microbiome_validation_raw) <- data_microbiome_validation_raw[1,]

data_microbiome_validation_raw <- data_microbiome_validation_raw[-1, ] %>% 
  mutate(
    ID= paste0("K", gsub("\\..*", "", trimws(ID)))) %>% 
  left_join(data_lipids_validation, by = 'ID') %>% 
  mutate(Data = 'valid', Sample = ID, Diet = X2) %>% 
  select(Sample, Data, Diet, everything()) %>% 
  select(-ID, -X2)

dim(data_microbiome_validation_raw)
## [1] 103 683

3.1.3 Get center-log transformed value

Open code

set.seed(478)

## Training data
metadata <- data_microbiome_original_raw[, c("Sample", "Country", "Diet")]

bacteria_d <- data_microbiome_original_raw[, -c(1:4)] %>% 
  mutate(across(everything(), as.numeric)) %>% 
  select(where(~ mean(. != 0) >= 0.5)) 

dim(data_microbiome_original_raw[, -c(1:4)])
## [1] 166 680
dim(bacteria_d)
## [1] 166 159

rel_taxons <- c(colnames(bacteria_d))

bacteria_data <- bacteria_d / rowSums(bacteria_d)
dim(bacteria_data)
## [1] 166 159
    
bacteria_data <- lrSVD(bacteria_data, 
                        label = 0, 
                        dl = NULL,
                        z.warning = 0.9,
                        z.delete = FALSE,
                       ncp = 1)

clr_bacteria_data <- clr(bacteria_data)
data_microbiome_original <- cbind(metadata, clr_bacteria_data)

if(file.exists('gitignore/data_microbiome_SGB50_training_impCLR.csv') == FALSE){
  write.csv(data_microbiome_original ,
            'gitignore/data_microbiome_SGB50_training_impCLR.csv')
}

## Show variances of CLR proportions across samples
data_variance <- data_microbiome_original %>%
  rowwise() %>%   
  mutate(variance = var(c_across(-(Sample:Diet)))) %>%  
  ungroup() %>%       
  select(Sample, variance)  

## Look at distribution
hist(data_variance$variance)

Open code


## Show extreme samples

## Validation data
metadata <- data_microbiome_validation_raw[, c("Sample", "Data", "Diet")]
bacteria_d <- data_microbiome_validation_raw[, -(1:3)] %>% 
  mutate(across(everything(), as.numeric)) %>% 
  select(all_of(rel_taxons))

bacteria_data <- bacteria_d / rowSums(bacteria_d)
min(colSums(bacteria_data))
## [1] 0.002750605
    
bacteria_data <- lrSVD(bacteria_data, 
                        label = 0, 
                        dl = NULL,
                        z.warning = 0.9, 
                        z.delete = FALSE,
                       ncp = 1)



clr_bacteria_data <- clr(bacteria_data)
data_microbiome_validation <- cbind(metadata, clr_bacteria_data)

## Add Diet for K284 which has the diet missing
data_microbiome_validation[
  which(data_microbiome_validation$Sample == 'K284'), 'Diet'
  ] <- 'VEGAN'

data_microbiome_validation$`Wujia_chipingensis|SGB5111`
##   [1] -2.92558573 -2.57877869 -2.48166039 -2.63933531 -2.29228727 -2.46890790
##   [7] -1.07508612 -2.47408922 -1.64066899 -2.75324924 -2.90470144 -0.44780088
##  [13] -2.77455915 -2.50439920 -2.65599579 -0.59288175  2.74812720 -1.03660838
##  [19] -2.79939457 -2.83753105 -2.96068965 -2.87819558 -3.03439229 -0.78436806
##  [25] -2.89675146 -0.63992092 -2.13606771 -2.34290003 -2.65403706  0.34352666
##  [31]  0.55104714  2.38939173  2.79294541 -2.66285398  0.32251211  0.43085198
##  [37]  4.75509724 -2.75336263 -0.97936846 -2.87671255 -2.85466074 -2.93907679
##  [43]  3.66700444 -2.93560948  1.59751781 -3.11993575 -2.77173744 -3.34986124
##  [49] -0.44193727  3.41349541  3.15512582  4.23053587 -2.43190972 -3.26389343
##  [55] -3.15221786 -2.89650842  4.18824162 -2.48300190 -2.23272938 -2.84640123
##  [61] -2.70142402  2.74698827  4.05797161  2.24121888 -2.67484213  4.04414696
##  [67] -1.40774567 -2.12658921 -1.43491687 -2.67698574 -2.78569574  1.19091522
##  [73]  4.92999406 -2.84243779 -3.02128704  2.84556112 -2.65215215 -3.05042733
##  [79] -3.09907790  3.53068758 -2.42651658 -0.75222217 -2.37529435  3.07879163
##  [85] -0.02028561  2.23415945 -2.54370243 -1.70031458 -2.55838450 -2.37362781
##  [91] -2.57418614 -1.39422115 -1.84617881 -2.82390333  3.42291000 -2.26742789
##  [97] -3.02312499 -2.63266669 -2.47838062 -2.38627082  2.15881645  3.58870849
## [103] -2.14898880



if(file.exists('gitignore/data_microbiome_SGB50_validation_impCLR.csv') == FALSE){
  write.csv(data_microbiome_validation,
            'gitignore/data_microbiome_SGB50_validation_impCLR.csv')
}

3.1.4 Merge training and validation dataset

Open code


cbind(
  colnames(data_microbiome_original), 
  colnames(data_microbiome_validation))
##        [,1]                                                
##   [1,] "Sample"                                            
##   [2,] "Country"                                           
##   [3,] "Diet"                                              
##   [4,] "Bacteroides_stercoris|SGB1830"                     
##   [5,] "Alistipes_putredinis|SGB2318"                      
##   [6,] "Candidatus_Cibionibacter_quicibialis|SGB15286"     
##   [7,] "Bacteroides_uniformis|SGB1836"                     
##   [8,] "GGB9602_SGB15031|SGB15031"                         
##   [9,] "Phocaeicola_vulgatus|SGB1814"                      
##  [10,] "Faecalibacterium_prausnitzii|SGB15316"             
##  [11,] "Lachnospiraceae_bacterium_CLA_AA_H244|SGB4993"     
##  [12,] "Sutterella_wadsworthensis|SGB9283"                 
##  [13,] "Oscillibacter_sp_ER4|SGB15254"                     
##  [14,] "Parabacteroides_distasonis|SGB1934"                
##  [15,] "Parabacteroides_merdae|SGB1949"                    
##  [16,] "Oscillibacter_valericigenes|SGB15053"              
##  [17,] "Brotolimicola_acetigignens|SGB4914"                
##  [18,] "Alistipes_shahii|SGB2295"                          
##  [19,] "Alistipes_onderdonkii|SGB2303"                     
##  [20,] "Faecalibacterium_SGB15346|SGB15346"                
##  [21,] "Faecalibacterium_prausnitzii|SGB15332"             
##  [22,] "Gemmiger_formicilis|SGB15300"                      
##  [23,] "GGB33469_SGB15236|SGB15236"                        
##  [24,] "Alistipes_communis|SGB2290"                        
##  [25,] "Escherichia_coli|SGB10068"                         
##  [26,] "Faecalibacterium_prausnitzii|SGB15318"             
##  [27,] "Faecalibacterium_prausnitzii|SGB15342"             
##  [28,] "Fusicatenibacter_saccharivorans|SGB4874"           
##  [29,] "Vescimonas_coprocola|SGB15089"                     
##  [30,] "Oscillibacter_sp_MSJ_31|SGB15249"                  
##  [31,] "Bacteroides_caccae|SGB1877"                        
##  [32,] "Oscillospiraceae_bacterium_CLA_AA_H250|SGB14861"   
##  [33,] "Bifidobacterium_longum|SGB17248"                   
##  [34,] "Alistipes_senegalensis|SGB2296"                    
##  [35,] "Roseburia_intestinalis|SGB4951"                    
##  [36,] "Dysosmobacter_welbionis|SGB15078"                  
##  [37,] "Eubacterium_rectale|SGB4933"                       
##  [38,] "Alistipes_ihumii|SGB2328"                          
##  [39,] "Roseburia_faecis|SGB4925"                          
##  [40,] "GGB9699_SGB15216|SGB15216"                         
##  [41,] "Oscillospiraceae_bacterium|SGB15225"               
##  [42,] "Faecalibacterium_prausnitzii|SGB15317"             
##  [43,] "GGB9667_SGB15164|SGB15164"                         
##  [44,] "Lachnospira_pectinoschiza|SGB5075"                 
##  [45,] "Anaerotignum_faecicola|SGB5190"                    
##  [46,] "Hydrogenoanaerobacterium_saccharovorans|SGB15350"  
##  [47,] "Roseburia_hominis|SGB4936"                         
##  [48,] "Ruminococcus_bicirculans|SGB4262"                  
##  [49,] "Bilophila_wadsworthia|SGB15452"                    
##  [50,] "Odoribacter_splanchnicus|SGB1790"                  
##  [51,] "GGB9715_SGB15265|SGB15265"                         
##  [52,] "GGB9730_SGB15291|SGB15291"                         
##  [53,] "Bifidobacterium_adolescentis|SGB17244"             
##  [54,] "Bacteroides_ovatus|SGB1871"                        
##  [55,] "GGB9614_SGB15049|SGB15049"                         
##  [56,] "Simiaoa_sunii|SGB4910"                             
##  [57,] "Barnesiella_intestinihominis|SGB1965"              
##  [58,] "Agathobaculum_butyriciproducens|SGB14993"          
##  [59,] "Lachnospira_eligens|SGB5082"                       
##  [60,] "Lacrimispora_amygdalina|SGB4716"                   
##  [61,] "Akkermansia_muciniphila|SGB9226"                   
##  [62,] "Clostridiales_bacterium|SGB15143"                  
##  [63,] "Ruminococcus_bromii|SGB4285"                       
##  [64,] "Gemmiger_formicilis|SGB15299"                      
##  [65,] "Clostridiaceae_bacterium|SGB4269"                  
##  [66,] "GGB9760_SGB15373|SGB15373"                         
##  [67,] "Clostridiaceae_bacterium_AF18_31LB|SGB4767"        
##  [68,] "Bacteroides_cellulosilyticus|SGB1844"              
##  [69,] "Faecalibacterium_sp_CLA_AA_H233|SGB15315"          
##  [70,] "Ruthenibacterium_lactatiformans|SGB15271"          
##  [71,] "Lawsonibacter_asaccharolyticus|SGB15154"           
##  [72,] "Clostridium_sp_AF20_17LB|SGB4714"                  
##  [73,] "Clostridium_fessum|SGB4705"                        
##  [74,] "Clostridium_sp_AF36_4|SGB4644"                     
##  [75,] "GGB13404_SGB14252|SGB14252"                        
##  [76,] "GGB9627_SGB15081|SGB15081"                         
##  [77,] "GGB9707_SGB15229|SGB15229"                         
##  [78,] "Parasutterella_excrementihominis|SGB9262"          
##  [79,] "Roseburia_inulinivorans|SGB4940"                   
##  [80,] "Phocaeicola_dorei|SGB1815"                         
##  [81,] "Oscillibacter_valericigenes|SGB15124"              
##  [82,] "Clostridiaceae_bacterium|SGB4770"                  
##  [83,] "Bacteroides_thetaiotaomicron|SGB1861"              
##  [84,] "Flavonifractor_plautii|SGB15132"                   
##  [85,] "Fusicatenibacter_sp_CLA_AA_H277|SGB4780"           
##  [86,] "Lachnospiraceae_bacterium_AM48_27BH|SGB4706"       
##  [87,] "Collinsella_aerofaciens|SGB14535"                  
##  [88,] "Clostridium_leptum|SGB14853"                       
##  [89,] "GGB52130_SGB14966|SGB14966"                        
##  [90,] "Clostridium_sp_AF27_2AA|SGB4712"                   
##  [91,] "Blautia_sp_MCC283|SGB4828"                         
##  [92,] "Clostridiales_bacterium_KLE1615|SGB5090"           
##  [93,] "Lachnospira_sp_NSJ_43|SGB5087"                     
##  [94,] "Coprococcus_catus|SGB4670"                         
##  [95,] "Clostridiaceae_bacterium_Marseille_Q4143|SGB4768"  
##  [96,] "GGB45432_SGB63101|SGB63101"                        
##  [97,] "Clostridium_sp_AM33_3|SGB4711"                     
##  [98,] "GGB9342_SGB14306|SGB14306"                         
##  [99,] "Faecalicatena_fissicatena|SGB4871"                 
## [100,] "Alistipes_finegoldii|SGB2301"                      
## [101,] "Oscillospiraceae_bacterium_Marseille_Q3528|SGB4778"
## [102,] "Faecalibacterium_sp_HTFF|SGB15340"                 
## [103,] "GGB3653_SGB4964|SGB4964"                           
## [104,] "Intestinimonas_butyriciproducens|SGB15126"         
## [105,] "Blautia_glucerasea|SGB4816"                        
## [106,] "GGB2653_SGB3574|SGB3574"                           
## [107,] "bacterium_210917_DFI_7_65|SGB14999"                
## [108,] "GGB9646_SGB15123|SGB15123"                         
## [109,] "GGB51441_SGB71759|SGB71759"                        
## [110,] "Alistipes_indistinctus|SGB2325"                    
## [111,] "Lachnospiraceae_bacterium|SGB4781"                 
## [112,] "GGB3109_SGB4121|SGB4121"                           
## [113,] "GGB9534_SGB14937|SGB14937"                         
## [114,] "Anaerotruncus_rubiinfantis|SGB25416"               
## [115,] "Blautia_SGB4815|SGB4815"                           
## [116,] "Lawsonibacter_hominis|SGB15131"                    
## [117,] "Clostridium_SGB4750|SGB4750"                       
## [118,] "Holdemania_filiformis|SGB4046"                     
## [119,] "Anaeromassilibacillus_senegalensis|SGB14894"       
## [120,] "Lachnospira_pectinoschiza|SGB5089"                 
## [121,] "Blautia_obeum|SGB4811"                             
## [122,] "Adlercreutzia_equolifaciens|SGB14797"              
## [123,] "Blautia_wexlerae|SGB4837"                          
## [124,] "Clostridium_sp_AM22_11AC|SGB4749"                  
## [125,] "Anaerotruncus_colihominis|SGB14963"                
## [126,] "Eubacterium_ramulus|SGB4959"                       
## [127,] "Blautia_faecis|SGB4820"                            
## [128,] "Clostridiaceae_bacterium_Marseille_Q4145|SGB4769"  
## [129,] "Roseburia_sp_AF02_12|SGB4938"                      
## [130,] "Intestinimonas_gabonensis|SGB79840"                
## [131,] "Blautia_massiliensis|SGB4826"                      
## [132,] "Lachnospiraceae_bacterium_CLA_AA_H215|SGB4777"     
## [133,] "Dorea_formicigenerans|SGB4575"                     
## [134,] "Clostridia_bacterium_UC5_1_1D1|SGB14995"           
## [135,] "Clostridium_sp_AM49_4BH|SGB4652"                   
## [136,] "Mediterraneibacter_faecis|SGB4563"                 
## [137,] "Ruminococcus_torques|SGB4608"                      
## [138,] "Coprococcus_comes|SGB4577"                         
## [139,] "Streptococcus_parasanguinis|SGB8071"               
## [140,] "Eubacterium_ventriosum|SGB5045"                    
## [141,] "Enterocloster_citroniae|SGB4761"                   
## [142,] "GGB9635_SGB15106|SGB15106"                         
## [143,] "GGB9758_SGB15368|SGB15368"                         
## [144,] "GGB9708_SGB15234|SGB15234"                         
## [145,] "Wujia_chipingensis|SGB5111"                        
## [146,] "Paraprevotella_clara|SGB1798"                      
## [147,] "Dorea_longicatena|SGB4581"                         
## [148,] "Ruminococcus_lactaris|SGB4557"                     
## [149,] "Anaerobutyricum_hallii|SGB4532"                    
## [150,] "Streptococcus_salivarius|SGB8007_group"            
## [151,] "Anaerostipes_hadrus|SGB4540"                       
## [152,] "Lawsonibacter_SGB15145|SGB15145"                   
## [153,] "Lachnospiraceae_bacterium|SGB4782"                 
## [154,] "Mediterraneibacter_butyricigenes|SGB25493"         
## [155,] "Dorea_sp_AF36_15AT|SGB4552"                        
## [156,] "Oliverpabstia_intestinalis|SGB4868"                
## [157,] "Lentihominibacter_faecis|SGB3957"                  
## [158,] "Bacteroides_xylanisolvens|SGB1867"                 
## [159,] "Faecalibacillus_intestinalis|SGB6754"              
## [160,] "Romboutsia_timonensis|SGB6148"                     
## [161,] "Dorea_longicatena|SGB4582"                         
## [162,] "Clostridiaceae_unclassified_SGB4771|SGB4771"       
##        [,2]                                                
##   [1,] "Sample"                                            
##   [2,] "Data"                                              
##   [3,] "Diet"                                              
##   [4,] "Bacteroides_stercoris|SGB1830"                     
##   [5,] "Alistipes_putredinis|SGB2318"                      
##   [6,] "Candidatus_Cibionibacter_quicibialis|SGB15286"     
##   [7,] "Bacteroides_uniformis|SGB1836"                     
##   [8,] "GGB9602_SGB15031|SGB15031"                         
##   [9,] "Phocaeicola_vulgatus|SGB1814"                      
##  [10,] "Faecalibacterium_prausnitzii|SGB15316"             
##  [11,] "Lachnospiraceae_bacterium_CLA_AA_H244|SGB4993"     
##  [12,] "Sutterella_wadsworthensis|SGB9283"                 
##  [13,] "Oscillibacter_sp_ER4|SGB15254"                     
##  [14,] "Parabacteroides_distasonis|SGB1934"                
##  [15,] "Parabacteroides_merdae|SGB1949"                    
##  [16,] "Oscillibacter_valericigenes|SGB15053"              
##  [17,] "Brotolimicola_acetigignens|SGB4914"                
##  [18,] "Alistipes_shahii|SGB2295"                          
##  [19,] "Alistipes_onderdonkii|SGB2303"                     
##  [20,] "Faecalibacterium_SGB15346|SGB15346"                
##  [21,] "Faecalibacterium_prausnitzii|SGB15332"             
##  [22,] "Gemmiger_formicilis|SGB15300"                      
##  [23,] "GGB33469_SGB15236|SGB15236"                        
##  [24,] "Alistipes_communis|SGB2290"                        
##  [25,] "Escherichia_coli|SGB10068"                         
##  [26,] "Faecalibacterium_prausnitzii|SGB15318"             
##  [27,] "Faecalibacterium_prausnitzii|SGB15342"             
##  [28,] "Fusicatenibacter_saccharivorans|SGB4874"           
##  [29,] "Vescimonas_coprocola|SGB15089"                     
##  [30,] "Oscillibacter_sp_MSJ_31|SGB15249"                  
##  [31,] "Bacteroides_caccae|SGB1877"                        
##  [32,] "Oscillospiraceae_bacterium_CLA_AA_H250|SGB14861"   
##  [33,] "Bifidobacterium_longum|SGB17248"                   
##  [34,] "Alistipes_senegalensis|SGB2296"                    
##  [35,] "Roseburia_intestinalis|SGB4951"                    
##  [36,] "Dysosmobacter_welbionis|SGB15078"                  
##  [37,] "Eubacterium_rectale|SGB4933"                       
##  [38,] "Alistipes_ihumii|SGB2328"                          
##  [39,] "Roseburia_faecis|SGB4925"                          
##  [40,] "GGB9699_SGB15216|SGB15216"                         
##  [41,] "Oscillospiraceae_bacterium|SGB15225"               
##  [42,] "Faecalibacterium_prausnitzii|SGB15317"             
##  [43,] "GGB9667_SGB15164|SGB15164"                         
##  [44,] "Lachnospira_pectinoschiza|SGB5075"                 
##  [45,] "Anaerotignum_faecicola|SGB5190"                    
##  [46,] "Hydrogenoanaerobacterium_saccharovorans|SGB15350"  
##  [47,] "Roseburia_hominis|SGB4936"                         
##  [48,] "Ruminococcus_bicirculans|SGB4262"                  
##  [49,] "Bilophila_wadsworthia|SGB15452"                    
##  [50,] "Odoribacter_splanchnicus|SGB1790"                  
##  [51,] "GGB9715_SGB15265|SGB15265"                         
##  [52,] "GGB9730_SGB15291|SGB15291"                         
##  [53,] "Bifidobacterium_adolescentis|SGB17244"             
##  [54,] "Bacteroides_ovatus|SGB1871"                        
##  [55,] "GGB9614_SGB15049|SGB15049"                         
##  [56,] "Simiaoa_sunii|SGB4910"                             
##  [57,] "Barnesiella_intestinihominis|SGB1965"              
##  [58,] "Agathobaculum_butyriciproducens|SGB14993"          
##  [59,] "Lachnospira_eligens|SGB5082"                       
##  [60,] "Lacrimispora_amygdalina|SGB4716"                   
##  [61,] "Akkermansia_muciniphila|SGB9226"                   
##  [62,] "Clostridiales_bacterium|SGB15143"                  
##  [63,] "Ruminococcus_bromii|SGB4285"                       
##  [64,] "Gemmiger_formicilis|SGB15299"                      
##  [65,] "Clostridiaceae_bacterium|SGB4269"                  
##  [66,] "GGB9760_SGB15373|SGB15373"                         
##  [67,] "Clostridiaceae_bacterium_AF18_31LB|SGB4767"        
##  [68,] "Bacteroides_cellulosilyticus|SGB1844"              
##  [69,] "Faecalibacterium_sp_CLA_AA_H233|SGB15315"          
##  [70,] "Ruthenibacterium_lactatiformans|SGB15271"          
##  [71,] "Lawsonibacter_asaccharolyticus|SGB15154"           
##  [72,] "Clostridium_sp_AF20_17LB|SGB4714"                  
##  [73,] "Clostridium_fessum|SGB4705"                        
##  [74,] "Clostridium_sp_AF36_4|SGB4644"                     
##  [75,] "GGB13404_SGB14252|SGB14252"                        
##  [76,] "GGB9627_SGB15081|SGB15081"                         
##  [77,] "GGB9707_SGB15229|SGB15229"                         
##  [78,] "Parasutterella_excrementihominis|SGB9262"          
##  [79,] "Roseburia_inulinivorans|SGB4940"                   
##  [80,] "Phocaeicola_dorei|SGB1815"                         
##  [81,] "Oscillibacter_valericigenes|SGB15124"              
##  [82,] "Clostridiaceae_bacterium|SGB4770"                  
##  [83,] "Bacteroides_thetaiotaomicron|SGB1861"              
##  [84,] "Flavonifractor_plautii|SGB15132"                   
##  [85,] "Fusicatenibacter_sp_CLA_AA_H277|SGB4780"           
##  [86,] "Lachnospiraceae_bacterium_AM48_27BH|SGB4706"       
##  [87,] "Collinsella_aerofaciens|SGB14535"                  
##  [88,] "Clostridium_leptum|SGB14853"                       
##  [89,] "GGB52130_SGB14966|SGB14966"                        
##  [90,] "Clostridium_sp_AF27_2AA|SGB4712"                   
##  [91,] "Blautia_sp_MCC283|SGB4828"                         
##  [92,] "Clostridiales_bacterium_KLE1615|SGB5090"           
##  [93,] "Lachnospira_sp_NSJ_43|SGB5087"                     
##  [94,] "Coprococcus_catus|SGB4670"                         
##  [95,] "Clostridiaceae_bacterium_Marseille_Q4143|SGB4768"  
##  [96,] "GGB45432_SGB63101|SGB63101"                        
##  [97,] "Clostridium_sp_AM33_3|SGB4711"                     
##  [98,] "GGB9342_SGB14306|SGB14306"                         
##  [99,] "Faecalicatena_fissicatena|SGB4871"                 
## [100,] "Alistipes_finegoldii|SGB2301"                      
## [101,] "Oscillospiraceae_bacterium_Marseille_Q3528|SGB4778"
## [102,] "Faecalibacterium_sp_HTFF|SGB15340"                 
## [103,] "GGB3653_SGB4964|SGB4964"                           
## [104,] "Intestinimonas_butyriciproducens|SGB15126"         
## [105,] "Blautia_glucerasea|SGB4816"                        
## [106,] "GGB2653_SGB3574|SGB3574"                           
## [107,] "bacterium_210917_DFI_7_65|SGB14999"                
## [108,] "GGB9646_SGB15123|SGB15123"                         
## [109,] "GGB51441_SGB71759|SGB71759"                        
## [110,] "Alistipes_indistinctus|SGB2325"                    
## [111,] "Lachnospiraceae_bacterium|SGB4781"                 
## [112,] "GGB3109_SGB4121|SGB4121"                           
## [113,] "GGB9534_SGB14937|SGB14937"                         
## [114,] "Anaerotruncus_rubiinfantis|SGB25416"               
## [115,] "Blautia_SGB4815|SGB4815"                           
## [116,] "Lawsonibacter_hominis|SGB15131"                    
## [117,] "Clostridium_SGB4750|SGB4750"                       
## [118,] "Holdemania_filiformis|SGB4046"                     
## [119,] "Anaeromassilibacillus_senegalensis|SGB14894"       
## [120,] "Lachnospira_pectinoschiza|SGB5089"                 
## [121,] "Blautia_obeum|SGB4811"                             
## [122,] "Adlercreutzia_equolifaciens|SGB14797"              
## [123,] "Blautia_wexlerae|SGB4837"                          
## [124,] "Clostridium_sp_AM22_11AC|SGB4749"                  
## [125,] "Anaerotruncus_colihominis|SGB14963"                
## [126,] "Eubacterium_ramulus|SGB4959"                       
## [127,] "Blautia_faecis|SGB4820"                            
## [128,] "Clostridiaceae_bacterium_Marseille_Q4145|SGB4769"  
## [129,] "Roseburia_sp_AF02_12|SGB4938"                      
## [130,] "Intestinimonas_gabonensis|SGB79840"                
## [131,] "Blautia_massiliensis|SGB4826"                      
## [132,] "Lachnospiraceae_bacterium_CLA_AA_H215|SGB4777"     
## [133,] "Dorea_formicigenerans|SGB4575"                     
## [134,] "Clostridia_bacterium_UC5_1_1D1|SGB14995"           
## [135,] "Clostridium_sp_AM49_4BH|SGB4652"                   
## [136,] "Mediterraneibacter_faecis|SGB4563"                 
## [137,] "Ruminococcus_torques|SGB4608"                      
## [138,] "Coprococcus_comes|SGB4577"                         
## [139,] "Streptococcus_parasanguinis|SGB8071"               
## [140,] "Eubacterium_ventriosum|SGB5045"                    
## [141,] "Enterocloster_citroniae|SGB4761"                   
## [142,] "GGB9635_SGB15106|SGB15106"                         
## [143,] "GGB9758_SGB15368|SGB15368"                         
## [144,] "GGB9708_SGB15234|SGB15234"                         
## [145,] "Wujia_chipingensis|SGB5111"                        
## [146,] "Paraprevotella_clara|SGB1798"                      
## [147,] "Dorea_longicatena|SGB4581"                         
## [148,] "Ruminococcus_lactaris|SGB4557"                     
## [149,] "Anaerobutyricum_hallii|SGB4532"                    
## [150,] "Streptococcus_salivarius|SGB8007_group"            
## [151,] "Anaerostipes_hadrus|SGB4540"                       
## [152,] "Lawsonibacter_SGB15145|SGB15145"                   
## [153,] "Lachnospiraceae_bacterium|SGB4782"                 
## [154,] "Mediterraneibacter_butyricigenes|SGB25493"         
## [155,] "Dorea_sp_AF36_15AT|SGB4552"                        
## [156,] "Oliverpabstia_intestinalis|SGB4868"                
## [157,] "Lentihominibacter_faecis|SGB3957"                  
## [158,] "Bacteroides_xylanisolvens|SGB1867"                 
## [159,] "Faecalibacillus_intestinalis|SGB6754"              
## [160,] "Romboutsia_timonensis|SGB6148"                     
## [161,] "Dorea_longicatena|SGB4582"                         
## [162,] "Clostridiaceae_unclassified_SGB4771|SGB4771"

common_microbiome <- intersect(
  colnames(data_microbiome_original), 
  colnames(data_microbiome_validation))[-c(1:2)]

tr1 <- data_microbiome_original %>% 
  mutate(Data = if_else(Country == 'CZ', 'CZ_tr', 'IT_tr')) %>% 
  select(Data, Diet, all_of(common_microbiome))

tr2 <- data_microbiome_validation %>% 
  mutate(Data = 'valid',
         Diet = Diet) %>% 
  select(Data, Diet, all_of(common_microbiome))

data_merged <- bind_rows(tr1, tr2)

3.2 Explore

3.2.0.1 Distributions - clr transformed

Open code

check <- data_microbiome_original %>% 
  dplyr::select(-c(Sample:Diet)
    ) %>% 
  na.omit()


size = c(8, 7)
par(mfrow = c(size[1],size[2]))
par(mar=c(2,1.5,2,0.5))
set.seed(16)
ran <- sample(1:ncol(check), size[1]*size[2], replace = FALSE)
for(x in ran){
  hist(check[,x], 
       16, 
       col='blue', 
       main = paste0(colnames(check)[x])
  )
}

Histogram of CLR-transformed proportions for a random selection of 56 bacterial taxa (proportion: number of sequences for each bacterium relative to the total library depth)

3.2.0.2 Taxon proportions accross groups

Open code

colo <- c('#329243', '#F9FFAF')

data_merged <- na.omit(data_merged)

outcomes <- common_microbiome[
    sample(
      1:length(common_microbiome), 35, replace = FALSE
      )
  ]

boxplot_cond <- function(variable) {
  
  
  p <- ggboxplot(data_merged, 
                 x = 'Diet', 
                 y = variable, 
                 fill = 'Diet', 
                 tip.length = 0.15,
                 palette = colo,
                 outlier.shape = 1,
                 lwd = 0.25,
                 outlier.size = 0.8,
                 facet.by = 'Data',
                 title = variable,
                 ylab = 'CLR(taxa proportion)') +
    
    theme(
      plot.title = element_text(size = 10), 
      axis.title = element_text(size = 8),  
      axis.text.y = element_text(size = 7),
      axis.text.x = element_blank(),
      axis.title.x = element_blank()
    ) 
    
  return(p)
}

# Plot all outcomes
plots <- map(outcomes, boxplot_cond)

# Create a matrix of plots
plots_arranged <- ggarrange(plotlist = plots, ncol = 5, nrow = 7,  common.legend = TRUE)
plots_arranged

CLR-trasformed counts in randomly chosen 35 bacterial taxa, across all 3 cohorts (Czech and Italian training cohorts and an independent Czech valdiation cohort) and both dietary groups

4 Linear models across taxa

We will fit a feature-specific linear model where the CLR-transformed count of bacteria reads represents the outcome variable whereas country (Italy vs Czech), diet (vegan vs omnivore), and their interaction (country:diet) all represent fixed-effects predictors. So, each model has the following form

\[ CLR({N_i}) = \alpha + \beta_{1} \times country + \beta_{2} \times diet + \beta_{3} \times country:diet + \epsilon \] where \(N_i\) is read count of \(i\)-th bacteria taxa

The variables were coded as follows: \(diet = -0.5\) for omnivores and \(diet = 0.5\) for vegans; \(country = -0.5\) for the Czech cohort and \(country = 0.5\) for the Italian cohort.
This parameterization allows us to interpret the linear model summary output as presenting the conditional effects of diet averaged across both countries and the conditional effects of country averaged across both diet groups. We will then use the emmeans package (Lenth 2024) to obtain specific estimates for the effect of diet in the Italian and Czech cohorts separately, still from a single model.

Taxa that will show a significant diet effect (average effect of diet across both countries, adjusted for multiple comparisons with FDR < 0.05) will be then visualized using a forest plot, with country-specific diet effect along with diet effect based on independent validation cohort, to evaluate how generalizable these findings are (see external validation section).

Note that p-value for avg effects are the same as produced with car::Anova(model, type = 'III').

4.1 Select data

Open code

data_analysis <- data_microbiome_original %>%
  na.omit() %>%
  dplyr::mutate(
    Diet_VEGAN = as.numeric(
      dplyr::if_else(
        Diet == "VEGAN", 0.5, -0.5
      )
    ),
    Country_IT = as.numeric(
      dplyr::if_else(
        Country == "IT", 0.5, -0.5
      )
    )
  ) %>%
  dplyr::select(
    Sample,
    Country,
    Country_IT,
    Diet,
    Diet_VEGAN,
    dplyr::everything()
  )

summary(data_analysis[ , 1:12])
##     Sample            Country            Country_IT           Diet          
##  Length:166         Length:166         Min.   :-0.50000   Length:166        
##  Class :character   Class :character   1st Qu.:-0.50000   Class :character  
##  Mode  :character   Mode  :character   Median :-0.50000   Mode  :character  
##                                        Mean   :-0.03012                     
##                                        3rd Qu.: 0.50000                     
##                                        Max.   : 0.50000                     
##    Diet_VEGAN       Bacteroides_stercoris|SGB1830 Alistipes_putredinis|SGB2318
##  Min.   :-0.50000   Min.   :-7.4094               Min.   :-13.402             
##  1st Qu.:-0.50000   1st Qu.:-4.8717               1st Qu.:  2.795             
##  Median : 0.50000   Median :-2.6812               Median :  4.898             
##  Mean   : 0.08434   Mean   :-0.6375               Mean   :  3.421             
##  3rd Qu.: 0.50000   3rd Qu.: 3.6156               3rd Qu.:  6.001             
##  Max.   : 0.50000   Max.   : 8.4105               Max.   :  9.279             
##  Candidatus_Cibionibacter_quicibialis|SGB15286 Bacteroides_uniformis|SGB1836
##  Min.   :-6.367                                Min.   :-5.274               
##  1st Qu.: 3.190                                1st Qu.: 3.710               
##  Median : 4.126                                Median : 4.933               
##  Mean   : 3.945                                Mean   : 4.801               
##  3rd Qu.: 5.229                                3rd Qu.: 6.146               
##  Max.   : 8.116                                Max.   : 9.617               
##  GGB9602_SGB15031|SGB15031 Phocaeicola_vulgatus|SGB1814
##  Min.   :-7.0147           Min.   :-6.827              
##  1st Qu.:-4.0305           1st Qu.: 3.362              
##  Median :-0.5367           Median : 4.485              
##  Mean   :-0.6963           Mean   : 4.306              
##  3rd Qu.: 2.2064           3rd Qu.: 5.785              
##  Max.   : 7.2414           Max.   : 9.945              
##  Faecalibacterium_prausnitzii|SGB15316
##  Min.   :-3.631                       
##  1st Qu.: 3.007                       
##  Median : 4.067                       
##  Mean   : 3.928                       
##  3rd Qu.: 5.202                       
##  Max.   : 9.427

4.1.1 Define number of microbiome and covariates

Open code

n_covarites <- 5
n_features <- ncol(data_analysis) - n_covarites

4.1.2 Create empty objects

Open code

outcome <- vector('double', n_features)
est_VGdiet_inCZ <- vector('double', n_features)
est_VGdiet_inIT <- vector('double', n_features)
est_VGdiet_avg <- vector('double', n_features)

est_ITcountry_avg <- vector('double', n_features)
diet_country_int <- vector('double', n_features)


P_VGdiet_inCZ <- vector('double', n_features)
P_VGdiet_inIT <- vector('double', n_features)
P_VGdiet_avg <- vector('double', n_features)

P_ITcountry_avg <- vector('double', n_features)
P_diet_country_int <- vector('double', n_features)


CI_L_VGdiet_inCZ <- vector('double', n_features)
CI_L_VGdiet_inIT <- vector('double', n_features)
CI_L_VGdiet_avg <- vector('double', n_features)

CI_U_VGdiet_inCZ <- vector('double', n_features)
CI_U_VGdiet_inIT <- vector('double', n_features)
CI_U_VGdiet_avg <- vector('double', n_features)

4.1.3 Estimate over outcomes

Open code

for (i in 1:n_features) {
  
  ## define variable
  data_analysis$outcome <- data_analysis[, (i + n_covarites)]

  ## fit model
  model <- lm(outcome ~ Country_IT * Diet_VEGAN, data = data_analysis)

  ## get contrast (effects of diet BY COUNTRY)
  contrast_emm <- summary(
    pairs(
      emmeans(
        model,
        specs = ~ Diet_VEGAN | Country_IT
        ),
      interaction = TRUE,
      adjust = "none"
      ),
    infer = c(TRUE, TRUE)
    )

  ## save results
  outcome[i] <- names(data_analysis)[i + n_covarites]
  
  ## country effect
  est_ITcountry_avg[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Country_IT"
    ), 1
  ]

  P_ITcountry_avg[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Country_IT"
    ), 4
  ]
  
  
  ## diet effect
  tr <- confint(model)
  
  CI_L_VGdiet_avg[i] <- tr[which(row.names(tr) == 'Diet_VEGAN'),][1]
  CI_U_VGdiet_avg[i] <- tr[which(row.names(tr) == 'Diet_VEGAN'),][2]
  
  est_VGdiet_avg[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Diet_VEGAN"
    ), 1
  ]

  P_VGdiet_avg[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Diet_VEGAN"
    ), 4
  ]
  
  est_VGdiet_inCZ[i] <- -contrast_emm[1,3]
  P_VGdiet_inCZ[i] <- contrast_emm$p.value[1]
  CI_L_VGdiet_inCZ[i] <- -contrast_emm$upper.CL[1]
  CI_U_VGdiet_inCZ[i] <- -contrast_emm$lower.CL[1]
  
  est_VGdiet_inIT[i] <- -contrast_emm[2,3]
  P_VGdiet_inIT[i] <- contrast_emm$p.value[2]
  CI_L_VGdiet_inIT[i] <- -contrast_emm$upper.CL[2]
  CI_U_VGdiet_inIT[i] <- -contrast_emm$lower.CL[2]
  
  ## interaction
  diet_country_int[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Country_IT:Diet_VEGAN"
    ), 1
  ]

  P_diet_country_int[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Country_IT:Diet_VEGAN"
    ), 4
  ]
}

4.1.4 Results table

Open code

result_microbiome <- data.frame(
  outcome,
  est_ITcountry_avg, P_ITcountry_avg,
  est_VGdiet_avg, P_VGdiet_avg,
  est_VGdiet_inCZ, P_VGdiet_inCZ,
  est_VGdiet_inIT, P_VGdiet_inIT,
  diet_country_int, P_diet_country_int,
  CI_L_VGdiet_avg, CI_U_VGdiet_avg,
  CI_L_VGdiet_inCZ, CI_U_VGdiet_inCZ,
  CI_L_VGdiet_inIT, CI_U_VGdiet_inIT
)

4.1.5 Adjust p values

Open code

result_microbiome <- result_microbiome %>% 
  dplyr::mutate(
    fdr_ITcountry_avg = p.adjust(P_ITcountry_avg, method = 'BH'),
    fdr_VGdiet_avg = p.adjust(P_VGdiet_avg, method = 'BH'),
    
    fdr_VGdiet_inCZ = p.adjust(P_VGdiet_inCZ, method = 'BH'),
    fdr_VGdiet_inIT = p.adjust(P_VGdiet_inIT, method = 'BH'),
    fdr_diet_country_int = p.adjust(P_diet_country_int, method = 'BH')
  ) %>% 
  dplyr::select(
    outcome,
    est_ITcountry_avg, P_ITcountry_avg, fdr_ITcountry_avg,
    est_VGdiet_avg, P_VGdiet_avg, fdr_VGdiet_avg,
    est_VGdiet_inCZ, P_VGdiet_inCZ, fdr_VGdiet_inCZ,
    est_VGdiet_inIT, P_VGdiet_inIT, fdr_VGdiet_inIT,
    diet_country_int, P_diet_country_int, fdr_diet_country_int,
    CI_L_VGdiet_avg, CI_U_VGdiet_avg,
    CI_L_VGdiet_inCZ, CI_U_VGdiet_inCZ,
    CI_L_VGdiet_inIT, CI_U_VGdiet_inIT
  )

4.1.6 Show and save results

Open code

kableExtra::kable(result_microbiome %>% filter(fdr_VGdiet_avg < 0.05),
  caption = "Results of linear models, modeling the CLR-transformed relative abundance of a given taxon as the outcome, with `Diet`, `Country`, and `Diet x Country` interaction as fixed-effect predictors. Only taxa whose CLR-transformed relative abundance differs between diets (FDR < 0.05, average diet effect across both countries) are shown. `est` prefix: denotes estimated effects (regression coefficients), i.e., how much CLR-transformed relative abundance differs in vegans compared to omnivores and in the Italian cohort vs. the Czech cohort, respectively; `P`: p-value; `fdr`: p-value after adjustment for multiple comparisons; `CI_L` and `CI_U`: lower and upper bounds of the 95% confidence interval. `avg` suffix denotes effects averaged across subgroups, whereas `inCZ` and `inIT` denote effects in the Czech or Italian cohort, respectively. Interaction effects are reported as `diet_country_int` (difference in the effects of a vegan diet between the Italian and Czech cohorts; positive values indicate a stronger effect in the Italian, negative values a stronger effect in the Czech cohort) and `P_diet_country_int` (its p-value). All estimates in a single row are based on a single model"
)

Results of linear models, modeling the CLR-transformed relative abundance of a given taxon as the outcome, with `Diet`, `Country`, and `Diet x Country` interaction as fixed-effect predictors. Only taxa whose CLR-transformed relative abundance differs between diets (FDR < 0.05, average diet effect across both countries) are shown. `est` prefix: denotes estimated effects (regression coefficients), i.e., how much CLR-transformed relative abundance differs in vegans compared to omnivores and in the Italian cohort vs. the Czech cohort, respectively; `P`: p-value; `fdr`: p-value after adjustment for multiple comparisons; `CI_L` and `CI_U`: lower and upper bounds of the 95% confidence interval. `avg` suffix denotes effects averaged across subgroups, whereas `inCZ` and `inIT` denote effects in the Czech or Italian cohort, respectively. Interaction effects are reported as `diet_country_int` (difference in the effects of a vegan diet between the Italian and Czech cohorts; positive values indicate a stronger effect in the Italian, negative values a stronger effect in the Czech cohort) and `P_diet_country_int` (its p-value). All estimates in a single row are based on a single model
outcome	est_ITcountry_avg	P_ITcountry_avg	fdr_ITcountry_avg	est_VGdiet_avg	P_VGdiet_avg	fdr_VGdiet_avg	est_VGdiet_inCZ	P_VGdiet_inCZ	fdr_VGdiet_inCZ	est_VGdiet_inIT	P_VGdiet_inIT	fdr_VGdiet_inIT	diet_country_int	P_diet_country_int	fdr_diet_country_int	CI_L_VGdiet_avg	CI_U_VGdiet_avg	CI_L_VGdiet_inCZ	CI_U_VGdiet_inCZ	CI_L_VGdiet_inIT	CI_U_VGdiet_inIT
Escherichia_coli\|SGB10068	0.4885048	0.3322301	0.4001863	-1.550017	0.0023876	0.0421807	-1.532608	0.0325347	0.2702736	-1.5674247	0.0286641	0.3255423	-0.0348163	0.9723958	0.997886	-2.5419087	-0.5581245	-2.9361481	-0.1290687	-2.9693791	-0.1654704
Lawsonibacter_asaccharolyticus\|SGB15154	3.1895898	0.0000000	0.0000000	-1.511171	0.0003857	0.0122662	-1.934277	0.0012725	0.0338735	-1.0880649	0.0665818	0.4221042	0.8462118	0.3115665	0.997886	-2.3342471	-0.6880946	-3.0989399	-0.7696136	-2.2514126	0.0752827
GGB9707_SGB15229\|SGB15229	0.8114240	0.0203405	0.0336889	1.162887	0.0009787	0.0228726	1.122442	0.0232780	0.2221447	1.2033319	0.0150079	0.2993182	0.0808896	0.9071720	0.997886	0.4790288	1.8467455	0.1547742	2.0901104	0.2367569	2.1699070
Lachnospiraceae_bacterium_AM48_27BH\|SGB4706	-0.8689938	0.0255963	0.0411092	1.439783	0.0002616	0.0103977	1.904670	0.0006224	0.0329892	0.9748962	0.0755911	0.4292495	-0.9297741	0.2298324	0.997886	0.6781517	2.2014148	0.8269521	2.9823885	-0.1016047	2.0513971
GGB2653_SGB3574\|SGB3574	0.1508178	0.6943092	0.7310938	-1.516517	0.0001125	0.0059615	-1.776950	0.0012782	0.0338735	-1.2560829	0.0215915	0.2993182	0.5208674	0.4975567	0.997886	-2.2729581	-0.7600750	-2.8473245	-0.7065760	-2.3252481	-0.1869176
Eubacterium_ramulus\|SGB4959	-2.2427927	0.0000001	0.0000005	-1.349327	0.0010975	0.0228726	-1.337502	0.0211232	0.2221447	-1.3611519	0.0188505	0.2993182	-0.0236502	0.9767968	0.997886	-2.1509484	-0.5477052	-2.4718063	-0.2031971	-2.4941752	-0.2281286
Blautia_massiliensis\|SGB4826	-3.9444585	0.0000000	0.0000000	-1.871578	0.0011508	0.0228726	-1.575550	0.0506497	0.2914093	-2.1676062	0.0074083	0.2528624	-0.5920560	0.6013355	0.997886	-2.9882184	-0.7549380	-3.1556100	0.0045096	-3.7458814	-0.5893311
Ruminococcus_torques\|SGB4608	-2.6288787	0.0000003	0.0000011	-3.194236	0.0000000	0.0000001	-3.983262	0.0000000	0.0000068	-2.4052096	0.0006514	0.0517831	1.5780528	0.1088827	0.997886	-4.1607301	-2.2277418	-5.3508637	-2.6156611	-3.7712662	-1.0391530
Clostridiaceae_unclassified_SGB4771\|SGB4771	-0.0205064	0.9672729	0.9795949	2.285882	0.0000092	0.0007316	2.101828	0.0033628	0.0594093	2.4699374	0.0005973	0.0517831	0.3681099	0.7127412	0.997886	1.3004337	3.2713312	0.7074052	3.4962498	1.0770901	3.8627846

Open code


if(file.exists('gitignore/lm_results/result_microbiome_SGB50.csv') == FALSE){
  write.table(result_microbiome, 
              'gitignore/lm_results/result_microbiome_SGB50.csv', row.names = FALSE)
}

dim(result_microbiome %>% filter(fdr_VGdiet_avg < 0.05))
## [1]  9 22

5 Elastic net

To assess the predictive power of microbiome features to discriminate between diet strategy, we employed Elastic Net logistic regression.

As we expected very high level of co-linearity, we allowed \(alpha\) to rather small (0, 0.2 or 0.4).

The performance of the predictive models was evaluated through their capacity of discriminate between vegan and omnivore diets, using out-of-sample area under ROC curve (AUC; estimated with out-of-bag bootstrap) as the measure of discriminatory capacity.

All features were transformed by 2 standard deviations (resulting in standard deviation of 0.5).

5.1 Prepare data for `glmnet`

Open code

data_microbiome_glmnet <- data_microbiome_original %>%
  na.omit() %>%
  dplyr::mutate(
    vegan = as.numeric(
      dplyr::if_else(
        Diet == "VEGAN", 1, 0
      )
    ),
    dplyr::across(
      `Bacteroides_stercoris|SGB1830`:`Clostridiaceae_unclassified_SGB4771|SGB4771`, 
      ~ arm::rescale(.)
    )
  ) %>%
  dplyr::select(
    vegan,Country,
    dplyr::everything()
  ) %>%
  dplyr::select(
    Sample, vegan, Country,
    `Bacteroides_stercoris|SGB1830`:`Clostridiaceae_unclassified_SGB4771|SGB4771`
  )

5.2 Fit model

Open code

modelac <- "elanet_microbiome_SGB50"

assign(
  modelac,
  run(
    expr = clust_glmnet_sep(
      data = data_microbiome_glmnet,
      outcome = "vegan",
      clust_id = "Sample",
      sample_method = "oos_boot",
      N = 500,
      alphas = c(0, 0.2, 0.4),
      family = "binomial",
      seed = 478
    ),
    path = paste0("gitignore/run/", modelac)
  )
)

5.3 Model summary

Open code

elanet_microbiome_SGB50$model_summary
##   alpha     lambda       auc auc_OutOfSample auc_oos_CIL auc_oos_CIU  accuracy
## 1   0.4 0.03003871 0.9801285       0.8076462    0.703754   0.8898782 0.9277108
##   accuracy_OutOfSample accuracy_oos_CIL accuracy_oos_CIU
## 1            0.7160903        0.6101695        0.8142185
elanet_microbiome_SGB50$country_AUC
##   auc_OutOfSample_IT auc_oos_CIL_IT auc_oos_CIU_IT auc_OutOfSample_CZ
## 1          0.7537663       0.580963      0.9003157          0.8446148
##   auc_oos_CIL_CZ auc_oos_CIU_CZ
## 1      0.7113765      0.9521635

5.4 ROC curve - internal validation

Open code

elanet_microbiome_SGB50$plot

Receiver operating characteristic (ROC) curves from a random subset of out-of-bag bootstrap iterations, illustrating the model’s ability to discriminate between vegan and omnivore status based on standardized CLR-transformed count of bacteria reads. Each curve corresponds to an elastic net model trained on a bootstrap resample and evaluated on the subjects not included in that iteration (i.e., out-of-bag samples). The curve plots the true positive rate (sensitivity) against the false positive rate (1 – specificity) across thresholds of predicted vegan probability. The area under the curve (AUC) quantifies overall discriminatory performance, with values closer to 1 indicating stronger separation between groups.

5.5 Estimated coefficients

Open code

data.frame(
  microbiome = row.names(
    elanet_microbiome_SGB50$betas
    )[
      which(
        abs(
          elanet_microbiome_SGB50$betas
          )>0
        )
      ],
  beta = elanet_microbiome_SGB50$betas[
    abs(
      elanet_microbiome_SGB50$betas
      )>0
    ]
  ) %>% 
  mutate(
    is_in_ExtValCoh = if_else(
      microbiome %in% names(data_microbiome_validation),
      1, 0
      )
    )
##                                          microbiome         beta
## 1                                       (Intercept)  0.459437083
## 2                      Alistipes_putredinis|SGB2318 -0.040595361
## 3                         GGB9602_SGB15031|SGB15031 -0.207538429
## 4                      Phocaeicola_vulgatus|SGB1814  0.191109782
## 5                 Sutterella_wadsworthensis|SGB9283  0.122229088
## 6                Parabacteroides_distasonis|SGB1934  0.133299124
## 7                Brotolimicola_acetigignens|SGB4914  0.024530137
## 8                Faecalibacterium_SGB15346|SGB15346  0.045951193
## 9                        GGB33469_SGB15236|SGB15236 -0.196421487
## 10                        Escherichia_coli|SGB10068 -0.070608506
## 11                   Alistipes_senegalensis|SGB2296 -0.021737338
## 12            Faecalibacterium_prausnitzii|SGB15317  0.026446203
## 13                Lachnospira_pectinoschiza|SGB5075 -0.350096766
## 14                   Anaerotignum_faecicola|SGB5190 -0.074526932
## 15 Hydrogenoanaerobacterium_saccharovorans|SGB15350 -0.476105504
## 16                        Roseburia_hominis|SGB4936  0.206257236
## 17                 Odoribacter_splanchnicus|SGB1790 -0.042556044
## 18                       Bacteroides_ovatus|SGB1871  0.368055222
## 19                        GGB9614_SGB15049|SGB15049  0.001210077
## 20             Barnesiella_intestinihominis|SGB1965 -0.140209994
## 21         Agathobaculum_butyriciproducens|SGB14993  0.255841779
## 22                      Ruminococcus_bromii|SGB4285  0.139827020
## 23          Lawsonibacter_asaccharolyticus|SGB15154 -0.446866250
## 24                       Clostridium_fessum|SGB4705 -0.426048240
## 25                    Clostridium_sp_AF36_4|SGB4644  0.137175853
## 26                        GGB9627_SGB15081|SGB15081  0.135012837
## 27                        GGB9707_SGB15229|SGB15229  0.287233674
## 28                  Roseburia_inulinivorans|SGB4940 -0.061759880
## 29                        Phocaeicola_dorei|SGB1815 -0.102024353
## 30          Fusicatenibacter_sp_CLA_AA_H277|SGB4780  0.053963003
## 31      Lachnospiraceae_bacterium_AM48_27BH|SGB4706  0.400550282
## 32                       GGB52130_SGB14966|SGB14966 -0.126367545
## 33                        Blautia_sp_MCC283|SGB4828  0.207911475
## 34          Clostridiales_bacterium_KLE1615|SGB5090  0.017004134
## 35                    Lachnospira_sp_NSJ_43|SGB5087  0.252316487
## 36                        Coprococcus_catus|SGB4670  0.182910611
## 37 Clostridiaceae_bacterium_Marseille_Q4143|SGB4768 -0.105592864
## 38                       GGB45432_SGB63101|SGB63101 -0.231582035
## 39                    Clostridium_sp_AM33_3|SGB4711  0.071248003
## 40                          GGB3653_SGB4964|SGB4964  0.130591614
## 41        Intestinimonas_butyriciproducens|SGB15126  0.008629126
## 42                       Blautia_glucerasea|SGB4816  0.006153689
## 43                          GGB2653_SGB3574|SGB3574 -0.292917447
## 44                    Holdemania_filiformis|SGB4046 -0.281554004
## 45      Anaeromassilibacillus_senegalensis|SGB14894 -0.126497626
## 46                      Eubacterium_ramulus|SGB4959 -0.416678884
## 47                     Roseburia_sp_AF02_12|SGB4938  0.190190893
## 48                     Blautia_massiliensis|SGB4826 -0.053674556
## 49                  Clostridium_sp_AM49_4BH|SGB4652  0.082575818
## 50                Mediterraneibacter_faecis|SGB4563 -0.360995811
## 51                     Ruminococcus_torques|SGB4608 -0.799742276
## 52                        Coprococcus_comes|SGB4577 -0.150653522
## 53                   Eubacterium_ventriosum|SGB5045  0.233526304
## 54                        GGB9635_SGB15106|SGB15106 -0.133505579
## 55                        GGB9758_SGB15368|SGB15368 -0.115064648
## 56                    Ruminococcus_lactaris|SGB4557 -0.027709486
## 57        Mediterraneibacter_butyricigenes|SGB25493  0.296675686
## 58                Bacteroides_xylanisolvens|SGB1867  0.166528066
## 59      Clostridiaceae_unclassified_SGB4771|SGB4771  0.460711452
##    is_in_ExtValCoh
## 1                0
## 2                1
## 3                1
## 4                1
## 5                1
## 6                1
## 7                1
## 8                1
## 9                1
## 10               1
## 11               1
## 12               1
## 13               1
## 14               1
## 15               1
## 16               1
## 17               1
## 18               1
## 19               1
## 20               1
## 21               1
## 22               1
## 23               1
## 24               1
## 25               1
## 26               1
## 27               1
## 28               1
## 29               1
## 30               1
## 31               1
## 32               1
## 33               1
## 34               1
## 35               1
## 36               1
## 37               1
## 38               1
## 39               1
## 40               1
## 41               1
## 42               1
## 43               1
## 44               1
## 45               1
## 46               1
## 47               1
## 48               1
## 49               1
## 50               1
## 51               1
## 52               1
## 53               1
## 54               1
## 55               1
## 56               1
## 57               1
## 58               1
## 59               1

5.6 Plot beta coefficients

Open code

elacoef <- data.frame(
  microbiome = row.names(elanet_microbiome_SGB50$betas),
  beta_ela = elanet_microbiome_SGB50$betas[, 1]
) %>%
  arrange(abs(beta_ela)) %>%
  filter(abs(beta_ela) > 0,
         !grepl('Intercept', microbiome)) %>%
  mutate(microbiome = factor(microbiome, levels = microbiome)) 

plotac <- "elanet_beta_microbiome_SGB50"
path <- "gitignore/figures"

assign(plotac, 
  ggplot(elacoef,
    aes(
      x = microbiome,
      y = beta_ela
    )
  ) +
  geom_point() +
  geom_hline(yintercept = 0, color = "black") +
  labs(
    y = "Standardized beta coefficients",
    x = "Bacteria species"
  ) +
  theme_minimal() +
  coord_flip() + 
  theme(
    axis.text.x = element_text(size = 10),
    axis.text.y = element_text(size = 10),
    axis.title.x = element_text(size = 12),
    axis.title.y = element_text(size = 12),
    legend.position = "bottom"
  )
)

get(plotac)

Regression coefficients from the elastic net model predicting vegan diet strategy based on CLR-transformed and standardized proportion of bacterial taxa. Taxa are ordered by the magnitude of the standardized coefficients, indicating their relative importance in distinguishing between the diet groups. The sign of each coefficient indicates the direction of association with vegan diet status, with positive values indicating a higher likelihood of vegan status and negative values indicating omnivore status. Taxa whose effects were shrunk to zero are not shown.

Open code


if (file.exists(paste0(path, "/", plotac, ".svg")) == FALSE) {  
  ggsave(
    path = paste0(path),
    filename = plotac,
    device = "svg",
    width = 7,
    height = 14
  )
}

6 External validation

External validation was performed with an independent Czech cohort.

As a first step, we will use the previously developed and internally validated elastic net model to predict vegan status in the independent Czech cohort. The validation data will be standardized using the mean and standard deviation of each taxa as taken from the training cohort to ensure comparability across datasets. For each subject in the external validation cohort, we will estimate the predicted probability of being vegan using the elastic net model. This predicted probability will then be used as a variable to discriminate between the diet groups in the independent cohort.

In a 2nd step, we will look at taxa that significantly differed between diet groups (average vegan diet effect across both countries, FDR < 0.05) estimated by linear models (one per a taxa) with data of training cohort. Then we will fit linear models also for external validation cohort. Effect of vegan diet on these taxa will be shown along with 95% confidence interval for all cohorts: training Czech and Italian cohorts, but also in Czech independent (validating) cohort

6.1 Prediction of diet (elastic net)

6.1.1 Get table of weights, means and SDs

Open code


coefs_microbiome_all <- get_coef(
  original_data = data_analysis,
  glmnet_model = elanet_microbiome_SGB50)

coefs_microbiome_all
## # A tibble: 160 × 5
##    predictor                             beta_scaled beta_OrigScale   mean    SD
##    <chr>                                       <dbl>          <dbl>  <dbl> <dbl>
##  1 (Intercept)                                0.459          NA     NA     NA   
##  2 Bacteroides_stercoris|SGB1830              0               0     -0.638  4.55
##  3 Alistipes_putredinis|SGB2318              -0.0406         -0.342  3.42   4.21
##  4 Candidatus_Cibionibacter_quicibialis…      0               0      3.94   2.20
##  5 Bacteroides_uniformis|SGB1836              0               0      4.80   2.00
##  6 GGB9602_SGB15031|SGB15031                 -0.208          -1.47  -0.696  3.53
##  7 Phocaeicola_vulgatus|SGB1814               0.191           1.04   4.31   2.72
##  8 Faecalibacterium_prausnitzii|SGB15316      0               0      3.93   2.04
##  9 Lachnospiraceae_bacterium_CLA_AA_H24…      0               0      2.11   1.70
## 10 Sutterella_wadsworthensis|SGB9283          0.122           1.16  -0.432  4.76
## # ℹ 150 more rows

6.1.2 Identify shared and missing predictors

Open code


## Which are missing in the validation set
missing <- setdiff(
  coefs_microbiome_all$predictor[-1], 
  colnames(
    data_microbiome_validation
    )
  )

missing
## character(0)

## Which are common with the validations et
common_predictors <- intersect(
  coefs_microbiome_all$predictor,
  colnames(data_microbiome_validation)
)

6.1.3 Standardize data in validation set

Open code

data_microbiome_validation_pred_all <- data_microbiome_validation %>%
  dplyr::mutate(
    vegan = if_else(
      Diet == "VEGAN", 1, 0
    )
  ) %>%
  dplyr::select(
    vegan,
    dplyr::all_of(common_predictors)
  ) %>% 
  dplyr::mutate(
    across(
      .cols = -vegan,
      .fns = ~ . 
      - coefs_microbiome_all$mean[
        match(
          cur_column(), 
          coefs_microbiome_all$predictor
          )
        ]
      )
    ) %>% 
  dplyr::mutate(
    across(
      .cols = -vegan,
      .fns = ~ . 
      / coefs_microbiome_all$SD[
        match(
          cur_column(), 
          coefs_microbiome_all$predictor
          )
        ]
      )
    )

6.1.4 Result

6.1.4.1 ROC curve

Open code


newx <- as.matrix(data_microbiome_validation_pred_all[,-1])

predicted <- predict(
  elanet_microbiome_SGB50$fit, 
  newx = newx)

tr <- data_microbiome_validation_pred_all %>% 
  dplyr::mutate(
    predicted_logit = as.numeric(
      predict(
        elanet_microbiome_SGB50$fit,
        newx = newx
        )
      )
    ) %>% 
  dplyr::mutate(
    predicted = inv_logit(predicted_logit)
  )

roc_microbiome_all <- pROC::roc(
      vegan ~ predicted_logit,
      data = tr,
      direction = "<",
      levels = c(0, 1),
      ci = TRUE
      )

roc_microbiome_all
## 
## Call:
## roc.formula(formula = vegan ~ predicted_logit, data = tr, direction = "<",     levels = c(0, 1), ci = TRUE)
## 
## Data: predicted_logit in 43 controls (vegan 0) < 59 cases (vegan 1).
## Area under the curve: 0.8538
## 95% CI: 0.7818-0.9257 (DeLong)

plotac <- "roc_microbiom_SGB50"
path <- "gitignore/figures"

assign(plotac, ggroc(roc_microbiome_all))
get(plotac)

Receiver Operating Characteristic (ROC) curve illustrating the model’s ability to distinguish between vegan and omnivore status based on CLR-transformed counts of bacteria reads in the external validation Czech cohort. The curve plots the true positive rate (sensitivity) against the true positive rate (specificity) at various thresholds of predicted vegan status, as estimated from the elastic net model developed on the training data. The area under the curve (AUC) represents the model’s overall performance, with values closer to 1 indicating stronger discrimination.

Open code


if (file.exists(paste0(path, "/", plotac, ".svg")) == FALSE) {  
  ggsave(
    path = paste0(path),
    filename = plotac,
    device = "svg",
    width = 6,
    height = 4.5
  )
}

6.1.4.2 Table

Open code

mod <- elanet_microbiome_SGB50

trainAUC <- mean(mod[["valid_performances"]]$auc_resamp_test)
trainCI <- quantile(mod[["valid_performances"]]$auc_resamp_test,
  probs = c(1 / 40, 39 / 40)
)

res <- data.frame(
  alpha = c(mod$model_summary$alpha, rep("", 4)),
  lambda = c(round(mod$model_summary$lambda, 4), rep("", 4)),
  `performance type` = c(
    "Training set AUC",
    "Out-of-sample AUC (all)",
    "Out-of-sample AUC (Czech)",
    "Out-of-sample AUC (Italy)",
    "External validation AUC"
  ),
  `performance [95% CI]` = c(
    sprintf("%.3f [%.3f to %.3f]", trainAUC, trainCI[1], trainCI[2]),
    sprintf(
      "%.3f [%.3f to %.3f]",
      mod$model_summary$auc_OutOfSample,
      mod$model_summary$auc_oos_CIL,
      mod$model_summary$auc_oos_CIU
    ),
    sprintf(
      "%.3f [%.3f to %.3f]",
      mod$country_AUC$auc_OutOfSample_CZ,
      mod$country_AUC$auc_oos_CIL_CZ,
      mod$country_AUC$auc_oos_CIU_CZ
    ),
    sprintf(
      "%.3f [%.3f to %.3f]",
      mod$country_AUC$auc_OutOfSample_IT,
      mod$country_AUC$auc_oos_CIL_IT,
      mod$country_AUC$auc_oos_CIU_IT
    ),
    sprintf(
      "%.3f [%.3f to %.3f]",
      roc_microbiome_all[["ci"]][2],
      roc_microbiome_all[["ci"]][1],
      roc_microbiome_all[["ci"]][3]
    )
  )
)


kableExtra::kable(
  res %>% mutate(across(where(is.numeric), ~ round(.x, 3))),
  caption = "Performance of the elastic‑net logistic regression model for discriminating vegan from omnivore status using CLR-transformed counts of bacteria reads. The model was developed on the combined training data (Czech and Italian cohorts), with the optimmized `alpha` (mixing parameter) and `lambda` (penalty strength) selected via 10-fold cross-validation. Internal validation employed 500 out‑of‑bag bootstrap resamples: the out‑of‑sample AUC is the mean across resamples, and its 95 % confidence interval (CI) is given by the 2.5th and 97.5th percentiles of the bootstrap distribution. The training‑set AUC and its CI were computed analogously from the in‑bag predictions. External validation was carried out on an independent Czech cohort; the reported AUC is the point estimate on that cohort, and its 95% CI was obtained with DeLong’s method."
)

Performance of the elastic‑net logistic regression model for discriminating vegan from omnivore status using CLR-transformed counts of bacteria reads. The model was developed on the combined training data (Czech and Italian cohorts), with the optimmized `alpha` (mixing parameter) and `lambda` (penalty strength) selected via 10-fold cross-validation. Internal validation employed 500 out‑of‑bag bootstrap resamples: the out‑of‑sample AUC is the mean across resamples, and its 95 % confidence interval (CI) is given by the 2.5th and 97.5th percentiles of the bootstrap distribution. The training‑set AUC and its CI were computed analogously from the in‑bag predictions. External validation was carried out on an independent Czech cohort; the reported AUC is the point estimate on that cohort, and its 95% CI was obtained with DeLong’s method.
alpha	lambda	performance.type	performance..95..CI.
0.4	0.03	Training set AUC	0.998 [0.990 to 1.000]
		Out-of-sample AUC (all)	0.808 [0.704 to 0.890]
		Out-of-sample AUC (Czech)	0.845 [0.711 to 0.952]
		Out-of-sample AUC (Italy)	0.754 [0.581 to 0.900]
		External validation AUC	0.854 [0.782 to 0.926]

6.2 Diet effect across datasets

Similarly as in training data cohorts, we will fit linear model per each of the selected taxa (\(CLR\) - transformed), with a single fixed effect factor of diet.

6.2.1 Linear models in validation cohort

Open code


diet_sensitive_taxa <- result_microbiome %>% 
  filter(fdr_VGdiet_avg < 0.05) %>% 
  pull(outcome)

len <- length(diet_sensitive_taxa)

data_analysis_microbiome <- data_microbiome_validation %>%
  dplyr::mutate(
    Diet_VEGAN = as.numeric(
      dplyr::if_else(
        Diet == 'VEGAN', 1, 0
      )
    )
  ) %>%
  dplyr::select(
    Diet_VEGAN,
    all_of(diet_sensitive_taxa)
  )

6.2.1.1 Define number of microbiome and covariates

Open code

n_covarites <- 1
n_features <- ncol(data_analysis_microbiome) - n_covarites

6.2.1.2 Create empty objects

Open code

outcome <- vector('double', n_features)
est_VGdiet <- vector('double', n_features)
P_VGdiet <- vector('double', n_features)
CI_L_VGdiet <- vector('double', n_features)
CI_U_VGdiet <- vector('double', n_features)

6.2.1.3 Estimate over outcomes

Open code

for (i in 1:n_features) {
  ## define variable
  data_analysis_microbiome$outcome <- data_analysis_microbiome[, (i + n_covarites)]

  ## fit model
  model <- lm(outcome ~ Diet_VEGAN, data = data_analysis_microbiome)

  ## save results
  outcome[i] <- names(data_analysis_microbiome)[i + n_covarites]

  ## diet effect
  tr <- confint(model)

  CI_L_VGdiet[i] <- tr[which(row.names(tr) == "Diet_VEGAN"), ][1]
  CI_U_VGdiet[i] <- tr[which(row.names(tr) == "Diet_VEGAN"), ][2]

  est_VGdiet[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Diet_VEGAN"
    ), 1
  ]

  P_VGdiet[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Diet_VEGAN"
    ), 4
  ]
}

6.2.1.4 Results table

Open code

result_microbiome_val <- data.frame(
  outcome,
  est_VGdiet, P_VGdiet,
  CI_L_VGdiet, CI_U_VGdiet
)

kableExtra::kable(result_microbiome_val,
  caption = "Results of linear models estimating the effect of diet on the CLR-transformed relative abundance of a given taxon as the outcome. Only taxa whose relative abundance significantly differed between diet groups in training cohorts (FDR < 0.05, average effect across both training cohorts) were included. `est` represents the estimated effects (regression coefficient), indicating how much the CLR-transformed relative abundnace of a given differ between vegans and omnivores. `P`: p-value, `fdr`: p-value adjusted for multiple comparisons, and `CI_L` and `CI_U` represent the lower and upper bounds of the 95% confidence interval, respectively. All estimates in a single row are based on a single model"
)

Results of linear models estimating the effect of diet on the CLR-transformed relative abundance of a given taxon as the outcome. Only taxa whose relative abundance significantly differed between diet groups in training cohorts (FDR < 0.05, average effect across both training cohorts) were included. `est` represents the estimated effects (regression coefficient), indicating how much the CLR-transformed relative abundnace of a given differ between vegans and omnivores. `P`: p-value, `fdr`: p-value adjusted for multiple comparisons, and `CI_L` and `CI_U` represent the lower and upper bounds of the 95% confidence interval, respectively. All estimates in a single row are based on a single model
outcome	est_VGdiet	P_VGdiet	CI_L_VGdiet	CI_U_VGdiet
Escherichia_coli\|SGB10068	-0.0694146	0.8746042	-0.9398378	0.8010086
Lawsonibacter_asaccharolyticus\|SGB15154	-1.9190565	0.0000027	-2.6837374	-1.1543757
GGB9707_SGB15229\|SGB15229	-0.2846930	0.5608316	-1.2525875	0.6832016
Lachnospiraceae_bacterium_AM48_27BH\|SGB4706	0.3032289	0.0960612	-0.0548456	0.6613034
GGB2653_SGB3574\|SGB3574	-0.6563113	0.0846063	-1.4038436	0.0912210
Eubacterium_ramulus\|SGB4959	-0.7616333	0.0028997	-1.2565170	-0.2667496
Blautia_massiliensis\|SGB4826	-0.2022311	0.5258513	-0.8325042	0.4280419
Ruminococcus_torques\|SGB4608	-1.9930762	0.0002807	-3.0431601	-0.9429923
Clostridiaceae_unclassified_SGB4771\|SGB4771	0.4428120	0.1852384	-0.2157601	1.1013841

Open code


if (file.exists("gitignore/lm_results/result_microbiom_validation_SGB50.csv") == FALSE) {
  write.table(result_microbiome_val,
    "gitignore/lm_results/result_microbiom_validation_SGB50.csv",
    row.names = FALSE
  )
}

6.2.2 Forest plot

6.2.2.1 Prepare data

Open code


## subset result tables
result_microbiome_subset <- result_microbiome %>%
  filter(outcome %in% diet_sensitive_taxa)

result_microbiome_val_subset <- result_microbiome_val %>%
  filter(outcome %in% diet_sensitive_taxa)

## create a data frame
data_forest <- data.frame(
  outcome = rep(diet_sensitive_taxa, 3),
  beta = c(
    result_microbiome_subset$est_VGdiet_inCZ,
    result_microbiome_subset$est_VGdiet_inIT,
    result_microbiome_val_subset$est_VGdiet
  ),
  lower = c(
    result_microbiome_subset$CI_L_VGdiet_inCZ,
    result_microbiome_subset$CI_L_VGdiet_inIT,
    result_microbiome_val_subset$CI_L_VGdiet
  ),
  upper = c(
    result_microbiome_subset$CI_U_VGdiet_inCZ,
    result_microbiome_subset$CI_U_VGdiet_inIT,
    result_microbiome_val_subset$CI_U_VGdiet
  ),
  dataset = c(
    rep("CZ", len),
    rep("IT", len),
    rep("Validation", len)
  )
)

## define ordering
validation_order <- data_forest %>%
  group_by(outcome) %>%
  summarise(beta_mean = mean(beta), .groups = "drop") %>%
  arrange(beta_mean) %>%
  pull(outcome)


## Define 'winners'
up_winners <- data_forest %>% 
  pivot_wider(names_from = dataset,
              values_from = c(beta, lower, upper)) %>% 
  left_join(elacoef %>% mutate(outcome = microbiome) %>% select(-microbiome), 
            by = 'outcome') %>% 
  filter(beta_CZ > 0,
         beta_IT > 0,
         lower_Validation > 0,
         beta_ela > 0.1) %>% 
  pull(outcome)

down_winners <- data_forest %>% 
  pivot_wider(names_from = dataset,
              values_from = c(beta, lower, upper)) %>% 
  left_join(elacoef %>% mutate(outcome = microbiome) %>% select(-microbiome), 
            by = 'outcome') %>% 
  filter(beta_CZ < 0,
         beta_IT < 0,
         upper_Validation < 0,
         beta_ela < -0.1) %>% 
  pull(outcome)

winners <- c(up_winners, down_winners)

 data_forest <- data_forest %>%
  mutate(in_winner = if_else(outcome %in% winners, TRUE, FALSE, missing = FALSE)) %>%
  left_join(
    elacoef %>% mutate(outcome = microbiome) %>% select(-microbiome), 
    by = 'outcome') %>% 
   mutate(outcome = factor(outcome, levels = validation_order))

6.2.2.2 Plotting

Open code

plotac <- "forest_plot_microbiome_SGB50"
path <- "gitignore/figures"

colors <- c("CZ" = "#150999", "IT" = "#329243", "Validation" = "grey60")

assign(plotac, ggplot(
  data_forest, aes(x = outcome, y = beta, ymin = lower, ymax = upper, color = dataset)
) +
  geom_pointrange(position = position_dodge(width = 0.5), size = 0.5) +
  geom_hline(yintercept = 0, color = "black") +
  geom_errorbar(position = position_dodge(width = 0.5), width = 0.2) +
  scale_color_manual(values = colors) +
  labs(
    y = "Effect of vegan diet on CLR-transformed relative abundance",
    x = "Outcome",
    color = "Dataset"
  ) +
  theme_minimal() +
  coord_flip() + # Flip coordinates to have outcomes on the y-axis
  scale_x_discrete(
    labels = setNames(
      ifelse(data_forest$in_winner, 
             paste0("**", data_forest$outcome, "**"), 
             as.character(data_forest$outcome)
      ), data_forest$outcome
    )
  ) +
  theme(
    axis.text.x = element_text(size = 10),
    axis.text.y = ggtext::element_markdown(size = 10),  
    axis.title.x = element_text(size = 12),
    axis.title.y = element_text(size = 12),
    legend.position = "bottom"
  )
)

get(plotac)

The forest plot illustrates the effects of a vegan diet on the CLR-transformed relative abundances of selected taxa, along with their 95% confidence intervals, across two training cohorts (Czech and Italian) and one independent Czech cohort (validation). Green, blue, and grey points/lines represent differences in CLR-transformed relative abundance between vegans and omnivores within the Italian cohort, Czech cohort, and Czech validation cohort, respectively. Positive values suggest a higher relative abundance in vegans compared to omnivores. Only taxa whose relative abundances differed between vegan and omnivores (as an average effect across both training cohorts) were selected, and these effects were further validated in the independent cohort. The estimates for the training cohorts were obtained from a single linear model that included `Diet`, `Country`, and the interaction term `Diet x Country` as fixed-effect predictors. In the independent Czech validation cohort, `Diet` was the only fixed-effect predictor. Taxa validated in the linear model and showing predictive power in the elastic net model (|β| > 0.1) are **bold**

Open code


if (file.exists(paste0(path, "/", plotac, ".svg")) == FALSE) {  
  ggsave(
    path = paste0(path),
    filename = plotac,
    device = "svg",
    width = 8,
    height = 5
  )
}

6.2.3 Boxplot

Open code

plotac <- "boxplot_microbiome_SGB50"
path <- "gitignore/figures"

colo <- c('#F9FFAF','#329243')

boxplot_cond <- function(variable) {
  
  p <- ggboxplot(data_merged, 
                 x = 'Diet', 
                 y = variable, 
                 fill = 'Diet', 
                 tip.length = 0.15,
                 palette = colo,
                 outlier.shape = 1,
                 lwd = 0.25,
                 outlier.size = 0.8,
                 facet.by = 'Data',
                 title = variable,
                 ylab = 'CLR(taxa proportion)') +
    
    theme(
      plot.title = element_text(size = 10), 
      axis.title = element_text(size = 8),  
      axis.text.y = element_text(size = 7),
      axis.text.x = element_blank(),
      axis.title.x = element_blank()
    ) 
  return(p)
}

# Plot all outcomes
plots <- map(diet_sensitive_taxa, boxplot_cond)

# Create a matrix of plots
assign(plotac, 
       ggarrange(plotlist = plots, ncol = 3, nrow = 3,  common.legend = TRUE)
       )

get(plotac)

CLR-transformed relative abundances of selected taxa across all 3 cohorts (Czech and Italian training cohorts and an independent Czech valdiation cohort) and across dietary groups

Open code


if (file.exists(paste0(path, "/", plotac, ".svg")) == FALSE) {  
  ggsave(
    path = paste0(path),
    filename = plotac,
    device = "svg",
    width = 9,
    height = 12
  )
}

7 Linear model VG duration

Next, we fit another series of linear models, this time modelling clr-transformed taxa proportions (derived from shotgun metagenomic sequencing) using the following fixed-effect predictors: duration of vegan status (Diet_duration, scaled in tens of years), Country, their interaction (Diet_duration × Country), and Age:

\[ \text{CLR(taxa proportion)} = \alpha + \beta_{1} \times \text{Country} + \beta_{2} \times \text{Diet duration} + \beta_{3} \times (\text{Country}:\text{Diet duration}) + \beta_{4} \times \text{Age} + \epsilon \]

This analysis includes only vegan participants, while omnivores are excluded. The aim was to test whether specific taxa that differ between vegans and omnivores also vary within the vegan group itself, depending on how long participants have been vegan. In other words, we asked whether long-term vegans show stronger enrichment or depletion of diet-sensitive taxa compared to those who adopted the diet more recently.

Because longer vegan duration is likely correlated with age (e.g. a 20-year-old cannot have 20 years of vegan history, unlike a 40- or 60-year-old), we also adjusted for age in the models.

7.1 Get data

7.1.1 Training

Open code

meta_trainIT <- read.xlsx('gitignore/data/diet_duration_age.xlsx', sheet = 1)
meta_trainCZ <- read.xlsx('gitignore/data/diet_duration_age.xlsx', sheet = 2) %>% 
  dplyr::mutate(ID = paste0('T', ID),
                Sex = SEX) %>% 
  dplyr::select(-SEX)

meta_trainIT[1:5,]
##       ID   COHORT GRP Diet_duration  Age Sex
## 1 VOV002 IT_train  OM             0 61.4   F
## 2 VOV003 IT_train  OM             0 43.7   F
## 3 VOV004 IT_train  OM             0 61.1   F
## 4 VOV006 IT_train  OM             0 31.7   F
## 5 VOV007 IT_train  OM             0 31.8   F
meta_trainCZ[1:5,]
##     ID   COHORT GRP Diet_duration      Age Sex
## 1  T97 CZ_train  OM             0 26.76438   M
## 2  T98 CZ_train  OM             0 27.61370   F
## 3 T134 CZ_train  OM             0 21.64384   F
## 4 T136 CZ_train  OM             0 31.25479   M
## 5 T137 CZ_train  OM             0 40.42192   F
data_meta_original <- bind_rows(meta_trainIT, meta_trainCZ) %>% 
  rename(`Sample` = `ID`)
  
data_meta_original[1:5,]
##   Sample   COHORT GRP Diet_duration  Age Sex
## 1 VOV002 IT_train  OM             0 61.4   F
## 2 VOV003 IT_train  OM             0 43.7   F
## 3 VOV004 IT_train  OM             0 61.1   F
## 4 VOV006 IT_train  OM             0 31.7   F
## 5 VOV007 IT_train  OM             0 31.8   F

data_microbiome_original2 <- data_microbiome_original %>% 
  left_join(data_meta_original, by = 'Sample') %>% 
  select(Sample:Diet, COHORT:Sex, everything())

data_microbiome_original2 %>% dim()
## [1] 166 167

data_microbiome_original2[
  1:4, 
  (ncol(data_microbiome_original2)-10):ncol(data_microbiome_original2)
  ]
##   Lawsonibacter_SGB15145|SGB15145 Lachnospiraceae_bacterium|SGB4782
## 1                       -3.254506                         1.8114292
## 2                       -3.007886                         0.8013213
## 3                       -2.681117                        -3.3882164
## 4                       -3.568587                        -4.0931105
##   Mediterraneibacter_butyricigenes|SGB25493 Dorea_sp_AF36_15AT|SGB4552
## 1                                 0.2638296                  1.7620764
## 2                                -0.5842658                  0.1247778
## 3                                -0.4519300                  0.6997812
## 4                                -2.2933822                 -0.1508825
##   Oliverpabstia_intestinalis|SGB4868 Lentihominibacter_faecis|SGB3957
## 1                          0.9384336                        3.6253807
## 2                         -1.3365725                       -2.5539683
## 3                         -6.0743603                        0.4644484
## 4                         -4.7664821                       -5.1422731
##   Bacteroides_xylanisolvens|SGB1867 Faecalibacillus_intestinalis|SGB6754
## 1                        -3.1769622                            -4.827240
## 2                        -0.4187274                             1.398531
## 3                         2.1325082                             1.680664
## 4                         5.0549940                             3.599577
##   Romboutsia_timonensis|SGB6148 Dorea_longicatena|SGB4582
## 1                      4.672154                 1.9383473
## 2                      1.742167                -0.8691416
## 3                      1.622448                 1.1114341
## 4                      2.229049                 2.7024698
##   Clostridiaceae_unclassified_SGB4771|SGB4771
## 1                                 -0.07941369
## 2                                  2.29991361
## 3                                 -0.05719320
## 4                                  0.49060577

data_microbiome_original2[1:4, 1:10]
##   Sample Country  Diet   COHORT GRP Diet_duration      Age Sex
## 1    T36      CZ VEGAN CZ_train  VG           3.5 31.64932   F
## 2    T47      CZ VEGAN CZ_train  VG           9.0 43.79178   M
## 3    T55      CZ VEGAN CZ_train  VG          10.0 29.83014   M
## 4    T59      CZ VEGAN CZ_train  VG           8.0 30.24658   M
##   Bacteroides_stercoris|SGB1830 Alistipes_putredinis|SGB2318
## 1                    -4.3668515                     2.101865
## 2                     2.8102872                     2.489236
## 3                     0.3063494                     2.099532
## 4                    -4.6809321                     4.473566

7.1.2 Validation

Open code

data_meta_valid <- read.xlsx('gitignore/data/diet_duration_age.xlsx', sheet = 3) %>% 
  rename(`Sample` = `ID`)

data_microbiome_valid2 <- data_microbiome_validation %>% 
  left_join(data_meta_valid, by = 'Sample') %>% 
  select(Sample:Diet, COHORT:SEX, everything())

data_microbiome_valid2 %>% dim()
## [1] 103 167

data_microbiome_valid2[
  1:4, 
  (ncol(data_microbiome_valid2)-10):ncol(data_microbiome_valid2)
  ]
##   Lawsonibacter_SGB15145|SGB15145 Lachnospiraceae_bacterium|SGB4782
## 1                     0.001834729                         0.8058615
## 2                     1.292624120                         0.5246770
## 3                    -0.570515546                        -4.0497319
## 4                     0.048857873                         2.4131420
##   Mediterraneibacter_butyricigenes|SGB25493 Dorea_sp_AF36_15AT|SGB4552
## 1                               -0.55278161                  0.6628579
## 2                               -2.27296648                 -0.5977329
## 3                               -0.05967569                  0.5214815
## 4                               -0.88270985                 -0.3220111
##   Oliverpabstia_intestinalis|SGB4868 Lentihominibacter_faecis|SGB3957
## 1                           2.193594                       -1.2077709
## 2                           2.343465                        0.5487621
## 3                           1.564105                       -0.9938200
## 4                           1.535005                        0.6384993
##   Bacteroides_xylanisolvens|SGB1867 Faecalibacillus_intestinalis|SGB6754
## 1                        -1.4252094                            1.8385311
## 2                        -1.7240507                           -3.5985503
## 3                         2.6782510                           -0.2160523
## 4                         0.3518265                            2.2343130
##   Romboutsia_timonensis|SGB6148 Dorea_longicatena|SGB4582
## 1                    -0.3390487                 0.3386811
## 2                    -2.7192637                -2.4139221
## 3                     0.4630830                 1.4634185
## 4                    -1.4804285                -5.0108075
##   Clostridiaceae_unclassified_SGB4771|SGB4771
## 1                                   0.5347861
## 2                                   1.0420602
## 3                                  -3.2722020
## 4                                   0.8586087

data_microbiome_valid2[1:4, 1:10]
##   Sample  Data  Diet COHORT GRP Diet_duration      Age SEX
## 1     K4 valid VEGAN CZ_val  VN          <NA> 30.98904   M
## 2     K5 valid VEGAN CZ_val  VN          <NA> 30.66575   F
## 3     K7 valid VEGAN CZ_val  VN          <NA> 32.97808   F
## 4    K12 valid VEGAN CZ_val  VN          <NA> 27.89315   M
##   Bacteroides_stercoris|SGB1830 Alistipes_putredinis|SGB2318
## 1                     0.9136811                     2.723792
## 2                     0.7035927                     4.177365
## 3                     2.1800671                     3.713772
## 4                     2.3322609                    -1.676288

data_microbiome_valid2 %>% select(Sample, Diet, GRP)
##     Sample  Diet GRP
## 1       K4 VEGAN  VN
## 2       K5 VEGAN  VN
## 3       K7 VEGAN  VN
## 4      K12 VEGAN  VN
## 5      K13 VEGAN  VN
## 6      K15 VEGAN  VN
## 7      K16 VEGAN  VN
## 8      K18 VEGAN  VN
## 9      K19 VEGAN  VN
## 10     K25 VEGAN  VN
## 11     K26 VEGAN  VN
## 12     K31 VEGAN  VN
## 13     K32 VEGAN  VN
## 14     K34 VEGAN  VN
## 15     K35 VEGAN  VN
## 16     K38 VEGAN  VN
## 17     K39 VEGAN  VN
## 18     K42  OMNI  OM
## 19     K43  OMNI  OM
## 20     K45 VEGAN  VN
## 21     K46 VEGAN  VN
## 22     K48 VEGAN  VN
## 23     K49 VEGAN  VN
## 24     K55 VEGAN  VN
## 25     K56 VEGAN  VN
## 26     K61 VEGAN  VN
## 27     K62 VEGAN  VN
## 28     K64 VEGAN  VN
## 29     K65 VEGAN  VN
## 30     K73 VEGAN  VN
## 31     K74 VEGAN  VN
## 32     K82 VEGAN  VN
## 33     K85 VEGAN  VN
## 34    K103 VEGAN  VN
## 35    K106 VEGAN  VN
## 36    K107 VEGAN  VN
## 37    K114 VEGAN  VN
## 38    K117 VEGAN  VN
## 39    K119 VEGAN  VN
## 40    K120 VEGAN  VN
## 41    K123  OMNI  OM
## 42    K124  OMNI  OM
## 43    K126  OMNI  OM
## 44    K127  OMNI  OM
## 45    K130  OMNI  OM
## 46    K131  OMNI  OM
## 47    K143  OMNI  OM
## 48    K144  <NA>  OM
## 49    K150  OMNI  OM
## 50    K151  OMNI  OM
## 51    K154  OMNI  OM
## 52    K155  OMNI  OM
## 53    K175  OMNI  OM
## 54    K176  OMNI  OM
## 55    K178  OMNI  OM
## 56    K179  OMNI  OM
## 57    K182  OMNI  OM
## 58    K183  OMNI  OM
## 59    K198  OMNI  OM
## 60    K199  OMNI  OM
## 61    K201  OMNI  OM
## 62    K202  OMNI  OM
## 63    K205 VEGAN  VN
## 64    K206 VEGAN  VN
## 65    K209  OMNI  OM
## 66    K210  OMNI  OM
## 67    K216 VEGAN  VN
## 68    K217 VEGAN  VN
## 69    K219 VEGAN  VN
## 70    K220 VEGAN  VN
## 71    K222  OMNI  OM
## 72    K223  OMNI  OM
## 73    K226 VEGAN  VN
## 74    K227 VEGAN  VN
## 75    K230  OMNI  OM
## 76    K231  OMNI  OM
## 77    K234  OMNI  OM
## 78    K235  OMNI  OM
## 79    K238  OMNI  OM
## 80    K239  OMNI  OM
## 81    K250 VEGAN  VN
## 82    K251 VEGAN  VN
## 83    K253  OMNI  OM
## 84    K254  OMNI  OM
## 85    K258 VEGAN  VN
## 86    K260 VEGAN  VN
## 87    K261 VEGAN  VN
## 88    K263 VEGAN  VN
## 89    K264 VEGAN  VN
## 90    K266  OMNI  OM
## 91    K267  OMNI  OM
## 92    K270  OMNI  OM
## 93    K281 VEGAN  VN
## 94    K284 VEGAN  VN
## 95    K285 VEGAN  VN
## 96    K289 VEGAN  VN
## 97    K291 VEGAN  VN
## 98    K292 VEGAN  VN
## 99    K298  OMNI  OM
## 100   K299  OMNI  OM
## 101   K318  OMNI  OM
## 102   K329  OMNI  OM
## 103   K330  OMNI  OM

7.2 Training cohort

7.2.1 Select data

Open code

data_analysis <- data_microbiome_original2 %>%
    mutate(Country_IT = as.numeric(
      dplyr::if_else(
        Country == "IT", 0.5, -0.5
      )
    )
  ) %>%
  filter(Diet == 'VEGAN',
         !is.na(Diet_duration)) %>% 
  select(1:8, Country_IT, all_of(diet_sensitive_taxa))

summary(data_analysis[ , 1:12])
##     Sample            Country              Diet              COHORT         
##  Length:96          Length:96          Length:96          Length:96         
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      GRP            Diet_duration         Age            Sex           
##  Length:96          Min.   : 0.600   Min.   :18.25   Length:96         
##  Class :character   1st Qu.: 3.150   1st Qu.:27.57   Class :character  
##  Mode  :character   Median : 5.000   Median :32.15   Mode  :character  
##                     Mean   : 5.492   Mean   :34.97                     
##                     3rd Qu.: 6.000   3rd Qu.:40.77                     
##                     Max.   :20.000   Max.   :60.70                     
##    Country_IT      Escherichia_coli|SGB10068
##  Min.   :-0.5000   Min.   :-5.3506          
##  1st Qu.:-0.5000   1st Qu.:-4.0660          
##  Median :-0.5000   Median :-0.9493          
##  Mean   :-0.1042   Mean   :-1.0616          
##  3rd Qu.: 0.5000   3rd Qu.: 1.3307          
##  Max.   : 0.5000   Max.   : 6.6459          
##  Lawsonibacter_asaccharolyticus|SGB15154 GGB9707_SGB15229|SGB15229
##  Min.   :-10.77923                       Min.   :-4.8004          
##  1st Qu.: -1.34999                       1st Qu.:-1.2932          
##  Median :  0.55754                       Median : 1.3718          
##  Mean   :  0.07592                       Mean   : 0.8356          
##  3rd Qu.:  2.64107                       3rd Qu.: 2.5950          
##  Max.   :  5.67190                       Max.   : 5.8891

7.2.2 Define number of microbiome and covariates

Open code

n_covarites <- 9
n_features <- ncol(data_analysis) - n_covarites

7.2.3 Create empty objects

Open code

outcome <- vector('double', n_features)

est_Diet_duration_inCZ <- vector('double', n_features)
est_Diet_duration_inIT <- vector('double', n_features)
est_Diet_duration_avg <- vector('double', n_features)

est_ITcountry_avg <- vector('double', n_features)
diet_country_int <- vector('double', n_features)

est_Age <- vector('double', n_features)
P_Age <- vector('double', n_features)

P_Diet_duration_inCZ <- vector('double', n_features)
P_Diet_duration_inIT <- vector('double', n_features)
P_Diet_duration_avg <- vector('double', n_features)

P_ITcountry_avg <- vector('double', n_features)
P_diet_country_int <- vector('double', n_features)

CI_L_Diet_duration_inCZ <- vector('double', n_features)
CI_L_Diet_duration_inIT <- vector('double', n_features)
CI_L_Diet_duration_avg <- vector('double', n_features)

CI_U_Diet_duration_inCZ <- vector('double', n_features)
CI_U_Diet_duration_inIT <- vector('double', n_features)
CI_U_Diet_duration_avg <- vector('double', n_features)

7.2.4 Estimate over outcomes

Open code

if (file.exists(
  "gitignore/lm_results/result_microbiome_SGB50_VGdur_training.Rds"
) == FALSE) {
  for (i in 1:n_features) {
    ## define variable
    data_analysis$outcome <- data_analysis[, (i + n_covarites)]

    ## fit model
    model <- lm(outcome ~ Country_IT * Diet_duration + Age,
      data = data_analysis %>%
        mutate(
          Diet_duration = (Diet_duration / 10),
          Age = (Age - 33) / 10
        )
    )

    ## get contrast (effects of diet BY COUNTRY)
    contrast_emm <- summary(
      pairs(
        emmeans(
          model,
          specs = ~ Diet_duration | Country_IT,
          at = list(Diet_duration = c(0, 1))
        ),
        interaction = TRUE,
        adjust = "none"
      ),
      infer = c(TRUE, TRUE)
    )

    ## save results
    outcome[i] <- names(data_analysis)[i + n_covarites]

    ## country and Age effect
    est_ITcountry_avg[i] <- summary(model)$coefficients[
      which(
        names(model$coefficients) == "Country_IT"
      ), 1
    ]

    P_ITcountry_avg[i] <- summary(model)$coefficients[
      which(
        names(model$coefficients) == "Country_IT"
      ), 4
    ]

    est_Age[i] <- summary(model)$coefficients[
      which(
        names(model$coefficients) == "Age"
      ), 1
    ]

    P_Age[i] <- summary(model)$coefficients[
      which(
        names(model$coefficients) == "Age"
      ), 4
    ]

    ## diet effect
    tr <- confint(model)

    CI_L_Diet_duration_avg[i] <- tr[which(row.names(tr) == "Diet_duration"), ][1]
    CI_U_Diet_duration_avg[i] <- tr[which(row.names(tr) == "Diet_duration"), ][2]

    est_Diet_duration_avg[i] <- summary(model)$coefficients[
      which(
        names(model$coefficients) == "Diet_duration"
      ), 1
    ]

    P_Diet_duration_avg[i] <- summary(model)$coefficients[
      which(
        names(model$coefficients) == "Diet_duration"
      ), 4
    ]

    est_Diet_duration_inCZ[i] <- -contrast_emm[1, 3]
    P_Diet_duration_inCZ[i] <- contrast_emm$p.value[1]
    CI_L_Diet_duration_inCZ[i] <- -contrast_emm$upper.CL[1]
    CI_U_Diet_duration_inCZ[i] <- -contrast_emm$lower.CL[1]

    est_Diet_duration_inIT[i] <- -contrast_emm[2, 3]
    P_Diet_duration_inIT[i] <- contrast_emm$p.value[2]
    CI_L_Diet_duration_inIT[i] <- -contrast_emm$upper.CL[2]
    CI_U_Diet_duration_inIT[i] <- -contrast_emm$lower.CL[2]

    ## interaction
    diet_country_int[i] <- summary(model)$coefficients[
      which(
        names(model$coefficients) == "Country_IT:Diet_duration"
      ), 1
    ]

    P_diet_country_int[i] <- summary(model)$coefficients[
      which(
        names(model$coefficients) == "Country_IT:Diet_duration"
      ), 4
    ]
  }

  result_microbiome <- data.frame(
    outcome,
    est_ITcountry_avg, P_ITcountry_avg,
    est_Age, P_Age,
    est_Diet_duration_avg, P_Diet_duration_avg,
    est_Diet_duration_inCZ, P_Diet_duration_inCZ,
    est_Diet_duration_inIT, P_Diet_duration_inIT,
    diet_country_int, P_diet_country_int,
    CI_L_Diet_duration_avg, CI_U_Diet_duration_avg,
    CI_L_Diet_duration_inCZ, CI_U_Diet_duration_inCZ,
    CI_L_Diet_duration_inIT, CI_U_Diet_duration_inIT
  )

  saveRDS(
    result_microbiome,
    "gitignore/lm_results/result_microbiome_SGB50_VGdur_training.Rds"
  )
}

result_microbiome <- readRDS(
  "gitignore/lm_results/result_microbiome_SGB50_VGdur_training.Rds"
)

7.2.5 Show and save results

Open code

kableExtra::kable(
  result_microbiome %>%
    dplyr::select(
      outcome,
      est_ITcountry_avg, P_ITcountry_avg,
      est_Age, P_Age,
      est_Diet_duration_avg, P_Diet_duration_avg,
      est_Diet_duration_inCZ, P_Diet_duration_inCZ,
      est_Diet_duration_inIT, P_Diet_duration_inIT,
      diet_country_int, P_diet_country_int,
      CI_L_Diet_duration_avg, CI_U_Diet_duration_avg,
      CI_L_Diet_duration_inCZ, CI_U_Diet_duration_inCZ,
      CI_L_Diet_duration_inIT, CI_U_Diet_duration_inIT
    ) %>% arrange(P_Diet_duration_avg),
  caption = "Results of linear models, modelling CLR-transformed relative abundances of taxa (inferred from shotgun metagenomic sequencing) as the outcome, with vegan status duration (`Diet_duration`), `Country`, their interaction (`Diet_duration × Country`), and `Age` as fixed-effect predictors, using training data only (Czech and Italian vegan cohorts). Only taxa with CLR-transformed relative abundances differing by diet (FDR < 0.05, average effect across both countries) were included. `est`: denotes estimated effects (regression coefficients), i.e. expected change in clr-transformed relative abundnace per 10 years of vegan diet or age, or for country (Italy vs Czechia). `P`: p-value; `CI_L` and `CI_U`: lower and upper bounds of the 95% confidence interval. `avg` suffix: effect averaged across cohorts, whereas `inCZ` and `inIT` indicate effects in the Czech or Italian cohort, respectively. Interaction effects are reported as `diet_country_int` (difference in the effect of vegan diet duration between Italian and Czech cohorts; positive values indicate a stronger effect in Italian, negative values a stronger effect in Czech cohort) and `P_diet_country_int` (its p-value). All estimates in a single row are based on a single model with interaction",
  escape = FALSE
)

Results of linear models, modelling CLR-transformed relative abundances of taxa (inferred from shotgun metagenomic sequencing) as the outcome, with vegan status duration (`Diet_duration`), `Country`, their interaction (`Diet_duration × Country`), and `Age` as fixed-effect predictors, using training data only (Czech and Italian vegan cohorts). Only taxa with CLR-transformed relative abundances differing by diet (FDR < 0.05, average effect across both countries) were included. `est`: denotes estimated effects (regression coefficients), i.e. expected change in clr-transformed relative abundnace per 10 years of vegan diet or age, or for country (Italy vs Czechia). `P`: p-value; `CI_L` and `CI_U`: lower and upper bounds of the 95% confidence interval. `avg` suffix: effect averaged across cohorts, whereas `inCZ` and `inIT` indicate effects in the Czech or Italian cohort, respectively. Interaction effects are reported as `diet_country_int` (difference in the effect of vegan diet duration between Italian and Czech cohorts; positive values indicate a stronger effect in Italian, negative values a stronger effect in Czech cohort) and `P_diet_country_int` (its p-value). All estimates in a single row are based on a single model with interaction
outcome	est_ITcountry_avg	P_ITcountry_avg	est_Age	P_Age	est_Diet_duration_avg	P_Diet_duration_avg	est_Diet_duration_inCZ	P_Diet_duration_inCZ	est_Diet_duration_inIT	P_Diet_duration_inIT	diet_country_int	P_diet_country_int	CI_L_Diet_duration_avg	CI_U_Diet_duration_avg	CI_L_Diet_duration_inCZ	CI_U_Diet_duration_inCZ	CI_L_Diet_duration_inIT	CI_U_Diet_duration_inIT
Eubacterium_ramulus\|SGB4959	-3.0241099	0.0026160	0.0399395	0.8836012	2.2067261	0.0171792	1.1922007	0.1947639	3.2212515	0.0297831	2.0290508	0.2130633	0.4008851	4.012567	-0.6207624	3.005164	0.3228608	6.119642
Lachnospiraceae_bacterium_AM48_27BH\|SGB4706	-1.1027483	0.2401632	0.1328495	0.6101633	-0.4654026	0.5930527	-0.0396491	0.9638002	-0.8911561	0.5238915	-0.8515070	0.5827783	-2.1891526	1.258347	-1.7701975	1.690899	-3.6577902	1.875478
Escherichia_coli\|SGB10068	-0.8920315	0.4970426	-0.0679348	0.8524443	0.4883506	0.6891894	-0.9465653	0.4405544	1.9232665	0.3274679	2.8698317	0.1885707	-1.9293193	2.906020	-3.3737703	1.480640	-1.9571150	5.803648
Clostridiaceae_unclassified_SGB4771\|SGB4771	-2.1585710	0.1180814	0.5346135	0.1638347	0.3220043	0.8008569	-1.4781631	0.2504107	2.1221716	0.3016614	3.6003348	0.1154867	-2.2063799	2.850388	-4.0165190	1.060193	-1.9359072	6.180250
Ruminococcus_torques\|SGB4608	-2.4697096	0.0605216	0.1453199	0.6888539	0.1750625	0.8851892	-0.4343591	0.7212763	0.7844841	0.6869561	1.2188432	0.5725051	-2.2264625	2.576588	-2.8453555	1.976637	-3.0699847	4.638953
GGB9707_SGB15229\|SGB15229	0.4133691	0.6590228	0.1627098	0.5329294	-0.1235040	0.8872622	-0.3885014	0.6570505	0.1414934	0.9193933	0.5299947	0.7325617	-1.8491162	1.602108	-2.1209193	1.343916	-2.6281295	2.911116
GGB2653_SGB3574\|SGB3574	1.1122778	0.3197002	-0.4067798	0.1920358	-0.0854657	0.9343257	0.3287073	0.7523026	-0.4996387	0.7641180	-0.8283460	0.6538040	-2.1399639	1.969032	-1.7338937	2.391308	-3.7971264	2.797849
Blautia_massiliensis\|SGB4826	-5.2698785	0.0009497	-0.4295762	0.3197913	0.1173654	0.9349990	-1.2071259	0.4043109	1.4418566	0.5328888	2.6489825	0.3024436	-2.7332541	2.967985	-4.0689880	1.654736	-3.1334128	6.017126
Lawsonibacter_asaccharolyticus\|SGB15154	4.6204742	0.0003335	-0.3772567	0.2769607	0.0303182	0.9790742	0.7885767	0.4973356	-0.7279402	0.6949009	-1.5165169	0.4617089	-2.2593926	2.320029	-1.5101646	3.087318	-4.4029463	2.947066

Open code


if (file.exists(
  "gitignore/lm_results/result_microbiome_SGB50_VGdur_training.csv"
) == FALSE) {
  write.table(result_microbiome,
    "gitignore/lm_results/result_microbiome_SGB50_VGdur_training.csv",
    row.names = FALSE
  )
}

7.3 Validation cohort

Open code

data_analysis_microbiome <- data_microbiome_valid2 %>%
  filter(Diet == 'VEGAN',
         !is.na(Diet_duration)) %>%
  dplyr::mutate(Diet_duration = as.numeric(as.character(Diet_duration))) %>% 
  dplyr::select(
    Diet_duration, Age,
    all_of(diet_sensitive_taxa)
  )

summary(data_analysis_microbiome[ , 1:11])
##  Diet_duration         Age        Escherichia_coli|SGB10068
##  Min.   : 0.170   Min.   :21.54   Min.   :-4.869           
##  1st Qu.: 4.062   1st Qu.:30.05   1st Qu.:-4.471           
##  Median : 6.000   Median :32.81   Median :-4.016           
##  Mean   : 6.790   Mean   :32.37   Mean   :-2.871           
##  3rd Qu.:10.000   3rd Qu.:34.24   3rd Qu.:-1.921           
##  Max.   :15.000   Max.   :44.08   Max.   : 3.879           
##  Lawsonibacter_asaccharolyticus|SGB15154 GGB9707_SGB15229|SGB15229
##  Min.   :-5.6576                         Min.   :-5.74435         
##  1st Qu.:-3.8140                         1st Qu.:-3.47211         
##  Median :-1.8483                         Median :-0.01788         
##  Mean   :-2.1718                         Mean   :-1.11893         
##  3rd Qu.:-0.5406                         3rd Qu.: 0.72750         
##  Max.   : 1.2122                         Max.   : 3.62343         
##  Lachnospiraceae_bacterium_AM48_27BH|SGB4706 GGB2653_SGB3574|SGB3574
##  Min.   :-2.6946                             Min.   :-5.2888        
##  1st Qu.:-0.1665                             1st Qu.:-3.8391        
##  Median : 0.2422                             Median :-2.2085        
##  Mean   : 0.1637                             Mean   :-2.3608        
##  3rd Qu.: 0.6530                             3rd Qu.:-1.2721        
##  Max.   : 2.5235                             Max.   : 0.7467        
##  Eubacterium_ramulus|SGB4959 Blautia_massiliensis|SGB4826
##  Min.   :-2.8304             Min.   :-1.867              
##  1st Qu.:-0.9214             1st Qu.: 1.115              
##  Median :-0.1930             Median : 1.899              
##  Mean   :-0.1585             Mean   : 2.153              
##  3rd Qu.: 0.5940             3rd Qu.: 3.366              
##  Max.   : 3.6354             Max.   : 5.640              
##  Ruminococcus_torques|SGB4608 Clostridiaceae_unclassified_SGB4771|SGB4771
##  Min.   :-5.2869              Min.   :-4.0655                            
##  1st Qu.:-4.8202              1st Qu.:-2.3478                            
##  Median :-3.4857              Median :-0.8808                            
##  Mean   :-2.7697              Mean   :-1.0196                            
##  3rd Qu.:-0.8871              3rd Qu.: 0.4252                            
##  Max.   : 4.1503              Max.   : 1.8320

7.3.0.1 Define number of microbiome and covariates

Open code

n_covarites <- 2
n_features <- ncol(data_analysis_microbiome) - n_covarites

7.3.0.2 Create empty objects

Open code

outcome <- vector('double', n_features)
est_VGduration <- vector('double', n_features)
P_VGduration <- vector('double', n_features)
est_Age <- vector('double', n_features)
P_Age <- vector('double', n_features)
CI_L_VGduration <- vector('double', n_features)
CI_U_VGduration <- vector('double', n_features)

7.3.0.3 Estimate over outcomes

Open code

for (i in 1:n_features) {
  ## define variable
  data_analysis_microbiome$outcome <- data_analysis_microbiome[, (i + n_covarites)]

  ## fit model
  model <- lm(outcome ~ Diet_duration + Age, 
              data = data_analysis_microbiome %>% 
                mutate(Diet_duration = Diet_duration/10,
                       Age = (Age-33)/10))

  ## save results
  outcome[i] <- names(data_analysis_microbiome)[i + n_covarites]

  ## diet effect
  tr <- confint(model)

  CI_L_VGduration[i] <- tr[which(row.names(tr) == "Diet_duration"), ][1]
  CI_U_VGduration[i] <- tr[which(row.names(tr) == "Diet_duration"), ][2]

  est_VGduration[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Diet_duration"
    ), 1
  ]

  P_VGduration[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Diet_duration"
    ), 4
  ]
  
    
  est_Age[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Age"
    ), 1
  ]

  P_Age[i] <- summary(model)$coefficients[
    which(
      names(model$coefficients) == "Age"
    ), 4
  ]
}

7.3.0.4 Results table

Open code

result_microbiome_val <- data.frame(
  outcome,
  est_Age, P_Age,
  est_VGduration, P_VGduration,
  CI_L_VGduration, CI_U_VGduration
)

kableExtra::kable(result_microbiome_val %>% arrange(P_VGduration),
  caption = "Results of linear models estimating the effect of vegan diet duration on CLR-transformed relative abundances of taxa (inferred from shotgun metagenomic sequencing) as the outcome. Only taxa with significant differences between vegans and omnivores in training cohorts (FDR < 0.05, average effect over both countries) were included. `est`: regression coefficient, i.e. expected change in CLR-transformed relative abundance per +10 years of vegan diet duration and age, respectively. `P`: p-value; `CI_L` and `CI_U`: lower and upper bounds of the 95% confidence interval. All estimates in a single row are based on a single model"
)

Results of linear models estimating the effect of vegan diet duration on CLR-transformed relative abundances of taxa (inferred from shotgun metagenomic sequencing) as the outcome. Only taxa with significant differences between vegans and omnivores in training cohorts (FDR < 0.05, average effect over both countries) were included. `est`: regression coefficient, i.e. expected change in CLR-transformed relative abundance per +10 years of vegan diet duration and age, respectively. `P`: p-value; `CI_L` and `CI_U`: lower and upper bounds of the 95% confidence interval. All estimates in a single row are based on a single model
outcome	est_Age	P_Age	est_VGduration	P_VGduration	CI_L_VGduration	CI_U_VGduration
Lachnospiraceae_bacterium_AM48_27BH\|SGB4706	0.5262566	0.1434342	-0.6457823	0.1284727	-1.484789	0.1932242
Blautia_massiliensis\|SGB4826	-0.6591379	0.3206348	-0.8070428	0.3029822	-2.364076	0.7499909
Lawsonibacter_asaccharolyticus\|SGB15154	-1.2928799	0.1122039	0.5873405	0.5365784	-1.307748	2.4824289
GGB2653_SGB3574\|SGB3574	0.4467054	0.5115506	0.3452029	0.6669190	-1.255716	1.9461215
Clostridiaceae_unclassified_SGB4771\|SGB4771	-0.3701377	0.5712316	0.3307498	0.6678817	-1.207880	1.8693799
GGB9707_SGB15229\|SGB15229	1.4554517	0.1287218	-0.4402186	0.6939131	-2.673176	1.7927392
Escherichia_coli\|SGB10068	0.2302403	0.7892944	0.2587873	0.7990847	-1.771730	2.2893045
Ruminococcus_torques\|SGB4608	-0.5752660	0.5504503	-0.2229724	0.8442807	-2.490358	2.0444130
Eubacterium_ramulus\|SGB4959	0.0666441	0.9037034	0.0914083	0.8881749	-1.207145	1.3899618

Open code


if (file.exists("gitignore/lm_results/result_microbiom_SGB50_VGdur_valid.csv") == FALSE) {
  write.table(result_microbiome_val,
    "gitignore/lm_results/result_microbiom_SGB50_VGdur_valid.csv",
    row.names = FALSE
  )
}

7.4 Forest plot

7.4.1 Prepare data

Open code


## subset result tables
result_microbiome_subset <- result_microbiome %>%
  filter(outcome %in% diet_sensitive_taxa)

result_microbiome_val_subset <- result_microbiome_val %>%
  filter(outcome %in% diet_sensitive_taxa)

## create a data frame
data_forest <- data.frame(
  outcome = rep(diet_sensitive_taxa, 3),
  beta = c(
    result_microbiome_subset$est_Diet_duration_inCZ,
    result_microbiome_subset$est_Diet_duration_inIT,
    result_microbiome_val_subset$est_VGduration
  ),
  lower = c(
    result_microbiome_subset$CI_L_Diet_duration_inCZ,
    result_microbiome_subset$CI_L_Diet_duration_inIT,
    result_microbiome_val_subset$CI_L_VGduration
  ),
  upper = c(
    result_microbiome_subset$CI_U_Diet_duration_inCZ,
    result_microbiome_subset$CI_U_Diet_duration_inIT,
    result_microbiome_val_subset$CI_U_VGduration
  ),
  dataset = c(
    rep("CZ", len),
    rep("IT", len),
    rep("Validation", len)
  )
)


 data_forest <- data_forest %>%
  mutate(in_winner = if_else(outcome %in% winners, TRUE, FALSE, missing = FALSE)) %>%
  left_join(
    elacoef %>% mutate(outcome = microbiome) %>% select(-microbiome), 
    by = 'outcome') %>% 
   mutate(outcome = factor(outcome, levels = validation_order))

7.4.1.1 Plotting

Open code


plotac <- "forest_plot_microbiome_SGB50_VGduration"
path <- "gitignore/figures"

colors <- c("CZ" = "#150999", "IT" = "#329243", "Validation" = "grey60")

assign(plotac, ggplot(
  data_forest, aes(x = outcome, y = beta, ymin = lower, ymax = upper, color = dataset)
) +
  geom_pointrange(position = position_dodge(width = 0.5), size = 0.5) +
  geom_hline(yintercept = 0, color = "black") +
  geom_errorbar(position = position_dodge(width = 0.5), width = 0.2) +
  scale_color_manual(values = colors) +
  labs(
    y = "Effect of vegan diet duration (+10y) on CLR(relative abundance)",
    x = "Outcome",
    color = "Dataset"
  ) +
  theme_minimal() +
  coord_flip() + 
  scale_x_discrete(
    labels = setNames(
      ifelse(data_forest$in_winner, 
             paste0("**", data_forest$outcome, "**"), 
             as.character(data_forest$outcome)
      ), data_forest$outcome
    )
  ) +
  theme(
    axis.text.x = element_text(size = 10),
    axis.text.y = ggtext::element_markdown(size = 10),  
    axis.title.x = element_text(size = 12),
    axis.title.y = element_text(size = 12),
    legend.position = "bottom"
  )
)

get(plotac)

Forest plot shows the effect of duration of vegan diet (within the vegan sub-population) on CLR-transformed relative abundance of selected taxa (inferred from shotgun metagenomic sequencing), with 95% confidence intervals, across two training cohorts (Czech and Italian) and one independent Czech validation cohort. Green, blue, and grey points/lines represent the expected change in CLR-transformed relative abundance per +10 years of vegan diet duration within the Italian, Czech, and Czech validation cohorts, respectively. Positive values indicate higher relative abundances in long-term vegans compared to short-term vegans. Only taxa with significant differences between vegans and omnivores in training cohorts (average effect over the two countries) were included. Estimates for the training cohorts come from a single linear model with predictors `Diet_duration`, `Country`, their interaction (`Diet_duration x Country`), and `Age` as fixed-effect predictors. `Age` was included as it correlates with `Diet_duration` and could act as a confounder. In the Czech validation cohort, `Diet_duration` and `Age` were fixed-effect predictors. Diet-sensitive taxa are shown in **bold**, in the same order as in the forest plot of vegan–omnivore differences

Open code


if (file.exists(paste0(path, "/", plotac, ".svg")) == FALSE) {  
  ggsave(
    path = paste0(path),
    filename = plotac,
    device = "svg",
    width = 8,
    height = 5
  )
}

8 Reproducibility

Open code

sessionInfo()
## R version 4.4.3 (2025-02-28)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=cs_CZ.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=cs_CZ.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=cs_CZ.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=cs_CZ.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Europe/Prague
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] vegan_2.6-6.1         lattice_0.22-5        permute_0.9-7        
##  [4] zCompositions_1.5.0-4 truncnorm_1.0-8       NADA_1.6-1.1         
##  [7] survival_3.7-0        MicrobiomeStat_1.2    glmnet_4.1-8         
## [10] pROC_1.18.0           arm_1.12-2            lme4_1.1-35.5        
## [13] Matrix_1.7-0          MASS_7.3-65           car_3.1-2            
## [16] carData_3.0-5         emmeans_1.10.4        brms_2.21.0          
## [19] Rcpp_1.0.13           rms_6.8-1             Hmisc_5.1-3          
## [22] glmmTMB_1.1.9         ggtext_0.1.2          ggdist_3.3.2         
## [25] cowplot_1.1.1         ggpubr_0.4.0          sjPlot_2.8.16        
## [28] kableExtra_1.4.0      flextable_0.9.6       gtsummary_2.0.2      
## [31] compositions_2.0-8    janitor_2.2.0         stringi_1.8.7        
## [34] lubridate_1.9.4       forcats_1.0.0         stringr_1.5.1        
## [37] dplyr_1.1.4           purrr_1.0.4           readr_2.1.5          
## [40] tidyr_1.3.1           tibble_3.3.0          ggplot2_3.5.1        
## [43] tidyverse_1.3.1       readxl_1.3.1          openxlsx_4.2.8       
## [46] RJDBC_0.2-10          rJava_1.0-11          DBI_1.2.3            
## 
## loaded via a namespace (and not attached):
##   [1] fs_1.6.6                matrixStats_1.3.0       httr_1.4.2             
##   [4] insight_0.20.2          numDeriv_2016.8-1.1     tools_4.4.3            
##   [7] backports_1.5.0         utf8_1.2.6              sjlabelled_1.2.0       
##  [10] R6_2.6.1                mgcv_1.9-1              withr_3.0.2            
##  [13] Brobdingnag_1.2-7       prettyunits_1.2.0       gridExtra_2.3          
##  [16] bayesm_3.1-6            quantreg_5.98           cli_3.6.5              
##  [19] textshaping_0.3.6       performance_0.12.2      officer_0.6.6          
##  [22] sandwich_3.0-1          labeling_0.4.2          mvtnorm_1.1-3          
##  [25] robustbase_0.93-9       polspline_1.1.25        ggridges_0.5.3         
##  [28] askpass_1.1             QuickJSR_1.3.1          commonmark_1.9.1       
##  [31] systemfonts_1.0.4       StanHeaders_2.32.10     foreign_0.8-90         
##  [34] gfonts_0.2.0            svglite_2.1.3           rstudioapi_0.16.0      
##  [37] httpcode_0.3.0          generics_0.1.4          shape_1.4.6            
##  [40] distributional_0.4.0    zip_2.2.0               inline_0.3.19          
##  [43] loo_2.4.1               abind_1.4-5             lifecycle_1.0.4        
##  [46] multcomp_1.4-18         yaml_2.3.10             snakecase_0.11.1       
##  [49] grid_4.4.3              promises_1.2.0.1        crayon_1.5.3           
##  [52] haven_2.4.3             pillar_1.11.0           knitr_1.50             
##  [55] statip_0.2.3            boot_1.3-31             estimability_1.5.1     
##  [58] codetools_0.2-19        glue_1.7.0              V8_4.4.2               
##  [61] fontLiberation_0.1.0    data.table_1.15.4       vctrs_0.6.5            
##  [64] cellranger_1.1.0        gtable_0.3.0            assertthat_0.2.1       
##  [67] datawizard_0.12.2       xfun_0.52               mime_0.12              
##  [70] coda_0.19-4             modeest_2.4.0           timeDate_3043.102      
##  [73] iterators_1.0.14        statmod_1.4.36          TH.data_1.1-0          
##  [76] nlme_3.1-168            fontquiver_0.2.1        rstan_2.32.6           
##  [79] fBasics_4041.97         tensorA_0.36.2.1        TMB_1.9.14             
##  [82] rpart_4.1.24            colorspace_2.0-2        nnet_7.3-20            
##  [85] tidyselect_1.2.1        processx_3.8.4          timeSeries_4032.109    
##  [88] compiler_4.4.3          curl_6.4.0              rvest_1.0.2            
##  [91] htmlTable_2.4.0         SparseM_1.81            xml2_1.3.3             
##  [94] fontBitstreamVera_0.1.1 posterior_1.6.0         checkmate_2.3.2        
##  [97] scales_1.3.0            DEoptimR_1.0-10         callr_3.7.6            
## [100] spatial_7.3-15          digest_0.6.37           minqa_1.2.4            
## [103] rmarkdown_2.27          htmltools_0.5.8.1       pkgconfig_2.0.3        
## [106] base64enc_0.1-3         stabledist_0.7-2        dbplyr_2.1.1           
## [109] fastmap_1.2.0           rlang_1.1.6             htmlwidgets_1.6.4      
## [112] shiny_1.9.1             farver_2.1.0            zoo_1.8-9              
## [115] jsonlite_2.0.0          magrittr_2.0.3          Formula_1.2-4          
## [118] bayesplot_1.8.1         munsell_0.5.0           gdtools_0.3.7          
## [121] stable_1.1.6            plyr_1.8.6              pkgbuild_1.3.1         
## [124] parallel_4.4.3          ggrepel_0.9.5           sjmisc_2.8.10          
## [127] ggeffects_1.7.0         splines_4.4.3           gridtext_0.1.5         
## [130] hms_1.1.3               sjstats_0.19.0          ps_1.7.7               
## [133] uuid_1.0-3              markdown_1.13           ggsignif_0.6.3         
## [136] stats4_4.4.3            rmutil_1.1.10           rstantools_2.1.1       
## [139] crul_1.5.0              reprex_2.0.1            evaluate_1.0.4         
## [142] RcppParallel_5.1.8      modelr_0.1.8            nloptr_2.0.0           
## [145] tzdb_0.5.0              foreach_1.5.2           httpuv_1.6.5           
## [148] MatrixModels_0.5-3      openssl_1.4.6           clue_0.3-65            
## [151] broom_1.0.6             xtable_1.8-4            rstatix_0.7.0          
## [154] later_1.3.0             viridisLite_0.4.0       ragg_1.4.0             
## [157] lmerTest_3.1-3          cluster_2.1.8.1         timechange_0.3.0       
## [160] bridgesampling_1.1-2

References

Lenth, Russell V. 2024. “Emmeans: Estimated Marginal Means, Aka Least-Squares Means.” https://CRAN.R-project.org/package=emmeans.

1 Introduction

1.1 Statistical Methods

2 Initiation

2.1 Set home directory

2.2 Upload initiation file

3 Data

3.1 Upload all original data

3.1.1 Training set

3.1.1.1 Connect metadata from lipidom table

3.1.1.2 Connect training data

3.1.2 Validation set

3.1.2.1 Get metadata from lipidom table

3.1.2.2 Connect validation data

3.1.3 Get center-log transformed value

3.1.4 Merge training and validation dataset

3.2 Explore

3.2.0.1 Distributions - clr transformed

3.2.0.2 Taxon proportions accross groups

4 Linear models across taxa

4.1 Select data

4.1.1 Define number of microbiome and covariates

4.1.2 Create empty objects

4.1.3 Estimate over outcomes

4.1.4 Results table

4.1.5 Adjust p values

4.1.6 Show and save results

5 Elastic net

5.1 Prepare data for glmnet

5.2 Fit model

5.3 Model summary

5.4 ROC curve - internal validation

5.5 Estimated coefficients

5.6 Plot beta coefficients

6 External validation

6.1 Prediction of diet (elastic net)

6.1.1 Get table of weights, means and SDs

6.1.2 Identify shared and missing predictors

6.1.3 Standardize data in validation set

6.1.4 Result

6.1.4.1 ROC curve

6.1.4.2 Table

6.2 Diet effect across datasets

6.2.1 Linear models in validation cohort

6.2.1.1 Define number of microbiome and covariates

6.2.1.2 Create empty objects

6.2.1.3 Estimate over outcomes

6.2.1.4 Results table

6.2.2 Forest plot

6.2.2.1 Prepare data

6.2.2.2 Plotting

6.2.3 Boxplot

7 Linear model VG duration

7.1 Get data

7.1.1 Training

7.1.2 Validation

7.2 Training cohort

7.2.1 Select data

7.2.2 Define number of microbiome and covariates

7.2.3 Create empty objects

7.2.4 Estimate over outcomes

7.2.5 Show and save results

7.3 Validation cohort

7.3.0.1 Define number of microbiome and covariates

7.3.0.2 Create empty objects

7.3.0.3 Estimate over outcomes

7.3.0.4 Results table

7.4 Forest plot

7.4.1 Prepare data

7.4.1.1 Plotting

8 Reproducibility

References

5.1 Prepare data for `glmnet`