LongitudinalRandomForest is a novel framework for analyzing compositional data in longitudinal studies using Random Forest models. It intelligently handles subject-level sampling to ensure one sample per subject per iteration, maximizing data independence while enabling full dataset utilization across runs.
This framework is specifically designed to address key challenges in analyzing longitudinal compositional dataβsuch as microbiome profilesβby:
- Ensuring subject independence through randomized sampling
- Handling repeated measures in a statistically valid way
- Utilizing Random Forest models for both regression and classification
- Implementing cross-validation with robust sampling strategies
-
Subject-Independent Sampling
Each run includes only one sample per subject to avoid intra-subject correlation. -
Full Sample Coverage
All samples appear at least once across multiple randomized datasets. -
Compositional Data Support
Ideal for datasets like microbiome profiles that require special considerations. -
Flexible ML Tasks
Supports both classification and regression modes. -
Cross-Validation Ready
Implements multiple folds and repetitions for robust evaluation.
For subjects with multiple samples (e.g., 5 timepoints per individual):
- The orchestrator creates N random subsets (where N β₯ number of timepoints)
- Each subset contains one sample per subject
- All subsets together ensure complete sample coverage
This design allows robust analysis while maintaining independence between samples.
-
Split dataset based on a variable of interest (e.g., maternal cytokine levels):
Category 0: Mid-range values (e.g., 25β75th percentile)Category 1: Extremes (e.g., lowest 25% and highest 25%)
-
Train models using microbiome (or other compositional) data to predict a target (e.g., infant age).
-
Run Random Forest on 5 balanced subsets (1 sample/subject), repeat across both categories.
-
Cross-Validation perfomed with 10 folds 10 times per each subset of data.
-
Predict external group using models trained on the other subset.
-
Average predictions across folds for final output.
βββ orchestrator.sh # Main orchestration script
βββ codes/
β βββ regr_for_external_dataset_pred.py
β βββ parse_ML_results.py
β βββ create_balanced_folds.py
β βββ average_external_predictions.py
βββ data/
β βββ microbiome_data.tsv
β βββ metadata.tsv
βββ results/
βββ ...