LongitudinalRandomForest

LongitudinalRandomForest is a novel framework for analyzing compositional data in longitudinal studies using Random Forest models. It intelligently handles subject-level sampling to ensure one sample per subject per iteration, maximizing data independence while enabling full dataset utilization across runs.

Overview

This framework is specifically designed to address key challenges in analyzing longitudinal compositional data—such as microbiome profiles—by:

Ensuring subject independence through randomized sampling
Handling repeated measures in a statistically valid way
Utilizing Random Forest models for both regression and classification
Implementing cross-validation with robust sampling strategies

Key Features

Subject-Independent Sampling
Each run includes only one sample per subject to avoid intra-subject correlation.
Full Sample Coverage
All samples appear at least once across multiple randomized datasets.
Compositional Data Support
Ideal for datasets like microbiome profiles that require special considerations.
Flexible ML Tasks
Supports both classification and regression modes.
Cross-Validation Ready
Implements multiple folds and repetitions for robust evaluation.

Methodology

Sample Randomization

For subjects with multiple samples (e.g., 5 timepoints per individual):

The orchestrator creates N random subsets (where N ≥ number of timepoints)
Each subset contains one sample per subject
All subsets together ensure complete sample coverage

This design allows robust analysis while maintaining independence between samples.

Analysis Workflow

Split dataset based on a variable of interest (e.g., maternal cytokine levels):
- Category 0: Mid-range values (e.g., 25–75th percentile)
- Category 1: Extremes (e.g., lowest 25% and highest 25%)
Train models using microbiome (or other compositional) data to predict a target (e.g., infant age).
Run Random Forest on 5 balanced subsets (1 sample/subject), repeat across both categories.
Cross-Validation perfomed with 10 folds 10 times per each subset of data.
Predict external group using models trained on the other subset.
Average predictions across folds for final output.

📂 Repository Structure

├── orchestrator.sh           # Main orchestration script
├── codes/
│   ├── regr_for_external_dataset_pred.py
│   ├── parse_ML_results.py
│   ├── create_balanced_folds.py
│   └── average_external_predictions.py
├── data/
│   ├── microbiome_data.tsv
│   └── metadata.tsv
└── results/
    └── ...

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
LICENSE		LICENSE
README.md		README.md
orchestrator.sh		orchestrator.sh
regr_for_external_dataset.py		regr_for_external_dataset.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LongitudinalRandomForest

Overview

Key Features

Methodology

Sample Randomization

Analysis Workflow

📂 Repository Structure

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

simone-anza/LongitudinalRandomForest

Folders and files

Latest commit

History

Repository files navigation

LongitudinalRandomForest

Overview

Key Features

Methodology

Sample Randomization

Analysis Workflow

📂 Repository Structure

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages