Reusable Python framework for integrating bulk multi-omics datasets using joint latent factors (PCA/SVD/NMF/ICA) and UMAP visualization.
OmicsAtlas provides an end-to-end, reproducible pipeline for combining transcriptomics, proteomics, metabolomics (or any other tabular omics layer) into a unified low-dimensional embedding, with batch correction, clustering, statistical analysis, and ground-truth validation built in.
- Multi-omics integration — concatenate and jointly decompose transcriptomics, proteomics, metabolomics (or any combination)
- Preprocessing pipeline — normalization (log1p, zscore, quantile, minmax, median-ratio), imputation (median, mean, knn, zero), feature filtering, scaling
- Batch correction — ComBat (empirical Bayes), linear regression, mean centering
- Embedding — UMAP, t-SNE, PCA with quality metrics (trustworthiness, continuity) and parameter sweeps
- Clustering — KMeans (auto-k via silhouette), DBSCAN, hierarchical, Leiden
- Statistical analysis — differential features, feature importance, outlier detection, permutation tests, correlation analysis
- Truthset validation — ground-truth validation with ARI, NMI, silhouette, batch mixing, embedding stability metrics
- Golden reference regression testing — detect numerical drift across releases
- HTML QC report with embedded visualizations
- Provenance tracking — SHA-256 hashes, software versions, platform info, config snapshots
- Robust input validation — guards for single-sample inputs, all-NaN columns, zero-variance features, n_clusters bounds
pip install -e .
# For Leiden clustering support:
pip install -e ".[leiden]"
# For development:
pip install -e ".[dev]"# Full pipeline
omicsatlas run \
-i data/transcriptomics.csv \
-i data/proteomics.csv \
-m data/metadata.csv \
--batch-column batch \
--n-factors 20 \
--method pca \
-o results/
# Validate inputs before running
omicsatlas validate \
-i data/transcriptomics.csv \
-i data/proteomics.csv \
-m data/metadata.csv
# Truthset validation
omicsatlas validate-truthset \
-i truthset_data/truthset_transcriptomics.csv \
-i truthset_data/truthset_proteomics.csv \
-i truthset_data/truthset_metabolomics.csv \
-m truthset_data/truthset_metadata.csv \
-g truthset_data/ground_truth.json
# Show version and environment info
omicsatlas infofrom omicsatlas import (
load_omics, load_metadata,
preprocess_pipeline, correct_batch,
compute_joint_factors, compute_embedding,
cluster_samples, cluster_metrics,
validate_truthset,
)
# Load data
omics = {
"transcriptomics": load_omics("transcriptomics.csv"),
"proteomics": load_omics("proteomics.csv"),
}
metadata = load_metadata("metadata.csv")
# Preprocess each layer
for name in omics:
omics[name] = preprocess_pipeline(omics[name])
# Batch correction
for name in omics:
shared = omics[name].index.intersection(metadata.index)
omics[name] = correct_batch(omics[name].loc[shared], metadata.loc[shared, "batch"])
# Integrate, embed, cluster
result = compute_joint_factors(omics, n_factors=20, method="pca")
embedding = compute_embedding(result["factors"], method="umap")
labels = cluster_samples(embedding, n_clusters=3)
metrics = cluster_metrics(embedding, labels)- Omics files: CSV/TSV with samples as rows, features as columns, first column as sample ID index
- Metadata file: CSV/TSV with sample IDs as index, columns for batch, group, covariates
- All omics layers must share sample IDs (intersection is used)
Pass a YAML config file with --config:
n_factors: 20
method: pca # pca, svd, nmf, ica
normalize: log1p # log1p, zscore, quantile, minmax, median_ratio, none
batch_column: batch
n_neighbors: 15
min_dist: 0.1
embedding_method: umap # umap, tsne, pca
n_clusters: null # null for auto-detection
random_state: 42src/omicsatlas/ Core framework modules (io, preprocess, batch,
integrate, embed, stats, plot, report, validation, cli)
scripts/ Data generation and pipeline scripts
tests/ Test suite (smoke, edge-case, coverage, truthset)
tests/golden_reference/ Golden reference outputs for regression testing
docs/ Sphinx documentation (api, cli, quickstart, validation)
toy_data/ Small example datasets
Full documentation lives under docs/ and can be built with Sphinx:
cd docs && make htmlTopics covered: quickstart, cli, api, configuration, validation.
# Run all tests
pytest -v
# Run with coverage
pytest --cov=omicsatlas --cov-report=term-missing
# Run only truthset validation tests
pytest tests/test_truthset.py -vIf you use OmicsAtlas in your research, please cite it via the metadata in CITATION.cff.
See CHANGELOG.md for release notes.
MIT