Skip to content

glbala87/OmicsAtlas

Repository files navigation

OmicsAtlas

CI Python License: MIT Version

Reusable Python framework for integrating bulk multi-omics datasets using joint latent factors (PCA/SVD/NMF/ICA) and UMAP visualization.

OmicsAtlas provides an end-to-end, reproducible pipeline for combining transcriptomics, proteomics, metabolomics (or any other tabular omics layer) into a unified low-dimensional embedding, with batch correction, clustering, statistical analysis, and ground-truth validation built in.

Features

  • Multi-omics integration — concatenate and jointly decompose transcriptomics, proteomics, metabolomics (or any combination)
  • Preprocessing pipeline — normalization (log1p, zscore, quantile, minmax, median-ratio), imputation (median, mean, knn, zero), feature filtering, scaling
  • Batch correction — ComBat (empirical Bayes), linear regression, mean centering
  • Embedding — UMAP, t-SNE, PCA with quality metrics (trustworthiness, continuity) and parameter sweeps
  • Clustering — KMeans (auto-k via silhouette), DBSCAN, hierarchical, Leiden
  • Statistical analysis — differential features, feature importance, outlier detection, permutation tests, correlation analysis
  • Truthset validation — ground-truth validation with ARI, NMI, silhouette, batch mixing, embedding stability metrics
  • Golden reference regression testing — detect numerical drift across releases
  • HTML QC report with embedded visualizations
  • Provenance tracking — SHA-256 hashes, software versions, platform info, config snapshots
  • Robust input validation — guards for single-sample inputs, all-NaN columns, zero-variance features, n_clusters bounds

Installation

pip install -e .

# For Leiden clustering support:
pip install -e ".[leiden]"

# For development:
pip install -e ".[dev]"

Quick start

CLI usage

# Full pipeline
omicsatlas run \
  -i data/transcriptomics.csv \
  -i data/proteomics.csv \
  -m data/metadata.csv \
  --batch-column batch \
  --n-factors 20 \
  --method pca \
  -o results/

# Validate inputs before running
omicsatlas validate \
  -i data/transcriptomics.csv \
  -i data/proteomics.csv \
  -m data/metadata.csv

# Truthset validation
omicsatlas validate-truthset \
  -i truthset_data/truthset_transcriptomics.csv \
  -i truthset_data/truthset_proteomics.csv \
  -i truthset_data/truthset_metabolomics.csv \
  -m truthset_data/truthset_metadata.csv \
  -g truthset_data/ground_truth.json

# Show version and environment info
omicsatlas info

Python API

from omicsatlas import (
    load_omics, load_metadata,
    preprocess_pipeline, correct_batch,
    compute_joint_factors, compute_embedding,
    cluster_samples, cluster_metrics,
    validate_truthset,
)

# Load data
omics = {
    "transcriptomics": load_omics("transcriptomics.csv"),
    "proteomics": load_omics("proteomics.csv"),
}
metadata = load_metadata("metadata.csv")

# Preprocess each layer
for name in omics:
    omics[name] = preprocess_pipeline(omics[name])

# Batch correction
for name in omics:
    shared = omics[name].index.intersection(metadata.index)
    omics[name] = correct_batch(omics[name].loc[shared], metadata.loc[shared, "batch"])

# Integrate, embed, cluster
result = compute_joint_factors(omics, n_factors=20, method="pca")
embedding = compute_embedding(result["factors"], method="umap")
labels = cluster_samples(embedding, n_clusters=3)
metrics = cluster_metrics(embedding, labels)

Input format

  • Omics files: CSV/TSV with samples as rows, features as columns, first column as sample ID index
  • Metadata file: CSV/TSV with sample IDs as index, columns for batch, group, covariates
  • All omics layers must share sample IDs (intersection is used)

Configuration

Pass a YAML config file with --config:

n_factors: 20
method: pca          # pca, svd, nmf, ica
normalize: log1p     # log1p, zscore, quantile, minmax, median_ratio, none
batch_column: batch
n_neighbors: 15
min_dist: 0.1
embedding_method: umap  # umap, tsne, pca
n_clusters: null     # null for auto-detection
random_state: 42

Repository layout

src/omicsatlas/         Core framework modules (io, preprocess, batch,
                        integrate, embed, stats, plot, report, validation, cli)
scripts/                Data generation and pipeline scripts
tests/                  Test suite (smoke, edge-case, coverage, truthset)
tests/golden_reference/ Golden reference outputs for regression testing
docs/                   Sphinx documentation (api, cli, quickstart, validation)
toy_data/               Small example datasets

Documentation

Full documentation lives under docs/ and can be built with Sphinx:

cd docs && make html

Topics covered: quickstart, cli, api, configuration, validation.

Testing

# Run all tests
pytest -v

# Run with coverage
pytest --cov=omicsatlas --cov-report=term-missing

# Run only truthset validation tests
pytest tests/test_truthset.py -v

Citation

If you use OmicsAtlas in your research, please cite it via the metadata in CITATION.cff.

Changelog

See CHANGELOG.md for release notes.

License

MIT

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages