Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
99 changes: 55 additions & 44 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,46 @@
# SpatialTranscriptFormer
# SpatialTranscriptFormer Framework

> [!WARNING]
> **Work in Progress**: This project is under active development. Core architectures, CLI flags, and data formats are subject to major changes.

**SpatialTranscriptFormer** bridges histology and biological pathways through a high-performance transformer architecture. By modeling the dense interplay between morphological features and gene expression signatures, it provides an interpretable and spatially-coherent mapping of the tissue microenvironment.
<!-- -->

> [!TIP]
> **Framework Release**: SpatialTranscriptFormer has been restructured from a research codebase into a robust framework. You can now use the Python API to train on your own spatial transcriptomics data with custom backbones and architectures.

**SpatialTranscriptFormer** is a modular deep learning framework designed to bridge histology and biological pathways. It leverages transformer architectures to model the interplay between morphological features and gene expression signatures, providing interpretable mapping of the tissue microenvironment.

## Python API: Quick Start

The framework is designed to be integrated programmatically into your scanpy/AnnData workflows:

```python
from spatial_transcript_former import SpatialTranscriptFormer, Predictor, FeatureExtractor
from spatial_transcript_former.predict import inject_predictions

# 1. Initialize model and backbone
model = SpatialTranscriptFormer.from_pretrained("./checkpoints/stf_small/")
extractor = FeatureExtractor(backbone="phikon", device="cuda")
predictor = Predictor(model, device="cuda")

# 2. Predict from features
predictions = predictor.predict_wsi(features, coords) # (1, G)

# 3. Integrate with Scanpy
inject_predictions(adata, coords, predictions[0], gene_names=model.gene_names)
```

For more details, see the **[Python API Reference](docs/API.md)**.

## Key Technical Pillars

- **Modular Architecture**: Decoupled backbones, interaction modules, and output heads.
- **Quad-Flow Interaction**: Configurable attention between Pathways and Histology patches (`p2p`, `p2h`, `h2p`, `h2h`).
- **Pathway Bottleneck**: Interpretable gene expression prediction via 50 MSigDB Hallmark tokens.
- **Spatial Pattern Coherence**: Optimized using a composite **MSE + PCC (Pearson Correlation) loss** to prevent spatial collapse and ensure accurate morphology-expression mapping.
- **Spatial Pattern Coherence**: Optimized using a composite **MSE + PCC (Pearson Correlation) loss**.
- **Foundation Model Ready**: Native support for **CTransPath**, **Phikon**, **Hibou**, and **GigaPath**.
- **Biologically Informed Initialization**: Gene reconstruction weights derived from known hallmark memberships.

---

## License

Expand All @@ -28,76 +57,58 @@ This project is protected by a **Proprietary Source Code License**. See the [LIC

The core architectural innovations, including the **SpatialTranscriptFormer** interaction logic and spatial masking strategies, are the unique Intellectual Property of the author. For a detailed breakdown, see the [IP Statement](docs/IP_STATEMENT.md).

---

## Installation

This project requires [Conda](https://docs.conda.io/en/latest/).

1. Clone the repository.
2. Run the automated setup script:
3. On Windows: `.\setup.ps1`
- On Windows: `.\setup.ps1`
- On Linux/HPC: `bash setup.sh`

## Usage
## Exemplar Recipe: HEST-1k Benchmark

### Dataset Access
The `SpatialTranscriptFormer` repository includes a complete, out-of-the-box CLI pipeline as an exemplar for reproducing our benchmarks on the [HEST-1k dataset](https://huggingface.co/datasets/MahmoodLab/hest).

The model uses the **HEST1k** dataset. You can download specific subsets (by organ, technology, etc.) or the entire dataset using the `stf-download` utility:
### 1. Dataset Access & Preprocessing

```bash
# List available filtering options
stf-download --list-options

# Download a specific subset (e.g., Breast Cancer samples from Visium)
# Download a specific subset
stf-download --organ Breast --disease Cancer --tech Visium --local_dir hest_data

# Download all human samples
stf-download --species "Homo sapiens" --local_dir hest_data
```

> [!NOTE]
> The HEST dataset is gated on Hugging Face. Ensure you have accepted the terms at [MahmoodLab/hest](https://huggingface.co/datasets/MahmoodLab/hest) and are logged in via `huggingface-cli login`.

### Train Models

We provide presets for baseline models and scaled versions of the SpatialTranscriptFormer.
### 2. Training with Presets

```bash
# Recommended: Run the Interaction model (Small)
python scripts/run_preset.py --preset stf_small

# Run the lightweight Tiny version
python scripts/run_preset.py --preset stf_tiny

# Run baselines
python scripts/run_preset.py --preset he2rna_baseline
```

For a complete list of configurations, see the [Training Guide](docs/TRAINING_GUIDE.md).

### Real-Time Monitoring

Monitor training progress, loss curves, and **prediction variance (collapse detector)** via the web dashboard:
### 3. Inference & Visualization

```bash
python scripts/monitor.py --run-dir runs/stf_interaction_l4
stf-predict --data-dir A:\hest_data --sample-id MEND29 --model-path checkpoints/best_model.pth --model-type interaction
```

### Inference & Visualization
Visualization plots and spatial expression maps will be saved to the `./results` directory. For the full guide, see the **[HEST Recipe Docs](src/spatial_transcript_former/recipes/hest/README.md)**.

Generate spatial maps comparing Ground Truth vs Predictions:
## Documentation

```bash
stf-predict --data-dir A:\hest_data --sample-id MEND29 --model-path checkpoints/best_model.pth --model-type interaction
```
### Framework APIs & Usage

Visualization plots will be saved to the `./results` directory.
- **[Python API Reference](docs/API.md)**: Full documentation for `Trainer`, `Predictor`, and `SpatialDataset`.
- **[Bring Your Own Data Guide](src/spatial_transcript_former/recipes/custom/README.md)**: Templates and examples for training on your own non-HEST spatial transcriptomics data.
- **[HEST Recipe Docs](src/spatial_transcript_former/recipes/hest/README.md)**: Detailed documentation for the included HEST-1k dataset recipe.
- **[Training Guide](docs/TRAINING_GUIDE.md)**: Complete list of configuration flags and preset configurations for HEST models.

## Documentation
### Theory & Interpretability

- [Models](docs/MODELS.md): Detailed model architectures and scaling parameters.
- [Data Structure](docs/DATA_STRUCTURE.md): Organization of HEST data on disk.
- [Pathway Mapping](docs/PATHWAY_MAPPING.md): Clinical interpretability and pathway integration.
- [Gene Analysis](docs/GENE_ANALYSIS.md): Modeling strategies for high-dimensional gene space.
- **[Models & Architecture](docs/MODELS.md)**: Deep dive into the quad-flow interaction logic and network scaling.
- **[Pathway Mapping](docs/PATHWAY_MAPPING.md)**: Clinical interpretability, pathway bottleneck design, and MSigDB integration.
- **[Gene Analysis](docs/GENE_ANALYSIS.md)**: Modeling strategies for mapping morphology to high-dimensional gene spaces.
- **[Data Structure](docs/DATA_STRUCTURE.md)**: Detailed breakdown of the HEST data structure on disk, metadata conventions, and preprocessing invariants.

## Development

Expand Down
Loading
Loading