Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 17 additions & 10 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ Single entry point: `protspace = protspace.cli.app:app`
| `protspace project` | HDF5 → dimensionality reduction |
| `protspace annotate` | Fetch protein annotations |
| `protspace bundle` | Combine projections + annotations → .parquetbundle |
| `protspace stats` | Compute projection quality statistics (cluster-validity + faithfulness) |
| `protspace stats` | Compute projection quality statistics (annotation-based cluster-validity + faithfulness) |
| `protspace serve` | Launch Dash web frontend |
| `protspace style` | Add annotation colors/styles |

Expand All @@ -67,17 +67,20 @@ protspace prepare -i <input> -m <methods> -o <output> [options]
# Parameter sweep: protspace prepare -i emb.h5 -m "umap2:n_neighbors=15" -m "umap2:n_neighbors=50" -m pca2 -o output
# Inline params: protspace prepare -i emb.h5 -m "pca2,umap2:n_neighbors=50;min_dist=0.3" -o output
# Quality stats (opt-in): protspace prepare -i emb.h5 -m pca2,umap2 --stats -o output
# Quality stats scoped to specific annotations: protspace prepare -i emb.h5 -m pca2 --stats --stats-annotation major_group,ec_number -o output
```

### protspace stats Usage

Compute per-projection quality statistics for an existing project directory (also available inline via `prepare --stats`). Cluster-validity → `statistics.parquet` (bundle 5th part) + per-protein `cluster_elbow_*` / `cluster_silhouette_*` membership columns (each value a `cluster N` label with the per-point silhouette attached as `|score`) + auto legend styles; faithfulness (local kNN + global metrics, tagged `scope`) → each projection's `info_json.quality`. `--cluster-selection elbow|silhouette|both` picks the K-selection method(s).
Compute per-projection quality statistics for an existing project directory (also available inline via `prepare --stats`). Validity is **annotation-based**: silhouette/DBI/CH are scored on a user-selected annotation's own category labels (not auto-clustering), computed once for the source embedding and again for each projection — `statistics.parquet` (bundle 5th part) gains an `annotation` column and `space_kind ∈ {embedding, projection}`. `--stats-annotation auto|name1,name2` (default `auto`) picks which annotation column(s) to score (all "suitable" low-cardinality categoricals, or an explicit list); requires `-a/--annotations`. Auto-clustering (KMeans elbow/silhouette) is retained for the per-protein `cluster_elbow_*` / `cluster_silhouette_*` membership columns (each value a `cluster N` label with the per-point silhouette attached as `|score`) + auto legend styles, but is no longer self-scored — instead its **ARI**/**NMI** agreement against each scored annotation is recorded (`stat_family=cluster_agreement`). Faithfulness (local kNN + global metrics, tagged `scope`) → each projection's `info_json.quality`. `--cluster-selection elbow|silhouette|both` picks the K-selection method(s).

```bash
# Standalone (embeddings needed for faithfulness)
# Standalone (embeddings needed for faithfulness + the once-per-embedding annotation-validity pass)
protspace stats -i emb.h5 -p project_dir -o statistics.parquet
# Enrich annotations in place + emit cluster legend styles for `bundle --settings`
# Enrich annotations in place, score annotation-based validity, + emit cluster legend styles for `bundle --settings`
protspace stats -i emb.h5 -p project_dir -o statistics.parquet -a annotations.parquet --settings-out styles.json
# Score only specific annotations instead of every suitable categorical (default: auto)
protspace stats -i emb.h5 -p project_dir -o statistics.parquet -a annotations.parquet --stats-annotation major_group,ec_number
# Elbow + silhouette-optimal clusterings side by side
protspace stats -i emb.h5 -p project_dir -o statistics.parquet -a annotations.parquet --cluster-selection both
# Fold a stats parquet + settings into a bundle
Expand Down Expand Up @@ -149,12 +152,14 @@ src/protspace/
├── stats/ # Projection quality statistics (opt-in, --stats)
│ ├── __init__.py # Lazy STATISTICS registry + compute_statistics entry
│ ├── base.py # StatContext / StatRow / AnnotationColumn / StatsReport
│ ├── driver.py # Per-projection contexts, embedding id-join, run stats
│ ├── driver.py # Per-projection contexts + once-per-embedding pass, embedding id-join, run stats
│ ├── carriage.py # Route rows to bundle parts (metadata / annotations / legend)
│ ├── annotation_select.py # Pick "suitable" annotations (auto/list) + build id→category labels
│ ├── cluster/kmeans_elbow.py # KMeans + distance-to-chord elbow (subsampled at scale)
│ └── metrics/
│ ├── validity.py # silhouette / Davies-Bouldin / Calinski-Harabasz
│ └── faithfulness.py # kNN-overlap / trustworthiness / continuity
│ ├── validity.py # Auto-cluster (KMeans) + ARI/NMI agreement vs annotations
│ ├── annotation_validity.py # silhouette / Davies-Bouldin / Calinski-Harabasz per annotation
│ └── faithfulness.py # kNN-overlap / trustworthiness / continuity
├── utils/
│ ├── __init__.py # Lazy exports: REDUCERS dict, reducer constants
│ ├── constants.py # DimensionReductionConfig, method name constants
Expand Down Expand Up @@ -216,7 +221,7 @@ HDF5 file (float16 embeddings)
2. `projections_metadata` — projection names, dimensions, parameters (faithfulness rides in `info_json.quality` when `--stats`)
3. `projections_data` — reduced coordinates per protein per projection
4. `settings` (optional) — annotation styles, pinned values, display config
5. `statistics` (optional) — tidy per-projection cluster-validity table (`protspace stats` / `prepare --stats`)
5. `statistics` (optional) — tidy table of annotation-based validity (silhouette/DBI/CH per annotation, `space_kind ∈ {embedding, projection}`, `annotation` column) + auto-cluster ARI/NMI agreement (`stat_family=cluster_agreement`) (`protspace stats` / `prepare --stats`)

Positional layout is `core(3) + settings? + statistics?`. When statistics are present but settings are absent, the settings slot is written as **zero bytes** so statistics stay at position five (readers branch on emptiness, not part count). Both bundled and separate-file (`--no-bundled`) output persist `settings.parquet` and `statistics.parquet` when present.

Expand All @@ -240,10 +245,12 @@ uv run pytest tests/ --cov=src/protspace # With coverage
| `test_settings_converter.py` | 31 | Settings table ↔ visualization state conversion |
| `test_uniprot_annotation_retriever.py` | 24 | UniProt API mocking, inactive entry resolution |
| `test_pipeline_utils.py` | 70 | ReductionPipeline, EmbeddingSet, method parsing, multi-input merging, inline param overrides |
| `test_stats.py` | 43 | Projection statistics: elbow, cluster-validity, faithfulness (dual continuity + global metrics), cluster-selection (elbow/silhouette/both), subsample determinism/order-invariance, silhouette consistency |
| `test_stats_cli.py` | 12 | `protspace stats` CLI + `prepare` stats wiring, `--settings-out` guard, `--cluster-selection` validation |
| `test_stats.py` | 48 | Projection statistics: elbow, annotation-based validity (silhouette/DBI/CH per annotation), auto-cluster ARI/NMI agreement, faithfulness (dual continuity + global metrics), cluster-selection (elbow/silhouette/both), subsample determinism/order-invariance, silhouette consistency |
| `test_stats_cli.py` | 15 | `protspace stats` CLI + `prepare` stats wiring, `--stats-annotation` (auto/list) wiring, `--settings-out` guard, `--cluster-selection` validation |
| `test_stats_carriage.py` | 10 | Routing rows to bundle parts (metadata quality, annotation columns, cluster legend) |
| `test_stats_bundle.py` | 7 | Optional 5th (statistics) bundle part round-trip |
| `test_annotation_select.py` | 4 | Annotation selection: suitability filter (cardinality/numeric/id-like exclusion), `auto` vs explicit-list label building, missing-value dropping |
| `test_annotation_validity.py` | 5 | `AnnotationValidityStatistic`: silhouette/DBI/CH scored per annotation on `ctx.coords`, embedding vs. projection `space_kind`, missing-value exclusion, single-category no-op |
| `test_biocentral_embedder.py` | 23 | Biocentral API client, embedding flow |
| `test_fasta.py` | 17 | FASTA parsing, edge cases, CSV annotation loading |
| `test_biocentral_retriever.py` | 14 | Biocentral prediction retriever (TMbed parsing, per-sequence) |
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ ProtSpace is a visualization tool for exploring **protein embeddings** or **simi

- **Multiple projections**: PCA, UMAP, t-SNE, MDS, PaCMAP, LocalMAP
- **Automatic annotations**: UniProt, InterPro, and Taxonomy
- **Quality metrics** _(opt-in)_: per-projection cluster-validity + faithfulness (local & global) via `--stats`
- **Quality metrics** _(opt-in)_: annotation-based cluster-validity + faithfulness (local & global) via `--stats`
- **Structure viewer**: Integrated protein structure visualization
- **Export**: PNG, PDF, SVG, HTML

Expand Down Expand Up @@ -65,7 +65,7 @@ protspace stats -i embeddings/prot_t5.h5 -p projections/ -o statistics.parquet
protspace bundle -p projections/ -a annotations.parquet -s statistics.parquet -o output.parquetbundle
```

Or compute quality metrics inline during `prepare` with `--stats` (opt-in): cluster-validity + faithfulness per projection. See the [CLI Reference](docs/cli.md#projection-statistics---stats).
Or compute quality metrics inline during `prepare` with `--stats` (opt-in): annotation-based cluster-validity + faithfulness per projection. See the [CLI Reference](docs/cli.md#projection-statistics---stats).

## 📊 Example Output

Expand Down
26 changes: 17 additions & 9 deletions docs/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -130,8 +130,9 @@ This produces three projections: `ProtT5 — PCA 2`, `ProtT5 — UMAP 2 (n=15)`,
| ---- | ----------- | ------- |
| `-o, --output` | Output directory. | `.` |
| `--bundled / --no-bundled` | Bundle into single `.parquetbundle`. | bundled |
| `--stats / --no-stats` | Compute projection quality statistics (cluster-validity + faithfulness). See [Projection Statistics](#projection-statistics---stats). | off |
| `--stats / --no-stats` | Compute projection quality statistics (annotation-based cluster-validity + faithfulness). See [Projection Statistics](#projection-statistics---stats). | off |
| `--cluster-selection` | With `--stats`, how to choose the cluster count K: `elbow`, `silhouette`, or `both`. | `elbow` |
| `--stats-annotation` | With `--stats`, which annotation column(s) to score for cluster-validity: `auto` (all suitable low-cardinality categoricals) or a comma-separated list. | `auto` |
| `--keep-tmp` | Cache intermediates for resumability. | on |
| `--no-log` | Skip writing `run.log`. | off |
| `--dump-cache` | Print cached annotations and exit. | off |
Expand Down Expand Up @@ -179,46 +180,53 @@ protspace bundle -p projections/ -a annotations.parquet \

## `protspace stats`

Compute per-projection quality statistics for an existing project directory and write them as a `statistics.parquet` (the optional 5th `.parquetbundle` part). No annotations are required. See [Projection Statistics](#projection-statistics---stats) for what is computed.
Compute per-projection quality statistics for an existing project directory and write them as a `statistics.parquet` (the optional 5th `.parquetbundle` part). Faithfulness and the auto-cluster membership columns need no annotations; annotation-based validity (and its ARI/NMI agreement with the auto-clusters) needs `-a/--annotations`. See [Projection Statistics](#projection-statistics---stats) for what is computed.

```bash
# Statistics for a project (embeddings needed for faithfulness)
protspace stats -i embeddings/prot_t5.h5 -p projections/ -o statistics.parquet

# Also enrich an annotations parquet in place with per-protein cluster-membership
# columns, and write the auto cluster-legend styles for `bundle`
# columns, score annotation-based validity, and write the auto cluster-legend styles
protspace stats -i embeddings/prot_t5.h5 -p projections/ -o statistics.parquet \
-a annotations.parquet --settings-out cluster_styles.json

# Score only specific annotations instead of every suitable categorical (default: auto)
protspace stats -i embeddings/prot_t5.h5 -p projections/ -o statistics.parquet \
-a annotations.parquet --stats-annotation major_group,ec_number

# Emit both the elbow and the silhouette-optimal clustering
protspace stats -i embeddings/prot_t5.h5 -p projections/ -o statistics.parquet \
-a annotations.parquet --cluster-selection both
```

| Flag | Description | Default |
| ---- | ----------- | ------- |
| `-i, --input` | HDF5 embedding file(s) (for faithfulness). Repeat for multi-embedding; `-i file.h5:name` to override the name. | — |
| `-i, --input` | HDF5 embedding file(s) (for faithfulness + the once-per-embedding annotation-validity pass). Repeat for multi-embedding; `-i file.h5:name` to override the name. | — |
| `-p, --projections` | Project directory with `projections_metadata.parquet` + `projections_data.parquet`. | — |
| `-o, --output` | Output `statistics.parquet` path. | — |
| `-a, --annotations` | Annotations parquet to enrich in place with per-protein `cluster_*` membership columns (per-point silhouette attached as `value|score`). | — |
| `-a, --annotations` | Annotations parquet to enrich in place with per-protein `cluster_*` membership columns (per-point silhouette attached as `value|score`), and to score for annotation-based validity + ARI/NMI agreement. | — |
| `--cluster-selection` | Cluster count K selection: `elbow`, `silhouette`, or `both`. | `elbow` |
| `--stats-annotation` | Which annotation column(s) to score for cluster-validity: `auto` (all suitable low-cardinality categoricals) or a comma-separated list. Requires `-a`. | `auto` |
| `--settings-out` | Write auto cluster-legend styles here (JSON) for `bundle --settings`. Requires `-a`. | — |
| `--metric` | High-dim distance metric for faithfulness when the projection metadata omits one (e.g. PCA/MDS). | `euclidean` |
| `--seed` | Random seed. | `42` |

## Projection Statistics (`--stats`)

`prepare --stats` (opt-in) and the standalone `protspace stats` command compute two families of per-projection quality metrics and bake them into the output:
`prepare --stats` (opt-in) and the standalone `protspace stats` command compute three families of per-projection quality metrics and bake them into the output:

- **Cluster validity** — KMeans labels the projection, scored by **silhouette**, **Davies–Bouldin**, and **Calinski–Harabasz**, written to the tidy `statistics.parquet` (the bundle's 5th part). The cluster count K is chosen by the inertia **elbow** and/or by **max silhouette** — `--cluster-selection elbow|silhouette|both`. Each selection also becomes a per-protein membership column — `cluster_elbow_<projection>` and/or `cluster_silhouette_<projection>` — with the point's **silhouette attached to its value** as `cluster N|<silhouette>` (the same `value|score` convention as UniProt evidence codes / InterPro bit scores; suppressed by `--no-scores`). Membership columns get an auto Kelly-palette legend (the bundle's 4th settings part); in `statistics.parquet` the two selections are distinguished by `label_kind` (`kmeans_elbow` / `kmeans_silhouette`).
- **Annotation-based validity** — silhouette, Davies–Bouldin, and Calinski–Harabasz scored using an annotation's own category labels (not auto-clustering) — how well proteins already grouped by an annotation (e.g. `major_group`, `ec_number`) separate in a given space. Computed once for the source embedding (a separability "ceiling") and again for each projection, written to the tidy `statistics.parquet` (the bundle's 5th part) with `space_kind ∈ {embedding, projection}` and an `annotation` column naming which one was scored. `--stats-annotation auto|name1,name2` (default `auto`) picks which annotation column(s) to score — `auto` scores every "suitable" low-cardinality categorical (≥2 and ≤min(50, max(2, n/2)) distinct non-empty values, not numeric, and not a generated `cluster_*` column); requires `-a/--annotations`.
- **Auto-cluster agreement** — KMeans labels the projection; the cluster count K is chosen by the inertia **elbow** and/or by **max silhouette** — `--cluster-selection elbow|silhouette|both`. This auto-clustering is no longer scored against itself (that was circular); instead, when annotations are supplied, each labelling's **ARI** (adjusted Rand index) and **NMI** (normalized mutual information) agreement with every scored annotation is recorded (`stat_family=cluster_agreement`). Each selection also becomes a per-protein membership column — `cluster_elbow_<projection>` and/or `cluster_silhouette_<projection>` — with the point's **silhouette attached to its value** as `cluster N|<silhouette>` (the same `value|score` convention as UniProt evidence codes / InterPro bit scores; suppressed by `--no-scores`). Membership columns get an auto Kelly-palette legend (the bundle's 4th settings part); in `statistics.parquet` the two selections are distinguished by `label_kind` (`kmeans_elbow` / `kmeans_silhouette`).
- **Faithfulness** — how well the projection preserves the source embedding's structure; each row is tagged `scope`:
- **local** (kNN-neighbourhood): **kNN-overlap**, **trustworthiness**, **continuity**.
- **global** (whole-layout): **random_triplet** (relative-ordering accuracy over random triplets, ∈[0,1]) and **spearman_distance** (rank correlation of all pairwise distances, ∈[−1,1]).

These per-projection scalars ride in each projection's `info_json.quality`.
These per-projection scalars ride in each projection's `info_json.quality` — they never land in `statistics.parquet`.

Notes:
- Off by default — the compute (a KMeans sweep + faithfulness) and the extra bundle columns/styles are opt-in.
- Off by default — the compute (annotation-validity + a KMeans sweep + faithfulness) and the extra bundle columns/styles are opt-in.
- Annotation-based validity and cluster agreement need `-a/--annotations`; faithfulness and the membership columns do not.
- Uses the projection's own high-dim metric (e.g. `cosine`) for faithfulness; falls back to `--metric` / `euclidean` when the reducer doesn't record one.
- Best-effort: a failure for one statistic or projection is logged and skipped, never failing the run. At large scale the heavier metrics are subsampled (silhouette/faithfulness) or fit on a bounded subsample (KMeans elbow) with a deterministic seed.

Expand Down
Loading