tsenoner · tsenoner · Jul 2, 2026 · Jul 2, 2026 · Jul 2, 2026 · Jul 2, 2026
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -47,7 +47,7 @@ Single entry point: `protspace = protspace.cli.app:app`
 | `protspace project` | HDF5 → dimensionality reduction |
 | `protspace annotate` | Fetch protein annotations |
 | `protspace bundle` | Combine projections + annotations → .parquetbundle |
-| `protspace stats` | Compute projection quality statistics (cluster-validity + faithfulness) |
+| `protspace stats` | Compute projection quality statistics (annotation-based cluster-validity + faithfulness) |
 | `protspace serve` | Launch Dash web frontend |
 | `protspace style` | Add annotation colors/styles |
 
@@ -67,17 +67,20 @@ protspace prepare -i <input> -m <methods> -o <output> [options]
 # Parameter sweep: protspace prepare -i emb.h5 -m "umap2:n_neighbors=15" -m "umap2:n_neighbors=50" -m pca2 -o output
 # Inline params: protspace prepare -i emb.h5 -m "pca2,umap2:n_neighbors=50;min_dist=0.3" -o output
 # Quality stats (opt-in): protspace prepare -i emb.h5 -m pca2,umap2 --stats -o output
+# Quality stats scoped to specific annotations: protspace prepare -i emb.h5 -m pca2 --stats --stats-annotation major_group,ec_number -o output
 ```
 
 ### protspace stats Usage
 
-Compute per-projection quality statistics for an existing project directory (also available inline via `prepare --stats`). Cluster-validity → `statistics.parquet` (bundle 5th part) + per-protein `cluster_elbow_*` / `cluster_silhouette_*` membership columns (each value a `cluster N` label with the per-point silhouette attached as `|score`) + auto legend styles; faithfulness (local kNN + global metrics, tagged `scope`) → each projection's `info_json.quality`. `--cluster-selection elbow|silhouette|both` picks the K-selection method(s).
+Compute per-projection quality statistics for an existing project directory (also available inline via `prepare --stats`). Validity is **annotation-based**: silhouette/DBI/CH are scored on a user-selected annotation's own category labels (not auto-clustering), computed once for the source embedding and again for each projection — `statistics.parquet` (bundle 5th part) gains an `annotation` column and `space_kind ∈ {embedding, projection}`. `--stats-annotation auto|name1,name2` (default `auto`) picks which annotation column(s) to score (all "suitable" low-cardinality categoricals, or an explicit list); requires `-a/--annotations`. Auto-clustering (KMeans elbow/silhouette) is retained for the per-protein `cluster_elbow_*` / `cluster_silhouette_*` membership columns (each value a `cluster N` label with the per-point silhouette attached as `|score`) + auto legend styles, but is no longer self-scored — instead its **ARI**/**NMI** agreement against each scored annotation is recorded (`stat_family=cluster_agreement`). Faithfulness (local kNN + global metrics, tagged `scope`) → each projection's `info_json.quality`. `--cluster-selection elbow|silhouette|both` picks the K-selection method(s).
 
 ```bash
-# Standalone (embeddings needed for faithfulness)
+# Standalone (embeddings needed for faithfulness + the once-per-embedding annotation-validity pass)
 protspace stats -i emb.h5 -p project_dir -o statistics.parquet
-# Enrich annotations in place + emit cluster legend styles for `bundle --settings`
+# Enrich annotations in place, score annotation-based validity, + emit cluster legend styles for `bundle --settings`
 protspace stats -i emb.h5 -p project_dir -o statistics.parquet -a annotations.parquet --settings-out styles.json
+# Score only specific annotations instead of every suitable categorical (default: auto)
+protspace stats -i emb.h5 -p project_dir -o statistics.parquet -a annotations.parquet --stats-annotation major_group,ec_number
 # Elbow + silhouette-optimal clusterings side by side
 protspace stats -i emb.h5 -p project_dir -o statistics.parquet -a annotations.parquet --cluster-selection both
 # Fold a stats parquet + settings into a bundle
@@ -149,12 +152,14 @@ src/protspace/
 ├── stats/                      # Projection quality statistics (opt-in, --stats)
 │   ├── __init__.py             # Lazy STATISTICS registry + compute_statistics entry
 │   ├── base.py                 # StatContext / StatRow / AnnotationColumn / StatsReport
-│   ├── driver.py               # Per-projection contexts, embedding id-join, run stats
+│   ├── driver.py               # Per-projection contexts + once-per-embedding pass, embedding id-join, run stats
 │   ├── carriage.py             # Route rows to bundle parts (metadata / annotations / legend)
+│   ├── annotation_select.py    # Pick "suitable" annotations (auto/list) + build id→category labels
 │   ├── cluster/kmeans_elbow.py # KMeans + distance-to-chord elbow (subsampled at scale)
 │   └── metrics/
-│       ├── validity.py         # silhouette / Davies-Bouldin / Calinski-Harabasz
-│       └── faithfulness.py     # kNN-overlap / trustworthiness / continuity
+│       ├── validity.py             # Auto-cluster (KMeans) + ARI/NMI agreement vs annotations
+│       ├── annotation_validity.py  # silhouette / Davies-Bouldin / Calinski-Harabasz per annotation
+│       └── faithfulness.py         # kNN-overlap / trustworthiness / continuity
 ├── utils/
 │   ├── __init__.py             # Lazy exports: REDUCERS dict, reducer constants
 │   ├── constants.py            # DimensionReductionConfig, method name constants
@@ -216,7 +221,7 @@ HDF5 file (float16 embeddings)
 2. `projections_metadata` — projection names, dimensions, parameters (faithfulness rides in `info_json.quality` when `--stats`)
 3. `projections_data` — reduced coordinates per protein per projection
 4. `settings` (optional) — annotation styles, pinned values, display config
-5. `statistics` (optional) — tidy per-projection cluster-validity table (`protspace stats` / `prepare --stats`)
+5. `statistics` (optional) — tidy table of annotation-based validity (silhouette/DBI/CH per annotation, `space_kind ∈ {embedding, projection}`, `annotation` column) + auto-cluster ARI/NMI agreement (`stat_family=cluster_agreement`) (`protspace stats` / `prepare --stats`)
 
 Positional layout is `core(3) + settings? + statistics?`. When statistics are present but settings are absent, the settings slot is written as **zero bytes** so statistics stay at position five (readers branch on emptiness, not part count). Both bundled and separate-file (`--no-bundled`) output persist `settings.parquet` and `statistics.parquet` when present.
 
@@ -240,10 +245,12 @@ uv run pytest tests/ --cov=src/protspace     # With coverage
 | `test_settings_converter.py` | 31 | Settings table ↔ visualization state conversion |
 | `test_uniprot_annotation_retriever.py` | 24 | UniProt API mocking, inactive entry resolution |
 | `test_pipeline_utils.py` | 70 | ReductionPipeline, EmbeddingSet, method parsing, multi-input merging, inline param overrides |
-| `test_stats.py` | 43 | Projection statistics: elbow, cluster-validity, faithfulness (dual continuity + global metrics), cluster-selection (elbow/silhouette/both), subsample determinism/order-invariance, silhouette consistency |
-| `test_stats_cli.py` | 12 | `protspace stats` CLI + `prepare` stats wiring, `--settings-out` guard, `--cluster-selection` validation |
+| `test_stats.py` | 48 | Projection statistics: elbow, annotation-based validity (silhouette/DBI/CH per annotation), auto-cluster ARI/NMI agreement, faithfulness (dual continuity + global metrics), cluster-selection (elbow/silhouette/both), subsample determinism/order-invariance, silhouette consistency |
+| `test_stats_cli.py` | 15 | `protspace stats` CLI + `prepare` stats wiring, `--stats-annotation` (auto/list) wiring, `--settings-out` guard, `--cluster-selection` validation |
 | `test_stats_carriage.py` | 10 | Routing rows to bundle parts (metadata quality, annotation columns, cluster legend) |
 | `test_stats_bundle.py` | 7 | Optional 5th (statistics) bundle part round-trip |
+| `test_annotation_select.py` | 4 | Annotation selection: suitability filter (cardinality/numeric/id-like exclusion), `auto` vs explicit-list label building, missing-value dropping |
+| `test_annotation_validity.py` | 5 | `AnnotationValidityStatistic`: silhouette/DBI/CH scored per annotation on `ctx.coords`, embedding vs. projection `space_kind`, missing-value exclusion, single-category no-op |
 | `test_biocentral_embedder.py` | 23 | Biocentral API client, embedding flow |
 | `test_fasta.py` | 17 | FASTA parsing, edge cases, CSV annotation loading |
 | `test_biocentral_retriever.py` | 14 | Biocentral prediction retriever (TMbed parsing, per-sequence) |

diff --git a/README.md b/README.md
@@ -10,7 +10,7 @@ ProtSpace is a visualization tool for exploring **protein embeddings** or **simi
 
 - **Multiple projections**: PCA, UMAP, t-SNE, MDS, PaCMAP, LocalMAP
 - **Automatic annotations**: UniProt, InterPro, and Taxonomy
-- **Quality metrics** _(opt-in)_: per-projection cluster-validity + faithfulness (local & global) via `--stats`
+- **Quality metrics** _(opt-in)_: annotation-based cluster-validity + faithfulness (local & global) via `--stats`
 - **Structure viewer**: Integrated protein structure visualization
 - **Export**: PNG, PDF, SVG, HTML
 
@@ -65,7 +65,7 @@ protspace stats -i embeddings/prot_t5.h5 -p projections/ -o statistics.parquet
 protspace bundle -p projections/ -a annotations.parquet -s statistics.parquet -o output.parquetbundle
 ```
 
-Or compute quality metrics inline during `prepare` with `--stats` (opt-in): cluster-validity + faithfulness per projection. See the [CLI Reference](docs/cli.md#projection-statistics---stats).
+Or compute quality metrics inline during `prepare` with `--stats` (opt-in): annotation-based cluster-validity + faithfulness per projection. See the [CLI Reference](docs/cli.md#projection-statistics---stats).
 
 ## 📊 Example Output
 

diff --git a/docs/cli.md b/docs/cli.md
@@ -130,8 +130,9 @@ This produces three projections: `ProtT5 — PCA 2`, `ProtT5 — UMAP 2 (n=15)`,
 | ---- | ----------- | ------- |
 | `-o, --output` | Output directory. | `.` |
 | `--bundled / --no-bundled` | Bundle into single `.parquetbundle`. | bundled |
-| `--stats / --no-stats` | Compute projection quality statistics (cluster-validity + faithfulness). See [Projection Statistics](#projection-statistics---stats). | off |
+| `--stats / --no-stats` | Compute projection quality statistics (annotation-based cluster-validity + faithfulness). See [Projection Statistics](#projection-statistics---stats). | off |
 | `--cluster-selection` | With `--stats`, how to choose the cluster count K: `elbow`, `silhouette`, or `both`. | `elbow` |
+| `--stats-annotation` | With `--stats`, which annotation column(s) to score for cluster-validity: `auto` (all suitable low-cardinality categoricals) or a comma-separated list. | `auto` |
 | `--keep-tmp` | Cache intermediates for resumability. | on |
 | `--no-log` | Skip writing `run.log`. | off |
 | `--dump-cache` | Print cached annotations and exit. | off |
@@ -179,46 +180,53 @@ protspace bundle -p projections/ -a annotations.parquet \
 
 ## `protspace stats`
 
-Compute per-projection quality statistics for an existing project directory and write them as a `statistics.parquet` (the optional 5th `.parquetbundle` part). No annotations are required. See [Projection Statistics](#projection-statistics---stats) for what is computed.
+Compute per-projection quality statistics for an existing project directory and write them as a `statistics.parquet` (the optional 5th `.parquetbundle` part). Faithfulness and the auto-cluster membership columns need no annotations; annotation-based validity (and its ARI/NMI agreement with the auto-clusters) needs `-a/--annotations`. See [Projection Statistics](#projection-statistics---stats) for what is computed.
 
 ```bash
 # Statistics for a project (embeddings needed for faithfulness)
 protspace stats -i embeddings/prot_t5.h5 -p projections/ -o statistics.parquet
 
 # Also enrich an annotations parquet in place with per-protein cluster-membership
-# columns, and write the auto cluster-legend styles for `bundle`
+# columns, score annotation-based validity, and write the auto cluster-legend styles
 protspace stats -i embeddings/prot_t5.h5 -p projections/ -o statistics.parquet \
   -a annotations.parquet --settings-out cluster_styles.json
 
+# Score only specific annotations instead of every suitable categorical (default: auto)
+protspace stats -i embeddings/prot_t5.h5 -p projections/ -o statistics.parquet \
+  -a annotations.parquet --stats-annotation major_group,ec_number
+
 # Emit both the elbow and the silhouette-optimal clustering
 protspace stats -i embeddings/prot_t5.h5 -p projections/ -o statistics.parquet \
   -a annotations.parquet --cluster-selection both
 ```
 
 | Flag | Description | Default |
 | ---- | ----------- | ------- |
-| `-i, --input` | HDF5 embedding file(s) (for faithfulness). Repeat for multi-embedding; `-i file.h5:name` to override the name. | — |
+| `-i, --input` | HDF5 embedding file(s) (for faithfulness + the once-per-embedding annotation-validity pass). Repeat for multi-embedding; `-i file.h5:name` to override the name. | — |
 | `-p, --projections` | Project directory with `projections_metadata.parquet` + `projections_data.parquet`. | — |
 | `-o, --output` | Output `statistics.parquet` path. | — |
-| `-a, --annotations` | Annotations parquet to enrich in place with per-protein `cluster_*` membership columns (per-point silhouette attached as `value|score`). | — |
+| `-a, --annotations` | Annotations parquet to enrich in place with per-protein `cluster_*` membership columns (per-point silhouette attached as `value|score`), and to score for annotation-based validity + ARI/NMI agreement. | — |
 | `--cluster-selection` | Cluster count K selection: `elbow`, `silhouette`, or `both`. | `elbow` |
+| `--stats-annotation` | Which annotation column(s) to score for cluster-validity: `auto` (all suitable low-cardinality categoricals) or a comma-separated list. Requires `-a`. | `auto` |
 | `--settings-out` | Write auto cluster-legend styles here (JSON) for `bundle --settings`. Requires `-a`. | — |
 | `--metric` | High-dim distance metric for faithfulness when the projection metadata omits one (e.g. PCA/MDS). | `euclidean` |
 | `--seed` | Random seed. | `42` |
 
 ## Projection Statistics (`--stats`)
 
-`prepare --stats` (opt-in) and the standalone `protspace stats` command compute two families of per-projection quality metrics and bake them into the output:
+`prepare --stats` (opt-in) and the standalone `protspace stats` command compute three families of per-projection quality metrics and bake them into the output:
 
-- **Cluster validity** — KMeans labels the projection, scored by **silhouette**, **Davies–Bouldin**, and **Calinski–Harabasz**, written to the tidy `statistics.parquet` (the bundle's 5th part). The cluster count K is chosen by the inertia **elbow** and/or by **max silhouette** — `--cluster-selection elbow|silhouette|both`. Each selection also becomes a per-protein membership column — `cluster_elbow_<projection>` and/or `cluster_silhouette_<projection>` — with the point's **silhouette attached to its value** as `cluster N|<silhouette>` (the same `value|score` convention as UniProt evidence codes / InterPro bit scores; suppressed by `--no-scores`). Membership columns get an auto Kelly-palette legend (the bundle's 4th settings part); in `statistics.parquet` the two selections are distinguished by `label_kind` (`kmeans_elbow` / `kmeans_silhouette`).
+- **Annotation-based validity** — silhouette, Davies–Bouldin, and Calinski–Harabasz scored using an annotation's own category labels (not auto-clustering) — how well proteins already grouped by an annotation (e.g. `major_group`, `ec_number`) separate in a given space. Computed once for the source embedding (a separability "ceiling") and again for each projection, written to the tidy `statistics.parquet` (the bundle's 5th part) with `space_kind ∈ {embedding, projection}` and an `annotation` column naming which one was scored. `--stats-annotation auto|name1,name2` (default `auto`) picks which annotation column(s) to score — `auto` scores every "suitable" low-cardinality categorical (≥2 and ≤min(50, max(2, n/2)) distinct non-empty values, not numeric, and not a generated `cluster_*` column); requires `-a/--annotations`.
+- **Auto-cluster agreement** — KMeans labels the projection; the cluster count K is chosen by the inertia **elbow** and/or by **max silhouette** — `--cluster-selection elbow|silhouette|both`. This auto-clustering is no longer scored against itself (that was circular); instead, when annotations are supplied, each labelling's **ARI** (adjusted Rand index) and **NMI** (normalized mutual information) agreement with every scored annotation is recorded (`stat_family=cluster_agreement`). Each selection also becomes a per-protein membership column — `cluster_elbow_<projection>` and/or `cluster_silhouette_<projection>` — with the point's **silhouette attached to its value** as `cluster N|<silhouette>` (the same `value|score` convention as UniProt evidence codes / InterPro bit scores; suppressed by `--no-scores`). Membership columns get an auto Kelly-palette legend (the bundle's 4th settings part); in `statistics.parquet` the two selections are distinguished by `label_kind` (`kmeans_elbow` / `kmeans_silhouette`).
 - **Faithfulness** — how well the projection preserves the source embedding's structure; each row is tagged `scope`:
   - **local** (kNN-neighbourhood): **kNN-overlap**, **trustworthiness**, **continuity**.
   - **global** (whole-layout): **random_triplet** (relative-ordering accuracy over random triplets, ∈[0,1]) and **spearman_distance** (rank correlation of all pairwise distances, ∈[−1,1]).
 
-  These per-projection scalars ride in each projection's `info_json.quality`.
+  These per-projection scalars ride in each projection's `info_json.quality` — they never land in `statistics.parquet`.
 
 Notes:
-- Off by default — the compute (a KMeans sweep + faithfulness) and the extra bundle columns/styles are opt-in.
+- Off by default — the compute (annotation-validity + a KMeans sweep + faithfulness) and the extra bundle columns/styles are opt-in.
+- Annotation-based validity and cluster agreement need `-a/--annotations`; faithfulness and the membership columns do not.
 - Uses the projection's own high-dim metric (e.g. `cosine`) for faithfulness; falls back to `--metric` / `euclidean` when the reducer doesn't record one.
 - Best-effort: a failure for one statistic or projection is logged and skipped, never failing the run. At large scale the heavier metrics are subsampled (silhouette/faithfulness) or fit on a bounded subsample (KMeans elbow) with a deterministic seed.