feat(stats): annotation-based cluster-validity + ARI/NMI agreement by tsenoner · Pull Request #65 · tsenoner/protspace

tsenoner · 2026-07-03T21:05:44Z

Reworks projection cluster-validity to be annotation-based, aligning it with the original request in #31 (which asked for validity scores computed "for any selected feature/annotation" on the "original high-dimensional embeddings").

Why

The shipped metric computed silhouette/DBI/CH on auto-KMeans labels of the 2D projection — partly circular (KMeans optimises the compactness silhouette/CH reward) and annotation-agnostic. It answered "does this projection form clean KMeans blobs?" rather than "how well does this projection/embedding separate my major_group?"

What changed

Annotation-based validity — silhouette / Davies-Bouldin / Calinski-Harabasz are now scored on a user-selected annotation's own category labels, on both the source embedding (the true-separability "ceiling", computed once per embedding) and each 2D projection (as-displayed). stat_family="annotation_validity", space_kind ∈ {embedding, projection}.
ARI/NMI agreement — the retained auto-clusterings are compared against each annotation ("did KMeans recover major_group?"). stat_family="cluster_agreement", reusing the KMeans labels (no second sweep).
Removed the circular auto-cluster self-validity rows. Kept n_clusters meta + the cluster_elbow_*/cluster_silhouette_* membership columns (+ per-point silhouette confidence + auto legend).
New flag --stats-annotation on prepare and stats: auto (all suitable low-cardinality categoricals; default, gated on --stats) or a comma-separated list.
statistics.parquet gains an annotation column (additive — the bundle treats the table as opaque parquet, so no reader breaks).

Design

Spec + plan committed under docs/superpowers/. Key decisions were made interactively: score on both spaces; --stats-annotation auto|list; keep auto-clusters + add ARI/NMI; gap/BIC K-selection deferred to #64.

Testing

Full fast suite green (593 passed). New tests: test_annotation_select.py (suitability filter), test_annotation_validity.py (per-space scoring, missing-value exclusion, subsample determinism), plus driver/CLI coverage for the embedding+projection passes and --stats-annotation. Built task-by-task with per-task + whole-branch review.

Parked follow-ups (after this lands)

Regenerate the 3FTx sample bundle and update the frontend spec protspace_web#296 for the new annotation column / annotation_validity rows.
The paused feat/projection-statistics → main merge.

Refs: #31, #64, protspace_web#296

🤖 Generated with Claude Code

Rework cluster-validity to score user-selected annotations (silhouette/DBI/CH on both the embedding and each projection) + ARI/NMI vs the auto-clusters, replacing the circular auto-KMeans self-validity. Keeps the group-detection membership columns. Gap/BIC k-selection deferred to #64. Refs: #31, #64, protspace_web#296 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

8 TDD tasks: annotation dimension in the data model, suitability filter + label builder, AnnotationValidityStatistic (embedding + projection), ARI/NMI agreement folded into ClusterValidityStatistic, driver once-per-embedding pass, --stats-annotation on stats + prepare, docs. Refs: #31, #64, protspace_web#296 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ation) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Threads an `annotations` kwarg through `compute_statistics` into every projection's StatContext, registers AnnotationValidityStatistic in the statistics registry, and adds a once-per-embedding pass that runs any statistic opting in via `embedding_space` (currently just annotation-validity) against the raw embedding as a separability ceiling. Also patches faithfulness.py's StatRow constructions with the now-required `annotation` field, and fixes the Task-1 test debt this exposed (_statrow helper + the 8→9 column schema assertion). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…line Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

_faith_row() built StatRow(...) without the now-required `annotation` field, breaking 3 tests after the annotation-dimension schema change. Faithfulness rows are not annotation-scoped, so annotation="". Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…otation Update CLAUDE.md, docs/cli.md, README.md, and the prepare-bundle Colab notebook to describe the shipped feature: silhouette/DBI/CH validity is now scored per user-selected annotation on both the embedding and each projection (space_kind embedding|projection, annotation column in statistics.parquet), auto-clustering is no longer self-scored but instead reports ARI/NMI agreement with each annotation, and the new --stats-annotation (auto|comma-list) flag picks which columns to score on prepare and stats. Refresh the stats/ package-structure tree and the test-file table with current test counts (grep -c '^def test_'), including the new test_annotation_select.py and test_annotation_validity.py. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…xtra-copy, cleanups) Fixes six minor findings from the whole-branch review of the annotation-based cluster-validity feature: stale eight-column docstrings in stats/base.py, a strict "auto" guard in cli/stats.py that didn't match the parser's normalised comparison, a shared/aliased `extra` dict across StatRows in annotation_validity.py, an unused np.unique return in validity.py, a duplicated _clean() call in annotation_select.py, and a doc/notebook wording+formatting nit. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

tsenoner and others added 12 commits July 2, 2026 22:49

feat(stats): add annotation dimension to StatRow + StatContext

a6e0210

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

feat(stats): annotation selection + suitability filter

3570e7f

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

feat(stats): AnnotationValidityStatistic (silhouette/DBI/CH per annot…

3f0731d

…ation) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

refactor(stats): drop auto-cluster self-validity, add ARI/NMI agreement

1efa34c

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

feat(stats): stats --stats-annotation scores selected annotations

2d145ac

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

feat(stats): prepare --stats-annotation flows selection into the pipe…

f268a5d

…line Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(stats): annotation-based cluster-validity + ARI/NMI agreement#65

feat(stats): annotation-based cluster-validity + ARI/NMI agreement#65
tsenoner wants to merge 12 commits into
feat/projection-statisticsfrom
feat/annotation-cluster-validity

tsenoner commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tsenoner commented Jul 3, 2026

Why

What changed

Design

Testing

Parked follow-ups (after this lands)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant