feat(stats): cluster-selection, silhouette-as-score, global faithfulness metrics#63
Merged
tsenoner merged 4 commits intoJul 2, 2026
Merged
Conversation
…-score, global faithfulness Sub-branch of feat/projection-statistics for separate review. - --cluster-selection elbow|silhouette|both (prepare + stats): emit the elbow clustering (`cluster_<proj>`), the max-silhouette-K clustering (`cluster_silhouette_<proj>`), or both; validity rows carry the matching label_kind (kmeans_elbow / kmeans_silhouette). kmeans_elbow optionally returns the silhouette-optimal K + labels (computed only on request). - Per-point silhouette is now attached to the membership value as `cluster N|<sil>` (the UniProt-ECO / InterPro-bit-score convention) instead of a separate silhouette_<proj> column; gated by --no-scores. Legend builder strips the suffix to recover the bare category. - Two global faithfulness metrics: random_triplet (relative-ordering accuracy over random triplets) and spearman_distance (rank correlation of all pairwise distances). Rows tagged scope=local|global. Tests updated for the single-column format; added cases for cluster-selection, score gating, global metrics, and silhouette-K selection. 572 fast tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- random_triplet was NOT row-order invariant for n<=sample_threshold (it samples triplets by array position). Canonicalise emb/coords/ids by id up front in FaithfulnessStatistic.compute so EVERY metric depends only on the id-set, in both the subsampled and non-subsampled paths. Invariance test now parametrised over both regimes and asserts all five metrics. - prepare: validate --cluster-selection before the expensive query/embed/similarity stages (fail-fast), mirroring the stats command; add a CLI rejection test. - Refresh stale docs/help/comments that still referenced the removed separate silhouette_<proj> column (carriage.py, cli/stats.py) and fix a "dense ranks" comment (ordinal ranks) + hoist a repeated fancy-index in random_triplet. 574 fast tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…j> + sync docs - Rename the elbow clustering's membership column cluster_<proj> -> cluster_elbow_<proj> so both selections are explicitly named (cluster_elbow_ / cluster_silhouette_). The column name is the only provenance signal that survives to the frontend (AnnotationColumn.extra is dropped at carriage), so name the method in it. - Bring docs + notebook current with the whole extras feature set (they only reflected the base PR): --cluster-selection, silhouette-as-attached-score (no separate silhouette_ column), and the local/global faithfulness split. Updated docs/cli.md, CLAUDE.md, README.md, ProtSpace_Preparation.ipynb. 574 fast tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Quality pass over the projection-stats "extras" (cluster-selection, silhouette-as-score, global faithfulness). Correctness - random_triplet: sample two DISTINCT others per anchor (j != m != anchor) instead of drawing uniformly from [0, n). Self-pairs are distance-0 and trivially "agree" in both spaces, biasing the accuracy score upward. Robustness / efficiency - faithfulness: return the n > hard_ceiling skip row BEFORE the canonical sort/copy, so oversized inputs (metrics skipped anyway) don't pay a wasted O(n log n) sort + two array copies. - cluster-validity: fall back to the 'elbow' default when the raw stats API receives an unrecognised cluster_selection (the CLI already validates via a Typer enum) instead of silently emitting no labelling at all. Simplify - model --cluster-selection as ClusterSelection(str, Enum) in common_options; Typer auto-validates, deleting two duplicated manual validation blocks in prepare.py + stats.py. - validity: carry selection_name in a _Labeling NamedTuple (drops the reverse-derivation; shrinks _emit_labeling's signature 8 -> 5 args). - kmeans_elbow: unify the two duplicate ElbowResult return sites. - faithfulness: factor the 3x repeated local-scope extra dict. Docs - sync stale test-count table in CLAUDE.md (37->43, 11->12, 9->10). - sync driver.compute_statistics docstring params (cluster_selection, include_scores, max_fit_sample, n_triplets_per_point, cluster_annotations). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Sub-branch of #61 (targets
feat/projection-statistics, notmain) so this new work is reviewed on top of the already-reviewed base.What's new
1.
--cluster-selection elbow | silhouette | bothChoose how the cluster count K is picked (on
prepare --statsandprotspace stats):elbow(default) →cluster_<proj>membership column, validity rowslabel_kind=kmeans_elbow.silhouette→cluster_silhouette_<proj>(K maximising silhouette over the sweep),label_kind=kmeans_silhouette.both→ both columns + both label_kinds (statistics rows are distinguished bylabel_kind).kmeans_elbowgains an optional silhouette-K pass (computed only on request, so the default path is unchanged).2. Per-point silhouette as an attached confidence (not a separate column)
The membership value now carries the per-point silhouette as
cluster N|<silhouette>— the samevalue|scoreconvention as UniProt evidence codes / InterPro bit scores — replacing the separatesilhouette_<proj>column. Gated by--no-scores. The auto legend strips the suffix to key categories by the barecluster N.3. Global faithfulness metrics
Two whole-layout metrics added alongside the local kNN ones (rows tagged
scope: local|global):random_triplet— relative-ordering accuracy over random triplets (∈[0,1]).spearman_distance— rank correlation of all pairwise distances (∈[−1,1]).Verification
--stats --cluster-selection both): two clusterings (elbow K=7 vs silhouette K=2), membership valuescluster 3|0.6013, 5 faithfulness metrics.random_tripletis now row-order invariant in all paths,preparevalidates--cluster-selectionfail-fast, stale docs refreshed.Notes / decisions for review
cluster_<proj>(backward compatible); silhouette usescluster_silhouette_<proj>.random_tripletneedspaired_distances-supported metrics (euclidean/cosine/manhattan) and degrades best-effort otherwise.🤖 Generated with Claude Code