feat(stats): projection statistics (cluster-validity + faithfulness)#61
feat(stats): projection statistics (cluster-validity + faithfulness)#61jcoludar wants to merge 13 commits into
Conversation
Add a protspace.stats package computing per-projection statistics, baked into the .parquetbundle as an optional fifth part: - cluster_validity: KMeans + distance-to-chord elbow -> silhouette, Davies-Bouldin, Calinski-Harabasz on the projection coordinates. - faithfulness: kNN-overlap + trustworthiness/continuity vs the source embedding (high-dim metric from the reducer; large-n sampling guard). Tidy long-format table (8 cols: space_kind, space_name, stat_family, label_kind, metric, metric_kind, value, extra_json) — new statistics add rows, not columns. Registry mirrors the lazy REDUCERS pattern; sklearn imports stay function-local to preserve CLI startup. Bundle I/O carries an optional 5th statistics part (core+settings?+stats?) with a zero-byte settings slot keeping it unambiguous; read_bundle keeps its 2-tuple shape (new read_statistics_from_bundle accessor) and replace_settings_in_bundle preserves a trailing stats part so `protspace style` is non-lossy. Wiring: ReductionPipeline computes stats (best-effort, never fatal) behind prepare --stats/--no-stats; new `protspace stats` subcommand for the discrete path; `bundle -s/--statistics` folds a stats parquet in. Refs tsenoner/protspace_web#219 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Web counterpart (consumes this, lands after a stats-bearing release): tsenoner/protspace_web#295. Tracking issue: tsenoner/protspace_web#219. |
CI's `ruff format --check` flagged 9 files that were committed without running `ruff format` (`ruff check` lint passed, but the formatter check is a separate CI step). Pure formatting — no behavior change. Stats suite still 30 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Where to put the information:
|
…lity) Phase 1A of route-projection-statistics: carry each statistic in the bundle part whose existing frontend consumer matches its granularity, instead of one opaque fifth part. - StatRow gains a `destination` (default "statistics_part", not a tidy-table column); StatsReport.partition() groups rows by destination and to_arrow() serialises only the statistics_part bucket -- the fifth part is now aggregate cluster-validity only. - Faithfulness rows (kNN-overlap / trustworthiness / continuity, incl. the skip row) are marked destination="projection_metadata". - New stats/carriage.py route_faithfulness_to_metadata() folds those rows into each projection's info_json.quality (per-metric value + k/metric/sampling provenance; NaN skip value -> null so info_json stays valid JSON). Wired into ReductionPipeline._compute_statistics before create_output serialises info_json. - `protspace stats` stays a pure aggregate-only producer (faithfulness no longer written to statistics.parquet); the prep stats+bundle path is unaffected. Tests: destination/partition/to_arrow restriction; faithfulness routing incl. skip row; carriage router (provenance, NaN->null, info_json round-trip, multi-embedding); end-to-end `protspace stats` aggregate-only. Existing fifth-part tests updated to the narrowed contract. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…thfulness The deployed prep pipeline builds bundles via standalone `protspace project` + `stats` + `bundle` subprocesses, not the in-process `prepare` pipeline. After the Phase-1A routing, `protspace stats` wrote faithfulness nowhere (only aggregate validity → statistics.parquet), so the prep path lost it. `protspace stats` now folds faithfulness into `projections_metadata.parquet` in place (parses each row's info_json, injects `quality`, preserves all other columns and the reducer's existing info, re-serialises). The existing `protspace bundle -p` then carries the enriched metadata into the bundle's 2nd part with no bundle/prep code change. statistics.parquet stays aggregate-only. This matches the spec scenario "the standalone stats path recomputes and merges it into projections_metadata" and makes Phase 1 deliver faithfulness end-to-end in the production prep flow. Tests: stats rewrites metadata.info_json.quality (columns/rows preserved, reducer info kept); end-to-end `stats` → `bundle -p` ships a bundle whose projections_metadata carries quality while the fifth part stays validity-only. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…n columns Phase 2A of route-projection-statistics (tsenoner #61 review bullets 1-2): surface the elbow-K labelling and per-point silhouette as per-protein annotation columns so the frontend color-by control renders them with no new UI. - New AnnotationColumn output type (name, kind categorical|numeric, values keyed by identifier); StatsReport carries an annotation_columns channel and add() routes mixed StatRow / AnnotationColumn lists. - ClusterValidityStatistic emits `cluster_<projection>` (non-numeric "cluster N" labels → categorical inference) and `silhouette_<projection>` (per-point silhouette_samples over the full labelled set → numeric). Per-point silhouette is O(n^2) with no subsample path, so it has its own hard-ceiling skip guard; both are gated by the cluster_annotations param and emitted only for a genuine (>=2) clustering with aligned ids. - carriage.merge_annotation_columns joins the columns onto the annotations frame by identifier (absent proteins get no value); wired into the prepare pipeline before create_output's .astype(str) so typing survives. - `protspace stats` gains -a/--annotations: enriches the annotations parquet in place with the computed columns (stringified to match the prepare path), so the prep `project -> stats -a -> bundle -a` flow carries them. Without -a the expensive per-protein computation is skipped. Tests: validity per-protein outputs + ceiling guard + disable; carriage join + annotations-table typing; stats -a enrichment; end-to-end stats -a -> bundle -a ships cluster_/silhouette_ columns in the bundle's annotations part. Auto-styling (colored-without-manual-step) is the next increment; columns already color via the default palette when selected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Phase 2A.4 of route-projection-statistics. Generate a full LegendPersistedSettings envelope per cluster-membership column so clusters are colored when selected with no manual styling step. - carriage.build_cluster_legend_settings: for each categorical AnnotationColumn build a complete envelope the frontend's sanitizeLegendSettingsEntry accepts — maxVisibleValues / shapeSize / sortMode / hiddenValues / enableDuplicateStackUI / selectedPaletteId + categories keyed by the exact label with a Kelly-palette color, zOrder and shape. Numeric (silhouette) columns keep the default ramp. - prepare path: BaseProcessor.save_output gains settings=; the pipeline builds the cluster styles from the report and writes them into the bundle's settings part. - prep path: `protspace stats --settings-out <json>` writes the styles; `protspace bundle --settings <json>` folds them into the settings part. Tests: envelope validity (every required field/type + distinct palette colors); end-to-end stats --settings-out -> bundle --settings styles clusters in the settings part. Deferred (follow-up): preserving the generated cluster styles across a later `protspace style` rewrite (replace_settings_in_bundle) — a rare re-style path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Routing is implemented across this PR and protspace_web#295. Where each statistic lands:
Computed columns appear in color-by automatically, grouped under a new "Statistics" section. protspace stats handles carriage: faithfulness into projections_metadata, per-protein columns into the annotations parquet (-a), and the cluster legend styles via --settings-out for bundle --settings. The prep pipeline picks all of it up. Design + phased plan: the route-projection-statistics openspec change in the web repo. Both PRs green. Defaults chosen:
Deferred (not blocking):
Sequencing: I folded the routing directly into #61/#295 rather than landing the opaque fifth part first and stacking. Can split the per-protein annotations into a separate PR if that's easier to review. |
…faults Refinements to the not-yet-released stats subsystem from a /simplify pass, an xhigh /code-review, and researched fixes for the deferred review findings. Correctness: - base_processor.save_output: persist `settings` in the unbundled (--no-bundle) branch too — cluster legend styles were silently dropped there. - cli/stats: union same-name -i inputs (merge_same_name_sets) so multi-file same-embedding runs don't collapse to the last set; guard --settings-out to require -a; write metadata/annotations parquet atomically (temp + rename); infer 3D from z when projection metadata is absent. - faithfulness continuity: compute via a correct dual (`_continuity`) so the embedding is ranked by the run's high-dim metric (was always euclidean); bit-identical on the euclidean default path. - validity silhouette: report the aggregate as the exact per-point mean when the per-point column is computed (was an inconsistent sampled estimate). - faithfulness subsample: select on canonical id order so scores are row-order invariant (matching the already order-invariant seed). Scaling / cleanup: - kmeans_elbow: fit the K sweep on a bounded subsample (MiniBatchKMeans) + predict above 50k points; full-batch unchanged below. Remove the write-only silhouette_optimal_k cross-check and its duplicate silhouette compute. - Collapse duplicated faithfulness try/except into a loop; drop the self._stats_settings side channel (return (table, settings)); remove a redundant empty-table branch, a needless list copy, a double np.unique, and an unused shape param; centralize the reduction `source` stamp via a closure. Behavior: - prepare: `--stats` now defaults to False (opt-in) — heavy compute + bundle column/style injection should not run on every run. Adds regression tests (continuity dual, subsample determinism/order-invariance, silhouette consistency, unbundled settings, --settings-out guard). 565 tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Post-review hardening pushed (
|
- docs/cli.md: add the `protspace stats` command, the `prepare --stats` flag, `bundle -s/--settings`, and a "Projection Statistics" concept section. - README.md: quality-metrics feature bullet + stats step in the power-user workflow. - CLAUDE.md: stats command + usage, stats/ package tree, cli/stats.py, the 5-part bundle layout (statistics part + settings in unbundled output), and stats test-file rows. - ProtSpace_Preparation.ipynb: a "Quality statistics" cell pointing to the CLI (the notebook installs from PyPI, so live-wiring the toggle waits for a release). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-score, global faithfulness Sub-branch of feat/projection-statistics for separate review. - --cluster-selection elbow|silhouette|both (prepare + stats): emit the elbow clustering (`cluster_<proj>`), the max-silhouette-K clustering (`cluster_silhouette_<proj>`), or both; validity rows carry the matching label_kind (kmeans_elbow / kmeans_silhouette). kmeans_elbow optionally returns the silhouette-optimal K + labels (computed only on request). - Per-point silhouette is now attached to the membership value as `cluster N|<sil>` (the UniProt-ECO / InterPro-bit-score convention) instead of a separate silhouette_<proj> column; gated by --no-scores. Legend builder strips the suffix to recover the bare category. - Two global faithfulness metrics: random_triplet (relative-ordering accuracy over random triplets) and spearman_distance (rank correlation of all pairwise distances). Rows tagged scope=local|global. Tests updated for the single-column format; added cases for cluster-selection, score gating, global metrics, and silhouette-K selection. 572 fast tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- random_triplet was NOT row-order invariant for n<=sample_threshold (it samples triplets by array position). Canonicalise emb/coords/ids by id up front in FaithfulnessStatistic.compute so EVERY metric depends only on the id-set, in both the subsampled and non-subsampled paths. Invariance test now parametrised over both regimes and asserts all five metrics. - prepare: validate --cluster-selection before the expensive query/embed/similarity stages (fail-fast), mirroring the stats command; add a CLI rejection test. - Refresh stale docs/help/comments that still referenced the removed separate silhouette_<proj> column (carriage.py, cli/stats.py) and fix a "dense ranks" comment (ordinal ranks) + hoist a repeated fancy-index in random_triplet. 574 fast tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…j> + sync docs - Rename the elbow clustering's membership column cluster_<proj> -> cluster_elbow_<proj> so both selections are explicitly named (cluster_elbow_ / cluster_silhouette_). The column name is the only provenance signal that survives to the frontend (AnnotationColumn.extra is dropped at carriage), so name the method in it. - Bring docs + notebook current with the whole extras feature set (they only reflected the base PR): --cluster-selection, silhouette-as-attached-score (no separate silhouette_ column), and the local/global faithfulness split. Updated docs/cli.md, CLAUDE.md, README.md, ProtSpace_Preparation.ipynb. 574 fast tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Quality pass over the projection-stats "extras" (cluster-selection, silhouette-as-score, global faithfulness). Correctness - random_triplet: sample two DISTINCT others per anchor (j != m != anchor) instead of drawing uniformly from [0, n). Self-pairs are distance-0 and trivially "agree" in both spaces, biasing the accuracy score upward. Robustness / efficiency - faithfulness: return the n > hard_ceiling skip row BEFORE the canonical sort/copy, so oversized inputs (metrics skipped anyway) don't pay a wasted O(n log n) sort + two array copies. - cluster-validity: fall back to the 'elbow' default when the raw stats API receives an unrecognised cluster_selection (the CLI already validates via a Typer enum) instead of silently emitting no labelling at all. Simplify - model --cluster-selection as ClusterSelection(str, Enum) in common_options; Typer auto-validates, deleting two duplicated manual validation blocks in prepare.py + stats.py. - validity: carry selection_name in a _Labeling NamedTuple (drops the reverse-derivation; shrinks _emit_labeling's signature 8 -> 5 args). - kmeans_elbow: unify the two duplicate ElbowResult return sites. - faithfulness: factor the 3x repeated local-scope extra dict. Docs - sync stale test-count table in CLAUDE.md (37->43, 11->12, 9->10). - sync driver.compute_statistics docstring params (cluster_selection, include_scores, max_fit_sample, n_triplets_per_point, cluster_annotations). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
feat(stats): cluster-selection, silhouette-as-score, global faithfulness metrics
Summary
Adds a
protspace.statssubsystem that computes per-projection quality statistics atpreparation time and bakes them into the
.parquetbundleas an optional fifth part. This is theengine half of the projection-statistics MVP (tracking issue: tsenoner/protspace_web#219;
related: #31). The
protspace_webPR that consumes it follows separately — this PR lands first.Today the pipeline (
embed → project → annotate → bundle) ships coordinates with zero qualitymetrics, so judging a projection is purely visual. This change answers two questions per projection:
cluster_validity: KMeans with a distance-to-chord elbow estimate ofK, scored by silhouette, Davies–Bouldin, Calinski–Harabasz on the projection coords.
faithfulness: kNN-overlap and trustworthiness / continuitybetween the source embedding and the projection — i.e. how much the reduction distorted the
neighbourhood structure.
What's in this PR
src/protspace/stats/— a generalizedStatisticcontract (each statistic declares whether itneeds the embedding and returns one or more
StatRows) + a lazySTATISTICSregistry mirroringthe existing
REDUCERSpattern. sklearn imports stay function-local to preserve CLI startup.stats/cluster/kmeans_elbow.py— KMeans sweepK ∈ [2, min(round(√n), 50)], elbow viaperpendicular deviation from the first→last inertia chord (argmax index → K).
stats/metrics/validity.py— silhouette (seeded sample above threshold) / DB / CH at that K.stats/metrics/faithfulness.py— kNN-overlap@k and trustworthiness/continuity vs the embedding.stats/driver.py—compute_statistics(...)iterating registered statistics per projection,isolating per-statistic and per-projection failures (a bad reduction is logged and skipped,
never sinks the report).
space_kind, space_name, stat_family, label_kind, metric, metric_kind, value, extra_json). New statistics add rows, not columns.data/io/bundle.py) — an optional fifth partstatistics.parquet. Layoutcore(3) + settings? + statistics?; when statistics is present without settings, a zero-bytesettings slot keeps the fifth position unambiguous.
write_bundle/read_bundle/extract_bundle_to_dirandreplace_settings_in_bundleare all updated — the last soprotspace styleno longer silently drops a trailing stats part.ReductionPipeline.run(the one stage holding embeddings and projections)computes stats behind
prepare --stats/--no-stats; newprotspace stats -i emb.h5 -p project_dir -o statistics.parquetfor the discrete path;bundle -s/--statisticsfolds a stats parquet in.manifold.trustworthiness) isalready a core dep.
Robustness hardening (post adversarial review)
This branch was put through a multi-agent review; the confirmed findings are fixed here:
trustworthinessrequires
n_neighbors < n/2(strict); k is now clamped to(n-1)//2instead ofn-2.cluster_validityscores the full projection;faithfulnessscores the embedding-alignedsubset. Previously clustering could be scored on the id-intersection subset only.
default_metric) so faithfulness uses the run's metricrather than defaulting to euclidean for PCA/MDS/PaCMAP.
source column maps each projection to its own embedding.
n_clustersreports the achieved distinct-cluster count (KMeans can collapse on coincidentpoints), keeping the requested K in
extra.Tests
tests/test_stats.py,tests/test_stats_bundle.py,tests/test_stats_cli.py— known-answernumeric fixtures (blob separation; faithfulness on identity vs random projections; label-permutation
alignment), the 8-column schema, the 5-part bundle round-trip, and the
protspace styleround-trip.Scope (MVP) & non-goals
In: per-projection
cluster_validity(unsupervised/elbow) +faithfulness, baked at prep time,carried in the bundle. Explicitly out (non-breaking future expansions): embedding-space
cluster-validity, annotation-feature label sources, on-demand recompute, the broader
ProtSpaceExtractorpair/edge/set analyses (future typed bundle parts), and frontend rendering. Theregistry + long-format table leave seams for the scalar expansions.
Data-format change: additive, backward compatible — existing 3- and 4-part bundles read and
write unchanged.
Refs tsenoner/protspace_web#219, #31