Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 7 additions & 5 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,13 +71,15 @@ protspace prepare -i <input> -m <methods> -o <output> [options]

### protspace stats Usage

Compute per-projection quality statistics for an existing project directory (also available inline via `prepare --stats`). Cluster-validity → `statistics.parquet` (bundle 5th part) + per-protein `cluster_*`/`silhouette_*` annotation columns + auto legend styles; faithfulness → each projection's `info_json.quality`.
Compute per-projection quality statistics for an existing project directory (also available inline via `prepare --stats`). Cluster-validity → `statistics.parquet` (bundle 5th part) + per-protein `cluster_elbow_*` / `cluster_silhouette_*` membership columns (each value a `cluster N` label with the per-point silhouette attached as `|score`) + auto legend styles; faithfulness (local kNN + global metrics, tagged `scope`) → each projection's `info_json.quality`. `--cluster-selection elbow|silhouette|both` picks the K-selection method(s).

```bash
# Standalone (embeddings needed for faithfulness)
protspace stats -i emb.h5 -p project_dir -o statistics.parquet
# Enrich annotations in place + emit cluster legend styles for `bundle --settings`
protspace stats -i emb.h5 -p project_dir -o statistics.parquet -a annotations.parquet --settings-out styles.json
# Elbow + silhouette-optimal clusterings side by side
protspace stats -i emb.h5 -p project_dir -o statistics.parquet -a annotations.parquet --cluster-selection both
# Fold a stats parquet + settings into a bundle
protspace bundle -p project_dir -a annotations.parquet -s statistics.parquet --settings styles.json -o out.parquetbundle
```
Expand Down Expand Up @@ -210,7 +212,7 @@ HDF5 file (float16 embeddings)
## Output Format

`.parquetbundle` = concatenated Apache Parquet tables separated by `---PARQUET_DELIMITER---`:
1. `protein_annotations` — identifier + annotation columns (incl. per-protein `cluster_*`/`silhouette_*` when `--stats`)
1. `protein_annotations` — identifier + annotation columns (incl. per-protein `cluster_elbow_*` / `cluster_silhouette_*` membership, with per-point silhouette attached as `value|score`, when `--stats`)
2. `projections_metadata` — projection names, dimensions, parameters (faithfulness rides in `info_json.quality` when `--stats`)
3. `projections_data` — reduced coordinates per protein per projection
4. `settings` (optional) — annotation styles, pinned values, display config
Expand Down Expand Up @@ -238,9 +240,9 @@ uv run pytest tests/ --cov=src/protspace # With coverage
| `test_settings_converter.py` | 31 | Settings table ↔ visualization state conversion |
| `test_uniprot_annotation_retriever.py` | 24 | UniProt API mocking, inactive entry resolution |
| `test_pipeline_utils.py` | 70 | ReductionPipeline, EmbeddingSet, method parsing, multi-input merging, inline param overrides |
| `test_stats.py` | 37 | Projection statistics: elbow, cluster-validity, faithfulness (dual continuity), subsample determinism/order-invariance, silhouette consistency |
| `test_stats_cli.py` | 11 | `protspace stats` CLI + `prepare` stats wiring, `--settings-out` guard |
| `test_stats_carriage.py` | 9 | Routing rows to bundle parts (metadata quality, annotation columns, cluster legend) |
| `test_stats.py` | 43 | Projection statistics: elbow, cluster-validity, faithfulness (dual continuity + global metrics), cluster-selection (elbow/silhouette/both), subsample determinism/order-invariance, silhouette consistency |
| `test_stats_cli.py` | 12 | `protspace stats` CLI + `prepare` stats wiring, `--settings-out` guard, `--cluster-selection` validation |
| `test_stats_carriage.py` | 10 | Routing rows to bundle parts (metadata quality, annotation columns, cluster legend) |
| `test_stats_bundle.py` | 7 | Optional 5th (statistics) bundle part round-trip |
| `test_biocentral_embedder.py` | 23 | Biocentral API client, embedding flow |
| `test_fasta.py` | 17 | FASTA parsing, edge cases, CSV annotation loading |
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ ProtSpace is a visualization tool for exploring **protein embeddings** or **simi

- **Multiple projections**: PCA, UMAP, t-SNE, MDS, PaCMAP, LocalMAP
- **Automatic annotations**: UniProt, InterPro, and Taxonomy
- **Quality metrics** _(opt-in)_: cluster-validity + faithfulness per projection (`--stats`)
- **Quality metrics** _(opt-in)_: per-projection cluster-validity + faithfulness (local & global) via `--stats`
- **Structure viewer**: Integrated protein structure visualization
- **Export**: PNG, PDF, SVG, HTML

Expand Down
20 changes: 15 additions & 5 deletions docs/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,7 @@ This produces three projections: `ProtT5 — PCA 2`, `ProtT5 — UMAP 2 (n=15)`,
| `-o, --output` | Output directory. | `.` |
| `--bundled / --no-bundled` | Bundle into single `.parquetbundle`. | bundled |
| `--stats / --no-stats` | Compute projection quality statistics (cluster-validity + faithfulness). See [Projection Statistics](#projection-statistics---stats). | off |
| `--cluster-selection` | With `--stats`, how to choose the cluster count K: `elbow`, `silhouette`, or `both`. | `elbow` |
| `--keep-tmp` | Cache intermediates for resumability. | on |
| `--no-log` | Skip writing `run.log`. | off |
| `--dump-cache` | Print cached annotations and exit. | off |
Expand Down Expand Up @@ -185,17 +186,22 @@ Compute per-projection quality statistics for an existing project directory and
protspace stats -i embeddings/prot_t5.h5 -p projections/ -o statistics.parquet

# Also enrich an annotations parquet in place with per-protein cluster-membership
# + silhouette columns, and write the auto cluster-legend styles for `bundle`
# columns, and write the auto cluster-legend styles for `bundle`
protspace stats -i embeddings/prot_t5.h5 -p projections/ -o statistics.parquet \
-a annotations.parquet --settings-out cluster_styles.json

# Emit both the elbow and the silhouette-optimal clustering
protspace stats -i embeddings/prot_t5.h5 -p projections/ -o statistics.parquet \
-a annotations.parquet --cluster-selection both
```

| Flag | Description | Default |
| ---- | ----------- | ------- |
| `-i, --input` | HDF5 embedding file(s) (for faithfulness). Repeat for multi-embedding; `-i file.h5:name` to override the name. | — |
| `-p, --projections` | Project directory with `projections_metadata.parquet` + `projections_data.parquet`. | — |
| `-o, --output` | Output `statistics.parquet` path. | — |
| `-a, --annotations` | Annotations parquet to enrich in place with per-protein `cluster_*` / `silhouette_*` columns. | — |
| `-a, --annotations` | Annotations parquet to enrich in place with per-protein `cluster_*` membership columns (per-point silhouette attached as `value|score`). | — |
| `--cluster-selection` | Cluster count K selection: `elbow`, `silhouette`, or `both`. | `elbow` |
| `--settings-out` | Write auto cluster-legend styles here (JSON) for `bundle --settings`. Requires `-a`. | — |
| `--metric` | High-dim distance metric for faithfulness when the projection metadata omits one (e.g. PCA/MDS). | `euclidean` |
| `--seed` | Random seed. | `42` |
Expand All @@ -204,11 +210,15 @@ protspace stats -i embeddings/prot_t5.h5 -p projections/ -o statistics.parquet \

`prepare --stats` (opt-in) and the standalone `protspace stats` command compute two families of per-projection quality metrics and bake them into the output:

- **Cluster validity** — KMeans with an elbow-chosen K labels the projection, scored by **silhouette**, **Davies–Bouldin**, and **Calinski–Harabasz**. Written to the tidy `statistics.parquet` (the bundle's 5th part). Per-protein **cluster-membership** (`cluster_<projection>`) and **silhouette** (`silhouette_<projection>`) columns are also added to the annotations, and the membership columns get an auto Kelly-palette legend (the bundle's 4th settings part).
- **Faithfulness** — how well the projection preserves the source embedding's neighbourhoods: **kNN-overlap**, **trustworthiness**, and **continuity**. These per-projection scalars ride in each projection's `info_json.quality`.
- **Cluster validity** — KMeans labels the projection, scored by **silhouette**, **Davies–Bouldin**, and **Calinski–Harabasz**, written to the tidy `statistics.parquet` (the bundle's 5th part). The cluster count K is chosen by the inertia **elbow** and/or by **max silhouette** — `--cluster-selection elbow|silhouette|both`. Each selection also becomes a per-protein membership column — `cluster_elbow_<projection>` and/or `cluster_silhouette_<projection>` — with the point's **silhouette attached to its value** as `cluster N|<silhouette>` (the same `value|score` convention as UniProt evidence codes / InterPro bit scores; suppressed by `--no-scores`). Membership columns get an auto Kelly-palette legend (the bundle's 4th settings part); in `statistics.parquet` the two selections are distinguished by `label_kind` (`kmeans_elbow` / `kmeans_silhouette`).
- **Faithfulness** — how well the projection preserves the source embedding's structure; each row is tagged `scope`:
- **local** (kNN-neighbourhood): **kNN-overlap**, **trustworthiness**, **continuity**.
- **global** (whole-layout): **random_triplet** (relative-ordering accuracy over random triplets, ∈[0,1]) and **spearman_distance** (rank correlation of all pairwise distances, ∈[−1,1]).

These per-projection scalars ride in each projection's `info_json.quality`.

Notes:
- Off by default — the compute (a KMeans elbow sweep) and the extra bundle columns/styles are opt-in.
- Off by default — the compute (a KMeans sweep + faithfulness) and the extra bundle columns/styles are opt-in.
- Uses the projection's own high-dim metric (e.g. `cosine`) for faithfulness; falls back to `--metric` / `euclidean` when the reducer doesn't record one.
- Best-effort: a failure for one statistic or projection is logged and skipped, never failing the run. At large scale the heavier metrics are subsampled (silhouette/faithfulness) or fit on a bounded subsample (KMeans elbow) with a deterministic seed.

Expand Down
7 changes: 5 additions & 2 deletions notebooks/ProtSpace_Preparation.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -313,7 +313,10 @@
"source": [
"## 📊 Quality statistics (optional)\n",
"\n",
"Gauge how well each projection preserves your data. The CLI can bake two metric families into the bundle — **cluster-validity** (silhouette, Davies–Bouldin, Calinski–Harabasz) and **faithfulness** (kNN-overlap, trustworthiness, continuity):\n",
"Gauge how well each projection preserves your data. The CLI bakes two metric families into the bundle:\n",
"\n",
"- **cluster-validity** — silhouette, Davies–Bouldin, Calinski–Harabasz on a KMeans clustering; choose the cluster count K by `elbow`, `silhouette`, or `both` (`--cluster-selection`).\n",
"- **faithfulness** — *local* neighbourhood preservation (kNN-overlap, trustworthiness, continuity) and *global* layout preservation (random_triplet, spearman_distance).\n",
"\n",
"```bash\n",
"# inline during prepare (opt-in)\n",
Expand All @@ -323,7 +326,7 @@
"protspace stats -i embeddings.h5 -p output/tmp -o statistics.parquet\n",
"```\n",
"\n",
"These also add auto-colored per-protein `cluster_<projection>` / `silhouette_<projection>` columns you can explore directly in the viewer. See [the CLI docs](https://github.com/tsenoner/protspace/blob/main/docs/cli.md#projection-statistics---stats)."
"This also adds an auto-colored per-protein `cluster_elbow_<projection>` membership column — with each point's silhouette confidence attached to its value — that you can explore directly in the viewer. See [the CLI docs](https://github.com/tsenoner/protspace/blob/main/docs/cli.md#projection-statistics---stats)."
]
},
{
Expand Down
8 changes: 8 additions & 0 deletions src/protspace/cli/common_options.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,14 @@ class Metric(str, Enum):
manhattan = "manhattan"


class ClusterSelection(str, Enum):
"""How `--stats` chooses the cluster count K."""

elbow = "elbow" # inertia elbow (default)
silhouette = "silhouette" # max-silhouette K
both = "both" # emit both clusterings


# ---------------------------------------------------------------------------
# Shared option types
# ---------------------------------------------------------------------------
Expand Down
17 changes: 15 additions & 2 deletions src/protspace/cli/prepare.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@

from protspace.cli.app import app, setup_logging
from protspace.cli.common_options import (
ClusterSelection,
Metric,
Opt_BatchSize,
Opt_Eps,
Expand Down Expand Up @@ -120,8 +121,18 @@
typer.Option(
"--stats/--no-stats",
help="Compute projection quality statistics (cluster-validity + "
"faithfulness); adds cluster_*/silhouette_* columns + legend styles to the "
"bundle. Opt-in (off by default): can be slow on large runs.",
"faithfulness); adds cluster_* membership columns (with per-point "
"silhouette confidence) + legend styles to the bundle. Opt-in (off by "
"default): can be slow on large runs.",
rich_help_panel="Output",
),
]
Opt_ClusterSelection = Annotated[
ClusterSelection,
typer.Option(
"--cluster-selection",
help="With --stats, how to choose the cluster count K: 'elbow' (default), "
"'silhouette' (max-silhouette K), or 'both' (emit both clusterings).",
rich_help_panel="Output",
),
]
Expand Down Expand Up @@ -301,6 +312,7 @@ def prepare(
annotations: Opt_Annotations = None,
scores: Opt_Scores = True,
stats: Opt_Stats = False,
cluster_selection: Opt_ClusterSelection = ClusterSelection.elbow,
refetch: Opt_Refetch = None,
# Output
output: Opt_Output = Path("."),
Expand Down Expand Up @@ -517,6 +529,7 @@ def prepare(
keep_tmp=keep_tmp,
no_scores=not scores,
stats=stats,
cluster_selection=cluster_selection.value,
refetch_stages=refetch_stages,
annotations=annotation_list,
intermediate_dir=cache_dir,
Expand Down
30 changes: 21 additions & 9 deletions src/protspace/cli/stats.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
import typer

from protspace.cli.app import app, setup_logging
from protspace.cli.common_options import ClusterSelection

logger = logging.getLogger(__name__)

Expand Down Expand Up @@ -148,11 +149,11 @@ def _merge_quality_into_metadata(meta_path: Path, quality_by_name: dict) -> None
def _merge_annotations_with_columns(ann_path: Path, report) -> int:
"""Merge the report's per-protein ``AnnotationColumn``s into ``ann_path``.

Rewrites the annotations parquet in place with the computed ``cluster_*`` /
``silhouette_*`` columns joined by identifier. Added columns are stringified
(membership → category labels, silhouette → numeric strings, absent → empty)
so they match the prepare path's all-string annotations and the frontend's
content-based type inference. Returns the number of columns added.
Rewrites the annotations parquet in place with the computed ``cluster_*``
membership columns joined by identifier (each value a ``cluster N`` label with
the per-point silhouette attached as ``|score``). Added columns are stringified
(absent → empty) so they match the prepare path's all-string annotations and the
frontend's content-based type inference. Returns the number of columns added.
"""
import pyarrow as pa
import pyarrow.parquet as pq
Expand Down Expand Up @@ -199,7 +200,8 @@ def stats(
"-a",
"--annotations",
help="Annotations parquet to enrich in place with per-protein "
"cluster-membership + silhouette columns. Omit to skip per-protein outputs.",
"cluster-membership columns (per-point silhouette attached as |score). "
"Omit to skip per-protein outputs.",
),
] = None,
settings_out: Annotated[
Expand All @@ -218,6 +220,14 @@ def stats(
help="High-dim distance metric for faithfulness when the projection metadata omits one (e.g. PCA/MDS).",
),
] = "euclidean",
cluster_selection: Annotated[
ClusterSelection,
typer.Option(
"--cluster-selection",
help="How to choose the cluster count K: 'elbow' (default), 'silhouette' "
"(max-silhouette K), or 'both' (emit both clusterings).",
),
] = ClusterSelection.elbow,
verbose: Annotated[
int, typer.Option("-v", "--verbose", count=True, help="Increase verbosity.")
] = 0,
Expand Down Expand Up @@ -251,10 +261,12 @@ def stats(
)

reductions = _load_reductions(projections, default_metric=metric)
# Per-protein outputs (cluster membership + per-point silhouette) are only
# computed when there's an annotations file to land them in — silhouette_samples
# Per-protein output (cluster membership with attached per-point silhouette) is
# only computed when there's an annotations file to land it in — silhouette_samples
# is O(n^2), so we don't pay for it with nowhere to write.
params = {} if annotations is not None else {"cluster_annotations": False}
params = {"cluster_selection": cluster_selection.value}
if annotations is None:
params["cluster_annotations"] = False
report = compute_statistics(
embedding_sets,
reductions,
Expand Down
7 changes: 7 additions & 0 deletions src/protspace/data/processors/pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@ class PipelineConfig:
keep_tmp: bool = False
no_scores: bool = False
stats: bool = False
cluster_selection: str = "elbow" # elbow | silhouette | both (for --stats)
refetch_stages: frozenset[str] = field(default_factory=frozenset)
annotations: list[str] | None = None
intermediate_dir: Path | None = None
Expand Down Expand Up @@ -731,6 +732,12 @@ def _compute_statistics(
embedding_sets,
all_reductions,
rng_seed=self.config.reducer_params.random_state,
params={
"cluster_selection": self.config.cluster_selection,
# Silhouette-as-confidence on cluster values is a score, so it
# honours --no-scores like UniProt/InterPro annotation scores.
"include_scores": not self.config.no_scores,
},
# Faithfulness high-dim metric: reducers like PCA/MDS/PaCMAP omit
# 'metric' from their params, so fall back to the run's metric
# rather than silently assuming euclidean.
Expand Down
Loading