tsenoner · tsenoner · Jul 2, 2026 · Jul 2, 2026 · Jul 2, 2026 · Jul 2, 2026
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -71,13 +71,15 @@ protspace prepare -i <input> -m <methods> -o <output> [options]
 
 ### protspace stats Usage
 
-Compute per-projection quality statistics for an existing project directory (also available inline via `prepare --stats`). Cluster-validity → `statistics.parquet` (bundle 5th part) + per-protein `cluster_*`/`silhouette_*` annotation columns + auto legend styles; faithfulness → each projection's `info_json.quality`.
+Compute per-projection quality statistics for an existing project directory (also available inline via `prepare --stats`). Cluster-validity → `statistics.parquet` (bundle 5th part) + per-protein `cluster_elbow_*` / `cluster_silhouette_*` membership columns (each value a `cluster N` label with the per-point silhouette attached as `|score`) + auto legend styles; faithfulness (local kNN + global metrics, tagged `scope`) → each projection's `info_json.quality`. `--cluster-selection elbow|silhouette|both` picks the K-selection method(s).
 
 ```bash
 # Standalone (embeddings needed for faithfulness)
 protspace stats -i emb.h5 -p project_dir -o statistics.parquet
 # Enrich annotations in place + emit cluster legend styles for `bundle --settings`
 protspace stats -i emb.h5 -p project_dir -o statistics.parquet -a annotations.parquet --settings-out styles.json
+# Elbow + silhouette-optimal clusterings side by side
+protspace stats -i emb.h5 -p project_dir -o statistics.parquet -a annotations.parquet --cluster-selection both
 # Fold a stats parquet + settings into a bundle
 protspace bundle -p project_dir -a annotations.parquet -s statistics.parquet --settings styles.json -o out.parquetbundle
 ```
@@ -210,7 +212,7 @@ HDF5 file (float16 embeddings)
 ## Output Format
 
 `.parquetbundle` = concatenated Apache Parquet tables separated by `---PARQUET_DELIMITER---`:
-1. `protein_annotations` — identifier + annotation columns (incl. per-protein `cluster_*`/`silhouette_*` when `--stats`)
+1. `protein_annotations` — identifier + annotation columns (incl. per-protein `cluster_elbow_*` / `cluster_silhouette_*` membership, with per-point silhouette attached as `value|score`, when `--stats`)
 2. `projections_metadata` — projection names, dimensions, parameters (faithfulness rides in `info_json.quality` when `--stats`)
 3. `projections_data` — reduced coordinates per protein per projection
 4. `settings` (optional) — annotation styles, pinned values, display config
@@ -238,9 +240,9 @@ uv run pytest tests/ --cov=src/protspace     # With coverage
 | `test_settings_converter.py` | 31 | Settings table ↔ visualization state conversion |
 | `test_uniprot_annotation_retriever.py` | 24 | UniProt API mocking, inactive entry resolution |
 | `test_pipeline_utils.py` | 70 | ReductionPipeline, EmbeddingSet, method parsing, multi-input merging, inline param overrides |
-| `test_stats.py` | 37 | Projection statistics: elbow, cluster-validity, faithfulness (dual continuity), subsample determinism/order-invariance, silhouette consistency |
-| `test_stats_cli.py` | 11 | `protspace stats` CLI + `prepare` stats wiring, `--settings-out` guard |
-| `test_stats_carriage.py` | 9 | Routing rows to bundle parts (metadata quality, annotation columns, cluster legend) |
+| `test_stats.py` | 43 | Projection statistics: elbow, cluster-validity, faithfulness (dual continuity + global metrics), cluster-selection (elbow/silhouette/both), subsample determinism/order-invariance, silhouette consistency |
+| `test_stats_cli.py` | 12 | `protspace stats` CLI + `prepare` stats wiring, `--settings-out` guard, `--cluster-selection` validation |
+| `test_stats_carriage.py` | 10 | Routing rows to bundle parts (metadata quality, annotation columns, cluster legend) |
 | `test_stats_bundle.py` | 7 | Optional 5th (statistics) bundle part round-trip |
 | `test_biocentral_embedder.py` | 23 | Biocentral API client, embedding flow |
 | `test_fasta.py` | 17 | FASTA parsing, edge cases, CSV annotation loading |

diff --git a/README.md b/README.md
@@ -10,7 +10,7 @@ ProtSpace is a visualization tool for exploring **protein embeddings** or **simi
 
 - **Multiple projections**: PCA, UMAP, t-SNE, MDS, PaCMAP, LocalMAP
 - **Automatic annotations**: UniProt, InterPro, and Taxonomy
-- **Quality metrics** _(opt-in)_: cluster-validity + faithfulness per projection (`--stats`)
+- **Quality metrics** _(opt-in)_: per-projection cluster-validity + faithfulness (local & global) via `--stats`
 - **Structure viewer**: Integrated protein structure visualization
 - **Export**: PNG, PDF, SVG, HTML
 

diff --git a/docs/cli.md b/docs/cli.md
@@ -131,6 +131,7 @@ This produces three projections: `ProtT5 — PCA 2`, `ProtT5 — UMAP 2 (n=15)`,
 | `-o, --output` | Output directory. | `.` |
 | `--bundled / --no-bundled` | Bundle into single `.parquetbundle`. | bundled |
 | `--stats / --no-stats` | Compute projection quality statistics (cluster-validity + faithfulness). See [Projection Statistics](#projection-statistics---stats). | off |
+| `--cluster-selection` | With `--stats`, how to choose the cluster count K: `elbow`, `silhouette`, or `both`. | `elbow` |
 | `--keep-tmp` | Cache intermediates for resumability. | on |
 | `--no-log` | Skip writing `run.log`. | off |
 | `--dump-cache` | Print cached annotations and exit. | off |
@@ -185,17 +186,22 @@ Compute per-projection quality statistics for an existing project directory and
 protspace stats -i embeddings/prot_t5.h5 -p projections/ -o statistics.parquet
 
 # Also enrich an annotations parquet in place with per-protein cluster-membership
-# + silhouette columns, and write the auto cluster-legend styles for `bundle`
+# columns, and write the auto cluster-legend styles for `bundle`
 protspace stats -i embeddings/prot_t5.h5 -p projections/ -o statistics.parquet \
   -a annotations.parquet --settings-out cluster_styles.json
+
+# Emit both the elbow and the silhouette-optimal clustering
+protspace stats -i embeddings/prot_t5.h5 -p projections/ -o statistics.parquet \
+  -a annotations.parquet --cluster-selection both
 ```
 
 | Flag | Description | Default |
 | ---- | ----------- | ------- |
 | `-i, --input` | HDF5 embedding file(s) (for faithfulness). Repeat for multi-embedding; `-i file.h5:name` to override the name. | — |
 | `-p, --projections` | Project directory with `projections_metadata.parquet` + `projections_data.parquet`. | — |
 | `-o, --output` | Output `statistics.parquet` path. | — |
-| `-a, --annotations` | Annotations parquet to enrich in place with per-protein `cluster_*` / `silhouette_*` columns. | — |
+| `-a, --annotations` | Annotations parquet to enrich in place with per-protein `cluster_*` membership columns (per-point silhouette attached as `value|score`). | — |
+| `--cluster-selection` | Cluster count K selection: `elbow`, `silhouette`, or `both`. | `elbow` |
 | `--settings-out` | Write auto cluster-legend styles here (JSON) for `bundle --settings`. Requires `-a`. | — |
 | `--metric` | High-dim distance metric for faithfulness when the projection metadata omits one (e.g. PCA/MDS). | `euclidean` |
 | `--seed` | Random seed. | `42` |
@@ -204,11 +210,15 @@ protspace stats -i embeddings/prot_t5.h5 -p projections/ -o statistics.parquet \
 
 `prepare --stats` (opt-in) and the standalone `protspace stats` command compute two families of per-projection quality metrics and bake them into the output:
 
-- **Cluster validity** — KMeans with an elbow-chosen K labels the projection, scored by **silhouette**, **Davies–Bouldin**, and **Calinski–Harabasz**. Written to the tidy `statistics.parquet` (the bundle's 5th part). Per-protein **cluster-membership** (`cluster_<projection>`) and **silhouette** (`silhouette_<projection>`) columns are also added to the annotations, and the membership columns get an auto Kelly-palette legend (the bundle's 4th settings part).
-- **Faithfulness** — how well the projection preserves the source embedding's neighbourhoods: **kNN-overlap**, **trustworthiness**, and **continuity**. These per-projection scalars ride in each projection's `info_json.quality`.
+- **Cluster validity** — KMeans labels the projection, scored by **silhouette**, **Davies–Bouldin**, and **Calinski–Harabasz**, written to the tidy `statistics.parquet` (the bundle's 5th part). The cluster count K is chosen by the inertia **elbow** and/or by **max silhouette** — `--cluster-selection elbow|silhouette|both`. Each selection also becomes a per-protein membership column — `cluster_elbow_<projection>` and/or `cluster_silhouette_<projection>` — with the point's **silhouette attached to its value** as `cluster N|<silhouette>` (the same `value|score` convention as UniProt evidence codes / InterPro bit scores; suppressed by `--no-scores`). Membership columns get an auto Kelly-palette legend (the bundle's 4th settings part); in `statistics.parquet` the two selections are distinguished by `label_kind` (`kmeans_elbow` / `kmeans_silhouette`).
+- **Faithfulness** — how well the projection preserves the source embedding's structure; each row is tagged `scope`:
+  - **local** (kNN-neighbourhood): **kNN-overlap**, **trustworthiness**, **continuity**.
+  - **global** (whole-layout): **random_triplet** (relative-ordering accuracy over random triplets, ∈[0,1]) and **spearman_distance** (rank correlation of all pairwise distances, ∈[−1,1]).
+
+  These per-projection scalars ride in each projection's `info_json.quality`.
 
 Notes:
-- Off by default — the compute (a KMeans elbow sweep) and the extra bundle columns/styles are opt-in.
+- Off by default — the compute (a KMeans sweep + faithfulness) and the extra bundle columns/styles are opt-in.
 - Uses the projection's own high-dim metric (e.g. `cosine`) for faithfulness; falls back to `--metric` / `euclidean` when the reducer doesn't record one.
 - Best-effort: a failure for one statistic or projection is logged and skipped, never failing the run. At large scale the heavier metrics are subsampled (silhouette/faithfulness) or fit on a bounded subsample (KMeans elbow) with a deterministic seed.
 

diff --git a/notebooks/ProtSpace_Preparation.ipynb b/notebooks/ProtSpace_Preparation.ipynb
@@ -313,7 +313,10 @@
    "source": [
     "## 📊 Quality statistics (optional)\n",
     "\n",
-    "Gauge how well each projection preserves your data. The CLI can bake two metric families into the bundle — **cluster-validity** (silhouette, Davies–Bouldin, Calinski–Harabasz) and **faithfulness** (kNN-overlap, trustworthiness, continuity):\n",
+    "Gauge how well each projection preserves your data. The CLI bakes two metric families into the bundle:\n",
+    "\n",
+    "- **cluster-validity** — silhouette, Davies–Bouldin, Calinski–Harabasz on a KMeans clustering; choose the cluster count K by `elbow`, `silhouette`, or `both` (`--cluster-selection`).\n",
+    "- **faithfulness** — *local* neighbourhood preservation (kNN-overlap, trustworthiness, continuity) and *global* layout preservation (random_triplet, spearman_distance).\n",
     "\n",
     "```bash\n",
     "# inline during prepare (opt-in)\n",
@@ -323,7 +326,7 @@
     "protspace stats -i embeddings.h5 -p output/tmp -o statistics.parquet\n",
     "```\n",
     "\n",
-    "These also add auto-colored per-protein `cluster_<projection>` / `silhouette_<projection>` columns you can explore directly in the viewer. See [the CLI docs](https://github.com/tsenoner/protspace/blob/main/docs/cli.md#projection-statistics---stats)."
+    "This also adds an auto-colored per-protein `cluster_elbow_<projection>` membership column — with each point's silhouette confidence attached to its value — that you can explore directly in the viewer. See [the CLI docs](https://github.com/tsenoner/protspace/blob/main/docs/cli.md#projection-statistics---stats)."
    ]
   },
   {

diff --git a/src/protspace/cli/common_options.py b/src/protspace/cli/common_options.py
@@ -16,6 +16,14 @@ class Metric(str, Enum):
     manhattan = "manhattan"
 
 
+class ClusterSelection(str, Enum):
+    """How `--stats` chooses the cluster count K."""
+
+    elbow = "elbow"  # inertia elbow (default)
+    silhouette = "silhouette"  # max-silhouette K
+    both = "both"  # emit both clusterings
+
+
 # ---------------------------------------------------------------------------
 # Shared option types
 # ---------------------------------------------------------------------------

diff --git a/src/protspace/cli/prepare.py b/src/protspace/cli/prepare.py
@@ -18,6 +18,7 @@
 
 from protspace.cli.app import app, setup_logging
 from protspace.cli.common_options import (
+    ClusterSelection,
     Metric,
     Opt_BatchSize,
     Opt_Eps,
@@ -120,8 +121,18 @@
     typer.Option(
         "--stats/--no-stats",
         help="Compute projection quality statistics (cluster-validity + "
-        "faithfulness); adds cluster_*/silhouette_* columns + legend styles to the "
-        "bundle. Opt-in (off by default): can be slow on large runs.",
+        "faithfulness); adds cluster_* membership columns (with per-point "
+        "silhouette confidence) + legend styles to the bundle. Opt-in (off by "
+        "default): can be slow on large runs.",
+        rich_help_panel="Output",
+    ),
+]
+Opt_ClusterSelection = Annotated[
+    ClusterSelection,
+    typer.Option(
+        "--cluster-selection",
+        help="With --stats, how to choose the cluster count K: 'elbow' (default), "
+        "'silhouette' (max-silhouette K), or 'both' (emit both clusterings).",
         rich_help_panel="Output",
     ),
 ]
@@ -301,6 +312,7 @@ def prepare(
     annotations: Opt_Annotations = None,
     scores: Opt_Scores = True,
     stats: Opt_Stats = False,
+    cluster_selection: Opt_ClusterSelection = ClusterSelection.elbow,
     refetch: Opt_Refetch = None,
     # Output
     output: Opt_Output = Path("."),
@@ -517,6 +529,7 @@ def prepare(
             keep_tmp=keep_tmp,
             no_scores=not scores,
             stats=stats,
+            cluster_selection=cluster_selection.value,
             refetch_stages=refetch_stages,
             annotations=annotation_list,
             intermediate_dir=cache_dir,

diff --git a/src/protspace/cli/stats.py b/src/protspace/cli/stats.py
@@ -14,6 +14,7 @@
 import typer
 
 from protspace.cli.app import app, setup_logging
+from protspace.cli.common_options import ClusterSelection
 
 logger = logging.getLogger(__name__)
 
@@ -148,11 +149,11 @@ def _merge_quality_into_metadata(meta_path: Path, quality_by_name: dict) -> None
 def _merge_annotations_with_columns(ann_path: Path, report) -> int:
     """Merge the report's per-protein ``AnnotationColumn``s into ``ann_path``.
 
-    Rewrites the annotations parquet in place with the computed ``cluster_*`` /
-    ``silhouette_*`` columns joined by identifier. Added columns are stringified
-    (membership → category labels, silhouette → numeric strings, absent → empty)
-    so they match the prepare path's all-string annotations and the frontend's
-    content-based type inference. Returns the number of columns added.
+    Rewrites the annotations parquet in place with the computed ``cluster_*``
+    membership columns joined by identifier (each value a ``cluster N`` label with
+    the per-point silhouette attached as ``|score``). Added columns are stringified
+    (absent → empty) so they match the prepare path's all-string annotations and the
+    frontend's content-based type inference. Returns the number of columns added.
     """
     import pyarrow as pa
     import pyarrow.parquet as pq
@@ -199,7 +200,8 @@ def stats(
             "-a",
             "--annotations",
             help="Annotations parquet to enrich in place with per-protein "
-            "cluster-membership + silhouette columns. Omit to skip per-protein outputs.",
+            "cluster-membership columns (per-point silhouette attached as |score). "
+            "Omit to skip per-protein outputs.",
         ),
     ] = None,
     settings_out: Annotated[
@@ -218,6 +220,14 @@ def stats(
             help="High-dim distance metric for faithfulness when the projection metadata omits one (e.g. PCA/MDS).",
         ),
     ] = "euclidean",
+    cluster_selection: Annotated[
+        ClusterSelection,
+        typer.Option(
+            "--cluster-selection",
+            help="How to choose the cluster count K: 'elbow' (default), 'silhouette' "
+            "(max-silhouette K), or 'both' (emit both clusterings).",
+        ),
+    ] = ClusterSelection.elbow,
     verbose: Annotated[
         int, typer.Option("-v", "--verbose", count=True, help="Increase verbosity.")
     ] = 0,
@@ -251,10 +261,12 @@ def stats(
     )
 
     reductions = _load_reductions(projections, default_metric=metric)
-    # Per-protein outputs (cluster membership + per-point silhouette) are only
-    # computed when there's an annotations file to land them in — silhouette_samples
+    # Per-protein output (cluster membership with attached per-point silhouette) is
+    # only computed when there's an annotations file to land it in — silhouette_samples
     # is O(n^2), so we don't pay for it with nowhere to write.
-    params = {} if annotations is not None else {"cluster_annotations": False}
+    params = {"cluster_selection": cluster_selection.value}
+    if annotations is None:
+        params["cluster_annotations"] = False
     report = compute_statistics(
         embedding_sets,
         reductions,

diff --git a/src/protspace/data/processors/pipeline.py b/src/protspace/data/processors/pipeline.py
@@ -74,6 +74,7 @@ class PipelineConfig:
     keep_tmp: bool = False
     no_scores: bool = False
     stats: bool = False
+    cluster_selection: str = "elbow"  # elbow | silhouette | both (for --stats)
     refetch_stages: frozenset[str] = field(default_factory=frozenset)
     annotations: list[str] | None = None
     intermediate_dir: Path | None = None
@@ -731,6 +732,12 @@ def _compute_statistics(
                 embedding_sets,
                 all_reductions,
                 rng_seed=self.config.reducer_params.random_state,
+                params={
+                    "cluster_selection": self.config.cluster_selection,
+                    # Silhouette-as-confidence on cluster values is a score, so it
+                    # honours --no-scores like UniProt/InterPro annotation scores.
+                    "include_scores": not self.config.no_scores,
+                },
                 # Faithfulness high-dim metric: reducers like PCA/MDS/PaCMAP omit
                 # 'metric' from their params, so fall back to the run's metric
                 # rather than silently assuming euclidean.