Skip to content

Support sample size for the silhouette dist. #1518

Merged
kei51e merged 1 commit into
masterfrom
fix/issue-36126
Jun 3, 2026
Merged

Support sample size for the silhouette dist. #1518
kei51e merged 1 commit into
masterfrom
fix/issue-36126

Conversation

@kei51e

@kei51e kei51e commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

Description

This bounds the dist/silhouette work to a silhouette_sample_size subsample (default 5000). K-means clustering itself still runs on the full (or max_nrow) data; only the silhouette diagnostics are estimated from the subsample. Non-sampled rows get NA silhouette scores, scattered back to full length so the per-row result stays positionally aligned to the data.

Key changes (R/kmeans.R):

  • silhouette_sample_index(): pick the sorted, reproducible subsample index (returns the full index when sample_size is unset / non-positive / >= n).
  • compute_silhouette_per_row(sample_idx=): compute silhouette only on the subsample and scatter the results back to length n (NA for non-sampled rows).
  • iterate_silhouette(mat=): reuse the same subsampled scaled matrix so the k-means input and the silhouette dist stay consistent.
  • exp_kmeans(silhouette_sample_size = 5000): new argument; bounds both the shared single-model path and the grouped per-group path.

Tests (tests/testthat/test_kmeans_1.R):

  • Grouped per-group subsampling (alignment + bounded scoring).
  • Positional correctness of the scatter, verified against an independently computed reference silhouette.
  • Cross-run reproducibility under a fixed seed (identical scores and NA positions).
  • iterate_silhouette() uses the supplied mat/dist and ignores df for the matrix build.

Checklist

Make sure you have performed the following items before submitting this pull request.
If not, please describe the reason.

  • Add test cases for this fix/enhancement
  • Pass devtools::check() (not run)
  • Pass devtools::test() (verified testthat::test_file("tests/testthat/test_kmeans_1.R"): 90 pass / 0 fail / 0 warn; full suite not run)
  • Test installing from github (not done)
  • Tested with Exploratory (not done)

🤖 Generated with Claude Code

…ows (#36126)

The per-row and elbow/silhouette-method diagnostics call stats::dist(), which is
O(n^2) in memory and can OOM on large data. Bound that work to a
silhouette_sample_size subsample (default 5000); k-means itself still runs on the
full (or max_nrow) data. Non-sampled rows get NA silhouette scores, scattered back
to full length so the result stays positionally aligned to the data.

- silhouette_sample_index(): pick the sorted, reproducible subsample index
- compute_silhouette_per_row(sample_idx=): compute on the subsample, scatter to n
- iterate_silhouette(mat=): reuse the same subsampled scaled matrix so the kmeans
  input and the silhouette dist stay consistent

Add tests for the grouped path, positional correctness of the scatter, cross-run
reproducibility, and mat/dist consistency.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@kei51e kei51e changed the title fix(kmeans): bound O(n^2) silhouette dist to silhouette_sample_size rows (#36126) Support sample size for the silhouette dist. Jun 3, 2026
@kei51e kei51e requested a review from Copilot June 3, 2026 21:09

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a configurable subsampling mechanism to bound the O(n²) distance computation used by silhouette diagnostics in exp_kmeans(), while keeping k-means fitting on the full (or max_nrow) dataset and preserving per-row output alignment by scattering sampled silhouette values back to full length with NA for non-sampled rows.

Changes:

  • Introduces silhouette_sample_index() and a silhouette_sample_size argument to cap silhouette computations.
  • Updates silhouette diagnostics to compute on a subsample and scatter results back to full row alignment via compute_silhouette_per_row(sample_idx=).
  • Extends coverage in test_kmeans_1.R to validate bounded scoring, positional scatter correctness, grouped behavior, and reproducibility.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
R/kmeans.R Adds subsampling + scatter logic for silhouette diagnostics and threads it through exp_kmeans() / iterate_silhouette().
tests/testthat/test_kmeans_1.R Adds tests for subsample indexing, scatter alignment, grouped behavior, reproducibility, and mat/dist reuse.
DESCRIPTION Bumps package version/date to reflect the change.
Comments suppressed due to low confidence (1)

R/kmeans.R:133

  • compute_silhouette_per_row() assumes sample_idx, mat, and (if provided) d are consistent, but it doesn’t validate this. If a caller accidentally passes a subset mat with a mismatched/duplicate/out-of-range sample_idx (or a dist built from a different matrix), the function can fail with hard-to-diagnose errors or silently mis-scatter results. Add explicit checks so misuse fails fast with a clear message.
  if (is.null(sample_idx)) {
    sample_idx <- seq_len(n)
  }
  # mat/d already correspond to sample_idx rows; the cluster ids must be subset to match.
  ids <- as.integer(cluster_ids[sample_idx])
  if (length(unique(ids)) < 2 || nrow(unique(mat)) < 2) {
    return(na_result)
  }
  if (is.null(d)) {
    d <- stats::dist(mat)
  }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread R/kmeans.R
@kei51e

kei51e commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator Author

Re: the suppressed low-confidence note on compute_silhouette_per_row() (validating sample_idx/mat/d consistency) -- not adopting it. This is an internal, non-exported helper with two adjacent call sites in exp_kmeans() that build mat/sample_idx/dist consistently; there is no external caller that could pass mismatched arguments. The guards that matter (< 2 clusters, < 2 distinct points) are already present, and the scatter order is correct (silhouette() preserves the ids order, so widths[i] -> sample_idx[i]). Adding fail-fast validation here would be defensive complexity for a misuse that cannot occur.

@kei51e kei51e merged commit 324e2c1 into master Jun 3, 2026
3 of 4 checks passed
@kei51e kei51e deleted the fix/issue-36126 branch June 3, 2026 22:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants