Support sample size for the silhouette dist. by kei51e · Pull Request #1518 · exploratory-io/exploratory_func

kei51e · 2026-06-03T20:40:19Z

Description

This bounds the dist/silhouette work to a silhouette_sample_size subsample (default 5000). K-means clustering itself still runs on the full (or max_nrow) data; only the silhouette diagnostics are estimated from the subsample. Non-sampled rows get NA silhouette scores, scattered back to full length so the per-row result stays positionally aligned to the data.

Key changes (R/kmeans.R):

silhouette_sample_index(): pick the sorted, reproducible subsample index (returns the full index when sample_size is unset / non-positive / >= n).
compute_silhouette_per_row(sample_idx=): compute silhouette only on the subsample and scatter the results back to length n (NA for non-sampled rows).
iterate_silhouette(mat=): reuse the same subsampled scaled matrix so the k-means input and the silhouette dist stay consistent.
exp_kmeans(silhouette_sample_size = 5000): new argument; bounds both the shared single-model path and the grouped per-group path.

Tests (tests/testthat/test_kmeans_1.R):

Grouped per-group subsampling (alignment + bounded scoring).
Positional correctness of the scatter, verified against an independently computed reference silhouette.
Cross-run reproducibility under a fixed seed (identical scores and NA positions).
iterate_silhouette() uses the supplied mat/dist and ignores df for the matrix build.

Checklist

Make sure you have performed the following items before submitting this pull request.
If not, please describe the reason.

Add test cases for this fix/enhancement
Pass devtools::check() (not run)
Pass devtools::test() (verified testthat::test_file("tests/testthat/test_kmeans_1.R"): 90 pass / 0 fail / 0 warn; full suite not run)
Test installing from github (not done)
Tested with Exploratory (not done)

🤖 Generated with Claude Code

…ows (#36126) The per-row and elbow/silhouette-method diagnostics call stats::dist(), which is O(n^2) in memory and can OOM on large data. Bound that work to a silhouette_sample_size subsample (default 5000); k-means itself still runs on the full (or max_nrow) data. Non-sampled rows get NA silhouette scores, scattered back to full length so the result stays positionally aligned to the data. - silhouette_sample_index(): pick the sorted, reproducible subsample index - compute_silhouette_per_row(sample_idx=): compute on the subsample, scatter to n - iterate_silhouette(mat=): reuse the same subsampled scaled matrix so the kmeans input and the silhouette dist stay consistent Add tests for the grouped path, positional correctness of the scatter, cross-run reproducibility, and mat/dist consistency. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds a configurable subsampling mechanism to bound the O(n²) distance computation used by silhouette diagnostics in exp_kmeans(), while keeping k-means fitting on the full (or max_nrow) dataset and preserving per-row output alignment by scattering sampled silhouette values back to full length with NA for non-sampled rows.

Changes:

Introduces silhouette_sample_index() and a silhouette_sample_size argument to cap silhouette computations.
Updates silhouette diagnostics to compute on a subsample and scatter results back to full row alignment via compute_silhouette_per_row(sample_idx=).
Extends coverage in test_kmeans_1.R to validate bounded scoring, positional scatter correctness, grouped behavior, and reproducibility.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
`R/kmeans.R`	Adds subsampling + scatter logic for silhouette diagnostics and threads it through `exp_kmeans()` / `iterate_silhouette()`.
`tests/testthat/test_kmeans_1.R`	Adds tests for subsample indexing, scatter alignment, grouped behavior, reproducibility, and mat/dist reuse.
`DESCRIPTION`	Bumps package version/date to reflect the change.

Comments suppressed due to low confidence (1)

R/kmeans.R:133

compute_silhouette_per_row() assumes sample_idx, mat, and (if provided) d are consistent, but it doesn’t validate this. If a caller accidentally passes a subset mat with a mismatched/duplicate/out-of-range sample_idx (or a dist built from a different matrix), the function can fail with hard-to-diagnose errors or silently mis-scatter results. Add explicit checks so misuse fails fast with a clear message.

  if (is.null(sample_idx)) {
    sample_idx <- seq_len(n)
  }
  # mat/d already correspond to sample_idx rows; the cluster ids must be subset to match.
  ids <- as.integer(cluster_ids[sample_idx])
  if (length(unique(ids)) < 2 || nrow(unique(mat)) < 2) {
    return(na_result)
  }
  if (is.null(d)) {
    d <- stats::dist(mat)
  }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

kei51e · 2026-06-03T21:22:54Z

Re: the suppressed low-confidence note on compute_silhouette_per_row() (validating sample_idx/mat/d consistency) -- not adopting it. This is an internal, non-exported helper with two adjacent call sites in exp_kmeans() that build mat/sample_idx/dist consistently; there is no external caller that could pass mismatched arguments. The guards that matter (< 2 clusters, < 2 distinct points) are already present, and the scatter order is correct (silhouette() preserves the ids order, so widths[i] -> sample_idx[i]). Adding fail-fast validation here would be defensive complexity for a misuse that cannot occur.

kei51e changed the title ~~fix(kmeans): bound O(n^2) silhouette dist to silhouette_sample_size rows (#36126)~~ Support sample size for the silhouette dist. Jun 3, 2026

kei51e requested a review from Copilot June 3, 2026 21:09

Copilot started reviewing on behalf of kei51e June 3, 2026 21:09 View session

Copilot AI reviewed Jun 3, 2026

View reviewed changes

Comment thread R/kmeans.R

kei51e merged commit 324e2c1 into master Jun 3, 2026
3 of 4 checks passed

kei51e deleted the fix/issue-36126 branch June 3, 2026 22:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support sample size for the silhouette dist. #1518

Support sample size for the silhouette dist. #1518
kei51e merged 1 commit into
masterfrom
fix/issue-36126

kei51e commented Jun 3, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

kei51e commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kei51e commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

kei51e commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kei51e commented Jun 3, 2026 •

edited

Loading