Support sample size for the silhouette dist. #1518
Conversation
…ows (#36126) The per-row and elbow/silhouette-method diagnostics call stats::dist(), which is O(n^2) in memory and can OOM on large data. Bound that work to a silhouette_sample_size subsample (default 5000); k-means itself still runs on the full (or max_nrow) data. Non-sampled rows get NA silhouette scores, scattered back to full length so the result stays positionally aligned to the data. - silhouette_sample_index(): pick the sorted, reproducible subsample index - compute_silhouette_per_row(sample_idx=): compute on the subsample, scatter to n - iterate_silhouette(mat=): reuse the same subsampled scaled matrix so the kmeans input and the silhouette dist stay consistent Add tests for the grouped path, positional correctness of the scatter, cross-run reproducibility, and mat/dist consistency. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a configurable subsampling mechanism to bound the O(n²) distance computation used by silhouette diagnostics in exp_kmeans(), while keeping k-means fitting on the full (or max_nrow) dataset and preserving per-row output alignment by scattering sampled silhouette values back to full length with NA for non-sampled rows.
Changes:
- Introduces
silhouette_sample_index()and asilhouette_sample_sizeargument to cap silhouette computations. - Updates silhouette diagnostics to compute on a subsample and scatter results back to full row alignment via
compute_silhouette_per_row(sample_idx=). - Extends coverage in
test_kmeans_1.Rto validate bounded scoring, positional scatter correctness, grouped behavior, and reproducibility.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
R/kmeans.R |
Adds subsampling + scatter logic for silhouette diagnostics and threads it through exp_kmeans() / iterate_silhouette(). |
tests/testthat/test_kmeans_1.R |
Adds tests for subsample indexing, scatter alignment, grouped behavior, reproducibility, and mat/dist reuse. |
DESCRIPTION |
Bumps package version/date to reflect the change. |
Comments suppressed due to low confidence (1)
R/kmeans.R:133
compute_silhouette_per_row()assumessample_idx,mat, and (if provided)dare consistent, but it doesn’t validate this. If a caller accidentally passes a subsetmatwith a mismatched/duplicate/out-of-rangesample_idx(or adistbuilt from a different matrix), the function can fail with hard-to-diagnose errors or silently mis-scatter results. Add explicit checks so misuse fails fast with a clear message.
if (is.null(sample_idx)) {
sample_idx <- seq_len(n)
}
# mat/d already correspond to sample_idx rows; the cluster ids must be subset to match.
ids <- as.integer(cluster_ids[sample_idx])
if (length(unique(ids)) < 2 || nrow(unique(mat)) < 2) {
return(na_result)
}
if (is.null(d)) {
d <- stats::dist(mat)
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Re: the suppressed low-confidence note on |
Description
This bounds the dist/silhouette work to a
silhouette_sample_sizesubsample (default 5000). K-means clustering itself still runs on the full (ormax_nrow) data; only the silhouette diagnostics are estimated from the subsample. Non-sampled rows getNAsilhouette scores, scattered back to full length so the per-row result stays positionally aligned to the data.Key changes (
R/kmeans.R):silhouette_sample_index(): pick the sorted, reproducible subsample index (returns the full index whensample_sizeis unset / non-positive />= n).compute_silhouette_per_row(sample_idx=): compute silhouette only on the subsample and scatter the results back to lengthn(NAfor non-sampled rows).iterate_silhouette(mat=): reuse the same subsampled scaled matrix so the k-means input and the silhouettediststay consistent.exp_kmeans(silhouette_sample_size = 5000): new argument; bounds both the shared single-model path and the grouped per-group path.Tests (
tests/testthat/test_kmeans_1.R):iterate_silhouette()uses the suppliedmat/distand ignoresdffor the matrix build.Checklist
Make sure you have performed the following items before submitting this pull request.
If not, please describe the reason.
testthat::test_file("tests/testthat/test_kmeans_1.R"): 90 pass / 0 fail / 0 warn; full suite not run)🤖 Generated with Claude Code