Skip to content

feat: add Leiden cluster detection for file dependency graph#143

Merged
ArtemisMucaj merged 7 commits into
mainfrom
claude/leiden-cluster-detection-rTrgU
Apr 13, 2026
Merged

feat: add Leiden cluster detection for file dependency graph#143
ArtemisMucaj merged 7 commits into
mainfrom
claude/leiden-cluster-detection-rTrgU

Conversation

@ArtemisMucaj
Copy link
Copy Markdown
Owner

@ArtemisMucaj ArtemisMucaj commented Apr 11, 2026

Implements community detection on the file-level call graph using the
Leiden algorithm (Traag et al. 2019), which guarantees internally-connected
clusters unlike Louvain.

Key design points:

  • Differentiated edge weights by reference kind (Call=1.0, Inheritance=0.8,
    Implementation=0.7, TypeReference=0.6, Import=0.5, Unknown=0.3).
  • O(edges) batch cohesion computation: build file→cluster map once, walk
    edges once, classify as internal or external.
  • Four-step cluster naming: common directory → dominant symbol (>40%) →
    frequent keyword from symbol names → "{dir}-{keyword}" slug.
  • Fallback to singleton clusters when graph has fewer than 10 nodes.
  • Three-phase Leiden loop (local moving → refinement → aggregation) capped
    at 50 iterations with a 1e-6 modularity-gain convergence threshold.

New public surface:

  • domain: Cluster, ClusterGraph
  • use case: ClusterDetectionUseCase with detect, cluster_for_file,
    architecture_overview
  • CLI: codesearch clusters list|get|overview

All 125 existing tests continue to pass; 8 new unit tests cover the
algorithm helpers (kind_weight, renumber, leiden, slugify, naming).

https://claude.ai/code/session_018vrHYDS8VcKRsqswrzk39D

Summary by CodeRabbit

  • New Features
    • Automatic cluster detection grouping related files with per-cluster name, dominant language, size, and cohesion
    • New CLI/API endpoints: clusters list (text/JSON), clusters get <file> (text/JSON), clusters overview — architecture table of top inter-cluster dependencies
  • Models
    • Added serializable Cluster and ClusterGraph outputs for reporting and API use
  • Tests
    • Added unit tests for clustering behavior and naming/cohesion heuristics

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 11, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds file-level clustering via a Leiden-style detector: new petgraph dependency, domain cluster models, a ClusterDetectionUseCase that builds a weighted file graph from FileRelationshipUseCase and computes clusters with naming/cohesion heuristics, plus CLI/controller/router wiring and JSON/text outputs.

Changes

Cohort / File(s) Summary
Dependencies
Cargo.toml
Added dependency: petgraph = "0.6".
Domain Models
src/domain/models/cluster.rs, src/domain/models/mod.rs
New Cluster and ClusterGraph structs with Serde derives; module added and re-exported.
Clustering Use Case
src/application/use_cases/cluster_detection.rs, src/application/use_cases/mod.rs
New ClusterDetectionUseCase implementing graph construction from FileRelationshipUseCase, Leiden-style iterative clustering (local moves, refinement, aggregation), cohesion calculation, cluster naming heuristics, and public APIs: create_clusters, cluster_for_file, architecture_overview. Re-exported in mod.rs.
CLI
src/cli/mod.rs
Added ClustersSubcommand and OutputFormatTextJson enum; wired clusters command with list, get, and overview subcommands and format flag.
Connector / API
src/connector/api/container.rs, src/connector/api/controller/clusters_controller.rs, src/connector/api/controller/mod.rs, src/connector/api/router.rs
Container accessor to construct ClusterDetectionUseCase; new ClustersController with async list/get/overview returning JSON or text; controller re-exported; router dispatch added for cluster commands.
Library Exports
src/lib.rs
Re-exported ClusterDetectionUseCase, ClustersSubcommand, Cluster, and ClusterGraph.
Tests
src/application/use_cases/cluster_detection.rs (module tests)
Unit tests covering weight/slugify heuristics, Leiden steps, naming, cohesion, and partition renumbering.

Sequence Diagram

sequenceDiagram
    participant CLI as CLI Client
    participant Router as Router
    participant Controller as ClustersController
    participant UseCase as ClusterDetectionUseCase
    participant FileGraph as FileRelationshipUseCase
    participant Detector as LeidenDetector

    CLI->>Router: Commands::Clusters::List(repository, format)
    Router->>Controller: list(repository, format)
    Controller->>UseCase: create_clusters(repository)
    UseCase->>FileGraph: load nodes & edges
    FileGraph-->>UseCase: nodes & edges
    UseCase->>Detector: build weighted graph
    Detector->>Detector: local moving phase
    Detector->>Detector: refinement & aggregation
    Detector-->>UseCase: partition labels
    UseCase->>UseCase: compute cohesion & names
    UseCase-->>Controller: ClusterGraph
    Controller->>CLI: formatted output (JSON or text)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐰 I hopped through nodes and edges tonight,

Leiden nudged the clusters into light,
Folder names, symbols stitched in rhyme,
Cohesion counted, members kept in line,
A rabbit cheers — clusters found just right ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding Leiden cluster detection for a file dependency graph, which aligns with the substantial implementation across multiple files.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/leiden-cluster-detection-rTrgU

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/application/use_cases/cluster_detection.rs`:
- Around line 687-707: The aggregation is using cluster.name as identity which
is not unique; update the file→cluster lookup and inter-aggregation to use the
cluster ID instead of name: in the file_to_cluster construction (originally
mapping m.as_str() -> c.name.as_str()) map to c.id.as_str() (and anywhere else
around the earlier 725-731 block), and change the inter key usage so you
aggregate on (cluster_id_from, cluster_id_to) (still &str references) when
processing graph.edges; ensure any comparisons (fc != tc) and later consumers of
inter expect cluster IDs rather than names.
- Around line 509-523: The code currently hard-codes cohesion: 1.0 for singleton
clusters when n < MIN_NODES_FOR_CLUSTERING which is wrong if there are external
edges; replace the literal with a real cohesion computation: for each created
Cluster (in the block that builds clusters from files when n <
MIN_NODES_FOR_CLUSTERING) compute cohesion based on the repository graph (e.g.,
internal_edges_for_member / total_edges_touching_member or call your existing
cohesion function if present) instead of 1.0; update the Cluster construction
(where id, name_cluster, repository_id, dominant_language, size, cohesion,
members are set) to set cohesion = compute_cohesion(&[path.clone()], &graph) or
equivalent so singletons reflect external connections.
- Around line 101-109: The loop runs local_moving_phase and refine_phase but
never collapses the partition into a coarser graph; add an aggregation step
(e.g., call a function like aggregate_partition or build_coarse_graph) after
refine_phase to produce a new_graph and updated partition mapping, replace graph
and partition with those coarse versions, then compute modularity on the
new_graph (using modularity(new_graph, &partition)) and continue; ensure the
aggregation function you call updates node IDs/weights and edge weights
appropriately so subsequent local_moving_phase/refine_phase operate on the
coarsened graph.
- Around line 458-469: The cross-cluster branch only increments the external
edge count for the source cluster, missing incoming edges for the destination
cluster; update the match arm handling (the (Some(&ci), Some(_)) case) so you
also look up the destination cluster index (e.g., cj via
file_to_cluster.get(&edge.to_file)), map to its id through id_by_index[cj], and
increment its external counter in stats as well
(stats.entry(dest_id.to_string()).and_modify(|(_, ext)| *ext += 1)); keep the
same behavior for the same-cluster arm (ci == cj) and ensure you reference
edges, file_to_cluster, id_by_index, and stats when making the change.
- Around line 431-433: The current truncation uses a byte slice
trimmed[..max_len].to_string() which can panic on UTF‑8 boundaries; change the
truncation in the cluster detection code that computes `trimmed`/`max_len` to
perform a character-aware truncation (i.e., operate on Unicode scalar values)
instead of byte indices — for example, create a new String from the first
max_len chars of `trimmed` (and ensure any length check uses character count),
replacing the byte-slice expression where it currently appears.

In `@src/cli/mod.rs`:
- Around line 23-25: The List and Get subcommands use the full OutputFormat enum
(via the format: OutputFormat field) but the help/doc restricts supported
formats to only text/json; create a narrow enum (e.g., OutputFormatTextJson with
variants Text and Json and deriving value_enum) and replace the format field
type in the List and Get subcommand structs to use OutputFormatTextJson (keep
default_value "text"), or alternatively update the help text to list all
OutputFormat variants; adjust any code that matches on OutputFormat to accept
the new type or convert it to the original OutputFormat if needed so behavior
and help text are consistent.

In `@src/connector/api/controller/clusters_controller.rs`:
- Around line 25-27: The match on OutputFormat in clusters_controller.rs
currently treats every non-JSON format as plain text, which silently downgrades
OutputFormat::Vimgrep; explicitly handle OutputFormat::Vimgrep (or return an
error) in the same match used in the List/Get handlers: add a match arm for
OutputFormat::Vimgrep that formats the cluster output into vimgrep-style lines,
or return an appropriate error (e.g., BadRequest/UnsupportedFormat) when
OutputFormat::Vimgrep is passed to the ClustersSubcommand::List/Get paths;
update both match sites (the one around the serde_json branch and the other at
lines ~82-84) to include OutputFormat::Vimgrep rather than falling through to
the text branch so vimgrep is correctly supported or explicitly rejected.
- Around line 22-23: Import anyhow::Context and annotate the fallible calls in
ClustersController so errors include call-site context before propagation: wrap
the use_case.detect(&repository).await call with .context("detecting clusters
for repository"), the cluster graph serialization call with
.context("serializing cluster graph"), the repository/file lookup in the “find
cluster for file” handler with .context(format!("finding cluster for file {} in
repository {}", file_path, repository)), the single-cluster serialization with
.context("serializing cluster"), and the architecture overview generation with
.context("generating architecture overview"); update the functions that call
these operations (e.g., the detect flow, the file-cluster lookup handler, and
overview handler) to return the annotated anyhow errors.
🪄 Autofix (Beta)

✅ Autofix completed


ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d0a9e826-c0b1-4bd3-bf6b-e2a272ad06d4

📥 Commits

Reviewing files that changed from the base of the PR and between 6f7642c and d6a26d4.

📒 Files selected for processing (11)
  • Cargo.toml
  • src/application/use_cases/cluster_detection.rs
  • src/application/use_cases/mod.rs
  • src/cli/mod.rs
  • src/connector/api/container.rs
  • src/connector/api/controller/clusters_controller.rs
  • src/connector/api/controller/mod.rs
  • src/connector/api/router.rs
  • src/domain/models/cluster.rs
  • src/domain/models/mod.rs
  • src/lib.rs

Comment thread src/application/use_cases/cluster_detection.rs
Comment thread src/application/use_cases/cluster_detection.rs Outdated
Comment thread src/application/use_cases/cluster_detection.rs
Comment thread src/application/use_cases/cluster_detection.rs
Comment thread src/application/use_cases/cluster_detection.rs Outdated
Comment thread src/cli/mod.rs Outdated
Comment thread src/connector/api/controller/clusters_controller.rs Outdated
Comment thread src/connector/api/controller/clusters_controller.rs Outdated
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 12, 2026

Note

Autofix is a beta feature. Expect some limitations and changes as we gather feedback and continue to improve it.

Fixes Applied Successfully

Fixed 3 file(s) based on 8 unresolved review comments.

Files modified:

  • src/application/use_cases/cluster_detection.rs
  • src/cli/mod.rs
  • src/connector/api/controller/clusters_controller.rs

Commit: 46699d7956fd93b409f429b1489bf2206a60364e

The changes have been pushed to the claude/leiden-cluster-detection-rTrgU branch.

Time taken: 10m 35s

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
src/application/use_cases/cluster_detection.rs (2)

27-35: Extract heuristic literals into named constants.

Non-obvious tuning values (reference-kind weights, 0.4 dominant-symbol threshold, and slug max length 30) are currently hardcoded across the file. Centralizing them as constants will make calibration and review safer.

As per coding guidelines, "Define named constants for any numeric value that isn't immediately obvious (e.g. default result counts, score thresholds, search depths)".

Also applies to: 413-417, 443-443

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/application/use_cases/cluster_detection.rs` around lines 27 - 35, Extract
the magic numbers into named module-level constants and replace their literal
uses: create constants for the reference-kind weights (e.g., CALL_WEIGHT,
INHERITANCE_WEIGHT, IMPLEMENTATION_WEIGHT, TYPEREFERENCE_WEIGHT, IMPORT_WEIGHT,
DEFAULT_KIND_WEIGHT) and use them inside the kind_weight function; define
DOMINANT_SYMBOL_THRESHOLD (for the 0.4 check referenced around lines 413-417)
and SLUG_MAX_LENGTH (for the slug max length 30 used around line 443) and
replace the hardcoded literals with these constants; ensure constants are
documented with a short comment and used wherever the original numeric literals
appear so tuning is centralized.

768-769: Avoid rebuilding the dependency graph twice in architecture_overview.

architecture_overview calls detect (which already builds the graph) and then builds the graph again at Line 791. This duplicates expensive work and can produce inconsistent snapshots if data changes between calls.

Consider refactoring so one graph build is reused for both clustering and inter-cluster dependency aggregation.

Also applies to: 791-795

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/application/use_cases/cluster_detection.rs` around lines 768 - 769,
architecture_overview is rebuilding the dependency graph twice by calling detect
(which already constructs the graph) and then reconstructing it again; refactor
so the graph is built once and reused: modify detect to either return the built
graph alongside its clustering result (e.g., return (clusters, graph) or a
struct containing both) or add a new private helper (e.g.,
build_dependency_graph) that architecture_overview calls once and then passes
the same graph into detect (or a new detect_with_graph function) for clustering
and later inter-cluster aggregation; update callers to consume the single graph
instance (replace the second graph construction around the code that currently
rebuilds the graph with the reused cg value) so clustering and dependency
aggregation operate on the same snapshot.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/application/use_cases/cluster_detection.rs`:
- Around line 341-344: The aggregation currently ignores intra-cluster edges by
only adding weights when cu != cv; change the logic so that when cu == cv you
still accumulate that weight as an internal/self-loop weight for the cluster
(for example by inserting/updating edge_weights.entry((cu, cu)) or maintaining
an internal_mass map keyed by cu and adding w), and ensure the aggregated graph
construction and subsequent modularity/gain computations (the code that consumes
edge_weights for coarse-level Leiden iterations) treat these self-loop/internal
weights as part of the cluster’s degree/mass used in modularity and delta-gain
calculations.

---

Nitpick comments:
In `@src/application/use_cases/cluster_detection.rs`:
- Around line 27-35: Extract the magic numbers into named module-level constants
and replace their literal uses: create constants for the reference-kind weights
(e.g., CALL_WEIGHT, INHERITANCE_WEIGHT, IMPLEMENTATION_WEIGHT,
TYPEREFERENCE_WEIGHT, IMPORT_WEIGHT, DEFAULT_KIND_WEIGHT) and use them inside
the kind_weight function; define DOMINANT_SYMBOL_THRESHOLD (for the 0.4 check
referenced around lines 413-417) and SLUG_MAX_LENGTH (for the slug max length 30
used around line 443) and replace the hardcoded literals with these constants;
ensure constants are documented with a short comment and used wherever the
original numeric literals appear so tuning is centralized.
- Around line 768-769: architecture_overview is rebuilding the dependency graph
twice by calling detect (which already constructs the graph) and then
reconstructing it again; refactor so the graph is built once and reused: modify
detect to either return the built graph alongside its clustering result (e.g.,
return (clusters, graph) or a struct containing both) or add a new private
helper (e.g., build_dependency_graph) that architecture_overview calls once and
then passes the same graph into detect (or a new detect_with_graph function) for
clustering and later inter-cluster aggregation; update callers to consume the
single graph instance (replace the second graph construction around the code
that currently rebuilds the graph with the reused cg value) so clustering and
dependency aggregation operate on the same snapshot.
🪄 Autofix (Beta)

❌ Autofix failed (check again to retry)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0d1d0edf-d213-4ef9-8e94-b93625af80c6

📥 Commits

Reviewing files that changed from the base of the PR and between d6a26d4 and 46699d7.

📒 Files selected for processing (3)
  • src/application/use_cases/cluster_detection.rs
  • src/cli/mod.rs
  • src/connector/api/controller/clusters_controller.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/cli/mod.rs

Comment thread src/application/use_cases/cluster_detection.rs Outdated
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 12, 2026

Note

Autofix is a beta feature. Expect some limitations and changes as we gather feedback and continue to improve it.

Cannot run autofix: This PR has merge conflicts.

Please resolve the conflicts with the base branch and try again.

Alternatively, use @coderabbitai resolve merge conflict to automatically resolve the conflicts.

claude and others added 2 commits April 12, 2026 20:37
Implements community detection on the file-level call graph using the
Leiden algorithm (Traag et al. 2019), which guarantees internally-connected
clusters unlike Louvain.

Key design points:
- Differentiated edge weights by reference kind (Call=1.0, Inheritance=0.8,
  Implementation=0.7, TypeReference=0.6, Import=0.5, Unknown=0.3).
- O(edges) batch cohesion computation: build file→cluster map once, walk
  edges once, classify as internal or external.
- Four-step cluster naming: common directory → dominant symbol (>40%) →
  frequent keyword from symbol names → "{dir}-{keyword}" slug.
- Fallback to singleton clusters when graph has fewer than 10 nodes.
- Three-phase Leiden loop (local moving → refinement → aggregation) capped
  at 50 iterations with a 1e-6 modularity-gain convergence threshold.

New public surface:
- domain: `Cluster`, `ClusterGraph`
- use case: `ClusterDetectionUseCase` with `detect`, `cluster_for_file`,
  `architecture_overview`
- CLI: `codesearch clusters list|get|overview`

All 125 existing tests continue to pass; 8 new unit tests cover the
algorithm helpers (kind_weight, renumber, leiden, slugify, naming).

https://claude.ai/code/session_018vrHYDS8VcKRsqswrzk39D
Fixed 3 file(s) based on 8 unresolved review comments.

Co-authored-by: CodeRabbit <noreply@coderabbit.ai>
@ArtemisMucaj ArtemisMucaj force-pushed the claude/leiden-cluster-detection-rTrgU branch from 46699d7 to 9062f84 Compare April 12, 2026 20:50
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
src/application/use_cases/cluster_detection.rs (1)

341-344: ⚠️ Potential issue | 🟠 Major

Aggregation still drops intra-cluster edge mass.

This keeps only cross-cluster weights, so the coarse graph no longer faithfully represents the prior partition’s modularity landscape.

Fix direction
-            if cu != cv {
-                let (lo, hi) = if cu < cv { (cu, cv) } else { (cv, cu) };
-                *edge_weights.entry((lo, hi)).or_insert(0.0) += w;
-            }
+            let (lo, hi) = if cu < cv { (cu, cv) } else { (cv, cu) };
+            *edge_weights.entry((lo, hi)).or_insert(0.0) += w;

Then ensure Graph/modularity/gain logic correctly accounts for (c,c) self-loop/internal mass (currently self-loops are not represented in modularity accumulation).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/application/use_cases/cluster_detection.rs` around lines 341 - 344, The
current aggregation ignores intra-cluster edges by skipping when cu == cv;
change the logic that updates edge_weights so it also records self-loop/internal
mass for pairs where cu == cv (e.g., insert or increment entry keyed by (cu, cu)
using the same weight variable w instead of dropping it), and then update the
downstream Graph/modularity/gain accumulation to include these (c,c) self-loop
entries when computing coarse-node internal degree and modularity/gain (ensure
any places that iterate edge_weights treat (c,c) as internal mass rather than
skipping or assuming zero).
🧹 Nitpick comments (1)
src/application/use_cases/cluster_detection.rs (1)

1-922: Please split this file into smaller modules.

This implementation is now large enough that algorithm, naming, cohesion, and tests should be separated for maintainability.

As per coding guidelines: "Keep one logical concept per file. If a file grows beyond ~300 lines, split it."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/application/use_cases/cluster_detection.rs` around lines 1 - 922, This
file is too large; split it into focused modules: move the graph and Leiden
algorithm (struct Graph, fn leiden, fn modularity, fn local_moving_phase, fn
refine_phase, fn aggregate_partition, fn renumber, and relevant constants
MAX_ITERATIONS/MIN_MODULARITY_GAIN) into a new module (e.g., algorithm.rs), move
naming logic (const STOP_WORDS, fn name_cluster, fn split_identifier, fn
slugify) into naming.rs, move cohesion logic (fn batch_cohesion) into
cohesion.rs, and keep the use-case wrapper (struct ClusterDetectionUseCase and
its methods detect, cluster_for_file, architecture_overview) in use_case.rs;
export or make pub the symbols referenced across modules (Graph,
composite_weight helper usage moved into algorithm or kept private), update
imports in each new module to bring in types like FileEdge, Cluster,
ClusterGraph, Language, Uuid, Path, and HashMap, and move tests into a tests
module mirroring these splits or into individual module tests, then add mod
declarations and re-export items from src/application/use_cases/mod.rs so
existing callers keep the same paths. Ensure function and type visibility
(pub(crate)/pub) is adjusted where cross-module references exist and update use
paths inside ClusterDetectionUseCase to call the moved functions (e.g.,
algorithm::leiden, naming::name_cluster, cohesion::batch_cohesion).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/application/use_cases/cluster_detection.rs`:
- Around line 413-416: The current dominant-symbol selection uses
sym_freq.iter().find(...) which is non-deterministic and uses >=
ceil(0.4*members) allowing exactly 40%; change to compute a strict integer
threshold and select the symbol deterministically: compute threshold as the
integer count that represents strictly more than 40% (e.g., threshold =
(members.len() * 40) / 100 and require count > threshold, or threshold =
(members.len() as f64 * 0.4).floor() as usize and require >), then replace
sym_freq.iter().find(...) with a deterministic max-by operation over sym_freq
(select the symbol with highest count and break ties deterministically, e.g., by
choosing the smallest/lowest symbol via Ord) and only accept it if its count >
threshold; refer to members.len, sym_freq, and dominant_sym in the change.

In `@src/connector/api/controller/clusters_controller.rs`:
- Around line 93-99: The None branch currently returns a human-readable string
while the Some(c) branch serializes to JSON for OutputFormat::Json; change the
None arm to produce JSON-compatible output when format == OutputFormat::Json
(e.g., serialize an object or null using serde_json::to_string_pretty) instead
of plain text. Locate the match on the cluster result in clusters_controller.rs
(the None => format!(...) arm and the Some(c) => match format {
OutputFormat::Json => serde_json::to_string_pretty(&c) ... } block) and update
the None branch to branch on the same `format` variable and call
serde_json::to_string_pretty with a JSON-safe representation (such as
serde_json::json!({"found": false, "file": file_path, "repository": repository})
or serde_json::to_string_pretty(&serde_json::Value::Null)) so machine consumers
receive valid JSON.

---

Duplicate comments:
In `@src/application/use_cases/cluster_detection.rs`:
- Around line 341-344: The current aggregation ignores intra-cluster edges by
skipping when cu == cv; change the logic that updates edge_weights so it also
records self-loop/internal mass for pairs where cu == cv (e.g., insert or
increment entry keyed by (cu, cu) using the same weight variable w instead of
dropping it), and then update the downstream Graph/modularity/gain accumulation
to include these (c,c) self-loop entries when computing coarse-node internal
degree and modularity/gain (ensure any places that iterate edge_weights treat
(c,c) as internal mass rather than skipping or assuming zero).

---

Nitpick comments:
In `@src/application/use_cases/cluster_detection.rs`:
- Around line 1-922: This file is too large; split it into focused modules: move
the graph and Leiden algorithm (struct Graph, fn leiden, fn modularity, fn
local_moving_phase, fn refine_phase, fn aggregate_partition, fn renumber, and
relevant constants MAX_ITERATIONS/MIN_MODULARITY_GAIN) into a new module (e.g.,
algorithm.rs), move naming logic (const STOP_WORDS, fn name_cluster, fn
split_identifier, fn slugify) into naming.rs, move cohesion logic (fn
batch_cohesion) into cohesion.rs, and keep the use-case wrapper (struct
ClusterDetectionUseCase and its methods detect, cluster_for_file,
architecture_overview) in use_case.rs; export or make pub the symbols referenced
across modules (Graph, composite_weight helper usage moved into algorithm or
kept private), update imports in each new module to bring in types like
FileEdge, Cluster, ClusterGraph, Language, Uuid, Path, and HashMap, and move
tests into a tests module mirroring these splits or into individual module
tests, then add mod declarations and re-export items from
src/application/use_cases/mod.rs so existing callers keep the same paths. Ensure
function and type visibility (pub(crate)/pub) is adjusted where cross-module
references exist and update use paths inside ClusterDetectionUseCase to call the
moved functions (e.g., algorithm::leiden, naming::name_cluster,
cohesion::batch_cohesion).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7b3a3da9-6e2b-4833-917c-cf366c28e47f

📥 Commits

Reviewing files that changed from the base of the PR and between 46699d7 and 9062f84.

📒 Files selected for processing (11)
  • Cargo.toml
  • src/application/use_cases/cluster_detection.rs
  • src/application/use_cases/mod.rs
  • src/cli/mod.rs
  • src/connector/api/container.rs
  • src/connector/api/controller/clusters_controller.rs
  • src/connector/api/controller/mod.rs
  • src/connector/api/router.rs
  • src/domain/models/cluster.rs
  • src/domain/models/mod.rs
  • src/lib.rs
✅ Files skipped from review due to trivial changes (3)
  • Cargo.toml
  • src/application/use_cases/mod.rs
  • src/domain/models/cluster.rs
🚧 Files skipped from review as they are similar to previous changes (4)
  • src/domain/models/mod.rs
  • src/connector/api/controller/mod.rs
  • src/connector/api/router.rs
  • src/lib.rs

Comment thread src/application/use_cases/cluster_detection.rs Outdated
Comment thread src/connector/api/controller/clusters_controller.rs Outdated
claude added 2 commits April 12, 2026 21:03
…_partition

When cu == cv during aggregation, the previous code discarded the edge
weight entirely, making total_weight and node degrees in the coarse graph
inconsistent with the original graph. Modularity and delta-gain
computations on subsequent Leiden iterations therefore used a smaller 2m
denominator than the true graph mass, biasing partition moves.

Fix: route intra-cluster edges through edge_weights.entry((cu, cu)) and,
when flushing edge_weights to the new Graph, add self-loop weight to
total_weight and degree[u] (×2, both endpoints collapse to the same
super-node) without inserting into adj, so movement decisions are not
affected by spurious self-neighbour entries.

Also fix a stale ClusterOutputFormat re-export in lib.rs (the enum was
renamed to OutputFormatTextJson by the CodeRabbit auto-fix).

https://claude.ai/code/session_018vrHYDS8VcKRsqswrzk39D
- Add named constants for all edge-weight magic numbers (CALL_WEIGHT,
  INHERITANCE_WEIGHT, IMPLEMENTATION_WEIGHT, TYPEREFERENCE_WEIGHT,
  IMPORT_WEIGHT, DEFAULT_KIND_WEIGHT) and use them in kind_weight()
  and composite_weight().
- Add DOMINANT_SYMBOL_THRESHOLD (0.4) and SLUG_MAX_LENGTH (30) constants
  and replace the bare literals in name_cluster().
- Extract private detect_internal() returning (ClusterGraph, FileGraph)
  so architecture_overview() reuses the already-built graph instead of
  calling build_graph() a second time.
- Fix unused enumerate variable (i → _) in singleton fallback path.

https://claude.ai/code/session_018vrHYDS8VcKRsqswrzk39D
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/application/use_cases/cluster_detection.rs (1)

447-453: ⚠️ Potential issue | 🟡 Minor

Dominant-symbol selection is nondeterministic and uses >= instead of > threshold.

HashMap::iter().find() iterates in arbitrary order, so different runs may select different symbols when multiple meet the threshold. Also, ceil(0.4 * n) with >= admits exactly 40%, contradicting the docstring "one symbol accounts for >40% of members".

For deterministic output, select the symbol with the highest count and use a tie-breaker (e.g., lexicographic order). For strict ">40%", use integer threshold (members.len() * 2) / 5 and > threshold or >= threshold + 1.

Suggested fix for determinism and strict threshold
-    let threshold = (members.len() as f64 * DOMINANT_SYMBOL_THRESHOLD).ceil() as usize;
-    if let Some((&dominant_sym, _)) =
-        sym_freq.iter().find(|(_, &c)| c >= threshold)
-    {
+    // Strictly > 40% threshold
+    let min_count = (members.len() * 2) / 5 + 1;
+    if let Some((&dominant_sym, _)) = sym_freq
+        .iter()
+        .filter(|(_, &c)| c >= min_count)
+        .max_by(|(sa, ca), (sb, cb)| ca.cmp(cb).then_with(|| sa.cmp(sb)))
+    {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/application/use_cases/cluster_detection.rs` around lines 447 - 453, The
dominant-symbol selection is nondeterministic and uses >= against a ceil-based
threshold; update the logic (in the block that computes threshold and inspects
sym_freq) to compute a strict integer threshold for “>40%” (e.g., let threshold
= (members.len() * 2) / 5 and require count > threshold or threshold + 1 as
appropriate) and pick the symbol with the maximum count deterministically by
scanning sym_freq to find the highest count and breaking ties lexicographically
(e.g., choose the smallest symbol string on equal counts), then call
slugify(dominant_sym, SLUG_MAX_LENGTH) and return that slug.
🧹 Nitpick comments (3)
src/application/use_cases/cluster_detection.rs (3)

598-601: Consider adding error context for the build_graph call.

Per coding guidelines for src/application/**/*.rs, use anyhow::Context to annotate errors with call-site context before propagating. This helps with debugging when graph construction fails.

Add context annotation
+use anyhow::Context;
+
         let graph = self
             .file_graph
             .build_graph(Some(&[repository_id.to_string()]), 1, false)
-            .await?;
+            .await
+            .context("failed to build file dependency graph for clustering")?;

Note: This requires changing the return type to use anyhow::Result or converting the error appropriately.

As per coding guidelines: "Use anyhow::Context in application and connector layers to annotate errors with call-site context before propagating."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/application/use_cases/cluster_detection.rs` around lines 598 - 601, The
call to file_graph.build_graph(...) in cluster_detection.rs should annotate
errors with anyhow::Context before propagating; update the invocation of
self.file_graph.build_graph(Some(&[repository_id.to_string()]), 1, false).await
to use .context("building file graph for repository {repository_id}") (or
similar informative message) so the error carries call-site context, ensure the
Context trait is imported (anyhow::Context), and if the enclosing function's
return type isn't anyhow::Result, adjust it to return anyhow::Result or convert
the error to anyhow::Error so .context can be used and the enriched error is
propagated from the function.

504-512: Redundant match arms in filter_map.

Both the else if and else branches return Some('-'), making the condition unnecessary.

Simplify to a single else branch
         .filter_map(|c| {
             if c.is_alphanumeric() {
                 Some(c.to_ascii_lowercase())
-            } else if c == '-' || c == '_' {
-                Some('-')
             } else {
                 Some('-')
             }
         })
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/application/use_cases/cluster_detection.rs` around lines 504 - 512, The
closure passed to filter_map contains redundant arms returning Some('-') for
both the else-if and else branches; simplify the logic in the filter_map on that
iterator (the closure currently using |c| { if c.is_alphanumeric() {
Some(c.to_ascii_lowercase()) } else if c == '-' || c == '_' { Some('-') } else {
Some('-') } }) by collapsing the two non-alphanumeric branches into a single
else that returns Some('-'); optionally replace filter_map with map since the
closure always returns Some, but at minimum remove the duplicate else-if and
keep a single else branch.

632-654: Unused enumerate() index.

The enumerate() at line 634 provides an index that's discarded. You can remove it for clarity.

Remove unused enumerate
         let clusters: Vec<Cluster> = files
             .iter()
-            .enumerate()
-            .map(|(_, path)| {
+            .map(|path| {
                 let lang =
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/application/use_cases/cluster_detection.rs` around lines 632 - 654,
Remove the unused enumerate by changing the map closure from map(|(_, path)| {
... }) to map(|path| { ... }) and adjust any references from path (which may be
&&str or &String) accordingly (e.g., use path.clone() or Path::new(path) as
before). Update calls using path inside the closure (Language::from_path,
file_to_edges.get(path), name_cluster(&[path.clone()], ...), members:
vec![path.clone()]) so they compile with the new closure parameter; leave the
rest of the Cluster construction (Uuid::new_v4(), cohesion calc, repository_id,
etc.) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/application/use_cases/cluster_detection.rs`:
- Around line 447-453: The dominant-symbol selection is nondeterministic and
uses >= against a ceil-based threshold; update the logic (in the block that
computes threshold and inspects sym_freq) to compute a strict integer threshold
for “>40%” (e.g., let threshold = (members.len() * 2) / 5 and require count >
threshold or threshold + 1 as appropriate) and pick the symbol with the maximum
count deterministically by scanning sym_freq to find the highest count and
breaking ties lexicographically (e.g., choose the smallest symbol string on
equal counts), then call slugify(dominant_sym, SLUG_MAX_LENGTH) and return that
slug.

---

Nitpick comments:
In `@src/application/use_cases/cluster_detection.rs`:
- Around line 598-601: The call to file_graph.build_graph(...) in
cluster_detection.rs should annotate errors with anyhow::Context before
propagating; update the invocation of
self.file_graph.build_graph(Some(&[repository_id.to_string()]), 1, false).await
to use .context("building file graph for repository {repository_id}") (or
similar informative message) so the error carries call-site context, ensure the
Context trait is imported (anyhow::Context), and if the enclosing function's
return type isn't anyhow::Result, adjust it to return anyhow::Result or convert
the error to anyhow::Error so .context can be used and the enriched error is
propagated from the function.
- Around line 504-512: The closure passed to filter_map contains redundant arms
returning Some('-') for both the else-if and else branches; simplify the logic
in the filter_map on that iterator (the closure currently using |c| { if
c.is_alphanumeric() { Some(c.to_ascii_lowercase()) } else if c == '-' || c ==
'_' { Some('-') } else { Some('-') } }) by collapsing the two non-alphanumeric
branches into a single else that returns Some('-'); optionally replace
filter_map with map since the closure always returns Some, but at minimum remove
the duplicate else-if and keep a single else branch.
- Around line 632-654: Remove the unused enumerate by changing the map closure
from map(|(_, path)| { ... }) to map(|path| { ... }) and adjust any references
from path (which may be &&str or &String) accordingly (e.g., use path.clone() or
Path::new(path) as before). Update calls using path inside the closure
(Language::from_path, file_to_edges.get(path), name_cluster(&[path.clone()],
...), members: vec![path.clone()]) so they compile with the new closure
parameter; leave the rest of the Cluster construction (Uuid::new_v4(), cohesion
calc, repository_id, etc.) unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bd65a214-2f25-45d3-9c3b-77850a2bb2c1

📥 Commits

Reviewing files that changed from the base of the PR and between 9062f84 and 0b11403.

📒 Files selected for processing (2)
  • src/application/use_cases/cluster_detection.rs
  • src/lib.rs

claude added 2 commits April 12, 2026 21:33
detect_internal → create_clusters_with_graph (private helper)
detect          → create_clusters (public API)

The new names make it clear what the methods produce rather than
describing the algorithm used internally.

https://claude.ai/code/session_018vrHYDS8VcKRsqswrzk39D
- Remove dead cluster_internal computation from local_moving_phase;
  only cluster_total was ever read.
- Collapse two identical arms in slugify filter_map into a single else branch;
  change filter_map to map since all chars now produce a value.
- Remove pointless .enumerate() in singleton fallback path where the index
  was immediately discarded.
- Make name_cluster private; it has no callers outside this module.
- batch_cohesion: replace two-pass init (pre-populate then fill) with
  HashMap::with_capacity + or_insert on first access during edge walk.
- cluster_for_file: use find_map + swap_remove instead of find with a
  nested any scan, eliminating the need to re-search members after
  finding the owning cluster.

https://claude.ai/code/session_018vrHYDS8VcKRsqswrzk39D
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
src/application/use_cases/cluster_detection.rs (2)

358-365: ⚠️ Potential issue | 🟠 Major

Keep coarse-graph self-loop mass visible to modularity.

These self-loops update total_weight and degree, but they never enter adj, and modularity() only sums internal weight from adj. After the first aggregation pass, the convergence check and later optimization are evaluating a graph whose internal mass is partly invisible. Either represent loop weight in adjacency too, or track it separately and include it everywhere modularity/gain is computed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/application/use_cases/cluster_detection.rs` around lines 358 - 365, The
self-loop weights are added to new_graph.total_weight and new_graph.degree but
are not represented in new_graph.adj, causing modularity() (which sums internal
weight from adj) to miss that mass; update the aggregation in the loop over
edge_weights (where u==v) to also record the self-loop weight in the adjacency
structure or add a dedicated self_loop_mass field on the graph type and then
modify modularity(), internal_weight/gain calculations, and any
convergence/optimization logic to include that self_loop_mass in both internal
weight and degree-based terms (reference symbols: edge_weights loop,
new_graph.total_weight, new_graph.degree, new_graph.adj, modularity()) so
modularity and movement gains see the same total internal mass.

431-433: ⚠️ Potential issue | 🟡 Minor

Make the dominant-symbol rule actually enforce >40%.

ceil(0.4 * n) with c >= threshold accepts exactly 40% of members, and find(...) stops at the first qualifying symbol instead of the highest-frequency one. Pick the max-count symbol first and then require count * 5 > members.len() * 2.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/application/use_cases/cluster_detection.rs` around lines 431 - 433, The
dominant-symbol logic incorrectly accepts exactly 40% and picks the first
qualifying symbol; change it to pick the symbol with the maximum count from
sym_freq (e.g., use iter().max_by_key(|(_, &c)| c) to get the (&dominant_sym,
&count)) and then enforce the >40% rule with an integer-safe comparison like
count * 5 > members.len() * 2 (instead of using ceil and
>=/DOMINANT_SYMBOL_THRESHOLD). Ensure you replace the current find-based branch
that uses threshold and c >= threshold with this max-and-compare approach so the
highest-frequency symbol is chosen and strictly exceeds 40%.
src/connector/api/controller/clusters_controller.rs (1)

92-108: ⚠️ Potential issue | 🟠 Major

Preserve the requested output format in the None branch.

None bypasses the format match entirely, so --format json returns prose and --format vimgrep quietly returns non-vimgrep output. Branch on format here the same way as Some(c) so machine consumers keep a stable contract.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/connector/api/controller/clusters_controller.rs` around lines 92 - 108,
The None arm currently ignores the requested OutputFormat; change the None
branch to match on the same format enum as the Some(c) branch
(OutputFormat::Json, ::Vimgrep, ::Text) and produce format-consistent output:
for OutputFormat::Json return a JSON-serializable error/not-found structure (so
clients get machine-readable output), for OutputFormat::Vimgrep return the
vimgrep-style not-found response or bail as appropriate, and for
OutputFormat::Text keep the existing prose message; update the match around
result (and use file_path/repository in the not-found payload) so the formatting
logic is identical between None and Some(c).
🧹 Nitpick comments (2)
src/application/use_cases/cluster_detection.rs (2)

779-786: This lookup is still linear, not O(1).

find_map(... any(...)) still scans clusters and members on every call. If you want the advertised constant-time lookup, build a HashMap<&str, usize> once from cg.clusters or return that index alongside ClusterGraph.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/application/use_cases/cluster_detection.rs` around lines 779 - 786, The
current lookup for file_path is still linear because cluster_idx is computed
with cg.clusters.iter().find_map(... any(...)), so build a true O(1) index
instead: after obtaining (mut cg, _) from create_clusters_with_graph(...)
construct a HashMap<String, usize> (or HashMap<&str, usize> if lifetimes permit)
by iterating cg.clusters.enumerate() and mapping each member path to its cluster
index, then use that map to get the cluster index for file_path and return
cg.clusters.swap_remove(index) if present; alternatively modify
create_clusters_with_graph to return this map alongside ClusterGraph to avoid
rebuilding it here.

1-874: Please split this module before it grows further.

This file now mixes graph primitives, Leiden phases, naming helpers, cohesion logic, the use case, and tests in one very large module. Extracting the algorithm core, naming/cohesion helpers, and tests into separate modules would make the clustering logic much easier to navigate and maintain. As per coding guidelines, **/*.rs: Keep one logical concept per file. If a file grows beyond ~300 lines, split it.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/application/use_cases/cluster_detection.rs` around lines 1 - 874, The
file is too large and should be split into focused modules: extract the graph
primitives (struct Graph, composite_weight, kind_weight, constants
CALL_WEIGHT..DEFAULT_KIND_WEIGHT) into a graph module; extract the Leiden
implementation (functions leiden, local_moving_phase, refine_phase,
aggregate_partition, modularity, renumber, MAX_ITERATIONS, MIN_MODULARITY_GAIN)
into an algorithm or leiden module that depends on the graph module; extract
naming/cohesion helpers (name_cluster, split_identifier, slugify, STOP_WORDS,
batch_cohesion and related constants DOMINANT_SYMBOL_THRESHOLD, SLUG_MAX_LENGTH)
into a helpers or naming module; and keep ClusterDetectionUseCase in its own
use_case module which imports the above modules and only exposes the public API;
also move tests into a tests module/file. Ensure you adjust visibility (pub /
pub(crate)) so ClusterDetectionUseCase can call Graph and Leiden APIs, update
mod declarations and use statements accordingly, and run cargo check to fix any
import visibility issues.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/application/use_cases/cluster_detection.rs`:
- Around line 820-827: The overview aggregation is using raw edge.weight but
clustering uses composite_weight(edge), causing inconsistent ranking; update the
aggregation in cluster_detection.rs to sum composite_weight(edge) instead of
edge.weight (or rename the metric if you intentionally want raw weights).
Specifically, in the loop that builds inter (the HashMap<(&str, &str), f64>)
where you currently do *inter.entry((fc, tc)).or_insert(0.0) += edge.weight as
f64;, call the existing composite_weight function (or the equivalent weighting
helper used by the clustering algorithm) and add its f64 result to the entry so
the overview totals use the same weighting model as the clustering logic; keep
file_to_cluster and graph.edges checks unchanged.

---

Duplicate comments:
In `@src/application/use_cases/cluster_detection.rs`:
- Around line 358-365: The self-loop weights are added to new_graph.total_weight
and new_graph.degree but are not represented in new_graph.adj, causing
modularity() (which sums internal weight from adj) to miss that mass; update the
aggregation in the loop over edge_weights (where u==v) to also record the
self-loop weight in the adjacency structure or add a dedicated self_loop_mass
field on the graph type and then modify modularity(), internal_weight/gain
calculations, and any convergence/optimization logic to include that
self_loop_mass in both internal weight and degree-based terms (reference
symbols: edge_weights loop, new_graph.total_weight, new_graph.degree,
new_graph.adj, modularity()) so modularity and movement gains see the same total
internal mass.
- Around line 431-433: The dominant-symbol logic incorrectly accepts exactly 40%
and picks the first qualifying symbol; change it to pick the symbol with the
maximum count from sym_freq (e.g., use iter().max_by_key(|(_, &c)| c) to get the
(&dominant_sym, &count)) and then enforce the >40% rule with an integer-safe
comparison like count * 5 > members.len() * 2 (instead of using ceil and
>=/DOMINANT_SYMBOL_THRESHOLD). Ensure you replace the current find-based branch
that uses threshold and c >= threshold with this max-and-compare approach so the
highest-frequency symbol is chosen and strictly exceeds 40%.

In `@src/connector/api/controller/clusters_controller.rs`:
- Around line 92-108: The None arm currently ignores the requested OutputFormat;
change the None branch to match on the same format enum as the Some(c) branch
(OutputFormat::Json, ::Vimgrep, ::Text) and produce format-consistent output:
for OutputFormat::Json return a JSON-serializable error/not-found structure (so
clients get machine-readable output), for OutputFormat::Vimgrep return the
vimgrep-style not-found response or bail as appropriate, and for
OutputFormat::Text keep the existing prose message; update the match around
result (and use file_path/repository in the not-found payload) so the formatting
logic is identical between None and Some(c).

---

Nitpick comments:
In `@src/application/use_cases/cluster_detection.rs`:
- Around line 779-786: The current lookup for file_path is still linear because
cluster_idx is computed with cg.clusters.iter().find_map(... any(...)), so build
a true O(1) index instead: after obtaining (mut cg, _) from
create_clusters_with_graph(...) construct a HashMap<String, usize> (or
HashMap<&str, usize> if lifetimes permit) by iterating cg.clusters.enumerate()
and mapping each member path to its cluster index, then use that map to get the
cluster index for file_path and return cg.clusters.swap_remove(index) if
present; alternatively modify create_clusters_with_graph to return this map
alongside ClusterGraph to avoid rebuilding it here.
- Around line 1-874: The file is too large and should be split into focused
modules: extract the graph primitives (struct Graph, composite_weight,
kind_weight, constants CALL_WEIGHT..DEFAULT_KIND_WEIGHT) into a graph module;
extract the Leiden implementation (functions leiden, local_moving_phase,
refine_phase, aggregate_partition, modularity, renumber, MAX_ITERATIONS,
MIN_MODULARITY_GAIN) into an algorithm or leiden module that depends on the
graph module; extract naming/cohesion helpers (name_cluster, split_identifier,
slugify, STOP_WORDS, batch_cohesion and related constants
DOMINANT_SYMBOL_THRESHOLD, SLUG_MAX_LENGTH) into a helpers or naming module; and
keep ClusterDetectionUseCase in its own use_case module which imports the above
modules and only exposes the public API; also move tests into a tests
module/file. Ensure you adjust visibility (pub / pub(crate)) so
ClusterDetectionUseCase can call Graph and Leiden APIs, update mod declarations
and use statements accordingly, and run cargo check to fix any import visibility
issues.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3a8040a9-ae66-4da3-868f-fdde7446cf17

📥 Commits

Reviewing files that changed from the base of the PR and between 0b11403 and 00509ef.

📒 Files selected for processing (2)
  • src/application/use_cases/cluster_detection.rs
  • src/connector/api/controller/clusters_controller.rs

Comment thread src/application/use_cases/cluster_detection.rs Outdated
1. architecture_overview used raw edge.weight for inter-cluster dependency
   ranking while clustering uses composite_weight(edge); now both use the
   same weighting model so the overview totals match the algorithm inputs.

2. modularity() missed self-loop mass: aggregate_partition stores intra-cluster
   edge weight in total_weight and degree but not adj, deflating q/total_weight.
   Add Graph::self_loop_weight, accumulate in aggregate_partition, and include
   it in q before dividing so all three terms are consistent.

3. Dominant-symbol heuristic used find (first-to-reach threshold) instead of
   max_by_key (highest count). Switch to max_by_key then enforce a strict >40%
   check via integer arithmetic (count * 5 > len * 2) to avoid float rounding
   and always pick the best-fit symbol. Remove now-unused DOMINANT_SYMBOL_THRESHOLD.

4. clusters_controller::get None arm always returned text regardless of the
   requested format. Match on (result, format) so None/Json returns a
   machine-readable JSON payload and None/Vimgrep bails consistently with
   the Some branch.

https://claude.ai/code/session_018vrHYDS8VcKRsqswrzk39D
@ArtemisMucaj ArtemisMucaj merged commit 7788b73 into main Apr 13, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants