feat: cluster tips by task description similarity by jayaramkr · Pull Request #60 · AgentToolkit/kaizen

jayaramkr · 2026-02-17T16:11:48Z

Summary

Add cluster_entities() in kaizen/llm/tips/clustering.py that embeds task descriptions via SentenceTransformer, computes pairwise cosine similarity, and groups tips into clusters using
union-find with path compression
Add KAIZEN_CLUSTERING_THRESHOLD config option (default 0.80) to KaizenConfig
Add KaizenClient.cluster_tips() convenience method
Add kaizen entities consolidate CLI command with Rich table output (dry-run only — next PR will add actual consolidation)

Test plan

uv run pytest tests/unit/test_clustering.py -v — 11 new tests pass
uv run pytest -v — full suite (102 tests) passes
uv run ruff check kaizen/ tests/ — lint clean
uv run kaizen entities consolidate --help — CLI renders correctly

Summary by CodeRabbit

New Features
- Entities consolidate CLI: dry‑run, threshold override, cluster preview table, and summary.
- Client-side consolidation and clustering operations for tips with consolidation result reporting.
- Generated guidelines now include and preserve task_description metadata.
- New prompt/template to merge and de-duplicate tips.
Schema
- Added TipGenerationResult wrapper and ConsolidationResult summary types; default clustering threshold = 0.80.
Tests
- Unit tests for clustering, tip combination/consolidation, and tip generation.

generate_tips() now returns a TipGenerationResult containing both the tips and the source task_description. Both callers (PhoenixSync and MCP save_trajectory) store task_description in tip entity metadata, enabling future clustering of tips by task similarity. Trajectories without a task description default to "Task description unknown".

Skip update_entities call when no tips are generated, aligning with the existing guard in phoenix_sync.py.

Add clustering module that embeds task descriptions via SentenceTransformer, computes pairwise cosine similarity, and groups tips using union-find. - New `cluster_entities()` function in `kaizen/llm/tips/clustering.py` - `KAIZEN_CLUSTERING_THRESHOLD` config (default 0.80) - `KaizenClient.cluster_tips()` method - `kaizen entities consolidate` CLI command (dry-run only for now) - 11 unit tests for clustering logic and union-find

coderabbitai · 2026-02-17T16:12:03Z

📝 Walkthrough

Walkthrough

Adds embedding-based tip clustering and LLM-driven consolidation: new CLI command, KaizenClient methods, clustering/combine utilities and prompt, schema types carrying task_description and consolidation summary, MCP/Phoenix flows updated to persist task_description, and associated tests.

Changes

Cohort / File(s)	Summary
CLI & Config `kaizen/cli/cli.py`, `kaizen/config/kaizen.py`	Adds `entities consolidate` CLI (namespace, --threshold/-t, --dry-run) and new `clustering_threshold: float = 0.80` config field.
Client API `kaizen/frontend/client/kaizen_client.py`	Adds `cluster_tips(namespace_id, threshold=None, limit=10000)` and `consolidate_tips(namespace_id, threshold=None)` to compute clusters and perform consolidation (insert combined tips, delete originals), returning `ConsolidationResult`.
Clustering & LLM glue `kaizen/llm/tips/clustering.py`, `kaizen/llm/tips/prompts/combine_tips.jinja2`	New clustering module: embedding-based grouping (SentenceTransformer embeddings, cosine similarity, union-find), and combine_cluster that calls LLM with retries and optional JSON-schema parsing; new Jinja2 template for combining tips.
Tip generation & schema `kaizen/llm/tips/tips.py`, `kaizen/schema/tips.py`	`generate_tips()` now returns `TipGenerationResult` (tips + task_description); adds `TipGenerationResult`, `ConsolidationResult` dataclasses and `DEFAULT_TASK_DESCRIPTION`.
Server & Sync integration `kaizen/frontend/mcp/mcp_server.py`, `kaizen/sync/phoenix_sync.py`	Save/update flows switched to consume `TipGenerationResult`, persist `task_description` into guideline metadata, and only write guidelines when tips exist.
Tests `tests/unit/test_clustering.py`, `tests/unit/test_combine_tips.py`, `tests/unit/test_phoenix_sync.py`, `tests/unit/test_tips.py`	New/updated tests for union-find/clustering, combine_cluster (retries and schema), consolidate_tips (insert/delete ordering and metrics), and tip generation task_description handling.

Sequence Diagram(s)

sequenceDiagram
    participant User as CLI User
    participant CLI as Kaizen CLI
    participant Client as KaizenClient
    participant DB as Entity DB
    participant Clustering as Clustering Engine
    participant LLM as LLM/Combine

    User->>CLI: entities consolidate(namespace, threshold?, dry_run)
    CLI->>CLI: resolve effective threshold (arg || config)
    CLI->>Client: cluster_tips(namespace, threshold)
    Client->>DB: fetch guideline RecordedEntity (limit)
    DB-->>Client: list[RecordedEntity]
    Client->>Clustering: cluster_entities(entities, threshold)
    Clustering->>Clustering: embed task_descriptions, compute similarities, union-find groups
    Clustering-->>Client: clusters (exclude singletons)
    CLI->>CLI: render clusters (tables)
    alt dry_run
        CLI-->>User: show dry-run summary
    else perform
        CLI->>Client: consolidate_tips(namespace, threshold)
        Client->>Clustering: combine_cluster(cluster) for each cluster
        Clustering->>LLM: generate consolidated tips (with retries, schema if supported)
        LLM-->>Clustering: consolidated tips
        Client->>DB: insert consolidated tip entities
        Client->>DB: delete original entities
        Client-->>CLI: ConsolidationResult (counts)
        CLI-->>User: show consolidation result
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Add milvus backend unit test #40 — touches Milvus/embedding initialization used by clustering (embedding model exposure).
Refactor: Rename 'kata' to 'entity' #33 — updates KaizenClient/backend types that intersect with new client methods.
Add Phoenix trajectory extraction and sync functionality #18 — changes to phoenix_sync that overlap task_description persistence and sync flow.

Suggested reviewers

vinodmut
visahak
illeatmyhat

Poem

🐰 I nibbled vectors, soft and bright,
I braided tips into one light,
Union-find hops through clustered rows,
Task descriptions in my nose,
Consolidation — off it goes! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 44.19% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main feature addition: clustering tips by task description similarity, which is the core change across the PR's modifications.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (6)

kaizen/llm/tips/clustering.py (3)
76-78: SentenceTransformer is instantiated on every call.

Loading the model from disk (or downloading it) on each invocation of cluster_entities is expensive. If this function is called repeatedly (e.g., in a loop or API handler), consider caching the model instance, for example via functools.lru_cache on model name or a module-level cache.

For a single CLI invocation this is fine, but flag for future consideration.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kaizen/llm/tips/clustering.py` around lines 76 - 78, cluster_entities
currently instantiates SentenceTransformer(embedding_model) on every call which
is expensive; cache the model instead (e.g., add a module-level dict or use
functools.lru_cache keyed by embedding_model) and retrieve the cached
SentenceTransformer for use by model.encode(descriptions,
normalize_embeddings=True) before computing similarity_matrix so repeated calls
reuse the loaded model.
80-86: O(n²) pairwise scan with up to 10,000 entities.

The caller (cluster_tips) fetches up to 10,000 guideline entities. With n=10,000, the similarity matrix is ~800 MB (float64) and the nested loop iterates ~50M pairs. Consider adding an upper-bound check or warning, or using a more scalable approach (e.g., FAISS approximate nearest neighbors) if large entity counts are expected.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kaizen/llm/tips/clustering.py` around lines 80 - 86, The current O(n²) pair
scan in clustering.py (variables: filtered, similarity_matrix, pairs) will blow
up for n≈10k; update cluster_tips to guard large inputs by adding a MAX_ENTITIES
constant and handling n > MAX_ENTITIES: either (a) log/warn and early-return or
downsample/filter to a smaller subset, or (b) switch to an approximate
nearest-neighbor path (e.g., FAISS or sklearn NearestNeighbors with
radius/top-k) to compute only top-k or radius neighbors per vector instead of
materializing the full similarity_matrix and nested loops; implement one of
these strategies and document the change in cluster_tips so pairs is built from
ANN results rather than the full O(n²) scan.
85-85: Strict > excludes pairs at exactly the threshold value.

similarity_matrix[i, j] > threshold means entities with similarity exactly equal to the threshold (e.g., 0.80) won't be clustered. This is a minor semantic decision — consider using >= if the intent is "at least this similar".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kaizen/llm/tips/clustering.py` at line 85, The comparison currently uses a
strict greater-than which excludes pairs equal to the threshold; update the
conditional that checks similarity_matrix[i, j] against threshold to use a
non-strict comparison (>=) so entities with similarity exactly equal to
threshold are included—locate the loop/conditional referencing similarity_matrix
and threshold (indices i, j) and change the operator from '>' to '>='.
tests/unit/test_clustering.py (2)
157-159: Weak assertion — len(call_arg) > 0 will pass for any non-empty model string.

The or len(call_arg) > 0 clause makes this assertion trivially true. If the intent is to verify the default model from milvus_other_settings.embedding_model is used, patch kaizen.llm.tips.clustering.milvus_other_settings and assert the exact value.
Proposed fix
-        mock_st_cls.assert_called_once()
-        call_arg = mock_st_cls.call_args[0][0]
-        assert "MiniLM" in call_arg or len(call_arg) > 0
+        mock_st_cls.assert_called_once()
+        call_arg = mock_st_cls.call_args[0][0]
+        # Assert a specific expected model name from the config
+        assert isinstance(call_arg, str) and len(call_arg) > 0
+        # Ideally: patch milvus_other_settings and assert call_arg == expected_model
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/test_clustering.py` around lines 157 - 159, The assertion is weak
because `or len(call_arg) > 0` is trivially true; instead patch
`milvus_other_settings.embedding_model` to a known expected value and assert
that the constructor was called with that exact model string: update the test to
set `milvus_other_settings.embedding_model = "<expected_model>"` (or use a
monkeypatch/patch fixture), retrieve `call_arg = mock_st_cls.call_args[0][0]`,
and replace the `assert "MiniLM" in call_arg or len(call_arg) > 0` with `assert
call_arg == "<expected_model>"` (or assert it equals the patched value) to
verify the default model is used.
31-56: Missing @pytest.mark.unit marker on test classes.

Both TestUnionFind and TestClusterEntities lack the @pytest.mark.unit decorator. As per coding guidelines, "Use pytest markers e2e for end-to-end tests and unit for unit tests in test files".
Proposed fix
+import pytest
+
+@pytest.mark.unit
 class TestUnionFind:
And similarly for TestClusterEntities:
+@pytest.mark.unit
 `@patch`("kaizen.llm.tips.clustering.SentenceTransformer")
 class TestClusterEntities:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/test_clustering.py` around lines 31 - 56, Add the pytest unit
marker to the test classes by decorating TestUnionFind and TestClusterEntities
with `@pytest.mark.unit` (and add "import pytest" at the top if it's not present);
ensure the decorator sits directly above each class definition so all tests
exercising _union_find and related cluster tests are marked as unit tests per
the project guideline.
kaizen/llm/tips/tips.py (1)
94-94: Nit: Default string inconsistency across the codebase.

The default here is "Task description unknown". Verify this matches the default used in the MCP save_trajectory path mentioned in the PR objectives, to ensure consistent metadata values for downstream clustering.
#!/bin/bash
# Check for all occurrences of the default task description string
rg -n "Task description unknown" --type=py
rg -n "Unknown task" --type=py
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kaizen/llm/tips/tips.py` at line 94, The default task instruction string used
in the tips generator ("task_instruction": task_instruction or "Task description
unknown") is inconsistent with the MCP save_trajectory default; update this to
use the canonical default used by the MCP (the same string literal used in
save_trajectory) so metadata is consistent for downstream clustering—open the
tips.py entry that sets "task_instruction" and replace "Task description
unknown" with the MCP's canonical default string (or better, import/reference a
shared constant if available, e.g., a DEFAULT_TASK_DESCRIPTION constant used by
save_trajectory).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@kaizen/llm/tips/clustering.py`:
- Around line 76-78: cluster_entities currently instantiates
SentenceTransformer(embedding_model) on every call which is expensive; cache the
model instead (e.g., add a module-level dict or use functools.lru_cache keyed by
embedding_model) and retrieve the cached SentenceTransformer for use by
model.encode(descriptions, normalize_embeddings=True) before computing
similarity_matrix so repeated calls reuse the loaded model.
- Around line 80-86: The current O(n²) pair scan in clustering.py (variables:
filtered, similarity_matrix, pairs) will blow up for n≈10k; update cluster_tips
to guard large inputs by adding a MAX_ENTITIES constant and handling n >
MAX_ENTITIES: either (a) log/warn and early-return or downsample/filter to a
smaller subset, or (b) switch to an approximate nearest-neighbor path (e.g.,
FAISS or sklearn NearestNeighbors with radius/top-k) to compute only top-k or
radius neighbors per vector instead of materializing the full similarity_matrix
and nested loops; implement one of these strategies and document the change in
cluster_tips so pairs is built from ANN results rather than the full O(n²) scan.
- Line 85: The comparison currently uses a strict greater-than which excludes
pairs equal to the threshold; update the conditional that checks
similarity_matrix[i, j] against threshold to use a non-strict comparison (>=) so
entities with similarity exactly equal to threshold are included—locate the
loop/conditional referencing similarity_matrix and threshold (indices i, j) and
change the operator from '>' to '>='.

In `@kaizen/llm/tips/tips.py`:
- Line 94: The default task instruction string used in the tips generator
("task_instruction": task_instruction or "Task description unknown") is
inconsistent with the MCP save_trajectory default; update this to use the
canonical default used by the MCP (the same string literal used in
save_trajectory) so metadata is consistent for downstream clustering—open the
tips.py entry that sets "task_instruction" and replace "Task description
unknown" with the MCP's canonical default string (or better, import/reference a
shared constant if available, e.g., a DEFAULT_TASK_DESCRIPTION constant used by
save_trajectory).

In `@tests/unit/test_clustering.py`:
- Around line 157-159: The assertion is weak because `or len(call_arg) > 0` is
trivially true; instead patch `milvus_other_settings.embedding_model` to a known
expected value and assert that the constructor was called with that exact model
string: update the test to set `milvus_other_settings.embedding_model =
"<expected_model>"` (or use a monkeypatch/patch fixture), retrieve `call_arg =
mock_st_cls.call_args[0][0]`, and replace the `assert "MiniLM" in call_arg or
len(call_arg) > 0` with `assert call_arg == "<expected_model>"` (or assert it
equals the patched value) to verify the default model is used.
- Around line 31-56: Add the pytest unit marker to the test classes by
decorating TestUnionFind and TestClusterEntities with `@pytest.mark.unit` (and add
"import pytest" at the top if it's not present); ensure the decorator sits
directly above each class definition so all tests exercising _union_find and
related cluster tests are marked as unit tests per the project guideline.

Add combine_cluster() that calls an LLM to merge overlapping tips in each cluster into fewer consolidated guidelines, with a 3-retry loop. Wire up the consolidate_tips() client method and CLI --no-dry-run path to delete originals and insert the combined results.

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (4)

kaizen/schema/tips.py (1)

17-31: LGTM.

The two dataclasses are well-scoped. Optional improvement: @dataclass(frozen=True) would make both result types immutable, which is appropriate for value objects returned from operations.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@kaizen/schema/tips.py` around lines 17 - 31, Make the two result dataclasses
immutable by adding frozen=True to their decorators: change `@dataclass` above
TipGenerationResult and ConsolidationResult to `@dataclass`(frozen=True) so
instances become hashable and cannot be mutated after creation; ensure no
existing code mutates attributes of TipGenerationResult or ConsolidationResult
(update callers if needed) to avoid breaking assignments.

kaizen/llm/tips/clustering.py (2)

86-88: SentenceTransformer is reinstantiated on every cluster_tips call.

Model loading involves disk I/O and potentially network (first download). Cache the model instance keyed by name to avoid repeated load overhead.

♻️ Proposed module-level cache

+_model_cache: dict[str, SentenceTransformer] = {}
+
+def _get_model(name: str) -> SentenceTransformer:
+    if name not in _model_cache:
+        _model_cache[name] = SentenceTransformer(name)
+    return _model_cache[name]

 # inside cluster_entities:
-    model = SentenceTransformer(embedding_model)
+    model = _get_model(embedding_model)
     embeddings = model.encode(descriptions, normalize_embeddings=True)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@kaizen/llm/tips/clustering.py` around lines 86 - 88, cluster_tips currently
re-instantiates SentenceTransformer on every call which causes repeated
disk/network I/O; add a module-level cache (e.g., a dict) keyed by
embedding_model name and reuse cached SentenceTransformer instances instead of
creating a new one each time: in cluster_tips look up embedding_model in the
cache, if missing create SentenceTransformer(embedding_model) and store it, then
call model.encode(descriptions, normalize_embeddings=True) as before; reference
symbols: cluster_tips, SentenceTransformer, embedding_model, model.encode.

90-96: O(n²) Python loop over the similarity matrix is the bottleneck at scale.

With the 10,000-entity cap in cluster_tips, the upper triangle has ~50 million iterations in pure Python. The similarity matrix is already a numpy array — replace the loop with a vectorized boolean mask:

♻️ Proposed vectorized pair extraction

-    # Find pairs exceeding threshold
-    n = len(filtered)
-    pairs: list[tuple[int, int]] = []
-    for i in range(n):
-        for j in range(i + 1, n):
-            if similarity_matrix[i, j] > threshold:
-                pairs.append((i, j))
+    # Find pairs exceeding threshold (vectorized)
+    n = len(filtered)
+    mask = np.triu(similarity_matrix > threshold, k=1)
+    rows, cols = np.where(mask)
+    pairs = list(zip(rows.tolist(), cols.tolist()))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@kaizen/llm/tips/clustering.py` around lines 90 - 96, The nested Python loops
that build pairs from similarity_matrix (in cluster_tips, iterating
n=len(filtered) and appending to pairs) are O(n²) and must be replaced with a
vectorized numpy extraction: compute a boolean mask of the upper triangle (use
numpy.triu with k=1) applied to similarity_matrix > threshold, then get index
arrays via numpy.where and convert them into the list of (i,j) tuples (or keep
as two arrays if downstream can accept that) to populate pairs; update
references to pairs and ensure the resulting order/type matches existing
expectations in cluster_tips and any callers.

kaizen/frontend/client/kaizen_client.py (1)

91-92: Hardcoded limit=10000 is an invisible ceiling.

Namespaces that grow beyond 10,000 guideline entities will be silently under-clustered. Consider either exposing this as a parameter or documenting it in the docstring.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@kaizen/frontend/client/kaizen_client.py` around lines 91 - 92, The call in
kaizen_client.py hardcodes limit=10000 when calling get_all_entities, which
silently caps guideline clustering; change the call to avoid the fixed ceiling
by adding an optional parameter (e.g., limit or max_entities) to the method that
contains this code and pass it through to get_all_entities, or implement
iterative pagination using get_all_entities until all entities are retrieved,
then call cluster_entities(entities, threshold=threshold); also update the
method docstring to document the new parameter/behavior so callers know how to
control or expect the limit.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@kaizen/frontend/client/kaizen_client.py`:
- Around line 112-142: The current per-cluster flow deletes originals via
delete_entity_by_id before inserting consolidated entities with update_entities,
which risks permanent data loss on insert failure and leaves partial progress
unreported; change the per-cluster ordering to first call combine_cluster, then
attempt update_entities for new_entities and only upon successful insert delete
the original entities with delete_entity_by_id; additionally wrap
combine_cluster and update_entities in try/except to catch and log failures,
skip failed clusters (accumulate tips_before/tips_after and clusters_found for
only successfully processed clusters), and return a ConsolidationResult
reflecting partial progress; if you must still raise KaizenException, enrich it
with progress context (counts and cluster index/IDs) so callers can recover.

In `@kaizen/llm/tips/clustering.py`:
- Around line 160-194: The module-level flag
litellm.enable_json_schema_validation is being set inside the retry loop causing
race conditions and redundant assignments; move the single assignment
"litellm.enable_json_schema_validation = constrained_decoding_supported" to just
before the for-attempt loop in combine_cluster (or the enclosing function in
kaizen/llm/tips/clustering.py), remove the two in-loop assignments at the sites
that currently set it (the branches that use constrained_decoding_supported),
and leave the rest of the retry logic (the calls to completion(...),
clean_llm_response, and
TipGenerationResponse.model_validate(json.loads(...)).tips) unchanged so schema
validation is established once per call and not toggled concurrently.

In `@tests/unit/test_combine_tips.py`:
- Around line 59-201: Add the pytest unit marker to every test function in this
file: decorate test_combine_cluster_returns_tips,
test_combine_cluster_retries_on_failure,
test_combine_cluster_raises_after_max_retries,
test_combine_cluster_uses_structured_output,
test_consolidate_tips_deletes_originals_and_inserts_new, and
test_consolidate_tips_returns_correct_counts with `@pytest.mark.unit` (and ensure
pytest is imported at top of the file if not already) so the `unit`
selection/exclusion filter recognizes these tests.

---

Nitpick comments:
In `@kaizen/frontend/client/kaizen_client.py`:
- Around line 91-92: The call in kaizen_client.py hardcodes limit=10000 when
calling get_all_entities, which silently caps guideline clustering; change the
call to avoid the fixed ceiling by adding an optional parameter (e.g., limit or
max_entities) to the method that contains this code and pass it through to
get_all_entities, or implement iterative pagination using get_all_entities until
all entities are retrieved, then call cluster_entities(entities,
threshold=threshold); also update the method docstring to document the new
parameter/behavior so callers know how to control or expect the limit.

In `@kaizen/llm/tips/clustering.py`:
- Around line 86-88: cluster_tips currently re-instantiates SentenceTransformer
on every call which causes repeated disk/network I/O; add a module-level cache
(e.g., a dict) keyed by embedding_model name and reuse cached
SentenceTransformer instances instead of creating a new one each time: in
cluster_tips look up embedding_model in the cache, if missing create
SentenceTransformer(embedding_model) and store it, then call
model.encode(descriptions, normalize_embeddings=True) as before; reference
symbols: cluster_tips, SentenceTransformer, embedding_model, model.encode.
- Around line 90-96: The nested Python loops that build pairs from
similarity_matrix (in cluster_tips, iterating n=len(filtered) and appending to
pairs) are O(n²) and must be replaced with a vectorized numpy extraction:
compute a boolean mask of the upper triangle (use numpy.triu with k=1) applied
to similarity_matrix > threshold, then get index arrays via numpy.where and
convert them into the list of (i,j) tuples (or keep as two arrays if downstream
can accept that) to populate pairs; update references to pairs and ensure the
resulting order/type matches existing expectations in cluster_tips and any
callers.

In `@kaizen/schema/tips.py`:
- Around line 17-31: Make the two result dataclasses immutable by adding
frozen=True to their decorators: change `@dataclass` above TipGenerationResult and
ConsolidationResult to `@dataclass`(frozen=True) so instances become hashable and
cannot be mutated after creation; ensure no existing code mutates attributes of
TipGenerationResult or ConsolidationResult (update callers if needed) to avoid
breaking assignments.

kaizen/frontend/client/kaizen_client.py

kaizen/llm/tips/clustering.py

tests/unit/test_combine_tips.py

vinodmut · 2026-02-18T16:37:23Z

@jayaramkr Please address the coderabbit reviews. If they're not correct or valid, add a comment saying why.

- Cache SentenceTransformer via lru_cache to avoid re-loading on every call - Guard O(n²) clustering with MAX_CLUSTER_ENTITIES=5000 cap - Vectorize pair scan with np.triu + np.where instead of nested loops - Use >= for inclusive threshold comparison - Extract DEFAULT_TASK_DESCRIPTION constant for consistent fallback - Reorder consolidate_tips to insert before delete to prevent data loss - Add per-cluster error handling so failed clusters are skipped gracefully - Hoist litellm.enable_json_schema_validation above the retry loop - Add configurable limit param to cluster_tips - Make TipGenerationResult and ConsolidationResult frozen dataclasses - Add @pytest.mark.unit to all clustering/combining test classes - Strengthen test_uses_default_embedding_model assertion

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

kaizen/llm/tips/clustering.py (1)

142-176: prompt_file.read_text() and Template(...) rebuilt on every combine_cluster call.

consolidate_tips calls combine_cluster in a loop — once per cluster. Each iteration reads the file from disk and re-compiles the Jinja2 Template. The file content is constant; hoisting to a module-level constant eliminates the repeated I/O and compilation overhead.

♻️ Proposed refactor — cache the compiled template at module level

+_COMBINE_TIPS_TEMPLATE = Template(
+    (Path(__file__).parent / "prompts/combine_tips.jinja2").read_text()
+)
+

 def combine_cluster(entities: list[RecordedEntity]) -> list[Tip]:
     ...
-    prompt_file = Path(__file__).parent / "prompts/combine_tips.jinja2"
     ...
     prompt = Template(prompt_file.read_text()).render(
+    prompt = _COMBINE_TIPS_TEMPLATE.render(
         task_descriptions=task_descriptions,
         tips=tips,
         constrained_decoding_supported=constrained_decoding_supported,
     )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@kaizen/llm/tips/clustering.py` around lines 142 - 176, The code repeatedly
reads and compiles the Jinja2 template inside combine_cluster via
prompt_file.read_text() and Template(...); hoist the compiled template to a
module-level constant (e.g., COMBINE_TIPS_TEMPLATE =
Template(Path(__file__).parent / "prompts/combine_tips.jinja2".read_text())) and
use that compiled template in combine_cluster (replace Template(...).render(...)
with COMBINE_TIPS_TEMPLATE.render(...)), removing the per-call prompt_file and
Template creation; ensure the module imports Template and Path remain available
and update any tests/uses that assume the old local variable name.

tests/unit/test_combine_tips.py (1)

155-171: Test comment overstates what's verified — call ORDER is not actually asserted.

The comment on Line 158 says "Verify insert was called before deletes (insert-first for safety)", but the assertions only check call_count and call_args — not the relative order of update_entities vs delete_entity_by_id calls. The key safety invariant (insert must precede any delete) is untested.

To actually verify ordering, attach both mocks to a shared MagicMock manager and inspect its mock_calls list:

♻️ Proposed ordering assertion

-        # Verify insert was called before deletes (insert-first for safety)
-        assert mock_backend.update_entities.call_count == 1
-        call_args = mock_backend.update_entities.call_args
-        ns_id, new_entities, enable_cr = call_args[0]
-        assert ns_id == "test-ns"
-        assert len(new_entities) == 1
-        assert new_entities[0].content == "Combined tip"
-        assert new_entities[0].metadata["task_description"] == "error handling"
-        assert enable_cr is False
-
-        # Verify deletes were called for each original entity
-        assert mock_backend.delete_entity_by_id.call_count == 2
-        mock_backend.delete_entity_by_id.assert_any_call("test-ns", "1")
-        mock_backend.delete_entity_by_id.assert_any_call("test-ns", "2")
+        # Verify call ORDER: insert must precede all deletes
+        manager = MagicMock()
+        manager.attach_mock(mock_backend.update_entities, "update_entities")
+        manager.attach_mock(mock_backend.delete_entity_by_id, "delete_entity_by_id")
+
+        with patch.object(client, "cluster_tips", return_value=[entities_cluster]):
+            client.consolidate_tips("test-ns")
+
+        call_names = [c[0] for c in manager.mock_calls]
+        assert call_names.index("update_entities") < call_names.index("delete_entity_by_id"), \
+            "update_entities must be called before any delete_entity_by_id"
+
+        # Existing content/arg assertions
+        _, new_entities, enable_cr = mock_backend.update_entities.call_args[0]
+        assert len(new_entities) == 1
+        assert new_entities[0].content == "Combined tip"
+        assert new_entities[0].metadata["task_description"] == "error handling"
+        assert enable_cr is False
+        assert mock_backend.delete_entity_by_id.call_count == 2
+        mock_backend.delete_entity_by_id.assert_any_call("test-ns", "1")
+        mock_backend.delete_entity_by_id.assert_any_call("test-ns", "2")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unit/test_combine_tips.py` around lines 155 - 171, The test comment
claims it verifies that update (insert) happens before deletes but only checks
call counts; change the assertions to assert ordering by inspecting the shared
mock's mock_calls: use the same mock_backend (or wrap both methods on a single
MagicMock) and after client.consolidate_tips("test-ns") find the index of the
update_entities call in mock_backend.mock_calls and ensure it is less than the
index of the first delete_entity_by_id call; keep existing checks for
ns_id/new_entities/enable_cr and delete call args but replace the current
non-ordering assertions with the mock_calls order check so update_entities is
asserted to occur before delete_entity_by_id calls (references:
consolidate_tips, client.cluster_tips, mock_backend.update_entities,
mock_backend.delete_entity_by_id).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@kaizen/frontend/client/kaizen_client.py`:
- Around line 118-153: The current single try/except around combine_cluster,
update_entities, and the loop calling delete_entity_by_id can hide post-insert
delete failures and leave duplicates; refactor by separating into two guarded
scopes: (1) try/except around combine_cluster and the creation + call to
update_entities so failures there skip the cluster and do not change counters,
and on success immediately increment clusters_found, tips_before, and
tips_after; (2) a second try/except around the deletion loop that calls
delete_entity_by_id for each entity which logs delete-specific errors (including
entity IDs) but does not roll back the successful insert or decrement the
counters; keep references to combine_cluster, update_entities,
delete_entity_by_id, clusters_found, tips_before, and tips_after to locate the
changes.

---

Nitpick comments:
In `@kaizen/llm/tips/clustering.py`:
- Around line 142-176: The code repeatedly reads and compiles the Jinja2
template inside combine_cluster via prompt_file.read_text() and Template(...);
hoist the compiled template to a module-level constant (e.g.,
COMBINE_TIPS_TEMPLATE = Template(Path(__file__).parent /
"prompts/combine_tips.jinja2".read_text())) and use that compiled template in
combine_cluster (replace Template(...).render(...) with
COMBINE_TIPS_TEMPLATE.render(...)), removing the per-call prompt_file and
Template creation; ensure the module imports Template and Path remain available
and update any tests/uses that assume the old local variable name.

In `@tests/unit/test_combine_tips.py`:
- Around line 155-171: The test comment claims it verifies that update (insert)
happens before deletes but only checks call counts; change the assertions to
assert ordering by inspecting the shared mock's mock_calls: use the same
mock_backend (or wrap both methods on a single MagicMock) and after
client.consolidate_tips("test-ns") find the index of the update_entities call in
mock_backend.mock_calls and ensure it is less than the index of the first
delete_entity_by_id call; keep existing checks for ns_id/new_entities/enable_cr
and delete call args but replace the current non-ordering assertions with the
mock_calls order check so update_entities is asserted to occur before
delete_entity_by_id calls (references: consolidate_tips, client.cluster_tips,
mock_backend.update_entities, mock_backend.delete_entity_by_id).

kaizen/frontend/client/kaizen_client.py

- Separate consolidate_tips into two guarded scopes: combine+insert failures skip the cluster, while delete failures are logged independently without rolling back the successful insert - Hoist Jinja2 template compilation to module-level constant _COMBINE_TIPS_TEMPLATE to avoid re-reading/compiling on every call - Assert insert-before-delete ordering in test via mock_calls index comparison instead of just checking call counts

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

kaizen/llm/tips/clustering.py (1)
209-212: Redundant continue in the retry except block.

if attempt < 2: continue has no effect — after the except block completes, the for loop advances to the next iteration naturally. The only non-redundant path is attempt == 2, where continue is already gated out. Remove for clarity.
♻️ Proposed cleanup
         except Exception as e:
             last_error = e
-            if attempt < 2:
-                continue
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kaizen/llm/tips/clustering.py` around lines 209 - 212, In the except block
that sets last_error = e, remove the redundant conditional "if attempt < 2:
continue" because the for-loop will naturally proceed to the next iteration;
keep the last_error assignment and any outer retry/raise logic intact (look for
the except block where last_error and attempt are used in clustering.py to
ensure only the unnecessary continue is deleted so behavior on the final attempt
remains unchanged).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@kaizen/frontend/client/kaizen_client.py`:
- Around line 96-97: cluster_tips silently truncates guideline entities when
get_all_entities hits the supplied limit (default 10_000); update cluster_tips
to detect when the returned entities length is >= the limit passed to
get_all_entities (call site symbol: cluster_tips, helper symbol:
get_all_entities) and surface a warning or error to the operator before calling
cluster_entities so truncated data is not hidden—mirror the behavior used for
cluster_entities' MAX_CLUSTER_ENTITIES cap (symbol: cluster_entities,
MAX_CLUSTER_ENTITIES) by logging a clear warning (or raising a specific
exception) that the limit was hit and suggesting increasing the limit or using
pagination.
- Around line 121-155: The consolidation flow currently deletes originals even
when combine_cluster returns an empty consolidated_tips, causing data loss;
modify the logic in the consolidation loop (where consolidated_tips and
new_entities are computed and update_entities is called) to check for an empty
consolidated_tips (or new_entities) and skip the entire Phase 2 deletion step
for that cluster (i.e., continue the loop) when no consolidated tips were
produced, ensuring delete_entity_by_id is only called if replacement entities
were successfully inserted.

In `@kaizen/llm/tips/clustering.py`:
- Around line 196-208: Guard against a None .message.content before calling
clean_llm_response or json.loads: after getting the LLM result from
completion(...).choices[0].message (the variable currently assigned to
response), check if message.content is None and if so raise a KaizenException
(or return a clear error) describing that the LLM returned no content (include
finish_reason or relevant metadata if available) instead of letting
clean_llm_response(response) or json.loads(clean_response) raise
AttributeError/TypeError; apply the same guard to both the constrained and
unconstrained branches that feed
TipGenerationResponse.model_validate(json.loads(...)) so the higher-level error
is meaningful.

---

Duplicate comments:
In `@kaizen/llm/tips/clustering.py`:
- Line 180: The code currently mutates the global
litellm.enable_json_schema_validation flag inside combine_cluster which is racy
under concurrent calls; stop changing the process-wide
litellm.enable_json_schema_validation and instead perform JSON schema validation
locally per-response (e.g., call your JSON validator directly on the LLM
response using the cluster schema or use a per-call option if LiteLLM exposes
one). Locate the mutation of litellm.enable_json_schema_validation in
combine_cluster and remove it, then validate the response with an
application-level validator (jsonvalidator/jsonschema) before parsing/using the
JSON so each caller validates independently without touching the global.

---

Nitpick comments:
In `@kaizen/llm/tips/clustering.py`:
- Around line 209-212: In the except block that sets last_error = e, remove the
redundant conditional "if attempt < 2: continue" because the for-loop will
naturally proceed to the next iteration; keep the last_error assignment and any
outer retry/raise logic intact (look for the except block where last_error and
attempt are used in clustering.py to ensure only the unnecessary continue is
deleted so behavior on the final attempt remains unchanged).

coderabbitai · 2026-02-19T16:42:22Z

kaizen/frontend/client/kaizen_client.py

+        entities = self.get_all_entities(namespace_id, filters={"type": "guideline"}, limit=limit)
+        return cluster_entities(entities, threshold=threshold)


⚠️ Potential issue | 🟡 Minor

Silent truncation when guideline entity count reaches limit.

get_all_entities is fetched up to limit=10_000, but if the actual count equals or exceeds limit, entities beyond that boundary are silently dropped before clustering. Unlike cluster_entities, which logs a warning when its MAX_CLUSTER_ENTITIES cap is hit, cluster_tips provides no signal to the caller or operator.

💡 Suggested improvement

entities = self.get_all_entities(namespace_id, filters={"type": "guideline"}, limit=limit) + if len(entities) == limit: + logger.warning( + "cluster_tips fetched exactly %d entities; results may be truncated.", + limit, + ) return cluster_entities(entities, threshold=threshold)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

entities = self.get_all_entities(namespace_id, filters={"type": "guideline"}, limit=limit)

return cluster_entities(entities, threshold=threshold)

entities = self.get_all_entities(namespace_id, filters={"type": "guideline"}, limit=limit)

if len(entities) == limit:

logger.warning(

"cluster_tips fetched exactly %d entities; results may be truncated.",

limit,

)

return cluster_entities(entities, threshold=threshold)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@kaizen/frontend/client/kaizen_client.py` around lines 96 - 97, cluster_tips silently truncates guideline entities when get_all_entities hits the supplied limit (default 10_000); update cluster_tips to detect when the returned entities length is >= the limit passed to get_all_entities (call site symbol: cluster_tips, helper symbol: get_all_entities) and surface a warning or error to the operator before calling cluster_entities so truncated data is not hidden—mirror the behavior used for cluster_entities' MAX_CLUSTER_ENTITIES cap (symbol: cluster_entities, MAX_CLUSTER_ENTITIES) by logging a clear warning (or raising a specific exception) that the limit was hit and suggesting increasing the limit or using pagination.

coderabbitai · 2026-02-19T16:42:22Z

kaizen/frontend/client/kaizen_client.py

+                consolidated_tips = combine_cluster(cluster)
+
+                task_description = (cluster[0].metadata or {}).get("task_description", "")
+                new_entities = [
+                    Entity(
+                        content=tip.content,
+                        type="guideline",
+                        metadata={
+                            "task_description": task_description,
+                            "rationale": tip.rationale,
+                            "category": tip.category,
+                            "trigger": tip.trigger,
+                        },
+                    )
+                    for tip in consolidated_tips
+                ]
+                if new_entities:
+                    self.update_entities(namespace_id, new_entities, enable_conflict_resolution=False)
+            except Exception:
+                logger.warning(
+                    "Failed to consolidate cluster of %d entities (IDs: %s); skipping.",
+                    len(cluster),
+                    [e.id for e in cluster],
+                    exc_info=True,
+                )
+                continue
+
+            clusters_found += 1
+            tips_before += len(cluster)
+            tips_after += len(consolidated_tips)
+
+            # Phase 2: delete originals (log errors but don't roll back insert)
+            for entity in cluster:
+                try:
+                    self.delete_entity_by_id(namespace_id, entity.id)


⚠️ Potential issue | 🟠 Major

Empty consolidated_tips silently deletes all originals with no replacement — data loss.

If combine_cluster returns an empty list (e.g., the LLM returns {"tips": []}, which validates successfully against TipGenerationResponse), new_entities is also empty. The if new_entities: guard on line 137 skips the insert, but Phase 2 (lines 153–161) still runs unconditionally and deletes every original entity in the cluster. This leaves the namespace with fewer entities than before consolidation and no consolidated replacement.

TipGenerationResponse has tips: list[Tip] — an empty list is a valid response, making this scenario plausible under adversarial prompts or model quality regressions.

🛡️ Proposed fix — skip cluster when LLM returns no tips

try: consolidated_tips = combine_cluster(cluster) + + if not consolidated_tips: + logger.warning( + "combine_cluster returned no tips for cluster of %d entities (IDs: %s); skipping.", + len(cluster), + [e.id for e in cluster], + ) + continue task_description = (cluster[0].metadata or {}).get("task_description", "")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@kaizen/frontend/client/kaizen_client.py` around lines 121 - 155, The consolidation flow currently deletes originals even when combine_cluster returns an empty consolidated_tips, causing data loss; modify the logic in the consolidation loop (where consolidated_tips and new_entities are computed and update_entities is called) to check for an empty consolidated_tips (or new_entities) and skip the entire Phase 2 deletion step for that cluster (i.e., continue the loop) when no consolidated tips were produced, ensuring delete_entity_by_id is only called if replacement entities were successfully inserted.

coderabbitai · 2026-02-19T16:42:22Z

kaizen/llm/tips/clustering.py

+            else:
+                response = (
+                    completion(
+                        model=llm_settings.tips_model,
+                        messages=[{"role": "user", "content": prompt}],
+                        custom_llm_provider=llm_settings.custom_llm_provider,
+                    )
+                    .choices[0]
+                    .message.content
+                )
+                clean_response = clean_llm_response(response)
+
+            return TipGenerationResponse.model_validate(json.loads(clean_response)).tips


⚠️ Potential issue | 🟡 Minor

message.content is not guarded against None, producing a misleading AttributeError on retries.

For the unconstrained path (line 204), response is assigned directly from .choices[0].message.content. If litellm returns None (e.g., when finish_reason is "tool_calls"), clean_llm_response(None) calls None.strip() and raises an AttributeError. The same applies to the constrained path (line 194) where json.loads(None) raises a TypeError. Both exceptions are caught and retried, but the chained error reaching KaizenException will be an opaque low-level type error rather than a meaningful content error.

🛡️ Proposed fix — guard against None content on both paths

if constrained_decoding_supported: clean_response = ( completion( model=llm_settings.tips_model, messages=[{"role": "user", "content": prompt}], response_format=TipGenerationResponse, custom_llm_provider=llm_settings.custom_llm_provider, ) .choices[0] .message.content ) + if clean_response is None: + raise ValueError("LLM returned None content (constrained path)") else: response = ( completion( model=llm_settings.tips_model, messages=[{"role": "user", "content": prompt}], custom_llm_provider=llm_settings.custom_llm_provider, ) .choices[0] .message.content ) + if response is None: + raise ValueError("LLM returned None content") clean_response = clean_llm_response(response)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@kaizen/llm/tips/clustering.py` around lines 196 - 208, Guard against a None .message.content before calling clean_llm_response or json.loads: after getting the LLM result from completion(...).choices[0].message (the variable currently assigned to response), check if message.content is None and if so raise a KaizenException (or return a clear error) describing that the LLM returned no content (include finish_reason or relevant metadata if available) instead of letting clean_llm_response(response) or json.loads(clean_response) raise AttributeError/TypeError; apply the same guard to both the constrained and unconstrained branches that feed TipGenerationResponse.model_validate(json.loads(...)) so the higher-level error is meaningful.

jayaramkr · 2026-02-20T12:50:27Z

conflict will be resolved once PR #58 is merged

JAYARAM RADHAKRISHNAN added 5 commits February 15, 2026 16:01

fix: guard against empty tips list in MCP save_trajectory

5d6f9c2

Skip update_entities call when no tips are generated, aligning with the existing guard in phoenix_sync.py.

test: assert task_description is persisted in tip entity metadata

626a072

test: add unit tests for parse_openai_agents_trajectory fallback

fbbfdf6

coderabbitai bot reviewed Feb 17, 2026

View reviewed changes

coderabbitai bot reviewed Feb 18, 2026

View reviewed changes

kaizen/frontend/client/kaizen_client.py Show resolved Hide resolved

kaizen/llm/tips/clustering.py Show resolved Hide resolved

tests/unit/test_combine_tips.py Show resolved Hide resolved

coderabbitai bot reviewed Feb 19, 2026

View reviewed changes

kaizen/frontend/client/kaizen_client.py Show resolved Hide resolved

coderabbitai bot reviewed Feb 19, 2026

View reviewed changes

		entities = self.get_all_entities(namespace_id, filters={"type": "guideline"}, limit=limit)
		return cluster_entities(entities, threshold=threshold)

Conversation

jayaramkr commented Feb 17, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vinodmut commented Feb 18, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

jayaramkr commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

jayaramkr commented Feb 17, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 17, 2026 •

edited

Loading