feat: cluster tips by task description similarity#60
feat: cluster tips by task description similarity#60jayaramkr wants to merge 8 commits intoAgentToolkit:mainfrom
Conversation
generate_tips() now returns a TipGenerationResult containing both the tips and the source task_description. Both callers (PhoenixSync and MCP save_trajectory) store task_description in tip entity metadata, enabling future clustering of tips by task similarity. Trajectories without a task description default to "Task description unknown".
Skip update_entities call when no tips are generated, aligning with the existing guard in phoenix_sync.py.
Add clustering module that embeds task descriptions via SentenceTransformer, computes pairwise cosine similarity, and groups tips using union-find. - New `cluster_entities()` function in `kaizen/llm/tips/clustering.py` - `KAIZEN_CLUSTERING_THRESHOLD` config (default 0.80) - `KaizenClient.cluster_tips()` method - `kaizen entities consolidate` CLI command (dry-run only for now) - 11 unit tests for clustering logic and union-find
📝 WalkthroughWalkthroughAdds embedding-based tip clustering and LLM-driven consolidation: new CLI command, KaizenClient methods, clustering/combine utilities and prompt, schema types carrying task_description and consolidation summary, MCP/Phoenix flows updated to persist task_description, and associated tests. Changes
Sequence Diagram(s)sequenceDiagram
participant User as CLI User
participant CLI as Kaizen CLI
participant Client as KaizenClient
participant DB as Entity DB
participant Clustering as Clustering Engine
participant LLM as LLM/Combine
User->>CLI: entities consolidate(namespace, threshold?, dry_run)
CLI->>CLI: resolve effective threshold (arg || config)
CLI->>Client: cluster_tips(namespace, threshold)
Client->>DB: fetch guideline RecordedEntity (limit)
DB-->>Client: list[RecordedEntity]
Client->>Clustering: cluster_entities(entities, threshold)
Clustering->>Clustering: embed task_descriptions, compute similarities, union-find groups
Clustering-->>Client: clusters (exclude singletons)
CLI->>CLI: render clusters (tables)
alt dry_run
CLI-->>User: show dry-run summary
else perform
CLI->>Client: consolidate_tips(namespace, threshold)
Client->>Clustering: combine_cluster(cluster) for each cluster
Clustering->>LLM: generate consolidated tips (with retries, schema if supported)
LLM-->>Clustering: consolidated tips
Client->>DB: insert consolidated tip entities
Client->>DB: delete original entities
Client-->>CLI: ConsolidationResult (counts)
CLI-->>User: show consolidation result
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (6)
kaizen/llm/tips/clustering.py (3)
76-78: SentenceTransformer is instantiated on every call.Loading the model from disk (or downloading it) on each invocation of
cluster_entitiesis expensive. If this function is called repeatedly (e.g., in a loop or API handler), consider caching the model instance, for example viafunctools.lru_cacheon model name or a module-level cache.For a single CLI invocation this is fine, but flag for future consideration.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@kaizen/llm/tips/clustering.py` around lines 76 - 78, cluster_entities currently instantiates SentenceTransformer(embedding_model) on every call which is expensive; cache the model instead (e.g., add a module-level dict or use functools.lru_cache keyed by embedding_model) and retrieve the cached SentenceTransformer for use by model.encode(descriptions, normalize_embeddings=True) before computing similarity_matrix so repeated calls reuse the loaded model.
80-86: O(n²) pairwise scan with up to 10,000 entities.The caller (
cluster_tips) fetches up to 10,000 guideline entities. With n=10,000, the similarity matrix is ~800 MB (float64) and the nested loop iterates ~50M pairs. Consider adding an upper-bound check or warning, or using a more scalable approach (e.g., FAISS approximate nearest neighbors) if large entity counts are expected.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@kaizen/llm/tips/clustering.py` around lines 80 - 86, The current O(n²) pair scan in clustering.py (variables: filtered, similarity_matrix, pairs) will blow up for n≈10k; update cluster_tips to guard large inputs by adding a MAX_ENTITIES constant and handling n > MAX_ENTITIES: either (a) log/warn and early-return or downsample/filter to a smaller subset, or (b) switch to an approximate nearest-neighbor path (e.g., FAISS or sklearn NearestNeighbors with radius/top-k) to compute only top-k or radius neighbors per vector instead of materializing the full similarity_matrix and nested loops; implement one of these strategies and document the change in cluster_tips so pairs is built from ANN results rather than the full O(n²) scan.
85-85: Strict>excludes pairs at exactly the threshold value.
similarity_matrix[i, j] > thresholdmeans entities with similarity exactly equal to the threshold (e.g., 0.80) won't be clustered. This is a minor semantic decision — consider using>=if the intent is "at least this similar".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@kaizen/llm/tips/clustering.py` at line 85, The comparison currently uses a strict greater-than which excludes pairs equal to the threshold; update the conditional that checks similarity_matrix[i, j] against threshold to use a non-strict comparison (>=) so entities with similarity exactly equal to threshold are included—locate the loop/conditional referencing similarity_matrix and threshold (indices i, j) and change the operator from '>' to '>='.tests/unit/test_clustering.py (2)
157-159: Weak assertion —len(call_arg) > 0will pass for any non-empty model string.The
or len(call_arg) > 0clause makes this assertion trivially true. If the intent is to verify the default model frommilvus_other_settings.embedding_modelis used, patchkaizen.llm.tips.clustering.milvus_other_settingsand assert the exact value.Proposed fix
- mock_st_cls.assert_called_once() - call_arg = mock_st_cls.call_args[0][0] - assert "MiniLM" in call_arg or len(call_arg) > 0 + mock_st_cls.assert_called_once() + call_arg = mock_st_cls.call_args[0][0] + # Assert a specific expected model name from the config + assert isinstance(call_arg, str) and len(call_arg) > 0 + # Ideally: patch milvus_other_settings and assert call_arg == expected_model🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/test_clustering.py` around lines 157 - 159, The assertion is weak because `or len(call_arg) > 0` is trivially true; instead patch `milvus_other_settings.embedding_model` to a known expected value and assert that the constructor was called with that exact model string: update the test to set `milvus_other_settings.embedding_model = "<expected_model>"` (or use a monkeypatch/patch fixture), retrieve `call_arg = mock_st_cls.call_args[0][0]`, and replace the `assert "MiniLM" in call_arg or len(call_arg) > 0` with `assert call_arg == "<expected_model>"` (or assert it equals the patched value) to verify the default model is used.
31-56: Missing@pytest.mark.unitmarker on test classes.Both
TestUnionFindandTestClusterEntitieslack the@pytest.mark.unitdecorator. As per coding guidelines, "Use pytest markerse2efor end-to-end tests andunitfor unit tests in test files".Proposed fix
+import pytest + +@pytest.mark.unit class TestUnionFind:And similarly for
TestClusterEntities:+@pytest.mark.unit `@patch`("kaizen.llm.tips.clustering.SentenceTransformer") class TestClusterEntities:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/test_clustering.py` around lines 31 - 56, Add the pytest unit marker to the test classes by decorating TestUnionFind and TestClusterEntities with `@pytest.mark.unit` (and add "import pytest" at the top if it's not present); ensure the decorator sits directly above each class definition so all tests exercising _union_find and related cluster tests are marked as unit tests per the project guideline.kaizen/llm/tips/tips.py (1)
94-94: Nit: Default string inconsistency across the codebase.The default here is
"Task description unknown". Verify this matches the default used in the MCPsave_trajectorypath mentioned in the PR objectives, to ensure consistent metadata values for downstream clustering.#!/bin/bash # Check for all occurrences of the default task description string rg -n "Task description unknown" --type=py rg -n "Unknown task" --type=py🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@kaizen/llm/tips/tips.py` at line 94, The default task instruction string used in the tips generator ("task_instruction": task_instruction or "Task description unknown") is inconsistent with the MCP save_trajectory default; update this to use the canonical default used by the MCP (the same string literal used in save_trajectory) so metadata is consistent for downstream clustering—open the tips.py entry that sets "task_instruction" and replace "Task description unknown" with the MCP's canonical default string (or better, import/reference a shared constant if available, e.g., a DEFAULT_TASK_DESCRIPTION constant used by save_trajectory).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@kaizen/llm/tips/clustering.py`:
- Around line 76-78: cluster_entities currently instantiates
SentenceTransformer(embedding_model) on every call which is expensive; cache the
model instead (e.g., add a module-level dict or use functools.lru_cache keyed by
embedding_model) and retrieve the cached SentenceTransformer for use by
model.encode(descriptions, normalize_embeddings=True) before computing
similarity_matrix so repeated calls reuse the loaded model.
- Around line 80-86: The current O(n²) pair scan in clustering.py (variables:
filtered, similarity_matrix, pairs) will blow up for n≈10k; update cluster_tips
to guard large inputs by adding a MAX_ENTITIES constant and handling n >
MAX_ENTITIES: either (a) log/warn and early-return or downsample/filter to a
smaller subset, or (b) switch to an approximate nearest-neighbor path (e.g.,
FAISS or sklearn NearestNeighbors with radius/top-k) to compute only top-k or
radius neighbors per vector instead of materializing the full similarity_matrix
and nested loops; implement one of these strategies and document the change in
cluster_tips so pairs is built from ANN results rather than the full O(n²) scan.
- Line 85: The comparison currently uses a strict greater-than which excludes
pairs equal to the threshold; update the conditional that checks
similarity_matrix[i, j] against threshold to use a non-strict comparison (>=) so
entities with similarity exactly equal to threshold are included—locate the
loop/conditional referencing similarity_matrix and threshold (indices i, j) and
change the operator from '>' to '>='.
In `@kaizen/llm/tips/tips.py`:
- Line 94: The default task instruction string used in the tips generator
("task_instruction": task_instruction or "Task description unknown") is
inconsistent with the MCP save_trajectory default; update this to use the
canonical default used by the MCP (the same string literal used in
save_trajectory) so metadata is consistent for downstream clustering—open the
tips.py entry that sets "task_instruction" and replace "Task description
unknown" with the MCP's canonical default string (or better, import/reference a
shared constant if available, e.g., a DEFAULT_TASK_DESCRIPTION constant used by
save_trajectory).
In `@tests/unit/test_clustering.py`:
- Around line 157-159: The assertion is weak because `or len(call_arg) > 0` is
trivially true; instead patch `milvus_other_settings.embedding_model` to a known
expected value and assert that the constructor was called with that exact model
string: update the test to set `milvus_other_settings.embedding_model =
"<expected_model>"` (or use a monkeypatch/patch fixture), retrieve `call_arg =
mock_st_cls.call_args[0][0]`, and replace the `assert "MiniLM" in call_arg or
len(call_arg) > 0` with `assert call_arg == "<expected_model>"` (or assert it
equals the patched value) to verify the default model is used.
- Around line 31-56: Add the pytest unit marker to the test classes by
decorating TestUnionFind and TestClusterEntities with `@pytest.mark.unit` (and add
"import pytest" at the top if it's not present); ensure the decorator sits
directly above each class definition so all tests exercising _union_find and
related cluster tests are marked as unit tests per the project guideline.
Add combine_cluster() that calls an LLM to merge overlapping tips in each cluster into fewer consolidated guidelines, with a 3-retry loop. Wire up the consolidate_tips() client method and CLI --no-dry-run path to delete originals and insert the combined results.
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (4)
kaizen/schema/tips.py (1)
17-31: LGTM.The two dataclasses are well-scoped. Optional improvement:
@dataclass(frozen=True)would make both result types immutable, which is appropriate for value objects returned from operations.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@kaizen/schema/tips.py` around lines 17 - 31, Make the two result dataclasses immutable by adding frozen=True to their decorators: change `@dataclass` above TipGenerationResult and ConsolidationResult to `@dataclass`(frozen=True) so instances become hashable and cannot be mutated after creation; ensure no existing code mutates attributes of TipGenerationResult or ConsolidationResult (update callers if needed) to avoid breaking assignments.kaizen/llm/tips/clustering.py (2)
86-88:SentenceTransformeris reinstantiated on everycluster_tipscall.Model loading involves disk I/O and potentially network (first download). Cache the model instance keyed by name to avoid repeated load overhead.
♻️ Proposed module-level cache
+_model_cache: dict[str, SentenceTransformer] = {} + +def _get_model(name: str) -> SentenceTransformer: + if name not in _model_cache: + _model_cache[name] = SentenceTransformer(name) + return _model_cache[name] # inside cluster_entities: - model = SentenceTransformer(embedding_model) + model = _get_model(embedding_model) embeddings = model.encode(descriptions, normalize_embeddings=True)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@kaizen/llm/tips/clustering.py` around lines 86 - 88, cluster_tips currently re-instantiates SentenceTransformer on every call which causes repeated disk/network I/O; add a module-level cache (e.g., a dict) keyed by embedding_model name and reuse cached SentenceTransformer instances instead of creating a new one each time: in cluster_tips look up embedding_model in the cache, if missing create SentenceTransformer(embedding_model) and store it, then call model.encode(descriptions, normalize_embeddings=True) as before; reference symbols: cluster_tips, SentenceTransformer, embedding_model, model.encode.
90-96: O(n²) Python loop over the similarity matrix is the bottleneck at scale.With the 10,000-entity cap in
cluster_tips, the upper triangle has ~50 million iterations in pure Python. The similarity matrix is already a numpy array — replace the loop with a vectorized boolean mask:♻️ Proposed vectorized pair extraction
- # Find pairs exceeding threshold - n = len(filtered) - pairs: list[tuple[int, int]] = [] - for i in range(n): - for j in range(i + 1, n): - if similarity_matrix[i, j] > threshold: - pairs.append((i, j)) + # Find pairs exceeding threshold (vectorized) + n = len(filtered) + mask = np.triu(similarity_matrix > threshold, k=1) + rows, cols = np.where(mask) + pairs = list(zip(rows.tolist(), cols.tolist()))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@kaizen/llm/tips/clustering.py` around lines 90 - 96, The nested Python loops that build pairs from similarity_matrix (in cluster_tips, iterating n=len(filtered) and appending to pairs) are O(n²) and must be replaced with a vectorized numpy extraction: compute a boolean mask of the upper triangle (use numpy.triu with k=1) applied to similarity_matrix > threshold, then get index arrays via numpy.where and convert them into the list of (i,j) tuples (or keep as two arrays if downstream can accept that) to populate pairs; update references to pairs and ensure the resulting order/type matches existing expectations in cluster_tips and any callers.kaizen/frontend/client/kaizen_client.py (1)
91-92: Hardcodedlimit=10000is an invisible ceiling.Namespaces that grow beyond 10,000 guideline entities will be silently under-clustered. Consider either exposing this as a parameter or documenting it in the docstring.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@kaizen/frontend/client/kaizen_client.py` around lines 91 - 92, The call in kaizen_client.py hardcodes limit=10000 when calling get_all_entities, which silently caps guideline clustering; change the call to avoid the fixed ceiling by adding an optional parameter (e.g., limit or max_entities) to the method that contains this code and pass it through to get_all_entities, or implement iterative pagination using get_all_entities until all entities are retrieved, then call cluster_entities(entities, threshold=threshold); also update the method docstring to document the new parameter/behavior so callers know how to control or expect the limit.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@kaizen/frontend/client/kaizen_client.py`:
- Around line 112-142: The current per-cluster flow deletes originals via
delete_entity_by_id before inserting consolidated entities with update_entities,
which risks permanent data loss on insert failure and leaves partial progress
unreported; change the per-cluster ordering to first call combine_cluster, then
attempt update_entities for new_entities and only upon successful insert delete
the original entities with delete_entity_by_id; additionally wrap
combine_cluster and update_entities in try/except to catch and log failures,
skip failed clusters (accumulate tips_before/tips_after and clusters_found for
only successfully processed clusters), and return a ConsolidationResult
reflecting partial progress; if you must still raise KaizenException, enrich it
with progress context (counts and cluster index/IDs) so callers can recover.
In `@kaizen/llm/tips/clustering.py`:
- Around line 160-194: The module-level flag
litellm.enable_json_schema_validation is being set inside the retry loop causing
race conditions and redundant assignments; move the single assignment
"litellm.enable_json_schema_validation = constrained_decoding_supported" to just
before the for-attempt loop in combine_cluster (or the enclosing function in
kaizen/llm/tips/clustering.py), remove the two in-loop assignments at the sites
that currently set it (the branches that use constrained_decoding_supported),
and leave the rest of the retry logic (the calls to completion(...),
clean_llm_response, and
TipGenerationResponse.model_validate(json.loads(...)).tips) unchanged so schema
validation is established once per call and not toggled concurrently.
In `@tests/unit/test_combine_tips.py`:
- Around line 59-201: Add the pytest unit marker to every test function in this
file: decorate test_combine_cluster_returns_tips,
test_combine_cluster_retries_on_failure,
test_combine_cluster_raises_after_max_retries,
test_combine_cluster_uses_structured_output,
test_consolidate_tips_deletes_originals_and_inserts_new, and
test_consolidate_tips_returns_correct_counts with `@pytest.mark.unit` (and ensure
pytest is imported at top of the file if not already) so the `unit`
selection/exclusion filter recognizes these tests.
---
Nitpick comments:
In `@kaizen/frontend/client/kaizen_client.py`:
- Around line 91-92: The call in kaizen_client.py hardcodes limit=10000 when
calling get_all_entities, which silently caps guideline clustering; change the
call to avoid the fixed ceiling by adding an optional parameter (e.g., limit or
max_entities) to the method that contains this code and pass it through to
get_all_entities, or implement iterative pagination using get_all_entities until
all entities are retrieved, then call cluster_entities(entities,
threshold=threshold); also update the method docstring to document the new
parameter/behavior so callers know how to control or expect the limit.
In `@kaizen/llm/tips/clustering.py`:
- Around line 86-88: cluster_tips currently re-instantiates SentenceTransformer
on every call which causes repeated disk/network I/O; add a module-level cache
(e.g., a dict) keyed by embedding_model name and reuse cached
SentenceTransformer instances instead of creating a new one each time: in
cluster_tips look up embedding_model in the cache, if missing create
SentenceTransformer(embedding_model) and store it, then call
model.encode(descriptions, normalize_embeddings=True) as before; reference
symbols: cluster_tips, SentenceTransformer, embedding_model, model.encode.
- Around line 90-96: The nested Python loops that build pairs from
similarity_matrix (in cluster_tips, iterating n=len(filtered) and appending to
pairs) are O(n²) and must be replaced with a vectorized numpy extraction:
compute a boolean mask of the upper triangle (use numpy.triu with k=1) applied
to similarity_matrix > threshold, then get index arrays via numpy.where and
convert them into the list of (i,j) tuples (or keep as two arrays if downstream
can accept that) to populate pairs; update references to pairs and ensure the
resulting order/type matches existing expectations in cluster_tips and any
callers.
In `@kaizen/schema/tips.py`:
- Around line 17-31: Make the two result dataclasses immutable by adding
frozen=True to their decorators: change `@dataclass` above TipGenerationResult and
ConsolidationResult to `@dataclass`(frozen=True) so instances become hashable and
cannot be mutated after creation; ensure no existing code mutates attributes of
TipGenerationResult or ConsolidationResult (update callers if needed) to avoid
breaking assignments.
|
@jayaramkr Please address the coderabbit reviews. If they're not correct or valid, add a comment saying why. |
- Cache SentenceTransformer via lru_cache to avoid re-loading on every call - Guard O(n²) clustering with MAX_CLUSTER_ENTITIES=5000 cap - Vectorize pair scan with np.triu + np.where instead of nested loops - Use >= for inclusive threshold comparison - Extract DEFAULT_TASK_DESCRIPTION constant for consistent fallback - Reorder consolidate_tips to insert before delete to prevent data loss - Add per-cluster error handling so failed clusters are skipped gracefully - Hoist litellm.enable_json_schema_validation above the retry loop - Add configurable limit param to cluster_tips - Make TipGenerationResult and ConsolidationResult frozen dataclasses - Add @pytest.mark.unit to all clustering/combining test classes - Strengthen test_uses_default_embedding_model assertion
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
kaizen/llm/tips/clustering.py (1)
142-176:prompt_file.read_text()andTemplate(...)rebuilt on everycombine_clustercall.
consolidate_tipscallscombine_clusterin a loop — once per cluster. Each iteration reads the file from disk and re-compiles the Jinja2Template. The file content is constant; hoisting to a module-level constant eliminates the repeated I/O and compilation overhead.♻️ Proposed refactor — cache the compiled template at module level
+_COMBINE_TIPS_TEMPLATE = Template( + (Path(__file__).parent / "prompts/combine_tips.jinja2").read_text() +) + def combine_cluster(entities: list[RecordedEntity]) -> list[Tip]: ... - prompt_file = Path(__file__).parent / "prompts/combine_tips.jinja2" ... prompt = Template(prompt_file.read_text()).render( + prompt = _COMBINE_TIPS_TEMPLATE.render( task_descriptions=task_descriptions, tips=tips, constrained_decoding_supported=constrained_decoding_supported, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@kaizen/llm/tips/clustering.py` around lines 142 - 176, The code repeatedly reads and compiles the Jinja2 template inside combine_cluster via prompt_file.read_text() and Template(...); hoist the compiled template to a module-level constant (e.g., COMBINE_TIPS_TEMPLATE = Template(Path(__file__).parent / "prompts/combine_tips.jinja2".read_text())) and use that compiled template in combine_cluster (replace Template(...).render(...) with COMBINE_TIPS_TEMPLATE.render(...)), removing the per-call prompt_file and Template creation; ensure the module imports Template and Path remain available and update any tests/uses that assume the old local variable name.tests/unit/test_combine_tips.py (1)
155-171: Test comment overstates what's verified — call ORDER is not actually asserted.The comment on Line 158 says "Verify insert was called before deletes (insert-first for safety)", but the assertions only check
call_countandcall_args— not the relative order ofupdate_entitiesvsdelete_entity_by_idcalls. The key safety invariant (insert must precede any delete) is untested.To actually verify ordering, attach both mocks to a shared
MagicMockmanager and inspect itsmock_callslist:♻️ Proposed ordering assertion
- # Verify insert was called before deletes (insert-first for safety) - assert mock_backend.update_entities.call_count == 1 - call_args = mock_backend.update_entities.call_args - ns_id, new_entities, enable_cr = call_args[0] - assert ns_id == "test-ns" - assert len(new_entities) == 1 - assert new_entities[0].content == "Combined tip" - assert new_entities[0].metadata["task_description"] == "error handling" - assert enable_cr is False - - # Verify deletes were called for each original entity - assert mock_backend.delete_entity_by_id.call_count == 2 - mock_backend.delete_entity_by_id.assert_any_call("test-ns", "1") - mock_backend.delete_entity_by_id.assert_any_call("test-ns", "2") + # Verify call ORDER: insert must precede all deletes + manager = MagicMock() + manager.attach_mock(mock_backend.update_entities, "update_entities") + manager.attach_mock(mock_backend.delete_entity_by_id, "delete_entity_by_id") + + with patch.object(client, "cluster_tips", return_value=[entities_cluster]): + client.consolidate_tips("test-ns") + + call_names = [c[0] for c in manager.mock_calls] + assert call_names.index("update_entities") < call_names.index("delete_entity_by_id"), \ + "update_entities must be called before any delete_entity_by_id" + + # Existing content/arg assertions + _, new_entities, enable_cr = mock_backend.update_entities.call_args[0] + assert len(new_entities) == 1 + assert new_entities[0].content == "Combined tip" + assert new_entities[0].metadata["task_description"] == "error handling" + assert enable_cr is False + assert mock_backend.delete_entity_by_id.call_count == 2 + mock_backend.delete_entity_by_id.assert_any_call("test-ns", "1") + mock_backend.delete_entity_by_id.assert_any_call("test-ns", "2")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/test_combine_tips.py` around lines 155 - 171, The test comment claims it verifies that update (insert) happens before deletes but only checks call counts; change the assertions to assert ordering by inspecting the shared mock's mock_calls: use the same mock_backend (or wrap both methods on a single MagicMock) and after client.consolidate_tips("test-ns") find the index of the update_entities call in mock_backend.mock_calls and ensure it is less than the index of the first delete_entity_by_id call; keep existing checks for ns_id/new_entities/enable_cr and delete call args but replace the current non-ordering assertions with the mock_calls order check so update_entities is asserted to occur before delete_entity_by_id calls (references: consolidate_tips, client.cluster_tips, mock_backend.update_entities, mock_backend.delete_entity_by_id).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@kaizen/frontend/client/kaizen_client.py`:
- Around line 118-153: The current single try/except around combine_cluster,
update_entities, and the loop calling delete_entity_by_id can hide post-insert
delete failures and leave duplicates; refactor by separating into two guarded
scopes: (1) try/except around combine_cluster and the creation + call to
update_entities so failures there skip the cluster and do not change counters,
and on success immediately increment clusters_found, tips_before, and
tips_after; (2) a second try/except around the deletion loop that calls
delete_entity_by_id for each entity which logs delete-specific errors (including
entity IDs) but does not roll back the successful insert or decrement the
counters; keep references to combine_cluster, update_entities,
delete_entity_by_id, clusters_found, tips_before, and tips_after to locate the
changes.
---
Nitpick comments:
In `@kaizen/llm/tips/clustering.py`:
- Around line 142-176: The code repeatedly reads and compiles the Jinja2
template inside combine_cluster via prompt_file.read_text() and Template(...);
hoist the compiled template to a module-level constant (e.g.,
COMBINE_TIPS_TEMPLATE = Template(Path(__file__).parent /
"prompts/combine_tips.jinja2".read_text())) and use that compiled template in
combine_cluster (replace Template(...).render(...) with
COMBINE_TIPS_TEMPLATE.render(...)), removing the per-call prompt_file and
Template creation; ensure the module imports Template and Path remain available
and update any tests/uses that assume the old local variable name.
In `@tests/unit/test_combine_tips.py`:
- Around line 155-171: The test comment claims it verifies that update (insert)
happens before deletes but only checks call counts; change the assertions to
assert ordering by inspecting the shared mock's mock_calls: use the same
mock_backend (or wrap both methods on a single MagicMock) and after
client.consolidate_tips("test-ns") find the index of the update_entities call in
mock_backend.mock_calls and ensure it is less than the index of the first
delete_entity_by_id call; keep existing checks for ns_id/new_entities/enable_cr
and delete call args but replace the current non-ordering assertions with the
mock_calls order check so update_entities is asserted to occur before
delete_entity_by_id calls (references: consolidate_tips, client.cluster_tips,
mock_backend.update_entities, mock_backend.delete_entity_by_id).
- Separate consolidate_tips into two guarded scopes: combine+insert failures skip the cluster, while delete failures are logged independently without rolling back the successful insert - Hoist Jinja2 template compilation to module-level constant _COMBINE_TIPS_TEMPLATE to avoid re-reading/compiling on every call - Assert insert-before-delete ordering in test via mock_calls index comparison instead of just checking call counts
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
kaizen/llm/tips/clustering.py (1)
209-212: Redundantcontinuein the retryexceptblock.
if attempt < 2: continuehas no effect — after theexceptblock completes, theforloop advances to the next iteration naturally. The only non-redundant path isattempt == 2, wherecontinueis already gated out. Remove for clarity.♻️ Proposed cleanup
except Exception as e: last_error = e - if attempt < 2: - continue🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@kaizen/llm/tips/clustering.py` around lines 209 - 212, In the except block that sets last_error = e, remove the redundant conditional "if attempt < 2: continue" because the for-loop will naturally proceed to the next iteration; keep the last_error assignment and any outer retry/raise logic intact (look for the except block where last_error and attempt are used in clustering.py to ensure only the unnecessary continue is deleted so behavior on the final attempt remains unchanged).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@kaizen/frontend/client/kaizen_client.py`:
- Around line 96-97: cluster_tips silently truncates guideline entities when
get_all_entities hits the supplied limit (default 10_000); update cluster_tips
to detect when the returned entities length is >= the limit passed to
get_all_entities (call site symbol: cluster_tips, helper symbol:
get_all_entities) and surface a warning or error to the operator before calling
cluster_entities so truncated data is not hidden—mirror the behavior used for
cluster_entities' MAX_CLUSTER_ENTITIES cap (symbol: cluster_entities,
MAX_CLUSTER_ENTITIES) by logging a clear warning (or raising a specific
exception) that the limit was hit and suggesting increasing the limit or using
pagination.
- Around line 121-155: The consolidation flow currently deletes originals even
when combine_cluster returns an empty consolidated_tips, causing data loss;
modify the logic in the consolidation loop (where consolidated_tips and
new_entities are computed and update_entities is called) to check for an empty
consolidated_tips (or new_entities) and skip the entire Phase 2 deletion step
for that cluster (i.e., continue the loop) when no consolidated tips were
produced, ensuring delete_entity_by_id is only called if replacement entities
were successfully inserted.
In `@kaizen/llm/tips/clustering.py`:
- Around line 196-208: Guard against a None .message.content before calling
clean_llm_response or json.loads: after getting the LLM result from
completion(...).choices[0].message (the variable currently assigned to
response), check if message.content is None and if so raise a KaizenException
(or return a clear error) describing that the LLM returned no content (include
finish_reason or relevant metadata if available) instead of letting
clean_llm_response(response) or json.loads(clean_response) raise
AttributeError/TypeError; apply the same guard to both the constrained and
unconstrained branches that feed
TipGenerationResponse.model_validate(json.loads(...)) so the higher-level error
is meaningful.
---
Duplicate comments:
In `@kaizen/llm/tips/clustering.py`:
- Line 180: The code currently mutates the global
litellm.enable_json_schema_validation flag inside combine_cluster which is racy
under concurrent calls; stop changing the process-wide
litellm.enable_json_schema_validation and instead perform JSON schema validation
locally per-response (e.g., call your JSON validator directly on the LLM
response using the cluster schema or use a per-call option if LiteLLM exposes
one). Locate the mutation of litellm.enable_json_schema_validation in
combine_cluster and remove it, then validate the response with an
application-level validator (jsonvalidator/jsonschema) before parsing/using the
JSON so each caller validates independently without touching the global.
---
Nitpick comments:
In `@kaizen/llm/tips/clustering.py`:
- Around line 209-212: In the except block that sets last_error = e, remove the
redundant conditional "if attempt < 2: continue" because the for-loop will
naturally proceed to the next iteration; keep the last_error assignment and any
outer retry/raise logic intact (look for the except block where last_error and
attempt are used in clustering.py to ensure only the unnecessary continue is
deleted so behavior on the final attempt remains unchanged).
| entities = self.get_all_entities(namespace_id, filters={"type": "guideline"}, limit=limit) | ||
| return cluster_entities(entities, threshold=threshold) |
There was a problem hiding this comment.
Silent truncation when guideline entity count reaches limit.
get_all_entities is fetched up to limit=10_000, but if the actual count equals or exceeds limit, entities beyond that boundary are silently dropped before clustering. Unlike cluster_entities, which logs a warning when its MAX_CLUSTER_ENTITIES cap is hit, cluster_tips provides no signal to the caller or operator.
💡 Suggested improvement
entities = self.get_all_entities(namespace_id, filters={"type": "guideline"}, limit=limit)
+ if len(entities) == limit:
+ logger.warning(
+ "cluster_tips fetched exactly %d entities; results may be truncated.",
+ limit,
+ )
return cluster_entities(entities, threshold=threshold)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| entities = self.get_all_entities(namespace_id, filters={"type": "guideline"}, limit=limit) | |
| return cluster_entities(entities, threshold=threshold) | |
| entities = self.get_all_entities(namespace_id, filters={"type": "guideline"}, limit=limit) | |
| if len(entities) == limit: | |
| logger.warning( | |
| "cluster_tips fetched exactly %d entities; results may be truncated.", | |
| limit, | |
| ) | |
| return cluster_entities(entities, threshold=threshold) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@kaizen/frontend/client/kaizen_client.py` around lines 96 - 97, cluster_tips
silently truncates guideline entities when get_all_entities hits the supplied
limit (default 10_000); update cluster_tips to detect when the returned entities
length is >= the limit passed to get_all_entities (call site symbol:
cluster_tips, helper symbol: get_all_entities) and surface a warning or error to
the operator before calling cluster_entities so truncated data is not
hidden—mirror the behavior used for cluster_entities' MAX_CLUSTER_ENTITIES cap
(symbol: cluster_entities, MAX_CLUSTER_ENTITIES) by logging a clear warning (or
raising a specific exception) that the limit was hit and suggesting increasing
the limit or using pagination.
| consolidated_tips = combine_cluster(cluster) | ||
|
|
||
| task_description = (cluster[0].metadata or {}).get("task_description", "") | ||
| new_entities = [ | ||
| Entity( | ||
| content=tip.content, | ||
| type="guideline", | ||
| metadata={ | ||
| "task_description": task_description, | ||
| "rationale": tip.rationale, | ||
| "category": tip.category, | ||
| "trigger": tip.trigger, | ||
| }, | ||
| ) | ||
| for tip in consolidated_tips | ||
| ] | ||
| if new_entities: | ||
| self.update_entities(namespace_id, new_entities, enable_conflict_resolution=False) | ||
| except Exception: | ||
| logger.warning( | ||
| "Failed to consolidate cluster of %d entities (IDs: %s); skipping.", | ||
| len(cluster), | ||
| [e.id for e in cluster], | ||
| exc_info=True, | ||
| ) | ||
| continue | ||
|
|
||
| clusters_found += 1 | ||
| tips_before += len(cluster) | ||
| tips_after += len(consolidated_tips) | ||
|
|
||
| # Phase 2: delete originals (log errors but don't roll back insert) | ||
| for entity in cluster: | ||
| try: | ||
| self.delete_entity_by_id(namespace_id, entity.id) |
There was a problem hiding this comment.
Empty consolidated_tips silently deletes all originals with no replacement — data loss.
If combine_cluster returns an empty list (e.g., the LLM returns {"tips": []}, which validates successfully against TipGenerationResponse), new_entities is also empty. The if new_entities: guard on line 137 skips the insert, but Phase 2 (lines 153–161) still runs unconditionally and deletes every original entity in the cluster. This leaves the namespace with fewer entities than before consolidation and no consolidated replacement.
TipGenerationResponse has tips: list[Tip] — an empty list is a valid response, making this scenario plausible under adversarial prompts or model quality regressions.
🛡️ Proposed fix — skip cluster when LLM returns no tips
try:
consolidated_tips = combine_cluster(cluster)
+
+ if not consolidated_tips:
+ logger.warning(
+ "combine_cluster returned no tips for cluster of %d entities (IDs: %s); skipping.",
+ len(cluster),
+ [e.id for e in cluster],
+ )
+ continue
task_description = (cluster[0].metadata or {}).get("task_description", "")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@kaizen/frontend/client/kaizen_client.py` around lines 121 - 155, The
consolidation flow currently deletes originals even when combine_cluster returns
an empty consolidated_tips, causing data loss; modify the logic in the
consolidation loop (where consolidated_tips and new_entities are computed and
update_entities is called) to check for an empty consolidated_tips (or
new_entities) and skip the entire Phase 2 deletion step for that cluster (i.e.,
continue the loop) when no consolidated tips were produced, ensuring
delete_entity_by_id is only called if replacement entities were successfully
inserted.
| else: | ||
| response = ( | ||
| completion( | ||
| model=llm_settings.tips_model, | ||
| messages=[{"role": "user", "content": prompt}], | ||
| custom_llm_provider=llm_settings.custom_llm_provider, | ||
| ) | ||
| .choices[0] | ||
| .message.content | ||
| ) | ||
| clean_response = clean_llm_response(response) | ||
|
|
||
| return TipGenerationResponse.model_validate(json.loads(clean_response)).tips |
There was a problem hiding this comment.
message.content is not guarded against None, producing a misleading AttributeError on retries.
For the unconstrained path (line 204), response is assigned directly from .choices[0].message.content. If litellm returns None (e.g., when finish_reason is "tool_calls"), clean_llm_response(None) calls None.strip() and raises an AttributeError. The same applies to the constrained path (line 194) where json.loads(None) raises a TypeError. Both exceptions are caught and retried, but the chained error reaching KaizenException will be an opaque low-level type error rather than a meaningful content error.
🛡️ Proposed fix — guard against None content on both paths
if constrained_decoding_supported:
clean_response = (
completion(
model=llm_settings.tips_model,
messages=[{"role": "user", "content": prompt}],
response_format=TipGenerationResponse,
custom_llm_provider=llm_settings.custom_llm_provider,
)
.choices[0]
.message.content
)
+ if clean_response is None:
+ raise ValueError("LLM returned None content (constrained path)")
else:
response = (
completion(
model=llm_settings.tips_model,
messages=[{"role": "user", "content": prompt}],
custom_llm_provider=llm_settings.custom_llm_provider,
)
.choices[0]
.message.content
)
+ if response is None:
+ raise ValueError("LLM returned None content")
clean_response = clean_llm_response(response)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@kaizen/llm/tips/clustering.py` around lines 196 - 208, Guard against a None
.message.content before calling clean_llm_response or json.loads: after getting
the LLM result from completion(...).choices[0].message (the variable currently
assigned to response), check if message.content is None and if so raise a
KaizenException (or return a clear error) describing that the LLM returned no
content (include finish_reason or relevant metadata if available) instead of
letting clean_llm_response(response) or json.loads(clean_response) raise
AttributeError/TypeError; apply the same guard to both the constrained and
unconstrained branches that feed
TipGenerationResponse.model_validate(json.loads(...)) so the higher-level error
is meaningful.
|
conflict will be resolved once PR #58 is merged |
Summary
cluster_entities()inkaizen/llm/tips/clustering.pythat embeds task descriptions via SentenceTransformer, computes pairwise cosine similarity, and groups tips into clusters usingunion-find with path compression
KAIZEN_CLUSTERING_THRESHOLDconfig option (default0.80) toKaizenConfigKaizenClient.cluster_tips()convenience methodkaizen entities consolidateCLI command with Rich table output (dry-run only — next PR will add actual consolidation)Test plan
uv run pytest tests/unit/test_clustering.py -v— 11 new tests passuv run pytest -v— full suite (102 tests) passesuv run ruff check kaizen/ tests/— lint cleanuv run kaizen entities consolidate --help— CLI renders correctlySummary by CodeRabbit
New Features
Schema
Tests