Multi repo hardening#1
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements multi-repository support for CodeMemory by introducing repo_id partitioning within a shared Neo4j database. Key changes include migrating global uniqueness constraints to composite ones, updating ingestion passes to stamp nodes with repository identifiers, and enhancing search functionality with over-fetching and structural reranking to prevent cross-repo result pollution. Additionally, a new agent-authored memory graph system is introduced, allowing for the creation, search, and management of memory nodes and relations. Feedback focuses on improving maintainability by replacing magic numbers with named constants in the reranking logic and ensuring documentation portability by using relative file paths.
| - `D:\code\codememory\src\codememory\ingestion\graph.py` | ||
| - `D:\code\codememory\src\codememory\server\app.py` | ||
| - `D:\code\codememory\src\codememory\server\tools.py` |
There was a problem hiding this comment.
The documentation contains absolute local file paths. For better portability and to make it useful for all contributors, these should be replaced with relative paths from the repository root. For example, D:\code\codememory\src\codememory\ingestion\graph.py should be src/codememory/ingestion/graph.py. This applies to other absolute paths in this document as well (lines 86, 560-565).
| - `D:\code\codememory\src\codememory\ingestion\graph.py` | |
| - `D:\code\codememory\src\codememory\server\app.py` | |
| - `D:\code\codememory\src\codememory\server\tools.py` | |
| - `src/codememory/ingestion/graph.py` | |
| - `src/codememory/server/app.py` | |
| - `src/codememory/server/tools.py` |
| calls_out, called_by, methods, file_imports, siblings | ||
| """ | ||
| active_repo = repo_id or self.repo_id | ||
| overfetch = limit * 3 if active_repo else limit |
There was a problem hiding this comment.
The overfetch multiplier 3 is a magic number. It would be better to define it as a named constant at the class level (e.g., RERANK_OVERFETCH_MULTIPLIER = 3) for better readability and easier configuration in the future.
| overfetch = limit * 3 if active_repo else limit | |
| overfetch = limit * self.RERANK_OVERFETCH_MULTIPLIER if active_repo else limit |
| for r in results: | ||
| structural = 0.0 | ||
| if r.get("calls_out"): | ||
| structural += 0.05 # has outgoing calls | ||
| if r.get("called_by"): | ||
| structural += 0.05 # is called by others — more central | ||
| if r.get("methods"): | ||
| structural += 0.03 # is a class with methods | ||
| r["final_score"] = r.get("score", 0.0) * 0.9 + structural | ||
| results.sort(key=lambda x: x["final_score"], reverse=True) | ||
| return results[:limit] |
There was a problem hiding this comment.
The reranking logic uses several magic numbers for calculating the structural bonus and final score (e.g., 0.9, 0.05, 0.03). To improve readability and maintainability, it would be better to define these as named constants at the class level (e.g., RERANK_VECTOR_WEIGHT = 0.9, RERANK_BONUS_CALLS_OUT = 0.05, etc.).
| for r in results: | |
| structural = 0.0 | |
| if r.get("calls_out"): | |
| structural += 0.05 # has outgoing calls | |
| if r.get("called_by"): | |
| structural += 0.05 # is called by others — more central | |
| if r.get("methods"): | |
| structural += 0.03 # is a class with methods | |
| r["final_score"] = r.get("score", 0.0) * 0.9 + structural | |
| results.sort(key=lambda x: x["final_score"], reverse=True) | |
| return results[:limit] | |
| for r in results: | |
| structural = 0.0 | |
| if r.get("calls_out"): | |
| structural += self.RERANK_BONUS_CALLS_OUT | |
| if r.get("called_by"): | |
| structural += self.RERANK_BONUS_CALLED_BY | |
| if r.get("methods"): | |
| structural += self.RERANK_BONUS_METHODS | |
| r["final_score"] = r.get("score", 0.0) * self.RERANK_VECTOR_WEIGHT + structural | |
| results.sort(key=lambda x: x["final_score"], reverse=True) | |
| return results[:limit] |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4dd83129bb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| entity_names: List[str] = [] | ||
| with self.driver.session() as session: | ||
| for entity in normalized_entities: | ||
| embedding = self._get_memory_embedding_or_none(entity["embedding_text"]) |
There was a problem hiding this comment.
Preserve embeddings when observations are omitted
When create_memory_entities() updates an existing node with observations omitted/empty, the ON MATCH branch intentionally keeps prior m.observations and m.observation_text, but this line still computes a new embedding from the empty payload and then overwrites m.embedding. In that metadata-only update path, vector search will rank the entity using an embedding that no longer matches its stored observations, causing silent relevance regressions until a manual backfill is run.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
This PR hardens CodeMemory for multi-repository usage within a single Neo4j database by introducing repo_id partitioning, repo-scoped constraints, and repo-filtered retrieval (including graph-enriched semantic search and new memory CRUD/search MCP tools).
Changes:
- Add
repo_idscoping across ingestion passes, composite uniqueness constraints, and repo-filtered code search with over-fetch + reranking. - Introduce memory graph CRUD/search/backfill capabilities exposed via MCP tools and support them with Neo4j schema (constraints/indexes).
- Update CLI/status queries, tests, and documentation to propagate/describe repo scoping and new tools.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_server.py | Expands semantic search formatting expectations and adds MCP memory tool tests. |
| tests/test_graph.py | Adds repo-scoped builder fixture and tests for repo-filtering + memory graph operations. |
| tests/test_cli.py | Verifies repo-scoped status queries and ensures repo_root is passed into builder in CLI commands. |
| src/codememory/server/tools.py | Enhances semantic search output to include graph context fields and adds memory method passthroughs. |
| src/codememory/server/app.py | Passes repo_root into graph initialization, initializes memory schema, and adds MCP memory tools + formatting helpers. |
| src/codememory/ingestion/graph.py | Core multi-repo partitioning: repo_id, composite constraints, repo-scoped ingestion passes, graph-enriched semantic search + rerank, and full memory CRUD/search/backfill implementation. |
| src/codememory/cli.py | Makes status/search/deps/impact propagate repo_root and scopes status counts by repo_id. |
| README.md | Updates git command examples and lists new memory MCP tools. |
| docs/TROUBLESHOOTING.md | Updates git command troubleshooting steps to match new flags. |
| docs/plans/ultraplan-4.5.26.md | Adds a detailed implementation plan for repo partitioning/migration and retrieval changes. |
| docs/MULTI_CODEBASE_RESEARCH.md | Adds research notes on multi-codebase GraphRAG approaches and repo partitioning rationale. |
| docs/MULTI_CODEBASE_RESEARCH_REPORT.md | Adds distilled research report with recommendations and validated environment constraints. |
| docs/MCP_INTEGRATION.md | Updates git-sync example usage. |
| docs/GIT_GRAPH.md | Aligns git-init/git-sync examples and option lists with current CLI. |
| docs/FIELD_TEST_TEMPLATE.md | Updates field-test checklist to new git-sync workflow/flags. |
| docs/ARCHITECTURE.md | Documents repo_id partitioning identity model and retrieval behavior. |
| docs/API.md | Updates API docs for semantic_search() repo filtering, reranking, and memory search behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def setup_memory_schema(self): | ||
| """Create constraints and indexes for agent-authored memory entities.""" | ||
| if self.repo_id: | ||
| name_constraint = ( | ||
| "DROP CONSTRAINT memory_name_unique IF EXISTS", | ||
| "CREATE CONSTRAINT memory_repo_name_unique IF NOT EXISTS " | ||
| f"FOR (m:{self.MEMORY_LABEL}) REQUIRE (m.repo_id, m.name) IS UNIQUE", | ||
| ) | ||
| else: | ||
| name_constraint = ( | ||
| "CREATE CONSTRAINT memory_name_unique IF NOT EXISTS " | ||
| f"FOR (m:{self.MEMORY_LABEL}) REQUIRE m.name IS UNIQUE", | ||
| ) |
There was a problem hiding this comment.
When running with repo_id enabled, setup_memory_schema() drops the legacy unique constraint and creates the composite (repo_id, name) constraint, but it never backfills repo_id onto existing :Memory nodes. This can leave pre-migration Memory nodes unreachable by repo-scoped reads (search_memory_nodes/read_memory_graph filter by repo_id) and can also allow duplicate names to coexist (old node without repo_id + new node with repo_id). Consider invoking the existing _backfill_repo_id() (or a memory-only backfill) before applying the composite constraint when repo_id is set.
| "score": 0.92, # raw vector similarity (0–1) | ||
| "final_score": 0.94, # 0.9×vector_score + structural_bonus | ||
| "text": "def authenticate(username, password):..." | ||
| }, | ||
| ... | ||
| ] | ||
| ``` | ||
|
|
||
| - `final_score` is always present when `repo_id` filtering is active (via `_rerank_results()`). | ||
|
|
There was a problem hiding this comment.
The documentation claims final_score is only present when repo_id filtering is active, and that score is always raw vector similarity (0–1). In the implementation (KnowledgeGraphBuilder._rerank_results), final_score is added for every result (including when repo_id is not set), and when the vector path falls back to fulltext search the returned score comes from Neo4j fulltext scoring (not guaranteed to be 0–1). Please update the contract here to match actual behavior, or adjust the code to only add final_score / only rerank when repo filtering is active.
| """Tests for the KnowledgeGraphBuilder module.""" | ||
|
|
||
| import pytest | ||
| from pathlib import Path | ||
| from unittest.mock import Mock, patch | ||
|
|
||
| import pytest | ||
|
|
There was a problem hiding this comment.
Path is imported from pathlib but not used anywhere in this test module. Removing the unused import will keep the test file clean and avoid lint failures if unused-import checks are enabled.
No description provided.