Summary
In failure scenarios (especially around Ollama and Qdrant), the container can appear healthy while repeatedly failing index operations, logs become noisy from retry storms, and MCP clients see opaque timeouts or empty responses. This was particularly visible during Mountain runs where curl: (52) Empty reply from server appeared during reindex and the only reliable recovery path was docker restart.[file:40]
This issue focuses on making errors, health, and observability first-class so operators and users can quickly see what’s wrong and why.
Pain points from sessions
-
Container health doesn’t reflect embedding / Qdrant failures
docker ps may show the container as healthy even when:
- Ollama is down or unreachable.
- Qdrant is misconfigured or unreachable.
reindex is failing repeatedly.
-
Retry storms flood logs with repeated errors
- When Ollama is unreachable, the container logs many identical
econnrefused lines.
- This makes it hard to see other important signals in the logs.
-
Embedding errors are not surfaced to MCP clients in a structured way
- When the embedding backend fails during
reindex, the MCP client often just sees a generic timeout or Empty reply from server, not a structured error payload that points to the root cause.[file:40]
Files likely to change
Proposed improvements
-
Add a container /health endpoint
-
Implement GET /health that returns a JSON structure, e.g.:
{
"mcp": "healthy" | "degraded" | "unreachable",
"qdrant": "healthy" | "degraded" | "unreachable",
"ollama": "healthy" | "degraded" | "unreachable",
"indexed_projects": N
}
-
Use this endpoint in docker-compose healthchecks so that container health reflects the state of its dependencies.
-
Add backoff and log deduplication for retry loops
- In
embedding_model.ex and qdrant_client.ex, implement exponential backoff for repeated failures to contact Ollama or Qdrant.
- Coalesce repeated errors into a single log entry like:
"Ollama unreachable at $OLLAMA_URL – retried 47 times in 30s, last error: econnrefused".
- Optionally log the first few failures at full detail, then summarize.
-
Return structured MCP errors for embedding failures
-
When the embedding backend fails during reindex, return a structured MCP error from mcp_server.ex instead of letting the call time out, e.g.:
{
"error_code": "embedding_backend_unreachable",
"message": "Failed to contact embedding backend at $OLLAMA_URL",
"partial_indexed_count": 123,
"recoverable": true
}
-
This allows MCP clients to provide actionable, user-friendly messages.
-
Log correlation IDs for long-running jobs
-
Document health and error surfaces
- Add README sections explaining:
- What
/health checks.
- What typical error codes (
embedding_backend_unreachable, ollama_unreachable, etc.) look like.
- How retry/backoff behavior works and where to find the relevant logs.
If this direction looks good, I can follow up with a PR proposal that:
- Introduces the
/health endpoint and integrates it with Docker healthchecks.
- Adds backoff + log deduplication for Ollama/Qdrant retries in
embedding_model.ex and qdrant_client.ex.
- Returns structured MCP errors for embedding failures via
mcp_server.ex instead of allowing silent timeouts.
Summary
In failure scenarios (especially around Ollama and Qdrant), the container can appear healthy while repeatedly failing index operations, logs become noisy from retry storms, and MCP clients see opaque timeouts or empty responses. This was particularly visible during Mountain runs where
curl: (52) Empty reply from serverappeared duringreindexand the only reliable recovery path wasdocker restart.[file:40]This issue focuses on making errors, health, and observability first-class so operators and users can quickly see what’s wrong and why.
Pain points from sessions
Container health doesn’t reflect embedding / Qdrant failures
docker psmay show the container as healthy even when:reindexis failing repeatedly.Retry storms flood logs with repeated errors
econnrefusedlines.Embedding errors are not surfaced to MCP clients in a structured way
reindex, the MCP client often just sees a generic timeout orEmpty reply from server, not a structured error payload that points to the root cause.[file:40]Files likely to change
/healthendpoint.lib/elixir_nexus/embedding_model.ex: where embedding calls and retries are handled.lib/elixir_nexus/qdrant_client.ex: where Qdrant health and failures can be checked.lib/elixir_nexus/mcp_server.ex: where structured MCP errors can be returned.Proposed improvements
Add a container
/healthendpointImplement
GET /healththat returns a JSON structure, e.g.:{ "mcp": "healthy" | "degraded" | "unreachable", "qdrant": "healthy" | "degraded" | "unreachable", "ollama": "healthy" | "degraded" | "unreachable", "indexed_projects": N }Use this endpoint in
docker-composehealthchecks so that container health reflects the state of its dependencies.Add backoff and log deduplication for retry loops
embedding_model.exandqdrant_client.ex, implement exponential backoff for repeated failures to contact Ollama or Qdrant."Ollama unreachable at $OLLAMA_URL – retried 47 times in 30s, last error: econnrefused".Return structured MCP errors for embedding failures
When the embedding backend fails during
reindex, return a structured MCP error frommcp_server.exinstead of letting the call time out, e.g.:{ "error_code": "embedding_backend_unreachable", "message": "Failed to contact embedding backend at $OLLAMA_URL", "partial_indexed_count": 123, "recoverable": true }This allows MCP clients to provide actionable, user-friendly messages.
Log correlation IDs for long-running jobs
reindexstarts, generate a job or correlation ID and include it in all related logs (see Issue Indexing lifecycle: asyncreindex, progress reporting, incremental updates, and per-project collections #7 for the job model andIndexerreferences).Document health and error surfaces
/healthchecks.embedding_backend_unreachable,ollama_unreachable, etc.) look like.If this direction looks good, I can follow up with a PR proposal that:
/healthendpoint and integrates it with Docker healthchecks.embedding_model.exandqdrant_client.ex.mcp_server.exinstead of allowing silent timeouts.