Skip to content

Errors, health, and observability: /health endpoint, backoff/log dedup, and structured MCP errors #9

@NikolaRHristov

Description

@NikolaRHristov

Summary

In failure scenarios (especially around Ollama and Qdrant), the container can appear healthy while repeatedly failing index operations, logs become noisy from retry storms, and MCP clients see opaque timeouts or empty responses. This was particularly visible during Mountain runs where curl: (52) Empty reply from server appeared during reindex and the only reliable recovery path was docker restart.[file:40]

This issue focuses on making errors, health, and observability first-class so operators and users can quickly see what’s wrong and why.

Pain points from sessions

  1. Container health doesn’t reflect embedding / Qdrant failures

    • docker ps may show the container as healthy even when:
      • Ollama is down or unreachable.
      • Qdrant is misconfigured or unreachable.
      • reindex is failing repeatedly.
  2. Retry storms flood logs with repeated errors

    • When Ollama is unreachable, the container logs many identical econnrefused lines.
    • This makes it hard to see other important signals in the logs.
  3. Embedding errors are not surfaced to MCP clients in a structured way

    • When the embedding backend fails during reindex, the MCP client often just sees a generic timeout or Empty reply from server, not a structured error payload that points to the root cause.[file:40]

Files likely to change

Proposed improvements

  1. Add a container /health endpoint

    • Implement GET /health that returns a JSON structure, e.g.:

      {
        "mcp": "healthy" | "degraded" | "unreachable",
        "qdrant": "healthy" | "degraded" | "unreachable",
        "ollama": "healthy" | "degraded" | "unreachable",
        "indexed_projects": N
      }
    • Use this endpoint in docker-compose healthchecks so that container health reflects the state of its dependencies.

  2. Add backoff and log deduplication for retry loops

    • In embedding_model.ex and qdrant_client.ex, implement exponential backoff for repeated failures to contact Ollama or Qdrant.
    • Coalesce repeated errors into a single log entry like:
      • "Ollama unreachable at $OLLAMA_URL – retried 47 times in 30s, last error: econnrefused".
    • Optionally log the first few failures at full detail, then summarize.
  3. Return structured MCP errors for embedding failures

    • When the embedding backend fails during reindex, return a structured MCP error from mcp_server.ex instead of letting the call time out, e.g.:

      {
        "error_code": "embedding_backend_unreachable",
        "message": "Failed to contact embedding backend at $OLLAMA_URL",
        "partial_indexed_count": 123,
        "recoverable": true
      }
    • This allows MCP clients to provide actionable, user-friendly messages.

  4. Log correlation IDs for long-running jobs

  5. Document health and error surfaces

    • Add README sections explaining:
      • What /health checks.
      • What typical error codes (embedding_backend_unreachable, ollama_unreachable, etc.) look like.
      • How retry/backoff behavior works and where to find the relevant logs.

If this direction looks good, I can follow up with a PR proposal that:

  • Introduces the /health endpoint and integrates it with Docker healthchecks.
  • Adds backoff + log deduplication for Ollama/Qdrant retries in embedding_model.ex and qdrant_client.ex.
  • Returns structured MCP errors for embedding failures via mcp_server.ex instead of allowing silent timeouts.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions