Errors, health, and observability: `/health` endpoint, backoff/log dedup, and structured MCP errors

### Summary

In failure scenarios (especially around Ollama and Qdrant), the container can appear healthy while repeatedly failing index operations, logs become noisy from retry storms, and MCP clients see opaque timeouts or empty responses. This was particularly visible during Mountain runs where `curl: (52) Empty reply from server` appeared during `reindex` and the only reliable recovery path was `docker restart`.[file:40]

This issue focuses on making errors, health, and observability first-class so operators and users can quickly see what’s wrong and why.

### Pain points from sessions

1. **Container health doesn’t reflect embedding / Qdrant failures**
   - `docker ps` may show the container as healthy even when:
     - Ollama is down or unreachable.
     - Qdrant is misconfigured or unreachable.
     - `reindex` is failing repeatedly.

2. **Retry storms flood logs with repeated errors**
   - When Ollama is unreachable, the container logs many identical `econnrefused` lines.
   - This makes it hard to see other important signals in the logs.

3. **Embedding errors are not surfaced to MCP clients in a structured way**
   - When the embedding backend fails during `reindex`, the MCP client often just sees a generic timeout or `Empty reply from server`, not a structured error payload that points to the root cause.[file:40]

### Files likely to change

- Phoenix / Plug endpoint module (wherever HTTP routes are defined) to add a `/health` endpoint.
- [`lib/elixir_nexus/embedding_model.ex`](https://github.com/iksnerd/code-nexus/blob/main/lib/elixir_nexus/embedding_model.ex): where embedding calls and retries are handled.
- [`lib/elixir_nexus/qdrant_client.ex`](https://github.com/iksnerd/code-nexus/blob/main/lib/elixir_nexus/qdrant_client.ex): where Qdrant health and failures can be checked.
- [`lib/elixir_nexus/mcp_server.ex`](https://github.com/iksnerd/code-nexus/blob/main/lib/elixir_nexus/mcp_server.ex): where structured MCP errors can be returned.

### Proposed improvements

1. **Add a container `/health` endpoint**
   - Implement `GET /health` that returns a JSON structure, e.g.:

     ```json
     {
       "mcp": "healthy" | "degraded" | "unreachable",
       "qdrant": "healthy" | "degraded" | "unreachable",
       "ollama": "healthy" | "degraded" | "unreachable",
       "indexed_projects": N
     }
     ```

   - Use this endpoint in `docker-compose` healthchecks so that container health reflects the state of its dependencies.

2. **Add backoff and log deduplication for retry loops**
   - In `embedding_model.ex` and `qdrant_client.ex`, implement exponential backoff for repeated failures to contact Ollama or Qdrant.
   - Coalesce repeated errors into a single log entry like:
     - `"Ollama unreachable at $OLLAMA_URL – retried 47 times in 30s, last error: econnrefused"`.
   - Optionally log the first few failures at full detail, then summarize.

3. **Return structured MCP errors for embedding failures**
   - When the embedding backend fails during `reindex`, return a structured MCP error from `mcp_server.ex` instead of letting the call time out, e.g.:

     ```json
     {
       "error_code": "embedding_backend_unreachable",
       "message": "Failed to contact embedding backend at $OLLAMA_URL",
       "partial_indexed_count": 123,
       "recoverable": true
     }
     ```

   - This allows MCP clients to provide actionable, user-friendly messages.

4. **Log correlation IDs for long-running jobs**
   - When `reindex` starts, generate a job or correlation ID and include it in all related logs (see Issue #7 for the job model and `Indexer` references).
   - This makes it easier to trace logs for a single indexing run when debugging.

5. **Document health and error surfaces**
   - Add README sections explaining:
     - What `/health` checks.
     - What typical error codes (`embedding_backend_unreachable`, `ollama_unreachable`, etc.) look like.
     - How retry/backoff behavior works and where to find the relevant logs.

If this direction looks good, I can follow up with a PR proposal that:
- Introduces the `/health` endpoint and integrates it with Docker healthchecks.
- Adds backoff + log deduplication for Ollama/Qdrant retries in `embedding_model.ex` and `qdrant_client.ex`.
- Returns structured MCP errors for embedding failures via `mcp_server.ex` instead of allowing silent timeouts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Errors, health, and observability: `/health` endpoint, backoff/log dedup, and structured MCP errors #9

Summary

Pain points from sessions

Files likely to change

Proposed improvements

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Errors, health, and observability: /health endpoint, backoff/log dedup, and structured MCP errors #9

Description

Summary

Pain points from sessions

Files likely to change

Proposed improvements

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Errors, health, and observability: `/health` endpoint, backoff/log dedup, and structured MCP errors #9