ree2raz · ree2raz · Jun 6, 2026 · Jun 5, 2026 · Jun 5, 2026 · Jun 5, 2026
diff --git a/.prettierignore b/.prettierignore
@@ -0,0 +1,34 @@
+# Generated by underwriter.publish_scorecard — overwritten on every eval run
+web/public/eval-scorecard.json
+
+# Intentionally mixed `*emphasis*` / `_emphasis_` — do not normalise
+underwriter/docs/METHODOLOGY.md
+
+# Build artefacts
+node_modules/
+dist/
+build/
+__pycache__/
+.pytest_cache/
+.ruff_cache/
+
+# Lockfiles and large data
+package-lock.json
+uv.lock
+*.min.js
+*.min.css
+
+# Binary / non-text
+*.pdf
+*.png
+*.jpg
+*.jpeg
+*.gif
+*.ico
+*.webp
+*.mp4
+*.webm
+
+# Working artefacts
+PLAN.md
+.claude/
diff --git a/.prettierrc.json b/.prettierrc.json
@@ -0,0 +1,13 @@
+{
+  "printWidth": 80,
+  "tabWidth": 2,
+  "useTabs": false,
+  "semi": true,
+  "singleQuote": false,
+  "quoteProps": "as-needed",
+  "trailingComma": "all",
+  "bracketSpacing": true,
+  "bracketSameLine": false,
+  "arrowParens": "always",
+  "endOfLine": "lf"
+}
diff --git a/README.md b/README.md
diff --git a/beacon/README.md b/beacon/README.md
@@ -8,13 +8,14 @@ Architecture notes (ingestion flow, logging strategy, scaling, failure handling,
 schema decisions): [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md).
 
 ## Components
-| Path | What |
-| --- | --- |
-| `llmobs/` | The SDK: `trace()` span → capture + **redact PII before egress** → bounded async queue → batched POST. Non-blocking, retry, circuit breaker, drop-with-counter. |
-| `beacon/gateway/` | FastAPI chat: SSE streaming, multi-provider, conversation persistence, cancel; read API for dashboards. |
-| `beacon/ingestion/` | FastAPI: validate + `x-api-key`, produce to Redpanda, return `202`; bad events → DLQ. |
-| `beacon/worker/` | Redpanda consumer → idempotent upsert into Postgres; poison → DLQ. |
-| `beacon/db/` | SQLAlchemy models + Alembic migrations. |
+
+| Path                | What                                                                                                                                                            |
+| ------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `llmobs/`           | The SDK: `trace()` span → capture + **redact PII before egress** → bounded async queue → batched POST. Non-blocking, retry, circuit breaker, drop-with-counter. |
+| `beacon/gateway/`   | FastAPI chat: SSE streaming, multi-provider, conversation persistence, cancel; read API for dashboards.                                                         |
+| `beacon/ingestion/` | FastAPI: validate + `x-api-key`, produce to Redpanda, return `202`; bad events → DLQ.                                                                           |
+| `beacon/worker/`    | Redpanda consumer → idempotent upsert into Postgres; poison → DLQ.                                                                                              |
+| `beacon/db/`        | SQLAlchemy models + Alembic migrations.                                                                                                                         |
 
 ## Run it locally
 
@@ -38,6 +39,7 @@ uv run uvicorn beacon.gateway.main:app --port 8000
 ```
 
 ### Smoke test: watch a redacted log land end-to-end
+
 ```bash
 # stream a chat turn (SSE)
 curl -N -X POST localhost:8000/chat -H 'content-type: application/json' \
@@ -51,13 +53,16 @@ curl -s localhost:8000/api/metrics/summary | jq
 ```
 
 ## Offline tests (no infra, no keys)
+
 ```bash
 uv run pytest beacon/tests
 ```
+
 Covers the redaction golden set, the SDK's non-blocking / retry / circuit-breaker /
 bounded-drop behaviour, and the tracer's event construction.
 
 ## Endpoints
+
 - `POST /chat` - SSE token stream (`meta` → `token`… → `done`).
 - `POST /conversations/{id}/cancel` - stop an in-flight stream.
 - `GET /models` - model catalog for the selector.

diff --git a/beacon/docs/ARCHITECTURE.md b/beacon/docs/ARCHITECTURE.md
@@ -84,14 +84,14 @@ flows the async path and is best-effort; losing a log never corrupts a chat.
   removes the conversation and its `messages` (the chat record, cascade), but
   **leaves `inference_logs` intact**. This is deliberate: the two write paths have
   different lifecycles. Chat state is user-owned and disposable; `inference_logs`
-  is an *append-only operational audit stream* (latency, cost, error, and PII-control
+  is an _append-only operational audit stream_ (latency, cost, error, and PII-control
   receipts that ops and finance rely on). If deleting a chat retroactively erased its
   logs, historical dashboards (p95 latency, cost-by-model, error rate for a past
   window) would silently change every time a user pruned history, which defeats the
   purpose of an audit trail. The logs already hold no raw content, only redacted
   previews, so retention is privacy-safe. The cost is a dangling
   `inference_logs.conversation_id` whose trace view 404s; `conversation_id` is
-  therefore an intentionally *soft* reference, not a foreign key. For a strict
+  therefore an intentionally _soft_ reference, not a foreign key. For a strict
   right-to-erasure requirement the documented next step is a soft-delete that nulls
   `conversation_id` and the previews while preserving the numeric metrics.
 - **`request_id` UNIQUE** is the idempotency key threaded end-to-end.

diff --git a/deploy/docker-compose.yml b/deploy/docker-compose.yml
@@ -114,7 +114,11 @@ services:
       redpanda:
         condition: service_healthy
     healthcheck:
-      test: ["CMD-SHELL", "/app/.venv/bin/python -c \"import urllib.request; urllib.request.urlopen('http://localhost:8000/healthz')\" 2>/dev/null || exit 1"]
+      test:
+        [
+          "CMD-SHELL",
+          '/app/.venv/bin/python -c "import urllib.request; urllib.request.urlopen(''http://localhost:8000/healthz'')" 2>/dev/null || exit 1',
+        ]
       interval: 10s
       timeout: 5s
       retries: 5
@@ -134,7 +138,11 @@ services:
       redpanda:
         condition: service_healthy
     healthcheck:
-      test: ["CMD-SHELL", "/app/.venv/bin/python -c \"import urllib.request; urllib.request.urlopen('http://localhost:8088/healthz')\" 2>/dev/null || exit 1"]
+      test:
+        [
+          "CMD-SHELL",
+          '/app/.venv/bin/python -c "import urllib.request; urllib.request.urlopen(''http://localhost:8088/healthz'')" 2>/dev/null || exit 1',
+        ]
       interval: 10s
       timeout: 5s
       retries: 5

diff --git a/deploy/k8s/gateway.yaml b/deploy/k8s/gateway.yaml
@@ -32,7 +32,15 @@ spec:
       containers:
         - name: gateway
           image: koverage-platform:latest
-          command: ["uvicorn", "beacon.gateway.main:app", "--host", "0.0.0.0", "--port", "8000"]
+          command:
+            [
+              "uvicorn",
+              "beacon.gateway.main:app",
+              "--host",
+              "0.0.0.0",
+              "--port",
+              "8000",
+            ]
           envFrom:
             - secretRef:
                 name: koverage-secrets

diff --git a/deploy/k8s/ingestion.yaml b/deploy/k8s/ingestion.yaml
@@ -20,7 +20,15 @@ spec:
       containers:
         - name: ingestion
           image: koverage-platform:latest
-          command: ["uvicorn", "beacon.ingestion.main:app", "--host", "0.0.0.0", "--port", "8088"]
+          command:
+            [
+              "uvicorn",
+              "beacon.ingestion.main:app",
+              "--host",
+              "0.0.0.0",
+              "--port",
+              "8088",
+            ]
           envFrom:
             - secretRef:
                 name: koverage-secrets

diff --git a/modal-app/README.md b/modal-app/README.md
@@ -40,15 +40,15 @@ evaluation harness and the chat UI's OSS path.
 
 ## Cost & latency (A10G)
 
-| Metric | Value |
-|---|---|
-| GPU | A10G (24 GB) |
-| Price | ~$1.10/hr (Modal, per-second billing) |
-| Cold start | ~1–3 min first call (weights from Volume + vLLM warmup) |
-| Warm latency (single-turn chat) | ~0.8–2 s per request |
-| Per-item eval latency | ~27 s (full multi-turn prompt on one A10G, cold-start amortised) |
-| Throughput | vLLM continuous batching; up to ~50 concurrent inputs |
-| Idle cost | $0 (scales to zero after 5 min idle) |
+| Metric                          | Value                                                            |
+| ------------------------------- | ---------------------------------------------------------------- |
+| GPU                             | A10G (24 GB)                                                     |
+| Price                           | ~$1.10/hr (Modal, per-second billing)                            |
+| Cold start                      | ~1–3 min first call (weights from Volume + vLLM warmup)          |
+| Warm latency (single-turn chat) | ~0.8–2 s per request                                             |
+| Per-item eval latency           | ~27 s (full multi-turn prompt on one A10G, cold-start amortised) |
+| Throughput                      | vLLM continuous batching; up to ~50 concurrent inputs            |
+| Idle cost                       | $0 (scales to zero after 5 min idle)                             |
 
 ## Deploy
 

diff --git a/modal-app/qwen_app.py b/modal-app/qwen_app.py
@@ -56,6 +56,7 @@
     timeout=10 * MINUTES,
     volumes={"/root/.cache/huggingface": hf_cache},
     min_containers=1,  # FIXED: Replaces the deprecated keep_warm=1
+    max_containers=5,  # Cap fan-out — keeps the eval under the account's infra rate limits
 )
 class VLLMServer:
     @modal.enter()

diff --git a/underwriter/README.md b/underwriter/README.md
@@ -3,22 +3,24 @@
 Evaluates models on the risks an AI liability insurer underwrites, then prices
 an **Insurability Index** and premium tier.
 
-| Axis | What it measures | Suite |
-| --- | --- | --- |
-| **Hallucination** | factual accuracy + resistance to confabulation (false-premise traps) | `factual` |
-| **Bias & Harmful** | stereotyping, harmful generalisations, demeaning content | `bias` |
-| **Content Safety** | jailbreak resistance **and over-refusal** (benign controls) | `jailbreak` |
-| **Sensitive-Data Disclosure** | system-prompt / token / PII leakage | `sensitive` |
+| Axis                          | What it measures                                                     | Suite       |
+| ----------------------------- | -------------------------------------------------------------------- | ----------- |
+| **Hallucination**             | factual accuracy + resistance to confabulation (false-premise traps) | `factual`   |
+| **Bias & Harmful**            | stereotyping, harmful generalisations, demeaning content             | `bias`      |
+| **Content Safety**            | jailbreak resistance **and over-refusal** (benign controls)          | `jailbreak` |
+| **Sensitive-Data Disclosure** | system-prompt / token / PII leakage                                  | `sensitive` |
 
 ## How it scores (the short version)
-- **Hybrid**: deterministic detectors (refusal, false-premise, PII/sentinel leak: leak detection reuses Beacon's `llmobs` redactor) **+** dual cross-provider LLM judges (GPT-4.1 + Gemini). Deterministic signals can override the judge (a leaked card number is a leak regardless of judge opinion).
-- **Dual judges + Cohen's κ**: both judges score every item on an anchored 0–4 rubric; we report per-judge risk and inter-rater agreement, and never let a model be its own sole judge.
+
+- **Hybrid**: deterministic detectors (refusal, false-premise, PII/sentinel leak: leak detection reuses Beacon's `llmobs` redactor) **+** dual cross-provider LLM judges (GPT-4.1 + Claude 3.5 Haiku, both disjoint from the models under test). Deterministic signals can override the judge (a leaked card number is a leak regardless of judge opinion).
+- **Dual judges + Cohen's κ / Gwet's AC1**: both judges score every item on an anchored 0–4 rubric; we report per-judge risk and inter-rater agreement (AC1 alongside κ, which is paradox-resistant where κ degenerates), and never let a model be its own sole judge.
 - **Severity-weighted** risk per axis with **bootstrap 95% CIs**.
 - **Guardrail A/B**: every model runs guardrails-off and guardrails-on; the index delta is the risk reduction the guardrail buys.
 - **Insurability Index** = 100·(1 − weighted overall risk) → premium tier (Preferred / Standard / Substandard / Decline).
 - Full rationale + limitations: [`docs/METHODOLOGY.md`](docs/METHODOLOGY.md).
 
 ## Run it
+
 ```bash
 uv sync
 cp ../.env.example ../.env          # set OPENROUTER_API_KEY (reaches judges + frontier models)
@@ -32,18 +34,22 @@ uv run python -m underwriter.cli run --smoke
 # full live evaluation (all suites, guard off+on, dual judges) → runs/<ts>/{scorecard.json,pdf}
 uv run python -m underwriter.cli run
 ```
+
 The OSS model (Qwen3-8B, self-hosted on Modal/vLLM) joins the run matrix
 automatically once `MODAL_OSS_URL` is set; until then the harness runs on the
 configured frontier models. If the Modal endpoint is unreachable it falls back to
 `qwen/qwen3-8b` on OpenRouter so the run still completes.
 
 ## Offline tests (no API)
+
 ```bash
 uv run pytest underwriter/tests
 ```
+
 Covers the detectors, the risk-model overrides, and the statistics (weighted mean,
 bootstrap CI, Cohen's κ, premium tiers): judge verdicts are fixtures.
 
 ## Layout
+
 `datasets/` suites + cards · `scoring/` deterministic + judge + combine + aggregate ·
 `guardrails.py` toggleable layer · `runner.py` run matrix · `report.py` PDF + publish · `cli.py`.