Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 34 additions & 0 deletions .prettierignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Generated by underwriter.publish_scorecard — overwritten on every eval run
web/public/eval-scorecard.json

# Intentionally mixed `*emphasis*` / `_emphasis_` — do not normalise
underwriter/docs/METHODOLOGY.md

# Build artefacts
node_modules/
dist/
build/
__pycache__/
.pytest_cache/
.ruff_cache/

# Lockfiles and large data
package-lock.json
uv.lock
*.min.js
*.min.css

# Binary / non-text
*.pdf
*.png
*.jpg
*.jpeg
*.gif
*.ico
*.webp
*.mp4
*.webm

# Working artefacts
PLAN.md
.claude/
13 changes: 13 additions & 0 deletions .prettierrc.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
{
"printWidth": 80,
"tabWidth": 2,
"useTabs": false,
"semi": true,
"singleQuote": false,
"quoteProps": "as-needed",
"trailingComma": "all",
"bracketSpacing": true,
"bracketSameLine": false,
"arrowParens": "always",
"endOfLine": "lf"
}
171 changes: 92 additions & 79 deletions README.md

Large diffs are not rendered by default.

19 changes: 12 additions & 7 deletions beacon/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,14 @@ Architecture notes (ingestion flow, logging strategy, scaling, failure handling,
schema decisions): [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md).

## Components
| Path | What |
| --- | --- |
| `llmobs/` | The SDK: `trace()` span → capture + **redact PII before egress** → bounded async queue → batched POST. Non-blocking, retry, circuit breaker, drop-with-counter. |
| `beacon/gateway/` | FastAPI chat: SSE streaming, multi-provider, conversation persistence, cancel; read API for dashboards. |
| `beacon/ingestion/` | FastAPI: validate + `x-api-key`, produce to Redpanda, return `202`; bad events → DLQ. |
| `beacon/worker/` | Redpanda consumer → idempotent upsert into Postgres; poison → DLQ. |
| `beacon/db/` | SQLAlchemy models + Alembic migrations. |

| Path | What |
| ------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `llmobs/` | The SDK: `trace()` span → capture + **redact PII before egress** → bounded async queue → batched POST. Non-blocking, retry, circuit breaker, drop-with-counter. |
| `beacon/gateway/` | FastAPI chat: SSE streaming, multi-provider, conversation persistence, cancel; read API for dashboards. |
| `beacon/ingestion/` | FastAPI: validate + `x-api-key`, produce to Redpanda, return `202`; bad events → DLQ. |
| `beacon/worker/` | Redpanda consumer → idempotent upsert into Postgres; poison → DLQ. |
| `beacon/db/` | SQLAlchemy models + Alembic migrations. |

## Run it locally

Expand All @@ -38,6 +39,7 @@ uv run uvicorn beacon.gateway.main:app --port 8000
```

### Smoke test: watch a redacted log land end-to-end

```bash
# stream a chat turn (SSE)
curl -N -X POST localhost:8000/chat -H 'content-type: application/json' \
Expand All @@ -51,13 +53,16 @@ curl -s localhost:8000/api/metrics/summary | jq
```

## Offline tests (no infra, no keys)

```bash
uv run pytest beacon/tests
```

Covers the redaction golden set, the SDK's non-blocking / retry / circuit-breaker /
bounded-drop behaviour, and the tracer's event construction.

## Endpoints

- `POST /chat` - SSE token stream (`meta` → `token`… → `done`).
- `POST /conversations/{id}/cancel` - stop an in-flight stream.
- `GET /models` - model catalog for the selector.
Expand Down
4 changes: 2 additions & 2 deletions beacon/docs/ARCHITECTURE.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,14 +84,14 @@ flows the async path and is best-effort; losing a log never corrupts a chat.
removes the conversation and its `messages` (the chat record, cascade), but
**leaves `inference_logs` intact**. This is deliberate: the two write paths have
different lifecycles. Chat state is user-owned and disposable; `inference_logs`
is an *append-only operational audit stream* (latency, cost, error, and PII-control
is an _append-only operational audit stream_ (latency, cost, error, and PII-control
receipts that ops and finance rely on). If deleting a chat retroactively erased its
logs, historical dashboards (p95 latency, cost-by-model, error rate for a past
window) would silently change every time a user pruned history, which defeats the
purpose of an audit trail. The logs already hold no raw content, only redacted
previews, so retention is privacy-safe. The cost is a dangling
`inference_logs.conversation_id` whose trace view 404s; `conversation_id` is
therefore an intentionally *soft* reference, not a foreign key. For a strict
therefore an intentionally _soft_ reference, not a foreign key. For a strict
right-to-erasure requirement the documented next step is a soft-delete that nulls
`conversation_id` and the previews while preserving the numeric metrics.
- **`request_id` UNIQUE** is the idempotency key threaded end-to-end.
Expand Down
12 changes: 10 additions & 2 deletions deploy/docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,11 @@ services:
redpanda:
condition: service_healthy
healthcheck:
test: ["CMD-SHELL", "/app/.venv/bin/python -c \"import urllib.request; urllib.request.urlopen('http://localhost:8000/healthz')\" 2>/dev/null || exit 1"]
test:
[
"CMD-SHELL",
'/app/.venv/bin/python -c "import urllib.request; urllib.request.urlopen(''http://localhost:8000/healthz'')" 2>/dev/null || exit 1',
]
interval: 10s
timeout: 5s
retries: 5
Expand All @@ -134,7 +138,11 @@ services:
redpanda:
condition: service_healthy
healthcheck:
test: ["CMD-SHELL", "/app/.venv/bin/python -c \"import urllib.request; urllib.request.urlopen('http://localhost:8088/healthz')\" 2>/dev/null || exit 1"]
test:
[
"CMD-SHELL",
'/app/.venv/bin/python -c "import urllib.request; urllib.request.urlopen(''http://localhost:8088/healthz'')" 2>/dev/null || exit 1',
]
interval: 10s
timeout: 5s
retries: 5
Expand Down
10 changes: 9 additions & 1 deletion deploy/k8s/gateway.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,15 @@ spec:
containers:
- name: gateway
image: koverage-platform:latest
command: ["uvicorn", "beacon.gateway.main:app", "--host", "0.0.0.0", "--port", "8000"]
command:
[
"uvicorn",
"beacon.gateway.main:app",
"--host",
"0.0.0.0",
"--port",
"8000",
]
envFrom:
- secretRef:
name: koverage-secrets
Expand Down
10 changes: 9 additions & 1 deletion deploy/k8s/ingestion.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,15 @@ spec:
containers:
- name: ingestion
image: koverage-platform:latest
command: ["uvicorn", "beacon.ingestion.main:app", "--host", "0.0.0.0", "--port", "8088"]
command:
[
"uvicorn",
"beacon.ingestion.main:app",
"--host",
"0.0.0.0",
"--port",
"8088",
]
envFrom:
- secretRef:
name: koverage-secrets
Expand Down
18 changes: 9 additions & 9 deletions modal-app/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,15 +40,15 @@ evaluation harness and the chat UI's OSS path.

## Cost & latency (A10G)

| Metric | Value |
|---|---|
| GPU | A10G (24 GB) |
| Price | ~$1.10/hr (Modal, per-second billing) |
| Cold start | ~1–3 min first call (weights from Volume + vLLM warmup) |
| Warm latency (single-turn chat) | ~0.8–2 s per request |
| Per-item eval latency | ~27 s (full multi-turn prompt on one A10G, cold-start amortised) |
| Throughput | vLLM continuous batching; up to ~50 concurrent inputs |
| Idle cost | $0 (scales to zero after 5 min idle) |
| Metric | Value |
| ------------------------------- | ---------------------------------------------------------------- |
| GPU | A10G (24 GB) |
| Price | ~$1.10/hr (Modal, per-second billing) |
| Cold start | ~1–3 min first call (weights from Volume + vLLM warmup) |
| Warm latency (single-turn chat) | ~0.8–2 s per request |
| Per-item eval latency | ~27 s (full multi-turn prompt on one A10G, cold-start amortised) |
| Throughput | vLLM continuous batching; up to ~50 concurrent inputs |
| Idle cost | $0 (scales to zero after 5 min idle) |

## Deploy

Expand Down
1 change: 1 addition & 0 deletions modal-app/qwen_app.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@
timeout=10 * MINUTES,
volumes={"/root/.cache/huggingface": hf_cache},
min_containers=1, # FIXED: Replaces the deprecated keep_warm=1
max_containers=5, # Cap fan-out — keeps the eval under the account's infra rate limits
)
class VLLMServer:
@modal.enter()
Expand Down
22 changes: 14 additions & 8 deletions underwriter/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,22 +3,24 @@
Evaluates models on the risks an AI liability insurer underwrites, then prices
an **Insurability Index** and premium tier.

| Axis | What it measures | Suite |
| --- | --- | --- |
| **Hallucination** | factual accuracy + resistance to confabulation (false-premise traps) | `factual` |
| **Bias & Harmful** | stereotyping, harmful generalisations, demeaning content | `bias` |
| **Content Safety** | jailbreak resistance **and over-refusal** (benign controls) | `jailbreak` |
| **Sensitive-Data Disclosure** | system-prompt / token / PII leakage | `sensitive` |
| Axis | What it measures | Suite |
| ----------------------------- | -------------------------------------------------------------------- | ----------- |
| **Hallucination** | factual accuracy + resistance to confabulation (false-premise traps) | `factual` |
| **Bias & Harmful** | stereotyping, harmful generalisations, demeaning content | `bias` |
| **Content Safety** | jailbreak resistance **and over-refusal** (benign controls) | `jailbreak` |
| **Sensitive-Data Disclosure** | system-prompt / token / PII leakage | `sensitive` |

## How it scores (the short version)
- **Hybrid**: deterministic detectors (refusal, false-premise, PII/sentinel leak: leak detection reuses Beacon's `llmobs` redactor) **+** dual cross-provider LLM judges (GPT-4.1 + Gemini). Deterministic signals can override the judge (a leaked card number is a leak regardless of judge opinion).
- **Dual judges + Cohen's κ**: both judges score every item on an anchored 0–4 rubric; we report per-judge risk and inter-rater agreement, and never let a model be its own sole judge.

- **Hybrid**: deterministic detectors (refusal, false-premise, PII/sentinel leak: leak detection reuses Beacon's `llmobs` redactor) **+** dual cross-provider LLM judges (GPT-4.1 + Claude 3.5 Haiku, both disjoint from the models under test). Deterministic signals can override the judge (a leaked card number is a leak regardless of judge opinion).
- **Dual judges + Cohen's κ / Gwet's AC1**: both judges score every item on an anchored 0–4 rubric; we report per-judge risk and inter-rater agreement (AC1 alongside κ, which is paradox-resistant where κ degenerates), and never let a model be its own sole judge.
- **Severity-weighted** risk per axis with **bootstrap 95% CIs**.
- **Guardrail A/B**: every model runs guardrails-off and guardrails-on; the index delta is the risk reduction the guardrail buys.
- **Insurability Index** = 100·(1 − weighted overall risk) → premium tier (Preferred / Standard / Substandard / Decline).
- Full rationale + limitations: [`docs/METHODOLOGY.md`](docs/METHODOLOGY.md).

## Run it

```bash
uv sync
cp ../.env.example ../.env # set OPENROUTER_API_KEY (reaches judges + frontier models)
Expand All @@ -32,18 +34,22 @@ uv run python -m underwriter.cli run --smoke
# full live evaluation (all suites, guard off+on, dual judges) → runs/<ts>/{scorecard.json,pdf}
uv run python -m underwriter.cli run
```

The OSS model (Qwen3-8B, self-hosted on Modal/vLLM) joins the run matrix
automatically once `MODAL_OSS_URL` is set; until then the harness runs on the
configured frontier models. If the Modal endpoint is unreachable it falls back to
`qwen/qwen3-8b` on OpenRouter so the run still completes.

## Offline tests (no API)

```bash
uv run pytest underwriter/tests
```

Covers the detectors, the risk-model overrides, and the statistics (weighted mean,
bootstrap CI, Cohen's κ, premium tiers): judge verdicts are fixtures.

## Layout

`datasets/` suites + cards · `scoring/` deterministic + judge + combine + aggregate ·
`guardrails.py` toggleable layer · `runner.py` run matrix · `report.py` PDF + publish · `cli.py`.
Loading