record_usage rejects ledger entry when budget exceeded — ledger diverges from actual LLM spend

**Discovered during:** dogfooding integration in [suspension](https://github.com/ryanwi/suspension) (v0.14.3). Validation test of the budget gate on 2026-05-11.

## Behavior

`TokenBudgetTracker.record_usage()` checks the budget *before* incrementing state. When the check fails, the exception fires and **the actual usage is not recorded in the ledger** ([token_budget_tracker.py:103-114](https://github.com/ryanwi/agent-control-plane/blob/main/src/agent_control_plane/engine/token_budget_tracker.py#L103-L114)):

```python
configs = await self._repo.list_budget_configs(identity)
result = await self._check_budget_with_configs(configs, identity, usage)
if not result.allowed:
    raise TokenBudgetExhaustedError(...)  # returns here, nothing written

for config in configs:
    await self._repo.increment_usage(...)
await self._repo.record_usage(session_id, usage, identity)
```

Both the per-config `increment_usage` and the ledger `record_usage` are gated by the check. When the check fails, neither row lands.

## Why this matters: README endorses the broken pattern

This isn't a niche consumer pattern colliding with strict SDK semantics — it's the SDK's own documented path producing a ledger that doesn't match reality.

`README.md:126` explicitly directs OpenAI-compatible consumers (i.e. most of them) to use post-call `record_usage` as the practical path, because pre-call tokenization isn't available without a separate tokenizer (see also #5). In that pattern — the *recommended* pattern — when the call already happened and the budget check fails, the consumer has paid for those tokens at the LLM provider, but the SDK's ledger never records them.

Two distinct claims the SDK makes:

1. *"Enforces budget caps"* — true; the exception fires.
2. *"Tracks per-tenant spend"* — partially true; tracks *allowed* spend, not *attempted* spend.

Consumers assuming the ledger matches their LLM bill will be wrong. The discrepancy is silent: nothing in the ledger surfaces the gap.

## Concrete repro from suspension

Validation test on 2026-05-11:

1. Lowered `default` tenant cap from $10.00 to $0.10
2. Ran `suspension generate` — agent's first LLM call returned successfully (~$0.114 of Opus tokens consumed at Anthropic)
3. `LLMClient.complete()` called `tracker.record_usage(None, identity, usage)`
4. SDK raised `TokenBudgetExhaustedError: Cost limit exceeded for daily budget ...: 0.114471 > 0.100000`
5. Restored cap to $10.00
6. `suspension budgets usage default` reports $0.066 used today — but actual Anthropic spend today is $0.066 + $0.114 = $0.180

The ledger underreports actual spend by exactly the cost of the over-budget call. At scale, the ledger becomes systematically inaccurate as a source of truth for cost reporting.

## Test coverage analysis confirms low-risk transition

`tests/test_token_budget_tracker.py:189-201` (`test_record_exceeds_budget_raises`) only asserts the raise; does not assert the *absence* of a record. `test_record_succeeds_under_limit` asserts `len(_usage_records) == 1` on the success path. So Option 1 (always-record-then-raise) would add a new assertion that the ledger row exists even after raise, and the success test stays green. No existing assertion breaks.

## No existing mitigation

`grep` finds no `record_usage_force`, `record_then_check`, or similar escape hatch. Consumers using the recommended post-call pattern have no way around the ledger drift.

## Possible fixes

1. **Always record, then optionally raise.** Record the usage first, then raise `TokenBudgetExhaustedError` if the check fails. Ledger matches reality; consumer still gets the exception. Slight semantic shift — the "record" is no longer atomic with the "permission" — but more honest about post-call enforcement and aligned with the README's own guidance.

2. **Separate methods.** Keep `record_usage` as the strict atomic version, add `record_usage_force()` that always records and returns `(recorded, budget_allowed)`. Consumers explicitly opt into the always-record semantics. Adds API surface; users reach for `record_usage` first and get burned before discovering the force variant.

3. **Document the gap.** Add to the README that "if you call `record_usage` post-call, blocked attempts won't appear in the ledger; rely on your LLM provider's billing API as the source of truth for actual spend." Pure docs fix; trap remains.

**Option 1 is the right default** given the README's own guidance. It's a behavior change but a small one — anyone using `record_usage` as a strict atomic pre-call gate is in the minority the README itself dismisses as impractical. Worth keeping the exception name and message identical so consumer code catching `TokenBudgetExhaustedError` doesn't need changes. CHANGELOG entry: *"behavior change: usage is now recorded before the budget check raises; the ledger reflects all attempts including over-budget ones."*

This is **distinct from #5** (pre-call tokenizer): #5 is about preventing the spend before it happens; this is about *recording* the spend that already happened. Both are real, both worth addressing.

## Severity

Medium. Affects every consumer using post-call enforcement (which is most consumers without an in-process tokenizer — and which the README recommends). Failure mode is silent ledger drift — the kind of thing that doesn't matter until cost-reporting accuracy does, at which point it's been wrong for a while.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

record_usage rejects ledger entry when budget exceeded — ledger diverges from actual LLM spend #7

Behavior

Why this matters: README endorses the broken pattern

Concrete repro from suspension

Test coverage analysis confirms low-risk transition

No existing mitigation

Possible fixes

Severity

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

record_usage rejects ledger entry when budget exceeded — ledger diverges from actual LLM spend #7

Description

Behavior

Why this matters: README endorses the broken pattern

Concrete repro from suspension

Test coverage analysis confirms low-risk transition

No existing mitigation

Possible fixes

Severity

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions