Discovered during: dogfooding integration in suspension (v0.14.3). Validation test of the budget gate on 2026-05-11.
Behavior
TokenBudgetTracker.record_usage() checks the budget before incrementing state. When the check fails, the exception fires and the actual usage is not recorded in the ledger (token_budget_tracker.py:103-114):
configs = await self._repo.list_budget_configs(identity)
result = await self._check_budget_with_configs(configs, identity, usage)
if not result.allowed:
raise TokenBudgetExhaustedError(...) # returns here, nothing written
for config in configs:
await self._repo.increment_usage(...)
await self._repo.record_usage(session_id, usage, identity)
Both the per-config increment_usage and the ledger record_usage are gated by the check. When the check fails, neither row lands.
Why this matters: README endorses the broken pattern
This isn't a niche consumer pattern colliding with strict SDK semantics — it's the SDK's own documented path producing a ledger that doesn't match reality.
README.md:126 explicitly directs OpenAI-compatible consumers (i.e. most of them) to use post-call record_usage as the practical path, because pre-call tokenization isn't available without a separate tokenizer (see also #5). In that pattern — the recommended pattern — when the call already happened and the budget check fails, the consumer has paid for those tokens at the LLM provider, but the SDK's ledger never records them.
Two distinct claims the SDK makes:
- "Enforces budget caps" — true; the exception fires.
- "Tracks per-tenant spend" — partially true; tracks allowed spend, not attempted spend.
Consumers assuming the ledger matches their LLM bill will be wrong. The discrepancy is silent: nothing in the ledger surfaces the gap.
Concrete repro from suspension
Validation test on 2026-05-11:
- Lowered
default tenant cap from $10.00 to $0.10
- Ran
suspension generate — agent's first LLM call returned successfully (~$0.114 of Opus tokens consumed at Anthropic)
LLMClient.complete() called tracker.record_usage(None, identity, usage)
- SDK raised
TokenBudgetExhaustedError: Cost limit exceeded for daily budget ...: 0.114471 > 0.100000
- Restored cap to $10.00
suspension budgets usage default reports $0.066 used today — but actual Anthropic spend today is $0.066 + $0.114 = $0.180
The ledger underreports actual spend by exactly the cost of the over-budget call. At scale, the ledger becomes systematically inaccurate as a source of truth for cost reporting.
Test coverage analysis confirms low-risk transition
tests/test_token_budget_tracker.py:189-201 (test_record_exceeds_budget_raises) only asserts the raise; does not assert the absence of a record. test_record_succeeds_under_limit asserts len(_usage_records) == 1 on the success path. So Option 1 (always-record-then-raise) would add a new assertion that the ledger row exists even after raise, and the success test stays green. No existing assertion breaks.
No existing mitigation
grep finds no record_usage_force, record_then_check, or similar escape hatch. Consumers using the recommended post-call pattern have no way around the ledger drift.
Possible fixes
-
Always record, then optionally raise. Record the usage first, then raise TokenBudgetExhaustedError if the check fails. Ledger matches reality; consumer still gets the exception. Slight semantic shift — the "record" is no longer atomic with the "permission" — but more honest about post-call enforcement and aligned with the README's own guidance.
-
Separate methods. Keep record_usage as the strict atomic version, add record_usage_force() that always records and returns (recorded, budget_allowed). Consumers explicitly opt into the always-record semantics. Adds API surface; users reach for record_usage first and get burned before discovering the force variant.
-
Document the gap. Add to the README that "if you call record_usage post-call, blocked attempts won't appear in the ledger; rely on your LLM provider's billing API as the source of truth for actual spend." Pure docs fix; trap remains.
Option 1 is the right default given the README's own guidance. It's a behavior change but a small one — anyone using record_usage as a strict atomic pre-call gate is in the minority the README itself dismisses as impractical. Worth keeping the exception name and message identical so consumer code catching TokenBudgetExhaustedError doesn't need changes. CHANGELOG entry: "behavior change: usage is now recorded before the budget check raises; the ledger reflects all attempts including over-budget ones."
This is distinct from #5 (pre-call tokenizer): #5 is about preventing the spend before it happens; this is about recording the spend that already happened. Both are real, both worth addressing.
Severity
Medium. Affects every consumer using post-call enforcement (which is most consumers without an in-process tokenizer — and which the README recommends). Failure mode is silent ledger drift — the kind of thing that doesn't matter until cost-reporting accuracy does, at which point it's been wrong for a while.
Discovered during: dogfooding integration in suspension (v0.14.3). Validation test of the budget gate on 2026-05-11.
Behavior
TokenBudgetTracker.record_usage()checks the budget before incrementing state. When the check fails, the exception fires and the actual usage is not recorded in the ledger (token_budget_tracker.py:103-114):Both the per-config
increment_usageand the ledgerrecord_usageare gated by the check. When the check fails, neither row lands.Why this matters: README endorses the broken pattern
This isn't a niche consumer pattern colliding with strict SDK semantics — it's the SDK's own documented path producing a ledger that doesn't match reality.
README.md:126explicitly directs OpenAI-compatible consumers (i.e. most of them) to use post-callrecord_usageas the practical path, because pre-call tokenization isn't available without a separate tokenizer (see also #5). In that pattern — the recommended pattern — when the call already happened and the budget check fails, the consumer has paid for those tokens at the LLM provider, but the SDK's ledger never records them.Two distinct claims the SDK makes:
Consumers assuming the ledger matches their LLM bill will be wrong. The discrepancy is silent: nothing in the ledger surfaces the gap.
Concrete repro from suspension
Validation test on 2026-05-11:
defaulttenant cap from $10.00 to $0.10suspension generate— agent's first LLM call returned successfully (~$0.114 of Opus tokens consumed at Anthropic)LLMClient.complete()calledtracker.record_usage(None, identity, usage)TokenBudgetExhaustedError: Cost limit exceeded for daily budget ...: 0.114471 > 0.100000suspension budgets usage defaultreports $0.066 used today — but actual Anthropic spend today is $0.066 + $0.114 = $0.180The ledger underreports actual spend by exactly the cost of the over-budget call. At scale, the ledger becomes systematically inaccurate as a source of truth for cost reporting.
Test coverage analysis confirms low-risk transition
tests/test_token_budget_tracker.py:189-201(test_record_exceeds_budget_raises) only asserts the raise; does not assert the absence of a record.test_record_succeeds_under_limitassertslen(_usage_records) == 1on the success path. So Option 1 (always-record-then-raise) would add a new assertion that the ledger row exists even after raise, and the success test stays green. No existing assertion breaks.No existing mitigation
grepfinds norecord_usage_force,record_then_check, or similar escape hatch. Consumers using the recommended post-call pattern have no way around the ledger drift.Possible fixes
Always record, then optionally raise. Record the usage first, then raise
TokenBudgetExhaustedErrorif the check fails. Ledger matches reality; consumer still gets the exception. Slight semantic shift — the "record" is no longer atomic with the "permission" — but more honest about post-call enforcement and aligned with the README's own guidance.Separate methods. Keep
record_usageas the strict atomic version, addrecord_usage_force()that always records and returns(recorded, budget_allowed). Consumers explicitly opt into the always-record semantics. Adds API surface; users reach forrecord_usagefirst and get burned before discovering the force variant.Document the gap. Add to the README that "if you call
record_usagepost-call, blocked attempts won't appear in the ledger; rely on your LLM provider's billing API as the source of truth for actual spend." Pure docs fix; trap remains.Option 1 is the right default given the README's own guidance. It's a behavior change but a small one — anyone using
record_usageas a strict atomic pre-call gate is in the minority the README itself dismisses as impractical. Worth keeping the exception name and message identical so consumer code catchingTokenBudgetExhaustedErrordoesn't need changes. CHANGELOG entry: "behavior change: usage is now recorded before the budget check raises; the ledger reflects all attempts including over-budget ones."This is distinct from #5 (pre-call tokenizer): #5 is about preventing the spend before it happens; this is about recording the spend that already happened. Both are real, both worth addressing.
Severity
Medium. Affects every consumer using post-call enforcement (which is most consumers without an in-process tokenizer — and which the README recommends). Failure mode is silent ledger drift — the kind of thing that doesn't matter until cost-reporting accuracy does, at which point it's been wrong for a while.