Discovered during: dogfooding integration in suspension (v0.14.2)
Behavior
TokenBudgetTracker.check_budget(identity, usage) exists as a pre-call hook in addition to record_usage(). The intent is clear: check before the LLM call, fail fast if the budget is blown.
But to call check_budget(), the consumer needs to know the prompt token count before the call returns. OpenAI-compatible SDKs (the openai Python package against Claude/Gemini/Ollama/etc.) don't expose this. A consumer would have to:
- Bring in
tiktoken or another tokenizer
- Tokenize the full message list against the target model's tokenizer
- Estimate the response (max_tokens? historical average?)
- Convert to cost via pricing table
- Then call
check_budget()
Steps 1-4 are non-trivial and model-specific. For Claude via OpenAI-compatible API, there's no canonical tokenizer in the consumer's stack.
The practical consequence: the integration ends up soft-ceiling. record_usage() is called after the API call returns, which records actuals correctly but doesn't prevent the call. The local soft-ceiling check (in-memory _tokens_used >= budget) stays as defense in depth, but the SDK's check_budget() is unused.
Why this matters
The whole "pre-call budget enforcement" story is documented in the SDK README but not achievable without a tokenizer adapter. A consumer reading the docs sees check_budget(), assumes hard-ceiling enforcement is available, and only discovers the gap once they try to implement it.
Possible fixes
- Document the soft-ceiling reality. The README should say
record_usage() is the primary enforcement path for OpenAI-compatible SDK consumers, and check_budget() requires a separate tokenizer not provided.
- Ship a tokenizer adapter.
agent_control_plane.tokenizers.OpenAITokenizer(model_id) that wraps tiktoken with appropriate model-to-encoding mapping. Optional dep; the SDK doesn't currently depend on tiktoken.
check_budget_after_call(response) helper that accepts the OpenAI response object directly and computes usage from response.usage. Effectively a "pre-call check using the previous call's overshoot as a signal" — pragmatic compromise that doesn't require a tokenizer.
Option 1 is the honest move. Option 2 is the most useful. Option 3 is novel but might be the right shape for the post-call check that's actually achievable.
Concrete repro from suspension
suspension/llm.py:_record_cp_usage records post-call only. The local _tokens_used >= self.token_budget check at the top of complete() is the only pre-call gate.
Severity: Medium. Affects expectations more than functionality. A consumer who thinks they have a hard ceiling and only has a soft one is misled.
Discovered during: dogfooding integration in suspension (v0.14.2)
Behavior
TokenBudgetTracker.check_budget(identity, usage)exists as a pre-call hook in addition torecord_usage(). The intent is clear: check before the LLM call, fail fast if the budget is blown.But to call
check_budget(), the consumer needs to know the prompt token count before the call returns. OpenAI-compatible SDKs (theopenaiPython package against Claude/Gemini/Ollama/etc.) don't expose this. A consumer would have to:tiktokenor another tokenizercheck_budget()Steps 1-4 are non-trivial and model-specific. For Claude via OpenAI-compatible API, there's no canonical tokenizer in the consumer's stack.
The practical consequence: the integration ends up soft-ceiling.
record_usage()is called after the API call returns, which records actuals correctly but doesn't prevent the call. The local soft-ceiling check (in-memory_tokens_used >= budget) stays as defense in depth, but the SDK'scheck_budget()is unused.Why this matters
The whole "pre-call budget enforcement" story is documented in the SDK README but not achievable without a tokenizer adapter. A consumer reading the docs sees
check_budget(), assumes hard-ceiling enforcement is available, and only discovers the gap once they try to implement it.Possible fixes
record_usage()is the primary enforcement path for OpenAI-compatible SDK consumers, andcheck_budget()requires a separate tokenizer not provided.agent_control_plane.tokenizers.OpenAITokenizer(model_id)that wraps tiktoken with appropriate model-to-encoding mapping. Optional dep; the SDK doesn't currently depend on tiktoken.check_budget_after_call(response)helper that accepts the OpenAI response object directly and computes usage fromresponse.usage. Effectively a "pre-call check using the previous call's overshoot as a signal" — pragmatic compromise that doesn't require a tokenizer.Option 1 is the honest move. Option 2 is the most useful. Option 3 is novel but might be the right shape for the post-call check that's actually achievable.
Concrete repro from suspension
suspension/llm.py:_record_cp_usagerecords post-call only. The local_tokens_used >= self.token_budgetcheck at the top ofcomplete()is the only pre-call gate.Severity: Medium. Affects expectations more than functionality. A consumer who thinks they have a hard ceiling and only has a soft one is misled.