Pre-call budget check requires tokenizer that OpenAI-compatible SDKs don't expose

**Discovered during:** dogfooding integration in [suspension](https://github.com/ryanwi/suspension) (v0.14.2)

**Behavior**

`TokenBudgetTracker.check_budget(identity, usage)` exists as a pre-call hook in addition to `record_usage()`. The intent is clear: check before the LLM call, fail fast if the budget is blown.

But to call `check_budget()`, the consumer needs to know the *prompt token count* before the call returns. OpenAI-compatible SDKs (the `openai` Python package against Claude/Gemini/Ollama/etc.) don't expose this. A consumer would have to:

1. Bring in `tiktoken` or another tokenizer
2. Tokenize the full message list against the target model's tokenizer
3. Estimate the response (max_tokens? historical average?)
4. Convert to cost via pricing table
5. Then call `check_budget()`

Steps 1-4 are non-trivial and model-specific. For Claude via OpenAI-compatible API, there's no canonical tokenizer in the consumer's stack.

The practical consequence: the integration ends up soft-ceiling. `record_usage()` is called *after* the API call returns, which records actuals correctly but doesn't prevent the call. The local soft-ceiling check (in-memory `_tokens_used >= budget`) stays as defense in depth, but the SDK's `check_budget()` is unused.

**Why this matters**

The whole "pre-call budget enforcement" story is documented in the SDK README but not achievable without a tokenizer adapter. A consumer reading the docs sees `check_budget()`, assumes hard-ceiling enforcement is available, and only discovers the gap once they try to implement it.

**Possible fixes**

1. **Document the soft-ceiling reality.** The README should say `record_usage()` is the primary enforcement path for OpenAI-compatible SDK consumers, and `check_budget()` requires a separate tokenizer not provided.
2. **Ship a tokenizer adapter.** `agent_control_plane.tokenizers.OpenAITokenizer(model_id)` that wraps tiktoken with appropriate model-to-encoding mapping. Optional dep; the SDK doesn't currently depend on tiktoken.
3. **`check_budget_after_call(response)` helper** that accepts the OpenAI response object directly and computes usage from `response.usage`. Effectively a "pre-call check using the *previous* call's overshoot as a signal" — pragmatic compromise that doesn't require a tokenizer.

Option 1 is the honest move. Option 2 is the most useful. Option 3 is novel but might be the right shape for the post-call check that's actually achievable.

**Concrete repro from suspension**

[`suspension/llm.py:_record_cp_usage`](https://github.com/ryanwi/suspension/blob/main/suspension/llm.py) records post-call only. The local `_tokens_used >= self.token_budget` check at the top of `complete()` is the only pre-call gate.

**Severity:** Medium. Affects expectations more than functionality. A consumer who thinks they have a hard ceiling and only has a soft one is misled.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pre-call budget check requires tokenizer that OpenAI-compatible SDKs don't expose #5

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Pre-call budget check requires tokenizer that OpenAI-compatible SDKs don't expose #5

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions