Skip to content

Pre-call budget check requires tokenizer that OpenAI-compatible SDKs don't expose #5

@ryanwi

Description

@ryanwi

Discovered during: dogfooding integration in suspension (v0.14.2)

Behavior

TokenBudgetTracker.check_budget(identity, usage) exists as a pre-call hook in addition to record_usage(). The intent is clear: check before the LLM call, fail fast if the budget is blown.

But to call check_budget(), the consumer needs to know the prompt token count before the call returns. OpenAI-compatible SDKs (the openai Python package against Claude/Gemini/Ollama/etc.) don't expose this. A consumer would have to:

  1. Bring in tiktoken or another tokenizer
  2. Tokenize the full message list against the target model's tokenizer
  3. Estimate the response (max_tokens? historical average?)
  4. Convert to cost via pricing table
  5. Then call check_budget()

Steps 1-4 are non-trivial and model-specific. For Claude via OpenAI-compatible API, there's no canonical tokenizer in the consumer's stack.

The practical consequence: the integration ends up soft-ceiling. record_usage() is called after the API call returns, which records actuals correctly but doesn't prevent the call. The local soft-ceiling check (in-memory _tokens_used >= budget) stays as defense in depth, but the SDK's check_budget() is unused.

Why this matters

The whole "pre-call budget enforcement" story is documented in the SDK README but not achievable without a tokenizer adapter. A consumer reading the docs sees check_budget(), assumes hard-ceiling enforcement is available, and only discovers the gap once they try to implement it.

Possible fixes

  1. Document the soft-ceiling reality. The README should say record_usage() is the primary enforcement path for OpenAI-compatible SDK consumers, and check_budget() requires a separate tokenizer not provided.
  2. Ship a tokenizer adapter. agent_control_plane.tokenizers.OpenAITokenizer(model_id) that wraps tiktoken with appropriate model-to-encoding mapping. Optional dep; the SDK doesn't currently depend on tiktoken.
  3. check_budget_after_call(response) helper that accepts the OpenAI response object directly and computes usage from response.usage. Effectively a "pre-call check using the previous call's overshoot as a signal" — pragmatic compromise that doesn't require a tokenizer.

Option 1 is the honest move. Option 2 is the most useful. Option 3 is novel but might be the right shape for the post-call check that's actually achievable.

Concrete repro from suspension

suspension/llm.py:_record_cp_usage records post-call only. The local _tokens_used >= self.token_budget check at the top of complete() is the only pre-call gate.

Severity: Medium. Affects expectations more than functionality. A consumer who thinks they have a hard ceiling and only has a soft one is misled.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions