Cut agent wall-clock latency by overlapping tool execution with LLM streaming.
A production-grade reference implementation of eager tool calling — the pattern that dispatches each tool the moment its block finishes streaming, not after
message_stop.
ASCII fallback
Classic parallel tool calling:
stream : [==================================]
tools : [===========] ← idle during stream
total : ← stream + max(tool)
Eager tool calling:
stream : [==================================]
tool A : [=========] ← fires mid-stream
tool B : [=========] ← fires mid-stream, overlaps A
tool C : [=========]← fires at message_stop
total : [==================================] ← max(stream, max(tool))
Parallel tool calling overlaps tools with tools. Eager tool calling overlaps tools with generation itself.
Synthetic harness — make bench reproduces locally, deterministic.
Across 16 workloads (3 → 15 tools), eager beats parallel by 1.20× – 1.50× (median ~1.28×).
Parallel is the right baseline: modern frameworks (langchain.agents.create_agent,
OpenAI Agents SDK, Vercel AI SDK) already execute tool calls from one
assistant message concurrently. Eager's win comes from overlapping tools
with the stream itself — something parallel dispatch can't do.
Full table + repro details: bench/results.md.
| Workload | Sequential | Parallel | Eager | Speedup vs parallel |
|---|---|---|---|---|
| 3-tool analytics | 4.90s | 3.50s | 2.90s | 1.21× |
| 9-tool incident triage | 17.61s | 9.50s | 6.50s | 1.46× |
| 15-tool ad campaign | 30.42s | 11.50s | 8.80s | 1.31× |
These are lower bounds. The synthetic stream removes network jitter, tail latency, and provider-side variance — the things that make eager dispatch shine in production. Run
make bench-live-anthropic(or-openai) to spot-check against a real provider.
pip install eager-tools-core eager-tools-langgraph # once published
# or, from source:
git clone https://github.com/cloudthinker-ai/eager-tools && cd eager-tools && make syncimport asyncio, os
from langchain.agents import create_agent
from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage
from langchain_core.tools import tool
from eager_tools_langgraph import eager_middleware
class SlowTool:
def __init__(self, name: str, delay: float = 2.0):
self.name = name
self.idempotent = True
self._delay = delay
async def __call__(self, arguments):
await asyncio.sleep(self._delay)
return {"name": self.name, "args": arguments, "ok": True}
@tool
def get_weather(city: str) -> str:
"""Get current weather for a city."""
return ""
@tool
def get_stock_price(ticker: str) -> str:
"""Get the current stock price for a ticker symbol."""
return ""
@tool
def get_news(topic: str) -> str:
"""Get recent news on a topic."""
return ""
eager_tools = {
"get_weather": SlowTool("get_weather"),
"get_stock_price": SlowTool("get_stock_price"),
"get_news": SlowTool("get_news"),
}
async def main():
agent = create_agent(
model=ChatAnthropic(model_name="claude-sonnet-4-5", timeout=60.0, stop=None),
tools=[get_weather, get_stock_price, get_news],
middleware=[eager_middleware(eager_tools)],
)
result = await agent.ainvoke({
"messages": [HumanMessage(
"Get the weather in NYC, the AAPL stock price, and recent AI news."
)]
})
print(result["messages"][-1].content)
asyncio.run(main())One middleware line wires eager dispatch into any create_agent call — no
changes to your tools or prompt. Works with OpenAI too: swap ChatAnthropic
for ChatOpenAI. Runnable variants in examples/.
Modern agent APIs — Anthropic, OpenAI, Bedrock — let the model emit multiple tool_use blocks in one assistant message and run them in parallel. That moves the tool phase from sum of durations to max. Good, but insufficient.
The stream phase still happens first. Tools still wait for message_stop. A four-second model stream followed by 2.5s of parallel tool execution is 6.5 seconds of wall clock. Eager tool calling makes it 4 seconds — the tools run during the stream, not after it.
See METHOD.md for the full mechanism: the seal event, the tool_call_id invariant, the runtime contract, and the edge cases.
For the per-block mechanism (chunks → buffer → seal → dispatch), see
docs/diagrams/seal-mechanism-flow.svg.
- Fast tools (sub-50ms). Seal/dispatch overhead exceeds the latency saved.
- Sequentially dependent tools. If tool B needs tool A's result, the model won't emit B until A returns — no pipeline opportunity.
- Non-idempotent tools. Payments, destructive commands, outbound messages. Route these to the classic path via
Tool.idempotent = Falsefor blanket denial, or via a per-callgatecallable for case-by-case decisions with parsed args visible (e.g. allowread_filebut not under/etc/). Seedocs/hitl.md. The gate still gates the eager path; the underlying tool still runs at the framework's tool step for non-denied calls. - Non-streaming backends. If your gateway buffers the full response, eager dispatch is impossible.
Long version with edge cases: docs/when-not-to-use.md.
Adapter PRs welcome — LlamaIndex, AutoGen, Vercel AI SDK, any provider that exposes a streaming response with per-block identifiers. Start from packages/eager-tools-core/ as the contract reference. See NEXT.md §3 for the extraction pattern.
Bug reports + design discussions happen in GitHub Discussions — issues are intentionally disabled to keep the signal-to-noise ratio high.
This pattern was extracted from production at CloudThinker, where it cuts median agent task latency by 50%. Internal codename: tool-call pipelining. External name: eager tool calling.
Read the full production story: Eager Tool Calling at CloudThinker.
MIT — see LICENSE.