py-spy / pprof, but for what's inside your prompt.
A diagnostic profiler for LLM agent context windows. It does not optimize or compress anything — it makes context waste visible and quantified so you can act on it.
In multi-turn agent loops, the full context is re-sent on every API call. A tool result added at turn 3 gets re-billed at turns 4, 5, 6 … Most of that is never read again.
Existing observability tools report a total token count — but never the composition or the waste. This invisible bloat drives three failures:
| Failure | Root cause |
|---|---|
| Cost | You pay repeatedly for dead weight sitting in context |
| Latency | Fatter context means slower first-token time on every call |
| Quality | "Context rot" — models degrade as the window fills with stale, irrelevant material |
ContextLens is the flamegraph for this: it decomposes the window, shows re-billing over turns, detects specific waste patterns, and prints the dollar cost of each with a concrete, one-line fix.
pip install contextlens-profilerOr from source:
git clone https://github.com/contextlens/contextlens
cd contextlens
pip install -e ".[dev]"Requirements: Python 3.11+. No API key required for analysis.
python examples/demo.pyThis simulates a 30-turn agent loop with canned data, prints a ranked waste report to the terminal, and writes examples/demo_report.html. Open that file in any browser to see the interactive D3 treemap.
# Terminal waste report
contextlens analyze trace.json
# Interactive HTML treemap report
contextlens report trace.json -o report.htmlExample terminal output:
+---------------------------------------------------------------------+
| ContextLens | Run demo-001 |
| Model: claude-3-5-sonnet-20241022 | Provider: anthropic | Turns: 30 |
+---------------------------------------------------------------------+
Context Composition by Region
---------------------------------------------------------------
Region Tokens Cost (USD) Share
assistant_message 11,490 $0.0345 ###....... 25.5%
tool_result 10,333 $0.0310 ##........ 22.9%
tool_schema 9,450 $0.0284 ##........ 21.0%
retrieved_content 5,805 $0.0174 #......... 12.9%
user_message 4,740 $0.0142 #......... 10.5%
system 3,240 $0.0097 #......... 7.2%
TOTAL 45,058 $0.1352
Re-billing: 45,058 tokens across 30 turns -> 43,185 (95.8%) re-billing waste ($0.1296)
Top Waste Findings
# Type Sev. Wasted Tokens Cost Fix (truncated)
1 [D] duplicate medium 7,084 $0.0213 Cache or externalize this content...
2 [R] redundant_ret medium 5,805 $0.0174 Use a re-ranker or tighter threshold...
3 [U] unused_schema low 3,150 $0.0095 Remove 'send_email' or inject dynamically...
...
import contextlens as cl
report = cl.analyze_file("trace.json")
print(f"Billed: {report.total_tokens_billed:,} tokens (${report.total_cost_usd:.4f})")
print(f"Recoverable: {report.recoverable_tokens:,} tokens (${report.recoverable_cost_usd:.4f})")
for finding in report.findings_by_severity():
print(f"[{finding.severity.upper():6}] {finding.kind.value:20} "
f"{finding.wasted_tokens:>7,} tok ${finding.wasted_cost_usd:.4f}")
print(f" Fix: {finding.fix}")
# Write the interactive HTML treemap
html = cl.render_html_report(report)
with open("report.html", "w") as f:
f.write(html)import anthropic
import contextlens as cl
client = anthropic.Anthropic()
with cl.capture_anthropic(client, model="claude-3-5-sonnet-20241022") as collector:
for turn in range(20):
client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
system="You are a helpful assistant.",
messages=build_messages(turn), # your agent loop
)
trace = collector.build_trace()
collector.save("trace.json") # optional: persist for later
report = cl.analyze_trace(trace)import openai
import contextlens as cl
client = openai.OpenAI()
with cl.capture_openai(client, model="gpt-4o") as collector:
for turn in range(20):
client.chat.completions.create(
model="gpt-4o",
messages=build_messages(turn),
)
trace = collector.build_trace()
report = cl.analyze_trace(trace)import contextlens as cl
# If you already log your LLM requests as JSON dicts:
trace = cl.load_trace("my_existing_trace.json")
report = cl.analyze_trace(trace) +-----------------+ +-----------------+
| Your Agent Loop | | Saved JSON Trace|
| (Anthropic / | | (existing logs)|
| OpenAI SDK) | +-----------------+
+--------+--------+ |
| |
capture_anthropic() load_trace()
capture_openai() |
| |
v v
+--------------------+ +--------------------+
| TraceCollector |-->| Trace |
| (monkey-patches | | run_id, model, |
| SDK client) | | provider, |
+--------------------+ | List[TurnSnapshot] |
+--------+-----------+
|
+----------v----------+
| DECOMPOSE |
| decompose.py |
| |
| Classifies every |
| content block into |
| a Region: |
| - SYSTEM |
| - TOOL_SCHEMA |
| - TOOL_RESULT |
| - USER_MESSAGE |
| - ASSISTANT_MSG |
| - RETRIEVED_CONTENT|
| |
| Uses SHA-256 hash |
| for cross-turn |
| block identity |
+----------+----------+
|
+---------------------+---------------------+
| |
+----------v----------+ +----------------v-----------+
| RE-BILLING | | WASTE DETECTORS |
| rebilling.py | | detectors.py |
| | | |
| Groups blocks by | | 1. DUPLICATE |
| content_hash across | | Same block re-sent |
| all turns. | | verbatim N turns |
| | | |
| Per block: | | 2. NEAR_DUPLICATE |
| - turns_present | | Jaccard > 0.85 between |
| - cumul_tokens | | distinct blocks |
| - cumul_cost_usd | | |
| | | 3. STALE_TOOL_RESULT |
| Recoverable waste = | | Tool output never |
| token*(turns-1) | | referenced in later |
| for every block | | assistant message |
| seen > 1 turn | | |
+----------+----------+ | 4. UNUSED_TOOL_SCHEMA |
| | Tool defined every turn |
| | but never called |
| | |
| | 5. REDUNDANT_RETRIEVAL |
| | Chunk overlap < 15% |
| | with model output |
| +----------------+-----------+
| |
+------------------+-----------------------+
|
+----------v----------+
| analyzer.py |
| |
| Builds Report: |
| - region_summaries |
| - rebilling_entries|
| - findings |
| - total costs |
| - recoverable $$ |
+----------+----------+
|
+------------------+------------------+
| | |
+----------v-----+ +---------v------+ +--------v-------+
| CLI | | HTML Report | | Python Report |
| cli.py | | reporter.py | | Object (API) |
| | | | | |
| contextlens | | Single .html | | report.findings|
| analyze | | file, no server| | report.rebilling|
| | | D3 treemap + | | render_html() |
| contextlens | | stacked area | | |
| report -o x | | + findings tbl | | |
+----------------+ +----------------+ +----------------+
| Module | Responsibility |
|---|---|
models.py |
All dataclasses: Trace, TurnSnapshot, ContentBlock, Region, WasteKind, Finding, Report |
costs.py |
Pricing table for Anthropic + OpenAI models; CostModel with per-million USD rates |
tokenizer.py |
tiktoken for OpenAI (exact); char/4 heuristic for Anthropic (labeled approximation) |
capture.py |
Live SDK interception via context managers; load_trace() for offline JSON |
decompose.py |
Classifies raw request payloads into ContentBlock lists per turn; handles both Anthropic and OpenAI message schemas |
rebilling.py |
Groups blocks by content hash, computes cumulative re-billing cost, calculates recoverable waste |
detectors.py |
Four waste heuristics → Finding objects with severity, token count, USD cost, fix |
analyzer.py |
Orchestration: decompose → rebilling → detectors → Report |
cli.py |
Click CLI with Rich terminal output |
reporter.py |
Self-contained HTML report with inlined D3 v7 treemap + timeline |
raw_request dict (messages, system, tools)
|
| decompose_snapshot()
v
List[ContentBlock]
block_id : uuid12
region : Region.TOOL_RESULT
content : "{'status': 'ok', ...}"
token_count: 142
content_hash: "sha256:abc..."
tool_call_id: "tu_007"
|
| grouped across turns by content_hash
v
RebillingEntry
turns_present: 18
cumulative_tokens: 142 * 18 = 2556
cumulative_cost_usd: $0.0077
|
| compared against later assistant blocks
v
Finding (STALE_TOOL_RESULT, severity=medium)
wasted_tokens: 2556
wasted_cost_usd: $0.0077
fix: "Summarize immediately, drop raw result from context"
| Rule | Region assigned |
|---|---|
system parameter |
SYSTEM |
Items in tools[] / functions[] |
TOOL_SCHEMA |
Block with type: tool_result / role tool |
TOOL_RESULT |
Block with type: tool_use / has tool_calls |
ASSISTANT_MESSAGE |
Role user, text matches retrieval heuristic |
RETRIEVED_CONTENT |
Role user, plain text |
USER_MESSAGE |
Role assistant, no tool calls |
ASSISTANT_MESSAGE |
The retrieval heuristic fires when text is > 200 chars and the first 500 chars contain a marker such as retrieved:, chunk:, source:, document:, passage:, excerpt:.
Exact content match (SHA-256) across > 1 turn. Wasted tokens = token_count × (turns - 1).
Fix: Cache this content in the system prompt, use KV-cache-friendly structure, or send a compressed summary after the first use.
Pairs of unique blocks with Jaccard word-4-gram similarity > 0.85. Only checks blocks ≥ 50 tokens.
Fix: Consolidate into a single template with variable slots.
Tool result block whose keyword set has < 2 words in common with any assistant message from the same turn onward.
Fix: Immediately after the tool call, have the assistant emit a short summary, then drop the raw result from context on the next turn.
Tool defined in every turn's tools[] / functions[] array but with zero calls recorded across the entire trace.
Fix: Remove the schema, or inject it only when the agent enters the sub-flow that needs it.
Retrieval chunk classified as RETRIEVED_CONTENT whose keyword overlap with all subsequent assistant messages is < 15%, for chunks > 100 tokens.
Fix: Apply a re-ranker or raise the similarity score threshold before injecting chunks.
Default prices (USD, mid-2025) are in costs.py. Override globally or per-analysis:
from contextlens import CostModel, ModelPricing, analyze_trace
cm = CostModel(overrides={
"my-internal-model": ModelPricing(input_per_million=0.50, output_per_million=1.50),
})
report = analyze_trace(trace, cost_model=cm)| Provider | Method | Notes |
|---|---|---|
| OpenAI / GPT | tiktoken (exact) |
Uses encoding_for_model; falls back to cl100k_base |
| Anthropic / Claude | len(text) // 4 |
Labeled approximation; ±10–15% typical error |
| Unknown | len(text) // 4 |
Same fallback |
- Anthropic prompt caching awareness (cache breakpoint markers → deduct cached token cost)
- Per-turn diff view: what changed between turn N and N+1 highlighted in the HTML
- LangChain / LlamaIndex trace adapters
- Token budget watch mode:
contextlens watch --max-tokens 50000 --alert-pct 80 - Gemini / Vertex AI provider support
- Export findings to OpenTelemetry spans / OTLP
- VS Code extension (annotate source with per-call cost)
git clone https://github.com/contextlens/contextlens
cd contextlens
pip install -e ".[dev]"
# Develop
ruff check src/ tests/ # lint
ruff format src/ tests/ # format
mypy src/contextlens/ # type check
pytest # tests
python examples/demo.py # end-to-end smoke testHard constraint: ContextLens diagnoses — it does not compress, modify, or optimize prompts. Please keep that scope discipline in PRs.
Open issues, report bugs, or discuss the roadmap at GitHub Issues.
MIT — Copyright (c) 2024 ContextLens Contributors