-
Notifications
You must be signed in to change notification settings - Fork 771
Description
Description
gemini-3-flash-preview seems to have a reproducible implicit caching dead zone where cached_content_token_count drops to 0 when the prompt size is between
~9K and ~17K tokens, despite a stable, byte-identical prefix growing via multi-turn conversation history.
Caching works correctly below ~8K (growing in ~2,048-token block increments) and reappears above ~18K, but the mid-range consistently reports zero cached tokens.
Additionally, once caching resumes above 18K, the cached amount locks to a plateau and only jumps in large ~8K-token steps (~8,192 ≈ 4×2048), staying fixed for many turns before the next jump:
| Plateau | Cached tokens | ≈ Blocks (×2048) | Holds across prompt range |
|---|---|---|---|
| 1 | ~16,391 | 8 | 18K → 25K (8 turns) |
| 2 | ~24,598 | 12 | 26K → 33K (8 turns) |
| 3 | ~32,806 | 16 | 34K+ |
Reproduction
Single-file repro using only google-genai :
pip install google-genai
export GEMINI_API_KEY=<key>
python gemini_cache_repro.py
What the script does:
- Fixed ~405-char system instruction, identical across all 35 turns
- Multi-turn conversation history grows each turn (prior user/model messages are prepended)
- Each turn pads the user message to hit a prompt-token target (1K → 35K in 1K steps)
- Model is asked to reply with a single word ("OK") to minimize output noise
- 2-second delay between calls for cache propagation
gemini_cache_repro.py (click to expand)
import argparse
import os
import time
from pathlib import Path
from google import genai
from google.genai import types
MODEL = "gemini-3-flash-preview"
SYSTEM_PROMPT = (
"You are a helpful assistant that answers questions about world geography, "
"capital cities, population statistics, and general knowledge. You provide "
"accurate, factual information. When asked to reply with one word, comply "
"exactly. You have deep knowledge of countries, capitals, major cities, "
"rivers, mountain ranges, climate zones, and economic indicators. Always be "
"concise and precise in your responses."
)
# Adaptive padding constants (empirical for this model)
TOKENS_PER_CHAR = 0.19
BASELINE_OVERHEAD = 240
TURN_OVERHEAD = 26
RESPONSE_TOKENS = 8
def load_env():
for p in [Path.cwd() / ".env", Path(__file__).resolve().parent / ".env"]:
if p.exists():
for line in p.read_text().splitlines():
line = line.strip()
if line and not line.startswith("#") and "=" in line:
k, v = line.split("=", 1)
k, v = k.strip(), v.strip().strip("\"'")
if k and k not in os.environ:
os.environ[k] = v
break
def make_padding(turn, n_chars):
base = (
f"[Turn {turn}] Geography reference: countries, capitals, populations, "
f"area, GDP, climate, languages, currency, time zones, coordinates. "
)
if n_chars <= 0:
return ""
return (base * (n_chars // len(base) + 1))[:n_chars]
def make_question(turn, padding_chars=0):
q = f"Question {turn}: what is the capital of country number {turn}?"
if padding_chars <= 0:
return f"{q} Reply with one word: OK"
return f"{q}\n\nReference:\n{make_padding(turn, padding_chars)}\n\nReply with one word: OK"
def extract_usage(resp):
u = getattr(resp, "usage_metadata", None)
if u is None:
return 0, 0, 0
d = u.model_dump() if hasattr(u, "model_dump") else (
u.to_dict() if hasattr(u, "to_dict") else u.__dict__
)
def gi(keys):
for k in keys:
v = d.get(k)
if isinstance(v, (int, float)) and not isinstance(v, bool):
return int(v)
return 0
return (
gi(["prompt_token_count", "promptTokenCount"]),
gi(["candidates_token_count", "candidatesTokenCount"]),
gi(["cached_content_token_count", "cachedContentTokenCount"]),
)
def main():
ap = argparse.ArgumentParser(
description="Reproduce Gemini implicit caching dead zone (~9K-17K)")
ap.add_argument("--api_key", default=None)
ap.add_argument("--start", type=int, default=1000)
ap.add_argument("--stop", type=int, default=35000)
ap.add_argument("--step", type=int, default=1000)
ap.add_argument("--sleep", type=float, default=2.0)
ap.add_argument("--max_tokens", type=int, default=16)
args = ap.parse_args()
load_env()
api_key = args.api_key or os.getenv("GEMINI_API_KEY") or os.getenv("GOOGLE_API_KEY")
if not api_key:
raise SystemExit("Set GEMINI_API_KEY or pass --api_key")
targets = list(range(args.start, args.stop + 1, args.step))
client = genai.Client(api_key=api_key)
print(f"Model: {MODEL}")
print(f"System prompt: {len(SYSTEM_PROMPT)} chars (fixed across all turns)")
print(f"Sweep: {args.start:,} → {args.stop:,} prompt tokens, step={args.step:,}")
print(f"Turns: {len(targets)}, sleep: {args.sleep}s between calls")
print(f"google-genai version: {genai.__version__}")
print()
print(f"{'Turn':>4} | {'Target':>6} | {'Prompt':>7} | {'Cached':>7} | "
f"{'New':>7} | {'Cache%':>6} | {'Time':>6} | Notes")
print("-" * 95)
pad_plan = [0] * len(targets)
pad_plan[0] = max(0, int((targets[0] - BASELINE_OVERHEAD) / TOKENS_PER_CHAR))
history = []
tpc, exp_resp = TOKENS_PER_CHAR, RESPONSE_TOKENS
prev_prompt, prev_hchars = None, 0
prev_qlen = len(make_question(1, pad_plan[0]))
prev_cached = 0
for i, target in enumerate(targets):
turn = i + 1
question = make_question(turn, pad_plan[i])
# Build multi-turn contents
contents = []
for h in history:
contents.append(types.UserContent(
parts=[types.Part.from_text(text=h["q"])]))
contents.append(types.ModelContent(
parts=[types.Part.from_text(text=h["a"])]))
contents.append(types.UserContent(
parts=[types.Part.from_text(text=question)]))
t0 = time.time()
resp = client.models.generate_content(
model=MODEL,
contents=contents,
config=types.GenerateContentConfig(
system_instruction=SYSTEM_PROMPT,
temperature=0,
max_output_tokens=args.max_tokens,
),
)
elapsed = time.time() - t0
prompt, compl, cached = extract_usage(resp)
ratio = cached / max(prompt, 1)
note = ""
if cached > prev_cached:
note = f"+{cached - prev_cached:,} cached (≈{cached // 2048} blk)"
elif cached == 0 and prev_cached > 0:
note = "← DROPPED TO 0"
prev_cached = cached
print(f"{turn:>4} | {target:>6,} | {prompt:>7,} | {cached:>7,} | "
f"{prompt - cached:>7,} | {ratio:>5.1%} | {elapsed:>5.2f}s | {note}")
txt = getattr(resp, "text", "") or ""
if not txt and hasattr(resp, "candidates") and resp.candidates:
try:
txt = resp.candidates[0].content.parts[0].text
except Exception:
txt = ""
history.append({"q": question, "a": str(txt)})
# Adaptive padding calibration
hchars = sum(len(h["q"]) + len(h["a"]) for h in history)
if prev_prompt is not None:
dt = prompt - prev_prompt
dc = (hchars - prev_hchars) + (len(question) - prev_qlen)
if dt > 0 and dc > 0:
tpc = min(0.6, max(0.05, 0.75 * tpc + 0.25 * (dt / dc)))
if compl > 0:
exp_resp = 0.7 * exp_resp + 0.3 * compl
if i + 1 < len(targets):
base_q = len(make_question(turn + 1, 0))
needed = targets[i + 1] - prompt - exp_resp - base_q * tpc - TURN_OVERHEAD
pad_plan[i + 1] = max(0, min(int(needed / tpc), 50000))
prev_prompt, prev_hchars, prev_qlen = prompt, hchars, len(question)
time.sleep(args.sleep)
if __name__ == "__main__":
main()Output
Model: gemini-3-flash-preview
System prompt: 405 chars (fixed across all turns)
Sweep: 1,000 → 35,000 prompt tokens, step=1,000
Turns: 35, sleep: 2.0s between calls
google-genai version: 1.59.0
Turn | Target | Prompt | Cached | New | Cache% | Time | Notes
-----------------------------------------------------------------------------------------------
1 | 1,000 | 994 | 0 | 994 | 0.0% | 1.73s |
2 | 2,000 | 2,146 | 0 | 2,146 | 0.0% | 1.00s |
3 | 3,000 | 3,124 | 0 | 3,124 | 0.0% | 1.28s |
4 | 4,000 | 4,028 | 2,037 | 1,991 | 50.6% | 1.33s | +2,037 cached (≈1 blk)
5 | 5,000 | 4,994 | 2,040 | 2,954 | 40.8% | 1.49s | +3 cached (≈1 blk)
6 | 6,000 | 6,003 | 4,085 | 1,918 | 68.0% | 1.77s | +2,045 cached (≈2 blk)
7 | 7,000 | 7,006 | 4,087 | 2,919 | 58.3% | 1.68s | +2 cached (≈3 blk)
8 | 8,000 | 7,996 | 6,134 | 1,862 | 76.7% | 1.46s | +2,047 cached (≈3 blk)
9 | 9,000 | 8,986 | 0 | 8,986 | 0.0% | 1.59s | ← DROPPED TO 0
10 | 10,000 | 10,006 | 0 | 10,006 | 0.0% | 1.37s |
11 | 11,000 | 10,999 | 0 | 10,999 | 0.0% | 1.47s |
12 | 12,000 | 11,982 | 0 | 11,982 | 0.0% | 1.53s |
13 | 13,000 | 12,979 | 0 | 12,979 | 0.0% | 1.63s |
14 | 14,000 | 13,979 | 0 | 13,979 | 0.0% | 1.51s |
15 | 15,000 | 14,978 | 0 | 14,978 | 0.0% | 1.40s |
16 | 16,000 | 15,975 | 0 | 15,975 | 0.0% | 1.81s |
17 | 17,000 | 16,973 | 0 | 16,973 | 0.0% | 1.93s |
18 | 18,000 | 17,972 | 16,391 | 1,581 | 91.2% | 1.34s | +16,391 cached (≈8 blk)
19 | 19,000 | 18,972 | 16,392 | 2,580 | 86.4% | 1.62s | +1 cached (≈8 blk)
20 | 20,000 | 19,971 | 16,393 | 3,578 | 82.1% | 1.80s | +1 cached (≈8 blk)
21 | 21,000 | 20,971 | 16,394 | 4,577 | 78.2% | 1.58s | +1 cached (≈8 blk)
22 | 22,000 | 21,970 | 16,395 | 5,575 | 74.6% | 2.11s | +1 cached (≈8 blk)
23 | 23,000 | 22,970 | 16,396 | 6,574 | 71.4% | 1.41s | +1 cached (≈8 blk)
24 | 24,000 | 23,970 | 16,397 | 7,573 | 68.4% | 1.98s | +1 cached (≈8 blk)
25 | 25,000 | 24,970 | 16,398 | 8,572 | 65.7% | 1.74s | +1 cached (≈8 blk)
26 | 26,000 | 25,970 | 24,598 | 1,372 | 94.7% | 1.41s | +8,200 cached (≈12 blk)
27 | 27,000 | 26,969 | 24,599 | 2,370 | 91.2% | 1.49s | +1 cached (≈12 blk)
28 | 28,000 | 27,968 | 24,600 | 3,368 | 88.0% | 1.78s | +1 cached (≈12 blk)
29 | 29,000 | 28,969 | 24,601 | 4,368 | 84.9% | 2.72s | +1 cached (≈12 blk)
30 | 30,000 | 29,970 | 24,602 | 5,368 | 82.1% | 1.80s | +1 cached (≈12 blk)
31 | 31,000 | 30,970 | 24,603 | 6,367 | 79.4% | 2.13s | +1 cached (≈12 blk)
32 | 32,000 | 31,969 | 24,603 | 7,366 | 77.0% | 2.11s |
33 | 33,000 | 32,968 | 24,604 | 8,364 | 74.6% | 2.03s | +1 cached (≈12 blk)
34 | 34,000 | 33,969 | 32,806 | 1,163 | 96.6% | 1.71s | +8,202 cached (≈16 blk)
35 | 35,000 | 34,970 | 32,807 | 2,163 | 93.8% | 2.20s | +1 cached (≈16 blk)
Analysis
The data shows four distinct behaviors:
1. Small context ~2K block growth (turns 4–8)
Cache grows in ~2,048-tokens (expected is 1024 as per docs though):
- Turn 4: +2,037 cached → ~2K
- Turn 6: +2,045 → ~4K
- Turn 8: +2,047 → ~6K
2. Dead zone — zero cached (turns 9–17)
At turn 9 (~9K prompt tokens), cached_content_token_count drops to 0 and stays there for 9 consecutive turns through ~17K prompt tokens. The prefix is growing in a fully stable way by appending new context at the end, preserving byte level data from previous turns.
3. Cache plateau behavior (turns 18+)
When caching resumes, it doesn't grow incrementally. Instead it locks to a fixed plateau and only jumps in large ~8,192-token steps (4 × 2048):
| Prompt range | Cached (locked) | ≈ Blocks | Duration |
|---|---|---|---|
| 18K – 25K | ~16,391 | 8 × 2048 | 8 turns |
| 26K – 33K | ~24,598 | 12 × 2048 | 8 turns |
| 34K+ | ~32,806 | 16 × 2048 | ongoing |
The cached amount stays essentially constant within each plateau (varying by only 1–7 tokens), then makes a single ~8K jump to the next level.
4. Block size regime change
- Below 8K: cache grows in 1-block (~2,048 token) steps
- Above 18K: cache grows in 4-block (~8,192 token) steps
- 9K–17K: no caching at all
Expected behavior
cached_content_token_count should increase monotonically (or at minimum remain stable) as the prompt grows, since the prefix is byte-identical across turns. Specifically:
- There should be no dead zone where caching drops to 0 mid-conversation
- The transition from small-block to large-block caching should not involve losing all cached state
Questions
- Why Is there a ~9K–17K dead zone in the caching system?
- Why does the block granularity change from ~2K steps (below 8K) to ~8K steps (above 18K)?
- Is there any way to avoid the dead zone for applications whose prompts naturally fall in the 9K–17K range?
Environment
- Model:
gemini-3-flash-preview - SDK:
google-genai==1.59.0 - API: Google AI Studio
- Python: 3.12
- OS: Windows 11 / WSL2