Skip to content

gemini-3-flash-preview implicit caching dead zone: cached_content_token_count drops to 0 between ~9K-17K prompt tokens #2064

@port2077

Description

@port2077

Description

gemini-3-flash-preview seems to have a reproducible implicit caching dead zone where cached_content_token_count drops to 0 when the prompt size is between
~9K and ~17K tokens, despite a stable, byte-identical prefix growing via multi-turn conversation history.

Caching works correctly below ~8K (growing in ~2,048-token block increments) and reappears above ~18K, but the mid-range consistently reports zero cached tokens.

Additionally, once caching resumes above 18K, the cached amount locks to a plateau and only jumps in large ~8K-token steps (~8,192 ≈ 4×2048), staying fixed for many turns before the next jump:

Plateau Cached tokens ≈ Blocks (×2048) Holds across prompt range
1 ~16,391 8 18K → 25K (8 turns)
2 ~24,598 12 26K → 33K (8 turns)
3 ~32,806 16 34K+

Reproduction

Single-file repro using only google-genai :

pip install google-genai
export GEMINI_API_KEY=<key>
python gemini_cache_repro.py

What the script does:

  • Fixed ~405-char system instruction, identical across all 35 turns
  • Multi-turn conversation history grows each turn (prior user/model messages are prepended)
  • Each turn pads the user message to hit a prompt-token target (1K → 35K in 1K steps)
  • Model is asked to reply with a single word ("OK") to minimize output noise
  • 2-second delay between calls for cache propagation
gemini_cache_repro.py (click to expand)
import argparse
import os
import time
from pathlib import Path

from google import genai
from google.genai import types

MODEL = "gemini-3-flash-preview"

SYSTEM_PROMPT = (
    "You are a helpful assistant that answers questions about world geography, "
    "capital cities, population statistics, and general knowledge. You provide "
    "accurate, factual information. When asked to reply with one word, comply "
    "exactly. You have deep knowledge of countries, capitals, major cities, "
    "rivers, mountain ranges, climate zones, and economic indicators. Always be "
    "concise and precise in your responses."
)

# Adaptive padding constants (empirical for this model)
TOKENS_PER_CHAR = 0.19
BASELINE_OVERHEAD = 240
TURN_OVERHEAD = 26
RESPONSE_TOKENS = 8


def load_env():
    for p in [Path.cwd() / ".env", Path(__file__).resolve().parent / ".env"]:
        if p.exists():
            for line in p.read_text().splitlines():
                line = line.strip()
                if line and not line.startswith("#") and "=" in line:
                    k, v = line.split("=", 1)
                    k, v = k.strip(), v.strip().strip("\"'")
                    if k and k not in os.environ:
                        os.environ[k] = v
            break


def make_padding(turn, n_chars):
    base = (
        f"[Turn {turn}] Geography reference: countries, capitals, populations, "
        f"area, GDP, climate, languages, currency, time zones, coordinates. "
    )
    if n_chars <= 0:
        return ""
    return (base * (n_chars // len(base) + 1))[:n_chars]


def make_question(turn, padding_chars=0):
    q = f"Question {turn}: what is the capital of country number {turn}?"
    if padding_chars <= 0:
        return f"{q} Reply with one word: OK"
    return f"{q}\n\nReference:\n{make_padding(turn, padding_chars)}\n\nReply with one word: OK"


def extract_usage(resp):
    u = getattr(resp, "usage_metadata", None)
    if u is None:
        return 0, 0, 0
    d = u.model_dump() if hasattr(u, "model_dump") else (
        u.to_dict() if hasattr(u, "to_dict") else u.__dict__
    )
    def gi(keys):
        for k in keys:
            v = d.get(k)
            if isinstance(v, (int, float)) and not isinstance(v, bool):
                return int(v)
        return 0
    return (
        gi(["prompt_token_count", "promptTokenCount"]),
        gi(["candidates_token_count", "candidatesTokenCount"]),
        gi(["cached_content_token_count", "cachedContentTokenCount"]),
    )


def main():
    ap = argparse.ArgumentParser(
        description="Reproduce Gemini implicit caching dead zone (~9K-17K)")
    ap.add_argument("--api_key", default=None)
    ap.add_argument("--start", type=int, default=1000)
    ap.add_argument("--stop", type=int, default=35000)
    ap.add_argument("--step", type=int, default=1000)
    ap.add_argument("--sleep", type=float, default=2.0)
    ap.add_argument("--max_tokens", type=int, default=16)
    args = ap.parse_args()

    load_env()
    api_key = args.api_key or os.getenv("GEMINI_API_KEY") or os.getenv("GOOGLE_API_KEY")
    if not api_key:
        raise SystemExit("Set GEMINI_API_KEY or pass --api_key")

    targets = list(range(args.start, args.stop + 1, args.step))
    client = genai.Client(api_key=api_key)

    print(f"Model: {MODEL}")
    print(f"System prompt: {len(SYSTEM_PROMPT)} chars (fixed across all turns)")
    print(f"Sweep: {args.start:,}{args.stop:,} prompt tokens, step={args.step:,}")
    print(f"Turns: {len(targets)}, sleep: {args.sleep}s between calls")
    print(f"google-genai version: {genai.__version__}")
    print()
    print(f"{'Turn':>4} | {'Target':>6} | {'Prompt':>7} | {'Cached':>7} | "
          f"{'New':>7} | {'Cache%':>6} | {'Time':>6} | Notes")
    print("-" * 95)

    pad_plan = [0] * len(targets)
    pad_plan[0] = max(0, int((targets[0] - BASELINE_OVERHEAD) / TOKENS_PER_CHAR))

    history = []
    tpc, exp_resp = TOKENS_PER_CHAR, RESPONSE_TOKENS
    prev_prompt, prev_hchars = None, 0
    prev_qlen = len(make_question(1, pad_plan[0]))
    prev_cached = 0

    for i, target in enumerate(targets):
        turn = i + 1
        question = make_question(turn, pad_plan[i])

        # Build multi-turn contents
        contents = []
        for h in history:
            contents.append(types.UserContent(
                parts=[types.Part.from_text(text=h["q"])]))
            contents.append(types.ModelContent(
                parts=[types.Part.from_text(text=h["a"])]))
        contents.append(types.UserContent(
            parts=[types.Part.from_text(text=question)]))

        t0 = time.time()
        resp = client.models.generate_content(
            model=MODEL,
            contents=contents,
            config=types.GenerateContentConfig(
                system_instruction=SYSTEM_PROMPT,
                temperature=0,
                max_output_tokens=args.max_tokens,
            ),
        )
        elapsed = time.time() - t0

        prompt, compl, cached = extract_usage(resp)
        ratio = cached / max(prompt, 1)

        note = ""
        if cached > prev_cached:
            note = f"+{cached - prev_cached:,} cached (≈{cached // 2048} blk)"
        elif cached == 0 and prev_cached > 0:
            note = "← DROPPED TO 0"
        prev_cached = cached

        print(f"{turn:>4} | {target:>6,} | {prompt:>7,} | {cached:>7,} | "
              f"{prompt - cached:>7,} | {ratio:>5.1%} | {elapsed:>5.2f}s | {note}")

        txt = getattr(resp, "text", "") or ""
        if not txt and hasattr(resp, "candidates") and resp.candidates:
            try:
                txt = resp.candidates[0].content.parts[0].text
            except Exception:
                txt = ""
        history.append({"q": question, "a": str(txt)})

        # Adaptive padding calibration
        hchars = sum(len(h["q"]) + len(h["a"]) for h in history)
        if prev_prompt is not None:
            dt = prompt - prev_prompt
            dc = (hchars - prev_hchars) + (len(question) - prev_qlen)
            if dt > 0 and dc > 0:
                tpc = min(0.6, max(0.05, 0.75 * tpc + 0.25 * (dt / dc)))
        if compl > 0:
            exp_resp = 0.7 * exp_resp + 0.3 * compl
        if i + 1 < len(targets):
            base_q = len(make_question(turn + 1, 0))
            needed = targets[i + 1] - prompt - exp_resp - base_q * tpc - TURN_OVERHEAD
            pad_plan[i + 1] = max(0, min(int(needed / tpc), 50000))

        prev_prompt, prev_hchars, prev_qlen = prompt, hchars, len(question)
        time.sleep(args.sleep)


if __name__ == "__main__":
    main()

Output

Model: gemini-3-flash-preview
System prompt: 405 chars (fixed across all turns)
Sweep: 1,000 → 35,000 prompt tokens, step=1,000
Turns: 35, sleep: 2.0s between calls
google-genai version: 1.59.0

Turn | Target |  Prompt |  Cached |     New | Cache% |   Time | Notes
-----------------------------------------------------------------------------------------------
   1 |  1,000 |     994 |       0 |     994 |  0.0% |  1.73s |
   2 |  2,000 |   2,146 |       0 |   2,146 |  0.0% |  1.00s |
   3 |  3,000 |   3,124 |       0 |   3,124 |  0.0% |  1.28s |
   4 |  4,000 |   4,028 |   2,037 |   1,991 | 50.6% |  1.33s | +2,037 cached (≈1 blk)
   5 |  5,000 |   4,994 |   2,040 |   2,954 | 40.8% |  1.49s | +3 cached (≈1 blk)
   6 |  6,000 |   6,003 |   4,085 |   1,918 | 68.0% |  1.77s | +2,045 cached (≈2 blk)
   7 |  7,000 |   7,006 |   4,087 |   2,919 | 58.3% |  1.68s | +2 cached (≈3 blk)
   8 |  8,000 |   7,996 |   6,134 |   1,862 | 76.7% |  1.46s | +2,047 cached (≈3 blk)
   9 |  9,000 |   8,986 |       0 |   8,986 |  0.0% |  1.59s | ← DROPPED TO 0
  10 | 10,000 |  10,006 |       0 |  10,006 |  0.0% |  1.37s |
  11 | 11,000 |  10,999 |       0 |  10,999 |  0.0% |  1.47s |
  12 | 12,000 |  11,982 |       0 |  11,982 |  0.0% |  1.53s |
  13 | 13,000 |  12,979 |       0 |  12,979 |  0.0% |  1.63s |
  14 | 14,000 |  13,979 |       0 |  13,979 |  0.0% |  1.51s |
  15 | 15,000 |  14,978 |       0 |  14,978 |  0.0% |  1.40s |
  16 | 16,000 |  15,975 |       0 |  15,975 |  0.0% |  1.81s |
  17 | 17,000 |  16,973 |       0 |  16,973 |  0.0% |  1.93s |
  18 | 18,000 |  17,972 |  16,391 |   1,581 | 91.2% |  1.34s | +16,391 cached (≈8 blk)
  19 | 19,000 |  18,972 |  16,392 |   2,580 | 86.4% |  1.62s | +1 cached (≈8 blk)
  20 | 20,000 |  19,971 |  16,393 |   3,578 | 82.1% |  1.80s | +1 cached (≈8 blk)
  21 | 21,000 |  20,971 |  16,394 |   4,577 | 78.2% |  1.58s | +1 cached (≈8 blk)
  22 | 22,000 |  21,970 |  16,395 |   5,575 | 74.6% |  2.11s | +1 cached (≈8 blk)
  23 | 23,000 |  22,970 |  16,396 |   6,574 | 71.4% |  1.41s | +1 cached (≈8 blk)
  24 | 24,000 |  23,970 |  16,397 |   7,573 | 68.4% |  1.98s | +1 cached (≈8 blk)
  25 | 25,000 |  24,970 |  16,398 |   8,572 | 65.7% |  1.74s | +1 cached (≈8 blk)
  26 | 26,000 |  25,970 |  24,598 |   1,372 | 94.7% |  1.41s | +8,200 cached (≈12 blk)
  27 | 27,000 |  26,969 |  24,599 |   2,370 | 91.2% |  1.49s | +1 cached (≈12 blk)
  28 | 28,000 |  27,968 |  24,600 |   3,368 | 88.0% |  1.78s | +1 cached (≈12 blk)
  29 | 29,000 |  28,969 |  24,601 |   4,368 | 84.9% |  2.72s | +1 cached (≈12 blk)
  30 | 30,000 |  29,970 |  24,602 |   5,368 | 82.1% |  1.80s | +1 cached (≈12 blk)
  31 | 31,000 |  30,970 |  24,603 |   6,367 | 79.4% |  2.13s | +1 cached (≈12 blk)
  32 | 32,000 |  31,969 |  24,603 |   7,366 | 77.0% |  2.11s |
  33 | 33,000 |  32,968 |  24,604 |   8,364 | 74.6% |  2.03s | +1 cached (≈12 blk)
  34 | 34,000 |  33,969 |  32,806 |   1,163 | 96.6% |  1.71s | +8,202 cached (≈16 blk)
  35 | 35,000 |  34,970 |  32,807 |   2,163 | 93.8% |  2.20s | +1 cached (≈16 blk)

Analysis

The data shows four distinct behaviors:

1. Small context ~2K block growth (turns 4–8)

Cache grows in ~2,048-tokens (expected is 1024 as per docs though):

  • Turn 4: +2,037 cached → ~2K
  • Turn 6: +2,045 → ~4K
  • Turn 8: +2,047 → ~6K

2. Dead zone — zero cached (turns 9–17)

At turn 9 (~9K prompt tokens), cached_content_token_count drops to 0 and stays there for 9 consecutive turns through ~17K prompt tokens. The prefix is growing in a fully stable way by appending new context at the end, preserving byte level data from previous turns.

3. Cache plateau behavior (turns 18+)

When caching resumes, it doesn't grow incrementally. Instead it locks to a fixed plateau and only jumps in large ~8,192-token steps (4 × 2048):

Prompt range Cached (locked) ≈ Blocks Duration
18K – 25K ~16,391 8 × 2048 8 turns
26K – 33K ~24,598 12 × 2048 8 turns
34K+ ~32,806 16 × 2048 ongoing

The cached amount stays essentially constant within each plateau (varying by only 1–7 tokens), then makes a single ~8K jump to the next level.

4. Block size regime change

  • Below 8K: cache grows in 1-block (~2,048 token) steps
  • Above 18K: cache grows in 4-block (~8,192 token) steps
  • 9K–17K: no caching at all

Expected behavior

cached_content_token_count should increase monotonically (or at minimum remain stable) as the prompt grows, since the prefix is byte-identical across turns. Specifically:

  1. There should be no dead zone where caching drops to 0 mid-conversation
  2. The transition from small-block to large-block caching should not involve losing all cached state

Questions

  1. Why Is there a ~9K–17K dead zone in the caching system?
  2. Why does the block granularity change from ~2K steps (below 8K) to ~8K steps (above 18K)?
  3. Is there any way to avoid the dead zone for applications whose prompts naturally fall in the 9K–17K range?

Environment

  • Model: gemini-3-flash-preview
  • SDK: google-genai==1.59.0
  • API: Google AI Studio
  • Python: 3.12
  • OS: Windows 11 / WSL2

Metadata

Metadata

Labels

priority: p2Moderately-important priority. Fix may not be included in next release.type: bugError or flaw in code with unintended results or allowing sub-optimal usage patterns.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions