gemini-3-flash-preview implicit caching dead zone: cached_content_token_count drops to 0 between ~9K-17K prompt tokens

 ## Description

  `gemini-3-flash-preview` seems to have a reproducible implicit caching dead zone where `cached_content_token_count` drops to **0** when the prompt size is between    
  **~9K and ~17K tokens**, despite a **stable, byte-identical prefix** growing via multi-turn conversation history.

 Caching works correctly below ~8K (growing in ~2,048-token block increments) and reappears above ~18K, but the mid-range consistently reports zero cached  tokens.

Additionally, once caching resumes above 18K, the cached amount **locks to a plateau** and only jumps in large ~8K-token steps (~8,192 ≈ 4×2048), staying fixed for many turns before the next jump:

  | Plateau | Cached tokens | ≈ Blocks (×2048) | Holds across prompt range |
  |---------|--------------|-------------------|--------------------------|
  | 1 | ~16,391 | 8 | 18K → 25K (8 turns) |
  | 2 | ~24,598 | 12 | 26K → 33K (8 turns) |
  | 3 | ~32,806 | 16 | 34K+ |

  ## Reproduction

  Single-file repro using only `google-genai` :

  ```
  pip install google-genai
  export GEMINI_API_KEY=<key>
  python gemini_cache_repro.py
  ```

  **What the script does:**
  - Fixed ~405-char system instruction, identical across all 35 turns
  - Multi-turn conversation history grows each turn (prior user/model messages are prepended)
  - Each turn pads the user message to hit a prompt-token target (1K → 35K in 1K steps)
  - Model is asked to reply with a single word ("OK") to minimize output noise
  - 2-second delay between calls for cache propagation

  <details>
  <summary>gemini_cache_repro.py (click to expand)</summary>

  ```python

  import argparse
  import os
  import time
  from pathlib import Path

  from google import genai
  from google.genai import types

  MODEL = "gemini-3-flash-preview"

  SYSTEM_PROMPT = (
      "You are a helpful assistant that answers questions about world geography, "
      "capital cities, population statistics, and general knowledge. You provide "
      "accurate, factual information. When asked to reply with one word, comply "
      "exactly. You have deep knowledge of countries, capitals, major cities, "
      "rivers, mountain ranges, climate zones, and economic indicators. Always be "
      "concise and precise in your responses."
  )

  # Adaptive padding constants (empirical for this model)
  TOKENS_PER_CHAR = 0.19
  BASELINE_OVERHEAD = 240
  TURN_OVERHEAD = 26
  RESPONSE_TOKENS = 8


  def load_env():
      for p in [Path.cwd() / ".env", Path(__file__).resolve().parent / ".env"]:
          if p.exists():
              for line in p.read_text().splitlines():
                  line = line.strip()
                  if line and not line.startswith("#") and "=" in line:
                      k, v = line.split("=", 1)
                      k, v = k.strip(), v.strip().strip("\"'")
                      if k and k not in os.environ:
                          os.environ[k] = v
              break


  def make_padding(turn, n_chars):
      base = (
          f"[Turn {turn}] Geography reference: countries, capitals, populations, "
          f"area, GDP, climate, languages, currency, time zones, coordinates. "
      )
      if n_chars <= 0:
          return ""
      return (base * (n_chars // len(base) + 1))[:n_chars]


  def make_question(turn, padding_chars=0):
      q = f"Question {turn}: what is the capital of country number {turn}?"
      if padding_chars <= 0:
          return f"{q} Reply with one word: OK"
      return f"{q}\n\nReference:\n{make_padding(turn, padding_chars)}\n\nReply with one word: OK"


  def extract_usage(resp):
      u = getattr(resp, "usage_metadata", None)
      if u is None:
          return 0, 0, 0
      d = u.model_dump() if hasattr(u, "model_dump") else (
          u.to_dict() if hasattr(u, "to_dict") else u.__dict__
      )
      def gi(keys):
          for k in keys:
              v = d.get(k)
              if isinstance(v, (int, float)) and not isinstance(v, bool):
                  return int(v)
          return 0
      return (
          gi(["prompt_token_count", "promptTokenCount"]),
          gi(["candidates_token_count", "candidatesTokenCount"]),
          gi(["cached_content_token_count", "cachedContentTokenCount"]),
      )


  def main():
      ap = argparse.ArgumentParser(
          description="Reproduce Gemini implicit caching dead zone (~9K-17K)")
      ap.add_argument("--api_key", default=None)
      ap.add_argument("--start", type=int, default=1000)
      ap.add_argument("--stop", type=int, default=35000)
      ap.add_argument("--step", type=int, default=1000)
      ap.add_argument("--sleep", type=float, default=2.0)
      ap.add_argument("--max_tokens", type=int, default=16)
      args = ap.parse_args()

      load_env()
      api_key = args.api_key or os.getenv("GEMINI_API_KEY") or os.getenv("GOOGLE_API_KEY")
      if not api_key:
          raise SystemExit("Set GEMINI_API_KEY or pass --api_key")

      targets = list(range(args.start, args.stop + 1, args.step))
      client = genai.Client(api_key=api_key)

      print(f"Model: {MODEL}")
      print(f"System prompt: {len(SYSTEM_PROMPT)} chars (fixed across all turns)")
      print(f"Sweep: {args.start:,} → {args.stop:,} prompt tokens, step={args.step:,}")
      print(f"Turns: {len(targets)}, sleep: {args.sleep}s between calls")
      print(f"google-genai version: {genai.__version__}")
      print()
      print(f"{'Turn':>4} | {'Target':>6} | {'Prompt':>7} | {'Cached':>7} | "
            f"{'New':>7} | {'Cache%':>6} | {'Time':>6} | Notes")
      print("-" * 95)

      pad_plan = [0] * len(targets)
      pad_plan[0] = max(0, int((targets[0] - BASELINE_OVERHEAD) / TOKENS_PER_CHAR))

      history = []
      tpc, exp_resp = TOKENS_PER_CHAR, RESPONSE_TOKENS
      prev_prompt, prev_hchars = None, 0
      prev_qlen = len(make_question(1, pad_plan[0]))
      prev_cached = 0

      for i, target in enumerate(targets):
          turn = i + 1
          question = make_question(turn, pad_plan[i])

          # Build multi-turn contents
          contents = []
          for h in history:
              contents.append(types.UserContent(
                  parts=[types.Part.from_text(text=h["q"])]))
              contents.append(types.ModelContent(
                  parts=[types.Part.from_text(text=h["a"])]))
          contents.append(types.UserContent(
              parts=[types.Part.from_text(text=question)]))

          t0 = time.time()
          resp = client.models.generate_content(
              model=MODEL,
              contents=contents,
              config=types.GenerateContentConfig(
                  system_instruction=SYSTEM_PROMPT,
                  temperature=0,
                  max_output_tokens=args.max_tokens,
              ),
          )
          elapsed = time.time() - t0

          prompt, compl, cached = extract_usage(resp)
          ratio = cached / max(prompt, 1)

          note = ""
          if cached > prev_cached:
              note = f"+{cached - prev_cached:,} cached (≈{cached // 2048} blk)"
          elif cached == 0 and prev_cached > 0:
              note = "← DROPPED TO 0"
          prev_cached = cached

          print(f"{turn:>4} | {target:>6,} | {prompt:>7,} | {cached:>7,} | "
                f"{prompt - cached:>7,} | {ratio:>5.1%} | {elapsed:>5.2f}s | {note}")

          txt = getattr(resp, "text", "") or ""
          if not txt and hasattr(resp, "candidates") and resp.candidates:
              try:
                  txt = resp.candidates[0].content.parts[0].text
              except Exception:
                  txt = ""
          history.append({"q": question, "a": str(txt)})

          # Adaptive padding calibration
          hchars = sum(len(h["q"]) + len(h["a"]) for h in history)
          if prev_prompt is not None:
              dt = prompt - prev_prompt
              dc = (hchars - prev_hchars) + (len(question) - prev_qlen)
              if dt > 0 and dc > 0:
                  tpc = min(0.6, max(0.05, 0.75 * tpc + 0.25 * (dt / dc)))
          if compl > 0:
              exp_resp = 0.7 * exp_resp + 0.3 * compl
          if i + 1 < len(targets):
              base_q = len(make_question(turn + 1, 0))
              needed = targets[i + 1] - prompt - exp_resp - base_q * tpc - TURN_OVERHEAD
              pad_plan[i + 1] = max(0, min(int(needed / tpc), 50000))

          prev_prompt, prev_hchars, prev_qlen = prompt, hchars, len(question)
          time.sleep(args.sleep)


  if __name__ == "__main__":
      main()
  ```

  </details>

  ## Output

  ```
  Model: gemini-3-flash-preview
  System prompt: 405 chars (fixed across all turns)
  Sweep: 1,000 → 35,000 prompt tokens, step=1,000
  Turns: 35, sleep: 2.0s between calls
  google-genai version: 1.59.0

 Turn | Target |  Prompt |  Cached |     New | Cache% |   Time | Notes
  -----------------------------------------------------------------------------------------------
     1 |  1,000 |     994 |       0 |     994 |  0.0% |  1.73s |
     2 |  2,000 |   2,146 |       0 |   2,146 |  0.0% |  1.00s |
     3 |  3,000 |   3,124 |       0 |   3,124 |  0.0% |  1.28s |
     4 |  4,000 |   4,028 |   2,037 |   1,991 | 50.6% |  1.33s | +2,037 cached (≈1 blk)
     5 |  5,000 |   4,994 |   2,040 |   2,954 | 40.8% |  1.49s | +3 cached (≈1 blk)
     6 |  6,000 |   6,003 |   4,085 |   1,918 | 68.0% |  1.77s | +2,045 cached (≈2 blk)
     7 |  7,000 |   7,006 |   4,087 |   2,919 | 58.3% |  1.68s | +2 cached (≈3 blk)
     8 |  8,000 |   7,996 |   6,134 |   1,862 | 76.7% |  1.46s | +2,047 cached (≈3 blk)
     9 |  9,000 |   8,986 |       0 |   8,986 |  0.0% |  1.59s | ← DROPPED TO 0
    10 | 10,000 |  10,006 |       0 |  10,006 |  0.0% |  1.37s |
    11 | 11,000 |  10,999 |       0 |  10,999 |  0.0% |  1.47s |
    12 | 12,000 |  11,982 |       0 |  11,982 |  0.0% |  1.53s |
    13 | 13,000 |  12,979 |       0 |  12,979 |  0.0% |  1.63s |
    14 | 14,000 |  13,979 |       0 |  13,979 |  0.0% |  1.51s |
    15 | 15,000 |  14,978 |       0 |  14,978 |  0.0% |  1.40s |
    16 | 16,000 |  15,975 |       0 |  15,975 |  0.0% |  1.81s |
    17 | 17,000 |  16,973 |       0 |  16,973 |  0.0% |  1.93s |
    18 | 18,000 |  17,972 |  16,391 |   1,581 | 91.2% |  1.34s | +16,391 cached (≈8 blk)
    19 | 19,000 |  18,972 |  16,392 |   2,580 | 86.4% |  1.62s | +1 cached (≈8 blk)
    20 | 20,000 |  19,971 |  16,393 |   3,578 | 82.1% |  1.80s | +1 cached (≈8 blk)
    21 | 21,000 |  20,971 |  16,394 |   4,577 | 78.2% |  1.58s | +1 cached (≈8 blk)
    22 | 22,000 |  21,970 |  16,395 |   5,575 | 74.6% |  2.11s | +1 cached (≈8 blk)
    23 | 23,000 |  22,970 |  16,396 |   6,574 | 71.4% |  1.41s | +1 cached (≈8 blk)
    24 | 24,000 |  23,970 |  16,397 |   7,573 | 68.4% |  1.98s | +1 cached (≈8 blk)
    25 | 25,000 |  24,970 |  16,398 |   8,572 | 65.7% |  1.74s | +1 cached (≈8 blk)
    26 | 26,000 |  25,970 |  24,598 |   1,372 | 94.7% |  1.41s | +8,200 cached (≈12 blk)
    27 | 27,000 |  26,969 |  24,599 |   2,370 | 91.2% |  1.49s | +1 cached (≈12 blk)
    28 | 28,000 |  27,968 |  24,600 |   3,368 | 88.0% |  1.78s | +1 cached (≈12 blk)
    29 | 29,000 |  28,969 |  24,601 |   4,368 | 84.9% |  2.72s | +1 cached (≈12 blk)
    30 | 30,000 |  29,970 |  24,602 |   5,368 | 82.1% |  1.80s | +1 cached (≈12 blk)
    31 | 31,000 |  30,970 |  24,603 |   6,367 | 79.4% |  2.13s | +1 cached (≈12 blk)
    32 | 32,000 |  31,969 |  24,603 |   7,366 | 77.0% |  2.11s |
    33 | 33,000 |  32,968 |  24,604 |   8,364 | 74.6% |  2.03s | +1 cached (≈12 blk)
    34 | 34,000 |  33,969 |  32,806 |   1,163 | 96.6% |  1.71s | +8,202 cached (≈16 blk)
    35 | 35,000 |  34,970 |  32,807 |   2,163 | 93.8% |  2.20s | +1 cached (≈16 blk)
  ```

  ## Analysis

  The data shows four distinct behaviors:

  ### 1. Small context ~2K block growth (turns 4–8)
  Cache grows in ~2,048-tokens (expected is 1024 as per docs though):
  - Turn 4: +2,037 cached → ~2K
  - Turn 6: +2,045 → ~4K
  - Turn 8: +2,047 → ~6K

  ### 2. Dead zone — zero cached (turns 9–17)
 At turn 9 (~9K prompt tokens), `cached_content_token_count` **drops to 0** and stays there for **9 consecutive turns** through ~17K prompt tokens. The  prefix is growing in a fully stable way by appending new context at the end, preserving byte level data from previous turns.

  ### 3. Cache plateau behavior (turns 18+)
  When caching resumes, it doesn't grow incrementally. Instead it **locks to a fixed plateau** and only jumps in large **~8,192-token steps** (4 × 2048):     

  | Prompt range | Cached (locked) | ≈ Blocks | Duration |
  |---|---|---|---|
  | 18K – 25K | ~16,391 | 8 × 2048 | 8 turns |
  | 26K – 33K | ~24,598 | 12 × 2048 | 8 turns |
  | 34K+ | ~32,806 | 16 × 2048 | ongoing |

  The cached amount stays essentially constant within each plateau (varying by only 1–7 tokens), then makes a single ~8K jump to the next level.

  ### 4. Block size regime change
  - **Below 8K**: cache grows in **1-block** (~2,048 token) steps
  - **Above 18K**: cache grows in **4-block** (~8,192 token) steps
  - **9K–17K**: no caching at all

  ## Expected behavior

  `cached_content_token_count` should increase monotonically (or at minimum remain stable) as the prompt grows, since the prefix is byte-identical across turns. Specifically:
  1. There should be no dead zone where caching drops to 0 mid-conversation
  2. The transition from small-block to large-block caching should not involve losing all cached state

  ## Questions

  1. Why Is there a ~9K–17K dead zone  in the caching system?
  2. Why does the block granularity change from ~2K steps (below 8K) to ~8K steps (above 18K)?
  3. Is there any way to avoid the dead zone for applications whose prompts naturally fall in the 9K–17K range?

  ## Environment

  - **Model:** `gemini-3-flash-preview`
  - **SDK:** `google-genai==1.59.0`
  - **API:** Google AI Studio
  - **Python:** 3.12
  - **OS:** Windows 11 / WSL2





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gemini-3-flash-preview implicit caching dead zone: cached_content_token_count drops to 0 between ~9K-17K prompt tokens #2064

Description

Reproduction

Output

Analysis

1. Small context ~2K block growth (turns 4–8)

2. Dead zone — zero cached (turns 9–17)

3. Cache plateau behavior (turns 18+)

4. Block size regime change

Expected behavior

Questions

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Plateau	Cached tokens	≈ Blocks (×2048)	Holds across prompt range
1	~16,391	8	18K → 25K (8 turns)
2	~24,598	12	26K → 33K (8 turns)
3	~32,806	16	34K+

Prompt range	Cached (locked)	≈ Blocks	Duration
18K – 25K	~16,391	8 × 2048	8 turns
26K – 33K	~24,598	12 × 2048	8 turns
34K+	~32,806	16 × 2048	ongoing

gemini-3-flash-preview implicit caching dead zone: cached_content_token_count drops to 0 between ~9K-17K prompt tokens #2064

Description

Description

Reproduction

Output

Analysis

1. Small context ~2K block growth (turns 4–8)

2. Dead zone — zero cached (turns 9–17)

3. Cache plateau behavior (turns 18+)

4. Block size regime change

Expected behavior

Questions

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions