diff --git a/.gitignore b/.gitignore
index 0a470be..4e9b44e 100644
--- a/.gitignore
+++ b/.gitignore
@@ -16,6 +16,7 @@ build/
 .token0_images/
 benchmarks/images/real/screenshot_real.png
 benchmarks/results/
+benchmarks/videos/
 *.db-journal
 .DS_Store
 .idea/
diff --git a/README.md b/README.md
index ff20a16..de34372 100644
--- a/README.md
+++ b/README.md
@@ -29,7 +29,7 @@ Your App → Token0 Proxy → [Analyze → Classify → Route → Transform →
          Database (logs every optimization decision + savings)
 ```
 
-Token0 applies **7 optimizations** automatically:
+Token0 applies **9 optimizations** automatically:
 
 ### Core Optimizations (Free Tier)
 
@@ -49,11 +49,15 @@ Token0 applies **7 optimizations** automatically:
 
 **7. Semantic Response Cache** — Cache responses for similar image+prompt pairs using perceptual image hashing. Repeated or similar queries cost 0 tokens. Effective on repetitive workloads (product classification, document processing).
 
+**8. QJL-Compressed Fuzzy Cache** — Similar (not just identical) images hit the cache using Quantized Johnson-Lindenstrauss random projection. Compresses 256-bit perceptual hashes to 128-bit binary signatures, matches via Hamming distance. Inspired by Google's TurboQuant (arXiv 2504.19874). **62% additional token savings** on image variations in benchmarks — similar product photos, re-scanned documents, and slightly different angles all hit cache.
+
+**9. Video Optimization** — Automatically extract keyframes from video at 1fps, deduplicate similar consecutive frames using QJL perceptual hashing, detect scene changes via pixel-level diff, and run each keyframe through the full image optimization pipeline. A 60-second video at 30fps (1,800 frames) reduces to ~10 keyframes before being sent to the LLM. **13-45% savings on local models; ~83% projected savings on GPT-4o.** Optional CLIP-based query-frame scoring (Layer 2) ranks frames by relevance to the user's prompt.
+
 ---
 
 ## Benchmarks
 
-We benchmarked Token0 against **4 vision models** on **5 real-world images** (not synthetic — actual photos, receipts, documents, and screenshots), plus cost projections using OpenAI and Anthropic's published token formulas.
+We benchmarked Token0 against **7 vision models** on **5 real-world images** (not synthetic — actual photos, receipts, documents, and screenshots) and **3 test videos**, plus cost projections using OpenAI and Anthropic's published token formulas.
 
 ### Real-World Image Test Suite
 
@@ -111,16 +115,55 @@ We benchmarked Token0 against **4 vision models** on **5 real-world images** (no
 | Screenshot (2066x766) | 618 | 244 | **60.5%** | **-3,744ms** | OCR route |
 | **Total** | **3,027** | **2,243** | **25.9%** | | |
 
-### Summary Across All Models
+### Image Benchmark Summary (7 Models)
+
+| Model | Params | Total Direct | Total Token0 | Savings | Notes |
+|---|---|---|---|---|---|
+| granite3.2-vision | 3B | 129,836 | 60,924 | **53.1%** | High-res image encoder |
+| minicpm-v | 8B | 10,877 | 6,276 | **42.3%** | |
+| moondream | 1.7B | 16,457 | 10,240 | **37.8%** | |
+| llava-llama3 | 8B | 13,365 | 8,486 | **36.5%** | |
+| llava:7b | 7B | 13,384 | 8,701 | **35.0%** | |
+| gemma3:4b | 4B | 6,380 | 4,798 | **24.8%** | |
+| llama3.2-vision | 11B | 665 | 665 | **0%** | Ultra-efficient encoder: passthrough correct, no optimization needed |
+
+> The 0% savings on llama3.2-vision is expected and correct. This model uses ~8-27 tokens per image natively — far below what OCR text extraction would cost. Token0 detects this and correctly skips all lossy optimizations.
+
+### Video Benchmark Results
+
+Test setup: 3 videos (product showcase, document montage, mixed content), naive baseline = all frames at 1fps sent raw, Token0 = frame dedup + scene detection + per-frame image optimization.
+
+| Model | Naive Tokens | Token0 Tokens | Savings |
+|---|---|---|---|
+| gemma3:4b | 14,706 | 8,081 | **45.0%** |
+| llava:7b | 15,731 | 12,845 | **18.3%** |
+| llava-llama3 | 15,658 | 12,789 | **18.3%** |
+| minicpm-v | 7,428 | 6,447 | **13.2%** |
+| moondream | 12,288 | 11,714 | **4.7%** |
 
-| Model | Params | Total Direct | Total Token0 | Savings |
-|---|---|---|---|---|
-| minicpm-v | 8B | 10,877 | 6,276 | **42.3%** |
-| moondream | 1.7B | 16,457 | 10,240 | **37.8%** |
-| llava-llama3 | 8B | 13,365 | 8,486 | **36.5%** |
-| llava:7b | 7B | 13,384 | 8,701 | **35.0%** |
+**Why moondream shows less video savings:** moondream uses a very small frame encoder — its per-frame token cost is already low, so frame dedup has less absolute impact than on higher-token models.
+
+### GPT-4o Video Extrapolation (ballpark)
+
+Using OpenAI's published tile formula (512px tiles, 170 tokens/tile):
+
+| Scenario | Naive | Token0 | Savings |
+|---|---|---|---|
+| 60s video, 30fps (1,800 frames → 1fps → 60 frames → dedup to ~10) | ~25,500 tokens | ~4,250 tokens | **~83%** |
+| Monthly cost at 10K videos/day (GPT-4o $2.50/1M tokens) | $19,125/mo | $3,188/mo | **$15,938/mo saved** |
+
+### Anthropic Video Extrapolation (ballpark)
+
+Using Anthropic's pixel formula (tokens ≈ width × height / 750):
+
+| Scenario | Naive | Token0 | Savings |
+|---|---|---|---|
+| 60s video, 1fps = 60 frames at 1280×720 | ~73,700 tokens | ~12,300 tokens | **~83%** |
+| Monthly cost at 1K videos/day (Claude Sonnet $3/1M tokens) | $6,633/mo | $1,107/mo | **$5,526/mo saved** |
 
-### GPT-4o Cost Projections (v1 vs v2)
+> These are linear extrapolations from the token formula + observed dedup ratios (60 frames → ~10 keyframes). Actual savings vary by content type — talking-head video deduplicates more aggressively than action scenes.
+
+### GPT-4o Image Cost Projections (v1 vs v2)
 
 Using OpenAI's published token formulas on real images:
 
@@ -150,11 +193,13 @@ Using OpenAI's published token formulas on real images:
 5. **Prompt-aware detail mode** drops simple queries from 1,105 → 85 tokens (92% savings) on GPT-4o.
 6. **Model cascade** routes simple tasks at 16.7x cheaper rates with equivalent quality.
 7. **Tile-optimized resize** cuts OpenAI costs by 44% on mid-size images (1280x720) with zero quality loss.
-8. **On cloud APIs, total savings reach 98.9%** when all optimizations are combined with model cascading.
+8. **On cloud APIs, total image savings reach 98.9%** when all optimizations are combined with model cascading.
+9. **Video deduplication collapses 60-frame clips to ~10 keyframes** — 13-45% savings on local models, ~83% projected on GPT-4o.
+10. **Model-aware OCR skip is critical** — ultra-efficient encoders like llama3.2-vision use <50 tokens/image; OCR text output would cost more, not less.
 
 ### Additional Test Coverage
 
-Token0 includes **103 unit tests** and benchmarks across multiple suites:
+Token0 includes **148 unit tests** and benchmarks across multiple suites:
 
 | Suite | Tests | What It Validates |
 |---|---|---|
@@ -166,6 +211,8 @@ Token0 includes **103 unit tests** and benchmarks across multiple suites:
 | `real` | 5 | Real-world photos, receipts, invoices, screenshots |
 | `streaming` | 7 | SSE streaming: format, content, stats, image optimization |
 | `litellm` | 10 | LiteLLM hook: passthrough, optimization, OCR, cascade, async |
+| `cache` | 23 | QJL fuzzy cache: perceptual hash, JL compression, Hamming distance, fuzzy match |
+| `video` | 22 | Frame extraction, QJL dedup, scene detection, CLIP scoring, full pipeline |
 
 ---
 
@@ -234,6 +281,26 @@ response = client.chat.completions.create(
 # response.token0.optimizations_applied = ["resize 4000x3000 → 1568x1176", "convert png → jpeg q=85"]
 ```
 
+### Video Support
+
+Send a video URL or base64-encoded video — Token0 automatically extracts keyframes, deduplicates, and optimizes before forwarding:
+
+```python
+response = client.chat.completions.create(
+    model="gpt-4o",
+    messages=[{
+        "role": "user",
+        "content": [
+            {"type": "text", "text": "What happens in this video?"},
+            {"type": "video_url", "video_url": {"url": "data:video/mp4;base64,..."}}
+        ]
+    }],
+    extra_headers={"X-Provider-Key": "sk-..."}
+)
+# 1,800 raw frames → ~10 keyframes → optimized images → LLM
+# response.token0.tokens_saved = 21,250  (~83% on GPT-4o)
+```
+
 ### Streaming Support
 
 Token0 supports `stream=true` — images are optimized before streaming begins, then tokens flow word-by-word via SSE:
@@ -332,12 +399,16 @@ curl http://localhost:8000/v1/usage
 pip install token0[dev]
 ollama pull moondream
 
-# Run all suites
+# Run all image suites
 python -m benchmarks.run --model moondream --suite all
 
 # Run only real-world images
 python -m benchmarks.run --model llava:7b --suite real
 
+# Run video benchmarks (requires Ollama + real images in benchmarks/images/real/)
+python -m benchmarks.bench_video_models
+python -m benchmarks.bench_video_models --model llava:7b --model minicpm-v
+
 # Available suites: images, text, multi, turns, tasks, real, all
 # Available models: any Ollama vision model
 ```
diff --git a/benchmarks/bench_fuzzy_cache.py b/benchmarks/bench_fuzzy_cache.py
new file mode 100644
index 0000000..6b3e90d
--- /dev/null
+++ b/benchmarks/bench_fuzzy_cache.py
@@ -0,0 +1,231 @@
+"""Benchmark: QJL fuzzy cache vs exact-match-only cache.
+
+Demonstrates token savings from fuzzy cache hits on similar images.
+Simulates a real workload: same product/document photographed multiple
+times with slight variations (lighting, angle, compression artifacts).
+
+Usage:
+    python -m benchmarks.bench_fuzzy_cache
+"""
+
+import asyncio
+import time
+
+import numpy as np
+from PIL import Image
+
+from token0.optimization.cache import (
+    _hamming_distance,
+    _image_hash,
+    _jl_compress,
+    clear_fuzzy_index,
+    get_cached_response,
+    get_fuzzy_index_size,
+    make_cache_key,
+    set_cached_response,
+)
+from token0.storage.redis import MemoryCache
+
+# Estimated tokens per image (GPT-4o high detail, ~800x600)
+TOKENS_PER_IMAGE = 765
+COST_PER_TOKEN = 2.50 / 1_000_000  # GPT-4o input price
+
+
+def _make_base_image(seed: int, width=800, height=600) -> Image.Image:
+    """Create a unique base image (simulates a product photo or document)."""
+    rng = np.random.RandomState(seed=seed)
+    pixels = rng.randint(0, 256, (height, width, 3), dtype=np.uint8)
+    return Image.fromarray(pixels)
+
+
+def _add_variation(base: Image.Image, variation_seed: int, noise_level: int = 15) -> Image.Image:
+    """Add slight variation to an image (simulates re-photo, compression, etc.)."""
+    pixels = np.array(base)
+    rng = np.random.RandomState(seed=variation_seed)
+    noise = rng.randint(-noise_level, noise_level + 1, pixels.shape, dtype=np.int16)
+    noisy = np.clip(pixels.astype(np.int16) + noise, 0, 255).astype(np.uint8)
+    return Image.fromarray(noisy)
+
+
+async def run_benchmark():
+    import token0.storage.redis as redis_mod
+
+    redis_mod._memory_cache.clear()
+    redis_mod.pool = MemoryCache()
+    clear_fuzzy_index()
+
+    print("=" * 80)
+    print("  QJL Fuzzy Cache Benchmark")
+    print("=" * 80)
+
+    # --- Setup: create base images and variations ---
+    num_unique_images = 20
+    variations_per_image = 5  # each base image has 5 slight variations
+    prompt = "describe this product image"
+
+    base_images = [_make_base_image(seed=i) for i in range(num_unique_images)]
+    variation_images = []
+    for i, base in enumerate(base_images):
+        for v in range(variations_per_image):
+            variation_images.append((i, _add_variation(base, variation_seed=i * 100 + v)))
+
+    total_requests = num_unique_images + len(variation_images)
+    print(f"\n  Setup: {num_unique_images} unique images, {variations_per_image} variations each")
+    print(f"  Total requests: {total_requests}")
+    print(f"  Tokens per image (GPT-4o): {TOKENS_PER_IMAGE}")
+
+    # --- Benchmark 1: Exact-match only ---
+    print("\n  --- Exact Match Only ---\n")
+    redis_mod._memory_cache.clear()
+    clear_fuzzy_index()
+
+    exact_hits = 0
+    exact_misses = 0
+    start = time.time()
+
+    # First pass: cache base images
+    for i, base in enumerate(base_images):
+        key = make_cache_key(base, prompt, "gpt-4o")
+        await set_cached_response(key, {"content": f"response_{i}"})
+
+    # Second pass: query with variations (exact match only)
+    for base_idx, var_img in variation_images:
+        key = make_cache_key(var_img, prompt, "gpt-4o")
+        result = await get_cached_response(key, fuzzy=False)
+        if result:
+            exact_hits += 1
+        else:
+            exact_misses += 1
+
+    exact_time = time.time() - start
+    exact_tokens_used = exact_misses * TOKENS_PER_IMAGE
+    exact_cost = exact_tokens_used * COST_PER_TOKEN
+
+    print(f"  Hits:   {exact_hits}/{len(variation_images)}")
+    print(f"  Misses: {exact_misses}/{len(variation_images)}")
+    print(f"  Tokens used: {exact_tokens_used:,}")
+    print(f"  Cost: ${exact_cost:.4f}")
+    print(f"  Time: {exact_time * 1000:.1f}ms")
+
+    # --- Benchmark 2: Fuzzy match (QJL) ---
+    print("\n  --- QJL Fuzzy Match ---\n")
+    redis_mod._memory_cache.clear()
+    clear_fuzzy_index()
+
+    fuzzy_hits = 0
+    fuzzy_misses = 0
+    start = time.time()
+
+    # First pass: cache base images
+    for i, base in enumerate(base_images):
+        key = make_cache_key(base, prompt, "gpt-4o")
+        await set_cached_response(key, {"content": f"response_{i}"})
+
+    # Second pass: query with variations (fuzzy match enabled)
+    for base_idx, var_img in variation_images:
+        key = make_cache_key(var_img, prompt, "gpt-4o")
+        result = await get_cached_response(key, fuzzy=True)
+        if result:
+            fuzzy_hits += 1
+        else:
+            fuzzy_misses += 1
+
+    fuzzy_time = time.time() - start
+    fuzzy_tokens_used = fuzzy_misses * TOKENS_PER_IMAGE
+    fuzzy_cost = fuzzy_tokens_used * COST_PER_TOKEN
+
+    print(f"  Hits:   {fuzzy_hits}/{len(variation_images)}")
+    print(f"  Misses: {fuzzy_misses}/{len(variation_images)}")
+    print(f"  Tokens used: {fuzzy_tokens_used:,}")
+    print(f"  Cost: ${fuzzy_cost:.4f}")
+    print(f"  Time: {fuzzy_time * 1000:.1f}ms")
+    print(f"  Fuzzy index size: {get_fuzzy_index_size()} entries")
+
+    # --- Hamming distance analysis ---
+    print("\n  --- Hamming Distance Analysis ---\n")
+    distances_similar = []
+    distances_different = []
+
+    for i, base in enumerate(base_images[:5]):
+        base_hash = _image_hash(base)
+        base_sig = _jl_compress(base_hash)
+
+        # Similar: variations of same base
+        for v in range(variations_per_image):
+            var = _add_variation(base, variation_seed=i * 100 + v)
+            var_hash = _image_hash(var)
+            var_sig = _jl_compress(var_hash)
+            distances_similar.append(_hamming_distance(base_sig, var_sig))
+
+        # Different: other base images
+        for j in range(5):
+            if i == j:
+                continue
+            other_hash = _image_hash(base_images[j])
+            other_sig = _jl_compress(other_hash)
+            distances_different.append(_hamming_distance(base_sig, other_sig))
+
+    print(
+        f"  Similar images:   avg={np.mean(distances_similar):.1f}, "
+        f"min={min(distances_similar)}, max={max(distances_similar)}"
+    )
+    print(
+        f"  Different images: avg={np.mean(distances_different):.1f}, "
+        f"min={min(distances_different)}, max={max(distances_different)}"
+    )
+
+    # --- Summary ---
+    print(f"\n  {'=' * 70}")
+    print("  SUMMARY")
+    print(f"  {'=' * 70}")
+    print(f"  {'':30s} {'Exact':>12s} {'Fuzzy (QJL)':>12s} {'Improvement':>12s}")
+    print(f"  {'-' * 30} {'-' * 12} {'-' * 12} {'-' * 12}")
+    print(
+        f"  {'Cache hits':30s} {exact_hits:>12d} {fuzzy_hits:>12d} "
+        f"{'+' + str(fuzzy_hits - exact_hits):>12s}"
+    )
+    print(
+        f"  {'Cache misses':30s} {exact_misses:>12d} {fuzzy_misses:>12d} "
+        f"{exact_misses - fuzzy_misses:>12d}"
+    )
+    print(
+        f"  {'Tokens used':30s} {exact_tokens_used:>12,} {fuzzy_tokens_used:>12,} "
+        f"{exact_tokens_used - fuzzy_tokens_used:>12,}"
+    )
+
+    if exact_tokens_used > 0:
+        savings_pct = (exact_tokens_used - fuzzy_tokens_used) / exact_tokens_used * 100
+        print(f"  {'Token savings':30s} {'':>12s} {'':>12s} {savings_pct:>11.1f}%")
+
+    print(
+        f"  {'Cost (GPT-4o)':30s} ${exact_cost:>11.4f} ${fuzzy_cost:>11.4f} "
+        f"${exact_cost - fuzzy_cost:>11.4f}"
+    )
+
+    # Scale projections
+    print("\n  At scale (100K images/day, 20% are variations):")
+    daily_variations = 20_000
+    exact_miss_rate = exact_misses / len(variation_images)
+    fuzzy_miss_rate = fuzzy_misses / len(variation_images)
+    daily_exact_tokens = daily_variations * TOKENS_PER_IMAGE * exact_miss_rate
+    daily_fuzzy_tokens = daily_variations * TOKENS_PER_IMAGE * fuzzy_miss_rate
+    monthly_exact = daily_exact_tokens * 30 * COST_PER_TOKEN
+    monthly_fuzzy = daily_fuzzy_tokens * 30 * COST_PER_TOKEN
+    print(f"  Exact-only monthly cost:  ${monthly_exact:,.2f}")
+    print(f"  Fuzzy cache monthly cost: ${monthly_fuzzy:,.2f}")
+    print(f"  Monthly savings:          ${monthly_exact - monthly_fuzzy:,.2f}")
+    print(f"  {'=' * 70}")
+
+    # Memory overhead
+    sig_bytes = get_fuzzy_index_size() * 16  # 16 bytes per signature
+    key_bytes = get_fuzzy_index_size() * 80  # ~80 bytes per cache key string
+    print(
+        f"\n  Memory overhead: {(sig_bytes + key_bytes) / 1024:.1f} KB "
+        f"for {get_fuzzy_index_size()} entries "
+        f"({sig_bytes} bytes signatures + {key_bytes} bytes keys)"
+    )
+    print(f"  At 1M entries: ~{(1_000_000 * 96) / 1024 / 1024:.1f} MB")
+
+
+if __name__ == "__main__":
+    asyncio.run(run_benchmark())
diff --git a/benchmarks/bench_video.py b/benchmarks/bench_video.py
new file mode 100644
index 0000000..e88a5eb
--- /dev/null
+++ b/benchmarks/bench_video.py
@@ -0,0 +1,167 @@
+"""Benchmark: Video optimization pipeline.
+
+Demonstrates token savings from video frame extraction, deduplication,
+and scene detection vs naive frame-by-frame approach.
+
+Usage:
+    python -m benchmarks.bench_video
+"""
+
+import time
+
+import cv2
+import numpy as np
+
+from token0.optimization.video import (
+    deduplicate_frames,
+    detect_scene_changes,
+    extract_frames,
+    process_video,
+)
+
+# GPT-4o token costs
+TOKENS_PER_FRAME_HIGH = 765  # high detail, typical 720p frame
+TOKENS_PER_FRAME_LOW = 85  # low detail
+COST_PER_TOKEN = 2.50 / 1_000_000
+
+
+def _create_benchmark_video(
+    duration_seconds: int = 30,
+    fps: float = 30.0,
+    width: int = 640,
+    height: int = 480,
+    num_scenes: int = 5,
+) -> str:
+    """Create a realistic benchmark video with multiple scenes."""
+    import tempfile
+
+    tmp = tempfile.NamedTemporaryFile(suffix=".mp4", delete=False)
+    fourcc = cv2.VideoWriter_fourcc(*"mp4v")
+    writer = cv2.VideoWriter(tmp.name, fourcc, fps, (width, height))
+
+    total_frames = int(duration_seconds * fps)
+    frames_per_scene = total_frames // num_scenes
+
+    for scene_idx in range(num_scenes):
+        # Each scene has a unique base pattern
+        rng = np.random.RandomState(seed=scene_idx * 42)
+        base = rng.randint(0, 256, (height, width, 3), dtype=np.uint8)
+
+        for frame_idx in range(frames_per_scene):
+            # Add slight per-frame variation (simulates camera movement, lighting)
+            noise_rng = np.random.RandomState(seed=scene_idx * 10000 + frame_idx)
+            noise = noise_rng.randint(-5, 6, base.shape, dtype=np.int16)
+            frame = np.clip(base.astype(np.int16) + noise, 0, 255).astype(np.uint8)
+            writer.write(frame)
+
+    writer.release()
+    return tmp.name
+
+
+def run_benchmark():
+    print("=" * 80)
+    print("  Video Optimization Benchmark")
+    print("=" * 80)
+
+    configs = [
+        {"duration": 10, "scenes": 3, "label": "10s video, 3 scenes"},
+        {"duration": 30, "scenes": 5, "label": "30s video, 5 scenes"},
+        {"duration": 60, "scenes": 8, "label": "60s video, 8 scenes"},
+    ]
+
+    for config in configs:
+        print(f"\n  --- {config['label']} ---\n")
+
+        # Create video
+        path = _create_benchmark_video(
+            duration_seconds=config["duration"],
+            num_scenes=config["scenes"],
+        )
+
+        total_frames = int(config["duration"] * 30)
+
+        # Naive approach: send every frame at 1fps
+        start = time.time()
+        naive_frames = extract_frames(path, fps=1.0, max_frames=1000)
+        naive_time = time.time() - start
+        naive_tokens = len(naive_frames) * TOKENS_PER_FRAME_HIGH
+
+        # Token0 approach: full pipeline
+        start = time.time()
+        optimized_images, stats = process_video(
+            path,
+            prompt="describe what happens in this video",
+            max_frames=32,
+        )
+        token0_time = time.time() - start
+        token0_tokens = len(optimized_images) * TOKENS_PER_FRAME_HIGH
+
+        # Full naive (every frame)
+        full_naive_tokens = total_frames * TOKENS_PER_FRAME_HIGH
+
+        print(f"  Total video frames (30fps):     {total_frames}")
+        print(f"  Naive 1fps frames:              {len(naive_frames)}")
+        print(f"  Token0 optimized frames:        {stats['frames_selected']}")
+        print(f"    - After extraction (1fps):    {stats['frames_extracted_at_1fps']}")
+        print(f"    - After dedup:                {stats['frames_after_dedup']}")
+        print(f"    - After scene detection:      {stats['frames_after_scene_detection']}")
+        print()
+
+        # Token comparison
+        print(f"  {'Method':<35s} {'Frames':>8s} {'Tokens':>10s} {'Cost':>10s}")
+        print(f"  {'-'*35} {'-'*8} {'-'*10} {'-'*10}")
+        print(
+            f"  {'All frames (30fps)':<35s} {total_frames:>8d} "
+            f"{full_naive_tokens:>10,} ${full_naive_tokens * COST_PER_TOKEN:>9.4f}"
+        )
+        print(
+            f"  {'Naive 1fps':<35s} {len(naive_frames):>8d} "
+            f"{naive_tokens:>10,} ${naive_tokens * COST_PER_TOKEN:>9.4f}"
+        )
+        print(
+            f"  {'Token0 (dedup + scene detect)':<35s} {len(optimized_images):>8d} "
+            f"{token0_tokens:>10,} ${token0_tokens * COST_PER_TOKEN:>9.4f}"
+        )
+
+        # Savings
+        savings_vs_30fps = (1 - token0_tokens / full_naive_tokens) * 100
+        savings_vs_1fps = (1 - token0_tokens / naive_tokens) * 100 if naive_tokens > 0 else 0
+
+        print()
+        print(f"  Savings vs 30fps:               {savings_vs_30fps:.1f}%")
+        print(f"  Savings vs naive 1fps:          {savings_vs_1fps:.1f}%")
+        print(f"  Processing time:                {token0_time * 1000:.0f}ms")
+
+    # Scale projections
+    print(f"\n  {'='*70}")
+    print("  SCALE PROJECTIONS (GPT-4o, 60s videos)")
+    print(f"  {'='*70}")
+    print()
+
+    # Use 60s video stats
+    path = _create_benchmark_video(duration_seconds=60, num_scenes=8)
+    _, stats = process_video(path, prompt="describe this video", max_frames=32)
+
+    frames_per_video_naive = 60  # 1fps
+    frames_per_video_token0 = stats["frames_selected"]
+
+    for daily_videos in [100, 1_000, 10_000]:
+        naive_daily_tokens = daily_videos * frames_per_video_naive * TOKENS_PER_FRAME_HIGH
+        token0_daily_tokens = daily_videos * frames_per_video_token0 * TOKENS_PER_FRAME_HIGH
+        naive_monthly = naive_daily_tokens * 30 * COST_PER_TOKEN
+        token0_monthly = token0_daily_tokens * 30 * COST_PER_TOKEN
+        savings = naive_monthly - token0_monthly
+
+        print(f"  {daily_videos:,} videos/day:")
+        print(f"    Naive 1fps:   ${naive_monthly:>10,.2f}/month")
+        print(f"    Token0:       ${token0_monthly:>10,.2f}/month")
+        print(f"    Savings:      ${savings:>10,.2f}/month")
+        print()
+
+    print(f"  Token0 frames per 60s video: {frames_per_video_token0}")
+    print(f"  Reduction: {frames_per_video_naive} → {frames_per_video_token0} frames")
+    print(f"  {'='*70}")
+
+
+if __name__ == "__main__":
+    run_benchmark()
diff --git a/benchmarks/bench_video_models.py b/benchmarks/bench_video_models.py
new file mode 100644
index 0000000..1296f01
--- /dev/null
+++ b/benchmarks/bench_video_models.py
@@ -0,0 +1,344 @@
+"""Benchmark: Video optimization against real Ollama vision models.
+
+Creates test videos from real images, then compares:
+  - Naive: send all frames at 1fps directly to the model
+  - Token0: extract keyframes, dedup, optimize, then send
+
+Usage:
+    python -m benchmarks.bench_video_models
+    python -m benchmarks.bench_video_models --model moondream
+    python -m benchmarks.bench_video_models --model llava:7b --model minicpm-v
+"""
+
+import argparse
+import asyncio
+import base64
+import io
+import os
+import tempfile
+import time
+
+import cv2
+import numpy as np
+from PIL import Image
+
+from token0.optimization.analyzer import analyze_image
+from token0.optimization.router import plan_optimization
+from token0.optimization.transformer import transform_image
+from token0.optimization.video import (
+    deduplicate_frames,
+    detect_scene_changes,
+    extract_frames,
+    process_video,
+)
+from token0.providers.ollama import OllamaProvider
+
+BENCHMARK_DIR = os.path.dirname(os.path.abspath(__file__))
+IMAGES_DIR = os.path.join(BENCHMARK_DIR, "images")
+REAL_DIR = os.path.join(IMAGES_DIR, "real")
+
+DEFAULT_MODELS = ["moondream", "llava:7b", "llava-llama3", "minicpm-v"]
+
+
+def _pil_to_data_uri(img: Image.Image, quality: int = 85) -> str:
+    buf = io.BytesIO()
+    img.save(buf, format="JPEG", quality=quality)
+    b64 = base64.b64encode(buf.getvalue()).decode()
+    return f"data:image/jpeg;base64,{b64}"
+
+
+def _create_video_from_images(
+    image_paths: list[str],
+    frames_per_image: int = 30,
+    fps: float = 30.0,
+    noise_level: int = 8,
+) -> str:
+    """Create a video by cycling through real images with slight variation.
+
+    Each image becomes a 'scene' lasting frames_per_image/fps seconds.
+    Slight noise is added per frame to simulate camera jitter.
+    """
+    images = []
+    for path in image_paths:
+        img = Image.open(path).convert("RGB").resize((640, 480))
+        images.append(np.array(img))
+
+    tmp = tempfile.NamedTemporaryFile(suffix=".mp4", delete=False)
+    fourcc = cv2.VideoWriter_fourcc(*"mp4v")
+    writer = cv2.VideoWriter(tmp.name, fourcc, fps, (640, 480))
+
+    for scene_idx, base_pixels in enumerate(images):
+        # Convert RGB to BGR for OpenCV
+        base_bgr = cv2.cvtColor(base_pixels, cv2.COLOR_RGB2BGR)
+        for frame_idx in range(frames_per_image):
+            rng = np.random.RandomState(seed=scene_idx * 10000 + frame_idx)
+            noise = rng.randint(-noise_level, noise_level + 1, base_bgr.shape, dtype=np.int16)
+            frame = np.clip(base_bgr.astype(np.int16) + noise, 0, 255).astype(np.uint8)
+            writer.write(frame)
+
+    writer.release()
+    total_frames = len(images) * frames_per_image
+    duration = total_frames / fps
+    print(f"  Created video: {len(images)} scenes, {total_frames} frames, {duration:.1f}s")
+    return tmp.name
+
+
+def _optimize_frame(pil_image: Image.Image, model: str) -> tuple[dict, list[str], int, int]:
+    """Run token0 optimization on a single frame. Returns (message_part, optimizations, before, after)."""
+    data_uri = _pil_to_data_uri(pil_image)
+    analysis, raw_bytes, pil_img = analyze_image(data_uri)
+    plan = plan_optimization(analysis, model)
+
+    if plan.use_ocr_route:
+        result = transform_image(plan, analysis, raw_bytes, pil_img)
+        part = {"type": "text", "text": f"[Extracted text from video frame]:\n{result['content']}"}
+    elif any([plan.resize, plan.recompress_jpeg, plan.force_detail_low]):
+        result = transform_image(plan, analysis, raw_bytes, pil_img)
+        detail = "low" if plan.force_detail_low else "auto"
+        part = {
+            "type": "image_url",
+            "image_url": {
+                "url": f"data:{result['media_type']};base64,{result['base64']}",
+                "detail": detail,
+            },
+        }
+    else:
+        part = {"type": "image_url", "image_url": {"url": data_uri, "detail": "auto"}}
+
+    return part, plan.reasons, plan.estimated_tokens_before, plan.estimated_tokens_after
+
+
+async def run_video_benchmark(
+    model: str, provider: OllamaProvider, video_path: str, video_name: str, prompt: str
+):
+    """Run a single video benchmark: naive vs token0 optimized."""
+    print(f'\n  [{video_name}] — "{prompt}"')
+
+    # --- Naive approach: extract at 1fps, send all frames ---
+    naive_frames = extract_frames(video_path, fps=1.0, max_frames=60)
+    print(f"    Naive: {len(naive_frames)} frames at 1fps")
+
+    # Build naive message (all frames as images)
+    naive_parts = [{"type": "text", "text": prompt}]
+    for _, frame in naive_frames:
+        naive_parts.append(
+            {
+                "type": "image_url",
+                "image_url": {"url": _pil_to_data_uri(frame), "detail": "auto"},
+            }
+        )
+
+    naive_messages = [{"role": "user", "content": naive_parts}]
+
+    print("    Sending naive to model...", end="", flush=True)
+    start = time.time()
+    try:
+        naive_resp = await provider.chat_completion(
+            model=model, messages=naive_messages, max_tokens=200
+        )
+        naive_latency = int((time.time() - start) * 1000)
+        print(f" {naive_latency}ms, {naive_resp.prompt_tokens} prompt tokens")
+    except Exception as e:
+        print(f" ERROR: {e}")
+        return None
+
+    # --- Token0 approach: process_video + per-frame optimization ---
+    optimized_images, video_stats = process_video(video_path, prompt=prompt, max_frames=32)
+    print(
+        f"    Token0: {video_stats['total_video_frames']} total → "
+        f"{video_stats['frames_extracted_at_1fps']} extracted → "
+        f"{video_stats['frames_after_dedup']} deduped → "
+        f"{video_stats['frames_selected']} selected"
+    )
+
+    # Optimize each selected frame through image pipeline
+    opt_parts = [{"type": "text", "text": prompt}]
+    all_optimizations = [
+        f"video: {video_stats['total_video_frames']} → {video_stats['frames_selected']} keyframes"
+    ]
+    total_before = 0
+    total_after = 0
+
+    for frame_img in optimized_images:
+        part, reasons, before, after = _optimize_frame(frame_img, model)
+        opt_parts.append(part)
+        all_optimizations.extend(reasons)
+        total_before += before
+        total_after += after
+
+    opt_messages = [{"role": "user", "content": opt_parts}]
+
+    print("    Sending optimized to model...", end="", flush=True)
+    start = time.time()
+    try:
+        opt_resp = await provider.chat_completion(
+            model=model, messages=opt_messages, max_tokens=200
+        )
+        opt_latency = int((time.time() - start) * 1000)
+        print(f" {opt_latency}ms, {opt_resp.prompt_tokens} prompt tokens")
+    except Exception as e:
+        print(f" ERROR: {e}")
+        return None
+
+    # Calculate savings
+    saved = naive_resp.prompt_tokens - opt_resp.prompt_tokens
+    pct = (saved / naive_resp.prompt_tokens * 100) if naive_resp.prompt_tokens > 0 else 0
+
+    return {
+        "video": video_name,
+        "prompt": prompt,
+        "naive_frames": len(naive_frames),
+        "token0_frames": len(optimized_images),
+        "naive_prompt_tokens": naive_resp.prompt_tokens,
+        "token0_prompt_tokens": opt_resp.prompt_tokens,
+        "tokens_saved": saved,
+        "savings_pct": round(pct, 1),
+        "naive_latency_ms": naive_latency,
+        "token0_latency_ms": opt_latency,
+        "latency_delta_ms": opt_latency - naive_latency,
+        "optimizations": all_optimizations,
+        "video_stats": video_stats,
+    }
+
+
+async def run_all_benchmarks(models: list[str]):
+    provider = OllamaProvider(base_url="http://localhost:11434/v1")
+
+    # --- Create test videos from real images ---
+    print("=" * 80)
+    print("  Video Optimization Benchmark — Real Models")
+    print("=" * 80)
+
+    # Check which real images exist
+    real_images = []
+    for name in [
+        "photo_nature.jpg",
+        "photo_street.jpg",
+        "receipt_real.jpg",
+        "document_invoice.png",
+        "screenshot_real.png",
+    ]:
+        path = os.path.join(REAL_DIR, name)
+        if os.path.exists(path):
+            real_images.append(path)
+
+    if not real_images:
+        print("  ERROR: No real images found in benchmarks/images/real/")
+        return
+
+    print(f"\n  Found {len(real_images)} real images for video creation")
+
+    # Video 1: Product showcase (nature + street photos, ~4s)
+    video1_path = _create_video_from_images(
+        [p for p in real_images if "photo" in p][:2],
+        frames_per_image=60,
+        fps=30.0,
+    )
+
+    # Video 2: Document scanning (receipt + invoice + screenshot, ~3s)
+    doc_images = [p for p in real_images if "receipt" in p or "document" in p or "screenshot" in p][
+        :3
+    ]
+    video2_path = _create_video_from_images(
+        doc_images,
+        frames_per_image=30,
+        fps=30.0,
+    )
+
+    # Video 3: Mixed content (all images, ~5s)
+    video3_path = _create_video_from_images(
+        real_images[:5],
+        frames_per_image=30,
+        fps=30.0,
+    )
+
+    test_cases = [
+        (video1_path, "photos_4s", "Describe what you see in this video"),
+        (video2_path, "documents_3s", "Extract any text or numbers visible in this video"),
+        (video3_path, "mixed_5s", "Summarize the content shown in this video"),
+    ]
+
+    # --- Also check for real video files in benchmarks/videos/ ---
+    videos_dir = os.path.join(BENCHMARK_DIR, "videos")
+    if os.path.isdir(videos_dir):
+        for fname in sorted(os.listdir(videos_dir)):
+            if fname.lower().endswith((".mp4", ".avi", ".mov", ".mkv", ".webm")):
+                vpath = os.path.join(videos_dir, fname)
+                vname = os.path.splitext(fname)[0]
+                test_cases.append((vpath, f"real_{vname}", "Describe what happens in this video"))
+                print(f"  Found real video: {fname}")
+
+
+    all_results = {}
+
+    for model in models:
+        print(f"\n{'=' * 80}")
+        print(f"  Model: {model}")
+        print(f"{'=' * 80}")
+
+        model_results = []
+        for video_path, video_name, prompt in test_cases:
+            result = await run_video_benchmark(model, provider, video_path, video_name, prompt)
+            if result:
+                model_results.append(result)
+
+        all_results[model] = model_results
+
+        # Print model summary
+        if model_results:
+            total_naive = sum(r["naive_prompt_tokens"] for r in model_results)
+            total_token0 = sum(r["token0_prompt_tokens"] for r in model_results)
+            total_saved = total_naive - total_token0
+            total_pct = (total_saved / total_naive * 100) if total_naive > 0 else 0
+
+            print(f"\n  --- {model} Summary ---")
+            print(
+                f"  {'Video':<20s} {'Naive Frames':>12s} {'T0 Frames':>10s} {'Naive Tokens':>13s} {'T0 Tokens':>10s} {'Saved':>8s}"
+            )
+            print(f"  {'-' * 20} {'-' * 12} {'-' * 10} {'-' * 13} {'-' * 10} {'-' * 8}")
+            for r in model_results:
+                print(
+                    f"  {r['video']:<20s} {r['naive_frames']:>12d} {r['token0_frames']:>10d} "
+                    f"{r['naive_prompt_tokens']:>13,} {r['token0_prompt_tokens']:>10,} {r['savings_pct']:>7.1f}%"
+                )
+            print(
+                f"  {'TOTAL':<20s} {'':>12s} {'':>10s} {total_naive:>13,} {total_token0:>10,} {total_pct:>7.1f}%"
+            )
+
+    # --- Grand summary across all models ---
+    print(f"\n{'=' * 80}")
+    print("  GRAND SUMMARY — All Models")
+    print(f"{'=' * 80}")
+    print(
+        f"\n  {'Model':<20s} {'Naive Tokens':>13s} {'T0 Tokens':>13s} {'Saved':>13s} {'Savings':>8s}"
+    )
+    print(f"  {'-' * 20} {'-' * 13} {'-' * 13} {'-' * 13} {'-' * 8}")
+
+    for model, results in all_results.items():
+        if results:
+            total_naive = sum(r["naive_prompt_tokens"] for r in results)
+            total_token0 = sum(r["token0_prompt_tokens"] for r in results)
+            total_saved = total_naive - total_token0
+            pct = (total_saved / total_naive * 100) if total_naive > 0 else 0
+            print(
+                f"  {model:<20s} {total_naive:>13,} {total_token0:>13,} {total_saved:>13,} {pct:>7.1f}%"
+            )
+
+    print(f"\n  {'=' * 80}")
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Video optimization benchmark against Ollama models"
+    )
+    parser.add_argument(
+        "--model", action="append", help="Ollama model(s) to test (can specify multiple)"
+    )
+    args = parser.parse_args()
+
+    models = args.model or DEFAULT_MODELS
+    asyncio.run(run_all_benchmarks(models))
+
+
+if __name__ == "__main__":
+    main()
diff --git a/pyproject.toml b/pyproject.toml
index ce17fdc..20a9674 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -1,6 +1,6 @@
 [project]
 name = "token0"
-version = "0.2.1"
+version = "0.3.0"
 description = "Open-source API proxy that makes vision LLM calls 5-10x cheaper"
 readme = "README.md"
 license = "Apache-2.0"
diff --git a/tests/test_cache.py b/tests/test_cache.py
new file mode 100644
index 0000000..99a806e
--- /dev/null
+++ b/tests/test_cache.py
@@ -0,0 +1,290 @@
+"""Tests for semantic cache with QJL fuzzy matching."""
+
+import numpy as np
+import pytest
+from PIL import Image
+
+from token0.optimization.cache import (
+    _hamming_distance,
+    _image_hash,
+    _jl_compress,
+    _prompt_hash,
+    clear_fuzzy_index,
+    get_cached_response,
+    get_fuzzy_index_size,
+    make_cache_key,
+    set_cached_response,
+)
+from token0.storage.redis import MemoryCache
+
+
+@pytest.fixture(autouse=True)
+def _init_cache():
+    """Initialize in-memory cache and clear fuzzy index for each test."""
+    import token0.storage.redis as redis_mod
+
+    redis_mod._memory_cache.clear()
+    redis_mod.pool = MemoryCache()
+    clear_fuzzy_index()
+    yield
+    clear_fuzzy_index()
+    redis_mod._memory_cache.clear()
+
+
+def _make_pil_image(width=800, height=600, color="red"):
+    """Create a PIL Image directly."""
+    return Image.new("RGB", (width, height), color=color)
+
+
+def _make_gradient_image(width=800, height=600, seed=0):
+    """Create a gradient image with unique content (not solid color)."""
+    rng = np.random.RandomState(seed=seed)
+    pixels = rng.randint(0, 256, (height, width, 3), dtype=np.uint8)
+    return Image.fromarray(pixels)
+
+
+def _make_slightly_different_image(base_seed=0, noise_level=10):
+    """Create an image that's visually similar but not identical to a gradient."""
+    base = _make_gradient_image(seed=base_seed)
+    pixels = np.array(base)
+    rng = np.random.RandomState(seed=123)
+    noise = rng.randint(-noise_level, noise_level + 1, pixels.shape, dtype=np.int16)
+    noisy = np.clip(pixels.astype(np.int16) + noise, 0, 255).astype(np.uint8)
+    return Image.fromarray(noisy)
+
+
+class TestPerceptualHash:
+    def test_identical_images_same_hash(self):
+        """Identical images produce the same hash."""
+        img1 = _make_pil_image(800, 600, "red")
+        img2 = _make_pil_image(800, 600, "red")
+        assert _image_hash(img1) == _image_hash(img2)
+
+    def test_different_images_different_hash(self):
+        """Completely different images produce different hashes."""
+        img1 = _make_gradient_image(seed=0)
+        img2 = _make_gradient_image(seed=99)
+        assert _image_hash(img1) != _image_hash(img2)
+
+    def test_hash_is_hex_string(self):
+        """Hash is a valid hex string of expected length."""
+        img = _make_pil_image()
+        h = _image_hash(img)
+        assert isinstance(h, str)
+        int(h, 16)  # should not raise
+        assert len(h) == 64  # 256 bits / 4 = 64 hex chars
+
+
+class TestJLCompression:
+    def test_signature_size(self):
+        """Compressed signature is exactly 16 bytes (128 bits)."""
+        img = _make_pil_image()
+        h = _image_hash(img)
+        sig = _jl_compress(h)
+        assert isinstance(sig, bytes)
+        assert len(sig) == 16
+
+    def test_identical_hashes_same_signature(self):
+        """Same hash always produces same signature (deterministic)."""
+        img = _make_pil_image()
+        h = _image_hash(img)
+        sig1 = _jl_compress(h)
+        sig2 = _jl_compress(h)
+        assert sig1 == sig2
+
+    def test_similar_images_close_signatures(self):
+        """Visually similar images have small Hamming distance."""
+        img1 = _make_gradient_image(seed=0)
+        img2 = _make_slightly_different_image(base_seed=0, noise_level=5)
+
+        h1 = _image_hash(img1)
+        h2 = _image_hash(img2)
+        sig1 = _jl_compress(h1)
+        sig2 = _jl_compress(h2)
+
+        dist = _hamming_distance(sig1, sig2)
+        # Similar images should have small distance
+        assert dist < 30  # well under half of 128
+
+    def test_different_images_far_signatures(self):
+        """Very different images have large Hamming distance."""
+        img1 = _make_gradient_image(seed=0)
+        img2 = _make_gradient_image(seed=99)
+
+        sig1 = _jl_compress(_image_hash(img1))
+        sig2 = _jl_compress(_image_hash(img2))
+
+        dist = _hamming_distance(sig1, sig2)
+        # Very different images should be far apart
+        assert dist > 20
+
+
+class TestHammingDistance:
+    def test_identical_signatures_zero_distance(self):
+        """Identical signatures have distance 0."""
+        sig = b"\xff" * 16
+        assert _hamming_distance(sig, sig) == 0
+
+    def test_completely_different_max_distance(self):
+        """Opposite signatures have distance 128."""
+        sig_a = b"\x00" * 16
+        sig_b = b"\xff" * 16
+        assert _hamming_distance(sig_a, sig_b) == 128
+
+    def test_one_bit_flip(self):
+        """Single bit difference = distance 1."""
+        sig_a = b"\x00" * 16
+        sig_b = b"\x01" + b"\x00" * 15
+        assert _hamming_distance(sig_a, sig_b) == 1
+
+
+class TestExactCacheMatch:
+    @pytest.mark.asyncio
+    async def test_exact_match_hit(self):
+        """Exact key match returns cached response."""
+        img = _make_pil_image()
+        key = make_cache_key(img, "describe this", "gpt-4o")
+        response = {"model": "gpt-4o", "content": "A red rectangle"}
+
+        await set_cached_response(key, response)
+        result = await get_cached_response(key)
+
+        assert result is not None
+        assert result["content"] == "A red rectangle"
+
+    @pytest.mark.asyncio
+    async def test_exact_match_miss(self):
+        """Non-existent key returns None."""
+        img = _make_pil_image()
+        key = make_cache_key(img, "describe this", "gpt-4o")
+        result = await get_cached_response(key)
+        assert result is None
+
+    @pytest.mark.asyncio
+    async def test_different_prompts_no_match(self):
+        """Different prompts don't match even for same image."""
+        img = _make_pil_image()
+        key1 = make_cache_key(img, "describe this", "gpt-4o")
+        key2 = make_cache_key(img, "classify this", "gpt-4o")
+
+        await set_cached_response(key1, {"content": "description"})
+        result = await get_cached_response(key2)
+
+        assert result is None
+
+    @pytest.mark.asyncio
+    async def test_different_models_no_match(self):
+        """Different models don't share cache."""
+        img = _make_pil_image()
+        key1 = make_cache_key(img, "describe this", "gpt-4o")
+        key2 = make_cache_key(img, "describe this", "claude-sonnet-4-6")
+
+        await set_cached_response(key1, {"content": "gpt response"})
+        result = await get_cached_response(key2)
+
+        assert result is None
+
+
+class TestFuzzyCacheMatch:
+    @pytest.mark.asyncio
+    async def test_fuzzy_match_similar_image(self):
+        """Similar images hit cache via fuzzy matching."""
+        img1 = _make_gradient_image(seed=0)
+        img2 = _make_slightly_different_image(base_seed=0, noise_level=5)
+        prompt = "describe this image"
+
+        # Cache response for img1
+        key1 = make_cache_key(img1, prompt, "gpt-4o")
+        await set_cached_response(key1, {"content": "A gradient image"})
+
+        # Query with img2 (similar but not identical)
+        key2 = make_cache_key(img2, prompt, "gpt-4o")
+
+        # If hashes are identical, it's an exact match (still valid)
+        # If hashes differ, fuzzy should find it
+        result = await get_cached_response(key2)
+        assert result is not None
+        assert result["content"] == "A gradient image"
+
+    @pytest.mark.asyncio
+    async def test_fuzzy_no_match_very_different(self):
+        """Very different images don't fuzzy match."""
+        img1 = _make_gradient_image(seed=0)
+        img2 = _make_gradient_image(seed=99)
+        prompt = "describe this"
+
+        key1 = make_cache_key(img1, prompt, "gpt-4o")
+        await set_cached_response(key1, {"content": "image 1"})
+
+        key2 = make_cache_key(img2, prompt, "gpt-4o")
+        result = await get_cached_response(key2)
+
+        # Very different images should not fuzzy match
+        assert result is None
+
+    @pytest.mark.asyncio
+    async def test_fuzzy_requires_same_prompt(self):
+        """Fuzzy match only works for same prompt, different image."""
+        img1 = _make_gradient_image(seed=0)
+        img2 = _make_slightly_different_image(base_seed=0, noise_level=5)
+
+        key1 = make_cache_key(img1, "describe this", "gpt-4o")
+        await set_cached_response(key1, {"content": "A gradient image"})
+
+        # Different prompt — should NOT match even with similar image
+        key2 = make_cache_key(img2, "classify this", "gpt-4o")
+        result = await get_cached_response(key2)
+        assert result is None
+
+    @pytest.mark.asyncio
+    async def test_fuzzy_disabled(self):
+        """fuzzy=False skips fuzzy matching."""
+        img1 = _make_gradient_image(seed=0)
+        img2 = _make_slightly_different_image(base_seed=0, noise_level=5)
+        prompt = "describe this"
+
+        key1 = make_cache_key(img1, prompt, "gpt-4o")
+        await set_cached_response(key1, {"content": "cached"})
+
+        key2 = make_cache_key(img2, prompt, "gpt-4o")
+
+        # If keys are different, fuzzy=False should miss
+        if key1 != key2:
+            result = await get_cached_response(key2, fuzzy=False)
+            assert result is None
+
+    @pytest.mark.asyncio
+    async def test_fuzzy_index_tracks_entries(self):
+        """Fuzzy index grows as entries are cached."""
+        assert get_fuzzy_index_size() == 0
+
+        img = _make_pil_image()
+        key = make_cache_key(img, "test", "gpt-4o")
+        await set_cached_response(key, {"content": "test"})
+
+        assert get_fuzzy_index_size() == 1
+
+    @pytest.mark.asyncio
+    async def test_fuzzy_index_no_duplicates(self):
+        """Re-caching same key doesn't duplicate in fuzzy index."""
+        img = _make_pil_image()
+        key = make_cache_key(img, "test", "gpt-4o")
+
+        await set_cached_response(key, {"content": "v1"})
+        await set_cached_response(key, {"content": "v2"})
+
+        assert get_fuzzy_index_size() == 1
+
+
+class TestPromptHash:
+    def test_normalized_whitespace(self):
+        """Extra whitespace is normalized."""
+        assert _prompt_hash("  hello   world  ") == _prompt_hash("hello world")
+
+    def test_case_insensitive(self):
+        """Case is normalized."""
+        assert _prompt_hash("Hello World") == _prompt_hash("hello world")
+
+    def test_different_prompts_different_hash(self):
+        """Different prompts produce different hashes."""
+        assert _prompt_hash("describe this") != _prompt_hash("classify this")
diff --git a/tests/test_video.py b/tests/test_video.py
new file mode 100644
index 0000000..d76b68b
--- /dev/null
+++ b/tests/test_video.py
@@ -0,0 +1,280 @@
+"""Tests for video optimization pipeline."""
+
+import tempfile
+
+import cv2
+import numpy as np
+from PIL import Image
+
+from token0.optimization.video import (
+    deduplicate_frames,
+    detect_scene_changes,
+    extract_frames,
+    process_video,
+    score_frames_by_relevance,
+)
+
+
+def _make_test_video(
+    num_frames: int = 30,
+    fps: float = 30.0,
+    width: int = 320,
+    height: int = 240,
+    scene_changes: list[int] | None = None,
+) -> str:
+    """Create a test video file. Returns path to temp file.
+
+    scene_changes: list of frame indices where the scene changes (color shifts).
+    """
+    tmp = tempfile.NamedTemporaryFile(suffix=".mp4", delete=False)
+    fourcc = cv2.VideoWriter_fourcc(*"mp4v")
+    writer = cv2.VideoWriter(tmp.name, fourcc, fps, (width, height))
+
+    if scene_changes is None:
+        scene_changes = []
+
+    current_color = np.array([50, 100, 150], dtype=np.uint8)
+
+    for i in range(num_frames):
+        if i in scene_changes:
+            # Dramatic color shift for scene change
+            current_color = np.random.RandomState(seed=i).randint(0, 256, 3).astype(np.uint8)
+
+        # Create frame with slight variation (simulates real video)
+        frame = np.full((height, width, 3), current_color, dtype=np.uint8)
+        # Add minor noise so frames aren't perfectly identical
+        noise = np.random.RandomState(seed=i).randint(-3, 4, frame.shape, dtype=np.int16)
+        frame = np.clip(frame.astype(np.int16) + noise, 0, 255).astype(np.uint8)
+        writer.write(frame)
+
+    writer.release()
+    return tmp.name
+
+
+def _make_diverse_video(num_frames: int = 60, fps: float = 30.0) -> str:
+    """Create a video with clearly different scenes."""
+    tmp = tempfile.NamedTemporaryFile(suffix=".mp4", delete=False)
+    fourcc = cv2.VideoWriter_fourcc(*"mp4v")
+    writer = cv2.VideoWriter(tmp.name, fourcc, fps, (320, 240))
+
+    for i in range(num_frames):
+        rng = np.random.RandomState(seed=i // 10)  # same scene for 10 frames
+        frame = rng.randint(0, 256, (240, 320, 3), dtype=np.uint8)
+        # Add per-frame noise
+        noise = np.random.RandomState(seed=i * 100).randint(-5, 6, frame.shape, dtype=np.int16)
+        frame = np.clip(frame.astype(np.int16) + noise, 0, 255).astype(np.uint8)
+        writer.write(frame)
+
+    writer.release()
+    return tmp.name
+
+
+class TestFrameExtraction:
+    def test_extract_frames_basic(self):
+        """Extract frames from a simple video."""
+        path = _make_test_video(num_frames=30, fps=30.0)
+        frames = extract_frames(path, fps=1.0)
+        # 30 frames at 30fps = 1 second → 1 frame at 1fps
+        assert len(frames) >= 1
+        assert all(isinstance(f[1], Image.Image) for f in frames)
+
+    def test_extract_frames_timestamps(self):
+        """Timestamps should be monotonically increasing."""
+        path = _make_test_video(num_frames=90, fps=30.0)
+        frames = extract_frames(path, fps=1.0)
+        timestamps = [t for t, _ in frames]
+        assert timestamps == sorted(timestamps)
+
+    def test_extract_frames_respects_max(self):
+        """Should not exceed max_frames."""
+        path = _make_test_video(num_frames=300, fps=30.0)
+        frames = extract_frames(path, fps=10.0, max_frames=5)
+        assert len(frames) <= 5
+
+    def test_extract_frames_pil_format(self):
+        """Extracted frames should be RGB PIL images."""
+        path = _make_test_video(num_frames=30, fps=30.0)
+        frames = extract_frames(path, fps=1.0)
+        for _, frame in frames:
+            assert isinstance(frame, Image.Image)
+            assert frame.mode == "RGB"
+
+    def test_extract_frames_higher_fps(self):
+        """Higher extraction fps should yield more frames."""
+        path = _make_test_video(num_frames=90, fps=30.0)
+        frames_1fps = extract_frames(path, fps=1.0, max_frames=100)
+        frames_5fps = extract_frames(path, fps=5.0, max_frames=100)
+        assert len(frames_5fps) > len(frames_1fps)
+
+
+class TestDeduplication:
+    def test_dedup_removes_similar_frames(self):
+        """Near-identical frames should be collapsed."""
+        path = _make_test_video(num_frames=90, fps=30.0)
+        frames = extract_frames(path, fps=10.0, max_frames=100)
+        deduped = deduplicate_frames(frames)
+        # Static video with tiny noise — most frames should be deduped
+        assert len(deduped) < len(frames)
+
+    def test_dedup_keeps_different_scenes(self):
+        """Frames from different scenes should be kept."""
+        path = _make_test_video(
+            num_frames=60,
+            fps=30.0,
+            scene_changes=[15, 30, 45],
+        )
+        frames = extract_frames(path, fps=2.0, max_frames=100)
+        deduped = deduplicate_frames(frames)
+        # Should keep at least one frame per scene (4 scenes)
+        assert len(deduped) >= 3
+
+    def test_dedup_empty_input(self):
+        """Empty input returns empty output."""
+        assert deduplicate_frames([]) == []
+
+    def test_dedup_single_frame(self):
+        """Single frame is always kept."""
+        img = Image.new("RGB", (320, 240), color="red")
+        result = deduplicate_frames([(0.0, img)])
+        assert len(result) == 1
+
+    def test_dedup_preserves_order(self):
+        """Deduplicated frames should maintain chronological order."""
+        path = _make_test_video(
+            num_frames=60,
+            fps=30.0,
+            scene_changes=[20, 40],
+        )
+        frames = extract_frames(path, fps=2.0, max_frames=100)
+        deduped = deduplicate_frames(frames)
+        timestamps = [t for t, _ in deduped]
+        assert timestamps == sorted(timestamps)
+
+
+class TestSceneDetection:
+    def test_scene_detection_finds_changes(self):
+        """Should detect scene boundaries."""
+        path = _make_test_video(
+            num_frames=90,
+            fps=30.0,
+            scene_changes=[30, 60],
+        )
+        frames = extract_frames(path, fps=2.0, max_frames=100)
+        scene_frames = detect_scene_changes(frames)
+        # Should keep frames around scene changes
+        assert len(scene_frames) >= 3  # at least first + 2 scene changes
+
+    def test_scene_detection_static_video(self):
+        """Static video should return few frames."""
+        path = _make_test_video(num_frames=90, fps=30.0)
+        frames = extract_frames(path, fps=2.0, max_frames=100)
+        scene_frames = detect_scene_changes(frames)
+        # Static = mostly same scene, should return very few
+        assert len(scene_frames) <= len(frames)
+
+    def test_scene_detection_preserves_first_last(self):
+        """First and last frames should always be kept."""
+        path = _make_test_video(num_frames=90, fps=30.0)
+        frames = extract_frames(path, fps=2.0, max_frames=100)
+        if len(frames) >= 3:
+            scene_frames = detect_scene_changes(frames)
+            assert scene_frames[0][0] == frames[0][0]  # first frame kept
+            assert scene_frames[-1][0] == frames[-1][0]  # last frame kept
+
+
+class TestCLIPScoring:
+    def test_scoring_without_clip(self):
+        """Without CLIP, should return uniform scores."""
+        frames = [
+            (0.0, Image.new("RGB", (320, 240), "red")),
+            (1.0, Image.new("RGB", (320, 240), "blue")),
+        ]
+        scored = score_frames_by_relevance(frames, "test prompt")
+        assert len(scored) == 2
+        # All scores should be 1.0 (uniform fallback) or actual CLIP scores
+        assert all(len(s) == 3 for s in scored)
+
+    def test_scoring_empty_input(self):
+        """Empty input returns empty output."""
+        assert score_frames_by_relevance([], "test") == []
+
+    def test_scoring_preserves_timestamps(self):
+        """Output should maintain chronological order."""
+        frames = [
+            (0.0, Image.new("RGB", (320, 240), "red")),
+            (1.0, Image.new("RGB", (320, 240), "blue")),
+            (2.0, Image.new("RGB", (320, 240), "green")),
+        ]
+        scored = score_frames_by_relevance(frames, "test")
+        timestamps = [s[0] for s in scored]
+        assert timestamps == sorted(timestamps)
+
+
+class TestProcessVideo:
+    def test_full_pipeline(self):
+        """Full pipeline returns images and stats."""
+        path = _make_test_video(num_frames=90, fps=30.0, scene_changes=[30, 60])
+        images, stats = process_video(path, prompt="describe this video")
+
+        assert len(images) > 0
+        assert all(isinstance(img, Image.Image) for img in images)
+        assert "total_video_frames" in stats
+        assert "frames_selected" in stats
+        assert "frame_reduction_pct" in stats
+        assert stats["frames_selected"] <= stats["frames_extracted_at_1fps"]
+
+    def test_pipeline_reduces_frames(self):
+        """Pipeline should reduce frame count significantly."""
+        path = _make_test_video(num_frames=150, fps=30.0)
+        images, stats = process_video(path, prompt="what is happening?")
+
+        # 150 frames at 30fps = 5 seconds
+        # At 1fps extraction = ~5 frames, dedup should collapse further
+        assert stats["frames_selected"] < stats["total_video_frames"]
+        assert stats["frame_reduction_pct"] > 50.0
+
+    def test_pipeline_diverse_video(self):
+        """Diverse video should keep more frames than static."""
+        static_path = _make_test_video(num_frames=90, fps=30.0)
+        diverse_path = _make_diverse_video(num_frames=90, fps=30.0)
+
+        static_images, static_stats = process_video(static_path)
+        diverse_images, diverse_stats = process_video(diverse_path)
+
+        # Diverse video should retain more frames
+        assert diverse_stats["frames_selected"] >= static_stats["frames_selected"]
+
+    def test_pipeline_stats_structure(self):
+        """Stats dict should have all expected keys."""
+        path = _make_test_video(num_frames=30, fps=30.0)
+        _, stats = process_video(path)
+
+        expected_keys = [
+            "video_duration_seconds",
+            "video_fps",
+            "total_video_frames",
+            "frames_extracted_at_1fps",
+            "frames_after_dedup",
+            "frames_after_scene_detection",
+            "frames_selected",
+            "frame_reduction_pct",
+            "clip_used",
+        ]
+        for key in expected_keys:
+            assert key in stats, f"Missing key: {key}"
+
+    def test_pipeline_max_frames_cap(self):
+        """Pipeline should respect max_frames cap."""
+        path = _make_diverse_video(num_frames=300, fps=30.0)
+        images, stats = process_video(path, max_frames=5)
+        assert len(images) <= 5
+
+    def test_pipeline_empty_video(self):
+        """Empty/invalid video should return empty results gracefully."""
+        tmp = tempfile.NamedTemporaryFile(suffix=".mp4", delete=False)
+        tmp.write(b"not a video")
+        tmp.flush()
+
+        images, stats = process_video(tmp.name)
+        assert len(images) == 0
+        assert stats["frames_selected"] == 0
diff --git a/token0/api/v1/chat.py b/token0/api/v1/chat.py
index bd0fbf5..83ded00 100644
--- a/token0/api/v1/chat.py
+++ b/token0/api/v1/chat.py
@@ -33,6 +33,7 @@
     plan_optimization,
 )
 from token0.optimization.transformer import transform_image
+from token0.optimization.video import process_video
 from token0.providers.anthropic import AnthropicProvider
 from token0.providers.base import BaseProvider, get_cost_per_token
 from token0.providers.google import GoogleProvider
@@ -100,6 +101,81 @@ def _optimize_messages(request: ChatRequest, prompt_detail: str):
         for part in msg.content:
             if part.type == "text":
                 optimized_parts.append({"type": "text", "text": part.text})
+            elif part.type == "video_url" and part.video_url and request.token0_optimize:
+                # Video optimization: extract keyframes, dedup, optimize each
+                prompt_text = extract_prompt_text(request.messages)
+                video_frames, video_stats = process_video(
+                    part.video_url.url,
+                    prompt=prompt_text,
+                )
+                optimizations_applied.append(
+                    f"video: {video_stats['total_video_frames']} frames → "
+                    f"{video_stats['frames_selected']} keyframes "
+                    f"({video_stats['frame_reduction_pct']}% reduction)"
+                )
+                # Each keyframe goes through the image optimization pipeline
+                for frame_img in video_frames:
+                    import base64 as b64mod
+                    import io as iomod
+
+                    buf = iomod.BytesIO()
+                    frame_img.save(buf, format="JPEG", quality=85)
+                    frame_bytes = buf.getvalue()
+                    frame_b64 = b64mod.b64encode(frame_bytes).decode()
+                    frame_data = f"data:image/jpeg;base64,{frame_b64}"
+
+                    frame_analysis, frame_raw, frame_pil = analyze_image(frame_data)
+                    if first_pil_image is None:
+                        first_pil_image = frame_pil
+
+                    frame_plan = plan_optimization(
+                        frame_analysis,
+                        request.model,
+                        detail_override=request.token0_detail_override,
+                        prompt_detail=prompt_detail,
+                        enable_cascade=request.token0_enable_cascade,
+                    )
+                    plans.append(frame_plan)
+                    total_tokens_before += frame_plan.estimated_tokens_before
+                    total_tokens_after += frame_plan.estimated_tokens_after
+
+                    if frame_plan.use_ocr_route:
+                        result = transform_image(frame_plan, frame_analysis, frame_raw, frame_pil)
+                        optimized_parts.append(
+                            {
+                                "type": "text",
+                                "text": f"[Extracted text from video frame]:\n{result['content']}",
+                            }
+                        )
+                    elif any(
+                        [frame_plan.resize, frame_plan.recompress_jpeg, frame_plan.force_detail_low]
+                    ):
+                        result = transform_image(frame_plan, frame_analysis, frame_raw, frame_pil)
+                        detail = "low" if frame_plan.force_detail_low else "auto"
+                        optimized_parts.append(
+                            {
+                                "type": "image_url",
+                                "image_url": {
+                                    "url": f"data:{result['media_type']};base64,{result['base64']}",
+                                    "detail": detail,
+                                },
+                            }
+                        )
+                    else:
+                        optimized_parts.append(
+                            {
+                                "type": "image_url",
+                                "image_url": {
+                                    "url": frame_data,
+                                    "detail": "auto",
+                                },
+                            }
+                        )
+                # Also estimate tokens for all dropped frames (what it would have cost)
+                tokens_per_frame_avg = 765  # GPT-4o high detail estimate
+                dropped_frames = video_stats["total_video_frames"] - video_stats["frames_selected"]
+                total_tokens_before += dropped_frames * tokens_per_frame_avg
+
             elif part.type == "image_url" and part.image_url and request.token0_optimize:
                 image_data = part.image_url.url
                 analysis, raw_bytes, pil_image = analyze_image(image_data)
diff --git a/token0/models/request.py b/token0/models/request.py
index 560a3de..83dc714 100644
--- a/token0/models/request.py
+++ b/token0/models/request.py
@@ -6,10 +6,15 @@ class ImageUrl(BaseModel):
     detail: str | None = None  # "low", "high", "auto"
 
 
+class VideoUrl(BaseModel):
+    url: str
+
+
 class ContentPart(BaseModel):
-    type: str  # "text" or "image_url"
+    type: str  # "text", "image_url", or "video_url"
     text: str | None = None
     image_url: ImageUrl | None = None
+    video_url: VideoUrl | None = None
 
 
 class Message(BaseModel):
diff --git a/token0/optimization/cache.py b/token0/optimization/cache.py
index 2c8e475..692fff8 100644
--- a/token0/optimization/cache.py
+++ b/token0/optimization/cache.py
@@ -3,18 +3,41 @@
 Uses perceptual image hash + prompt text to create a cache key.
 If a similar request was seen before, return the cached response (0 tokens).
 
+v0.3: QJL-compressed fuzzy matching — similar (not just identical) images
+can hit the cache using Hamming distance on compressed binary signatures.
+Inspired by Google's TurboQuant (arXiv 2504.19874) QJL technique.
+
 Works in both lite mode (in-memory dict) and full mode (Redis).
 """
 
 import hashlib
 import json
+import logging
 
+import numpy as np
 from PIL import Image
 
 from token0.storage.redis import get_redis
 
+logger = logging.getLogger("token0.cache")
+
+# ---------------------------------------------------------------------------
+# JL projection matrix — fixed per process, seeded for determinism
+# Projects 256-dim perceptual hash into 128-dim compressed signature
+# ---------------------------------------------------------------------------
+_JL_DIM = 128  # compressed signature size (bits)
+_HASH_SIZE = 16  # perceptual hash grid (16x16 = 256 bits)
+_HASH_DIM = _HASH_SIZE * _HASH_SIZE  # 256
+_HAMMING_THRESHOLD = 18  # max Hamming distance for fuzzy match (~14% of 128)
+
+_rng = np.random.RandomState(seed=42)
+_JL_MATRIX = _rng.randn(_JL_DIM, _HASH_DIM).astype(np.float32)
+
+# In-memory fuzzy index: model -> list of (signature_bytes, cache_key)
+_fuzzy_index: dict[str, list[tuple[bytes, str]]] = {}
 
-def _image_hash(pil_image: Image.Image, hash_size: int = 16) -> str:
+
+def _image_hash(pil_image: Image.Image, hash_size: int = _HASH_SIZE) -> str:
     """Compute a perceptual hash (average hash) for an image.
 
     Perceptual hashes are similar for visually similar images,
@@ -30,6 +53,37 @@ def _image_hash(pil_image: Image.Image, hash_size: int = 16) -> str:
     return hex(int(bits, 2))[2:].zfill(hash_size * hash_size // 4)
 
 
+def _hash_to_vector(hash_hex: str) -> np.ndarray:
+    """Convert a hex perceptual hash to a numpy float vector (+1/-1)."""
+    hash_int = int(hash_hex, 16)
+    bits = format(hash_int, f"0{_HASH_DIM}b")
+    return np.array([1.0 if b == "1" else -1.0 for b in bits], dtype=np.float32)
+
+
+def _jl_compress(hash_hex: str) -> bytes:
+    """Compress a perceptual hash to a QJL binary signature.
+
+    1. Convert hash to +1/-1 vector (256-dim)
+    2. Project through random JL matrix (256 -> 128)
+    3. Take sign bits -> 128-bit signature (16 bytes)
+    """
+    vec = _hash_to_vector(hash_hex)
+    projected = _JL_MATRIX @ vec
+    # Sign quantization: 1 if positive, 0 if negative
+    sign_bits = (projected > 0).astype(np.uint8)
+    # Pack into bytes (128 bits -> 16 bytes)
+    return np.packbits(sign_bits).tobytes()
+
+
+def _hamming_distance(sig_a: bytes, sig_b: bytes) -> int:
+    """Compute Hamming distance between two binary signatures."""
+    a = np.frombuffer(sig_a, dtype=np.uint8)
+    b = np.frombuffer(sig_b, dtype=np.uint8)
+    # XOR and popcount
+    xor = np.bitwise_xor(a, b)
+    return sum(bin(byte).count("1") for byte in xor)
+
+
 def _prompt_hash(prompt: str) -> str:
     """Hash the prompt text. Normalize whitespace and case."""
     normalized = " ".join(prompt.lower().strip().split())
@@ -43,15 +97,67 @@ def make_cache_key(pil_image: Image.Image, prompt: str, model: str) -> str:
     return f"token0:cache:{model}:{img_h}:{prompt_h}"
 
 
-async def get_cached_response(cache_key: str) -> dict | None:
-    """Look up a cached response. Returns parsed dict or None."""
+def _parse_cache_key(cache_key: str) -> tuple[str, str, str]:
+    """Extract (model, img_hash, prompt_hash) from a cache key."""
+    parts = cache_key.split(":")
+    # token0:cache:{model}:{img_hash}:{prompt_hash}
+    return parts[2], parts[3], parts[4]
+
+
+async def get_cached_response(
+    cache_key: str,
+    fuzzy: bool = True,
+) -> dict | None:
+    """Look up a cached response. Tries exact match first, then fuzzy.
+
+    Returns parsed dict or None.
+    """
     cache = get_redis()
+
+    # Step 1: Exact match (O(1), fast)
     try:
         cached = await cache.get(cache_key)
         if cached:
             return json.loads(cached)
     except Exception:
         pass
+
+    # Step 2: Fuzzy match via QJL signatures
+    if not fuzzy:
+        return None
+
+    try:
+        model, img_hash, prompt_hash = _parse_cache_key(cache_key)
+        query_sig = _jl_compress(img_hash)
+
+        # Search fuzzy index for this model
+        candidates = _fuzzy_index.get(model, [])
+        best_key = None
+        best_distance = _HAMMING_THRESHOLD + 1
+
+        for sig, stored_key in candidates:
+            # Must match same prompt hash (fuzzy is image-only)
+            _, _, stored_prompt_hash = _parse_cache_key(stored_key)
+            if stored_prompt_hash != prompt_hash:
+                continue
+
+            dist = _hamming_distance(query_sig, sig)
+            if dist < best_distance:
+                best_distance = dist
+                best_key = stored_key
+
+        if best_key is not None:
+            cached = await cache.get(best_key)
+            if cached:
+                logger.info(
+                    "fuzzy cache hit: hamming_distance=%d (threshold=%d)",
+                    best_distance,
+                    _HAMMING_THRESHOLD,
+                )
+                return json.loads(cached)
+    except Exception:
+        logger.debug("fuzzy cache lookup failed", exc_info=True)
+
     return None
 
 
@@ -60,9 +166,30 @@ async def set_cached_response(
     response: dict,
     ttl_seconds: int = 3600,
 ) -> None:
-    """Cache a response. Default TTL: 1 hour."""
+    """Cache a response and index its QJL signature for fuzzy matching."""
     cache = get_redis()
     try:
         await cache.set(cache_key, json.dumps(response), ex=ttl_seconds)
+
+        # Add to fuzzy index
+        model, img_hash, _ = _parse_cache_key(cache_key)
+        sig = _jl_compress(img_hash)
+
+        if model not in _fuzzy_index:
+            _fuzzy_index[model] = []
+
+        # Avoid duplicate entries for same key
+        _fuzzy_index[model] = [(s, k) for s, k in _fuzzy_index[model] if k != cache_key]
+        _fuzzy_index[model].append((sig, cache_key))
     except Exception:
         pass  # cache failures shouldn't break the request
+
+
+def clear_fuzzy_index() -> None:
+    """Clear the in-memory fuzzy index. Useful for tests."""
+    _fuzzy_index.clear()
+
+
+def get_fuzzy_index_size() -> int:
+    """Return total entries across all models in the fuzzy index."""
+    return sum(len(entries) for entries in _fuzzy_index.values())
diff --git a/token0/optimization/router.py b/token0/optimization/router.py
index 40ad94d..7defa46 100644
--- a/token0/optimization/router.py
+++ b/token0/optimization/router.py
@@ -50,6 +50,13 @@ def get_provider_from_model(model: str) -> str:
         "yi-vl",
         "llava-phi",
         "nanollava",
+        "llama3.2-vision",
+        "llama3.2",
+        "gemma3",
+        "granite3.2",
+        "granite3",
+        "qwen2.5vl",
+        "qwen3-vl",
     )
     if any(k in model_lower for k in ollama_models):
         return "ollama"
@@ -144,18 +151,38 @@ def plan_optimization(
 
     # --- Optimization 1: OCR Route ---
     if analysis.is_mostly_text:
-        plan.use_ocr_route = True
-        plan.reasons.append(
-            f"text_density={analysis.text_density:.2f} "
-            f"> {settings.text_density_threshold} — OCR route"
-        )
-        plan.estimated_tokens_before = _estimate_tokens(analysis, provider, "high")
-        plan.estimated_tokens_after = 200
-        # Still suggest cascade for OCR route (text processing is simple)
-        if enable_cascade and model in MODEL_CASCADE:
-            plan.recommended_model = MODEL_CASCADE[model]
-            plan.reasons.append(f"cascade → {plan.recommended_model} (text task)")
-        return plan
+        estimated_image_tokens = _estimate_tokens(analysis, provider, "high")
+        estimated_ocr_tokens = 200  # typical OCR text extraction output
+
+        # Some Ollama models encode images with extremely few tokens — far fewer
+        # than our estimation formula predicts. For these, OCR text output
+        # (200-700 tokens) costs MORE than just sending the image.
+        # Only skip OCR for models confirmed to use <50 tokens per image.
+        _ultra_efficient_models = ("llama3.2-vision", "llama3.2")
+        is_ultra_efficient = any(k in model.lower() for k in _ultra_efficient_models)
+        if provider == "ollama" and is_ultra_efficient:
+            plan.reasons.append(
+                f"text_density={analysis.text_density:.2f} but Ollama model — "
+                f"skip OCR (image tokens ~{estimated_image_tokens} likely cheaper than OCR text)"
+            )
+        elif estimated_image_tokens <= estimated_ocr_tokens:
+            plan.reasons.append(
+                f"text_density={analysis.text_density:.2f} but image tokens "
+                f"({estimated_image_tokens}) ≤ OCR estimate ({estimated_ocr_tokens}) — skip OCR"
+            )
+        else:
+            plan.use_ocr_route = True
+            plan.reasons.append(
+                f"text_density={analysis.text_density:.2f} "
+                f"> {settings.text_density_threshold} — OCR route"
+            )
+            plan.estimated_tokens_before = estimated_image_tokens
+            plan.estimated_tokens_after = estimated_ocr_tokens
+            # Still suggest cascade for OCR route (text processing is simple)
+            if enable_cascade and model in MODEL_CASCADE:
+                plan.recommended_model = MODEL_CASCADE[model]
+                plan.reasons.append(f"cascade → {plan.recommended_model} (text task)")
+            return plan
 
     # --- Optimization 2: Prompt-Aware Detail Mode (NEW) ---
     if provider == "openai" and detail_override != "high":
diff --git a/token0/optimization/video.py b/token0/optimization/video.py
new file mode 100644
index 0000000..5ee1030
--- /dev/null
+++ b/token0/optimization/video.py
@@ -0,0 +1,347 @@
+"""Video optimization — extract keyframes, deduplicate, and optimize.
+
+Layer 1: Frame sampling + perceptual hash dedup (no ML dependencies)
+Layer 2: CLIP-based query-frame relevance scoring (optional, needs sentence-transformers)
+
+Accepts video as base64 data URI or file path. Extracts keyframes,
+deduplicates similar frames using QJL fuzzy matching, optionally scores
+frames against the user's prompt, and returns optimized PIL images.
+"""
+
+import base64
+import logging
+import tempfile
+
+import cv2
+import numpy as np
+from PIL import Image
+
+from token0.optimization.cache import (
+    _hamming_distance,
+    _image_hash,
+    _jl_compress,
+)
+
+logger = logging.getLogger("token0.video")
+
+# Try to import CLIP for Layer 2 (optional)
+_clip_model = None
+_clip_preprocess = None
+_clip_available = False
+
+try:
+    import clip
+    import torch
+
+    _clip_available = True
+except ImportError:
+    pass
+
+# ---------------------------------------------------------------------------
+# Constants
+# ---------------------------------------------------------------------------
+DEFAULT_FPS = 1  # extract 1 frame per second
+MAX_FRAMES = 32  # hard cap on frames sent to LLM
+DEDUP_HAMMING_THRESHOLD = 12  # tighter than cache (more aggressive dedup for consecutive frames)
+MIN_SCENE_CHANGE_THRESHOLD = 15.0  # minimum pixel difference for scene change
+
+
+def _decode_video_input(video_input: str) -> str:
+    """Decode video input to a temp file path. Accepts base64 data URI or file path."""
+    if video_input.startswith("data:"):
+        # data:video/mp4;base64,...
+        _, b64_data = video_input.split(",", 1)
+        raw_bytes = base64.b64decode(b64_data)
+        tmp = tempfile.NamedTemporaryFile(suffix=".mp4", delete=False)
+        tmp.write(raw_bytes)
+        tmp.flush()
+        return tmp.name
+    elif video_input.startswith(("http://", "https://")):
+        # URL — download first
+        import httpx
+
+        resp = httpx.get(video_input, timeout=30)
+        resp.raise_for_status()
+        tmp = tempfile.NamedTemporaryFile(suffix=".mp4", delete=False)
+        tmp.write(resp.content)
+        tmp.flush()
+        return tmp.name
+    else:
+        # Assume file path
+        return video_input
+
+
+def extract_frames(
+    video_path: str,
+    fps: float = DEFAULT_FPS,
+    max_frames: int = MAX_FRAMES,
+) -> list[tuple[float, Image.Image]]:
+    """Extract frames from video at given fps.
+
+    Returns list of (timestamp_seconds, PIL.Image) tuples.
+    """
+    cap = cv2.VideoCapture(video_path)
+    if not cap.isOpened():
+        logger.warning("Cannot open video: %s", video_path)
+        return []
+
+    video_fps = cap.get(cv2.CAP_PROP_FPS) or 30.0
+    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
+    duration = total_frames / video_fps if video_fps > 0 else 0
+
+    # Calculate frame interval
+    frame_interval = max(1, int(video_fps / fps))
+
+    frames = []
+    frame_idx = 0
+
+    while cap.isOpened() and len(frames) < max_frames:
+        ret, frame = cap.read()
+        if not ret:
+            break
+
+        if frame_idx % frame_interval == 0:
+            # Convert BGR (OpenCV) to RGB (PIL)
+            rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
+            pil_image = Image.fromarray(rgb_frame)
+            timestamp = frame_idx / video_fps
+            frames.append((timestamp, pil_image))
+
+        frame_idx += 1
+
+    cap.release()
+
+    logger.info(
+        "extracted %d frames from %.1fs video (%.1f fps, interval=%d)",
+        len(frames),
+        duration,
+        video_fps,
+        frame_interval,
+    )
+    return frames
+
+
+def deduplicate_frames(
+    frames: list[tuple[float, Image.Image]],
+    hamming_threshold: int = DEDUP_HAMMING_THRESHOLD,
+) -> list[tuple[float, Image.Image]]:
+    """Remove near-duplicate frames using QJL perceptual hash signatures.
+
+    Keeps the first frame from each group of similar consecutive frames.
+    """
+    if not frames:
+        return []
+
+    kept = [frames[0]]
+    prev_sig = _jl_compress(_image_hash(frames[0][1]))
+
+    for timestamp, frame in frames[1:]:
+        sig = _jl_compress(_image_hash(frame))
+        dist = _hamming_distance(prev_sig, sig)
+
+        if dist > hamming_threshold:
+            kept.append((timestamp, frame))
+            prev_sig = sig
+        else:
+            logger.debug(
+                "dedup: dropped frame at %.1fs (hamming=%d, threshold=%d)",
+                timestamp,
+                dist,
+                hamming_threshold,
+            )
+
+    logger.info(
+        "dedup: %d → %d frames (removed %d duplicates)",
+        len(frames),
+        len(kept),
+        len(frames) - len(kept),
+    )
+    return kept
+
+
+def detect_scene_changes(
+    frames: list[tuple[float, Image.Image]],
+    threshold: float = MIN_SCENE_CHANGE_THRESHOLD,
+) -> list[tuple[float, Image.Image]]:
+    """Detect scene changes by pixel-level difference between consecutive frames.
+
+    Returns frames at scene boundaries + first and last frame.
+    """
+    if len(frames) <= 2:
+        return frames
+
+    kept = [frames[0]]  # always keep first frame
+
+    for i in range(1, len(frames)):
+        prev_arr = np.array(frames[i - 1][1].resize((160, 120))).astype(np.float32)
+        curr_arr = np.array(frames[i][1].resize((160, 120))).astype(np.float32)
+        diff = np.mean(np.abs(curr_arr - prev_arr))
+
+        if diff > threshold:
+            kept.append(frames[i])
+
+    # Always keep last frame
+    if kept[-1][0] != frames[-1][0]:
+        kept.append(frames[-1])
+
+    logger.info(
+        "scene detection: %d → %d frames",
+        len(frames),
+        len(kept),
+    )
+    return kept
+
+
+def _load_clip_model():
+    """Lazy-load CLIP model on first use."""
+    global _clip_model, _clip_preprocess
+    if _clip_model is None and _clip_available:
+        device = "cuda" if torch.cuda.is_available() else "cpu"
+        _clip_model, _clip_preprocess = clip.load("ViT-B/32", device=device)
+        logger.info("CLIP model loaded on %s", device)
+
+
+def score_frames_by_relevance(
+    frames: list[tuple[float, Image.Image]],
+    prompt: str,
+    top_k: int | None = None,
+) -> list[tuple[float, Image.Image, float]]:
+    """Score frames by relevance to the prompt using CLIP.
+
+    Returns list of (timestamp, frame, score) sorted by score descending.
+    Falls back to uniform scoring if CLIP is not available.
+    """
+    if not frames:
+        return []
+
+    if not _clip_available:
+        # Fallback: uniform scores, keep all
+        logger.debug("CLIP not available, using uniform frame scores")
+        return [(t, f, 1.0) for t, f in frames]
+
+    _load_clip_model()
+    device = next(_clip_model.parameters()).device
+
+    # Encode prompt
+    text_tokens = clip.tokenize([prompt]).to(device)
+    with torch.no_grad():
+        text_features = _clip_model.encode_text(text_tokens)
+        text_features = text_features / text_features.norm(dim=-1, keepdim=True)
+
+    # Encode frames
+    scored = []
+    for timestamp, frame in frames:
+        image_input = _clip_preprocess(frame).unsqueeze(0).to(device)
+        with torch.no_grad():
+            image_features = _clip_model.encode_image(image_input)
+            image_features = image_features / image_features.norm(dim=-1, keepdim=True)
+
+        similarity = (text_features @ image_features.T).item()
+        scored.append((timestamp, frame, similarity))
+
+    # Sort by score descending
+    scored.sort(key=lambda x: x[2], reverse=True)
+
+    if top_k is not None:
+        scored = scored[:top_k]
+
+    # Re-sort by timestamp for chronological order
+    scored.sort(key=lambda x: x[0])
+
+    return scored
+
+
+def process_video(
+    video_input: str,
+    prompt: str = "",
+    fps: float = DEFAULT_FPS,
+    max_frames: int = MAX_FRAMES,
+    deduplicate: bool = True,
+    use_scene_detection: bool = True,
+    use_clip_scoring: bool = True,
+) -> tuple[list[Image.Image], dict]:
+    """Full video optimization pipeline.
+
+    Returns (list of optimized PIL images, stats dict).
+
+    Pipeline:
+    1. Decode video input (base64/URL/path) → temp file
+    2. Extract frames at configured fps
+    3. Deduplicate similar frames (QJL perceptual hash)
+    4. Detect scene changes
+    5. Score by prompt relevance (CLIP, if available)
+    6. Cap at max_frames
+    """
+    # Step 1: Decode input
+    video_path = _decode_video_input(video_input)
+
+    # Step 2: Extract frames
+    all_frames = extract_frames(video_path, fps=fps, max_frames=max_frames * 4)
+    total_extracted = len(all_frames)
+
+    if not all_frames:
+        return [], {
+            "total_video_frames": 0,
+            "frames_extracted": 0,
+            "frames_after_dedup": 0,
+            "frames_after_scene_detection": 0,
+            "frames_selected": 0,
+            "frame_reduction_pct": 0.0,
+            "clip_used": False,
+        }
+
+    # Get video metadata
+    cap = cv2.VideoCapture(video_path)
+    video_fps = cap.get(cv2.CAP_PROP_FPS) or 30.0
+    total_video_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
+    duration = total_video_frames / video_fps if video_fps > 0 else 0
+    cap.release()
+
+    # Step 3: Deduplicate
+    if deduplicate:
+        frames = deduplicate_frames(all_frames)
+    else:
+        frames = all_frames
+    frames_after_dedup = len(frames)
+
+    # Step 4: Scene detection
+    if use_scene_detection and len(frames) > max_frames:
+        frames = detect_scene_changes(frames)
+    frames_after_scene = len(frames)
+
+    # Step 5: CLIP scoring (trim to max_frames by relevance)
+    clip_used = False
+    if use_clip_scoring and _clip_available and prompt and len(frames) > max_frames:
+        scored = score_frames_by_relevance(frames, prompt, top_k=max_frames)
+        frames = [(t, f) for t, f, s in scored]
+        clip_used = True
+    elif len(frames) > max_frames:
+        # Uniform sampling to max_frames
+        indices = np.linspace(0, len(frames) - 1, max_frames, dtype=int)
+        frames = [frames[i] for i in indices]
+
+    # Extract just the PIL images (drop timestamps)
+    images = [f for _, f in frames]
+
+    stats = {
+        "video_duration_seconds": round(duration, 1),
+        "video_fps": round(video_fps, 1),
+        "total_video_frames": total_video_frames,
+        "frames_extracted_at_1fps": total_extracted,
+        "frames_after_dedup": frames_after_dedup,
+        "frames_after_scene_detection": frames_after_scene,
+        "frames_selected": len(images),
+        "frame_reduction_pct": round((1 - len(images) / max(total_video_frames, 1)) * 100, 1),
+        "clip_used": clip_used,
+    }
+
+    logger.info(
+        "video pipeline: %d total → %d extracted → %d deduped → %d selected (%.1f%% reduction)",
+        total_video_frames,
+        total_extracted,
+        frames_after_dedup,
+        len(images),
+        stats["frame_reduction_pct"],
+    )
+
+    return images, stats