Pritom14 · Pritom14 · Mar 30, 2026 · Mar 30, 2026 · Mar 30, 2026 · Mar 30, 2026
diff --git a/.gitignore b/.gitignore
@@ -16,6 +16,7 @@ build/
 .token0_images/
 benchmarks/images/real/screenshot_real.png
 benchmarks/results/
+benchmarks/videos/
 *.db-journal
 .DS_Store
 .idea/

diff --git a/README.md b/README.md
@@ -29,7 +29,7 @@ Your App → Token0 Proxy → [Analyze → Classify → Route → Transform →
          Database (logs every optimization decision + savings)
 ```
 
-Token0 applies **7 optimizations** automatically:
+Token0 applies **9 optimizations** automatically:
 
 ### Core Optimizations (Free Tier)
 
@@ -49,11 +49,15 @@ Token0 applies **7 optimizations** automatically:
 
 **7. Semantic Response Cache** — Cache responses for similar image+prompt pairs using perceptual image hashing. Repeated or similar queries cost 0 tokens. Effective on repetitive workloads (product classification, document processing).
 
+**8. QJL-Compressed Fuzzy Cache** — Similar (not just identical) images hit the cache using Quantized Johnson-Lindenstrauss random projection. Compresses 256-bit perceptual hashes to 128-bit binary signatures, matches via Hamming distance. Inspired by Google's TurboQuant (arXiv 2504.19874). **62% additional token savings** on image variations in benchmarks — similar product photos, re-scanned documents, and slightly different angles all hit cache.
+
+**9. Video Optimization** — Automatically extract keyframes from video at 1fps, deduplicate similar consecutive frames using QJL perceptual hashing, detect scene changes via pixel-level diff, and run each keyframe through the full image optimization pipeline. A 60-second video at 30fps (1,800 frames) reduces to ~10 keyframes before being sent to the LLM. **13-45% savings on local models; ~83% projected savings on GPT-4o.** Optional CLIP-based query-frame scoring (Layer 2) ranks frames by relevance to the user's prompt.
+
 ---
 
 ## Benchmarks
 
-We benchmarked Token0 against **4 vision models** on **5 real-world images** (not synthetic — actual photos, receipts, documents, and screenshots), plus cost projections using OpenAI and Anthropic's published token formulas.
+We benchmarked Token0 against **7 vision models** on **5 real-world images** (not synthetic — actual photos, receipts, documents, and screenshots) and **3 test videos**, plus cost projections using OpenAI and Anthropic's published token formulas.
 
 ### Real-World Image Test Suite
 
@@ -111,16 +115,55 @@ We benchmarked Token0 against **4 vision models** on **5 real-world images** (no
 | Screenshot (2066x766) | 618 | 244 | **60.5%** | **-3,744ms** | OCR route |
 | **Total** | **3,027** | **2,243** | **25.9%** | | |
 
-### Summary Across All Models
+### Image Benchmark Summary (7 Models)
+
+| Model | Params | Total Direct | Total Token0 | Savings | Notes |
+|---|---|---|---|---|---|
+| granite3.2-vision | 3B | 129,836 | 60,924 | **53.1%** | High-res image encoder |
+| minicpm-v | 8B | 10,877 | 6,276 | **42.3%** | |
+| moondream | 1.7B | 16,457 | 10,240 | **37.8%** | |
+| llava-llama3 | 8B | 13,365 | 8,486 | **36.5%** | |
+| llava:7b | 7B | 13,384 | 8,701 | **35.0%** | |
+| gemma3:4b | 4B | 6,380 | 4,798 | **24.8%** | |
+| llama3.2-vision | 11B | 665 | 665 | **0%** | Ultra-efficient encoder: passthrough correct, no optimization needed |
+
+> The 0% savings on llama3.2-vision is expected and correct. This model uses ~8-27 tokens per image natively — far below what OCR text extraction would cost. Token0 detects this and correctly skips all lossy optimizations.
+
+### Video Benchmark Results
+
+Test setup: 3 videos (product showcase, document montage, mixed content), naive baseline = all frames at 1fps sent raw, Token0 = frame dedup + scene detection + per-frame image optimization.
+
+| Model | Naive Tokens | Token0 Tokens | Savings |
+|---|---|---|---|
+| gemma3:4b | 14,706 | 8,081 | **45.0%** |
+| llava:7b | 15,731 | 12,845 | **18.3%** |
+| llava-llama3 | 15,658 | 12,789 | **18.3%** |
+| minicpm-v | 7,428 | 6,447 | **13.2%** |
+| moondream | 12,288 | 11,714 | **4.7%** |
 
-| Model | Params | Total Direct | Total Token0 | Savings |
-|---|---|---|---|---|
-| minicpm-v | 8B | 10,877 | 6,276 | **42.3%** |
-| moondream | 1.7B | 16,457 | 10,240 | **37.8%** |
-| llava-llama3 | 8B | 13,365 | 8,486 | **36.5%** |
-| llava:7b | 7B | 13,384 | 8,701 | **35.0%** |
+**Why moondream shows less video savings:** moondream uses a very small frame encoder — its per-frame token cost is already low, so frame dedup has less absolute impact than on higher-token models.
+
+### GPT-4o Video Extrapolation (ballpark)
+
+Using OpenAI's published tile formula (512px tiles, 170 tokens/tile):
+
+| Scenario | Naive | Token0 | Savings |
+|---|---|---|---|
+| 60s video, 30fps (1,800 frames → 1fps → 60 frames → dedup to ~10) | ~25,500 tokens | ~4,250 tokens | **~83%** |
+| Monthly cost at 10K videos/day (GPT-4o $2.50/1M tokens) | $19,125/mo | $3,188/mo | **$15,938/mo saved** |
+
+### Anthropic Video Extrapolation (ballpark)
+
+Using Anthropic's pixel formula (tokens ≈ width × height / 750):
+
+| Scenario | Naive | Token0 | Savings |
+|---|---|---|---|
+| 60s video, 1fps = 60 frames at 1280×720 | ~73,700 tokens | ~12,300 tokens | **~83%** |
+| Monthly cost at 1K videos/day (Claude Sonnet $3/1M tokens) | $6,633/mo | $1,107/mo | **$5,526/mo saved** |
 
-### GPT-4o Cost Projections (v1 vs v2)
+> These are linear extrapolations from the token formula + observed dedup ratios (60 frames → ~10 keyframes). Actual savings vary by content type — talking-head video deduplicates more aggressively than action scenes.
+
+### GPT-4o Image Cost Projections (v1 vs v2)
 
 Using OpenAI's published token formulas on real images:
 
@@ -150,11 +193,13 @@ Using OpenAI's published token formulas on real images:
 5. **Prompt-aware detail mode** drops simple queries from 1,105 → 85 tokens (92% savings) on GPT-4o.
 6. **Model cascade** routes simple tasks at 16.7x cheaper rates with equivalent quality.
 7. **Tile-optimized resize** cuts OpenAI costs by 44% on mid-size images (1280x720) with zero quality loss.
-8. **On cloud APIs, total savings reach 98.9%** when all optimizations are combined with model cascading.
+8. **On cloud APIs, total image savings reach 98.9%** when all optimizations are combined with model cascading.
+9. **Video deduplication collapses 60-frame clips to ~10 keyframes** — 13-45% savings on local models, ~83% projected on GPT-4o.
+10. **Model-aware OCR skip is critical** — ultra-efficient encoders like llama3.2-vision use <50 tokens/image; OCR text output would cost more, not less.
 
 ### Additional Test Coverage
 
-Token0 includes **103 unit tests** and benchmarks across multiple suites:
+Token0 includes **148 unit tests** and benchmarks across multiple suites:
 
 | Suite | Tests | What It Validates |
 |---|---|---|
@@ -166,6 +211,8 @@ Token0 includes **103 unit tests** and benchmarks across multiple suites:
 | `real` | 5 | Real-world photos, receipts, invoices, screenshots |
 | `streaming` | 7 | SSE streaming: format, content, stats, image optimization |
 | `litellm` | 10 | LiteLLM hook: passthrough, optimization, OCR, cascade, async |
+| `cache` | 23 | QJL fuzzy cache: perceptual hash, JL compression, Hamming distance, fuzzy match |
+| `video` | 22 | Frame extraction, QJL dedup, scene detection, CLIP scoring, full pipeline |
 
 ---
 
@@ -234,6 +281,26 @@ response = client.chat.completions.create(
 # response.token0.optimizations_applied = ["resize 4000x3000 → 1568x1176", "convert png → jpeg q=85"]
 ```
 
+### Video Support
+
+Send a video URL or base64-encoded video — Token0 automatically extracts keyframes, deduplicates, and optimizes before forwarding:
+
+```python
+response = client.chat.completions.create(
+    model="gpt-4o",
+    messages=[{
+        "role": "user",
+        "content": [
+            {"type": "text", "text": "What happens in this video?"},
+            {"type": "video_url", "video_url": {"url": "data:video/mp4;base64,..."}}
+        ]
+    }],
+    extra_headers={"X-Provider-Key": "sk-..."}
+)
+# 1,800 raw frames → ~10 keyframes → optimized images → LLM
+# response.token0.tokens_saved = 21,250  (~83% on GPT-4o)
+```
+
 ### Streaming Support
 
 Token0 supports `stream=true` — images are optimized before streaming begins, then tokens flow word-by-word via SSE:
@@ -332,12 +399,16 @@ curl http://localhost:8000/v1/usage
 pip install token0[dev]
 ollama pull moondream
 
-# Run all suites
+# Run all image suites
 python -m benchmarks.run --model moondream --suite all
 
 # Run only real-world images
 python -m benchmarks.run --model llava:7b --suite real
 
+# Run video benchmarks (requires Ollama + real images in benchmarks/images/real/)
+python -m benchmarks.bench_video_models
+python -m benchmarks.bench_video_models --model llava:7b --model minicpm-v
+
 # Available suites: images, text, multi, turns, tasks, real, all
 # Available models: any Ollama vision model
 ```

diff --git a/benchmarks/bench_fuzzy_cache.py b/benchmarks/bench_fuzzy_cache.py
@@ -0,0 +1,231 @@
+"""Benchmark: QJL fuzzy cache vs exact-match-only cache.
+
+Demonstrates token savings from fuzzy cache hits on similar images.
+Simulates a real workload: same product/document photographed multiple
+times with slight variations (lighting, angle, compression artifacts).
+
+Usage:
+    python -m benchmarks.bench_fuzzy_cache
+"""
+
+import asyncio
+import time
+
+import numpy as np
+from PIL import Image
+
+from token0.optimization.cache import (
+    _hamming_distance,
+    _image_hash,
+    _jl_compress,
+    clear_fuzzy_index,
+    get_cached_response,
+    get_fuzzy_index_size,
+    make_cache_key,
+    set_cached_response,
+)
+from token0.storage.redis import MemoryCache
+
+# Estimated tokens per image (GPT-4o high detail, ~800x600)
+TOKENS_PER_IMAGE = 765
+COST_PER_TOKEN = 2.50 / 1_000_000  # GPT-4o input price
+
+
+def _make_base_image(seed: int, width=800, height=600) -> Image.Image:
+    """Create a unique base image (simulates a product photo or document)."""
+    rng = np.random.RandomState(seed=seed)
+    pixels = rng.randint(0, 256, (height, width, 3), dtype=np.uint8)
+    return Image.fromarray(pixels)
+
+
+def _add_variation(base: Image.Image, variation_seed: int, noise_level: int = 15) -> Image.Image:
+    """Add slight variation to an image (simulates re-photo, compression, etc.)."""
+    pixels = np.array(base)
+    rng = np.random.RandomState(seed=variation_seed)
+    noise = rng.randint(-noise_level, noise_level + 1, pixels.shape, dtype=np.int16)
+    noisy = np.clip(pixels.astype(np.int16) + noise, 0, 255).astype(np.uint8)
+    return Image.fromarray(noisy)
+
+
+async def run_benchmark():
+    import token0.storage.redis as redis_mod
+
+    redis_mod._memory_cache.clear()
+    redis_mod.pool = MemoryCache()
+    clear_fuzzy_index()
+
+    print("=" * 80)
+    print("  QJL Fuzzy Cache Benchmark")
+    print("=" * 80)
+
+    # --- Setup: create base images and variations ---
+    num_unique_images = 20
+    variations_per_image = 5  # each base image has 5 slight variations
+    prompt = "describe this product image"
+
+    base_images = [_make_base_image(seed=i) for i in range(num_unique_images)]
+    variation_images = []
+    for i, base in enumerate(base_images):
+        for v in range(variations_per_image):
+            variation_images.append((i, _add_variation(base, variation_seed=i * 100 + v)))
+
+    total_requests = num_unique_images + len(variation_images)
+    print(f"\n  Setup: {num_unique_images} unique images, {variations_per_image} variations each")
+    print(f"  Total requests: {total_requests}")
+    print(f"  Tokens per image (GPT-4o): {TOKENS_PER_IMAGE}")
+
+    # --- Benchmark 1: Exact-match only ---
+    print("\n  --- Exact Match Only ---\n")
+    redis_mod._memory_cache.clear()
+    clear_fuzzy_index()
+
+    exact_hits = 0
+    exact_misses = 0
+    start = time.time()
+
+    # First pass: cache base images
+    for i, base in enumerate(base_images):
+        key = make_cache_key(base, prompt, "gpt-4o")
+        await set_cached_response(key, {"content": f"response_{i}"})
+
+    # Second pass: query with variations (exact match only)
+    for base_idx, var_img in variation_images:
+        key = make_cache_key(var_img, prompt, "gpt-4o")
+        result = await get_cached_response(key, fuzzy=False)
+        if result:
+            exact_hits += 1
+        else:
+            exact_misses += 1
+
+    exact_time = time.time() - start
+    exact_tokens_used = exact_misses * TOKENS_PER_IMAGE
+    exact_cost = exact_tokens_used * COST_PER_TOKEN
+
+    print(f"  Hits:   {exact_hits}/{len(variation_images)}")
+    print(f"  Misses: {exact_misses}/{len(variation_images)}")
+    print(f"  Tokens used: {exact_tokens_used:,}")
+    print(f"  Cost: ${exact_cost:.4f}")
+    print(f"  Time: {exact_time * 1000:.1f}ms")
+
+    # --- Benchmark 2: Fuzzy match (QJL) ---
+    print("\n  --- QJL Fuzzy Match ---\n")
+    redis_mod._memory_cache.clear()
+    clear_fuzzy_index()
+
+    fuzzy_hits = 0
+    fuzzy_misses = 0
+    start = time.time()
+
+    # First pass: cache base images
+    for i, base in enumerate(base_images):
+        key = make_cache_key(base, prompt, "gpt-4o")
+        await set_cached_response(key, {"content": f"response_{i}"})
+
+    # Second pass: query with variations (fuzzy match enabled)
+    for base_idx, var_img in variation_images:
+        key = make_cache_key(var_img, prompt, "gpt-4o")
+        result = await get_cached_response(key, fuzzy=True)
+        if result:
+            fuzzy_hits += 1
+        else:
+            fuzzy_misses += 1
+
+    fuzzy_time = time.time() - start
+    fuzzy_tokens_used = fuzzy_misses * TOKENS_PER_IMAGE
+    fuzzy_cost = fuzzy_tokens_used * COST_PER_TOKEN
+
+    print(f"  Hits:   {fuzzy_hits}/{len(variation_images)}")
+    print(f"  Misses: {fuzzy_misses}/{len(variation_images)}")
+    print(f"  Tokens used: {fuzzy_tokens_used:,}")
+    print(f"  Cost: ${fuzzy_cost:.4f}")
+    print(f"  Time: {fuzzy_time * 1000:.1f}ms")
+    print(f"  Fuzzy index size: {get_fuzzy_index_size()} entries")
+
+    # --- Hamming distance analysis ---
+    print("\n  --- Hamming Distance Analysis ---\n")
+    distances_similar = []
+    distances_different = []
+
+    for i, base in enumerate(base_images[:5]):
+        base_hash = _image_hash(base)
+        base_sig = _jl_compress(base_hash)
+
+        # Similar: variations of same base
+        for v in range(variations_per_image):
+            var = _add_variation(base, variation_seed=i * 100 + v)
+            var_hash = _image_hash(var)
+            var_sig = _jl_compress(var_hash)
+            distances_similar.append(_hamming_distance(base_sig, var_sig))
+
+        # Different: other base images
+        for j in range(5):
+            if i == j:
+                continue
+            other_hash = _image_hash(base_images[j])
+            other_sig = _jl_compress(other_hash)
+            distances_different.append(_hamming_distance(base_sig, other_sig))
+
+    print(
+        f"  Similar images:   avg={np.mean(distances_similar):.1f}, "
+        f"min={min(distances_similar)}, max={max(distances_similar)}"
+    )
+    print(
+        f"  Different images: avg={np.mean(distances_different):.1f}, "
+        f"min={min(distances_different)}, max={max(distances_different)}"
+    )
+
+    # --- Summary ---
+    print(f"\n  {'=' * 70}")
+    print("  SUMMARY")
+    print(f"  {'=' * 70}")
+    print(f"  {'':30s} {'Exact':>12s} {'Fuzzy (QJL)':>12s} {'Improvement':>12s}")
+    print(f"  {'-' * 30} {'-' * 12} {'-' * 12} {'-' * 12}")
+    print(
+        f"  {'Cache hits':30s} {exact_hits:>12d} {fuzzy_hits:>12d} "
+        f"{'+' + str(fuzzy_hits - exact_hits):>12s}"
+    )
+    print(
+        f"  {'Cache misses':30s} {exact_misses:>12d} {fuzzy_misses:>12d} "
+        f"{exact_misses - fuzzy_misses:>12d}"
+    )
+    print(
+        f"  {'Tokens used':30s} {exact_tokens_used:>12,} {fuzzy_tokens_used:>12,} "
+        f"{exact_tokens_used - fuzzy_tokens_used:>12,}"
+    )
+
+    if exact_tokens_used > 0:
+        savings_pct = (exact_tokens_used - fuzzy_tokens_used) / exact_tokens_used * 100
+        print(f"  {'Token savings':30s} {'':>12s} {'':>12s} {savings_pct:>11.1f}%")
+
+    print(
+        f"  {'Cost (GPT-4o)':30s} ${exact_cost:>11.4f} ${fuzzy_cost:>11.4f} "
+        f"${exact_cost - fuzzy_cost:>11.4f}"
+    )
+
+    # Scale projections
+    print("\n  At scale (100K images/day, 20% are variations):")
+    daily_variations = 20_000
+    exact_miss_rate = exact_misses / len(variation_images)
+    fuzzy_miss_rate = fuzzy_misses / len(variation_images)
+    daily_exact_tokens = daily_variations * TOKENS_PER_IMAGE * exact_miss_rate
+    daily_fuzzy_tokens = daily_variations * TOKENS_PER_IMAGE * fuzzy_miss_rate
+    monthly_exact = daily_exact_tokens * 30 * COST_PER_TOKEN
+    monthly_fuzzy = daily_fuzzy_tokens * 30 * COST_PER_TOKEN
+    print(f"  Exact-only monthly cost:  ${monthly_exact:,.2f}")
+    print(f"  Fuzzy cache monthly cost: ${monthly_fuzzy:,.2f}")
+    print(f"  Monthly savings:          ${monthly_exact - monthly_fuzzy:,.2f}")
+    print(f"  {'=' * 70}")
+
+    # Memory overhead
+    sig_bytes = get_fuzzy_index_size() * 16  # 16 bytes per signature
+    key_bytes = get_fuzzy_index_size() * 80  # ~80 bytes per cache key string
+    print(
+        f"\n  Memory overhead: {(sig_bytes + key_bytes) / 1024:.1f} KB "
+        f"for {get_fuzzy_index_size()} entries "
+        f"({sig_bytes} bytes signatures + {key_bytes} bytes keys)"
+    )
+    print(f"  At 1M entries: ~{(1_000_000 * 96) / 1024 / 1024:.1f} MB")
+
+
+if __name__ == "__main__":
+    asyncio.run(run_benchmark())