Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ build/
.token0_images/
benchmarks/images/real/screenshot_real.png
benchmarks/results/
benchmarks/videos/
*.db-journal
.DS_Store
.idea/
Expand Down
97 changes: 84 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ Your App → Token0 Proxy → [Analyze → Classify → Route → Transform →
Database (logs every optimization decision + savings)
```

Token0 applies **7 optimizations** automatically:
Token0 applies **9 optimizations** automatically:

### Core Optimizations (Free Tier)

Expand All @@ -49,11 +49,15 @@ Token0 applies **7 optimizations** automatically:

**7. Semantic Response Cache** — Cache responses for similar image+prompt pairs using perceptual image hashing. Repeated or similar queries cost 0 tokens. Effective on repetitive workloads (product classification, document processing).

**8. QJL-Compressed Fuzzy Cache** — Similar (not just identical) images hit the cache using Quantized Johnson-Lindenstrauss random projection. Compresses 256-bit perceptual hashes to 128-bit binary signatures, matches via Hamming distance. Inspired by Google's TurboQuant (arXiv 2504.19874). **62% additional token savings** on image variations in benchmarks — similar product photos, re-scanned documents, and slightly different angles all hit cache.

**9. Video Optimization** — Automatically extract keyframes from video at 1fps, deduplicate similar consecutive frames using QJL perceptual hashing, detect scene changes via pixel-level diff, and run each keyframe through the full image optimization pipeline. A 60-second video at 30fps (1,800 frames) reduces to ~10 keyframes before being sent to the LLM. **13-45% savings on local models; ~83% projected savings on GPT-4o.** Optional CLIP-based query-frame scoring (Layer 2) ranks frames by relevance to the user's prompt.

---

## Benchmarks

We benchmarked Token0 against **4 vision models** on **5 real-world images** (not synthetic — actual photos, receipts, documents, and screenshots), plus cost projections using OpenAI and Anthropic's published token formulas.
We benchmarked Token0 against **7 vision models** on **5 real-world images** (not synthetic — actual photos, receipts, documents, and screenshots) and **3 test videos**, plus cost projections using OpenAI and Anthropic's published token formulas.

### Real-World Image Test Suite

Expand Down Expand Up @@ -111,16 +115,55 @@ We benchmarked Token0 against **4 vision models** on **5 real-world images** (no
| Screenshot (2066x766) | 618 | 244 | **60.5%** | **-3,744ms** | OCR route |
| **Total** | **3,027** | **2,243** | **25.9%** | | |

### Summary Across All Models
### Image Benchmark Summary (7 Models)

| Model | Params | Total Direct | Total Token0 | Savings | Notes |
|---|---|---|---|---|---|
| granite3.2-vision | 3B | 129,836 | 60,924 | **53.1%** | High-res image encoder |
| minicpm-v | 8B | 10,877 | 6,276 | **42.3%** | |
| moondream | 1.7B | 16,457 | 10,240 | **37.8%** | |
| llava-llama3 | 8B | 13,365 | 8,486 | **36.5%** | |
| llava:7b | 7B | 13,384 | 8,701 | **35.0%** | |
| gemma3:4b | 4B | 6,380 | 4,798 | **24.8%** | |
| llama3.2-vision | 11B | 665 | 665 | **0%** | Ultra-efficient encoder: passthrough correct, no optimization needed |

> The 0% savings on llama3.2-vision is expected and correct. This model uses ~8-27 tokens per image natively — far below what OCR text extraction would cost. Token0 detects this and correctly skips all lossy optimizations.

### Video Benchmark Results

Test setup: 3 videos (product showcase, document montage, mixed content), naive baseline = all frames at 1fps sent raw, Token0 = frame dedup + scene detection + per-frame image optimization.

| Model | Naive Tokens | Token0 Tokens | Savings |
|---|---|---|---|
| gemma3:4b | 14,706 | 8,081 | **45.0%** |
| llava:7b | 15,731 | 12,845 | **18.3%** |
| llava-llama3 | 15,658 | 12,789 | **18.3%** |
| minicpm-v | 7,428 | 6,447 | **13.2%** |
| moondream | 12,288 | 11,714 | **4.7%** |

| Model | Params | Total Direct | Total Token0 | Savings |
|---|---|---|---|---|
| minicpm-v | 8B | 10,877 | 6,276 | **42.3%** |
| moondream | 1.7B | 16,457 | 10,240 | **37.8%** |
| llava-llama3 | 8B | 13,365 | 8,486 | **36.5%** |
| llava:7b | 7B | 13,384 | 8,701 | **35.0%** |
**Why moondream shows less video savings:** moondream uses a very small frame encoder — its per-frame token cost is already low, so frame dedup has less absolute impact than on higher-token models.

### GPT-4o Video Extrapolation (ballpark)

Using OpenAI's published tile formula (512px tiles, 170 tokens/tile):

| Scenario | Naive | Token0 | Savings |
|---|---|---|---|
| 60s video, 30fps (1,800 frames → 1fps → 60 frames → dedup to ~10) | ~25,500 tokens | ~4,250 tokens | **~83%** |
| Monthly cost at 10K videos/day (GPT-4o $2.50/1M tokens) | $19,125/mo | $3,188/mo | **$15,938/mo saved** |

### Anthropic Video Extrapolation (ballpark)

Using Anthropic's pixel formula (tokens ≈ width × height / 750):

| Scenario | Naive | Token0 | Savings |
|---|---|---|---|
| 60s video, 1fps = 60 frames at 1280×720 | ~73,700 tokens | ~12,300 tokens | **~83%** |
| Monthly cost at 1K videos/day (Claude Sonnet $3/1M tokens) | $6,633/mo | $1,107/mo | **$5,526/mo saved** |

### GPT-4o Cost Projections (v1 vs v2)
> These are linear extrapolations from the token formula + observed dedup ratios (60 frames → ~10 keyframes). Actual savings vary by content type — talking-head video deduplicates more aggressively than action scenes.

### GPT-4o Image Cost Projections (v1 vs v2)

Using OpenAI's published token formulas on real images:

Expand Down Expand Up @@ -150,11 +193,13 @@ Using OpenAI's published token formulas on real images:
5. **Prompt-aware detail mode** drops simple queries from 1,105 → 85 tokens (92% savings) on GPT-4o.
6. **Model cascade** routes simple tasks at 16.7x cheaper rates with equivalent quality.
7. **Tile-optimized resize** cuts OpenAI costs by 44% on mid-size images (1280x720) with zero quality loss.
8. **On cloud APIs, total savings reach 98.9%** when all optimizations are combined with model cascading.
8. **On cloud APIs, total image savings reach 98.9%** when all optimizations are combined with model cascading.
9. **Video deduplication collapses 60-frame clips to ~10 keyframes** — 13-45% savings on local models, ~83% projected on GPT-4o.
10. **Model-aware OCR skip is critical** — ultra-efficient encoders like llama3.2-vision use <50 tokens/image; OCR text output would cost more, not less.

### Additional Test Coverage

Token0 includes **103 unit tests** and benchmarks across multiple suites:
Token0 includes **148 unit tests** and benchmarks across multiple suites:

| Suite | Tests | What It Validates |
|---|---|---|
Expand All @@ -166,6 +211,8 @@ Token0 includes **103 unit tests** and benchmarks across multiple suites:
| `real` | 5 | Real-world photos, receipts, invoices, screenshots |
| `streaming` | 7 | SSE streaming: format, content, stats, image optimization |
| `litellm` | 10 | LiteLLM hook: passthrough, optimization, OCR, cascade, async |
| `cache` | 23 | QJL fuzzy cache: perceptual hash, JL compression, Hamming distance, fuzzy match |
| `video` | 22 | Frame extraction, QJL dedup, scene detection, CLIP scoring, full pipeline |

---

Expand Down Expand Up @@ -234,6 +281,26 @@ response = client.chat.completions.create(
# response.token0.optimizations_applied = ["resize 4000x3000 → 1568x1176", "convert png → jpeg q=85"]
```

### Video Support

Send a video URL or base64-encoded video — Token0 automatically extracts keyframes, deduplicates, and optimizes before forwarding:

```python
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What happens in this video?"},
{"type": "video_url", "video_url": {"url": "data:video/mp4;base64,..."}}
]
}],
extra_headers={"X-Provider-Key": "sk-..."}
)
# 1,800 raw frames → ~10 keyframes → optimized images → LLM
# response.token0.tokens_saved = 21,250 (~83% on GPT-4o)
```

### Streaming Support

Token0 supports `stream=true` — images are optimized before streaming begins, then tokens flow word-by-word via SSE:
Expand Down Expand Up @@ -332,12 +399,16 @@ curl http://localhost:8000/v1/usage
pip install token0[dev]
ollama pull moondream

# Run all suites
# Run all image suites
python -m benchmarks.run --model moondream --suite all

# Run only real-world images
python -m benchmarks.run --model llava:7b --suite real

# Run video benchmarks (requires Ollama + real images in benchmarks/images/real/)
python -m benchmarks.bench_video_models
python -m benchmarks.bench_video_models --model llava:7b --model minicpm-v

# Available suites: images, text, multi, turns, tasks, real, all
# Available models: any Ollama vision model
```
Expand Down
231 changes: 231 additions & 0 deletions benchmarks/bench_fuzzy_cache.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,231 @@
"""Benchmark: QJL fuzzy cache vs exact-match-only cache.

Demonstrates token savings from fuzzy cache hits on similar images.
Simulates a real workload: same product/document photographed multiple
times with slight variations (lighting, angle, compression artifacts).

Usage:
python -m benchmarks.bench_fuzzy_cache
"""

import asyncio
import time

import numpy as np
from PIL import Image

from token0.optimization.cache import (
_hamming_distance,
_image_hash,
_jl_compress,
clear_fuzzy_index,
get_cached_response,
get_fuzzy_index_size,
make_cache_key,
set_cached_response,
)
from token0.storage.redis import MemoryCache

# Estimated tokens per image (GPT-4o high detail, ~800x600)
TOKENS_PER_IMAGE = 765
COST_PER_TOKEN = 2.50 / 1_000_000 # GPT-4o input price


def _make_base_image(seed: int, width=800, height=600) -> Image.Image:
"""Create a unique base image (simulates a product photo or document)."""
rng = np.random.RandomState(seed=seed)
pixels = rng.randint(0, 256, (height, width, 3), dtype=np.uint8)
return Image.fromarray(pixels)


def _add_variation(base: Image.Image, variation_seed: int, noise_level: int = 15) -> Image.Image:
"""Add slight variation to an image (simulates re-photo, compression, etc.)."""
pixels = np.array(base)
rng = np.random.RandomState(seed=variation_seed)
noise = rng.randint(-noise_level, noise_level + 1, pixels.shape, dtype=np.int16)
noisy = np.clip(pixels.astype(np.int16) + noise, 0, 255).astype(np.uint8)
return Image.fromarray(noisy)


async def run_benchmark():
import token0.storage.redis as redis_mod

redis_mod._memory_cache.clear()
redis_mod.pool = MemoryCache()
clear_fuzzy_index()

print("=" * 80)
print(" QJL Fuzzy Cache Benchmark")
print("=" * 80)

# --- Setup: create base images and variations ---
num_unique_images = 20
variations_per_image = 5 # each base image has 5 slight variations
prompt = "describe this product image"

base_images = [_make_base_image(seed=i) for i in range(num_unique_images)]
variation_images = []
for i, base in enumerate(base_images):
for v in range(variations_per_image):
variation_images.append((i, _add_variation(base, variation_seed=i * 100 + v)))

total_requests = num_unique_images + len(variation_images)
print(f"\n Setup: {num_unique_images} unique images, {variations_per_image} variations each")
print(f" Total requests: {total_requests}")
print(f" Tokens per image (GPT-4o): {TOKENS_PER_IMAGE}")

# --- Benchmark 1: Exact-match only ---
print("\n --- Exact Match Only ---\n")
redis_mod._memory_cache.clear()
clear_fuzzy_index()

exact_hits = 0
exact_misses = 0
start = time.time()

# First pass: cache base images
for i, base in enumerate(base_images):
key = make_cache_key(base, prompt, "gpt-4o")
await set_cached_response(key, {"content": f"response_{i}"})

# Second pass: query with variations (exact match only)
for base_idx, var_img in variation_images:
key = make_cache_key(var_img, prompt, "gpt-4o")
result = await get_cached_response(key, fuzzy=False)
if result:
exact_hits += 1
else:
exact_misses += 1

exact_time = time.time() - start
exact_tokens_used = exact_misses * TOKENS_PER_IMAGE
exact_cost = exact_tokens_used * COST_PER_TOKEN

print(f" Hits: {exact_hits}/{len(variation_images)}")
print(f" Misses: {exact_misses}/{len(variation_images)}")
print(f" Tokens used: {exact_tokens_used:,}")
print(f" Cost: ${exact_cost:.4f}")
print(f" Time: {exact_time * 1000:.1f}ms")

# --- Benchmark 2: Fuzzy match (QJL) ---
print("\n --- QJL Fuzzy Match ---\n")
redis_mod._memory_cache.clear()
clear_fuzzy_index()

fuzzy_hits = 0
fuzzy_misses = 0
start = time.time()

# First pass: cache base images
for i, base in enumerate(base_images):
key = make_cache_key(base, prompt, "gpt-4o")
await set_cached_response(key, {"content": f"response_{i}"})

# Second pass: query with variations (fuzzy match enabled)
for base_idx, var_img in variation_images:
key = make_cache_key(var_img, prompt, "gpt-4o")
result = await get_cached_response(key, fuzzy=True)
if result:
fuzzy_hits += 1
else:
fuzzy_misses += 1

fuzzy_time = time.time() - start
fuzzy_tokens_used = fuzzy_misses * TOKENS_PER_IMAGE
fuzzy_cost = fuzzy_tokens_used * COST_PER_TOKEN

print(f" Hits: {fuzzy_hits}/{len(variation_images)}")
print(f" Misses: {fuzzy_misses}/{len(variation_images)}")
print(f" Tokens used: {fuzzy_tokens_used:,}")
print(f" Cost: ${fuzzy_cost:.4f}")
print(f" Time: {fuzzy_time * 1000:.1f}ms")
print(f" Fuzzy index size: {get_fuzzy_index_size()} entries")

# --- Hamming distance analysis ---
print("\n --- Hamming Distance Analysis ---\n")
distances_similar = []
distances_different = []

for i, base in enumerate(base_images[:5]):
base_hash = _image_hash(base)
base_sig = _jl_compress(base_hash)

# Similar: variations of same base
for v in range(variations_per_image):
var = _add_variation(base, variation_seed=i * 100 + v)
var_hash = _image_hash(var)
var_sig = _jl_compress(var_hash)
distances_similar.append(_hamming_distance(base_sig, var_sig))

# Different: other base images
for j in range(5):
if i == j:
continue
other_hash = _image_hash(base_images[j])
other_sig = _jl_compress(other_hash)
distances_different.append(_hamming_distance(base_sig, other_sig))

print(
f" Similar images: avg={np.mean(distances_similar):.1f}, "
f"min={min(distances_similar)}, max={max(distances_similar)}"
)
print(
f" Different images: avg={np.mean(distances_different):.1f}, "
f"min={min(distances_different)}, max={max(distances_different)}"
)

# --- Summary ---
print(f"\n {'=' * 70}")
print(" SUMMARY")
print(f" {'=' * 70}")
print(f" {'':30s} {'Exact':>12s} {'Fuzzy (QJL)':>12s} {'Improvement':>12s}")
print(f" {'-' * 30} {'-' * 12} {'-' * 12} {'-' * 12}")
print(
f" {'Cache hits':30s} {exact_hits:>12d} {fuzzy_hits:>12d} "
f"{'+' + str(fuzzy_hits - exact_hits):>12s}"
)
print(
f" {'Cache misses':30s} {exact_misses:>12d} {fuzzy_misses:>12d} "
f"{exact_misses - fuzzy_misses:>12d}"
)
print(
f" {'Tokens used':30s} {exact_tokens_used:>12,} {fuzzy_tokens_used:>12,} "
f"{exact_tokens_used - fuzzy_tokens_used:>12,}"
)

if exact_tokens_used > 0:
savings_pct = (exact_tokens_used - fuzzy_tokens_used) / exact_tokens_used * 100
print(f" {'Token savings':30s} {'':>12s} {'':>12s} {savings_pct:>11.1f}%")

print(
f" {'Cost (GPT-4o)':30s} ${exact_cost:>11.4f} ${fuzzy_cost:>11.4f} "
f"${exact_cost - fuzzy_cost:>11.4f}"
)

# Scale projections
print("\n At scale (100K images/day, 20% are variations):")
daily_variations = 20_000
exact_miss_rate = exact_misses / len(variation_images)
fuzzy_miss_rate = fuzzy_misses / len(variation_images)
daily_exact_tokens = daily_variations * TOKENS_PER_IMAGE * exact_miss_rate
daily_fuzzy_tokens = daily_variations * TOKENS_PER_IMAGE * fuzzy_miss_rate
monthly_exact = daily_exact_tokens * 30 * COST_PER_TOKEN
monthly_fuzzy = daily_fuzzy_tokens * 30 * COST_PER_TOKEN
print(f" Exact-only monthly cost: ${monthly_exact:,.2f}")
print(f" Fuzzy cache monthly cost: ${monthly_fuzzy:,.2f}")
print(f" Monthly savings: ${monthly_exact - monthly_fuzzy:,.2f}")
print(f" {'=' * 70}")

# Memory overhead
sig_bytes = get_fuzzy_index_size() * 16 # 16 bytes per signature
key_bytes = get_fuzzy_index_size() * 80 # ~80 bytes per cache key string
print(
f"\n Memory overhead: {(sig_bytes + key_bytes) / 1024:.1f} KB "
f"for {get_fuzzy_index_size()} entries "
f"({sig_bytes} bytes signatures + {key_bytes} bytes keys)"
)
print(f" At 1M entries: ~{(1_000_000 * 96) / 1024 / 1024:.1f} MB")


if __name__ == "__main__":
asyncio.run(run_benchmark())
Loading
Loading