Skip to content

secedastudios/mensch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mensch — European Film Incentives Intelligence

A self-updating Rust service that tracks European film production incentives. It crawls 50+ film commission websites, extracts structured data using Claude, stores everything in a SurrealDB knowledge graph with vector embeddings, and exposes a streaming chat interface powered by hybrid RAG.

Ask it things like:

  • "What's the best rebate for a €2.5M feature shooting in Germany?"
  • "Compare Ireland Section 481 vs. UK AVEC for a $10M co-production"
  • "Which European countries have over 30% rebates and no cultural test?"

How it works

50+ EU film commission websites
          │
          ▼
  Fetch HTML → follow sub-page links if landing page is thin
          │
          ▼
  Claude extracts structured JSON
  (programs, rebate %, requirements, future changes, org details)
          │
          ├─ If scrape yields nothing → Claude bootstraps from training knowledge
          │                             (marked as unverified, overwritten when scrape improves)
          │
          ▼
  SurrealDB knowledge graph:
  ├─ incentive_program  (rebate %, thresholds, status, HNSW vector index)
  ├─ requirement        (linked to program)
  ├─ future_change      (linked to program)
  ├─ organization       (administered_by graph edge)
  ├─ country            (offers graph edge)
  └─ data_source        (crawl registry, failure tracking, auto-heal state)
          │
          ▼
  After each run:
  ├─ Failed URLs → crawl domain root, ask Claude to find correct page → retry
  ├─ Consecutive failures (10×) → auto-disable source
  └─ Claude suggests new sources not yet in the registry → added automatically
          │
          ▼
  User asks a question via chat
          │
          ▼
  Claude generates a structured SearchPlan
  (semantic query + country filters + program type + budget)
          │
          ├─► Vector similarity search (HNSW cosine, BGE-large-en-v1.5 local embeddings)
          └─► Structured fallback (SurrealQL filters on country, type, rebate %)
                    │
                    ▼
             Graph expansion: fetch requirements,
             future changes, org names per result
                    │
                    ▼
             Claude streams the answer with
             calculated EUR amounts and source citations

Ingestion runs on startup (configurable) and on a schedule (default 24 hours). Schema and seed data are applied automatically on boot — the DB is always in sync.

Requirements

  • Rust 1.85+ (edition 2024)
  • Docker + Docker Compose (for SurrealDB)
  • Anthropic API key — Claude is used for extraction, bootstrap, URL healing, source discovery, and chat
  • cargo-watch (optional, for make dev hot reload): cargo install cargo-watch

Embeddings run locally by default using BAAI/bge-large-en-v1.5 via the candle crate (~1.3 GB, downloaded once to ~/.cache/huggingface). No embedding API key required. OpenAI and Voyage AI are also supported.

Quick start

# 1. Clone and enter the directory
git clone <repo> mensch && cd mensch

# 2. Copy and fill in the env file
cp .env-example .env
# Required: set ANTHROPIC_API_KEY
# Everything else has sensible defaults

# 3. Start SurrealDB
make services

# 4. Seed the database (schema + initial data sources)
make db-init

# 5. Run
make run
# or with hot reload:
make dev

Open http://localhost:3000. On first boot the app runs ingestion — this takes 5–10 minutes as it crawls all sources and the local embedding model loads for the first time. Subsequent starts are fast (model cached, data already in DB).

Configuration

All configuration is via environment variables (.env in development).

Variable Default Description
ANTHROPIC_API_KEY required Anthropic API key
ANTHROPIC_MODEL claude-sonnet-4-6 Claude model
EMBEDDING_PROVIDER local local, openai, voyage, or none
EMBEDDING_API_KEY API key for openai or voyage providers
SURREAL_URL localhost:8000 SurrealDB host:port
SURREAL_USER root SurrealDB username
SURREAL_PASS root SurrealDB password
SURREAL_NS mensch SurrealDB namespace
SURREAL_DB mensch SurrealDB database name
API_KEY Pre-shared key for API/WebSocket auth (leave empty to disable)
PORT 3000 HTTP port
RUST_LOG info Log level (debug, info, warn, error)
INGESTION_ON_STARTUP true Run a full scrape on boot
INGESTION_INTERVAL_HOURS 24 Hours between scheduled scrape runs

Ingestion pipeline

Each ingestion run:

  1. Fetch — HTTP GET with browser-like headers; follows sub-page links if the landing page has less than 500 chars of content
  2. Extract — Claude reads the text and returns structured JSON (program name, rebate %, requirements, future changes, administering org)
  3. Bootstrap — if scraping yields 0 programs (JS-heavy site, etc.), Claude populates the record from training knowledge with a note to verify at the official URL
  4. Embed — BGE-large-en-v1.5 generates a 1024-dim vector for semantic search
  5. Heal — 404/DNS failures trigger a domain crawl + Claude picks the new URL and updates the registry; TLS errors disable the source
  6. Discover — Claude reviews the full source list and suggests new sources not yet tracked; they're added automatically for the next run

Re-running ingestion is always safe — all writes are upserts.

# Watch ingestion in detail
RUST_LOG=debug make run

Database management

make db-init     # drop + reapply schema + reseed (wipes all extracted data)
make db-seed     # reapply schema + seed without dropping (safe on running DB)
make db-schema   # apply schema changes only

The Makefile reads SURREAL_* env vars from your shell or .env. Defaults match the app's defaults (ns=mensch, db=mensch).

Production deployment

cp .env-example .env   # fill in real keys
make up                # docker compose up -d (app + SurrealDB)
make down              # stop
docker compose logs -f # follow logs

Data is persisted in a named Docker volume (surreal-data). The app container waits for SurrealDB to be healthy before starting.

API

All /api/* routes require X-Api-Key: <value> matching API_KEY. The WebSocket accepts the key as ?key= query param (required for browser clients).

Method Path Description
GET / Chat UI
GET /healthcheck Service health, DB status, system metrics
WS /ws/chat Streaming chat (WebSocket)
GET /api/programs List programs (?country=de&min_rebate=20&limit=50)
GET /api/programs/:slug Single program with requirements and future changes
GET /api/countries All tracked countries
GET /api/changes Upcoming/future changes across all programs

WebSocket protocol

Client → server:

{ "type": "init", "session_id": null, "project_context": { "budget_usd": 2500000, "film_type": "feature" } }
{ "type": "message", "content": "What rebates are available in Germany?" }

Server → client:

{ "type": "session", "session_id": "abc123" }
{ "type": "sources", "sources": [ ... ] }
{ "type": "token", "content": "The " }
{ "type": "done" }
{ "type": "error", "message": "..." }

project_context is included in every LLM call — set budget_usd, film_type, shoot_country, or shoot_city to get personalised calculations.

Development

make test    # cargo test (in-memory SurrealDB, no running instance needed)
make lint    # cargo clippy + fmt check
make watch   # cargo watch with debug logging (requires running SurrealDB)

Stack

Layer Technology
Language Rust (edition 2024)
Web framework Axum 0.8
Database SurrealDB 3 (document + graph + vector)
LLM Anthropic Claude (extraction, bootstrap, healing, discovery, chat)
Embeddings BGE-large-en-v1.5 via candle (local, default) · OpenAI · Voyage AI
Scraping reqwest + scraper (HTML → text, sub-page following)
Templating Askama (server-side HTML)
Async runtime Tokio
Auth JWT-ready · pre-shared API key

About

I film financing incentive research assistant

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors