Built to survive IKEA.
Phone on a tripod. AI watches. You build. Fully open-source DIY assistant — no cloud, no subscription, no hands required.
You're 45 minutes into an IKEA KALLAX build. You need to flip a cam lock. Your hands are covered in sawdust. You have no idea which direction "clockwise" is anymore.
You could:
- Take off your gloves, unlock your phone, google it, get sawdust on the screen
- Or just say "hey assistant, which way does this go?" and keep building
That's it. That's the whole product.
┌─────────────────────────────────────────────────────────────┐
│ Phone (browser) → /ws/intake → LLaVA (Ollama) │
│ ↓ │
│ Microphone → faster-whisper → Intent Router │
│ ↓ │
│ Session Manager (steps) │
│ ↓ │
│ /ws/analysis → Browser │
└─────────────────────────────────────────────────────────────┘
- Your phone camera streams frames over WebSocket to the server
- Your voice is captured via
faster-whisperwithsilero-vadgating - Say "hey assistant" — the hotword wakes it
- LLaVA (running locally via Ollama) analyzes the frame + your current step
- Guidance pushes back to your phone instantly
Zero cloud. Zero subscriptions. Everything runs on your home rig.
All inference runs locally via Ollama. No image ever leaves your network. Not even a ping to an external API.
Uses silero-vad instead of webrtcvad — significantly more robust in reverberant workshops with background noise (fans, music, power tools). Configurable sensitivity from 0 (strict) to 3 (permissive).
# Crank sensitivity for a loud workshop
python -m uvicorn main:app --env VAD_SENSITIVITY=3- Mic hot indicator pulses green on the phone browser when voice is actively detected
- Safety banner auto-triggers and pauses the session if LLaVA flags a hazard
- Step navigation entirely by voice: "next step," "go back," "pause"
The intelligence lives in packs/. Each pack is one Python file that defines the domain knowledge for a specific task type. Swap packs without restarting the server.
- Python 3.11+
- Ollama installed and running
- A GPU (CPU works, just slow — LLaVA is chunky)
# 1. Pull the vision model
ollama pull llava:13b
# 2. Clone and install
git clone https://github.com/ninja-otaku/handsfree-ai
cd handsfree-ai
pip install -r requirements.txt
# 3. Configure
cp .env.example .env
# Edit .env: set OLLAMA_HOST, WHISPER_MODEL, etc.
# 4. Run
uvicorn main:app --host 0.0.0.0 --port 8000
# 5. Open on your phone
# http://YOUR_RIG_IP:8000
# Mount phone on tripod. Point at your work. Say "hey assistant".All settings live in .env:
| Variable | Default | Description |
|---|---|---|
OLLAMA_HOST |
http://localhost:11434 |
Ollama instance URL |
VISION_MODEL |
llava:13b |
Any Ollama multimodal model |
WHISPER_MODEL |
tiny |
tiny / base / small / medium |
HOTWORD |
hey assistant |
Wake phrase (case-insensitive) |
VAD_SENSITIVITY |
2 |
0 = strict, 3 = permissive |
ACTIVE_PACK |
ikea |
Pack name (filename without .py) |
PORT |
8000 |
Server port |
TLS_ENABLED |
false |
Enable HTTPS (required for camera on non-localhost) |
Camera note: Browsers require HTTPS to access
getUserMediafrom non-localhost origins. SetTLS_ENABLED=trueand provide cert paths, or proxy behind nginx with a self-signed cert for LAN use.
This is the growth engine of the project.
packs/
├── schema.json # JSON Schema — all packs validated against this
├── base_pack.py # Abstract base class
├── PACK_TEMPLATE.py # 58-line annotated template ← start here
├── ikea.py # IKEA furniture assembly
└── your_domain.py # ← your contribution
A pack is a Python class that does two things:
class Pack(BasePack):
metadata = {
"name": "IKEA Assembly",
"version": "1.0.0",
"domain": "furniture",
"description": "...",
"safety_keywords": ["tip over", "two person", "sharp edge"]
}
def system_prompt(self) -> str:
return """You are an IKEA assembly assistant.
...your domain knowledge here...
"""That's it. The framework handles validation, safety interrupts, session state, WebSocket broadcasting, and voice routing automatically.
curl -X POST http://localhost:8000/packs/activate \
-H "Content-Type: application/json" \
-d '{"pack": "ikea"}'3d_printing— layer adhesion, support removal, bed levelingcar_maintenance— oil changes, brake pads, filter swapselectronics_repair— soldering guidance, component ID, ESD warningscooking— mise en place timing, temperature checks, technique cueswoodworking— joint alignment, grain direction, finishing sequencesplumbing— valve directions, fitting types, leak checks
# 1. Copy the template
cp packs/PACK_TEMPLATE.py packs/your_domain.py
# 2. Fill in metadata and system_prompt()
# The template has 6 clearly marked TODOs
# 3. Validate
python -c "from packs.your_domain import Pack; p=Pack(); p.validate(); print('OK')"
# 4. Test live
curl -X POST http://localhost:8000/packs/activate \
-d '{"pack": "your_domain"}'
# 5. Open a PRThe only requirement: your system_prompt() must tell LLaVA how to reason about your domain's safety keywords. Everything else is flexible.
handsfree-ai/
├── main.py # FastAPI app, WebSocket hub, voice loop
├── config.py # Pydantic settings
├── intake/
│ └── voice_intake.py # faster-whisper + silero-vad daemon
├── engine/
│ └── session_manager.py # Step state machine (IDLE/ACTIVE/PAUSED/COMPLETED)
├── providers/
│ └── ollama_vision.py # LLaVA inference + JSON extraction
├── packs/
│ ├── schema.json
│ ├── base_pack.py
│ ├── PACK_TEMPLATE.py
│ └── ikea.py
└── static/
└── index.html # Phone-optimized PWA UI
| Layer | Choice | Why |
|---|---|---|
| Vision | LLaVA 13B via Ollama | Free, local, surprisingly capable |
| Speech-to-text | faster-whisper | 4× faster than openai-whisper, same accuracy |
| VAD | silero-vad | Survives workshops; webrtcvad doesn't |
| Backend | FastAPI + WebSockets | Async pub/sub for multi-client broadcast |
| Frontend | Vanilla JS | No build step — phone browser just works |
- v1.1 — Bluetooth footpedal as hardware "next step" trigger
- v1.1 — Pack marketplace index (community-submitted
schema.jsonregistry) - v1.2 — Step photo capture — auto-photograph each completed step for your records
- v1.2 — LLaVA model hot-swap without server restart
- v2.0 — Offline STT fallback (Vosk) for fully air-gapped use
MIT. Build whatever you want. A star is appreciated.