Skip to content

ninja-otaku/handsfree-ai

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

handsfree-ai

Built to survive IKEA.

Phone on a tripod. AI watches. You build. Fully open-source DIY assistant — no cloud, no subscription, no hands required.

License: MIT Python 3.11+ Ollama


The Problem

You're 45 minutes into an IKEA KALLAX build. You need to flip a cam lock. Your hands are covered in sawdust. You have no idea which direction "clockwise" is anymore.

You could:

  • Take off your gloves, unlock your phone, google it, get sawdust on the screen
  • Or just say "hey assistant, which way does this go?" and keep building

That's it. That's the whole product.


How It Works

┌─────────────────────────────────────────────────────────────┐
│  Phone (browser)  →  /ws/intake  →  LLaVA (Ollama)         │
│                                          ↓                   │
│  Microphone  →  faster-whisper  →  Intent Router            │
│                                          ↓                   │
│                              Session Manager (steps)         │
│                                          ↓                   │
│                              /ws/analysis  →  Browser        │
└─────────────────────────────────────────────────────────────┘
  1. Your phone camera streams frames over WebSocket to the server
  2. Your voice is captured via faster-whisper with silero-vad gating
  3. Say "hey assistant" — the hotword wakes it
  4. LLaVA (running locally via Ollama) analyzes the frame + your current step
  5. Guidance pushes back to your phone instantly

Zero cloud. Zero subscriptions. Everything runs on your home rig.


Features

Completely Private

All inference runs locally via Ollama. No image ever leaves your network. Not even a ping to an external API.

Built for Noisy Environments

Uses silero-vad instead of webrtcvad — significantly more robust in reverberant workshops with background noise (fans, music, power tools). Configurable sensitivity from 0 (strict) to 3 (permissive).

# Crank sensitivity for a loud workshop
python -m uvicorn main:app --env VAD_SENSITIVITY=3

Actually Hands-Free

  • Mic hot indicator pulses green on the phone browser when voice is actively detected
  • Safety banner auto-triggers and pauses the session if LLaVA flags a hazard
  • Step navigation entirely by voice: "next step," "go back," "pause"

Expandable via Packs

The intelligence lives in packs/. Each pack is one Python file that defines the domain knowledge for a specific task type. Swap packs without restarting the server.


Quickstart

Prerequisites

  • Python 3.11+
  • Ollama installed and running
  • A GPU (CPU works, just slow — LLaVA is chunky)
# 1. Pull the vision model
ollama pull llava:13b

# 2. Clone and install
git clone https://github.com/ninja-otaku/handsfree-ai
cd handsfree-ai
pip install -r requirements.txt

# 3. Configure
cp .env.example .env
# Edit .env: set OLLAMA_HOST, WHISPER_MODEL, etc.

# 4. Run
uvicorn main:app --host 0.0.0.0 --port 8000

# 5. Open on your phone
# http://YOUR_RIG_IP:8000
# Mount phone on tripod. Point at your work. Say "hey assistant".

Configuration

All settings live in .env:

Variable Default Description
OLLAMA_HOST http://localhost:11434 Ollama instance URL
VISION_MODEL llava:13b Any Ollama multimodal model
WHISPER_MODEL tiny tiny / base / small / medium
HOTWORD hey assistant Wake phrase (case-insensitive)
VAD_SENSITIVITY 2 0 = strict, 3 = permissive
ACTIVE_PACK ikea Pack name (filename without .py)
PORT 8000 Server port
TLS_ENABLED false Enable HTTPS (required for camera on non-localhost)

Camera note: Browsers require HTTPS to access getUserMedia from non-localhost origins. Set TLS_ENABLED=true and provide cert paths, or proxy behind nginx with a self-signed cert for LAN use.


The Packs Architecture

This is the growth engine of the project.

packs/
├── schema.json          # JSON Schema — all packs validated against this
├── base_pack.py         # Abstract base class
├── PACK_TEMPLATE.py     # 58-line annotated template ← start here
├── ikea.py              # IKEA furniture assembly
└── your_domain.py       # ← your contribution

A pack is a Python class that does two things:

class Pack(BasePack):
    metadata = {
        "name": "IKEA Assembly",
        "version": "1.0.0",
        "domain": "furniture",
        "description": "...",
        "safety_keywords": ["tip over", "two person", "sharp edge"]
    }

    def system_prompt(self) -> str:
        return """You are an IKEA assembly assistant.
        ...your domain knowledge here...
        """

That's it. The framework handles validation, safety interrupts, session state, WebSocket broadcasting, and voice routing automatically.

Activate a pack via API

curl -X POST http://localhost:8000/packs/activate \
  -H "Content-Type: application/json" \
  -d '{"pack": "ikea"}'

Domains that need packs (PRs welcome)

  • 3d_printing — layer adhesion, support removal, bed leveling
  • car_maintenance — oil changes, brake pads, filter swaps
  • electronics_repair — soldering guidance, component ID, ESD warnings
  • cooking — mise en place timing, temperature checks, technique cues
  • woodworking — joint alignment, grain direction, finishing sequences
  • plumbing — valve directions, fitting types, leak checks

Contributing a Pack

# 1. Copy the template
cp packs/PACK_TEMPLATE.py packs/your_domain.py

# 2. Fill in metadata and system_prompt()
# The template has 6 clearly marked TODOs

# 3. Validate
python -c "from packs.your_domain import Pack; p=Pack(); p.validate(); print('OK')"

# 4. Test live
curl -X POST http://localhost:8000/packs/activate \
  -d '{"pack": "your_domain"}'

# 5. Open a PR

The only requirement: your system_prompt() must tell LLaVA how to reason about your domain's safety keywords. Everything else is flexible.


Project Structure

handsfree-ai/
├── main.py                    # FastAPI app, WebSocket hub, voice loop
├── config.py                  # Pydantic settings
├── intake/
│   └── voice_intake.py        # faster-whisper + silero-vad daemon
├── engine/
│   └── session_manager.py     # Step state machine (IDLE/ACTIVE/PAUSED/COMPLETED)
├── providers/
│   └── ollama_vision.py       # LLaVA inference + JSON extraction
├── packs/
│   ├── schema.json
│   ├── base_pack.py
│   ├── PACK_TEMPLATE.py
│   └── ikea.py
└── static/
    └── index.html             # Phone-optimized PWA UI

Tech Stack

Layer Choice Why
Vision LLaVA 13B via Ollama Free, local, surprisingly capable
Speech-to-text faster-whisper 4× faster than openai-whisper, same accuracy
VAD silero-vad Survives workshops; webrtcvad doesn't
Backend FastAPI + WebSockets Async pub/sub for multi-client broadcast
Frontend Vanilla JS No build step — phone browser just works

Roadmap

  • v1.1 — Bluetooth footpedal as hardware "next step" trigger
  • v1.1 — Pack marketplace index (community-submitted schema.json registry)
  • v1.2 — Step photo capture — auto-photograph each completed step for your records
  • v1.2 — LLaVA model hot-swap without server restart
  • v2.0 — Offline STT fallback (Vosk) for fully air-gapped use

License

MIT. Build whatever you want. A star is appreciated.

About

Phone on a tripod. AI watches. You build. Open-source DIY assistant.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors