Skip to content

memenow/voice-layer

Repository files navigation

VoiceLayer

VoiceLayer is a local-first voice composition layer for Ubuntu desktop workflows. It combines fast dictation, structured text composition, rewrite, and translation into a single daemon, CLI/TUI, and host-injection stack.

Scope

VoiceLayer is designed for:

  • Browser text areas and document editors
  • IDE input surfaces and comment fields
  • Terminal and TUI applications such as tmux, Neovim, Claude Code, and Codex CLI
  • Drafting workflows that need preview and confirmation before insertion

VoiceLayer is not designed as:

  • A traditional IME candidate window
  • A subtitle-only transcriber
  • A browser-only extension
  • A cloud-only voice assistant

Architecture

  • crates/voicelayer-core: shared domain types and injection planning
  • crates/voicelayer-doc-test-utils: dev-only helpers shared by the workspace's repository-wide markdown guard tests
  • crates/voicelayerd: Unix-socket daemon and /v1 control API
  • crates/vl: CLI/TUI entry point and operator tooling
  • crates/vl-desktop: interactive GUI shell that talks to the daemon over the same socket
  • python/voicelayer_orchestrator: JSON-RPC worker protocol and provider orchestration entry point
  • systemd/: user-service templates for the daemon and the optional persistent whisper-server
  • scripts/install.sh: one-shot installer that builds release binaries and seeds ~/.local/bin/, ~/.config/systemd/user/, and ~/.config/voicelayer/
  • docs/: architecture, host strategy, and operations documentation
  • openapi/: local API contract

Current Status

Shipped today:

  • Rust workspace with voicelayer-core, voicelayerd, vl, and vl-desktop
  • /v1 control API over a Unix domain socket with Server-Sent Events at /v1/events/stream
  • One-shot and fixed-duration segmented live dictation (POST /v1/sessions/dictation with segmentation.mode = one_shot | fixed), with per-segment segment_recorded / segment_transcribed events and a concatenated transcript on stop
  • vl dictation foreground-ptt alternate-screen panel with hold-to-record, transcript scrolling, clipboard restore, and tmux / WezTerm / Kitty targets
  • vl-desktop GUI overlay that shares the same socket, session state, and event stream as the CLI
  • Real ASR via whisper.cpp: one-shot whisper-cli plus an optional persistent whisper-server endpoint (with autostart) for warm-model reuse
  • Optional silero-vad pre-pass inside the Python worker that trims non-speech before whisper
  • Optional Xiaomi MiMo-V2.5-ASR backend (CUDA-only, opt-in via provider_id) for multilingual and quality-priority transcription, selectable per-request on /v1/transcriptions and per-session on the dictation pipeline (vl dictation start --provider-id mimo_v2_5_asr, vl record-transcribe --provider-id ..., vl dictation foreground-ptt --provider-id ...); see docs/guides/local-asr-provider.md
  • Real LLM integration via OpenAI-compatible chat completions, with optional llama-server autostart for local endpoints
  • Live Rust↔Python stdio JSON-RPC bridge through the uv-managed project environment
  • systemd user units for voicelayerd and the optional whisper-server, plus scripts/install.sh
  • vl doctor surfaces recorder diagnostics, whisper mode (cli / server / unconfigured), LLM reachability, portal support, and systemd unit state

Not yet implemented (documented and scoped):

  • GNOME portal hotkey binding beyond availability probing
  • AT-SPI writable target discovery
  • Always-on background microphone and mid-utterance partial transcripts
  • VAD-driven segmentation boundaries at the recorder layer (fixed-duration segmentation is shipped; adaptive VAD-driven segmentation is a later stage)
  • .deb packaging

Development

Requirements

  • Rust 1.88+
  • Python 3.12+
  • uv 0.11+
  • Ubuntu with PipeWire

Verification Chain

The authoritative commands every change must pass before merge (also mirrored in CLAUDE.md):

cargo fmt --all
cargo clippy --all-targets --all-features -- -D warnings
cargo test --all
uv sync --group dev
uv run ruff check python tests/python
uv run ruff format --check python tests/python
uv run pytest -q tests/python

Python commands in this repository should always run through uv.

Worker Runtime

The daemon and CLI launch the Python worker from the project-managed environment. Resolution order is:

  1. VOICELAYER_PROJECT_ROOT/.venv/bin/python -m voicelayer_orchestrator.worker
  2. uv run --project <project_root> python -m voicelayer_orchestrator.worker

If you run the daemon outside the repository root, set VOICELAYER_PROJECT_ROOT explicitly. When a local LLM endpoint is configured, vl doctor also probes endpoint reachability through /v1/models. If VOICELAYER_LLM_AUTO_START=true, the worker can also auto-launch llama-server for local endpoints.

Run the Daemon

cargo run -p vl -- daemon run --project-root "$(pwd)"

By default the daemon listens on:

$XDG_RUNTIME_DIR/voicelayer/daemon.sock

Inspect the Environment

cargo run -p vl -- doctor

Inspect Providers

cargo run -p vl -- providers

Configure a Local LLM Endpoint

See docs/guides/local-llm-provider.md for the llama.cpp server path and environment variables.

Configure a Local ASR Provider

See docs/guides/local-asr-provider.md for the whisper.cpp file transcription path, the optional persistent whisper-server deployment, and the optional silero-vad pre-pass.

Launch the Desktop Shell

See docs/guides/desktop.md for vl-desktop usage and the two client-side environment variables (VOICELAYER_VL_BIN, VOICELAYER_LOG).

Install as a systemd User Service

See docs/guides/systemd.md for scripts/install.sh, the voicelayerd unit, and the optional dedicated voicelayer-whisper-server unit (with a Docker drop-in).

Render a Bracketed Paste Payload

cargo run -p vl -- print-bracketed-paste "Analyze the current repository authentication flow."

Transcribe a Local Audio File

cargo run -p vl -- transcribe-file /path/to/sample.wav --language auto

Record and Transcribe a Short Clip

cargo run -p vl -- record-transcribe --duration-seconds 8 --language auto

The CLI prefers pw-record with timeout --signal=INT and falls back to arecord. Internally this reuses the same daemon-side dictation capture flow the UI and hotkey layer call.

The daemon exposes a live dictation session flow:

  • POST /v1/sessions/dictation starts recording
  • POST /v1/sessions/dictation/stop stops recording and returns the transcript

The request body's segmentation field selects between one-shot and fixed-duration segmented capture:

  • {"mode": "one_shot"} (default) records a single WAV from start to stop and transcribes it once.
  • {"mode": "fixed", "segment_secs": N} rolls the recorder every N seconds; each finalized chunk is transcribed in the background and the per-segment events surface on /v1/events/stream (dictation.segment_recorded, dictation.segment_transcribed) while stop returns the concatenated transcript.

The vl CLI exercises that control plane directly:

cargo run -p vl -- dictation start --backend pipewire --language auto
cargo run -p vl -- dictation stop <session-id>

foreground-ptt uses an alternate-screen status panel instead of streaming JSON on each transition. The panel shows:

  • current dictation status
  • active session ID
  • last completed session ID
  • last transcript preview
  • last injection result
  • last error
  • recent events

Panel controls:

  • j / k or Up / Down to scroll the full transcript view
  • PageUp / PageDown for larger transcript jumps
  • c to copy the last completed transcript to the system clipboard on demand
  • r to restore the saved text clipboard backup after the tool has overwritten the clipboard
  • i to re-apply the last injection target
  • s to save the last transcript to a timestamped text file
  • d to discard the last transcript from the panel
  • Esc to exit

If you also want a clipboard fallback after each completed dictation:

cargo run -p vl -- dictation foreground-ptt --backend pipewire --language auto --copy-on-stop

This writes the finished transcript to the system clipboard before any optional terminal-target injection.

You can change the default stop behavior without leaving the panel:

cargo run -p vl -- dictation foreground-ptt \
  --default-stop-action inject \
  --restore-clipboard-on-exit \
  --save-dir ~/Documents/voice-layer

Available default stop actions are:

  • none
  • copy
  • inject
  • save

VoiceLayer can also persist these defaults in a local config file:

cargo run -p vl -- config path
cargo run -p vl -- config init-defaults
cargo run -p vl -- config show
cargo run -p vl -- config set foreground_ptt.default_stop_action inject

The config file lives at:

~/.config/voicelayer/config.toml

For terminal-focused fallback usage, vl also provides a foreground raw-terminal mode:

cargo run -p vl -- dictation foreground-ptt --backend pipewire --language auto

When the terminal reports key release events, this behaves like hold-to-record. When release events are not available, it degrades to:

  • first key press starts dictation
  • second key press stops dictation
  • Esc exits the mode

If you run the controller inside tmux and want the transcript pasted into another pane:

cargo run -p vl -- dictation foreground-ptt --backend pipewire --language auto --tmux-target-pane %2

This uses tmux set-buffer plus tmux paste-buffer -dpr -t <pane>. The controller refuses to paste into the same pane that is currently running foreground-ptt.

If you omit --tmux-target-pane while running inside tmux:

  • zero candidate panes: no tmux injection is attempted
  • one candidate pane: it is selected automatically
  • multiple candidate panes: vl prompts you to choose a target pane before entering raw mode

For terminal-specific explicit targets outside tmux:

cargo run -p vl -- dictation foreground-ptt --wezterm-target-pane-id 12
cargo run -p vl -- dictation foreground-ptt --kitty-match 'title:Output'

These routes are explicit-only:

  • WezTerm uses wezterm cli send-text --pane-id
  • Kitty uses kitten @ send-text --match ... --stdin --bracketed-paste auto

VoiceLayer does not auto-discover WezTerm or Kitty targets yet.

Inspect Global Shortcuts Portal Support

cargo run -p vl -- hotkeys portal-status

This checks whether the current desktop session exposes org.freedesktop.portal.GlobalShortcuts.

Product Defaults

  • Desktop target: Ubuntu GNOME Wayland
  • Local ASR baseline: whisper.cpp
  • Local LLM baseline: Gemma 4 via llama.cpp-compatible deployment
  • GUI insertion priority: AT-SPI, then clipboard, then keyboard simulation fallback
  • Terminal insertion priority: bracketed paste, then terminal-specific adapters
  • Preview surface: CLI/TUI first, GUI preview later

License

The repository is intended to ship under the Apache License 2.0.

About

Local-first voice composition layer for Ubuntu desktop: fast dictation, text rewrite, and translation via a single daemon, CLI/TUI, and host-injection stack.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors