VoiceLayer is a local-first voice composition layer for Ubuntu desktop workflows. It combines fast dictation, structured text composition, rewrite, and translation into a single daemon, CLI/TUI, and host-injection stack.
VoiceLayer is designed for:
- Browser text areas and document editors
- IDE input surfaces and comment fields
- Terminal and TUI applications such as tmux, Neovim, Claude Code, and Codex CLI
- Drafting workflows that need preview and confirmation before insertion
VoiceLayer is not designed as:
- A traditional IME candidate window
- A subtitle-only transcriber
- A browser-only extension
- A cloud-only voice assistant
crates/voicelayer-core: shared domain types and injection planningcrates/voicelayer-doc-test-utils: dev-only helpers shared by the workspace's repository-wide markdown guard testscrates/voicelayerd: Unix-socket daemon and/v1control APIcrates/vl: CLI/TUI entry point and operator toolingcrates/vl-desktop: interactive GUI shell that talks to the daemon over the same socketpython/voicelayer_orchestrator: JSON-RPC worker protocol and provider orchestration entry pointsystemd/: user-service templates for the daemon and the optional persistentwhisper-serverscripts/install.sh: one-shot installer that builds release binaries and seeds~/.local/bin/,~/.config/systemd/user/, and~/.config/voicelayer/docs/: architecture, host strategy, and operations documentationopenapi/: local API contract
Shipped today:
- Rust workspace with
voicelayer-core,voicelayerd,vl, andvl-desktop /v1control API over a Unix domain socket with Server-Sent Events at/v1/events/stream- One-shot and fixed-duration segmented live dictation (
POST /v1/sessions/dictationwithsegmentation.mode = one_shot | fixed), with per-segmentsegment_recorded/segment_transcribedevents and a concatenated transcript on stop vl dictation foreground-pttalternate-screen panel with hold-to-record, transcript scrolling, clipboard restore, and tmux / WezTerm / Kitty targetsvl-desktopGUI overlay that shares the same socket, session state, and event stream as the CLI- Real ASR via
whisper.cpp: one-shotwhisper-cliplus an optional persistentwhisper-serverendpoint (with autostart) for warm-model reuse - Optional silero-vad pre-pass inside the Python worker that trims non-speech before whisper
- Optional Xiaomi MiMo-V2.5-ASR backend (CUDA-only, opt-in via
provider_id) for multilingual and quality-priority transcription, selectable per-request on/v1/transcriptionsand per-session on the dictation pipeline (vl dictation start --provider-id mimo_v2_5_asr,vl record-transcribe --provider-id ...,vl dictation foreground-ptt --provider-id ...); see docs/guides/local-asr-provider.md - Real LLM integration via OpenAI-compatible chat completions, with optional
llama-serverautostart for local endpoints - Live Rust↔Python stdio JSON-RPC bridge through the
uv-managed project environment - systemd user units for
voicelayerdand the optionalwhisper-server, plusscripts/install.sh vl doctorsurfaces recorder diagnostics, whisper mode (cli/server/unconfigured), LLM reachability, portal support, and systemd unit state
Not yet implemented (documented and scoped):
- GNOME portal hotkey binding beyond availability probing
- AT-SPI writable target discovery
- Always-on background microphone and mid-utterance partial transcripts
- VAD-driven segmentation boundaries at the recorder layer (fixed-duration segmentation is shipped; adaptive VAD-driven segmentation is a later stage)
.debpackaging
- Rust 1.88+
- Python 3.12+
uv0.11+- Ubuntu with PipeWire
The authoritative commands every change must pass before merge (also mirrored in CLAUDE.md):
cargo fmt --all
cargo clippy --all-targets --all-features -- -D warnings
cargo test --all
uv sync --group dev
uv run ruff check python tests/python
uv run ruff format --check python tests/python
uv run pytest -q tests/pythonPython commands in this repository should always run through uv.
The daemon and CLI launch the Python worker from the project-managed environment. Resolution order is:
VOICELAYER_PROJECT_ROOT/.venv/bin/python -m voicelayer_orchestrator.workeruv run --project <project_root> python -m voicelayer_orchestrator.worker
If you run the daemon outside the repository root, set VOICELAYER_PROJECT_ROOT explicitly.
When a local LLM endpoint is configured, vl doctor also probes endpoint reachability through /v1/models.
If VOICELAYER_LLM_AUTO_START=true, the worker can also auto-launch llama-server for local endpoints.
cargo run -p vl -- daemon run --project-root "$(pwd)"By default the daemon listens on:
$XDG_RUNTIME_DIR/voicelayer/daemon.sock
cargo run -p vl -- doctorcargo run -p vl -- providersSee docs/guides/local-llm-provider.md for the llama.cpp server path and environment variables.
See docs/guides/local-asr-provider.md for the whisper.cpp file transcription path, the optional persistent whisper-server deployment, and the optional silero-vad pre-pass.
See docs/guides/desktop.md for vl-desktop usage and the two client-side environment variables (VOICELAYER_VL_BIN, VOICELAYER_LOG).
See docs/guides/systemd.md for scripts/install.sh, the voicelayerd unit, and the optional dedicated voicelayer-whisper-server unit (with a Docker drop-in).
cargo run -p vl -- print-bracketed-paste "Analyze the current repository authentication flow."cargo run -p vl -- transcribe-file /path/to/sample.wav --language autocargo run -p vl -- record-transcribe --duration-seconds 8 --language autoThe CLI prefers pw-record with timeout --signal=INT and falls back to arecord.
Internally this reuses the same daemon-side dictation capture flow the UI and hotkey layer call.
The daemon exposes a live dictation session flow:
POST /v1/sessions/dictationstarts recordingPOST /v1/sessions/dictation/stopstops recording and returns the transcript
The request body's segmentation field selects between one-shot and fixed-duration segmented capture:
{"mode": "one_shot"}(default) records a single WAV from start to stop and transcribes it once.{"mode": "fixed", "segment_secs": N}rolls the recorder everyNseconds; each finalized chunk is transcribed in the background and the per-segment events surface on/v1/events/stream(dictation.segment_recorded,dictation.segment_transcribed) while stop returns the concatenated transcript.
The vl CLI exercises that control plane directly:
cargo run -p vl -- dictation start --backend pipewire --language auto
cargo run -p vl -- dictation stop <session-id>foreground-ptt uses an alternate-screen status panel instead of streaming JSON on each transition.
The panel shows:
- current dictation status
- active session ID
- last completed session ID
- last transcript preview
- last injection result
- last error
- recent events
Panel controls:
j/kor Up / Down to scroll the full transcript viewPageUp/PageDownfor larger transcript jumpscto copy the last completed transcript to the system clipboard on demandrto restore the saved text clipboard backup after the tool has overwritten the clipboardito re-apply the last injection targetsto save the last transcript to a timestamped text filedto discard the last transcript from the panelEscto exit
If you also want a clipboard fallback after each completed dictation:
cargo run -p vl -- dictation foreground-ptt --backend pipewire --language auto --copy-on-stopThis writes the finished transcript to the system clipboard before any optional terminal-target injection.
You can change the default stop behavior without leaving the panel:
cargo run -p vl -- dictation foreground-ptt \
--default-stop-action inject \
--restore-clipboard-on-exit \
--save-dir ~/Documents/voice-layerAvailable default stop actions are:
nonecopyinjectsave
VoiceLayer can also persist these defaults in a local config file:
cargo run -p vl -- config path
cargo run -p vl -- config init-defaults
cargo run -p vl -- config show
cargo run -p vl -- config set foreground_ptt.default_stop_action injectThe config file lives at:
~/.config/voicelayer/config.toml
For terminal-focused fallback usage, vl also provides a foreground raw-terminal mode:
cargo run -p vl -- dictation foreground-ptt --backend pipewire --language autoWhen the terminal reports key release events, this behaves like hold-to-record. When release events are not available, it degrades to:
- first key press starts dictation
- second key press stops dictation
Escexits the mode
If you run the controller inside tmux and want the transcript pasted into another pane:
cargo run -p vl -- dictation foreground-ptt --backend pipewire --language auto --tmux-target-pane %2This uses tmux set-buffer plus tmux paste-buffer -dpr -t <pane>.
The controller refuses to paste into the same pane that is currently running foreground-ptt.
If you omit --tmux-target-pane while running inside tmux:
- zero candidate panes: no tmux injection is attempted
- one candidate pane: it is selected automatically
- multiple candidate panes:
vlprompts you to choose a target pane before entering raw mode
For terminal-specific explicit targets outside tmux:
cargo run -p vl -- dictation foreground-ptt --wezterm-target-pane-id 12
cargo run -p vl -- dictation foreground-ptt --kitty-match 'title:Output'These routes are explicit-only:
- WezTerm uses
wezterm cli send-text --pane-id - Kitty uses
kitten @ send-text --match ... --stdin --bracketed-paste auto
VoiceLayer does not auto-discover WezTerm or Kitty targets yet.
cargo run -p vl -- hotkeys portal-statusThis checks whether the current desktop session exposes org.freedesktop.portal.GlobalShortcuts.
- Desktop target: Ubuntu GNOME Wayland
- Local ASR baseline:
whisper.cpp - Local LLM baseline:
Gemma 4viallama.cpp-compatible deployment - GUI insertion priority: AT-SPI, then clipboard, then keyboard simulation fallback
- Terminal insertion priority: bracketed paste, then terminal-specific adapters
- Preview surface: CLI/TUI first, GUI preview later
The repository is intended to ship under the Apache License 2.0.