Skip to content

ChaseDreamInfinity/voice-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Voice Agent

License: Apache 2.0 Platform: Android Hardware: Snapdragon

On-device voice assistant for Android. Push-to-talk text mode uses Android SpeechRecognizer; voice-conversation mode feeds raw audio directly into Gemma's multimodal input (fully on-device, no cloud STT). An opt-in vision channel in voice-conversation mode lets the user ask spoken questions about the camera view ("what is this?") — a single CameraX frame is threaded into pass 2 of the agent loop alongside the transcript. On-device LLM is Gemma 4 E4B (default) via LiteRT-LM, running on the device's GPU. TTS uses ElevenLabs cloud streaming (default when an API key is configured) with Android TextToSpeech as an automatic fallback when the cloud path is unreachable. Agentic tool-calling (time, arithmetic, navigation, plus tiered memory tools), durable Room-backed memory store, and a Compose chat UI. The underlying LLM is swappable at runtime via the in-app picker or at build time via engine/ModelProfile.kt.

Status: early. Push-to-talk, text chat, and a hands-free voice-conversation mode (on-device transcription + optional camera input + TTS replies) work end-to-end on a Snapdragon 8 Elite. Barge-in works during both THINKING and SPEAKING: THINKING-phase interrupts cancel immediately; SPEAKING-phase interrupts are filtered by an on-device smart-turn classifier (backchannel vs. real interrupt), with a 2 s sustained-voicing fast-path as a fallback when the model file isn't present. iOS port is deferred.

Screenshots

Chat home
Chat home
Tool calling
Tool calling
Voice-conversation mode
Voice mode
TTS settings dialog
TTS settings
Optional features dialog
Optional features

What it does

  • Voice-conversation mode (hands-free): mic stays open through the assistant's reply; barge-in works mid-TTS, filtered by an on-device smart-turn classifier so backchannels like "uh-huh" don't interrupt.
  • Push-to-talk text mode: tap-to-talk via Android SpeechRecognizer; silent text reply.
  • Optional vision channel: in voice-conversation mode, toggle the camera on and ask spoken questions about the lens view ("what is this?"); a single CameraX frame is threaded into the LLM alongside the transcript.
  • Multilingual + code-switched input: in voice-conversation mode the audio goes directly into Gemma 4's multimodal input, so the model handles the language and accent natively — including switching languages mid-conversation or mixing languages within a single utterance. ML Kit on-device language ID picks the matching TTS voice for each reply.
  • Multilingual TTS: language detected from each reply via on-device ML Kit; voice routed to ElevenLabs (default) with Android TextToSpeech as an automatic fallback.
  • In-app TTS settings: Tune-icon button in the top app bar opens a dialog to enter an ElevenLabs API key, switch backend (ElevenLabs / system), pick an ElevenLabs voice (4 multilingual presets), and pick an ElevenLabs model (eleven_v3 default for expressiveness, or eleven_turbo_v2_5 for low-latency).
  • In-app model downloader: a CloudDownload icon in the top bar opens an optional-features dialog with one-tap downloads for the smart-turn classifier (~8.7 MB) and the sentence encoder (~487 MB); a "Download Gemma 4 E4B" CTA appears on first launch when the model file is missing.
  • Agentic tool-calling: the model can call built-in tools (getCurrentTime, calculate, navigateTo) and memory tools (rememberFact, forgetFact, noteThis, recallFact) mid-turn.
  • Tiered persistent memory: facts (durable, key-indexed), episodes (per-session summaries), and working notes (in-process). Optional on-device sentence encoder enables semantic top-k retrieval; falls back to recency without it.
  • Model picker: switch between Gemma 4 E4B and E2B at runtime from the chat header; choice persists across launches.

Hardware requirements

LLM inference uses LiteRT-LM's GPU/OpenCL backend. That means Qualcomm Snapdragon only for now — the runtime relies on libOpenCL.so from Qualcomm's vendor namespace. Verified device: Samsung Galaxy S26+ (Snapdragon 8 Elite). Other Snapdragon 8 Gen 2 / Gen 3 / Elite devices should work; non-Snapdragon devices (Tensor, Exynos, Dimensity) do not — see CLAUDE.md for the device matrix.

The APK is arm64-v8a only. Non-arm64 devices cannot install it.

The model itself is not bundled in the APK. The app downloads it on first launch (or you can adb push it — see Setup). Default model is Gemma 4 E4B (~3.4 GB on disk); the smaller E2B variant (~2.4 GB) is also supported and selectable from the in-app picker.

Repo layout

voice-agent/
  android/      # Kotlin / Jetpack Compose app
  docs/         # architecture diagrams (SVG)
  LICENSES/     # third-party license texts (fonts, vendored binaries)
  CLAUDE.md     # architecture, constraints, gotchas
  LICENSE       # Apache 2.0
  NOTICE        # third-party attribution
  README.md

Setup

The app needs two things to run end-to-end: an installed APK and a Gemma model file on the device. The model can be downloaded directly in the app (recommended) or pushed via adb (power-user path). ElevenLabs cloud TTS, the smart-turn classifier, and the sentence encoder are all optional — fail-open with explicit graceful degradation when not present.

Build the APK

Prereqs: Android Studio (bundles the JetBrains Runtime 21 needed by the Gradle daemon), Android SDK 36, a connected Snapdragon-class Android device.

cd android
~/.gradle/wrapper/dists/gradle-8.13-bin/*/gradle-8.13/bin/gradle :app:installDebug

The first build pulls down the Gradle wrapper and dependencies (~5 min on a cold cache). Subsequent incremental builds finish in seconds. See CLAUDE.md for full Gradle command syntax and a note on the JBR path override if your Android Studio isn't at /Applications/Android Studio.app.

Get the LLM model onto the device (required)

Recommended: on first launch the app shows a "Download Gemma 4 E4B (3.4 GB)" prompt with a progress bar — tap Download and wait. A cellular-network warning appears if the active connection is metered.

Alternative (power users): download gemma-4-E4B-it.litertlm from Hugging Face (litert-community/gemma-4-E4B-it-litert-lm) and push it:

adb push gemma-4-E4B-it.litertlm \
  /sdcard/Android/data/io.github.chasedreaminfinity.voiceagent/files/

The app auto-loads the model on launch. E2B (~2.4 GB) is also supported; use the same directory and switch via the in-app model picker.

Add an ElevenLabs API key (optional, enables cloud TTS)

Without a key, every TTS turn falls back to Android's built-in TextToSpeech and the app remains fully functional. To enable the higher-quality cloud path:

Recommended (no rebuild): tap the Tune icon → paste your key into the API key field → tap Apply. The key is stored in app-private SharedPreferences (allowBackup=false) and takes effect on the next TTS turn.

Developer path: add a single line to android/local.properties (gitignored) and rebuild:

ELEVENLABS_API_KEY=sk_...

The build emits the key as BuildConfig.ELEVENLABS_API_KEY. If the in-app pref is also set, it takes precedence. The key never leaves the device except as the xi-api-key HTTP header on requests to api.elevenlabs.io.

Install the smart-turn classifier (optional, semantic backchannel filtering)

Without this file, mid-TTS barge-in still works via a 2-second sustained-voicing fast-path, but short utterances during the assistant's reply are dropped as backchannels regardless of intent.

Recommended: tap the CloudDownload icon in the top bar → Download next to "Smart-turn classifier (~8.7 MB)".

Alternative:

# ~8.7 MB int8 ONNX from pipecat-ai/smart-turn-v3 on Hugging Face
adb push smart-turn-v3.1-cpu.onnx \
  /sdcard/Android/data/io.github.chasedreaminfinity.voiceagent/files/smart_turn/model.onnx

The Whisper mel filterbank (63 KB) needed for preprocessing ships in the APK under assets/smart_turn/; only the model file needs to be pushed.

Install the sentence encoder (optional, enables semantic memory retrieval)

Memory retrieval falls back to recency-only when the encoder isn't present. To enable semantic top-k retrieval over rememberFact entries:

Recommended: tap the CloudDownload icon → Download next to "Sentence encoder (~487 MB)". The downloader fetches model.onnx (~470 MB) and tokenizer.json (~17 MB) from Xenova/paraphrase-multilingual-MiniLM-L12-v2 on Hugging Face (a pre-exported ONNX of sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2).

Alternative:

# Same files as the in-app downloader (Xenova/paraphrase-multilingual-MiniLM-L12-v2 ONNX).
# The runtime was validated against this specific export; other ONNX exports of the
# same upstream model may differ in op-set or output names and aren't guaranteed to work.
adb push model.onnx \
  /sdcard/Android/data/io.github.chasedreaminfinity.voiceagent/files/embeddings/model.onnx
adb push tokenizer.json \
  /sdcard/Android/data/io.github.chasedreaminfinity.voiceagent/files/embeddings/tokenizer.json

The encoder loads lazily on first use. New rememberFact writes get embedded automatically; existing facts are backfilled in the background once the encoder is loaded.

Using the app

First launch. The app requests microphone and camera permissions up front. Either denial degrades the corresponding feature gracefully (the camera-toggle button hides; the voice modes show an error toast). If the LLM model file isn't on the device yet, the chat home screen shows a "Download Gemma 4 E4B (3.4 GB)" button — tap it to fetch the model in-app, or push it via adb (see Setup above).

Push-to-talk text mode is the default. Tap the mic icon next to the text input, speak, then watch the response render as text. The reply is silent.

Voice-conversation mode is hands-free: tap the voice-mode icon next to the input to enter, then just talk — voice activity detection handles endpointing. The phase pill at the bottom shows Listening / Thinking / Speaking. Tap the same icon again to exit. Barge-in works during both Thinking (immediate cancel) and Speaking (smart-turn-classified — say "stop" to interrupt; backchannels like "yeah" / "uh-huh" don't).

Vision channel. Inside voice-conversation mode, tap the camera icon to toggle a live preview. The next utterance captures a single frame at endpoint and threads it into the LLM alongside the transcript — useful for "what is this?" / "is this a Granny Smith?" questions. A second icon flips between front and back camera.

Switch models. Tap the model name in the chat header to pick between Gemma 4 E4B and E2B. The chat clears (cross-model context isn't safe to carry); the choice persists across launches.

TTS settings. Tap the Tune icon in the top bar to:

  • Enter or update your ElevenLabs API key (stored in app-private storage; no rebuild needed).
  • Switch between ElevenLabs (with auto-fallback to system TTS) and system-only.
  • Pick an ElevenLabs voice from four multilingual presets.
  • Pick the ElevenLabs model (eleven_v3 for expressiveness, ~690 ms TTFB; or eleven_turbo_v2_5 for low-latency, ~350 ms TTFB).

The backend choice is session-scoped (resets to ElevenLabs on every app launch); API key, voice, and model all persist.

Optional features. Tap the CloudDownload icon in the top bar to open the optional-features dialog. From there you can download the smart-turn classifier (~8.7 MB) for semantic backchannel filtering and the sentence encoder (~487 MB) for semantic memory retrieval — both install without leaving the app or restarting.

Tool calls happen mid-conversation. Asking "what time is it" triggers getCurrentTime; "calculate 17% of 240" triggers calculate; "navigate to nearest coffee shop" opens turn-by-turn directions in Maps; saying "remember my favorite food is Thai curry" triggers rememberFact. The next session sees the fact via the memory context block. Tapping the + icon in the header clears the chat and summarizes the session into an episode for future recall.

Test

cd android
~/.gradle/wrapper/dists/gradle-8.13-bin/*/gradle-8.13/bin/gradle testDebugUnitTest

216 JVM unit tests, no emulator or device required. Coverage areas: AgentEngine ↔ LlmEngine wiring (text, audio two-pass, audio+image two-pass), WAV header construction (LlmEngineTest), HTTP Range-resume downloader (DownloaderTest: Content-Range parsing, RFC 7233 * total, malformed inputs), TTS sentence-splitting including clause-level chunking (SentenceExtractorTest), TTS routing layer (per-turn sticky fallback, setOnDone forwarding — RoutingTtsEngineTest), ElevenLabs engine (blank-key fail-open, JSON body shape including default eleven_v3 model id, idempotent setVoiceId / setModelId / setApiKey, carry-byte streaming for odd-byte chunked reads — ElevenLabsTtsEngineTest), arithmetic-tool parser, BuiltinTools shape, memory store tier CRUD, MemoryTools normalization, embedding index + semantic retrieval, JSON→Room migration. (Tool-call dispatch itself is owned by the LiteRT-LM SDK and is exercised on-device.)

Architecture

See CLAUDE.md for the full architectural overview, layer responsibilities, and the non-obvious constraints (Kotlin 2.3 requirement, GPU backend selection, 16 KB native-lib alignment, IME layout pattern, TTS threading rules, etc.).

Third-party assets

This repo bundles two open-source fonts under the SIL Open Font License 1.1:

  • Inter by Rasmus Andersson — LICENSES/Inter-OFL.txt
  • Fraunces by Phaedra Charles — LICENSES/Fraunces-OFL.txt

It also vendors one binary into the source tree — libc++_shared.so (LLVM C++ runtime, Apache 2.0 + LLVM exceptions) — required by DJL's tokenizer JNI; see LICENSES/libcxx_shared-NOTICE.md for provenance and replacement instructions.

Runtime dependencies pulled via Gradle (LiteRT-LM, ONNX Runtime, android-vad, Compose, Room, CameraX, kotlinx, DJL tokenizers, ML Kit Language ID, JTransforms) are listed with their respective licenses in NOTICE. The model weights (Gemma 4, smart-turn, sentence encoder) are not bundled in this repo or in the APK — the app's in-app downloader fetches them on demand. See NOTICE for upstream attribution and licenses.

License

Apache License 2.0 — see LICENSE. Third-party attribution lives in NOTICE and per-asset texts under LICENSES/.

About

Hands-free voice assistant for Android, fully on-device. Gemma 4 multimodal LLM via LiteRT-LM with optional ElevenLabs cloud TTS, smart-turn-classified barge-in, and a vision channel.

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages