SpeechCompass: Enhancing Mobile Captioning with Diarization and Directional Guidance via Multi-Microphone Localization
Paper (PDF) | ACM Digital Library | Project Page | Google Research Blog
Artem Dementyev*, Dimitri Kanevsky, Samuel J. Yang, Mathieu Parvaix, Chiong Lai, Alex Olwal*
Official code release for SpeechCompass: Enhancing Mobile Captioning with Diarization and Directional Guidance via Multi-Microphone Localization, published at CHI 2025.
Video 4:24 | Presentation 9:30
*First and last author contributed equally to this work
Mobile speech-to-text apps have a fundamental limitation in group conversations: they transcribe everything into a single undifferentiated stream, making it hard to follow who said what. SpeechCompass addresses this by adding a spatial dimension — using multiple microphones to localize speakers in real time and overlay directional guidance on live captions.
The system is designed with accessibility in mind, particularly for people who are hard of hearing. Rather than relying on machine learning approaches that require video, speaker embeddings, or high compute, SpeechCompass uses classical DSP (GCC-PHAT + kernel density estimation) that runs on a low-power embedded microcontroller with low latency and no voice data retention.
The Android app offers multiple ways to display speaker direction alongside captions:
- Colored text — each speaker gets a distinct color
- Directional arrows and glyphs — indicate where speech is coming from
- Radar minimap — a persistent spatial overview of active speakers
- Edge indicators — subtle screen-edge cues for peripheral awareness
- Speech suppression — filter out speech from a specific direction
- Localization accuracy: 11°–22° average error at normal conversational volume (60–65 dB), comparable to human localization ability
- Diarization: 4-microphone configuration achieves 23–35% relative improvement in Diarization Error Rate (DER) over a 3-microphone setup across varying SNR conditions
A survey of 263 frequent captioning users identified speaker distinction as the most significant unmet need. In a follow-up prototype study with 8 frequent users, colored text and directional arrows were the preferred visualizations, and all participants agreed that directional guidance was valuable for group conversations.
SpeechCompass combines a custom hardware phone case with lightweight on-device processing:
- A 4-microphone phone case sends audio to an STM32 L5 microcontroller, which runs GCC-PHAT localization and streams azimuth angles to the phone over USB
- The Android app uses the phone's built-in microphone for speech recognition (ASR) and receives speaker direction from the case — keeping voice data local and processing costs low
- The DSP algorithms are written in portable C11 and can also run on phones with 2+ built-in microphones, providing 180° localization without additional hardware
| Component | Description |
|---|---|
hardware/README.md |
PCB schematics for the custom 4-microphone phone case |
firmware/README.md |
STM32 L5 firmware (GCC-PHAT localization → USB output) |
dsp/README.md |
Platform-agnostic C localization and beamforming algorithms, with Bazel unit tests |
android/README.md |
Android Studio app (ASR + directional visualization) |
Each component can be used independently — in particular, the DSP algorithms can be built and tested with Bazel without any hardware.
@inproceedings{dementyev2025speechcompass,
title={SpeechCompass: Enhancing Mobile Captioning with Diarization and Directional Guidance via Multi-Microphone Localization},
author={Dementyev, Artem and Kanevsky, Dimitri and Yang, Samuel and Parvaix, Mathieu and Lai, Chiong and Olwal, Alex},
booktitle={Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems},
year={2025}
}SpeechCompass builds on LiveLocalizer (UIST 2023), which first demonstrated microphone-array localization augmenting mobile speech-to-text. The same hardware can run the SpeechCompass firmware.
Dementyev, A., Kanevsky, D., Yang, S., Parvaix, M., Lai, C., and Olwal, A. "LiveLocalizer: Augmenting Mobile Speech-to-Text with Microphone Arrays, Optimized Localization and Beamforming." UIST 2023 Adjunct, San Francisco, CA. ACM DL
We thank Sagar Savla, Dmitrii Votintcev, Pascal Getreuer, Richard Lyon, Alex Huang, Shao-Fu Shih, Chet Gnegy, Shaun Kane, James Landay, Malcolm Slaney, Meredith Morris, Carson Lau, Ngan Nguyen, Mei Lu, Don Barnett, Ryan Geraghty, and Sanjay Batra for their contributions and support.
Copyright 2025 Google LLC
All software is licensed under the Apache License, Version 2.0 (Apache 2.0); you may not use this file except in compliance with the Apache 2.0 license. You may obtain a copy of the Apache 2.0 license at: https://www.apache.org/licenses/LICENSE-2.0
All other materials are licensed under the Creative Commons Attribution 4.0 International License (CC-BY). You may obtain a copy of the CC-BY license at: https://creativecommons.org/licenses/by/4.0/legalcode
Unless required by applicable law or agreed to in writing, all software and materials distributed here under the Apache 2.0 or CC-BY licenses are distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the licenses for the specific language governing permissions and limitations under those licenses.
This is not an official Google product.


