fix: resample TTS audio to device native rate to eliminate crackling#204
fix: resample TTS audio to device native rate to eliminate crackling#204Ded1nk wants to merge 1 commit into
Conversation
TTS outputs at 22050Hz but most audio devices run at 44100Hz (or another native rate). PortAudio's built-in sample rate conversion is low quality and causes audible crackling/distortion artifacts. Fix: use soxr (high-quality SRC library) to resample audio to the output device's native sample rate before playback. Also fixes a secondary bug where measure_percentage_spoken() was opening a second concurrent OutputStream filled with zeros alongside the sd.play() call. Having two concurrent output streams caused audio mixing artifacts and scheduling jitter. Replaced with a single callback-driven OutputStream that both plays audio and tracks playback position accurately. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
📝 WalkthroughWalkthroughThis PR reworks the audio playback system in ChangesAudio playback system refactoring
🎯 3 (Moderate) | ⏱️ ~25 minutes
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/glados/audio_io/sounddevice_io.py`:
- Around line 159-160: _playback_position is tracked in device (resampled)
frames while total_samples is from the original TTS buffer, so progress must be
scaled by the sample-rate ratio; update the progress calculation wherever you
compute percent/completion (references: _playback_position, _playback_audio,
_playback_sample_rate, and total_samples) to multiply or divide by
(_playback_sample_rate / original_sample_rate) (or equivalent inverse) so the
units match before computing percentage, and apply the same change in the other
occurrence noted around the block that handles interruptions (the second spot
mentioned near lines 224-226).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 88dc14cc-e4a3-4e19-b3e4-ff6a32d1bfc9
📒 Files selected for processing (2)
pyproject.tomlsrc/glados/audio_io/sounddevice_io.py
| self._playback_audio = audio | ||
| self._playback_sample_rate = sample_rate |
There was a problem hiding this comment.
Scale playback progress by sample-rate ratio.
_playback_position is counted in resampled device frames, but total_samples still comes from the original TTS buffer. When the rates differ, interruptions report the wrong percentage back to src/glados/core/speech_player.py—e.g. 22.05 kHz → 44.1 kHz will read roughly 100% at half the utterance.
💡 Proposed fix
progress = getattr(self, "_playback_position", total_samples)
self._is_playing = False
- percentage_played = min(int(progress / total_samples * 100), 100) if total_samples > 0 else 100
+ if total_samples > 0 and sample_rate and self._playback_sample_rate > 0:
+ played_seconds = progress / self._playback_sample_rate
+ total_seconds = total_samples / sample_rate
+ percentage_played = min(int(played_seconds / total_seconds * 100), 100)
+ else:
+ percentage_played = 100
return interrupted, percentage_playedAlso applies to: 224-226
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/glados/audio_io/sounddevice_io.py` around lines 159 - 160,
_playback_position is tracked in device (resampled) frames while total_samples
is from the original TTS buffer, so progress must be scaled by the sample-rate
ratio; update the progress calculation wherever you compute percent/completion
(references: _playback_position, _playback_audio, _playback_sample_rate, and
total_samples) to multiply or divide by (_playback_sample_rate /
original_sample_rate) (or equivalent inverse) so the units match before
computing percentage, and apply the same change in the other occurrence noted
around the block that handles interruptions (the second spot mentioned near
lines 224-226).
There was a problem hiding this comment.
make the fix. Also, can we avoid the new soxr dependency?
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
cat pyproject.tomlLength of output: 5630
Happy to make both changes. Here's the plan:
- Remove
soxr>=0.3.0frompyproject.toml - Drop the
import soxrand replacesoxr.resample(...)with a private_resamplehelper usingnp.interp(already available — no new dependency) - Fix
measure_percentage_spokento convert both progress and total_samples to seconds before computing percentage
Note on resampling quality:
np.interp(linear interpolation) is simpler than soxr's HQ sinc filter. For typical TTS use-cases — especially integer upsampling ratios like 22 050 → 44 100 Hz — it is audibly transparent. If higher quality becomes important later,scipy.signal.resample_poly(scipy is a common optional dep) can be swapped in with a one-line change.
🧬 Code Graph Analysis Results
src/glados/audio_io/sounddevice_io.py
Lines 5-12 (relevant to avoiding the new soxr dependency):
import soxr
from loguru import logger
import numpy as np
from numpy.typing import NDArray
import sounddevice as sd # type: ignore
from . import VADLines ~104-137 (resampling uses soxr, likely what the user wants to remove):
# Resample to device native rate if needed to avoid low-quality SRC in PortAudio
device_rate = int(sd.query_devices(kind="output")["default_samplerate"])
if sample_rate != device_rate:
audio = soxr.resample(audio, sample_rate, device_rate, quality="HQ")
sample_rate = device_rateLines ~60-84 (VAD chunk sizing; relevant if the “fix” relates to VAD compatibility):
self.input_stream = sd.InputStream(
samplerate=self.SAMPLE_RATE,
channels=1,
callback=audio_callback,
blocksize=int(self.SAMPLE_RATE * self.VAD_SIZE / 1000),
)src/glados/audio_io/vad.py
Lines ~41-88 (VAD constraints that must match microphone chunk sizing / sample rate):
def __call__(self, audio_sample: NDArray[np.float32], sample_rate: int = SAMPLE_RATE) -> NDArray[np.float32]:
"""Process a batch of audio samples and return the VAD output."""
if sample_rate != 16000:
raise ValueError("silero_vad_16k_op15.onnx only supports 16000 Hz audio.")
num_samples = 512
if audio_sample.shape[-1] != num_samples:
raise ValueError(
f"Provided number of samples is {audio_sample.shape[-1]} "
f"(Supported values: 256 for 8000 sample rate, 512 for 16000)"
)Lines ~88-116 (context/stateful inference details):
batch_size = audio_sample.shape[0]
context_size = 64 if sample_rate == 16000 else 32
if not self._last_batch_size:
self.reset_states(batch_size)
if (self._last_sr) and (self._last_sr != sample_rate):
self.reset_states(batch_size)
if (self._last_batch_size) and (self._last_batch_size != batch_size):
self.reset_states(batch_size)
if not len(self._context):
self._context = np.zeros((batch_size, context_size), dtype=np.float32)
audio_sample = np.concatenate([self._context, audio_sample], axis=1)
ort_inputs = {
"input": audio_sample.astype(np.float32),
"state": self._state,
"sr": np.array(sample_rate, dtype=np.int64),
}
ort_outs = self.ort_sess.run(None, ort_inputs)
out, state = ort_outssrc/glados/audio_io/__init__.py
Lines 26-34 (method signatures for reference while adjusting behavior):
def start_listening(self) -> None: ...
def stop_listening(self) -> None: ...
def start_speaking(
self, audio_data: NDArray[np.float32], sample_rate: int | None = None, text: str = ""
) -> None: ...
def measure_percentage_spoken(self, total_samples: int, sample_rate: int | None = None) -> tuple[bool, int]: ...
def check_if_speaking(self) -> bool: ...
def stop_speaking(self) -> None: ...
def get_sample_queue(self) -> queue.Queue[tuple[NDArray[np.float32], bool]]: ...Failed to handle agent chat message. Please try again.
|
@Ded1nk Thanks for the PR. Can you modify this to not add a new dependency? A strong goal of the project is minimal dependencies, and this adds soxr. |
|
indeed it does. this will be my first ever commit to a public project, lets see if i can do it. big thanks for helping me on my journey helping you. :) |
Problem
The TTS model (
glados.onnx) outputs audio at 22050 Hz, but most audio devices default to 44100 Hz (or another native rate). Whensounddeviceopens a stream at 22050 Hz on a 44100 Hz device, PortAudio performs internal sample rate conversion — and its built-in SRC is low quality, producing audible crackling and distortion artifacts.Fix
Use
soxrto resample TTS audio to the output device's native sample rate before playback, bypassing PortAudio's SRC entirely:soxris a well-established, high-quality SRC library (used in FFmpeg, librosa, etc.) and is available on all platforms.Secondary fix: dual-stream monitoring bug
The original
measure_percentage_spoken()calledsd.play()to start audio and then opened a second concurrentsd.OutputStreamwith a callback that filled output with zeros — just to count frames for progress tracking. On macOS CoreAudio (and likely other backends), having two concurrent output streams caused audio mixing artifacts and scheduling jitter.Fixed by: using a single callback-driven
sd.OutputStreamthat both feeds audio data AND increments_playback_position.measure_percentage_spoken()now simply polls athreading.Eventset when the stream finishes.Testing
Verified on macOS with headphones defaulting to 44100 Hz. Audio went from clearly distorted/crackling to crystal clear after this change.
Dependencies
Added
soxr>=0.3.0topyproject.tomldependencies.Summary by CodeRabbit
Chores
Refactor