Skip to content

fix: resample TTS audio to device native rate to eliminate crackling#204

Open
Ded1nk wants to merge 1 commit into
dnhkng:mainfrom
Ded1nk:fix/audio-crackle-samplerate
Open

fix: resample TTS audio to device native rate to eliminate crackling#204
Ded1nk wants to merge 1 commit into
dnhkng:mainfrom
Ded1nk:fix/audio-crackle-samplerate

Conversation

@Ded1nk

@Ded1nk Ded1nk commented May 27, 2026

Copy link
Copy Markdown

Problem

The TTS model (glados.onnx) outputs audio at 22050 Hz, but most audio devices default to 44100 Hz (or another native rate). When sounddevice opens a stream at 22050 Hz on a 44100 Hz device, PortAudio performs internal sample rate conversion — and its built-in SRC is low quality, producing audible crackling and distortion artifacts.

Fix

Use soxr to resample TTS audio to the output device's native sample rate before playback, bypassing PortAudio's SRC entirely:

device_rate = int(sd.query_devices(kind="output")["default_samplerate"])
if sample_rate != device_rate:
    audio = soxr.resample(audio, sample_rate, device_rate, quality="HQ")
    sample_rate = device_rate

soxr is a well-established, high-quality SRC library (used in FFmpeg, librosa, etc.) and is available on all platforms.

Secondary fix: dual-stream monitoring bug

The original measure_percentage_spoken() called sd.play() to start audio and then opened a second concurrent sd.OutputStream with a callback that filled output with zeros — just to count frames for progress tracking. On macOS CoreAudio (and likely other backends), having two concurrent output streams caused audio mixing artifacts and scheduling jitter.

Fixed by: using a single callback-driven sd.OutputStream that both feeds audio data AND increments _playback_position. measure_percentage_spoken() now simply polls a threading.Event set when the stream finishes.

Testing

Verified on macOS with headphones defaulting to 44100 Hz. Audio went from clearly distorted/crackling to crystal clear after this change.

Dependencies

Added soxr>=0.3.0 to pyproject.toml dependencies.

Summary by CodeRabbit

  • Chores

    • Added audio processing library to project dependencies
  • Refactor

    • Enhanced audio playback with automatic sample rate conversion to match system output device specifications
    • Improved playback state tracking and completion detection for more reliable audio behavior

Review Change Stack

TTS outputs at 22050Hz but most audio devices run at 44100Hz (or
another native rate). PortAudio's built-in sample rate conversion is
low quality and causes audible crackling/distortion artifacts.

Fix: use soxr (high-quality SRC library) to resample audio to the
output device's native sample rate before playback.

Also fixes a secondary bug where measure_percentage_spoken() was
opening a second concurrent OutputStream filled with zeros alongside
the sd.play() call. Having two concurrent output streams caused audio
mixing artifacts and scheduling jitter. Replaced with a single
callback-driven OutputStream that both plays audio and tracks
playback position accurately.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@coderabbitai

coderabbitai Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

This PR reworks the audio playback system in SoundDeviceAudioIO to use soxr for high-quality resampling and replaces simple sd.play() calls with a custom sd.OutputStream callback and explicit state-based completion tracking.

Changes

Audio playback system refactoring

Layer / File(s) Summary
Dependency setup for resampling
pyproject.toml, src/glados/audio_io/sounddevice_io.py
soxr>=0.3.0 is added as a runtime dependency and imported for high-quality audio resampling.
Playback state infrastructure
src/glados/audio_io/sounddevice_io.py
New state fields track the active output stream (_output_stream), buffered audio data (_playback_audio), playback position (_playback_position), and completion signal (_playback_done).
Audio processing and stream initialization
src/glados/audio_io/sounddevice_io.py
start_speaking now converts input to mono float32, conditionally resamples to the device's native rate using soxr (HQ quality), buffers the audio with position tracking, and initializes an sd.OutputStream with a callback that writes audio chunks and signals completion.
Playback progress and completion monitoring
src/glados/audio_io/sounddevice_io.py
measure_percentage_spoken switches from a callback-based progress counter to polling _is_playing and _playback_done, then computes percentage from the final buffered playback position.
Playback stream termination
src/glados/audio_io/sounddevice_io.py
stop_speaking explicitly stops and closes the active _output_stream and sets completion flags, replacing the prior sd.stop()-based interruption.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🐰 A stream of audio doth flow so true,
With soxr's magic, resampled anew,
No more callbacks left hanging about,
Just buffers and states, completion rings out!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title clearly and specifically describes the main change: resampling TTS audio to match device native sample rate to fix crackling issues.
Docstring Coverage ✅ Passed Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/glados/audio_io/sounddevice_io.py`:
- Around line 159-160: _playback_position is tracked in device (resampled)
frames while total_samples is from the original TTS buffer, so progress must be
scaled by the sample-rate ratio; update the progress calculation wherever you
compute percent/completion (references: _playback_position, _playback_audio,
_playback_sample_rate, and total_samples) to multiply or divide by
(_playback_sample_rate / original_sample_rate) (or equivalent inverse) so the
units match before computing percentage, and apply the same change in the other
occurrence noted around the block that handles interruptions (the second spot
mentioned near lines 224-226).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 88dc14cc-e4a3-4e19-b3e4-ff6a32d1bfc9

📥 Commits

Reviewing files that changed from the base of the PR and between 878b144 and 187c272.

📒 Files selected for processing (2)
  • pyproject.toml
  • src/glados/audio_io/sounddevice_io.py

Comment on lines +159 to +160
self._playback_audio = audio
self._playback_sample_rate = sample_rate

@coderabbitai coderabbitai Bot May 27, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Scale playback progress by sample-rate ratio.

_playback_position is counted in resampled device frames, but total_samples still comes from the original TTS buffer. When the rates differ, interruptions report the wrong percentage back to src/glados/core/speech_player.py—e.g. 22.05 kHz → 44.1 kHz will read roughly 100% at half the utterance.

💡 Proposed fix
         progress = getattr(self, "_playback_position", total_samples)
         self._is_playing = False
-        percentage_played = min(int(progress / total_samples * 100), 100) if total_samples > 0 else 100
+        if total_samples > 0 and sample_rate and self._playback_sample_rate > 0:
+            played_seconds = progress / self._playback_sample_rate
+            total_seconds = total_samples / sample_rate
+            percentage_played = min(int(played_seconds / total_seconds * 100), 100)
+        else:
+            percentage_played = 100
         return interrupted, percentage_played

Also applies to: 224-226

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/glados/audio_io/sounddevice_io.py` around lines 159 - 160,
_playback_position is tracked in device (resampled) frames while total_samples
is from the original TTS buffer, so progress must be scaled by the sample-rate
ratio; update the progress calculation wherever you compute percent/completion
(references: _playback_position, _playback_audio, _playback_sample_rate, and
total_samples) to multiply or divide by (_playback_sample_rate /
original_sample_rate) (or equivalent inverse) so the units match before
computing percentage, and apply the same change in the other occurrence noted
around the block that handles interruptions (the second spot mentioned near
lines 224-226).

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make the fix. Also, can we avoid the new soxr dependency?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
cat pyproject.toml

Length of output: 5630



Happy to make both changes. Here's the plan:

  • Remove soxr>=0.3.0 from pyproject.toml
  • Drop the import soxr and replace soxr.resample(...) with a private _resample helper using np.interp (already available — no new dependency)
  • Fix measure_percentage_spoken to convert both progress and total_samples to seconds before computing percentage

Note on resampling quality: np.interp (linear interpolation) is simpler than soxr's HQ sinc filter. For typical TTS use-cases — especially integer upsampling ratios like 22 050 → 44 100 Hz — it is audibly transparent. If higher quality becomes important later, scipy.signal.resample_poly (scipy is a common optional dep) can be swapped in with a one-line change.


🧬 Code Graph Analysis Results

src/glados/audio_io/sounddevice_io.py

Lines 5-12 (relevant to avoiding the new soxr dependency):

import soxr

from loguru import logger
import numpy as np
from numpy.typing import NDArray
import sounddevice as sd  # type: ignore

from . import VAD

Lines ~104-137 (resampling uses soxr, likely what the user wants to remove):

        # Resample to device native rate if needed to avoid low-quality SRC in PortAudio
        device_rate = int(sd.query_devices(kind="output")["default_samplerate"])
        if sample_rate != device_rate:
            audio = soxr.resample(audio, sample_rate, device_rate, quality="HQ")
            sample_rate = device_rate

Lines ~60-84 (VAD chunk sizing; relevant if the “fix” relates to VAD compatibility):

            self.input_stream = sd.InputStream(
                samplerate=self.SAMPLE_RATE,
                channels=1,
                callback=audio_callback,
                blocksize=int(self.SAMPLE_RATE * self.VAD_SIZE / 1000),
            )

src/glados/audio_io/vad.py

Lines ~41-88 (VAD constraints that must match microphone chunk sizing / sample rate):

    def __call__(self, audio_sample: NDArray[np.float32], sample_rate: int = SAMPLE_RATE) -> NDArray[np.float32]:
        """Process a batch of audio samples and return the VAD output."""
        if sample_rate != 16000:
            raise ValueError("silero_vad_16k_op15.onnx only supports 16000 Hz audio.")
        num_samples = 512

        if audio_sample.shape[-1] != num_samples:
            raise ValueError(
                f"Provided number of samples is {audio_sample.shape[-1]} "
                f"(Supported values: 256 for 8000 sample rate, 512 for 16000)"
            )

Lines ~88-116 (context/stateful inference details):

        batch_size = audio_sample.shape[0]
        context_size = 64 if sample_rate == 16000 else 32

        if not self._last_batch_size:
            self.reset_states(batch_size)
        if (self._last_sr) and (self._last_sr != sample_rate):
            self.reset_states(batch_size)
        if (self._last_batch_size) and (self._last_batch_size != batch_size):
            self.reset_states(batch_size)

        if not len(self._context):
            self._context = np.zeros((batch_size, context_size), dtype=np.float32)

        audio_sample = np.concatenate([self._context, audio_sample], axis=1)

        ort_inputs = {
            "input": audio_sample.astype(np.float32),
            "state": self._state,
            "sr": np.array(sample_rate, dtype=np.int64),
        }
        ort_outs = self.ort_sess.run(None, ort_inputs)
        out, state = ort_outs

src/glados/audio_io/__init__.py

Lines 26-34 (method signatures for reference while adjusting behavior):

def start_listening(self) -> None: ...
def stop_listening(self) -> None: ...
def start_speaking(
        self, audio_data: NDArray[np.float32], sample_rate: int | None = None, text: str = ""
    ) -> None: ...
def measure_percentage_spoken(self, total_samples: int, sample_rate: int | None = None) -> tuple[bool, int]: ...
def check_if_speaking(self) -> bool: ...
def stop_speaking(self) -> None: ...
def get_sample_queue(self) -> queue.Queue[tuple[NDArray[np.float32], bool]]: ...

Failed to handle agent chat message. Please try again.

@dnhkng

dnhkng commented May 31, 2026

Copy link
Copy Markdown
Owner

@Ded1nk Thanks for the PR.

Can you modify this to not add a new dependency? A strong goal of the project is minimal dependencies, and this adds soxr.

@Ded1nk

Ded1nk commented May 31, 2026

Copy link
Copy Markdown
Author

indeed it does. this will be my first ever commit to a public project, lets see if i can do it. big thanks for helping me on my journey helping you. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants