Agentic TTS: speech scripting pass, mood-aware voice selection, and optional narration #33

glitchbunny0 · 2026-05-09T15:09:58Z

glitchbunny0
May 9, 2026

After getting the TTS pipeline merged (#29), I've been thinking about extending the agentic approach into the audio realm. The app already has mood detection via the Director pass (active_moods), and the TTS layer already structures speech as SpeakableChunk with an emotion field per chunk — so there's a foundation to build on.

The general idea: treat speech generation as a first-class agentic pass, not just a post-hoc synthesis step. This would be entirely opt-in and out of the way, like the existing TTS toggle.

1. Speech scripting agent pass

Instead of pure regex extraction, add an optional agent pass that writes a speech script tailored to the TTS model. This pass would:

Extract dialogue with proper emotional annotations ([laugh], [sigh], [whisper], etc.)
Add non-verbal cues based on context (the Director's mood data could feed into this)
Tag prosody hints that backends like ElevenLabs or Fish Speech can actually use

The regex extractor stays as the default (zero-latency, zero-cost). The agent pass would be opt-in, like the Writer/Editor passes.

2. Mood-aware voice selection

Some backends (ElevenLabs, Fish Speech) support voice cloning or multiple voices. If the Director detects a mood shift, the TTS layer could pick a voice variant that matches — softer voice for tender scenes, sharper for tense ones. The SpeakableChunk.emotion field is already there; this would connect it to voice selection.

3. Optional narrator voice for non-speech parts

A niche one, but: some people might want a narrator voice reading the action beats and scene descriptions between dialogue lines. The regex extractor already splits text into speech/non-speech chunks. A second voice could read the non-speech parts, creating a full audiobook-like experience. Fully optional, probably a per-character setting.

Thoughts? Would love to hear if others see value in this direction, or if there are better ways to approach it.

OrbFrontend · 2026-05-09T15:40:32Z

OrbFrontend
May 9, 2026
Maintainer

I see the vision. Looks like it's gonna be related to this: #30, everyone will need to be on the same page.

0 replies

glitchbunny0 · 2026-05-09T16:48:08Z

glitchbunny0
May 9, 2026
Author

Good call on linking this to #30 — the secondary agents work is exactly the right foundation for this.

The speech scripting pass would slot in naturally as a secondary agent: it runs after the Director, reads the mood metadata, and produces TTS-optimized output. The regex extractor stays as the default path (zero cost/latency), and the agent pass is opt-in the same way Writer/Editor are.

I'd be happy to align the TTS agent interface with whatever seam design comes out of #30, so we're not building against a moving target. If it helps to have a concrete use case for testing the plugin architecture, the speech scripting pass could be a good candidate — it's self-contained, has clear inputs (mood, text) and outputs (scripted chunks), and doesn't need to touch core rendering.

0 replies

OrbFrontend · 2026-05-10T12:53:42Z

OrbFrontend
May 10, 2026
Maintainer

I just had an idea regarding TTS. What if the user can click on a dialogue in the chat window and the TTS will only speak that sentence? This lets the user read and listen at their own space. Wdyt?

0 replies

glitchbunny0 · 2026-05-10T18:20:16Z

glitchbunny0
May 10, 2026
Author

Implemented both your click-to-speak idea and expanded it with karaoke highlighting in PR #40.

How it works:

Click any quoted dialogue line → it highlights and speaks just that chunk. When you hit the full speak button, chunks play sequentially and each line highlights as it's spoken — so you get karaoke-style following.

Architecture:

New GET /messages/{id}/chunks endpoint returns chunk metadata (text, emotion, pause_before_ms) using the existing regex_extract()
New POST /messages/{id}/speak-chunk synthesizes a single chunk with per-chunk caching
Frontend plays chunks one-by-one via a queue player — each chunk fetches its own audio, highlights the corresponding <span class="quoted">, then advances on onended
Fallback to monolithic /speak for narration-only messages (no dialogue lines)

Why sequential instead of one file:

The original code concatenated all chunks into one MP3. That makes per-chunk highlighting impossible — you'd need timestamp tracking per chunk, which is fragile across different TTS adapters and encodings. Playing chunks individually makes the highlight just a CSS class toggle on the current chunk's span. Cached chunks play instantly on repeat so the inter-chunk latency only hits the first playthrough.

Draft PR is up for review: #40

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agentic TTS: speech scripting pass, mood-aware voice selection, and optional narration #33

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Agentic TTS: speech scripting pass, mood-aware voice selection, and optional narration #33

Uh oh!

glitchbunny0 May 9, 2026

Replies: 4 comments

Uh oh!

OrbFrontend May 9, 2026 Maintainer

Uh oh!

glitchbunny0 May 9, 2026 Author

Uh oh!

OrbFrontend May 10, 2026 Maintainer

Uh oh!

glitchbunny0 May 10, 2026 Author

glitchbunny0
May 9, 2026

OrbFrontend
May 9, 2026
Maintainer

glitchbunny0
May 9, 2026
Author

OrbFrontend
May 10, 2026
Maintainer

glitchbunny0
May 10, 2026
Author