Agentic TTS: speech scripting pass, mood-aware voice selection, and optional narration #33
Replies: 4 comments
-
|
I see the vision. Looks like it's gonna be related to this: #30, everyone will need to be on the same page. |
Beta Was this translation helpful? Give feedback.
-
|
Good call on linking this to #30 — the secondary agents work is exactly the right foundation for this. The speech scripting pass would slot in naturally as a secondary agent: it runs after the Director, reads the mood metadata, and produces TTS-optimized output. The regex extractor stays as the default path (zero cost/latency), and the agent pass is opt-in the same way Writer/Editor are. I'd be happy to align the TTS agent interface with whatever seam design comes out of #30, so we're not building against a moving target. If it helps to have a concrete use case for testing the plugin architecture, the speech scripting pass could be a good candidate — it's self-contained, has clear inputs (mood, text) and outputs (scripted chunks), and doesn't need to touch core rendering. |
Beta Was this translation helpful? Give feedback.
-
|
I just had an idea regarding TTS. What if the user can click on a dialogue in the chat window and the TTS will only speak that sentence? This lets the user read and listen at their own space. Wdyt? |
Beta Was this translation helpful? Give feedback.
-
|
Implemented both your click-to-speak idea and expanded it with karaoke highlighting in PR #40. How it works: Click any quoted dialogue line → it highlights and speaks just that chunk. When you hit the full speak button, chunks play sequentially and each line highlights as it's spoken — so you get karaoke-style following. Architecture:
Why sequential instead of one file: The original code concatenated all chunks into one MP3. That makes per-chunk highlighting impossible — you'd need timestamp tracking per chunk, which is fragile across different TTS adapters and encodings. Playing chunks individually makes the highlight just a CSS class toggle on the current chunk's span. Cached chunks play instantly on repeat so the inter-chunk latency only hits the first playthrough. Draft PR is up for review: #40 |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
After getting the TTS pipeline merged (#29), I've been thinking about extending the agentic approach into the audio realm. The app already has mood detection via the Director pass (
active_moods), and the TTS layer already structures speech asSpeakableChunkwith anemotionfield per chunk — so there's a foundation to build on.The general idea: treat speech generation as a first-class agentic pass, not just a post-hoc synthesis step. This would be entirely opt-in and out of the way, like the existing TTS toggle.
1. Speech scripting agent pass
Instead of pure regex extraction, add an optional agent pass that writes a speech script tailored to the TTS model. This pass would:
[laugh],[sigh],[whisper], etc.)The regex extractor stays as the default (zero-latency, zero-cost). The agent pass would be opt-in, like the Writer/Editor passes.
2. Mood-aware voice selection
Some backends (ElevenLabs, Fish Speech) support voice cloning or multiple voices. If the Director detects a mood shift, the TTS layer could pick a voice variant that matches — softer voice for tender scenes, sharper for tense ones. The
SpeakableChunk.emotionfield is already there; this would connect it to voice selection.3. Optional narrator voice for non-speech parts
A niche one, but: some people might want a narrator voice reading the action beats and scene descriptions between dialogue lines. The regex extractor already splits text into speech/non-speech chunks. A second voice could read the non-speech parts, creating a full audiobook-like experience. Fully optional, probably a per-character setting.
Thoughts? Would love to hear if others see value in this direction, or if there are better ways to approach it.
Beta Was this translation helpful? Give feedback.
All reactions