feat: Add SSML support by Tgenz1213 · Pull Request #302 · hexgrad/kokoro

Tgenz1213 · 2026-03-09T15:29:09Z

Summary

Adds a pure pre-processing SSML layer to kokoro-js. All tags are resolved before the model sees the input. There is zero overhead for plain-text inputs and no changes to the model pipeline.

Supported tags

Tag	Purpose
`<phoneme alphabet="ipa" ph="...">`	Bypass G2P with exact IPA pronunciation
`<break time="500ms"/>`	Insert a timed silence (ms or s)
`<sub alias="...">`	Substitute spoken text while preserving display text
`<say-as interpret-as="characters">`	Spell out letter-by-letter (e.g. SQL → S. Q. L.)
`<say-as interpret-as="ordinal">`	Read as ordinal (e.g. 3 → "third")
`<say-as interpret-as="number">`	Read as cardinal number

Malformed or unknown tags are read as plain-text as the parser never throws.

Example

const audio = await tts.generate(
  `The <phoneme alphabet="ipa" ph="wˈɜːld">world</phoneme> record, ` +
  `set by <sub alias="World Wide Web Consortium">W3C</sub>, ` +
  `was broken in <say-as interpret-as="ordinal">3</say-as> attempts.` +
  `<break time="500ms"/>` +
  `<say-as interpret-as="characters">SQL</say-as> was also involved.`,
  { voice: "af_heart" }
);

Implementation notes

Architecture

New src/ssml.js: Lightweight regex parser with hasSSML(), parseSSML(), splitAtBreaks(), and parseBreakMs(). The < guard in hasSSML() means the SSML path is never entered for plain text.
phonemize() now dispatches to phonemizeSSML() when hasSSML(text) is true. Each segment type is handled independently with no double-normalization.
<break> tags are extracted one level up in generate()/stream() via splitAtBreaks() — the phonemize layer never sees them. This clean separation means phonemize() requires no knowledge of audio timing.

Audio utilities

generateSilence(ms): Produces a zeroed RawAudio of the requested duration at 24 kHz.
concatAudio(chunks): Concatenates audio segments with an 8 ms linear cross-fade at each boundary to prevent audible pops.

`<break>` in `stream()`

Breaks yield { text: '', phonemes: '', audio: silence } to preserve the { text, phonemes, audio } shape that consumers iterate over.

eSpeak IPA notation

The ph attribute requires eSpeak IPA, not standard (broad) IPA. The critical difference: stress marks (ˈ ˌ) must appear immediately before the stressed vowel, not before the syllable onset consonant. Placing ˈ before a consonant causes it to be vocalized as a phoneme (heard as an "ah" sound).

✅ eSpeak: wˈɜːld (ˈ before vowel ɜ)
❌ Standard: ˈwɜːld (ˈ before onset consonant w)

Use espeak-ng --ipa -q -v en-us "word" to discover the correct notation for any word.

Partially address #36

normalize_text was previously package-private. Exporting it allows the upcoming SSML layer to call it on individual text sub-segments without going through the full phonemize() pipeline (which would double-normalize). Add a dedicated describe("normalize_text") block in phonemize.test.js covering all normalization rules directly: quotes, brackets, whitespace, abbreviations, numbers, currency, possessives, and hyphenated initials. Previously these were only tested indirectly via phonemize() output. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

generateSilence(durationMs) produces a RawAudio filled with zeros. concatAudio(audios) joins an array of RawAudio segments with an 8 ms linear cross-fade at each boundary. The fade-out tail of the preceding segment and the fade-in head of the following segment are summed over the overlap window, reducing click/pop artifacts at splice points. Silence segments are handled naturally: 0 * ramp = 0, so silence-to-speech and speech-to-silence boundaries are just a fade-in and fade-out respectively. Both utilities are module-private; they will be used by the <break> tag implementation to stitch speech and silence segments together. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

src/ssml.js exports three functions: hasSSML(text) — O(n) fast guard; checks for '<' before parsing. parseSSML(text) — Single-pass regex parser returning a typed segment array: text | phoneme | break | sub | say-as. splitAtBreaks(text) — Extracts only <break> boundaries, leaving other SSML tags intact in text segments for the phonemize layer to handle. Parser is intentionally non-throwing: unknown tag names and malformed or missing required attributes degrade gracefully to plain-text segments so synthesis always produces output. Input like '2 < 3' that contains '<' but no valid tag is left unchanged. parseBreakMs() parses 'Nms' and 'Ns'/'N.Ns' time attribute values. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

phonemize() now checks hasSSML() before entering the plain-text path. When SSML is detected it delegates to phonemizeSSML(), which calls parseSSML() and processes each segment type individually: <phoneme alphabet='ipa' ph='...'> — ipa value injected directly; G2P is bypassed for the tagged word entirely. <sub alias='...'>text</sub> — alias text is normalized then phonemized; display text is never spoken. <say-as interpret-as='characters'> — expandCharacters() converts the content to dot-separated uppercase (e.g. 'SQL' → 'S.Q.L.'); normalize_text then converts inter-cap dots to hyphens, which eSpeak reads letter-by-letter. <say-as interpret-as='ordinal'> — expandOrdinal() converts to ordinal suffix form ('42nd', 'first', 'twelfth'); normalize_text and eSpeak handle the rest. <say-as interpret-as='number'> — passed through normalize_text unchanged; existing number expansion already handles it. <break> — skipped here; handled at the audio level in kokoro.js. runEspeak() extracts the eSpeak call + post-processing pipeline so phonemizeSSML can invoke it on already-normalized sub-segments without triggering a second normalize_text pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

generate() and stream() now handle <break time='...'/> tags. generate(): if the input string contains SSML and a <break> tag, splitAtBreaks() partitions it into alternating text/break segments. Each text segment is phonemized and synthesized independently; each break segment becomes a silence RawAudio via generateSilence(). All segments are joined by concatAudio() with 8 ms cross-fades, and a single RawAudio is returned. Inputs with no <break> tags follow the original single-pass code path unchanged. stream(): the same splitAtBreaks() pass is applied to plain-string inputs containing <break>. Silence segments are yielded as { text: '', phonemes: '', audio: silence } between the normal sentence yields so that consumers receive a consistent { text, phonemes, audio } shape. TextSplitterStream inputs are unaffected. Silence duration tests added to ssml.test.js (verify sample count arithmetic against SAMPLE_RATE for 500ms / 1s / 1.5s / 0ms). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The <phoneme ph="..."> attribute requires eSpeak IPA notation, not standard (broad) IPA. The key differences are: - Stress marks must appear immediately before the stressed vowel, not before the syllable onset. Writing the stress mark before a consonant onset causes the model to vocalize it as a separate phoneme (typically heard as an "ah" sound) rather than applying stress. - The English rhotic is espeak's 'r with hook' (U+0279), not plain r. Updated: - README.md: add SSML section with tag reference table, eSpeak IPA explanation, stress mark placement rules, and espeak-ng discovery command - src/ssml.js: expand module-level JSDoc with eSpeak IPA format notes and correct/incorrect stress mark placement examples - src/phonemize.js: update phonemizeSSML JSDoc with the same warning Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR adds a preprocessing SSML (Speech Synthesis Markup Language) layer to kokoro-js, partially addressing Issue #36. All SSML tags are resolved before the model sees the input, with zero overhead for plain-text inputs and no changes to the model pipeline. Supported tags include <phoneme>, <break>, <sub>, and <say-as> (with characters, ordinal, and number interpret-as modes).

Changes:

New src/ssml.js module with a regex-based SSML parser (hasSSML, parseSSML, splitAtBreaks, parseBreakMs) that gracefully degrades malformed/unknown tags to plain text.
Extended src/phonemize.js with SSML-aware phonemization (phonemizeSSML, runEspeak, expandCharacters, expandOrdinal) and exported normalize_text for testing. The phonemize() entry point now delegates to the SSML path when tags are detected.
Updated src/kokoro.js with generateSilence/concatAudio audio utilities and <break> tag handling in both generate() and stream() methods, splitting text at break points and concatenating with 8ms linear cross-fades.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`kokoro.js/src/ssml.js`	New regex-based SSML parser with typed segment output, break time parsing, and graceful fallback for unknown/malformed tags
`kokoro.js/src/phonemize.js`	Adds SSML-aware phonemization path, refactors eSpeak invocation into reusable `runEspeak`, adds ordinal/character expansion helpers, exports `normalize_text`
`kokoro.js/src/kokoro.js`	Adds `generateSilence`/`concatAudio` utilities, integrates `<break>` handling into `generate()` and `stream()` methods
`kokoro.js/tests/ssml.test.js`	Comprehensive tests for SSML parsing, break time conversion, phonemization with all supported tags, and silence sizing math
`kokoro.js/tests/phonemize.test.js`	New unit tests for `normalize_text` covering quotes, whitespace, abbreviations, numbers, currency, possessives, and hyphenated initials
`kokoro.js/README.md`	Documents supported SSML tags, eSpeak IPA notation requirements, and usage examples

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

…entation

nptrainor · 2026-04-08T08:08:14Z

Thank you for your work.
Any comments/progress on this?
This would be seriously useful across virtually every use case.

Tgenz1213 · 2026-04-08T13:48:34Z

Thank you for your work. Any comments/progress on this? This would be seriously useful across virtually every use case.

I haven't heard any feedback yet.

You can try it out via NPM.

I haven't uploaded any of my side projects to Github recently, but it works well where I've tried it.

Tgenz1213 and others added 6 commits March 9, 2026 09:31

Copilot AI review requested due to automatic review settings March 9, 2026 15:29

Copilot started reviewing on behalf of Tgenz1213 March 9, 2026 15:29 View session

Copilot AI reviewed Mar 9, 2026

View reviewed changes

Comment thread kokoro.js/src/kokoro.js

Tgenz1213 added 2 commits March 9, 2026 11:33

Add splitTextIntoChunks function and related tests for SSML text segm…

5ccc163

…entation

Fix type annotation for attributes in parseAttrs function

b15abb9

drewster99 mentioned this pull request Mar 17, 2026

Add SSML or basic emphasis support mlalma/kokoro-ios#32

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add SSML support#302

feat: Add SSML support#302
Tgenz1213 wants to merge 8 commits into
hexgrad:mainfrom
Tgenz1213:feat/add-ssml-support

Tgenz1213 commented Mar 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

nptrainor commented Apr 8, 2026 •

edited

Loading

Uh oh!

Tgenz1213 commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Tgenz1213 commented Mar 9, 2026

Summary

Supported tags

Example

Implementation notes

Architecture

Audio utilities

<break> in stream()

eSpeak IPA notation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

nptrainor commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Tgenz1213 commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

`<break>` in `stream()`

nptrainor commented Apr 8, 2026 •

edited

Loading