Skip to content

feat: Add SSML support#302

Open
Tgenz1213 wants to merge 8 commits into
hexgrad:mainfrom
Tgenz1213:feat/add-ssml-support
Open

feat: Add SSML support#302
Tgenz1213 wants to merge 8 commits into
hexgrad:mainfrom
Tgenz1213:feat/add-ssml-support

Conversation

@Tgenz1213
Copy link
Copy Markdown

Summary

Adds a pure pre-processing SSML layer to kokoro-js. All tags are resolved before the model sees the input. There is zero overhead for plain-text inputs and no changes to the model pipeline.

Supported tags

Tag Purpose
<phoneme alphabet="ipa" ph="..."> Bypass G2P with exact IPA pronunciation
<break time="500ms"/> Insert a timed silence (ms or s)
<sub alias="..."> Substitute spoken text while preserving display text
<say-as interpret-as="characters"> Spell out letter-by-letter (e.g. SQL → S. Q. L.)
<say-as interpret-as="ordinal"> Read as ordinal (e.g. 3 → "third")
<say-as interpret-as="number"> Read as cardinal number

Malformed or unknown tags are read as plain-text as the parser never throws.

Example

const audio = await tts.generate(
  `The <phoneme alphabet="ipa" ph="wˈɜːld">world</phoneme> record, ` +
  `set by <sub alias="World Wide Web Consortium">W3C</sub>, ` +
  `was broken in <say-as interpret-as="ordinal">3</say-as> attempts.` +
  `<break time="500ms"/>` +
  `<say-as interpret-as="characters">SQL</say-as> was also involved.`,
  { voice: "af_heart" }
);

Implementation notes

Architecture

  • New src/ssml.js: Lightweight regex parser with hasSSML(), parseSSML(), splitAtBreaks(), and parseBreakMs(). The < guard in hasSSML() means the SSML path is never entered for plain text.
  • phonemize() now dispatches to phonemizeSSML() when hasSSML(text) is true. Each segment type is handled independently with no double-normalization.
  • <break> tags are extracted one level up in generate()/stream() via splitAtBreaks() — the phonemize layer never sees them. This clean separation means phonemize() requires no knowledge of audio timing.

Audio utilities

  • generateSilence(ms): Produces a zeroed RawAudio of the requested duration at 24 kHz.
  • concatAudio(chunks): Concatenates audio segments with an 8 ms linear cross-fade at each boundary to prevent audible pops.

<break> in stream()

Breaks yield { text: '', phonemes: '', audio: silence } to preserve the { text, phonemes, audio } shape that consumers iterate over.

eSpeak IPA notation

The ph attribute requires eSpeak IPA, not standard (broad) IPA. The critical difference: stress marks (ˈ ˌ) must appear immediately before the stressed vowel, not before the syllable onset consonant. Placing ˈ before a consonant causes it to be vocalized as a phoneme (heard as an "ah" sound).

  • eSpeak: wˈɜːld (ˈ before vowel ɜ)
  • Standard: ˈwɜːld (ˈ before onset consonant w)

Use espeak-ng --ipa -q -v en-us "word" to discover the correct notation for any word.

Partially address #36

Tgenz1213 and others added 6 commits March 9, 2026 09:31
normalize_text was previously package-private. Exporting it allows the
upcoming SSML layer to call it on individual text sub-segments without
going through the full phonemize() pipeline (which would double-normalize).

Add a dedicated describe("normalize_text") block in phonemize.test.js
covering all normalization rules directly: quotes, brackets, whitespace,
abbreviations, numbers, currency, possessives, and hyphenated initials.
Previously these were only tested indirectly via phonemize() output.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
generateSilence(durationMs) produces a RawAudio filled with zeros.

concatAudio(audios) joins an array of RawAudio segments with an 8 ms
linear cross-fade at each boundary. The fade-out tail of the preceding
segment and the fade-in head of the following segment are summed over
the overlap window, reducing click/pop artifacts at splice points.
Silence segments are handled naturally: 0 * ramp = 0, so silence-to-speech
and speech-to-silence boundaries are just a fade-in and fade-out respectively.

Both utilities are module-private; they will be used by the <break> tag
implementation to stitch speech and silence segments together.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
src/ssml.js exports three functions:

  hasSSML(text)       — O(n) fast guard; checks for '<' before parsing.
  parseSSML(text)     — Single-pass regex parser returning a typed segment
                        array: text | phoneme | break | sub | say-as.
  splitAtBreaks(text) — Extracts only <break> boundaries, leaving other
                        SSML tags intact in text segments for the phonemize
                        layer to handle.

Parser is intentionally non-throwing: unknown tag names and malformed or
missing required attributes degrade gracefully to plain-text segments so
synthesis always produces output. Input like '2 < 3' that contains '<'
but no valid tag is left unchanged.

parseBreakMs() parses 'Nms' and 'Ns'/'N.Ns' time attribute values.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
phonemize() now checks hasSSML() before entering the plain-text path.
When SSML is detected it delegates to phonemizeSSML(), which calls
parseSSML() and processes each segment type individually:

  <phoneme alphabet='ipa' ph='...'>  — ipa value injected directly;
      G2P is bypassed for the tagged word entirely.

  <sub alias='...'>text</sub>  — alias text is normalized then
      phonemized; display text is never spoken.

  <say-as interpret-as='characters'>  — expandCharacters() converts
      the content to dot-separated uppercase (e.g. 'SQL' → 'S.Q.L.');
      normalize_text then converts inter-cap dots to hyphens, which
      eSpeak reads letter-by-letter.

  <say-as interpret-as='ordinal'>  — expandOrdinal() converts to
      ordinal suffix form ('42nd', 'first', 'twelfth'); normalize_text
      and eSpeak handle the rest.

  <say-as interpret-as='number'>  — passed through normalize_text
      unchanged; existing number expansion already handles it.

  <break>  — skipped here; handled at the audio level in kokoro.js.

runEspeak() extracts the eSpeak call + post-processing pipeline so
phonemizeSSML can invoke it on already-normalized sub-segments without
triggering a second normalize_text pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
generate() and stream() now handle <break time='...'/> tags.

generate(): if the input string contains SSML and a <break> tag,
splitAtBreaks() partitions it into alternating text/break segments.
Each text segment is phonemized and synthesized independently; each
break segment becomes a silence RawAudio via generateSilence(). All
segments are joined by concatAudio() with 8 ms cross-fades, and a
single RawAudio is returned. Inputs with no <break> tags follow the
original single-pass code path unchanged.

stream(): the same splitAtBreaks() pass is applied to plain-string
inputs containing <break>. Silence segments are yielded as
{ text: '', phonemes: '', audio: silence } between the normal sentence
yields so that consumers receive a consistent { text, phonemes, audio }
shape. TextSplitterStream inputs are unaffected.

Silence duration tests added to ssml.test.js (verify sample count
arithmetic against SAMPLE_RATE for 500ms / 1s / 1.5s / 0ms).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The <phoneme ph="..."> attribute requires eSpeak IPA notation, not
standard (broad) IPA. The key differences are:

- Stress marks must appear immediately before the stressed vowel, not
  before the syllable onset. Writing the stress mark before a consonant
  onset causes the model to vocalize it as a separate phoneme (typically
  heard as an "ah" sound) rather than applying stress.

- The English rhotic is espeak's 'r with hook' (U+0279), not plain r.

Updated:
- README.md: add SSML section with tag reference table, eSpeak IPA
  explanation, stress mark placement rules, and espeak-ng discovery command
- src/ssml.js: expand module-level JSDoc with eSpeak IPA format notes
  and correct/incorrect stress mark placement examples
- src/phonemize.js: update phonemizeSSML JSDoc with the same warning

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 9, 2026 15:29
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a preprocessing SSML (Speech Synthesis Markup Language) layer to kokoro-js, partially addressing Issue #36. All SSML tags are resolved before the model sees the input, with zero overhead for plain-text inputs and no changes to the model pipeline. Supported tags include <phoneme>, <break>, <sub>, and <say-as> (with characters, ordinal, and number interpret-as modes).

Changes:

  • New src/ssml.js module with a regex-based SSML parser (hasSSML, parseSSML, splitAtBreaks, parseBreakMs) that gracefully degrades malformed/unknown tags to plain text.
  • Extended src/phonemize.js with SSML-aware phonemization (phonemizeSSML, runEspeak, expandCharacters, expandOrdinal) and exported normalize_text for testing. The phonemize() entry point now delegates to the SSML path when tags are detected.
  • Updated src/kokoro.js with generateSilence/concatAudio audio utilities and <break> tag handling in both generate() and stream() methods, splitting text at break points and concatenating with 8ms linear cross-fades.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
kokoro.js/src/ssml.js New regex-based SSML parser with typed segment output, break time parsing, and graceful fallback for unknown/malformed tags
kokoro.js/src/phonemize.js Adds SSML-aware phonemization path, refactors eSpeak invocation into reusable runEspeak, adds ordinal/character expansion helpers, exports normalize_text
kokoro.js/src/kokoro.js Adds generateSilence/concatAudio utilities, integrates <break> handling into generate() and stream() methods
kokoro.js/tests/ssml.test.js Comprehensive tests for SSML parsing, break time conversion, phonemization with all supported tags, and silence sizing math
kokoro.js/tests/phonemize.test.js New unit tests for normalize_text covering quotes, whitespace, abbreviations, numbers, currency, possessives, and hyphenated initials
kokoro.js/README.md Documents supported SSML tags, eSpeak IPA notation requirements, and usage examples

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread kokoro.js/src/kokoro.js
@nptrainor
Copy link
Copy Markdown

nptrainor commented Apr 8, 2026

Thank you for your work.
Any comments/progress on this?
This would be seriously useful across virtually every use case.

@Tgenz1213
Copy link
Copy Markdown
Author

Thank you for your work. Any comments/progress on this? This would be seriously useful across virtually every use case.

I haven't heard any feedback yet.

You can try it out via NPM.

I haven't uploaded any of my side projects to Github recently, but it works well where I've tried it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants