feat: Add SSML support#302
Conversation
normalize_text was previously package-private. Exporting it allows the
upcoming SSML layer to call it on individual text sub-segments without
going through the full phonemize() pipeline (which would double-normalize).
Add a dedicated describe("normalize_text") block in phonemize.test.js
covering all normalization rules directly: quotes, brackets, whitespace,
abbreviations, numbers, currency, possessives, and hyphenated initials.
Previously these were only tested indirectly via phonemize() output.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
generateSilence(durationMs) produces a RawAudio filled with zeros. concatAudio(audios) joins an array of RawAudio segments with an 8 ms linear cross-fade at each boundary. The fade-out tail of the preceding segment and the fade-in head of the following segment are summed over the overlap window, reducing click/pop artifacts at splice points. Silence segments are handled naturally: 0 * ramp = 0, so silence-to-speech and speech-to-silence boundaries are just a fade-in and fade-out respectively. Both utilities are module-private; they will be used by the <break> tag implementation to stitch speech and silence segments together. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
src/ssml.js exports three functions:
hasSSML(text) — O(n) fast guard; checks for '<' before parsing.
parseSSML(text) — Single-pass regex parser returning a typed segment
array: text | phoneme | break | sub | say-as.
splitAtBreaks(text) — Extracts only <break> boundaries, leaving other
SSML tags intact in text segments for the phonemize
layer to handle.
Parser is intentionally non-throwing: unknown tag names and malformed or
missing required attributes degrade gracefully to plain-text segments so
synthesis always produces output. Input like '2 < 3' that contains '<'
but no valid tag is left unchanged.
parseBreakMs() parses 'Nms' and 'Ns'/'N.Ns' time attribute values.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
phonemize() now checks hasSSML() before entering the plain-text path.
When SSML is detected it delegates to phonemizeSSML(), which calls
parseSSML() and processes each segment type individually:
<phoneme alphabet='ipa' ph='...'> — ipa value injected directly;
G2P is bypassed for the tagged word entirely.
<sub alias='...'>text</sub> — alias text is normalized then
phonemized; display text is never spoken.
<say-as interpret-as='characters'> — expandCharacters() converts
the content to dot-separated uppercase (e.g. 'SQL' → 'S.Q.L.');
normalize_text then converts inter-cap dots to hyphens, which
eSpeak reads letter-by-letter.
<say-as interpret-as='ordinal'> — expandOrdinal() converts to
ordinal suffix form ('42nd', 'first', 'twelfth'); normalize_text
and eSpeak handle the rest.
<say-as interpret-as='number'> — passed through normalize_text
unchanged; existing number expansion already handles it.
<break> — skipped here; handled at the audio level in kokoro.js.
runEspeak() extracts the eSpeak call + post-processing pipeline so
phonemizeSSML can invoke it on already-normalized sub-segments without
triggering a second normalize_text pass.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
generate() and stream() now handle <break time='...'/> tags.
generate(): if the input string contains SSML and a <break> tag,
splitAtBreaks() partitions it into alternating text/break segments.
Each text segment is phonemized and synthesized independently; each
break segment becomes a silence RawAudio via generateSilence(). All
segments are joined by concatAudio() with 8 ms cross-fades, and a
single RawAudio is returned. Inputs with no <break> tags follow the
original single-pass code path unchanged.
stream(): the same splitAtBreaks() pass is applied to plain-string
inputs containing <break>. Silence segments are yielded as
{ text: '', phonemes: '', audio: silence } between the normal sentence
yields so that consumers receive a consistent { text, phonemes, audio }
shape. TextSplitterStream inputs are unaffected.
Silence duration tests added to ssml.test.js (verify sample count
arithmetic against SAMPLE_RATE for 500ms / 1s / 1.5s / 0ms).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The <phoneme ph="..."> attribute requires eSpeak IPA notation, not standard (broad) IPA. The key differences are: - Stress marks must appear immediately before the stressed vowel, not before the syllable onset. Writing the stress mark before a consonant onset causes the model to vocalize it as a separate phoneme (typically heard as an "ah" sound) rather than applying stress. - The English rhotic is espeak's 'r with hook' (U+0279), not plain r. Updated: - README.md: add SSML section with tag reference table, eSpeak IPA explanation, stress mark placement rules, and espeak-ng discovery command - src/ssml.js: expand module-level JSDoc with eSpeak IPA format notes and correct/incorrect stress mark placement examples - src/phonemize.js: update phonemizeSSML JSDoc with the same warning Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR adds a preprocessing SSML (Speech Synthesis Markup Language) layer to kokoro-js, partially addressing Issue #36. All SSML tags are resolved before the model sees the input, with zero overhead for plain-text inputs and no changes to the model pipeline. Supported tags include <phoneme>, <break>, <sub>, and <say-as> (with characters, ordinal, and number interpret-as modes).
Changes:
- New
src/ssml.jsmodule with a regex-based SSML parser (hasSSML,parseSSML,splitAtBreaks,parseBreakMs) that gracefully degrades malformed/unknown tags to plain text. - Extended
src/phonemize.jswith SSML-aware phonemization (phonemizeSSML,runEspeak,expandCharacters,expandOrdinal) and exportednormalize_textfor testing. Thephonemize()entry point now delegates to the SSML path when tags are detected. - Updated
src/kokoro.jswithgenerateSilence/concatAudioaudio utilities and<break>tag handling in bothgenerate()andstream()methods, splitting text at break points and concatenating with 8ms linear cross-fades.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
kokoro.js/src/ssml.js |
New regex-based SSML parser with typed segment output, break time parsing, and graceful fallback for unknown/malformed tags |
kokoro.js/src/phonemize.js |
Adds SSML-aware phonemization path, refactors eSpeak invocation into reusable runEspeak, adds ordinal/character expansion helpers, exports normalize_text |
kokoro.js/src/kokoro.js |
Adds generateSilence/concatAudio utilities, integrates <break> handling into generate() and stream() methods |
kokoro.js/tests/ssml.test.js |
Comprehensive tests for SSML parsing, break time conversion, phonemization with all supported tags, and silence sizing math |
kokoro.js/tests/phonemize.test.js |
New unit tests for normalize_text covering quotes, whitespace, abbreviations, numbers, currency, possessives, and hyphenated initials |
kokoro.js/README.md |
Documents supported SSML tags, eSpeak IPA notation requirements, and usage examples |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
|
Thank you for your work. |
I haven't heard any feedback yet. You can try it out via NPM. I haven't uploaded any of my side projects to Github recently, but it works well where I've tried it. |
Summary
Adds a pure pre-processing SSML layer to
kokoro-js. All tags are resolved before the model sees the input. There is zero overhead for plain-text inputs and no changes to the model pipeline.Supported tags
<phoneme alphabet="ipa" ph="..."><break time="500ms"/><sub alias="..."><say-as interpret-as="characters"><say-as interpret-as="ordinal"><say-as interpret-as="number">Malformed or unknown tags are read as plain-text as the parser never throws.
Example
Implementation notes
Architecture
src/ssml.js: Lightweight regex parser withhasSSML(),parseSSML(),splitAtBreaks(), andparseBreakMs(). The<guard inhasSSML()means the SSML path is never entered for plain text.phonemize()now dispatches tophonemizeSSML()whenhasSSML(text)is true. Each segment type is handled independently with no double-normalization.<break>tags are extracted one level up ingenerate()/stream()viasplitAtBreaks()— the phonemize layer never sees them. This clean separation meansphonemize()requires no knowledge of audio timing.Audio utilities
generateSilence(ms): Produces a zeroedRawAudioof the requested duration at 24 kHz.concatAudio(chunks): Concatenates audio segments with an 8 ms linear cross-fade at each boundary to prevent audible pops.<break>instream()Breaks yield
{ text: '', phonemes: '', audio: silence }to preserve the{ text, phonemes, audio }shape that consumers iterate over.eSpeak IPA notation
The
phattribute requires eSpeak IPA, not standard (broad) IPA. The critical difference: stress marks (ˈ ˌ) must appear immediately before the stressed vowel, not before the syllable onset consonant. Placing ˈ before a consonant causes it to be vocalized as a phoneme (heard as an "ah" sound).wˈɜːld(ˈ before vowel ɜ)ˈwɜːld(ˈ before onset consonant w)Use
espeak-ng --ipa -q -v en-us "word"to discover the correct notation for any word.Partially address #36