WIP: YPE-1824 - add Bible HTML transformer to core package#198
WIP: YPE-1824 - add Bible HTML transformer to core package#198cameronapak wants to merge 9 commits intomainfrom
Conversation
Move transformBibleHtml from UI to @youversion/platform-core as a runtime-agnostic function with injected DOM adapters. UI wrapper now delegates to core, keeping getFootnoteMarker and font constants. Refs: YPE-1814
|
Greptile SummaryThis PR moves Bible HTML transformation logic from the Key concerns to address before merging:
Confidence Score: 2/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant App
participant Verse.Html
participant transformBibleHtmlForBrowser
participant BibleTextHtml
participant getVerseHtmlFromDom
participant VerseFootnoteButton
App->>Verse.Html: html (raw API HTML)
Verse.Html->>transformBibleHtmlForBrowser: html
Note over transformBibleHtmlForBrowser: wrapVerseContent()<br/>assignFootnoteKeys()<br/>replaceFootnotesWithAnchors()<br/>addNbspToVerseLabels()<br/>fixIrregularTables()
transformBibleHtmlForBrowser-->>Verse.Html: { html: transformedHtml }
Verse.Html->>BibleTextHtml: html=transformedHtml
Note over BibleTextHtml: useLayoutEffect sets innerHTML<br/>queries [data-verse-footnote]
BibleTextHtml->>BibleTextHtml: 1st pass: build notesByKey map<br/>(reads data-verse-footnote-content)
BibleTextHtml->>getVerseHtmlFromDom: verseNum (once per anchor, not per verse)
getVerseHtmlFromDom-->>BibleTextHtml: cloned+sanitized verse HTML
BibleTextHtml->>BibleTextHtml: setFootnoteData([...])
loop For each footnote anchor
BibleTextHtml->>VerseFootnoteButton: portal into anchor element<br/>notes[], verseHtml, hasVerseContext
VerseFootnoteButton-->>App: Popover with footnote content
end
|
packages/core/src/bible.ts
Outdated
| async getPassage( | ||
| versionId: number, | ||
| usfm: string, | ||
| format: 'html', | ||
| include_headings: boolean | undefined, | ||
| include_notes: boolean | undefined, | ||
| transform: TransformBibleHtmlOptions, | ||
| ): Promise<TransformedBiblePassage>; |
There was a problem hiding this comment.
Overload forces explicit
undefined for optional booleans
The second overload signature requires callers to spell out undefined for include_headings and include_notes whenever they want to use transform:
// Only way to reach the TransformedBiblePassage overload:
await bibleClient.getPassage(id, usfm, 'html', undefined, undefined, transform);This is ergonomically awkward. An options-object signature for the transform variant — or at minimum making include_headings / include_notes truly optional with ? — would improve the call-site experience. This could also be addressed by accepting transform as part of an options bag (e.g. { transform, include_headings, include_notes }) so callers don't need positional undefined placeholders.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Adds a new attribute to store the footnote content directly on the element, improving accessibility and enabling easier styling of footnote content.
Adds `transformBibleHtmlForNode` to the core package, utilizing Linkedom for DOM manipulation in Node.js environments. This allows for consistent Bible HTML processing across server and browser. Also refactors `transformBibleHtml` to accept DOM adapters, enabling runtime-agnostic transformations. The browser-specific transformation is now handled by `transformBibleHtmlForBrowser`. The `isomorphic-dompurify` dependency is removed from `@youversion/platform-ui`, as sanitization is no longer handled at the UI layer but is now implicitly managed by the core transformer.
| it('should preserve footnote HTML structure in data-verse-footnote-content', () => { | ||
| const html = ` | ||
| <div> | ||
| <div class="p"> | ||
| <span class="yv-v" v="1"></span><span class="yv-vlbl">1</span>Text<span class="yv-n f"><span class="ft"><em>Emphasized</em> note</span></span>.</div> | ||
| </div> | ||
| `; | ||
|
|
||
| const result = transformBibleHtml(html, createAdapters()); | ||
|
|
||
| const doc = new DOMParser().parseFromString(result.html, 'text/html'); | ||
| const anchor = doc.querySelector('[data-verse-footnote="1"]'); | ||
| expect(anchor).not.toBeNull(); | ||
| const content = anchor!.getAttribute('data-verse-footnote-content'); |
There was a problem hiding this comment.
Test name does not match what is actually asserted
The test is titled 'should not call sanitize when omitted', but the assertion checks that parseHtml was called (expect(called).toBe(true)). The test does not actually verify that the sanitize function is bypassed when omitted; it only confirms the pipeline still executes. A call to a non-existent sanitize would throw an error and also cause the test to pass (the error would surface before the assertion), so the title can mislead readers into thinking this is a no-sanitize-path coverage test.
Consider renaming the test and adding an explicit guard or spy:
| it('should preserve footnote HTML structure in data-verse-footnote-content', () => { | |
| const html = ` | |
| <div> | |
| <div class="p"> | |
| <span class="yv-v" v="1"></span><span class="yv-vlbl">1</span>Text<span class="yv-n f"><span class="ft"><em>Emphasized</em> note</span></span>.</div> | |
| </div> | |
| `; | |
| const result = transformBibleHtml(html, createAdapters()); | |
| const doc = new DOMParser().parseFromString(result.html, 'text/html'); | |
| const anchor = doc.querySelector('[data-verse-footnote="1"]'); | |
| expect(anchor).not.toBeNull(); | |
| const content = anchor!.getAttribute('data-verse-footnote-content'); | |
| it('should still run parseHtml when sanitize is not provided', () => { | |
| const html = '<div>Test</div>'; | |
| let called = false; | |
| transformBibleHtml(html, { | |
| parseHtml: (h) => { | |
| called = true; | |
| return new DOMParser().parseFromString(h, 'text/html'); | |
| }, | |
| serializeHtml: (doc) => doc.body.innerHTML, | |
| }); | |
| expect(called).toBe(true); | |
| }); |
packages/ui/src/components/verse.tsx
Outdated
| const transformedData = useMemo<TransformedBibleHtml>( | ||
| () => transformBibleHtmlForBrowser(html), | ||
| [html], | ||
| ); |
There was a problem hiding this comment.
🔴 HTML sanitization (DOMPurify) removed — unsanitized HTML rendered via dangerouslySetInnerHTML
This PR removes isomorphic-dompurify from packages/ui/package.json and eliminates all HTML sanitization from the transformation pipeline. The old transformBibleHtml in verse-html-utils.ts ran DOMPurify.sanitize(html, DOMPURIFY_CONFIG) before any processing. The new transformBibleHtmlForBrowser in core performs no sanitization at all.
The unsanitized HTML is then rendered at three injection points in verse.tsx:
contentRef.current.innerHTML = html(line 178)dangerouslySetInnerHTML={{ __html: verseHtml }}(line 106)dangerouslySetInnerHTML={{ __html: note }}(line 120)
This violates the repository's explicit review guideline at docs/review-guidelines.md:19: "Is XSS protection properly implemented (no dangerouslySetInnerHTML without sanitization)?"
Additionally, the biome-ignore comments on lines 105 and 119 still claim "HTML has been run through DOMPurify and is safe" — this is now factually incorrect since DOMPurify was removed. The XSS protection tests (script tag removal, event handler stripping, javascript: URL removal) were also deleted from verse.test.tsx.
Was this helpful? React with 👍 or 👎 to provide feedback.
Check for the existence of `globalThis.DOMParser` before attempting to use it. This prevents errors in environments where the DOMParser might not be available.
This commit introduces linkedom as a dependency for the `@youversion/platform-core` package, enabling the `transformBibleHtmlForNode` function to correctly process and transform Bible HTML in Node.js environments. The `transformBibleHtmlForNode` function now correctly wraps HTML in `<html><body>` tags before parsing to ensure proper serialization of `doc.body.innerHTML`. Dependency updates include adding `linkedom` and upgrading `htmlparser2` and `entities` to their latest compatible versions. Test cases for `bible-html-transformer.node.test.ts` have been adjusted to accommodate potential attribute ordering differences in `linkedom`'s serialization and to correctly assert the encoded non-breaking space.
The regular expression for determining if a space is needed before a character was too restrictive and missed cases where apostrophes should be preceded by a space in the transformed HTML. This commit expands the character set to include single quotes, ensuring proper rendering.
Ensure that the bible-html-transformer does not crash if a footnote element is missing its key attribute.
The `transformBibleHtml` function and its variants have been refactored to embed footnote data directly within the HTML as `data-verse-footnote` and `data-verse-footnote-content` attributes. This eliminates the need for the separate `notes` property in the return type, simplifying the API and making the output HTML self-contained. Tests have been updated to reflect these changes, focusing on the presence and content of the new data attributes in the transformed HTML. The `VerseNotes` type has been removed from the core package.
| export function transformBibleHtmlForNode(html: string): TransformedBibleHtml { | ||
| // eslint-disable-next-line @typescript-eslint/no-require-imports | ||
| const { DOMParser } = require('linkedom') as { | ||
| DOMParser: new () => { parseFromString(html: string, type: string): Document }; | ||
| }; | ||
|
|
||
| return transformBibleHtml(html, { | ||
| parseHtml: (h: string) => | ||
| new DOMParser().parseFromString(`<html><body>${h}</body></html>`, 'text/html'), | ||
| serializeHtml: (doc: Document) => doc.body.innerHTML, | ||
| }); | ||
| } |
There was a problem hiding this comment.
require() throws ReferenceError in ESM environments
packages/core/package.json declares "type": "module", making this package native ESM. When any bundler or Node.js runtime resolves the "import" export condition, require is not defined in module scope and calling transformBibleHtmlForNode will throw:
ReferenceError: require is not defined in ES module scope
This will affect Next.js App Router (uses the "import" condition on the server), Remix, Astro SSR, and any ESM-first environment.
Replace the synchronous require() with a dynamic import to make it ESM-compatible:
export async function transformBibleHtmlForNode(html: string): Promise<TransformedBibleHtml> {
const { DOMParser } = await import('linkedom');
return transformBibleHtml(html, {
parseHtml: (h: string) =>
new DOMParser().parseFromString(`<html><body>${h}</body></html>`, 'text/html'),
serializeHtml: (doc: Document) => doc.body.innerHTML,
});
}Note this also changes the return type to Promise<TransformedBibleHtml>, which will require callers to await the result — a breaking change from the current synchronous API.
The node test (bible-html-transformer.node.test.ts) also needs to be updated to await the call.
| export function transformBibleHtmlForBrowser(html: string): TransformedBibleHtml { | ||
| if (typeof globalThis.DOMParser === 'undefined') { | ||
| return { html }; | ||
| } | ||
|
|
||
| return transformBibleHtml(html, { | ||
| parseHtml: (h) => new DOMParser().parseFromString(h, 'text/html'), | ||
| serializeHtml: (doc) => doc.body.innerHTML, | ||
| }); | ||
| } |
There was a problem hiding this comment.
🔴 Core package exports browser-only DOMParser API, violating package boundary rules
transformBibleHtmlForBrowser directly uses the browser-only globalThis.DOMParser (line 287) and new DOMParser() (line 292). The core package's AGENTS.md explicitly states: "❌ Don't: Import React, window, document, or browser-only APIs directly". DOMParser is a browser-only API that doesn't exist in Node.js. While the function guards with a runtime check, the rule prohibits having browser APIs in core at all.
Similarly, transformBibleHtmlForNode at line 308 uses require('linkedom'), which is a Node.js-specific CJS API in a package with "type": "module". The core package should only export the generic transformBibleHtml with its adapter pattern; the environment-specific convenience functions belong in the respective consuming packages (UI for browser, a server package for Node).
Prompt for agents
Move `transformBibleHtmlForBrowser` out of packages/core/src/bible-html-transformer.ts and into the UI package (e.g., packages/ui/src/lib/verse-html-utils.ts or a new file), since it uses browser-only DOMParser. Similarly, move `transformBibleHtmlForNode` (lines 306-317) to a server-side package or keep it as documentation/example only. The core package should only export the generic `transformBibleHtml` function and its types (`TransformBibleHtmlOptions`, `TransformedBibleHtml`). Update packages/core/src/index.ts to stop exporting the environment-specific functions, and update packages/ui/src/components/verse.tsx to import from the local wrapper instead of @youversion/platform-core.
Was this helpful? React with 👍 or 👎 to provide feedback.
| type VerseNotes, | ||
| } from '@/lib/verse-html-utils'; | ||
| import { type FontFamily, getFootnoteMarker } from '@/lib/verse-html-utils'; | ||
| import { transformBibleHtmlForBrowser } from '@youversion/platform-core'; |
There was a problem hiding this comment.
🔴 UI component imports directly from @youversion/platform-core instead of using local re-exports
verse.tsx:20 imports transformBibleHtmlForBrowser directly from @youversion/platform-core. The UI package's AGENTS.md explicitly states: "❌ Don't: Import from @youversion/platform-core directly (except re-exports in index.ts)". No other component file in packages/ui/src/components/ imports from @youversion/platform-core — this is the only instance, breaking the established pattern. The import should go through a local re-export or utility file.
Was this helpful? React with 👍 or 👎 to provide feedback.
The `getFootnoteMarker` function was moved from `verse-html-utils.ts` to `verse.tsx`. This is because the logic is specifically related to rendering footnote markers within the Verse component and does not have broader utility for the entire UI package. This change adheres to the package boundary guidelines by keeping related logic localized.
| function getVerseHtmlFromDom(container: HTMLElement, verseNum: string): string { | ||
| const wrappers = container.querySelectorAll(`.yv-v[v="${verseNum}"]`); | ||
| if (!wrappers.length) return ''; | ||
|
|
||
| const parts: string[] = []; | ||
| let noteIdx = 0; | ||
|
|
||
| wrappers.forEach((wrapper, i) => { | ||
| if (i > 0) parts.push(' '); | ||
| const clone = wrapper.cloneNode(true) as Element; | ||
| clone.querySelectorAll('.yv-h, .yv-vlbl').forEach((el) => el.remove()); | ||
| clone.querySelectorAll('[data-verse-footnote]').forEach((anchor) => { | ||
| const sup = document.createElement('sup'); | ||
| sup.className = 'yv:text-muted-foreground'; | ||
| sup.textContent = getFootnoteMarker(noteIdx++); | ||
| anchor.replaceWith(sup); | ||
| }); | ||
| parts.push(clone.innerHTML); | ||
| }); | ||
|
|
||
| return parts.join(''); | ||
| } |
There was a problem hiding this comment.
getVerseHtmlFromDom outputs unsanitized clone.innerHTML into dangerouslySetInnerHTML
getVerseHtmlFromDom clones verse wrapper nodes from the live DOM and returns their raw innerHTML as a string. That string is then set via dangerouslySetInnerHTML={{ __html: verseHtml }} (line 120). The clone preserves any HTML attributes and child content exactly as they appear in the DOM — including any injected markup if the API response ever contained unexpected HTML.
More broadly, this PR removes isomorphic-dompurify and deletes all four XSS security tests (should remove script tags, should remove inline event handlers (onerror), should remove inline event handlers (onclick), should remove javascript: URLs). The new justification comment "comes from our YouVersion APIs and is safe" is an operational assumption, not a technical guarantee. If a Bible translation licensed from a third-party publisher contained crafted footnote content, or if the API were misconfigured, this path would execute arbitrary HTML in the user's browser.
Consider either re-introducing a lightweight sanitization pass (e.g. strip event-handler attributes from the cloned tree before serializing) or adding a note acknowledging this risk has been deliberately accepted, so it is not silently re-reported by security scanners.
| // Second pass: create one entry per anchor (each anchor gets its own portal) | ||
| const result: VerseFootnoteData[] = []; | ||
| anchors.forEach((el) => { | ||
| const verseNum = el.getAttribute('data-verse-footnote'); | ||
| if (verseNum) result.push({ verseNum, el }); | ||
| if (!verseNum) return; | ||
| const allNotes = notesByKey.get(verseNum) || []; | ||
| const hasVerseContext = el.closest('.yv-v[v]') !== null; | ||
| const verseHtml = hasVerseContext ? getVerseHtmlFromDom(contentRef.current!, verseNum) : ''; | ||
| result.push({ verseNum, el, notes: allNotes, verseHtml, hasVerseContext }); | ||
| }); |
There was a problem hiding this comment.
getVerseHtmlFromDom called once per anchor instead of once per verse
In the second pass, getVerseHtmlFromDom is invoked for every footnote anchor that has a verse context. If verse 1 has three footnotes, three separate querySelectorAll + cloneNode + querySelectorAll operations are performed to build the identical verseHtml string. For footnote-dense passages (e.g. Psalms) this creates an O(N²) DOM clone loop inside a useLayoutEffect.
The verseHtml depends only on verseNum, so it can be computed once per unique key using the notesByKey map that's already available from the first pass:
// After the first pass, compute verseHtml per unique key
const verseHtmlByKey = new Map<string, string>();
notesByKey.forEach((_, key) => {
const sample = contentRef.current!.querySelector(`[data-verse-footnote="${key}"]`);
const hasContext = sample?.closest('.yv-v[v]') !== null;
verseHtmlByKey.set(key, hasContext ? getVerseHtmlFromDom(contentRef.current!, key) : '');
});
// Then in the second pass:
const verseHtml = verseHtmlByKey.get(verseNum) ?? '';| export function transformBibleHtmlForNode(html: string): TransformedBibleHtml { | ||
| // eslint-disable-next-line @typescript-eslint/no-require-imports | ||
| const { DOMParser } = require('linkedom') as { | ||
| DOMParser: new () => { parseFromString(html: string, type: string): Document }; | ||
| }; | ||
|
|
||
| return transformBibleHtml(html, { | ||
| parseHtml: (h: string) => | ||
| new DOMParser().parseFromString(`<html><body>${h}</body></html>`, 'text/html'), | ||
| serializeHtml: (doc: Document) => doc.body.innerHTML, | ||
| }); | ||
| } |
There was a problem hiding this comment.
transformBibleHtmlForNode signature incompatible with future ESM-safe fix
The previous thread flagged that require('linkedom') throws ReferenceError in ESM environments. The correct fix is to replace it with await import('linkedom'), but that would force transformBibleHtmlForNode to return Promise<TransformedBibleHtml> instead of TransformedBibleHtml. Every call site in the node tests — and any user code — calls the function synchronously:
const result = transformBibleHtmlForNode(html); // currently syncChanging to async without a major-version bump would be a silent breaking change. This is a good moment to decide on the intended contract before the API is published. Options:
- Ship the function as
asyncnow and update all callers/tests toawaitthe result. - Keep it synchronous but document that it only supports CJS environments and add a runtime check for
typeof require. - Accept a
DOMParser-like factory as a parameter (matching thetransformBibleHtmlpattern) so the function stays sync and callers provide the linkedom parser.
|
|
||
| const FOOTNOTE_KEY_ATTR = 'data-footnote-key'; | ||
|
|
||
| const NEEDS_SPACE_BEFORE = /^[^\s.,;:!?)}\]'"»›]/; |
There was a problem hiding this comment.
NEEDS_SPACE_BEFORE missing ASCII apostrophe causes spurious space before possessives
The regex character class excludes ' (U+2019 RIGHT SINGLE QUOTATION MARK) and " (U+201C LEFT DOUBLE QUOTATION MARK), but the straight ASCII apostrophe ' (U+0027) is absent from the negated list. Any text node whose first character is a plain apostrophe — possessives like 's, contractions like 'll, 've — will match the regex and set nextNeedsSpace = true, inserting a spurious space after the footnote anchor.
For example God[fn]'s word would become God [fn] 's word instead of God[fn]'s word.
| const NEEDS_SPACE_BEFORE = /^[^\s.,;:!?)}\]'"»›]/; | |
| const NEEDS_SPACE_BEFORE = /^[^\s.,;:!?)}\]'"'»›]/; |
https://lifechurch.atlassian.net/browse/YPE-1814