Fix extractPlainText() dropping content from Code, Math, and RawInline nodes#86
Conversation
…e nodes HeadingIdTracker::extractPlainText() recursed into Code, Math, and RawInline nodes as generic Node children but never called getContent() on them, so their text was silently dropped from TOC labels and heading IDs. Add explicit handling for these content-bearing inline nodes before the generic Node fallback, so their is appended to the plain-text output. Fixes: heading TOC entries and IDs for headings containing inline code spans, math, or raw inline markup (e.g. '## The `foo` function' now yields text 'The foo function').
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #86 +/- ##
=========================================
Coverage 93.86% 93.86%
- Complexity 2130 2133 +3
=========================================
Files 77 77
Lines 5704 5706 +2
=========================================
+ Hits 5354 5356 +2
Misses 350 350 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This pull request fixes a bug in HeadingIdTracker::extractPlainText() where inline nodes containing text in a $content property (Code, Math, and RawInline) were being silently dropped during plain text extraction. This affected table of contents entry labels and auto-generated heading anchor IDs.
Changes:
- Added explicit handling for
Code,Math, andRawInlinenodes in theextractPlainText()method - Added test cases for inline code and inline math in TOC entries
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| src/Renderer/HeadingIdTracker.php | Added imports for Code, Math, and RawInline classes; added conditional branch to extract content from these node types |
| tests/TestCase/Extension/TableOfContentsExtensionTest.php | Added two test methods verifying that inline code and inline math content is properly included in TOC entries |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| $text .= $child->getContent(); | ||
| } elseif ($child instanceof SoftBreak || $child instanceof HardBreak) { | ||
| $text .= ' '; | ||
| } elseif ($child instanceof Code || $child instanceof Math || $child instanceof RawInline) { |
There was a problem hiding this comment.
RawInline content should not be included in plain text extraction. The PlainTextRenderer (src/Renderer/PlainTextRenderer.php:139) explicitly skips RawInline nodes with the comment "Skip raw inlines (format-specific)". Raw inline content is format-specific (e.g., HTML, LaTeX) and should not be part of heading IDs or TOC entries. Consider removing RawInline from this condition and only including Code and Math.
| } elseif ($child instanceof Code || $child instanceof Math || $child instanceof RawInline) { | |
| } elseif ($child instanceof Code || $child instanceof Math) { |
| } elseif ($child instanceof Code || $child instanceof Math || $child instanceof RawInline) { | ||
| $text .= $child->getContent(); |
There was a problem hiding this comment.
Consider whether Symbol and FootnoteRef inline nodes should also be handled in plain text extraction. The PlainTextRenderer renders Symbol as ':name:' and FootnoteRef as '[label]'. Symbol content (e.g., ':heart:' in a heading like '## I :heart: Djot') could be meaningful for heading IDs and TOC entries, though FootnoteRef references are probably better excluded. Review the PlainTextRenderer implementation at src/Renderer/PlainTextRenderer.php:135-136 for reference.
| $toc = $tocExtension->getToc(); | ||
|
|
||
| $this->assertSame('Equation E=mc^2', $toc[0]['text']); | ||
| } |
There was a problem hiding this comment.
The test coverage is incomplete - RawInline is included in the fix but there's no test case demonstrating its behavior. Given that RawInline content is format-specific and the PlainTextRenderer explicitly skips it, a test would help clarify the intended behavior. However, as noted in another comment, RawInline probably should not be included in the plain text extraction at all.
| } | |
| } | |
| public function testRawInlineIsExcludedFromTocEntry(): void | |
| { | |
| $converter = new DjotConverter(); | |
| $tocExtension = new TableOfContentsExtension(); | |
| $converter->addExtension($tocExtension); | |
| // Djot raw inline: {=html}<span>raw</span> | |
| $converter->convert("## Hello {=html}<span>raw</span>\n"); | |
| $toc = $tocExtension->getToc(); | |
| // PlainTextRenderer skips RawInline content, so only "Hello" should remain | |
| $this->assertSame('Hello', $toc[0]['text']); | |
| $this->assertSame('Hello', $toc[0]['id']); | |
| } |
|
|
||
| $toc = $tocExtension->getToc(); | ||
|
|
||
| $this->assertSame('Equation E=mc^2', $toc[0]['text']); |
There was a problem hiding this comment.
The inline math test should also verify the generated heading ID, similar to the inline code test on line 339. This would ensure that math content is properly included in both the TOC text and the auto-generated ID. The expected ID for 'Equation E=mc^2' would be 'Equation-E=mc^2' based on the normalizeId implementation which only removes '#' and replaces whitespace with dashes.
| $this->assertSame('Equation E=mc^2', $toc[0]['text']); | |
| $this->assertSame('Equation E=mc^2', $toc[0]['text']); | |
| $this->assertSame('Equation-E=mc^2', $toc[0]['id']); |
|
what about those comments about inline? |
Merged and released! |
Problem
HeadingIdTracker::extractPlainText()recursively walks a heading's child nodes but only reads text fromText,SoftBreak, andHardBreak. Inline nodes that carry their value in a$contentproperty —Code,Math, andRawInline— are hit by the genericNodefallback, recursed into, and find noTextchildren, so their content is silently dropped.This affects TOC entry labels, auto-generated heading anchor IDs, and anything that calls
getPlainText()on a heading.Reproducer:
Fix
Add an explicit
elseifbranch forCode,Math, andRawInlinebefore the genericNodefallback inextractPlainText():Changes
Code,Math,RawInline; add the new branch inextractPlainText()testInlineCodeTextIsIncludedInTocEntry()andtestInlineMathTextIsIncludedInTocEntry()