[BUG: Output] ListItem html logic strips all internal dashes/hyphens from the first formatted line.

## 📝 Describe the Output Issue

There is a bug in marker/schema/blocks/listitem.py inside the replace_bullets function that causes the silent loss of valid dashes and hyphens located inside the text of a recognized list item.

When a paragraph is classified as a ListItem (for example, dialogue lines that start with an em-dash —), replace_bullets attempts to remove the original bullet character from the OCR HTML so markdownify can generate a clean markdown list.

However, re.sub is used without count=1:

python
bullet_pattern = r"(^|[\n ]|<[^>]*>)[•●○ഠ ം◦■▪▫–—-](< >)"
first_block.html = re.sub(bullet_pattern, r"\1\2", first_block.html)
Because the regex looks for any dash surrounded by spaces, a greedy global re.sub ends up stripping every single valid hyphen or em-dash surrounded by spaces in the entire first line, not just the leading bullet item.

Example: 
Original OCR text: — I saw him — he interrupted. 
Expected Markdown output: - I saw him — he interrupted. 
Actual Markdown output: - I saw him he interrupted. (The internal em-dash is completely lost).

Proposed Fix: Simply add count=1 to the regex replacement in listitem.py so it only targets the leading bullet marker:

python
first_block.html = re.sub(bullet_pattern, r"\1\2", first_block.html, count=1)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG: Output] ListItem html logic strips all internal dashes/hyphens from the first formatted line. #1024

📝 Describe the Output Issue

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[BUG: Output] ListItem html logic strips all internal dashes/hyphens from the first formatted line. #1024

Description

📝 Describe the Output Issue

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions