Skip to content

[BUG: Output] ListItem html logic strips all internal dashes/hyphens from the first formatted line. #1024

@vanderalex

Description

@vanderalex

📝 Describe the Output Issue

There is a bug in marker/schema/blocks/listitem.py inside the replace_bullets function that causes the silent loss of valid dashes and hyphens located inside the text of a recognized list item.

When a paragraph is classified as a ListItem (for example, dialogue lines that start with an em-dash —), replace_bullets attempts to remove the original bullet character from the OCR HTML so markdownify can generate a clean markdown list.

However, re.sub is used without count=1:

python
bullet_pattern = r"(^|[\n ]|<[^>]*>)•●○ഠ ം◦■▪▫–—-"
first_block.html = re.sub(bullet_pattern, r"\1\2", first_block.html)
Because the regex looks for any dash surrounded by spaces, a greedy global re.sub ends up stripping every single valid hyphen or em-dash surrounded by spaces in the entire first line, not just the leading bullet item.

Example:
Original OCR text: — I saw him — he interrupted.
Expected Markdown output: - I saw him — he interrupted.
Actual Markdown output: - I saw him he interrupted. (The internal em-dash is completely lost).

Proposed Fix: Simply add count=1 to the regex replacement in listitem.py so it only targets the leading bullet marker:

python
first_block.html = re.sub(bullet_pattern, r"\1\2", first_block.html, count=1)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bug: outputPoor markdown/HTML output quality

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions