📝 Describe the Output Issue
There is a bug in marker/schema/blocks/listitem.py inside the replace_bullets function that causes the silent loss of valid dashes and hyphens located inside the text of a recognized list item.
When a paragraph is classified as a ListItem (for example, dialogue lines that start with an em-dash —), replace_bullets attempts to remove the original bullet character from the OCR HTML so markdownify can generate a clean markdown list.
However, re.sub is used without count=1:
python
bullet_pattern = r"(^|[\n ]|<[^>]*>)•●○ഠ ം◦■▪▫–—-"
first_block.html = re.sub(bullet_pattern, r"\1\2", first_block.html)
Because the regex looks for any dash surrounded by spaces, a greedy global re.sub ends up stripping every single valid hyphen or em-dash surrounded by spaces in the entire first line, not just the leading bullet item.
Example:
Original OCR text: — I saw him — he interrupted.
Expected Markdown output: - I saw him — he interrupted.
Actual Markdown output: - I saw him he interrupted. (The internal em-dash is completely lost).
Proposed Fix: Simply add count=1 to the regex replacement in listitem.py so it only targets the leading bullet marker:
python
first_block.html = re.sub(bullet_pattern, r"\1\2", first_block.html, count=1)
📝 Describe the Output Issue
There is a bug in marker/schema/blocks/listitem.py inside the replace_bullets function that causes the silent loss of valid dashes and hyphens located inside the text of a recognized list item.
When a paragraph is classified as a ListItem (for example, dialogue lines that start with an em-dash —), replace_bullets attempts to remove the original bullet character from the OCR HTML so markdownify can generate a clean markdown list.
However, re.sub is used without count=1:
python
bullet_pattern = r"(^|[\n ]|<[^>]*>)•●○ഠ ം◦■▪▫–—-"
first_block.html = re.sub(bullet_pattern, r"\1\2", first_block.html)
Because the regex looks for any dash surrounded by spaces, a greedy global re.sub ends up stripping every single valid hyphen or em-dash surrounded by spaces in the entire first line, not just the leading bullet item.
Example:
Original OCR text: — I saw him — he interrupted.
Expected Markdown output: - I saw him — he interrupted.
Actual Markdown output: - I saw him he interrupted. (The internal em-dash is completely lost).
Proposed Fix: Simply add count=1 to the regex replacement in listitem.py so it only targets the leading bullet marker:
python
first_block.html = re.sub(bullet_pattern, r"\1\2", first_block.html, count=1)