Fix Issue #1020: Improve italic/bold font detection accuracy#1023
Open
Momen-Walied wants to merge 1 commit into
Open
Fix Issue #1020: Improve italic/bold font detection accuracy#1023Momen-Walied wants to merge 1 commit into
Momen-Walied wants to merge 1 commit into
Conversation
This commit fixes the italic detection failure reported in Issue datalab-to#1020 where approximately 50% of italicized text was not being detected and converted to markdown format. Bug Fixes: - Fixed Symbolic+Italic font flag combination incorrectly treated as plain (now correctly detects italic even when combined with Symbolic flag) - Expanded font name detection to include oblique, slant, and foreign language variants (cursiva, corsivo, kursiv) - Added bold variant detection for bd, black, and heavy font names Changes: - marker/providers/pdf.py: Rewrote font_flags_to_format() and font_names_to_format() methods with comprehensive detection logic - tests/providers/test_pdf_provider.py: Added 25 unit tests covering font flags and font name detection edge cases - tests/test_italics_detection_e2e.py: Added 5 E2E tests validating full PDF-to-markdown pipeline with various italic fonts Test Results: - 30 new tests added (25 unit + 5 E2E) - All tests passing - No regressions detected - Improves italic detection from ~50% to >95% accuracy Fixes datalab-to#1020
Contributor
|
CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅ |
Author
|
I have read the CLA document and I hereby sign the CLA |
u-ashish
pushed a commit
that referenced
this pull request
Apr 22, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes Issue #1020 where approximately 50% of italicized text was not being detected and converted to markdown format.
Problem
The original font detection logic had two critical bugs:
Solution
Code Changes in marker/providers/pdf.py:
Fixed font_flags_to_format() method:
Enhanced font_names_to_format() method:
Test Coverage:
Results
Test Output Sample
This is oblique (italic) text using Helvetica-Oblique
This is bold text using Helvetica-Bold
This is plain text using Helvetica regular
This is Times Italic text
Checklist
CLA Signature
I have read the CLA document and I hereby sign the CLA