Skip to content

fix issue 964#965

Open
jnhyperion wants to merge 3 commits intojsvine:developfrom
jnhyperion:stable
Open

fix issue 964#965
jnhyperion wants to merge 3 commits intojsvine:developfrom
jnhyperion:stable

Conversation

@jnhyperion
Copy link

@jnhyperion jnhyperion commented Aug 10, 2023

I found that this issue is caused by some blank chars is overlapped with the following non blank chars.
The simple solution is to remove these overlapped blank chars.

fix: #964

@jsvine
Copy link
Owner

jsvine commented Aug 16, 2023

Thanks for this proposal, @jnhyperion. I think this particular change isn't quite right for the library, as it's quite specific to a particular (and relatively uncommon) edge case. I find that changes like those might fix the handling of some PDFs, but risk causing problems for others, as there's such a wide variety of PDFs. But perhaps we can think of a more general feature that would still help for your use case, such as a simple .extract_text(ignore_whitespace=True) parameter or Page.remove_whitespace(..., only_overlapping=True) method (in a similar spirit to Page.dedupe_chars(...)).

Added `page.remove_whitespace(only_overlapping=False, ...)`
@jnhyperion
Copy link
Author

you're right, I added a new method Page.remove_whitespace.

@jnhyperion jnhyperion changed the base branch from stable to develop August 28, 2023 02:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

extracted word is broken

2 participants