Describe the bug
Fetched the following paper from cell: https://www.cell.com/action/showPdf?pii=S0092-8674%2819%2930277-6
It's open access.
It crashes the process, memory metrics passed 6GB before crashing. r
Code to reproduce the problem
text = []
with pdfplumber.open(pdf_path) as pdf:
for page_plumb in pdf.pages:
page_text = page_plumb.extract_text()
if page_text:
text.append(page_text)
extracted_text = "\n".join(text)
PDF file
Please attach any PDFs necessary to reproduce the problem.
https://www.cell.com/action/showPdf?pii=S0092-8674%2819%2930277-6
Expected behavior
Not crash, on a 40MB pdf
Be able to extract text from a 40MB pdf with less than 6GB of RAM
Actual behavior
Crashed, presumbly due to memory consumption
Environment
- pdfplumber version: 0.11.6
- Python version: 3.10
- OS: Linux (Ubuntu and Amazon linux)
Describe the bug
Fetched the following paper from cell: https://www.cell.com/action/showPdf?pii=S0092-8674%2819%2930277-6
It's open access.
It crashes the process, memory metrics passed 6GB before crashing. r
Code to reproduce the problem
PDF file
Please attach any PDFs necessary to reproduce the problem.
https://www.cell.com/action/showPdf?pii=S0092-8674%2819%2930277-6
Expected behavior
Not crash, on a 40MB pdf
Be able to extract text from a 40MB pdf with less than 6GB of RAM
Actual behavior
Crashed, presumbly due to memory consumption
Environment