Skip to content

crashes process (memory probably on modest pdf) #1339

@meirqed

Description

@meirqed

Describe the bug

Fetched the following paper from cell: https://www.cell.com/action/showPdf?pii=S0092-8674%2819%2930277-6
It's open access.
It crashes the process, memory metrics passed 6GB before crashing. r

Code to reproduce the problem

        text = []
        with pdfplumber.open(pdf_path) as pdf:
            for page_plumb in pdf.pages:
                page_text = page_plumb.extract_text()
                if page_text:
                    text.append(page_text)
        extracted_text = "\n".join(text)

PDF file

Please attach any PDFs necessary to reproduce the problem.
https://www.cell.com/action/showPdf?pii=S0092-8674%2819%2930277-6

Expected behavior

Not crash, on a 40MB pdf
Be able to extract text from a 40MB pdf with less than 6GB of RAM

Actual behavior

Crashed, presumbly due to memory consumption

Environment

  • pdfplumber version: 0.11.6
  • Python version: 3.10
  • OS: Linux (Ubuntu and Amazon linux)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions