Skip to content

feat: page_batch_size#1039

Open
superniker wants to merge 2 commits into
datalab-to:masterfrom
superniker:feat/stream-pages
Open

feat: page_batch_size#1039
superniker wants to merge 2 commits into
datalab-to:masterfrom
superniker:feat/stream-pages

Conversation

@superniker
Copy link
Copy Markdown

See superniker/marker for details

Problem: DocumentBuilder loads ALL page images (lowres + highres) into
memory simultaneously via provider.get_images().  A 97-page PDF at 192 DPI
consumes ~13 GB RSS, triggering OOM on machines with less RAM.

Root cause: providers/pdf.py:get_images() is a list comprehension that
renders every page before returning.  Combined with JPEG→RGB decode
expansion (~50×), even modest PDFs exhaust memory.

Solution — three changes that together enable 100× memory reduction:

1. PageGroup.compress_images()
   After the builder stage, convert PIL Images to JPEG bytes (~100 KB/page
   vs ~13 MB raw).  Reduces 97 pages from 2.5 GB → ~20 MB.

2. PageGroup.get_image() auto-decompress
   If image is stored as bytes, decompress on first access and cache the
   result.  Completely transparent to ~20 existing call sites in table,
   equation, debug, and LLM processors.

3. DocumentBuilder.page_batch_size (default 0 = all-in-memory)
   When >0, process N pages at a time through layout → line → OCR, then
   compress their images before loading the next batch.  Peak memory is
   O(batch_size) instead of O(total_pages).

Usage:
  # CLI
  marker_single large.pdf --page_batch_size 10

  # Python API
  converter = PdfConverter(config={'page_batch_size': 10}, ...)

Backward compatible: page_batch_size=0 preserves existing behaviour.
No changes to LayoutBuilder, LineBuilder, OcrBuilder, or any processor.
The previous version cached PIL Images back to page.lowres_image and
page.highres_image after decompression, causing all images to accumulate
in memory again as downstream processors called get_image().

Now get_image() decompresses bytes on-the-fly and returns a fresh PIL
Image without caching.  The returned image is used and garbage-collected.
Slightly slower (~5ms/decompress) but keeps memory at O(batch_size).
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 29, 2026

CLA Assistant Lite bot:
Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.


I have read the CLA Document and I hereby sign the CLA


You can retrigger this bot by commenting recheck in this Pull Request

@superniker
Copy link
Copy Markdown
Author

我也来提交一下PR

@superniker superniker force-pushed the feat/stream-pages branch from e46587c to 5493412 Compare May 29, 2026 14:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant