feat: page_batch_size#1039
Open
superniker wants to merge 2 commits into
Open
Conversation
Problem: DocumentBuilder loads ALL page images (lowres + highres) into
memory simultaneously via provider.get_images(). A 97-page PDF at 192 DPI
consumes ~13 GB RSS, triggering OOM on machines with less RAM.
Root cause: providers/pdf.py:get_images() is a list comprehension that
renders every page before returning. Combined with JPEG→RGB decode
expansion (~50×), even modest PDFs exhaust memory.
Solution — three changes that together enable 100× memory reduction:
1. PageGroup.compress_images()
After the builder stage, convert PIL Images to JPEG bytes (~100 KB/page
vs ~13 MB raw). Reduces 97 pages from 2.5 GB → ~20 MB.
2. PageGroup.get_image() auto-decompress
If image is stored as bytes, decompress on first access and cache the
result. Completely transparent to ~20 existing call sites in table,
equation, debug, and LLM processors.
3. DocumentBuilder.page_batch_size (default 0 = all-in-memory)
When >0, process N pages at a time through layout → line → OCR, then
compress their images before loading the next batch. Peak memory is
O(batch_size) instead of O(total_pages).
Usage:
# CLI
marker_single large.pdf --page_batch_size 10
# Python API
converter = PdfConverter(config={'page_batch_size': 10}, ...)
Backward compatible: page_batch_size=0 preserves existing behaviour.
No changes to LayoutBuilder, LineBuilder, OcrBuilder, or any processor.
The previous version cached PIL Images back to page.lowres_image and page.highres_image after decompression, causing all images to accumulate in memory again as downstream processors called get_image(). Now get_image() decompresses bytes on-the-fly and returns a fresh PIL Image without caching. The returned image is used and garbage-collected. Slightly slower (~5ms/decompress) but keeps memory at O(batch_size).
Contributor
|
CLA Assistant Lite bot: I have read the CLA Document and I hereby sign the CLA You can retrigger this bot by commenting recheck in this Pull Request |
Author
|
我也来提交一下PR |
e46587c to
5493412
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See superniker/marker for details