Describe the bug
Rasterizing with page.to_image() goes through pdfplumber.display.get_page_image, which opens a pypdfium2.PdfDocument and loads the page without calling PdfDocument.init_forms(). Filled AcroForm field content is drawn via PDFium’s form layer (FPDF_FFLDraw); in pypdfium2 that only runs when a form environment exists, which init_forms() creates (and must be called after open, before loading pages).
So filled form field text can be missing from the PIL bitmap while PDF viewers still show it. This is PDFium rendering, not pdfminer text extraction (e.g. .chars).
Have you tried repairing the PDF?
Yes. pdfplumber.open(..., repair=True) does not fix this: the PDF is not treated as malformed for this path. The gap is that get_page_image never initializes the PDFium form environment before render().
Code to reproduce the problem
import pdfplumber
with pdfplumber.open("filled_form.pdf") as pdf:
im = pdf.pages[0].to_image(resolution=150).original
im.save("out.png")
(Also reproducible with repair=True on the same file.)
PDF file
filled_form.pdf
Expected behavior
out.png should include the visible filled field values, consistent with a typical PDF viewer, when PDFium supports the form.
Actual behavior
out.png omits the filled field text; the rest of the page rasterizes as usual.
Screenshots
If applicable, attach:
Issue:
Expexted:
Environment
- pdfplumber version:
0.11.9
- OS: macOS
Additional context
Likely fix: In get_page_image, after successfully opening pypdfium2.PdfDocument, call pdfium_doc.init_forms() before pdfium_doc.get_page(page_ix), matching pypdfium2’s documented order.
Related (not duplicate): #120 is about form values not appearing in CLI / extraction; the README documents AcroForm via pdfminer. This report is specifically about to_image() / get_page_image rasterization.
Describe the bug
Rasterizing with
page.to_image()goes throughpdfplumber.display.get_page_image, which opens apypdfium2.PdfDocumentand loads the page without callingPdfDocument.init_forms(). Filled AcroForm field content is drawn via PDFium’s form layer (FPDF_FFLDraw); in pypdfium2 that only runs when a form environment exists, whichinit_forms()creates (and must be called after open, before loading pages).So filled form field text can be missing from the PIL bitmap while PDF viewers still show it. This is PDFium rendering, not pdfminer text extraction (e.g.
.chars).Have you tried repairing the PDF?
Yes.
pdfplumber.open(..., repair=True)does not fix this: the PDF is not treated as malformed for this path. The gap is thatget_page_imagenever initializes the PDFium form environment beforerender().Code to reproduce the problem
(Also reproducible with
repair=Trueon the same file.)PDF file
filled_form.pdf
Expected behavior
out.pngshould include the visible filled field values, consistent with a typical PDF viewer, when PDFium supports the form.Actual behavior
out.pngomits the filled field text; the rest of the page rasterizes as usual.Screenshots
If applicable, attach:
Issue:
Expexted:
Environment
0.11.9Additional context
Likely fix: In
get_page_image, after successfully openingpypdfium2.PdfDocument, callpdfium_doc.init_forms()beforepdfium_doc.get_page(page_ix), matching pypdfium2’s documented order.Related (not duplicate): #120 is about form values not appearing in CLI / extraction; the README documents AcroForm via pdfminer. This report is specifically about
to_image()/get_page_imagerasterization.