OCR System for Extracting Text from Scanned PDF Documents using PaddleOCR and Streamlit
-
Updated
Jun 22, 2026 - Python
OCR System for Extracting Text from Scanned PDF Documents using PaddleOCR and Streamlit
High-fidelity OCR + pre-RAG pipeline processor featuring: 1.) Tesseract OCR 2.) Built-in cross-line dehyphenation + real word verification 3.) Support for TIFF series, & JPEG2000 (jpx) for hi-fidelity pdf sources with logistically significant size savings. Morphic assists in pre-RAG PDF prep for analysis, large-scale ingest & agentic analysis
Add a description, image, and links to the pdf-preprocessing topic page so that developers can more easily learn about it.
To associate your repository with the pdf-preprocessing topic, visit your repo's landing page and select "manage topics."