This project automates the process of downloading PDF contracts from a public source, converting them to images, extracting text using OCR (Optical Character Recognition), and transforming the extracted text into a structured format for further analysis. It is designed for large-scale document processing and text mining tasks.
- Automated PDF Download: Fetches contract PDFs from a list of URLs.
- PDF to Image Conversion: Converts each page of the PDF to an image using Poppler.
- OCR Text Extraction: Uses Tesseract OCR to extract text from images.
- Text Structuring: Processes and structures the extracted text for downstream analysis.
- Batch Processing: Handles large numbers of files efficiently.
src/– Main Python scripts.data/– Input data files (e.g., Excel, CSV, TXT).pdf/– Downloaded PDF files.text/– Output text files from OCR.temp/– Temporary image files used during processing.
-
Clone the repository:
git clone https://github.com/Tooruogata/conosce-pdf-ocr-bow.git cd conosce-pdf-ocr-bow -
Set up the docker image:
docker build -t contract-ocr:latest -f .devcontainer/Dockerfile . docker run -dit --name contract-ocr-dev -v "$repopath:/workspace" -w /workspace contract-ocr:latest
- Place your input data files (e.g., Excel with contract URLs) in the
data/directory. - Run the main script:
python src/notebook.ipynb
- The script will:
- Download missing PDFs to
pdf/ - Convert PDFs to images in
temp/ - Extract text to
text/ - Output structured data in
data/
- Download missing PDFs to
- Ensure the paths in the script match your folder structure.
- For large-scale processing, monitor disk space in
temp/andtext/.