seace-contract-ocr-bow

Overview

This project automates the process of downloading PDF contracts from a public source, converting them to images, extracting text using OCR (Optical Character Recognition), and transforming the extracted text into a structured format for further analysis. It is designed for large-scale document processing and text mining tasks.

Features

Automated PDF Download: Fetches contract PDFs from a list of URLs.
PDF to Image Conversion: Converts each page of the PDF to an image using Poppler.
OCR Text Extraction: Uses Tesseract OCR to extract text from images.
Text Structuring: Processes and structures the extracted text for downstream analysis.
Batch Processing: Handles large numbers of files efficiently.

Folder Structure

src/ – Main Python scripts.
data/ – Input data files (e.g., Excel, CSV, TXT).
pdf/ – Downloaded PDF files.
text/ – Output text files from OCR.
temp/ – Temporary image files used during processing.

Installation

Clone the repository:

git clone https://github.com/Tooruogata/conosce-pdf-ocr-bow.git
cd conosce-pdf-ocr-bow

Set up the docker image:

docker build -t contract-ocr:latest -f .devcontainer/Dockerfile .
docker run -dit --name contract-ocr-dev -v "$repopath:/workspace" -w /workspace contract-ocr:latest

Usage

Place your input data files (e.g., Excel with contract URLs) in the data/ directory.
Run the main script:
```
python src/notebook.ipynb
```
The script will:
- Download missing PDFs to pdf/
- Convert PDFs to images in temp/
- Extract text to text/
- Output structured data in data/

Notes

Ensure the paths in the script match your folder structure.
For large-scale processing, monitor disk space in temp/ and text/.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

seace-contract-ocr-bow

Overview

Features

Folder Structure

Installation

Usage

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.devcontainer		.devcontainer
data		data
pdf		pdf
src		src
temp		temp
text		text
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

seace-contract-ocr-bow

Overview

Features

Folder Structure

Installation

Usage

Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages