GitHub - thokchomthoithoibasingh-create/OCR_Project: OCR System for Extracting Text from Scanned PDF Documents using PaddleOCR and Streamlit

OCR_Project

OCR_Project is a lightweight OCR processing toolkit focused on extracting text and simple table structures from image and PDF inputs. It provides preprocessing, PDF handling, table detection, and flexible output writing to text files.

Key features

Preprocessing pipeline for image cleanup and enhancement
PDF handling and image extraction
OCR text extraction and simple table detection
CLI and Streamlit demo app for interactive use
Simple input/output folder conventions for batch processing

Requirements

Python 3.8+
See requirements.txt for full dependency list

Installation

Create and activate a virtual environment (recommended):

python -m venv .venv
source .venv/bin/activate

Install dependencies:

pip install -r requirements.txt

Running the project

There are a few entry points depending on the workflow you want:

Batch CLI (basic):

python main.py

Streamlit demo (interactive):

streamlit run streamlit_app.py

Generate sample test assets (helper script):

python generate_test_assets.py

Project layout

input/ — put your input images or PDFs here.
output/ — generated outputs and text files are written here.
ocr_engine.py — main OCR and text extraction logic.
preprocess.py — image preprocessing utilities.
pdf_handler.py — PDF-to-image conversion and handling.
table_detector.py — simple heuristics for detecting tables.
output_writer.py — formatting and writing extracted text to files.
streamlit_app.py — interactive demo UI.
generate_test_assets.py — creates sample inputs for testing.

Usage notes

Place input files in input/ and run python main.py to process them in batch; results will be placed under output/text_outputs/.
The Streamlit app exposes the same processing pipeline for trying different preprocessing options and inspecting results.

Testing

There is no automated test suite included; to manually test, run the generator and then process the generated assets:

python generate_test_assets.py
python main.py

Windows

On Windows you can use the provided run.bat to run the main pipeline.

Contributing

Contributions are welcome. Open an issue or a PR describing the change you propose. Keep changes focused and include small, testable commits.

License

This repository does not include a license file. If you plan to publish or share this project, add a LICENSE file to clarify terms.

Contact

If you have questions about the code, open an issue in this repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation

Running the project

Project layout

Usage notes

Testing

Windows

Contributing

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
input		input
.gitignore		.gitignore
README.md		README.md
files.txt		files.txt
generate_test_assets.py		generate_test_assets.py
main.py		main.py
ocr_engine.py		ocr_engine.py
output_writer.py		output_writer.py
pdf_handler.py		pdf_handler.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
streamlit_app.py		streamlit_app.py
table_detector.py		table_detector.py
test.py		test.py

Folders and files

Latest commit

History

Repository files navigation

Installation

Running the project

Project layout

Usage notes

Testing

Windows

Contributing

License

Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages