OCR_Project
OCR_Project is a lightweight OCR processing toolkit focused on extracting text and simple table structures from image and PDF inputs. It provides preprocessing, PDF handling, table detection, and flexible output writing to text files.
Key features
- Preprocessing pipeline for image cleanup and enhancement
- PDF handling and image extraction
- OCR text extraction and simple table detection
- CLI and Streamlit demo app for interactive use
- Simple input/output folder conventions for batch processing
Requirements
- Python 3.8+
- See
requirements.txtfor full dependency list
- Create and activate a virtual environment (recommended):
python -m venv .venv
source .venv/bin/activate- Install dependencies:
pip install -r requirements.txtThere are a few entry points depending on the workflow you want:
- Batch CLI (basic):
python main.py- Streamlit demo (interactive):
streamlit run streamlit_app.py- Generate sample test assets (helper script):
python generate_test_assets.pyinput/— put your input images or PDFs here.output/— generated outputs and text files are written here.ocr_engine.py— main OCR and text extraction logic.preprocess.py— image preprocessing utilities.pdf_handler.py— PDF-to-image conversion and handling.table_detector.py— simple heuristics for detecting tables.output_writer.py— formatting and writing extracted text to files.streamlit_app.py— interactive demo UI.generate_test_assets.py— creates sample inputs for testing.
- Place input files in
input/and runpython main.pyto process them in batch; results will be placed underoutput/text_outputs/. - The Streamlit app exposes the same processing pipeline for trying different preprocessing options and inspecting results.
There is no automated test suite included; to manually test, run the generator and then process the generated assets:
python generate_test_assets.py
python main.pyOn Windows you can use the provided run.bat to run the main pipeline.
Contributions are welcome. Open an issue or a PR describing the change you propose. Keep changes focused and include small, testable commits.
This repository does not include a license file. If you plan to publish or share this project, add a LICENSE file to clarify terms.
If you have questions about the code, open an issue in this repository.