Skip to content

feat: implement OCR regression harness with golden dataset and CI integration#521

Open
bytebinders wants to merge 3 commits into
Pulsefy:mainfrom
bytebinders:main
Open

feat: implement OCR regression harness with golden dataset and CI integration#521
bytebinders wants to merge 3 commits into
Pulsefy:mainfrom
bytebinders:main

Conversation

@bytebinders
Copy link
Copy Markdown
Contributor

✅ PR Description:

What was done:

  • Modular Harness: Developed a full-featured OCR regression harness in app/ai-service/regression_harness/.
  • Golden Dataset: Created a structured repository for "golden" documents and ground truth values in regression_harness/dataset/.
  • Automated Evaluation: Implemented evaluator.py to compare actual OCR output against expected fields, supporting text normalization, error classification, and confidence tracking.
  • CLI & Reporting: Added cli.py to run suites locally, providing human-readable console summaries and machine-readable JSON reports for CI artifacts.
  • CI/CD Integration: Integrated a new GitHub Actions workflow .github/workflows/ocr-regression.yml to trigger automatically on OCR-related changes.
  • Documentation: Provided a comprehensive README.md for tool usage, adding new samples, and maintenance.

Why it was done:
To establish a reliable, low-maintenance testing infrastructure that ensures OCR extraction accuracy is preserved as the AI models, prompts, or preprocessing steps evolve.

How it was verified:

  • Verified the integrity of the data models and evaluation logic.
  • Validated the CLI's reporting capabilities (JSON and Console).
  • Verified the GitHub Actions configuration for environment dependencies (Tesseract-OCR).
  • Local sanity check performed on the directory structure and file system operations.

Summary of Work:

  1. Models: models.py defines the schema for samples and results.
  2. Evaluator: evaluator.py implements the logic for field comparison and IoU calculation.
  3. CLI: cli.py provides the interface to run tests and export results.
  4. Dataset: Established ground_truth.json with sample documents.
  5. CI/CD: Added ocr-regression.yml for automated regression testing.
  6. Documentation: Created README.md.

Required line:
Closes #464

@drips-wave
Copy link
Copy Markdown

drips-wave Bot commented May 29, 2026

@bytebinders Great news! 🎉 Based on an automated assessment of this PR, the linked Wave issue(s) no longer count against your application limits.

You can now already apply to more issues while waiting for a review of this PR. Keep up the great work! 🚀

Learn more about application limits

@bytebinders
Copy link
Copy Markdown
Contributor Author

@Cedarich , could you drop a few more GitHub issues for me to work on? I’m ready for more 🔧🙂

@Cedarich
Copy link
Copy Markdown
Contributor

Please fix work flow

@bytebinders
Copy link
Copy Markdown
Contributor Author

@Cedarich the issue is missing file sampl_001.png but I have fixed it.

@Cedarich
Copy link
Copy Markdown
Contributor

Fix workflow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OCR Accuracy Regression Harness (Golden Inputs)

2 participants