Add OCR text verification to prevent false positive completions#46
Add OCR text verification to prevent false positive completions#46
Conversation
|
@maxi07 👋 This repository doesn't have Copilot instructions. With Copilot instructions, I can understand the repository better, work faster and produce higher quality PRs. I can generate a .github/copilot-instructions.md file for you automatically. Click here to open a pre-filled issue and assign it to me. I'll write the instructions, and then tag you for review. |
Co-authored-by: maxi07 <7480270+maxi07@users.noreply.github.com>
|
@copilot please add tests that will test this new functionality and also make sure that the web ui reflects failing our status (OCR status gets updated properly in the db AND the ui) |
…ssing tests Co-authored-by: maxi07 <7480270+maxi07@users.noreply.github.com> Agent-Logs-Url: https://github.com/maxi07/ScanSync/sessions/a2c6fc77-605f-48ae-b6b2-579df2e70e6c
Done in commit Tests (
DB: Added an UI: The dashboard card now shows a Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
Currently, the OCR service only trusts the exit code from OCRmyPDF to determine if OCR processing was successful. This can lead to false positives where OCR appears to complete successfully (exit code 0) but no actual text was extracted from the document.
Problem
OCRmyPDF can return exit code 0 in cases where:
In these scenarios, the OCR status was incorrectly set to
COMPLETEDeven though no text extraction occurred.Solution
This PR adds a verification step after successful OCR completion and persists the result to the database and web UI:
Text extraction verification: After OCR exits with code 0, the service now uses the existing
extract_text()helper function to verify that the OCR output file actually contains extractable text.Improved status logic:
OCRStatus.COMPLETEDOCRStatus.FAILEDOCRStatus.OUTPUT_ERROROCR status persisted to database: A new
ocr_statuscolumn is added to thescanneddatatable via migration, and the OCR service writes the status name (e.g.COMPLETED,FAILED,DPI_ERROR) after each job.Web UI reflects OCR failures: Dashboard cards display a⚠️ warning indicator with a human-readable OCR error message (e.g. "OCR: No text found", "OCR: Image DPI too low") when OCR encountered a problem. This is visible on initial page load and updates live via Server-Sent Events.
Enhanced logging: Added detailed logging that reports the number of characters extracted during verification.
Changes
ocr_service/main.py: Added import forextract_text, implemented verification logic, addedif __name__ == "__main__":guard for testability, and persistsocr_statusto the databasescansynclib/scansynclib/sqlite_wrapper.py: Added migration to addocr_statuscolumn toscanneddatatableweb_service/src/main.py: Addedocr_statusto SSE payloadweb_service/src/static/js/dashboard.js: AddedgetOcrStatusText()helper and OCR status warning display in document cardstests/test_ocr_verification.py: Added unit tests covering text extraction scenarios and 9 comprehensive tests forstart_processingcovering all OCR outcome scenariosExample Impact
Before this change, a blank PDF page would result in:
After this change:
The dashboard card also shows:⚠️ OCR: No text found
This ensures the OCR pipeline only marks documents as successfully processed when text extraction actually occurred, and the failure reason is visible in the database and web UI.
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.