Add OCR text verification to prevent false positive completions by Copilot · Pull Request #46 · maxi07/ScanSync

Copilot · 2025-08-28T14:18:17Z

Currently, the OCR service only trusts the exit code from OCRmyPDF to determine if OCR processing was successful. This can lead to false positives where OCR appears to complete successfully (exit code 0) but no actual text was extracted from the document.

Problem

OCRmyPDF can return exit code 0 in cases where:

The input document contains only blank pages
The image quality is too poor for text recognition
The document contains only images with no readable text

In these scenarios, the OCR status was incorrectly set to COMPLETED even though no text extraction occurred.

Solution

This PR adds a verification step after successful OCR completion and persists the result to the database and web UI:

Text extraction verification: After OCR exits with code 0, the service now uses the existing extract_text() helper function to verify that the OCR output file actually contains extractable text.
Improved status logic:
- If text is found → OCRStatus.COMPLETED
- If no text or only whitespace is found → OCRStatus.FAILED
- If the OCR output file is missing → OCRStatus.OUTPUT_ERROR
OCR status persisted to database: A new ocr_status column is added to the scanneddata table via migration, and the OCR service writes the status name (e.g. COMPLETED, FAILED, DPI_ERROR) after each job.
Web UI reflects OCR failures: Dashboard cards display a ⚠️ warning indicator with a human-readable OCR error message (e.g. "OCR: No text found", "OCR: Image DPI too low") when OCR encountered a problem. This is visible on initial page load and updates live via Server-Sent Events.
Enhanced logging: Added detailed logging that reports the number of characters extracted during verification.

Changes

ocr_service/main.py: Added import for extract_text, implemented verification logic, added if __name__ == "__main__": guard for testability, and persists ocr_status to the database
scansynclib/scansynclib/sqlite_wrapper.py: Added migration to add ocr_status column to scanneddata table
web_service/src/main.py: Added ocr_status to SSE payload
web_service/src/static/js/dashboard.js: Added getOcrStatusText() helper and OCR status warning display in document cards
tests/test_ocr_verification.py: Added unit tests covering text extraction scenarios and 9 comprehensive tests for start_processing covering all OCR outcome scenarios

Example Impact

Before this change, a blank PDF page would result in:

OCR processing completed: blank_page.pdf
Status: COMPLETED

After this change:

OCR processing completed: blank_page.pdf
OCR verification failed: no text found in OCR output file
Status: FAILED

The dashboard card also shows: ⚠️ OCR: No text found

This ensures the OCR pipeline only marks documents as successfully processed when text extraction actually occurred, and the failure reason is visible in the database and web UI.

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot · 2025-08-28T14:18:19Z

@maxi07 👋 This repository doesn't have Copilot instructions. With Copilot instructions, I can understand the repository better, work faster and produce higher quality PRs.

I can generate a .github/copilot-instructions.md file for you automatically. Click here to open a pre-filled issue and assign it to me. I'll write the instructions, and then tag you for review.

Co-authored-by: maxi07 <7480270+maxi07@users.noreply.github.com>

maxi07 · 2026-03-23T17:10:21Z

@copilot please add tests that will test this new functionality and also make sure that the web ui reflects failing our status (OCR status gets updated properly in the db AND the ui)

…ssing tests Co-authored-by: maxi07 <7480270+maxi07@users.noreply.github.com> Agent-Logs-Url: https://github.com/maxi07/ScanSync/sessions/a2c6fc77-605f-48ae-b6b2-579df2e70e6c

Copilot · 2026-03-23T17:26:39Z

@copilot please add tests that will test this new functionality and also make sure that the web ui reflects failing our status (OCR status gets updated properly in the db AND the ui)

Done in commit 7aa489c. Here's what was added:

Tests (tests/test_ocr_verification.py): Added 9 new unit tests in TestStartProcessing that mock all external dependencies and cover every outcome of start_processing:

OCR success with text → COMPLETED
OCR success with no/whitespace-only text → FAILED
OCR success but output file missing → OUTPUT_ERROR
Non-zero exit code → FAILED
UnsupportedImageFormatError → UNSUPPORTED
DpiError → DPI_ERROR
Forwarded to upload queue on success
DB final update includes both file_status and ocr_status

DB: Added an ocr_status column migration to scanneddata in sqlite_wrapper.py. The OCR service now writes the status name (e.g. "COMPLETED", "FAILED", "DPI_ERROR") to this column at the end of each OCR job.

UI: The dashboard card now shows a ⚠️ warning line with a human-readable OCR error message when OCR encountered a problem (e.g. "OCR: No text found", "OCR: Image DPI too low"). This updates both on initial page load and live via SSE when a document is processed.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

redis
- Triggering command: /usr/bin/python python -m pytest tests/ -v --ignore=tests/test_homepage.py --ignore=tests/test_advanced_settings_web.py ****** (dns block)
- Triggering command: /usr/bin/python python -m pytest tests/test_settings.py (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

Initial plan

6702ca4

Copilot AI assigned Copilot and maxi07 Aug 28, 2025

Copilot started work on behalf of maxi07 August 28, 2025 14:18 View session

Add OCR verification to check for text extraction after OCR completion

e439182

Co-authored-by: maxi07 <7480270+maxi07@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Add verification for OCR after our step~~ Add OCR text verification to prevent false positive completions Aug 28, 2025

Copilot AI requested a review from maxi07 August 28, 2025 14:28

Copilot finished work on behalf of maxi07 August 28, 2025 14:28

maxi07 marked this pull request as ready for review March 23, 2026 17:09

Copilot started work on behalf of maxi07 March 23, 2026 17:10 View session

Add OCR status persistence, UI display, and comprehensive start_proce…

7aa489c

…ssing tests Co-authored-by: maxi07 <7480270+maxi07@users.noreply.github.com> Agent-Logs-Url: https://github.com/maxi07/ScanSync/sessions/a2c6fc77-605f-48ae-b6b2-579df2e70e6c

Copilot finished work on behalf of maxi07 March 23, 2026 17:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add OCR text verification to prevent false positive completions#46

Add OCR text verification to prevent false positive completions#46
Copilot wants to merge 3 commits intomainfrom
copilot/fix-41

Copilot AI commented Aug 28, 2025 •

edited

Loading

Uh oh!

Copilot AI commented Aug 28, 2025

Uh oh!

maxi07 commented Mar 23, 2026

Uh oh!

Copilot AI commented Mar 23, 2026 •

edited

Loading

I tried to connect to the following addresses, but was blocked by firewall rules:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Changes

Example Impact

Uh oh!

Copilot AI commented Aug 28, 2025

Uh oh!

maxi07 commented Mar 23, 2026

Uh oh!

Copilot AI commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

I tried to connect to the following addresses, but was blocked by firewall rules:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Aug 28, 2025 •

edited

Loading

Copilot AI commented Mar 23, 2026 •

edited

Loading