Skip to content

Add OCR text verification to prevent false positive completions#46

Open
Copilot wants to merge 3 commits intomainfrom
copilot/fix-41
Open

Add OCR text verification to prevent false positive completions#46
Copilot wants to merge 3 commits intomainfrom
copilot/fix-41

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Aug 28, 2025

Currently, the OCR service only trusts the exit code from OCRmyPDF to determine if OCR processing was successful. This can lead to false positives where OCR appears to complete successfully (exit code 0) but no actual text was extracted from the document.

Problem

OCRmyPDF can return exit code 0 in cases where:

  • The input document contains only blank pages
  • The image quality is too poor for text recognition
  • The document contains only images with no readable text

In these scenarios, the OCR status was incorrectly set to COMPLETED even though no text extraction occurred.

Solution

This PR adds a verification step after successful OCR completion and persists the result to the database and web UI:

  1. Text extraction verification: After OCR exits with code 0, the service now uses the existing extract_text() helper function to verify that the OCR output file actually contains extractable text.

  2. Improved status logic:

    • If text is found → OCRStatus.COMPLETED
    • If no text or only whitespace is found → OCRStatus.FAILED
    • If the OCR output file is missing → OCRStatus.OUTPUT_ERROR
  3. OCR status persisted to database: A new ocr_status column is added to the scanneddata table via migration, and the OCR service writes the status name (e.g. COMPLETED, FAILED, DPI_ERROR) after each job.

  4. Web UI reflects OCR failures: Dashboard cards display a ⚠️ warning indicator with a human-readable OCR error message (e.g. "OCR: No text found", "OCR: Image DPI too low") when OCR encountered a problem. This is visible on initial page load and updates live via Server-Sent Events.

  5. Enhanced logging: Added detailed logging that reports the number of characters extracted during verification.

Changes

  • ocr_service/main.py: Added import for extract_text, implemented verification logic, added if __name__ == "__main__": guard for testability, and persists ocr_status to the database
  • scansynclib/scansynclib/sqlite_wrapper.py: Added migration to add ocr_status column to scanneddata table
  • web_service/src/main.py: Added ocr_status to SSE payload
  • web_service/src/static/js/dashboard.js: Added getOcrStatusText() helper and OCR status warning display in document cards
  • tests/test_ocr_verification.py: Added unit tests covering text extraction scenarios and 9 comprehensive tests for start_processing covering all OCR outcome scenarios

Example Impact

Before this change, a blank PDF page would result in:

OCR processing completed: blank_page.pdf
Status: COMPLETED

After this change:

OCR processing completed: blank_page.pdf
OCR verification failed: no text found in OCR output file
Status: FAILED

The dashboard card also shows: ⚠️ OCR: No text found

This ensures the OCR pipeline only marks documents as successfully processed when text extraction actually occurred, and the failure reason is visible in the database and web UI.


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Aug 28, 2025

@maxi07 👋 This repository doesn't have Copilot instructions. With Copilot instructions, I can understand the repository better, work faster and produce higher quality PRs.

I can generate a .github/copilot-instructions.md file for you automatically. Click here to open a pre-filled issue and assign it to me. I'll write the instructions, and then tag you for review.

Co-authored-by: maxi07 <7480270+maxi07@users.noreply.github.com>
Copilot AI changed the title [WIP] Add verification for OCR after our step Add OCR text verification to prevent false positive completions Aug 28, 2025
Copilot AI requested a review from maxi07 August 28, 2025 14:28
@maxi07 maxi07 marked this pull request as ready for review March 23, 2026 17:09
@maxi07
Copy link
Copy Markdown
Owner

maxi07 commented Mar 23, 2026

@copilot please add tests that will test this new functionality and also make sure that the web ui reflects failing our status (OCR status gets updated properly in the db AND the ui)

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Mar 23, 2026

@copilot please add tests that will test this new functionality and also make sure that the web ui reflects failing our status (OCR status gets updated properly in the db AND the ui)

Done in commit 7aa489c. Here's what was added:

Tests (tests/test_ocr_verification.py): Added 9 new unit tests in TestStartProcessing that mock all external dependencies and cover every outcome of start_processing:

  • OCR success with text → COMPLETED
  • OCR success with no/whitespace-only text → FAILED
  • OCR success but output file missing → OUTPUT_ERROR
  • Non-zero exit code → FAILED
  • UnsupportedImageFormatErrorUNSUPPORTED
  • DpiErrorDPI_ERROR
  • Forwarded to upload queue on success
  • DB final update includes both file_status and ocr_status

DB: Added an ocr_status column migration to scanneddata in sqlite_wrapper.py. The OCR service now writes the status name (e.g. "COMPLETED", "FAILED", "DPI_ERROR") to this column at the end of each OCR job.

UI: The dashboard card now shows a ⚠️ warning line with a human-readable OCR error message when OCR encountered a problem (e.g. "OCR: No text found", "OCR: Image DPI too low"). This updates both on initial page load and live via SSE when a document is processed.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • redis
    • Triggering command: /usr/bin/python python -m pytest tests/ -v --ignore=tests/test_homepage.py --ignore=tests/test_advanced_settings_web.py ****** (dns block)
    • Triggering command: /usr/bin/python python -m pytest tests/test_settings.py (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants