Skip to content

Add a Turkish-language OCR benchmark harness #2

@ada-cinar

Description

@ada-cinar

Right now we claim Turkish-focused accuracy in the README, but there are no numbers behind it. We should add a small, reproducible benchmark.

What to build

  • Run OpenCR over a fixed set of public-domain Turkish PDFs (~50 pages total)
  • Measure Word Error Rate and Character Error Rate against gold-standard transcripts
  • Run the same fixtures through Tesseract, Surya, PaddleOCR, and Marker for comparison
  • Publish the resulting table at benchmarks/RESULTS.md and link it from the README

Where things live

Fixtures and gold transcripts under benchmarks/fixtures/, the runner script under benchmarks/run.py, comparison tooling under benchmarks/compare/.

Why

Even informal numbers are more useful than the silence we have now. This is also a great way for new contributors to help — no model code needed, mostly careful PDF curation and a bit of scripting.

Good first issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions