What is this?

An extractor for PDF files which converts to:

PNG files (one per page)
Alto XML files (this is the native format of Tesseract)
JSON files - a direct conversion of the Tesseract XML

How does this work?

It uses some shell tools to do the hard lifting:

ImageMagic for PDF to PNG conversion
Tesseract OCR for PNG to Alto XML conversion
Xsltpoc and a stylesheet for XML to JSON conversion

How do I run this?

The wrapper script can either run the local commands using packages you installed on your machine, or act as a front-end to using a Docker image which has all the required packages pre-installed.

./run.sh -h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is this?

How does this work?

How do I run this?

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

What is this?

How does this work?

How do I run this?