GitHub - afriedman412/petey-app: The easy PDF extractor.

Petey: A framework for PDF data extraction.

Download for Mac and Windows | Run with Docker | Try the live demo

The PDF was invented in 1991 by Adobe co-founder John Warnock. While the future was digital, the present was analog, and the new format was intended as a bridge between the two eras. PDFs were designed to be a universal container for the printed page that would look the same on any screen or printer. A PDF generated in California should be identical to a version printed in London.

Warnock achieved his goal, and today PDFs are ubiquitous. According to The Economist, there are over 2.4 trillion PDF-formatted documents across the world's computers. They are your bank statement after you go paperless. They are the only existing copy of a newspaper article. They are the textbook for an online class. They are the forms you fill out when you join a gym. They are everywhere.

The secret to their flexibility is that they have almost no rules internally. A PDF is just a list of items — words, characters, shapes, images — and their coordinates on the page. No information about the relationship between anything. Two words that appear next to each other in print could be 1,000 lines apart in code. (If you have ever tried to highlight a line in a PDF and ended up selecting words halfway across the page, that's why.)

This makes PDFs a nightmare to work with.

But people have been working on PDF extraction for a long time. Open-source tools like pdfplumber and PyMuPDF can do almost everything, and the commercial options are even better. AI can finish the job, and it's cheap and fast enough for everyone to use. The only thing standing between you and the data in your PDF is something to bring it all together.

What is Petey?

Petey is a framework for PDF data extraction. It wires the PDF parser of your choice to the LLM of your choice, and, with input from the user, pulls the data out of your PDF document.

What isn't Petey?

Petey is not a parser or an LLM itself. It outsources most of its tasks to other services, whether they are open-source or commercial. You bring your own API keys to Petey, and you pay for everything Petey does to your documents. Typical cost is 1-5 cents per page.

Get started

Web (no install)

Try the interactive demos or go straight to the extractor.

Desktop app

Download the latest release for Mac or Windows. Open the app, add an API key in Settings, and you're ready to go.

Docker

docker run -p 8080:8080 afriedman412/petey

Open http://localhost:8080.

Python CLI

pip install petey
petey extract --schema your_schema.yaml document.pdf
petey extract --schema your_schema.yaml folder/ -o results.csv

See petey extract --help for all options.

API keys

Petey connects to external services for parsing and extraction. You'll need at least one API key to run extractions. See the API key setup guide for step-by-step instructions.

Provider	What it does	Cost
OpenAI	LLM extraction	~$0.01-0.05/page
Anthropic	LLM extraction	~$0.01-0.04/page
Datalab (Marker)	AI-powered parsing	~$0.005/page
PyMuPDF	Built-in parsing	Free

Name		Name	Last commit message	Last commit date
Latest commit History 106 Commits
.claude		.claude
.github/workflows		.github/workflows
desktop		desktop
parser		parser
schemas		schemas
server		server
static		static
templates		templates
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.base		Dockerfile.base
Dockerfile.standalone		Dockerfile.standalone
LICENSE		LICENSE
Makefile		Makefile
NOTICE		NOTICE
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What is Petey?

What isn't Petey?

Get started

Web (no install)

Desktop app

Docker

Python CLI

API keys

Links

About

Uh oh!

Releases 7

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

What is Petey?

What isn't Petey?

Get started

Web (no install)

Desktop app

Docker

Python CLI

API keys

Links

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages