Petey: A framework for PDF data extraction.
Download for Mac and Windows | Run with Docker | Try the live demo
The PDF was invented in 1991 by Adobe co-founder John Warnock. While the future was digital, the present was analog, and the new format was intended as a bridge between the two eras. PDFs were designed to be a universal container for the printed page that would look the same on any screen or printer. A PDF generated in California should be identical to a version printed in London.
Warnock achieved his goal, and today PDFs are ubiquitous. According to The Economist, there are over 2.4 trillion PDF-formatted documents across the world's computers. They are your bank statement after you go paperless. They are the only existing copy of a newspaper article. They are the textbook for an online class. They are the forms you fill out when you join a gym. They are everywhere.
The secret to their flexibility is that they have almost no rules internally. A PDF is just a list of items — words, characters, shapes, images — and their coordinates on the page. No information about the relationship between anything. Two words that appear next to each other in print could be 1,000 lines apart in code. (If you have ever tried to highlight a line in a PDF and ended up selecting words halfway across the page, that's why.)
This makes PDFs a nightmare to work with.
But people have been working on PDF extraction for a long time. Open-source tools like pdfplumber and PyMuPDF can do almost everything, and the commercial options are even better. AI can finish the job, and it's cheap and fast enough for everyone to use. The only thing standing between you and the data in your PDF is something to bring it all together.
Petey is a framework for PDF data extraction. It wires the PDF parser of your choice to the LLM of your choice, and, with input from the user, pulls the data out of your PDF document.
Petey is not a parser or an LLM itself. It outsources most of its tasks to other services, whether they are open-source or commercial. You bring your own API keys to Petey, and you pay for everything Petey does to your documents. Typical cost is 1-5 cents per page.
Try the interactive demos or go straight to the extractor.
Download the latest release for Mac or Windows. Open the app, add an API key in Settings, and you're ready to go.
docker run -p 8080:8080 afriedman412/peteyOpen http://localhost:8080.
pip install petey
petey extract --schema your_schema.yaml document.pdf
petey extract --schema your_schema.yaml folder/ -o results.csvSee petey extract --help for all options.
Petey connects to external services for parsing and extraction. You'll need at least one API key to run extractions. See the API key setup guide for step-by-step instructions.
| Provider | What it does | Cost |
|---|---|---|
| OpenAI | LLM extraction | ~$0.01-0.05/page |
| Anthropic | LLM extraction | ~$0.01-0.04/page |
| Datalab (Marker) | AI-powered parsing | ~$0.005/page |
| PyMuPDF | Built-in parsing | Free |
