This project must declare MPL-2.0-or-later for platform/tooling compatibility.
Philosophy: Palimpsest. The Palimpsest-MPL (PMPL) text is provided in license/PMPL-1.0.txt, and the canonical source is the palimpsest-license repository.
Docudactyl is a multi-language PDF text extraction and transformation toolkit designed for:
-
Parallel processing of large document collections
-
Data analysis on extracted content
-
Machine-readable transformation to Scheme (S-expressions)
-
Interactive exploration via terminal UI
The tool recovers text from poorly redacted PDF files where visual redaction (black rectangles) was applied without removing the underlying text data from the PDF stream.
┌─────────────────────────────────────────────────────────────────┐
│ Docudactyl │
├─────────────────┬─────────────────────┬─────────────────────────┤
│ Julia │ OCaml │ Ada │
│ (Extraction) │ (Transformer) │ (TUI) │
├─────────────────┼─────────────────────┼─────────────────────────┤
│ • PDF parsing │ • PDF → Scheme │ • Interactive viewer │
│ • Parallel proc │ • JSON → Scheme │ • Document navigation │
│ • Data analysis │ • S-expr generation │ • Search interface │
│ • Export (CSV, │ • Accessor funcs │ • Analysis display │
│ JSON, Scheme) │ │ │
└─────────────────┴─────────────────────┴─────────────────────────┘# Install dependencies
just deps
# Build all components
just build
# Extract text from a PDF
just extract document.pdf -o output.json -f json
# Transform to Scheme
just transform output.json output.scm
# Or run the full pipeline
just pipeline document.pdf
# Launch the TUI viewer
just tui output.jsonThe Julia component provides high-performance PDF text extraction with parallel processing support.
using Docudactyl
# Single file extraction
result = extract_text("document.pdf")
if result.success
println(text_content(result.document))
end
# Parallel batch processing
using Distributed
addprocs(4)
@everywhere using Docudactyl
files = readdir("pdfs/", join=true)
results = parallel_extract(files)
# Analysis
stats = analyze_content(result.document)
println("Words: $(stats.total_words)")
println("Unique: $(stats.unique_words)")
# Export to Scheme
export_scheme(result.document, "output.scm")-
Parallel Processing: Distribute extraction across multiple cores
-
Position Preservation: Maintains text block coordinates for layout reconstruction
-
Multiple Formats: Export to CSV, JSON, plain text, or Scheme
-
Statistical Analysis: Word frequency, coverage estimation, comparative analysis
The OCaml component transforms extracted documents into machine-readable Scheme (S-expression) format.
# From PDF directly
docudactyl-scm document.pdf -o document.scm
# From Julia JSON output
docudactyl-scm extracted.json -o extracted.scm
# Minimal output (data only, no helper functions)
docudactyl-scm --minimal document.pdfThe Scheme output includes the document structure and accessor functions:
(define docudactyl-document
`((metadata
(filepath . "document.pdf")
(sha256 . "abc123...")
(extracted-at . "2025-12-27T10:00:00Z")
(pdf-metadata
(title . "Example Document")
(author . "Unknown")))
(statistics
(total-pages . 10)
(total-words . 5000))
(pages
((page-number . 1)
(dimensions (width . 612.0) (height . 792.0))
(lines
"First line of text"
"Second line of text")
(blocks
((text . "First")
(bounds (x0 . 72.0) (y0 . 72.0) (x1 . 100.0) (y1 . 84.0))
(font-size . 12.0)))))))
;; Accessor functions
(define (docudactyl-get-pages doc) ...)
(define (docudactyl-page-text page) ...)
(define (docudactyl-document-text doc) ...)The Ada component provides an interactive terminal interface for document exploration.
# Launch TUI with a document
docudactyl-tui extracted.jsonThe project uses just as a task runner.
# Show all recipes
just
# Build
just build # Build all components
just build-julia # Build Julia only
just build-ocaml # Build OCaml only
just build-ada # Build Ada only
# Run
just extract <pdf> # Extract text from PDF
just transform <in> # Transform to Scheme
just tui [file] # Launch TUI
just pipeline <pdf> # Full extraction pipeline
just repl # Julia REPL
# Parallel processing
just parallel-extract <dir> [workers]
# Test
just test # Run all tests
just test-julia # Julia tests only
just test-ocaml # OCaml tests only
# Dependencies
just deps # Install all dependencies
just deps-julia # Julia dependencies
just deps-ocaml # OCaml dependencies
# Utilities
just loc # Count lines of code
just clean # Clean build artifactsdocudactyl/
├── src/
│ ├── julia/ # Julia extraction engine
│ │ ├── Project.toml # Package manifest
│ │ ├── Docudactyl.jl # Main module
│ │ ├── extract.jl # PDF extraction
│ │ ├── parallel.jl # Parallel processing
│ │ ├── analysis.jl # Data analysis
│ │ ├── export.jl # Export formats
│ │ └── cli.jl # Command-line interface
│ │
│ ├── ocaml/ # OCaml transformer
│ │ ├── dune-project # Dune project
│ │ ├── dune # Build config
│ │ ├── docudactyl_scm.ml # Main entry
│ │ ├── document_types.ml # Type definitions
│ │ ├── pdf_parser.ml # PDF/JSON parsing
│ │ └── scheme_emit.ml # Scheme generation
│ │
│ └── ada/ # Ada TUI
│ ├── docudactyl.gpr # GNAT project
│ ├── docudactyl_tui.adb # Main TUI
│ ├── terminal_utils.ads # Terminal handling
│ ├── terminal_utils.adb
│ ├── document_model.ads # Document types
│ └── document_model.adb
│
├── examples/ # Example documents
├── docs/ # Documentation
├── justfile # Task runner
└── README.adoc # This fileVerify that redaction was properly applied:
result = extract_text("redacted.pdf")
if result.success
text = text_content(result.document)
if !isempty(text)
println("WARNING: Text found under redactions!")
println(text)
end
endAnalyze a collection of documents:
results = parallel_extract_dir("documents/"; recursive=true)
summary = batch_process(results; output_dir="analysis/", format=:csv)
println("Processed $(summary.total) documents in $(summary.extraction_time_s)s")Process extracted documents in Guile Scheme:
(load "document.scm")
;; Get all text
(define text (docudactyl-document-text docudactyl-document))
;; Search for patterns
(define matches (docudactyl-search docudactyl-document "confidential"))
;; Access specific pages
(define page-1 (docudactyl-get-page docudactyl-document 1))
(display (docudactyl-page-text page-1))This tool is designed for:
-
Document analysis and archival review
-
Research and verification of redaction practices
-
Accessibility improvements for PDF content
-
Data recovery from improperly redacted public documents
Important: This tool does not bypass encryption, OCR hidden images, or circumvent actual security measures. It only extracts text that remains in the PDF stream but is visually obscured.
This is a Tier 1 RSR project using:
-
Julia: Batch processing (per RSR allowance)
-
OCaml: Language-specific tooling (AffineScript compiler category)
-
Ada: Safety-critical systems
-
Inspired by unredact