Skip to content

hyperpolymath/docudactyl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Docudactyl

RSR Tier 1 Phase Julia OCaml Ada

1. License & Philosophy

This project must declare MPL-2.0-or-later for platform/tooling compatibility.

Philosophy: Palimpsest. The Palimpsest-MPL (PMPL) text is provided in license/PMPL-1.0.txt, and the canonical source is the palimpsest-license repository.

2. Overview

Docudactyl is a multi-language PDF text extraction and transformation toolkit designed for:

  • Parallel processing of large document collections

  • Data analysis on extracted content

  • Machine-readable transformation to Scheme (S-expressions)

  • Interactive exploration via terminal UI

The tool recovers text from poorly redacted PDF files where visual redaction (black rectangles) was applied without removing the underlying text data from the PDF stream.

2.1. Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Docudactyl                                │
├─────────────────┬─────────────────────┬─────────────────────────┤
│     Julia       │       OCaml         │         Ada             │
│  (Extraction)   │   (Transformer)     │        (TUI)            │
├─────────────────┼─────────────────────┼─────────────────────────┤
│ • PDF parsing   │ • PDF → Scheme      │ • Interactive viewer    │
│ • Parallel proc │ • JSON → Scheme     │ • Document navigation   │
│ • Data analysis │ • S-expr generation │ • Search interface      │
│ • Export (CSV,  │ • Accessor funcs    │ • Analysis display      │
│   JSON, Scheme) │                     │                         │
└─────────────────┴─────────────────────┴─────────────────────────┘

3. Quick Start

# Install dependencies
just deps

# Build all components
just build

# Extract text from a PDF
just extract document.pdf -o output.json -f json

# Transform to Scheme
just transform output.json output.scm

# Or run the full pipeline
just pipeline document.pdf

# Launch the TUI viewer
just tui output.json

4. Components

4.1. Julia: Core Extraction Engine

The Julia component provides high-performance PDF text extraction with parallel processing support.

using Docudactyl

# Single file extraction
result = extract_text("document.pdf")
if result.success
    println(text_content(result.document))
end

# Parallel batch processing
using Distributed
addprocs(4)
@everywhere using Docudactyl

files = readdir("pdfs/", join=true)
results = parallel_extract(files)

# Analysis
stats = analyze_content(result.document)
println("Words: $(stats.total_words)")
println("Unique: $(stats.unique_words)")

# Export to Scheme
export_scheme(result.document, "output.scm")

4.1.1. Features

  • Parallel Processing: Distribute extraction across multiple cores

  • Position Preservation: Maintains text block coordinates for layout reconstruction

  • Multiple Formats: Export to CSV, JSON, plain text, or Scheme

  • Statistical Analysis: Word frequency, coverage estimation, comparative analysis

4.2. OCaml: Scheme Transformer

The OCaml component transforms extracted documents into machine-readable Scheme (S-expression) format.

# From PDF directly
docudactyl-scm document.pdf -o document.scm

# From Julia JSON output
docudactyl-scm extracted.json -o extracted.scm

# Minimal output (data only, no helper functions)
docudactyl-scm --minimal document.pdf

4.2.1. Output Format

The Scheme output includes the document structure and accessor functions:

(define docudactyl-document
  `((metadata
     (filepath . "document.pdf")
     (sha256 . "abc123...")
     (extracted-at . "2025-12-27T10:00:00Z")
     (pdf-metadata
       (title . "Example Document")
       (author . "Unknown")))

    (statistics
     (total-pages . 10)
     (total-words . 5000))

    (pages
      ((page-number . 1)
       (dimensions (width . 612.0) (height . 792.0))
       (lines
         "First line of text"
         "Second line of text")
       (blocks
         ((text . "First")
          (bounds (x0 . 72.0) (y0 . 72.0) (x1 . 100.0) (y1 . 84.0))
          (font-size . 12.0)))))))

;; Accessor functions
(define (docudactyl-get-pages doc) ...)
(define (docudactyl-page-text page) ...)
(define (docudactyl-document-text doc) ...)

4.3. Ada: Terminal UI

The Ada component provides an interactive terminal interface for document exploration.

# Launch TUI with a document
docudactyl-tui extracted.json

4.3.1. Keybindings

Key Action

q, Esc

Quit application

?

Show help

/

Search in document

j, Down

Scroll down

k, Up

Scroll up

n, PgDn

Next page

p, PgUp

Previous page

g

Go to first page

G

Go to last page

Tab

Switch view mode

a

Show analysis

5. Justfile Recipes

The project uses just as a task runner.

# Show all recipes
just

# Build
just build            # Build all components
just build-julia      # Build Julia only
just build-ocaml      # Build OCaml only
just build-ada        # Build Ada only

# Run
just extract <pdf>    # Extract text from PDF
just transform <in>   # Transform to Scheme
just tui [file]       # Launch TUI
just pipeline <pdf>   # Full extraction pipeline
just repl             # Julia REPL

# Parallel processing
just parallel-extract <dir> [workers]

# Test
just test             # Run all tests
just test-julia       # Julia tests only
just test-ocaml       # OCaml tests only

# Dependencies
just deps             # Install all dependencies
just deps-julia       # Julia dependencies
just deps-ocaml       # OCaml dependencies

# Utilities
just loc              # Count lines of code
just clean            # Clean build artifacts

6. Directory Structure

docudactyl/
├── src/
│   ├── julia/               # Julia extraction engine
│   │   ├── Project.toml     # Package manifest
│   │   ├── Docudactyl.jl    # Main module
│   │   ├── extract.jl       # PDF extraction
│   │   ├── parallel.jl      # Parallel processing
│   │   ├── analysis.jl      # Data analysis
│   │   ├── export.jl        # Export formats
│   │   └── cli.jl           # Command-line interface
│   │
│   ├── ocaml/               # OCaml transformer
│   │   ├── dune-project     # Dune project
│   │   ├── dune             # Build config
│   │   ├── docudactyl_scm.ml    # Main entry
│   │   ├── document_types.ml    # Type definitions
│   │   ├── pdf_parser.ml        # PDF/JSON parsing
│   │   └── scheme_emit.ml       # Scheme generation
│   │
│   └── ada/                 # Ada TUI
│       ├── docudactyl.gpr   # GNAT project
│       ├── docudactyl_tui.adb   # Main TUI
│       ├── terminal_utils.ads   # Terminal handling
│       ├── terminal_utils.adb
│       ├── document_model.ads   # Document types
│       └── document_model.adb
│
├── examples/                # Example documents
├── docs/                    # Documentation
├── justfile                 # Task runner
└── README.adoc              # This file

7. Requirements

7.1. System Dependencies

  • Julia 1.9+ with packages: PDFIO, DataFrames, CSV, JSON3, ArgParse

  • OCaml 4.14+ with packages: camlpdf, yojson, cmdliner, dune

  • Ada GNAT (GCC) with gprbuild

7.2. Installation via Guix

guix shell -D -f guix.scm
just deps
just build

8. Use Cases

8.1. Document Verification

Verify that redaction was properly applied:

result = extract_text("redacted.pdf")
if result.success
    text = text_content(result.document)
    if !isempty(text)
        println("WARNING: Text found under redactions!")
        println(text)
    end
end

8.2. Batch Analysis

Analyze a collection of documents:

results = parallel_extract_dir("documents/"; recursive=true)
summary = batch_process(results; output_dir="analysis/", format=:csv)
println("Processed $(summary.total) documents in $(summary.extraction_time_s)s")

8.3. Lisp Processing

Process extracted documents in Guile Scheme:

(load "document.scm")

;; Get all text
(define text (docudactyl-document-text docudactyl-document))

;; Search for patterns
(define matches (docudactyl-search docudactyl-document "confidential"))

;; Access specific pages
(define page-1 (docudactyl-get-page docudactyl-document 1))
(display (docudactyl-page-text page-1))

9. Ethical Use

This tool is designed for:

  • Document analysis and archival review

  • Research and verification of redaction practices

  • Accessibility improvements for PDF content

  • Data recovery from improperly redacted public documents

Important: This tool does not bypass encryption, OCR hidden images, or circumvent actual security measures. It only extracts text that remains in the PDF stream but is visually obscured.

10. RSR Compliance

This is a Tier 1 RSR project using:

  • Julia: Batch processing (per RSR allowance)

  • OCaml: Language-specific tooling (AffineScript compiler category)

  • Ada: Safety-critical systems

11. License

SPDX-License-Identifier: MPL-2.0-or-later