Docudactyl

1. License & Philosophy

This project must declare MPL-2.0-or-later for platform/tooling compatibility.

Philosophy: Palimpsest. The Palimpsest-MPL (PMPL) text is provided in license/PMPL-1.0.txt, and the canonical source is the palimpsest-license repository.

2. Overview

Docudactyl is a multi-language PDF text extraction and transformation toolkit designed for:

Parallel processing of large document collections
Data analysis on extracted content
Machine-readable transformation to Scheme (S-expressions)
Interactive exploration via terminal UI

The tool recovers text from poorly redacted PDF files where visual redaction (black rectangles) was applied without removing the underlying text data from the PDF stream.

2.1. Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Docudactyl                                │
├─────────────────┬─────────────────────┬─────────────────────────┤
│     Julia       │       OCaml         │         Ada             │
│  (Extraction)   │   (Transformer)     │        (TUI)            │
├─────────────────┼─────────────────────┼─────────────────────────┤
│ • PDF parsing   │ • PDF → Scheme      │ • Interactive viewer    │
│ • Parallel proc │ • JSON → Scheme     │ • Document navigation   │
│ • Data analysis │ • S-expr generation │ • Search interface      │
│ • Export (CSV,  │ • Accessor funcs    │ • Analysis display      │
│   JSON, Scheme) │                     │                         │
└─────────────────┴─────────────────────┴─────────────────────────┘

3. Quick Start

# Install dependencies
just deps

# Build all components
just build

# Extract text from a PDF
just extract document.pdf -o output.json -f json

# Transform to Scheme
just transform output.json output.scm

# Or run the full pipeline
just pipeline document.pdf

# Launch the TUI viewer
just tui output.json

4. Components

4.1. Julia: Core Extraction Engine

The Julia component provides high-performance PDF text extraction with parallel processing support.

using Docudactyl

# Single file extraction
result = extract_text("document.pdf")
if result.success
    println(text_content(result.document))
end

# Parallel batch processing
using Distributed
addprocs(4)
@everywhere using Docudactyl

files = readdir("pdfs/", join=true)
results = parallel_extract(files)

# Analysis
stats = analyze_content(result.document)
println("Words: $(stats.total_words)")
println("Unique: $(stats.unique_words)")

# Export to Scheme
export_scheme(result.document, "output.scm")

4.1.1. Features

Parallel Processing: Distribute extraction across multiple cores
Position Preservation: Maintains text block coordinates for layout reconstruction
Multiple Formats: Export to CSV, JSON, plain text, or Scheme
Statistical Analysis: Word frequency, coverage estimation, comparative analysis

4.2. OCaml: Scheme Transformer

The OCaml component transforms extracted documents into machine-readable Scheme (S-expression) format.

# From PDF directly
docudactyl-scm document.pdf -o document.scm

# From Julia JSON output
docudactyl-scm extracted.json -o extracted.scm

# Minimal output (data only, no helper functions)
docudactyl-scm --minimal document.pdf

4.2.1. Output Format

The Scheme output includes the document structure and accessor functions:

(define docudactyl-document
  `((metadata
     (filepath . "document.pdf")
     (sha256 . "abc123...")
     (extracted-at . "2025-12-27T10:00:00Z")
     (pdf-metadata
       (title . "Example Document")
       (author . "Unknown")))

    (statistics
     (total-pages . 10)
     (total-words . 5000))

    (pages
      ((page-number . 1)
       (dimensions (width . 612.0) (height . 792.0))
       (lines
         "First line of text"
         "Second line of text")
       (blocks
         ((text . "First")
          (bounds (x0 . 72.0) (y0 . 72.0) (x1 . 100.0) (y1 . 84.0))
          (font-size . 12.0)))))))

;; Accessor functions
(define (docudactyl-get-pages doc) ...)
(define (docudactyl-page-text page) ...)
(define (docudactyl-document-text doc) ...)

4.3. Ada: Terminal UI

The Ada component provides an interactive terminal interface for document exploration.

# Launch TUI with a document
docudactyl-tui extracted.json

4.3.1. Keybindings

Key	Action
`q`, `Esc`	Quit application
`?`	Show help
`/`	Search in document
`j`, `Down`	Scroll down
`k`, `Up`	Scroll up
`n`, `PgDn`	Next page
`p`, `PgUp`	Previous page
`g`	Go to first page
`G`	Go to last page
`Tab`	Switch view mode
`a`	Show analysis

5. Justfile Recipes

The project uses just as a task runner.

# Show all recipes
just

# Build
just build            # Build all components
just build-julia      # Build Julia only
just build-ocaml      # Build OCaml only
just build-ada        # Build Ada only

# Run
just extract <pdf>    # Extract text from PDF
just transform <in>   # Transform to Scheme
just tui [file]       # Launch TUI
just pipeline <pdf>   # Full extraction pipeline
just repl             # Julia REPL

# Parallel processing
just parallel-extract <dir> [workers]

# Test
just test             # Run all tests
just test-julia       # Julia tests only
just test-ocaml       # OCaml tests only

# Dependencies
just deps             # Install all dependencies
just deps-julia       # Julia dependencies
just deps-ocaml       # OCaml dependencies

# Utilities
just loc              # Count lines of code
just clean            # Clean build artifacts

6. Directory Structure

docudactyl/
├── src/
│   ├── julia/               # Julia extraction engine
│   │   ├── Project.toml     # Package manifest
│   │   ├── Docudactyl.jl    # Main module
│   │   ├── extract.jl       # PDF extraction
│   │   ├── parallel.jl      # Parallel processing
│   │   ├── analysis.jl      # Data analysis
│   │   ├── export.jl        # Export formats
│   │   └── cli.jl           # Command-line interface
│   │
│   ├── ocaml/               # OCaml transformer
│   │   ├── dune-project     # Dune project
│   │   ├── dune             # Build config
│   │   ├── docudactyl_scm.ml    # Main entry
│   │   ├── document_types.ml    # Type definitions
│   │   ├── pdf_parser.ml        # PDF/JSON parsing
│   │   └── scheme_emit.ml       # Scheme generation
│   │
│   └── ada/                 # Ada TUI
│       ├── docudactyl.gpr   # GNAT project
│       ├── docudactyl_tui.adb   # Main TUI
│       ├── terminal_utils.ads   # Terminal handling
│       ├── terminal_utils.adb
│       ├── document_model.ads   # Document types
│       └── document_model.adb
│
├── examples/                # Example documents
├── docs/                    # Documentation
├── justfile                 # Task runner
└── README.adoc              # This file

7. Requirements

7.1. System Dependencies

Julia 1.9+ with packages: PDFIO, DataFrames, CSV, JSON3, ArgParse
OCaml 4.14+ with packages: camlpdf, yojson, cmdliner, dune
Ada GNAT (GCC) with gprbuild

7.2. Installation via Guix

guix shell -D -f guix.scm
just deps
just build

8. Use Cases

8.1. Document Verification

Verify that redaction was properly applied:

result = extract_text("redacted.pdf")
if result.success
    text = text_content(result.document)
    if !isempty(text)
        println("WARNING: Text found under redactions!")
        println(text)
    end
end

8.2. Batch Analysis

Analyze a collection of documents:

results = parallel_extract_dir("documents/"; recursive=true)
summary = batch_process(results; output_dir="analysis/", format=:csv)
println("Processed $(summary.total) documents in $(summary.extraction_time_s)s")

8.3. Lisp Processing

Process extracted documents in Guile Scheme:

(load "document.scm")

;; Get all text
(define text (docudactyl-document-text docudactyl-document))

;; Search for patterns
(define matches (docudactyl-search docudactyl-document "confidential"))

;; Access specific pages
(define page-1 (docudactyl-get-page docudactyl-document 1))
(display (docudactyl-page-text page-1))

9. Ethical Use

This tool is designed for:

Document analysis and archival review
Research and verification of redaction practices
Accessibility improvements for PDF content
Data recovery from improperly redacted public documents

Important: This tool does not bypass encryption, OCR hidden images, or circumvent actual security measures. It only extracts text that remains in the PDF stream but is visually obscured.

10. RSR Compliance

This is a Tier 1 RSR project using:

Julia: Batch processing (per RSR allowance)
OCaml: Language-specific tooling (AffineScript compiler category)
Ada: Safety-critical systems

11. License

SPDX-License-Identifier: MPL-2.0-or-later

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.claude		.claude
.github		.github
ai-cli-crash-capture		ai-cli-crash-capture
contractiles		contractiles
docs		docs
examples		examples
ffi/zig		ffi/zig
license		license
src		src
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.nojekyll		.nojekyll
ABI-FFI-README.md		ABI-FFI-README.md
AGENTIC.scm		AGENTIC.scm
AI.a2ml		AI.a2ml
AI.djot		AI.djot
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.adoc		CONTRIBUTING.adoc
CONTRIBUTING.md		CONTRIBUTING.md
ECOSYSTEM.scm		ECOSYSTEM.scm
Justfile		Justfile
LICENSE		LICENSE
MAINTAINERS.adoc		MAINTAINERS.adoc
META.scm		META.scm
Mustfile		Mustfile
NEUROSYM.scm		NEUROSYM.scm
PLAYBOOK.scm		PLAYBOOK.scm
README.adoc		README.adoc
ROADMAP.adoc		ROADMAP.adoc
RSR_OUTLINE.adoc		RSR_OUTLINE.adoc
SECURITY.md		SECURITY.md
STATE.scm		STATE.scm
justfile		justfile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Docudactyl

1. License & Philosophy

2. Overview

2.1. Architecture

3. Quick Start

4. Components

4.1. Julia: Core Extraction Engine

4.1.1. Features

4.2. OCaml: Scheme Transformer

4.2.1. Output Format

4.3. Ada: Terminal UI

4.3.1. Keybindings

5. Justfile Recipes

6. Directory Structure

7. Requirements

7.1. System Dependencies

7.2. Installation via Guix

8. Use Cases

8.1. Document Verification

8.2. Batch Analysis

8.3. Lisp Processing

9. Ethical Use

10. RSR Compliance

11. License

12. Links

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Uh oh!

License

hyperpolymath/docudactyl

Folders and files

Latest commit

History

Repository files navigation

Docudactyl

1. License & Philosophy

2. Overview

2.1. Architecture

3. Quick Start

4. Components

4.1. Julia: Core Extraction Engine

4.1.1. Features

4.2. OCaml: Scheme Transformer

4.2.1. Output Format

4.3. Ada: Terminal UI

4.3.1. Keybindings

5. Justfile Recipes

6. Directory Structure

7. Requirements

7.1. System Dependencies

7.2. Installation via Guix

8. Use Cases

8.1. Document Verification

8.2. Batch Analysis

8.3. Lisp Processing

9. Ethical Use

10. RSR Compliance

11. License

12. Links

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages