EntiGraph

Synthetic corpus expansion via entity graph traversal. Turn a small corpus into training-ready data.

Based on Stanford's "Synthetic Continued Pretraining" (ICLR 2025 Oral). Takes any document and expands it 5-50x by extracting entities, generating all combinatorial pairs and sampled triples, then producing structured multi-section essays that reframe the source text through each entity's perspective.

Why EntiGraph?

When your corpus is too small for continued pretraining (CPT), you need more tokens. Simple rephrasing saturates at ~38M tokens. EntiGraph keeps scaling past 455M tokens without plateau because diversity comes from combinatorial entity graph traversal, not prompt tricks.

Install

pip install git+https://github.com/BogdanAlRa/entigraph.git

# With PDF support
pip install "entigraph[pdf] @ git+https://github.com/BogdanAlRa/entigraph.git"

# With everything
pip install "entigraph[all] @ git+https://github.com/BogdanAlRa/entigraph.git"

Quick Start

CLI

# Single file
entigraph book.txt --provider openai --api-key sk-xxx

# PDF with Gemini
entigraph paper.pdf --provider gemini --api-key AIza...

# Directory of files
entigraph ./corpus/ --provider cerebras --api-key csk-xxx

# Use env var instead of flag
export OPENAI_API_KEY=sk-xxx
entigraph book.txt

# Resume interrupted job (state saved after every 20 items)
entigraph big-book.txt --resume

# Check progress
entigraph --status -o ./output

API Server

entigraph --serve --port 8000

# Submit a document
curl -X POST http://localhost:8000/expand \
  -F file=@book.pdf \
  -F provider=openai \
  -F api_key=sk-xxx

# Check progress
curl http://localhost:8000/status/{job_id}

# Download expanded corpus
curl http://localhost:8000/result/{job_id} -o expanded.txt

# Download extracted entities
curl http://localhost:8000/entities/{job_id} -o entities.json

Python Library

from entigraph import EntiGraphExpander

e = EntiGraphExpander(provider="gemini", api_key="AIza...")
result = e.process_file("book.pdf")

print(f"Expansion: {result['expansion_ratio']}")
print(f"Entities: {result['entities']}")
print(f"Pairs: {result['pairs_generated']}, Triples: {result['triples_generated']}")

How It Works

Phase 1: Entity Extraction

For each ~20K character chunk, an LLM extracts all entities (people, places, concepts, frameworks, methods, etc.) with the instruction to "exhaust as many entities as possible."

Phase 2: Combinatorial Work Queue

All pairs C(n,2) are generated and shuffled (seed=42)
Triples C(n,3): all for n<=30, sampled proportionally for larger sets
This combinatorial explosion is what drives diversity

Phase 3: Structured Essay Generation

For each pair (E_i, E_j), the LLM generates 3 sections:

Source text rephrased emphasizing entity 1
Source text rephrased emphasizing entity 2
Analysis of interaction between both entities

For each triple (E_i, E_j, E_k), 4 sections: 1-3. Source rephrased through each entity's lens 4. Three-way interaction analysis

The full source document is passed in every call. The model reframes existing knowledge, not generating de novo. This is the implicit quality filter.

Supported Providers (BYOK)

Provider	Default Model	Env Var
`openai`	gpt-4.1-mini	`OPENAI_API_KEY`
`gemini`	gemini-2.5-flash	`GOOGLE_API_KEY`
`anthropic`	claude-sonnet-4-6	`ANTHROPIC_API_KEY`
`cerebras`	qwen-3-235b	`CEREBRAS_API_KEY`
`groq`	llama-3.3-70b	`GROQ_API_KEY`
`openrouter`	gemini-2.5-flash:free	`OPENROUTER_API_KEY`
`custom`	(you specify)	(you specify)

Output

{name}_expanded.txt -- original source + all generated essays (plain text, ready for CPT)
{name}_entities.json -- extracted entities for inspection
.state-{id}.json -- progress state (enables resume)

Options

Flag	Default	Description
`--provider`	openai	LLM provider
`--api-key`	(env var)	API key
`--model`	(provider default)	Override model
`--workers`	4	Concurrent API calls
`--max-triples`	0 (paper default)	Cap triple count
`--seed`	42	Random seed for shuffling
`--output`	./output	Output directory
`--resume`	false	Resume interrupted job
`--serve`	false	Start API server

Paper Reference

Chen, Z., Yang, Z., et al. "Synthetic Continued Pretraining." ICLR 2025 (Oral).

Source: 265 articles, 1.3M tokens

Expanded: 455M tokens (350x)

Result: Log-linear scaling on QA, no plateau observed

Key insight: Entity graph traversal >> prompt-based rephrasing

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src/entigraph		src/entigraph
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EntiGraph

Why EntiGraph?

Install

Quick Start

CLI

API Server

Python Library

How It Works

Phase 1: Entity Extraction

Phase 2: Combinatorial Work Queue

Phase 3: Structured Essay Generation

Supported Providers (BYOK)

Output

Options

Paper Reference

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EntiGraph

Why EntiGraph?

Install

Quick Start

CLI

API Server

Python Library

How It Works

Phase 1: Entity Extraction

Phase 2: Combinatorial Work Queue

Phase 3: Structured Essay Generation

Supported Providers (BYOK)

Output

Options

Paper Reference

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages