Skip to content

BogdanAlRa/entigraph

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

EntiGraph

Synthetic corpus expansion via entity graph traversal. Turn a small corpus into training-ready data.

Based on Stanford's "Synthetic Continued Pretraining" (ICLR 2025 Oral). Takes any document and expands it 5-50x by extracting entities, generating all combinatorial pairs and sampled triples, then producing structured multi-section essays that reframe the source text through each entity's perspective.

Why EntiGraph?

When your corpus is too small for continued pretraining (CPT), you need more tokens. Simple rephrasing saturates at ~38M tokens. EntiGraph keeps scaling past 455M tokens without plateau because diversity comes from combinatorial entity graph traversal, not prompt tricks.

Install

pip install git+https://github.com/BogdanAlRa/entigraph.git

# With PDF support
pip install "entigraph[pdf] @ git+https://github.com/BogdanAlRa/entigraph.git"

# With everything
pip install "entigraph[all] @ git+https://github.com/BogdanAlRa/entigraph.git"

Quick Start

CLI

# Single file
entigraph book.txt --provider openai --api-key sk-xxx

# PDF with Gemini
entigraph paper.pdf --provider gemini --api-key AIza...

# Directory of files
entigraph ./corpus/ --provider cerebras --api-key csk-xxx

# Use env var instead of flag
export OPENAI_API_KEY=sk-xxx
entigraph book.txt

# Resume interrupted job (state saved after every 20 items)
entigraph big-book.txt --resume

# Check progress
entigraph --status -o ./output

API Server

entigraph --serve --port 8000
# Submit a document
curl -X POST http://localhost:8000/expand \
  -F file=@book.pdf \
  -F provider=openai \
  -F api_key=sk-xxx

# Check progress
curl http://localhost:8000/status/{job_id}

# Download expanded corpus
curl http://localhost:8000/result/{job_id} -o expanded.txt

# Download extracted entities
curl http://localhost:8000/entities/{job_id} -o entities.json

Python Library

from entigraph import EntiGraphExpander

e = EntiGraphExpander(provider="gemini", api_key="AIza...")
result = e.process_file("book.pdf")

print(f"Expansion: {result['expansion_ratio']}")
print(f"Entities: {result['entities']}")
print(f"Pairs: {result['pairs_generated']}, Triples: {result['triples_generated']}")

How It Works

Phase 1: Entity Extraction

For each ~20K character chunk, an LLM extracts all entities (people, places, concepts, frameworks, methods, etc.) with the instruction to "exhaust as many entities as possible."

Phase 2: Combinatorial Work Queue

  • All pairs C(n,2) are generated and shuffled (seed=42)
  • Triples C(n,3): all for n<=30, sampled proportionally for larger sets
  • This combinatorial explosion is what drives diversity

Phase 3: Structured Essay Generation

For each pair (E_i, E_j), the LLM generates 3 sections:

  1. Source text rephrased emphasizing entity 1
  2. Source text rephrased emphasizing entity 2
  3. Analysis of interaction between both entities

For each triple (E_i, E_j, E_k), 4 sections: 1-3. Source rephrased through each entity's lens 4. Three-way interaction analysis

The full source document is passed in every call. The model reframes existing knowledge, not generating de novo. This is the implicit quality filter.

Supported Providers (BYOK)

Provider Default Model Env Var
openai gpt-4.1-mini OPENAI_API_KEY
gemini gemini-2.5-flash GOOGLE_API_KEY
anthropic claude-sonnet-4-6 ANTHROPIC_API_KEY
cerebras qwen-3-235b CEREBRAS_API_KEY
groq llama-3.3-70b GROQ_API_KEY
openrouter gemini-2.5-flash:free OPENROUTER_API_KEY
custom (you specify) (you specify)

Output

  • {name}_expanded.txt -- original source + all generated essays (plain text, ready for CPT)
  • {name}_entities.json -- extracted entities for inspection
  • .state-{id}.json -- progress state (enables resume)

Options

Flag Default Description
--provider openai LLM provider
--api-key (env var) API key
--model (provider default) Override model
--workers 4 Concurrent API calls
--max-triples 0 (paper default) Cap triple count
--seed 42 Random seed for shuffling
--output ./output Output directory
--resume false Resume interrupted job
--serve false Start API server

Paper Reference

Chen, Z., Yang, Z., et al. "Synthetic Continued Pretraining." ICLR 2025 (Oral).

  • Source: 265 articles, 1.3M tokens
  • Expanded: 455M tokens (350x)
  • Result: Log-linear scaling on QA, no plateau observed
  • Key insight: Entity graph traversal >> prompt-based rephrasing

License

MIT

About

Synthetic corpus expansion via entity graph traversal. Based on Stanford's EntiGraph (ICLR 2025 Oral). CLI + API + Library.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages