Synthetic corpus expansion via entity graph traversal. Turn a small corpus into training-ready data.
Based on Stanford's "Synthetic Continued Pretraining" (ICLR 2025 Oral). Takes any document and expands it 5-50x by extracting entities, generating all combinatorial pairs and sampled triples, then producing structured multi-section essays that reframe the source text through each entity's perspective.
When your corpus is too small for continued pretraining (CPT), you need more tokens. Simple rephrasing saturates at ~38M tokens. EntiGraph keeps scaling past 455M tokens without plateau because diversity comes from combinatorial entity graph traversal, not prompt tricks.
pip install git+https://github.com/BogdanAlRa/entigraph.git
# With PDF support
pip install "entigraph[pdf] @ git+https://github.com/BogdanAlRa/entigraph.git"
# With everything
pip install "entigraph[all] @ git+https://github.com/BogdanAlRa/entigraph.git"# Single file
entigraph book.txt --provider openai --api-key sk-xxx
# PDF with Gemini
entigraph paper.pdf --provider gemini --api-key AIza...
# Directory of files
entigraph ./corpus/ --provider cerebras --api-key csk-xxx
# Use env var instead of flag
export OPENAI_API_KEY=sk-xxx
entigraph book.txt
# Resume interrupted job (state saved after every 20 items)
entigraph big-book.txt --resume
# Check progress
entigraph --status -o ./outputentigraph --serve --port 8000# Submit a document
curl -X POST http://localhost:8000/expand \
-F file=@book.pdf \
-F provider=openai \
-F api_key=sk-xxx
# Check progress
curl http://localhost:8000/status/{job_id}
# Download expanded corpus
curl http://localhost:8000/result/{job_id} -o expanded.txt
# Download extracted entities
curl http://localhost:8000/entities/{job_id} -o entities.jsonfrom entigraph import EntiGraphExpander
e = EntiGraphExpander(provider="gemini", api_key="AIza...")
result = e.process_file("book.pdf")
print(f"Expansion: {result['expansion_ratio']}")
print(f"Entities: {result['entities']}")
print(f"Pairs: {result['pairs_generated']}, Triples: {result['triples_generated']}")For each ~20K character chunk, an LLM extracts all entities (people, places, concepts, frameworks, methods, etc.) with the instruction to "exhaust as many entities as possible."
- All pairs C(n,2) are generated and shuffled (seed=42)
- Triples C(n,3): all for n<=30, sampled proportionally for larger sets
- This combinatorial explosion is what drives diversity
For each pair (E_i, E_j), the LLM generates 3 sections:
- Source text rephrased emphasizing entity 1
- Source text rephrased emphasizing entity 2
- Analysis of interaction between both entities
For each triple (E_i, E_j, E_k), 4 sections: 1-3. Source rephrased through each entity's lens 4. Three-way interaction analysis
The full source document is passed in every call. The model reframes existing knowledge, not generating de novo. This is the implicit quality filter.
| Provider | Default Model | Env Var |
|---|---|---|
openai |
gpt-4.1-mini | OPENAI_API_KEY |
gemini |
gemini-2.5-flash | GOOGLE_API_KEY |
anthropic |
claude-sonnet-4-6 | ANTHROPIC_API_KEY |
cerebras |
qwen-3-235b | CEREBRAS_API_KEY |
groq |
llama-3.3-70b | GROQ_API_KEY |
openrouter |
gemini-2.5-flash:free | OPENROUTER_API_KEY |
custom |
(you specify) | (you specify) |
{name}_expanded.txt-- original source + all generated essays (plain text, ready for CPT){name}_entities.json-- extracted entities for inspection.state-{id}.json-- progress state (enables resume)
| Flag | Default | Description |
|---|---|---|
--provider |
openai | LLM provider |
--api-key |
(env var) | API key |
--model |
(provider default) | Override model |
--workers |
4 | Concurrent API calls |
--max-triples |
0 (paper default) | Cap triple count |
--seed |
42 | Random seed for shuffling |
--output |
./output | Output directory |
--resume |
false | Resume interrupted job |
--serve |
false | Start API server |
Chen, Z., Yang, Z., et al. "Synthetic Continued Pretraining." ICLR 2025 (Oral).
- Source: 265 articles, 1.3M tokens
- Expanded: 455M tokens (350x)
- Result: Log-linear scaling on QA, no plateau observed
- Key insight: Entity graph traversal >> prompt-based rephrasing
MIT