TheoremForge is an agentic workflow system for synthesizing formal mathematical data at scale. It combines multiple specialized agents to transform informal mathematical statements into formalized Lean 4 theorems with verified proofs.
git clone https://github.com/timechess/TheoremForge.git
cd TheoremForgeThis project uses uv for fast dependency management.
Install uv:
curl -LsSf https://astral.sh/uv/install.sh | shCreate virtual environment and install dependencies:
uv venv
uv sync
⚠️ vLLM Note: This project uses vLLM to serve prover models, which may have CUDA compatibility issues. If you encounter problems, run:bash scripts/install_vllm.sh
The verifier server requires a local Lean 4 installation with Mathlib.
Install elan (Lean version manager):
curl https://elan.lean-lang.org/elan-init.sh -sSf | sh
source $HOME/.elan/envgit clone https://github.com/leanprover-community/mathlib4.git
cd mathlib4 && git checkout v4.21.0
# Build the project (this may take a while)
lake exe cache get
lake buildYou should see Build completed successfully when finished.
After completing Configuration, start the server:
# The server is on port 8000
uv run run_lean_serverThis project uses LeanExplore as the search service. Run the following command to download cache files.
uv run leanexplore data fetchCreate a .env file in the project root:
CLOSEAI_API_KEY=YOUR_CLOSEAI_API_KEY
DATABASE_PATH=./theoremforge.dbEdit config/gemini-3-flash.yaml to match your setup:
- ProverAgentConfig: Points to local vLLM server (see next section)
- Other agents: Use CloseAI or any OpenAI-compatible API
You can modify base_url and api_key to use alternative providers.
Download and serve the Goedel-Prover-V2-32B model and the ReForm-32B model.
mkdir -p model
# Download Goedel-Prover-V2-32B and ReForm-32B to model/ directory
# Use huggingface-cli or git lfsbash scripts/vllm_serve_model.sh \
--model-name model/Goedel-Prover-V2-32B \
--port 8002 \
--gpu-ids 0,1
bash scripts/vllm_serve_model.sh \
--model-name model/ReForm-32B \
--port 8003 \
--gpu-ids 2,3Update config/gemini-3-flash.yaml with your chosen port and model path.
The typical workflow consists of three main steps:
- Generate Dataset: Sample problems from DeepTheorem and DeepMath datasets
- Run Workflow: Process problems through the TheoremForge agentic workflow
- Extract Data: Extract training data for different tasks from workflow results
Use scripts/generate_dataset.py to sample problems from DeepTheorem and DeepMath datasets by difficulty distribution.
Example:
uv run python scripts/generate_dataset.py \
--num_samples 1000 \
--ratio 0.6:0.4 \
--seed 42 \
--output data/sampled_problems.jsonlParameters:
--num_samples: Total number of problems to sample (required)--ratio: Dataset ratio in formatdeeptheorem:deepmath, e.g.,0.6:0.4(required)--seed: Random seed (default: 42)--output: Output file path (optional, prints first sample if not specified)
The script samples problems while preserving the original difficulty distribution from each dataset.
Use scripts/run_workflow.py to process problems through the TheoremForge agentic workflow.
Example:
uv run python scripts/run_workflow.py \
--config_path config/gemini-3-flash.yaml \
--max_workers 4 \
--input_file data/sampled_problems.jsonl \
--export_file results/workflow_results.jsonl \
--resumeParameters:
--config_path: Path to the configuration file (required)--max_workers: Maximum number of concurrent workers (required)--input_file: Input file with problems in JSONL format (required)- Each line should be a JSON object with
idandnl_problemfields
- Each line should be a JSON object with
--export_file: Output file path for workflow results (required)--resume: Resume from checkpoint (optional)- If specified, resumes from the last successful entry in the export file
Input Format: Each line in the input file should be a JSON object:
{"id": "problem_1", "nl_problem": "Prove that the sum of two even numbers is even."}Output Format: Each line in the output file is a JSON object:
{
"id": "problem_1",
"statement_id": "...",
"formal_statement": "theorem sum_even : ...",
"informal_statement": "Prove that the sum of two even numbers is even.",
"formal_proof": "...",
"success": true
}Use scripts/extract_data.py to extract training data for different tasks from workflow results.
Example:
uv run python scripts/extract_data.py \
--file results/workflow_results.jsonlParameters:
--file: Path to workflow results file (required)
Output Files:
The script generates five JSONL files in the results/ directory:
-
statement_formalization_data.jsonl: Data for statement formalization task- Fields:
informal_statement,retrieval_results,formal_statement,success
- Fields:
-
premise_selection_data.jsonl: Data for premise selection task- Fields:
informal_statement/formal_statement,queries,results,success
- Fields:
-
proof_generation_data.jsonl: Data for proof generation task- Fields:
formal_statement,retrieval_results,formal_proof,success
- Fields:
-
proof_correction_data.jsonl: Data for proof correction task- Fields:
error_code,error_messages,valid_code,success
- Fields:
-
proof_sketching_data.jsonl: Data for proof sketching task- Fields:
formal_statement,retrieval_results,informal_proof,proof_sketch,success
- Fields:
Prerequisites:
- The Lean verifier server must be running (see Step 3: Lean Server)
- The database must contain trace information from the workflow run
- Ensure
DATABASE_PATHin.envpoints to the correct database file
We have open-sourced our extracted data in huggingface: https://huggingface.co/datasets/timechess/theoremforge
@misc{tao2026theoremforgescalingformaldata,
title={TheoremForge: Scaling up Formal Data Synthesis with Low-Budget Agentic Workflow},
author={Yicheng Tao and Hongteng Xu},
year={2026},
journal={arXiv preprint arXiv:2601.17332}
}