TheoremForge: Scaling up Formal Data Synthesis with Low-Budget Agentic Workflow

TheoremForge is an agentic workflow system for synthesizing formal mathematical data at scale. It combines multiple specialized agents to transform informal mathematical statements into formalized Lean 4 theorems with verified proofs.

Setup

1. Clone Repository

git clone https://github.com/timechess/TheoremForge.git
cd TheoremForge

2. Python Environment

This project uses uv for fast dependency management.

Install uv:

curl -LsSf https://astral.sh/uv/install.sh | sh

Create virtual environment and install dependencies:

uv venv
uv sync

⚠️ vLLM Note: This project uses vLLM to serve prover models, which may have CUDA compatibility issues. If you encounter problems, run:
bash scripts/install_vllm.sh

3. Lean Server

The verifier server requires a local Lean 4 installation with Mathlib.

Install Lean (Linux)

Install elan (Lean version manager):

curl https://elan.lean-lang.org/elan-init.sh -sSf | sh
source $HOME/.elan/env

Setup Mathlib4

git clone https://github.com/leanprover-community/mathlib4.git
cd mathlib4 && git checkout v4.21.0

# Build the project (this may take a while)
lake exe cache get
lake build

You should see Build completed successfully when finished.

Start the Verifier Server

After completing Configuration, start the server:

# The server is on port 8000
uv run run_lean_server

4. Search Server

This project uses LeanExplore as the search service. Run the following command to download cache files.

uv run leanexplore data fetch

5. Configuration

Environment Variables

Create a .env file in the project root:

CLOSEAI_API_KEY=YOUR_CLOSEAI_API_KEY
DATABASE_PATH=./theoremforge.db

Configuration File

Edit config/gemini-3-flash.yaml to match your setup:

ProverAgentConfig: Points to local vLLM server (see next section)
Other agents: Use CloseAI or any OpenAI-compatible API

You can modify base_url and api_key to use alternative providers.

6. Model Serving

Download and serve the Goedel-Prover-V2-32B model and the ReForm-32B model.

Download Model

mkdir -p model
# Download Goedel-Prover-V2-32B and ReForm-32B to model/ directory
# Use huggingface-cli or git lfs

Serve with vLLM

bash scripts/vllm_serve_model.sh \
  --model-name model/Goedel-Prover-V2-32B \
  --port 8002 \
  --gpu-ids 0,1

bash scripts/vllm_serve_model.sh \
  --model-name model/ReForm-32B \
  --port 8003 \
  --gpu-ids 2,3

Update config/gemini-3-flash.yaml with your chosen port and model path.

Usage

Workflow Overview

The typical workflow consists of three main steps:

Generate Dataset: Sample problems from DeepTheorem and DeepMath datasets
Run Workflow: Process problems through the TheoremForge agentic workflow
Extract Data: Extract training data for different tasks from workflow results

Step 1: Generate Dataset

Use scripts/generate_dataset.py to sample problems from DeepTheorem and DeepMath datasets by difficulty distribution.

Example:

uv run python scripts/generate_dataset.py \
  --num_samples 1000 \
  --ratio 0.6:0.4 \
  --seed 42 \
  --output data/sampled_problems.jsonl

Parameters:

--num_samples: Total number of problems to sample (required)
--ratio: Dataset ratio in format deeptheorem:deepmath, e.g., 0.6:0.4 (required)
--seed: Random seed (default: 42)
--output: Output file path (optional, prints first sample if not specified)

The script samples problems while preserving the original difficulty distribution from each dataset.

Step 2: Run Workflow

Use scripts/run_workflow.py to process problems through the TheoremForge agentic workflow.

Example:

uv run python scripts/run_workflow.py \
  --config_path config/gemini-3-flash.yaml \
  --max_workers 4 \
  --input_file data/sampled_problems.jsonl \
  --export_file results/workflow_results.jsonl \
  --resume

Parameters:

--config_path: Path to the configuration file (required)
--max_workers: Maximum number of concurrent workers (required)
--input_file: Input file with problems in JSONL format (required)
- Each line should be a JSON object with id and nl_problem fields
--export_file: Output file path for workflow results (required)
--resume: Resume from checkpoint (optional)
- If specified, resumes from the last successful entry in the export file

Input Format: Each line in the input file should be a JSON object:

{"id": "problem_1", "nl_problem": "Prove that the sum of two even numbers is even."}

Output Format: Each line in the output file is a JSON object:

{
  "id": "problem_1",
  "statement_id": "...",
  "formal_statement": "theorem sum_even : ...",
  "informal_statement": "Prove that the sum of two even numbers is even.",
  "formal_proof": "...",
  "success": true
}

Step 3: Extract Data

Use scripts/extract_data.py to extract training data for different tasks from workflow results.

Example:

uv run python scripts/extract_data.py \
  --file results/workflow_results.jsonl

Parameters:

--file: Path to workflow results file (required)

Output Files: The script generates five JSONL files in the results/ directory:

statement_formalization_data.jsonl: Data for statement formalization task
- Fields: informal_statement, retrieval_results, formal_statement, success
premise_selection_data.jsonl: Data for premise selection task
- Fields: informal_statement/formal_statement, queries, results, success
proof_generation_data.jsonl: Data for proof generation task
- Fields: formal_statement, retrieval_results, formal_proof, success
proof_correction_data.jsonl: Data for proof correction task
- Fields: error_code, error_messages, valid_code, success
proof_sketching_data.jsonl: Data for proof sketching task
- Fields: formal_statement, retrieval_results, informal_proof, proof_sketch, success

Prerequisites:

The Lean verifier server must be running (see Step 3: Lean Server)
The database must contain trace information from the workflow run
Ensure DATABASE_PATH in .env points to the correct database file

We have open-sourced our extracted data in huggingface: https://huggingface.co/datasets/timechess/theoremforge

Citation

@misc{tao2026theoremforgescalingformaldata,
      title={TheoremForge: Scaling up Formal Data Synthesis with Low-Budget Agentic Workflow}, 
      author={Yicheng Tao and Hongteng Xu},
      year={2026},
      journal={arXiv preprint arXiv:2601.17332}
}

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
config		config
scripts		scripts
theoremforge		theoremforge
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TheoremForge: Scaling up Formal Data Synthesis with Low-Budget Agentic Workflow

Setup

1. Clone Repository

2. Python Environment

3. Lean Server

Install Lean (Linux)

Setup Mathlib4

Start the Verifier Server

4. Search Server

5. Configuration

Environment Variables

Configuration File

6. Model Serving

Download Model

Serve with vLLM

Usage

Workflow Overview

Step 1: Generate Dataset

Step 2: Run Workflow

Step 3: Extract Data

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TheoremForge: Scaling up Formal Data Synthesis with Low-Budget Agentic Workflow

Setup

1. Clone Repository

2. Python Environment

3. Lean Server

Install Lean (Linux)

Setup Mathlib4

Start the Verifier Server

4. Search Server

5. Configuration

Environment Variables

Configuration File

6. Model Serving

Download Model

Serve with vLLM

Usage

Workflow Overview

Step 1: Generate Dataset

Step 2: Run Workflow

Step 3: Extract Data

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages