This repository is the official implementation of Agent0.
Agent0 is a fully autonomous framework that evolves high-performing agents from scratch without relying on any human-curated data. It employs a symbiotic competition between a Curriculum Agent (proposing tasks) and an Executor Agent (solving tasks with tools), driving a self-reinforcing cycle of improvement.
- Zero Data: Eliminates dependency on external data or human annotations.
- Co-Evolution: Establishes a competition where the curriculum agent creates increasingly difficult tasks and the executor agent learns to solve them.
- Tool Integration: Integrates external tools (e.g., code interpreter) to enhance problem-solving and pressure the curriculum agent to generate complex, tool-aware tasks.
Empirically, Agent 0 substantially boosts reasoning capabilities on the Qwen3-8B-Base model:
- +18% improvement on Mathematical Reasoning benchmarks.
- +24% improvement on General Reasoning benchmarks.
git clone https://github.com/aiming-lab/Agent0.git
cd Agent0/Agent0
# For curriculum training
conda create -n curriculum python==3.12
conda activate curriculum
cd curriculum_train/
pip install -r requirements.txt
pip install flash-attn==2.7.4.post1 --no-build-isolation --no-cache-dir
# For executor training
conda deactivate
conda create -n executor python==3.12
conda activate executor
cd ../
pip install -r requirements.txt
cd executor_train/
pip install -e verl
pip install "flash-attn==2.8.3" --no-build-isolation --no-cache-dirYou need to deploy a code sandbox service for subsequent code compilation. You can dynamically adjust the number or specific configuration of the sandbox service based on your hardware settings such as CPU memory to meet higher concurrency requirements. Please refer to this for more details.
Here is a sample script for the sandbox setup. We deployed four sandbox services. Each service is assigned an IP address and a corresponding port.
git clone https://github.com/bytedance/SandboxFusion.git
cd SandboxFusion
poetry install
make run-onlineFirst, you need to fill out the code lines L36-41 with info of sandbox service from Step 2 in curriculum_train/vllm_service_init/start_vllm_server_tool.py.
SANDBOX_API_URLS = [
'IP1:PORT1/run_code',
'IP2:PORT2/run_code',
'IP3:PORT3/run_code',
'IP4:PORT4/run_code'
]Then use the script to train the curriculum agent. This step will be relatively slow due to limitations such as rollout and concurrency restrictions of the sandbox service. So if you would like to change the max_turns, please refer to generate_with_tool_use() in curriculum_train/vllm_service_init/start_vllm_server_tool.py.
cd curriculum_train/
export STORAGE_PATH=""
export HUGGINGFACENAME=""
export WANDB_API_KEY=""
mkdir -p \
"$STORAGE_PATH/evaluation" \
"$STORAGE_PATH/models" \
"$STORAGE_PATH/generated_question" \
"$STORAGE_PATH/temp_results"
# Initialize first iteration with base model
bash scripts/curriculum_train.sh Qwen/Qwen3-4B-Base Qwen/Qwen3-4B-Base qwen3_4b_curriculum_v1Then we need to construct the training data for the execution agent with filtering mechanism using self-consistency score.
executor_agent_path=Qwen/Qwen3-4B-Base
curriculum_agent_path=${STORAGE_PATH}/models/qwen3_4b_curriculum_v1/global_step_5/actor/huggingface
experiment_name=qwen3_4b_executor_v1
export VLLM_DISABLE_COMPILE_CACHE=1
echo 'start generate question'
bash question_generate/question_generate.bash $curriculum_agent_path 1000 $experiment_name
echo 'start evaluate generated question'
bash question_evaluate/evaluate.sh $executor_agent_path $experiment_name
echo 'start upload'
LOCAL_DATA_PATH=$(python question_evaluate/upload.py --max_score 0.8 --min_score 0.3 --experiment_name ${experiment_name})
echo "training data saved to: ${LOCAL_DATA_PATH}"Lastly, we implement the ADPO algorithm in executor_train/verl_tool/trainer/ppo/core_algos.py and then update the executor agent with multi-turn RL based on VeRL-Tool.
cd ../executor_train
# if out of cuda memory during training, try lower the use_dynamic_bs=False
bash examples/train/math_tir/train_qwen3_4b_adpo.shThe checkpoints will be saved in checkpoints/torl. So you can manually select the checkpoint and train the next iterations.
The framework is based on VeRL. We use code from R-Zero and VeRL-Tool as the codebase and SandboxFusion as the sandbox server. We thank the authors for releasing their code.
If you find this work helpful, please consider citing our paper:
@article{xia2025agent0,
title={Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning},
author={Xia, Peng and Zeng, Kaide and Liu, Jiaqi and Qin, Can and Wu, Fang and Zhou, Yiyang and Xiong, Caiming and Yao, Huaxiu},
journal={arXiv preprint arXiv:2511.16043},
year={2025}
}