Zixuan Wang1,2, Yuchen Yan1, Hongxing Li1, Teng Pan1,2, Dingming Li1, Ruiqing Zhang2,
Weiming Lu1, Jun Xiao1, Yueting Zhuang1, Yongliang Shen1,†
1Zhejiang University · 2Baidu Inc.
†Corresponding author
This repository contains the official implementation of BEACON, a milestone-guided policy learning framework that addresses two pathologies of trajectory-level RL on long-horizon language-agent tasks: credit misattribution (correct early actions penalized by terminal failure) and sample inefficiency (partial successes wasted under sparse rewards). The implementation is built on top of verl-agent; only the components contributed by this work are documented here.
-
Consistent gains on three long-horizon benchmarks with a single set of hyperparameters (
$\gamma=0.95$ ,$\lambda=1.0$ ). - Horizon-dependent gains. On ALFWorld Long tasks, BEACON reaches 92.9% vs. 53.5% for GRPO. Relative gains over GRPO scale from +26.2% (Short) to +73.6% (Long).
- Recovers learning signal from partial successes. Effective sample utilization improves from 23.7% to 82.0% on ALFWorld.
- Outperforms behavior cloning. 91.4% vs. 43% for SFT on oracle trajectories — gains stem from policy optimization, not milestone imitation.
![]()
Main results. BEACON outperforms GRPO and GiGPO across ALFWorld, ScienceWorld, and WebShop at both 1.5B and 7B scales.
BEACON operates in three stages:
-
Trajectory partitioning. A milestone indicator
$\Phi$ identifies verifiable subgoal-completion transitions, splitting each trajectory into segments at milestone boundaries.$\Phi$ is environment-defined and requires no learned model: ALFWorld uses object/state predicates, WebShop uses page-transition phases, ScienceWorld exposessubgoal_completeddirectly. -
Temporal reward shaping. Within each completed segment, actions receive shaped reward
$r_t = R_{\text{ms}} \cdot \gamma^{t_k - t}$ , giving graduated positive credit to actions leading up to a milestone and converting partial successes into learning signal. - Dual-scale advantage estimation. Trajectory-level advantage (GRPO-style) captures global task performance; segment-level advantage compares only among trajectories that reached the same milestone, isolating local action quality from variance in later segments. The two are combined as $\hat{A}_{i,t} = A^{\text{traj}}i + \lambda \cdot A^{\text{seg}}{i,t}$.
At update time, BEACON automatically routes each batch based on which milestone field is present (trial_id for ALFWorld, milestone_achieved for WebShop, subgoal_completed for ScienceWorld), so a single training pipeline supports all three environments without environment-specific code paths in the trainer.
migpo/ # BEACON core: advantage / step-reward computation and milestone detector
agent_system/ # ALFWorld, WebShop, and ScienceWorld environment integrations
examples/migpo_trainer/ # Paper-locked training scripts (one per environment)
Everything else is inherited from the upstream verl-agent framework.
conda create -n verl-agent python==3.12 -y
conda activate verl-agent
pip3 install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip3 install flash-attn==2.7.4.post1 --no-build-isolation
pip3 install -e .
pip3 install vllm==0.8.5Each environment below is best installed in its own dedicated conda environment to avoid dependency conflicts.
pip3 install gymnasium==0.29.1
pip3 install stable-baselines3==2.6.0
pip install alfworld
pip install vllm==0.8.5
# Download PDDL & Game files and the pre-trained MaskRCNN detector
alfworld-download -fWebShop requires Python ≤ 3.10:
conda create -n verl-agent-webshop python==3.10 -y
conda activate verl-agent-webshop
cd ./agent_system/environments/env_package/webshop/webshop
./setup.sh -d all
cd repo_root/
pip3 install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip3 install flash-attn==2.7.4.post1 --no-build-isolation
pip3 install -e .
pip3 install vllm==0.8.2ScienceWorld requires Java 1.8+ and Python ≤ 3.10:
conda create -n verl-agent-sciworld python==3.10 -y
conda activate verl-agent-sciworld
cd repo_root/
pip3 install torch==2.6.0+cu124 --index-url https://download.pytorch.org/whl/cu124
pip3 install flash-attn==2.7.4.post1 --no-build-isolation
pip3 install -e .
pip3 install vllm==0.8.2
# Java via conda (not system-wide)
conda install -c conda-forge openjdk=11 -y
# ScienceWorld ships its own bundled JAR and the py4j bridge
pip install scienceworldVariation indices used by our experiments are included at agent_system/environments/env_package/sciworld/variations_idx/.
Sanity check:
python -c "from scienceworld import ScienceWorldEnv; print('ScienceWorld import successful')"Paper-locked training scripts (Qwen2.5-1.5B-Instruct, single 8-GPU node) live in examples/migpo_trainer/:
bash examples/migpo_trainer/run_alfworld.sh # ALFWorld
bash examples/migpo_trainer/run_webshop.sh # WebShop
bash examples/migpo_trainer/run_sciworld.sh # ScienceWorldThis codebase builds on verl-agent, which itself extends veRL. We thank the authors of those projects, and the maintainers of the supported environments — ALFWorld, WebShop, and ScienceWorld.
@misc{wang2026milestoneguidedpolicylearninglonghorizon,
title = {Milestone-Guided Policy Learning for Long-Horizon Language Agents},
author = {Zixuan Wang and Yuchen Yan and Hongxing Li and Teng Pan and Dingming Li and Ruiqing Zhang and Weiming Lu and Jun Xiao and Yueting Zhuang and Yongliang Shen},
year = {2026},
eprint = {2605.06078},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2605.06078},
}