Ranked Return Regression for RL (R4)

This repository contains the code release for the paper "Reward Learning through Ranking Mean Squared Error".

Abstract

Reward design remains a significant bottleneck in applying reinforcement learning (RL) to real-world problems. A popular alternative is reward learning, where reward functions are inferred from human feedback rather than manually specified. Recent work has proposed learning reward functions from human feedback in the form of ratings, rather than traditional binary preferences, enabling richer and potentially less cognitively demanding supervision. Building on this paradigm, we introduce a new rating-based RL method, Ranked Return Regression for RL (R4). At its core, R4 employs a novel ranking mean squared error (rMSE) loss, which treats teacher-provided ratings as ordinal targets. Our approach learns from a dataset of trajectory-rating pairs, where each trajectory is labeled with a discrete rating (e.g., "bad," "neutral," "good"). At each training step, we sample a set of trajectories, predict their returns, and rank them using a differentiable sorting operator (soft ranks). We then optimize a mean squared error loss between the resulting soft ranks and the teacher's ratings. Unlike prior rating-based approaches, R4 offers formal guarantees: its solution set is provably minimal and complete under mild assumptions. Empirically, using simulated human feedback, we demonstrate that R4 consistently matches or outperforms existing rating and preference-based RL methods on robotic locomotion benchmarks from OpenAI Gym and the DeepMind Control Suite, while requiring significantly less feedback.

Repository layout

src/main.py: Offline rMSE reward learning entry point
src/main_online.py: Online training with preference-based reward updates
src/optimization.py: Shared optimization logic (offline and online)
src/lossFunctions: rMSE losses (fast_soft_sort and OT variants)
src/configs: Environment configurations for offline and online runs

Setup

Offline reward learning (rMSE)

Run the offline rMSE training from the repo root:

python src/main.py --env_name hungrythirsty --num_files_to_read 1 --num_trajectories 1000

Common options:

--loss_func ranking_mse (fast_soft_sort) or --loss_func ranking_mse_ot (OT-based)
--ranking_assumption bin_ranking or full_ranking
--model mini|medium|large|large2

Artifacts are saved under ../checkpoints/ours by default.

Online reward learning

Run the online preference-based training from the repo root:

python src/main_online.py --env_name walker_walk --task_name walk

Common options:

--reward_model_arch mini|medium|large|large2
--budget 100 --num_preferences 10 --num_generations 2000
--subtrajectory_length 50

Artifacts are saved under ../checkpoints/online by default.

Citation

If you use this code, please cite the paper:

Chaitanya Kharyal, Calarina Muslimani, Matthew E. Taylor. "Reward Learning through Ranking Mean Squared Error." ICML 2026.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ranked Return Regression for RL (R4)

Abstract

Repository layout

Setup

Offline reward learning (rMSE)

Online reward learning

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Ranked Return Regression for RL (R4)

Abstract

Repository layout

Setup

Offline reward learning (rMSE)

Online reward learning

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages