Thank you for your excellent work on rStar2-Agent. I’ve been especially impressed by your contributions to LLM reasoning.
In your earlier work, rStar-Math, you used MCTS with SFT and a reward model (PRM) for step-level supervision, achieving strong performance comparable to models like o1-mini. In contrast, rStar2-Agent adopts GRPO with verifiable answer-only rewards.
I’m curious about the motivation behind this shift. Was the primary reason to avoid the risk of reward hacking from PRMs, as you mention in the paper? My intuition is that a well-trained, representative PRM could still be quite effective—especially since it provides fine-grained intermediate feedback that outcome-only rewards may miss.
Did you observe any specific reward hacking issues when using PRMs in rStar-Math? While GRPO-RoC addresses some limitations of outcome-only setups, I wonder if there is a specific reason for replacing the effective PRM with a new method.
Thank you again for your insights and for sharing the code and experiments.
Thank you for your excellent work on rStar2-Agent. I’ve been especially impressed by your contributions to LLM reasoning.
In your earlier work, rStar-Math, you used MCTS with SFT and a reward model (PRM) for step-level supervision, achieving strong performance comparable to models like o1-mini. In contrast, rStar2-Agent adopts GRPO with verifiable answer-only rewards.
I’m curious about the motivation behind this shift. Was the primary reason to avoid the risk of reward hacking from PRMs, as you mention in the paper? My intuition is that a well-trained, representative PRM could still be quite effective—especially since it provides fine-grained intermediate feedback that outcome-only rewards may miss.
Did you observe any specific reward hacking issues when using PRMs in rStar-Math? While GRPO-RoC addresses some limitations of outcome-only setups, I wonder if there is a specific reason for replacing the effective PRM with a new method.
Thank you again for your insights and for sharing the code and experiments.