Skip to content

Inquiry about rStar2 paper #51

@qordmlwls

Description

@qordmlwls

Thank you for your excellent work on rStar2-Agent. I’ve been especially impressed by your contributions to LLM reasoning.

In your earlier work, rStar-Math, you used MCTS with SFT and a reward model (PRM) for step-level supervision, achieving strong performance comparable to models like o1-mini. In contrast, rStar2-Agent adopts GRPO with verifiable answer-only rewards.

I’m curious about the motivation behind this shift. Was the primary reason to avoid the risk of reward hacking from PRMs, as you mention in the paper? My intuition is that a well-trained, representative PRM could still be quite effective—especially since it provides fine-grained intermediate feedback that outcome-only rewards may miss.

Did you observe any specific reward hacking issues when using PRMs in rStar-Math? While GRPO-RoC addresses some limitations of outcome-only setups, I wonder if there is a specific reason for replacing the effective PRM with a new method.

Thank you again for your insights and for sharing the code and experiments.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions