Confirmation over settings on clevr_math_sft and its RL version

Hi Tan, I am trying to reproduce your results using Qwen2-VL-2B. As written in your paper https://arxiv.org/pdf/2503.20752, 

> *Training Paradigms and Baselines* To assess the performance and generalization of different
> training strategies, we compare: (1) SFT-based methods—ANS-SFT, which fine-tunes on answer
> generation, and CoT-SFT, which uses supervised learning with CoT reasoning; and (2) RL-based
> methods—Reason-RFT-Zero, which applies RL without reasoning activation stage, and ReasonRFT, 
> which uses limited CoT data for reasoning activation before RL training.

I've converted the full clever-math-sft data (1.5K) into the specified CoT format and obtained the sft model (stage 1). So for the RL (stage 2), which data shall I use? Shall I convert the CoT sft data into RL version for RL training? Or shall I split the sft data to two parts (at which proportion?) for SFT and RL respectively? And for the format, shall I follow Reason-RFT-Zero?




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confirmation over settings on clevr_math_sft and its RL version #48

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Confirmation over settings on clevr_math_sft and its RL version #48

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions