Skip to content

Issue with SFT Data and Abnormal Tool Call Behavior During RL Training #52

@CoverZhao

Description

@CoverZhao

Thank you so much for open-sourcing rStar2! While trying to reproduce the results, I encountered a few issues:

  1. Where can I find the non-reasoning SFT data?
  2. If I don’t have any SFT data, can I use qwen3-4b directly instead of qwen3-4b-base?
  3. I tried running RL directly with qwen3-4b, but after around 50 steps of training, the number of tool calls suddenly increases (the model keeps repeating tool calls), such as in the following example:

.... Therefore, the smallest positive integer n is when k=1: n = 41*1 - 18 = 23. Wait, is that correct? Let me check.

Let me test n=23. Then 923 -2 = 207 -2 = 205. 723 +3 = 161 +3 = 164. Now, gcd(205, 164). Let's compute that. 205 divided by 164 is 1 with remainder 41. Then 164 divided by 41 is 4 with remainder 0. So gcd is 41. Which is greater than 1. So yes, n=23 works. But is there a smaller n?

Wait, maybe I made a mistake in the steps. Let's check for smaller n. Let's see, since n+18 must be a multiple of 41, the smallest n is 41 -18 = 23. But perhaps there's a smaller n where n+18 is a multiple of 41? Well, 41 is prime, so the next smaller multiple would be 41*0 = 0, which would mean n = -18, which is not positive. So 23 is indeed the smallest positive integer. Therefore, the answer is 23.

But wait, let me check n=23. Yes, as shown, the gcd is 41. So the answer is 23. So the smallest positive integer is 23.

<tool_call>
{"name": "python_code_with_standard_io", "arguments": {"code": "def gcd(a, b):\n while b:\n a, b = b, a % b\n return a\n\nfor n in range(1, 100):\n a = 9 * n - 2\n b = 7 * n + 3\n if gcd(a, b) > 1:\n print(n)\n break", "input": ""}}
</tool_call>
<tool_call>
{"name": "python_code_with_standard_io", "arguments": {"code": "def gcd(a, b):\n while b:\n a, b = b, a % b\n return a\n\nfor n in range(1, 100):\n a = 9 * n - 2\n b = 7 * n + 3\n if gcd(a, b) > 1:\n print(n)\n break", "input": ""}}
</tool_call>

<tool_call>
{"name": "python_code_with_standard_io", "arguments": {"code": "def gcd(a, b):\n while b:\n a, b = b, a % b\n return a\n\nfor n in range(1, 100):\n a = 9 * n - 2\n b = 7 * n + 3\n if gcd(a, b) > 1:\n print(n)\n break", "input": ""}}
</tool_call>user
<tool_response>
Tool call success
stdout: 23

execution time: 0.01s
</tool_response>
assistant
The Python code executed successfully and returned the result 23. This confirms that the smallest positive integer $ n $ for which $ 9n - 2 $ and $ 7n + 3 $ share a common factor greater than 1 is indeed 23. \boxed{23}


Could this behavior be caused by using qwen3-4b directly without SFT?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions