Handle reasoning budget by pwilkin · Pull Request #20297 · ggml-org/llama.cpp

pwilkin · 2026-03-09T15:07:29Z

Adds proper handling for --reasoning-budget.

Currently, --reasoning-budget is just a stub that handles one case: 0, and the only think that does is set enable_thinking to false.

This PR adds the following flags:

--reasoning on (short -rea on) - enable reasoning via kwargs on model
--reasoning off (short -rea off) - disable reasoning via kwargs on model
--reasoning-budget-message - a message to be appended before the reasoning close marker to inform the model the reasoning was terminated due to budget constraints, i.e. " ... reasoning budget exceeded" or "... okay, now let's answer."

Also, --reasoning-budget now adds an extra grammar with a mechanism called delayed launch. Basically what this does is, when the opening trigger for the grammar is triggered, tokens get counted down and we also watch for the disarm trigger. If the disarm trigger doesn't get triggered before the countdown, the grammar gets launched.

This allows setting a real token budget limit on reasoning for models. This also allows disabling thinking for models that normally do not allow that by setting the budget to 0, which is now a different behavior than --disable-reasoning. Note: while possible, this isn't recommended since if a model was trained to only work with reasoning, it might exhibit aberrant behavior (for example, try to open an extra reasoning section).

Supersedes #17750

pwilkin · 2026-03-09T15:08:03Z

AI disclosure: I used Claude Opus in making most of the changes, auditing and modifying the critical code myself.

pwilkin · 2026-03-09T15:09:32Z

Oh, I didn't mention it in the note but this of course entails support for multiple grammars for one server task since the tool grammar is still there.

pwilkin · 2026-03-09T17:18:37Z

Some interesting observations from early tests (on Qwen3.5 9B Q8_0):

full model humaneval is around 93%
non-reasoning (-dre) is around 88%
using reasoning_budget 1000 and 400 is actually pretty similar and improves to about 89%
however this relies on having a --reasoning-budget-message (I used " ... reasoning budget exceeded, need to answer."). Without that, performance drops to a terrible 79%.

ggerganov

I'll probably need to understand this deeper, but on first look this seems very heavy logic. How important is this functionality?

Specifically the changes in common_sampler seem disproportionately large compared to what this brings to the existing logic. Look for ways to simplify.

pwilkin · 2026-03-09T18:02:00Z

I'll probably need to understand this deeper, but on first look this seems very heavy logic. How important is this functionality?

A lot of people have been requesting this, especially with the Qwen3.5 models that are seen as too verbose with their reasoning.

The changes in the sampler code are basically to the grammar sampler, since the idea is (a) to support more than one grammar simultaneously and (b) to support delayed grammar application (with token counting). Maybe this can be simplified by instead inserting another grammar sampler? Not sure how viable that would be.

aldehir · 2026-03-09T18:48:44Z

In my opinion, we need to think long term.

The grammar sampler is incredibly inefficient. We had to revert a change @ggerganov wanted to make that shifts the grammar to the start of the chain to support backend sampling.

Merging this will increase the reliance on the grammar sampler and make it more challenging to optimize in the future.

I'm of the opinion that a dedicated, simple, reasoning sampler that lives in common would be enough and can be used at the start of the chain--so long as it aligns with the grammar used (if any).

ggerganov · 2026-03-09T18:55:08Z

Yes, framing this as a reasoning sampler should definitely be explored.

pwilkin · 2026-03-09T21:17:44Z

@ggerganov @aldehir aight, reverted all the grammar changes and instead reimplemented it as a clean new reasoning parser.

I tested on cli, there is no noticeable overhead on generation (152 t/s both with and without the sampler).

aldehir · 2026-03-09T21:56:40Z

Unless @ggerganov thinks otherwise, I would put it under common until it reaches maturity before exposing it in the public API. I imagine there will be quite a bit of churn with all the models to support.

Other notes:

Is arm_immediately needed? Why not define an initial state instead?
Need to add a soft/hard cap to handle partial utf8 sequences. When I tested this with a grammar approach, I would often see incomplete utf8 codepoints. Instead, we could enforce a soft cap then continue until we hit a clean boundary or hit the hard cap.
Would like to see some vision for incorporating other reasoning budget strategies. For example, Nemotron Nano 2 (which they claim to also support in their 3-series).

3.4. Budget Control Evaluation
Nemotron Nano V2 allows users to specify how many thinking tokens the model may generate before
producing the final answer. The final answer is the portion of text typically shown to end users.
This feature is implemented by counting tokens after the model begins generating the <think>
token. Once the budget is reached, the inference setup attempts to insert a closing </think> tag.
Rather than inserting it immediately, we let the model finish its current sentence and place the
tag at the next newline. In extreme cases where no newline appears, the system enforces closure
within 500 tokens past the budget: if no newline occurs by the (budget+ 500)th token, the </think>
tag is forcibly inserted.

https://arxiv.org/abs/2508.14444

Overall, I think this is a cleaner approach. It isolates the complexity rather than polluting the already complex grammar sampling logic.

aldehir

Need some tests around the apply/accept logic. I had some in my example, but feel free to improvise.

CISC · 2026-03-09T22:58:18Z

Funny, wonder what happened here:
https://github.com/ggml-org/llama.cpp/actions/runs/22877806623/job/66373589713?pr=20297

pwilkin · 2026-03-09T23:06:29Z

Funny, wonder what happened here: https://github.com/ggml-org/llama.cpp/actions/runs/22877806623/job/66373589713?pr=20297

GitHub merge running on Windows? :D

pwilkin · 2026-03-10T12:52:20Z

Aight I got rid of the explosive terminology and fixed the newlines in the process :)

pwilkin · 2026-03-10T14:11:24Z

Okay, UTF-8 and tests are done, think this one's ready.

* v1 * Finished! * Handlie cli * Reasoning sampler * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Less explosive terminology :) * Add utf-8 case and tests * common : migrate reasoning budget sampler to common * cont : clean up * cont : expose state and allow passing as initial state * cont : remove unused imports * cont : update state machine doc string --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Alde Rojas <hello@alde.dev>

aviallon · 2026-03-26T13:24:21Z

@pwilkin should this be added to the recent API changes: #9291 issue?

strawberrymelonpanda · 2026-03-30T19:32:54Z

@pwilkin Wanted to report some interesting numbers: Using GG's llama-eval with a fixed seed of 1234 on aime2025 with Qwen35 27B Q4 on a small sample of 30 questions (hence the same questions), plus a fixed seed on the server:

Unlimited reasoning: 3 out of 30 (10%) [Total accumulated time: 4915.2s]
reasoning budget=1000: 10 out of 30 (33%) [Total accumulated time: 4369.0s]

(message="... reasoning budget exhausted. Let's answer now.")

Stopping runaway reasoning in this case both saves time and improves the score.
I appreciate this command. 👍

pritam-dey3 · 2026-04-04T07:44:29Z

Quick feature suggestion related to llama-server. Is it possible to add an extra-body parameter in the chat completions endpoint to enable something like this?
If it is possible, then that would be helpful since I can decide on the budget for different tasks (and I already use this with openrouter). I assume since this is mostly grammar based, this should be possible. I would like to take a stab at this if @pwilkin thinks it is a reasonable thing to add to the server.

aviallon · 2026-04-04T08:23:58Z

Quick feature suggestion related to llama-server. Is it possible to add an extra-body parameter in the chat completions endpoint to enable something like this?
If it is possible, then that would be helpful since I can decide on the budget for different tasks (and I already use this with openrouter). I assume since this is mostly grammar based, this should be possible. I would like to take a stab at this if @pwilkin thinks it is a reasonable thing to add to the server.

You can already do that! Just add thinking_budget_tokens to your body

This change refactors the reasoning_budget_message parameter from the common params into the sampling parameters specifically. It also removes the reasoning_budget common parameter and standardizes on the existing reasoning_budget_tokens parameter in the sampling configuration. Issue: ggml-org#20429 Original PR: ggml-org#20297

This change refactors the reasoning_budget_message parameter from the common params into the sampling parameters specifically. It also removes the reasoning_budget common parameter and standardizes on the existing reasoning_budget_tokens parameter in the sampling configuration. Issue: #20429 Original PR: #20297