Skip to content

Handle reasoning budget#20297

Merged
pwilkin merged 13 commits into
ggml-org:masterfrom
pwilkin:reasoning-budget
Mar 11, 2026
Merged

Handle reasoning budget#20297
pwilkin merged 13 commits into
ggml-org:masterfrom
pwilkin:reasoning-budget

Conversation

@pwilkin
Copy link
Copy Markdown
Member

@pwilkin pwilkin commented Mar 9, 2026

Adds proper handling for --reasoning-budget.

Currently, --reasoning-budget is just a stub that handles one case: 0, and the only think that does is set enable_thinking to false.

This PR adds the following flags:

  • --reasoning on (short -rea on) - enable reasoning via kwargs on model
  • --reasoning off (short -rea off) - disable reasoning via kwargs on model
  • --reasoning-budget-message - a message to be appended before the reasoning close marker to inform the model the reasoning was terminated due to budget constraints, i.e. " ... reasoning budget exceeded" or "... okay, now let's answer."

Also, --reasoning-budget now adds an extra grammar with a mechanism called delayed launch. Basically what this does is, when the opening trigger for the grammar is triggered, tokens get counted down and we also watch for the disarm trigger. If the disarm trigger doesn't get triggered before the countdown, the grammar gets launched.

This allows setting a real token budget limit on reasoning for models. This also allows disabling thinking for models that normally do not allow that by setting the budget to 0, which is now a different behavior than --disable-reasoning. Note: while possible, this isn't recommended since if a model was trained to only work with reasoning, it might exhibit aberrant behavior (for example, try to open an extra reasoning section).

Supersedes #17750

@pwilkin pwilkin requested review from ggerganov and ngxson as code owners March 9, 2026 15:07
@pwilkin
Copy link
Copy Markdown
Member Author

pwilkin commented Mar 9, 2026

AI disclosure: I used Claude Opus in making most of the changes, auditing and modifying the critical code myself.

@pwilkin
Copy link
Copy Markdown
Member Author

pwilkin commented Mar 9, 2026

Oh, I didn't mention it in the note but this of course entails support for multiple grammars for one server task since the tool grammar is still there.

@pwilkin
Copy link
Copy Markdown
Member Author

pwilkin commented Mar 9, 2026

Some interesting observations from early tests (on Qwen3.5 9B Q8_0):

  • full model humaneval is around 93%
  • non-reasoning (-dre) is around 88%
  • using reasoning_budget 1000 and 400 is actually pretty similar and improves to about 89%
  • however this relies on having a --reasoning-budget-message (I used " ... reasoning budget exceeded, need to answer."). Without that, performance drops to a terrible 79%.

Copy link
Copy Markdown
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll probably need to understand this deeper, but on first look this seems very heavy logic. How important is this functionality?

Specifically the changes in common_sampler seem disproportionately large compared to what this brings to the existing logic. Look for ways to simplify.

@pwilkin
Copy link
Copy Markdown
Member Author

pwilkin commented Mar 9, 2026

I'll probably need to understand this deeper, but on first look this seems very heavy logic. How important is this functionality?

A lot of people have been requesting this, especially with the Qwen3.5 models that are seen as too verbose with their reasoning.

The changes in the sampler code are basically to the grammar sampler, since the idea is (a) to support more than one grammar simultaneously and (b) to support delayed grammar application (with token counting). Maybe this can be simplified by instead inserting another grammar sampler? Not sure how viable that would be.

@aldehir
Copy link
Copy Markdown
Contributor

aldehir commented Mar 9, 2026

In my opinion, we need to think long term.

The grammar sampler is incredibly inefficient. We had to revert a change @ggerganov wanted to make that shifts the grammar to the start of the chain to support backend sampling.

Merging this will increase the reliance on the grammar sampler and make it more challenging to optimize in the future.

I'm of the opinion that a dedicated, simple, reasoning sampler that lives in common would be enough and can be used at the start of the chain--so long as it aligns with the grammar used (if any).

@ggerganov
Copy link
Copy Markdown
Member

Yes, framing this as a reasoning sampler should definitely be explored.

@pwilkin pwilkin force-pushed the reasoning-budget branch from 1df4d24 to b201c80 Compare March 9, 2026 21:15
@pwilkin
Copy link
Copy Markdown
Member Author

pwilkin commented Mar 9, 2026

@ggerganov @aldehir aight, reverted all the grammar changes and instead reimplemented it as a clean new reasoning parser.

I tested on cli, there is no noticeable overhead on generation (152 t/s both with and without the sampler).

Comment thread common/arg.cpp Outdated
Comment thread common/arg.cpp Outdated
Comment thread tools/server/server-context.cpp Outdated
@aldehir
Copy link
Copy Markdown
Contributor

aldehir commented Mar 9, 2026

Unless @ggerganov thinks otherwise, I would put it under common until it reaches maturity before exposing it in the public API. I imagine there will be quite a bit of churn with all the models to support.

Other notes:

  • Is arm_immediately needed? Why not define an initial state instead?
  • Need to add a soft/hard cap to handle partial utf8 sequences. When I tested this with a grammar approach, I would often see incomplete utf8 codepoints. Instead, we could enforce a soft cap then continue until we hit a clean boundary or hit the hard cap.
  • Would like to see some vision for incorporating other reasoning budget strategies. For example, Nemotron Nano 2 (which they claim to also support in their 3-series).
3.4. Budget Control Evaluation
Nemotron Nano V2 allows users to specify how many thinking tokens the model may generate before
producing the final answer. The final answer is the portion of text typically shown to end users.
This feature is implemented by counting tokens after the model begins generating the <think>
token. Once the budget is reached, the inference setup attempts to insert a closing </think> tag.
Rather than inserting it immediately, we let the model finish its current sentence and place the
tag at the next newline. In extreme cases where no newline appears, the system enforces closure
within 500 tokens past the budget: if no newline occurs by the (budget+ 500)th token, the </think>
tag is forcibly inserted.

https://arxiv.org/abs/2508.14444

Overall, I think this is a cleaner approach. It isolates the complexity rather than polluting the already complex grammar sampling logic.

Copy link
Copy Markdown
Contributor

@aldehir aldehir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need some tests around the apply/accept logic. I had some in my example, but feel free to improvise.

Comment thread include/llama.h Outdated
@CISC
Copy link
Copy Markdown
Member

CISC commented Mar 9, 2026

@pwilkin
Copy link
Copy Markdown
Member Author

pwilkin commented Mar 9, 2026

Funny, wonder what happened here: https://github.com/ggml-org/llama.cpp/actions/runs/22877806623/job/66373589713?pr=20297

GitHub merge running on Windows? :D

@pwilkin
Copy link
Copy Markdown
Member Author

pwilkin commented Mar 10, 2026

Aight I got rid of the explosive terminology and fixed the newlines in the process :)

@pwilkin
Copy link
Copy Markdown
Member Author

pwilkin commented Mar 10, 2026

Okay, UTF-8 and tests are done, think this one's ready.

@github-actions github-actions Bot added the testing Everything test related label Mar 10, 2026
@pwilkin pwilkin requested review from CISC, aldehir and ggerganov March 10, 2026 14:11
Ethan-a2 pushed a commit to Ethan-a2/llama.cpp that referenced this pull request Mar 20, 2026
* v1

* Finished!

* Handlie cli

* Reasoning sampler

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Less explosive terminology :)

* Add utf-8 case and tests

* common : migrate reasoning budget sampler to common

* cont : clean up

* cont : expose state and allow passing as initial state

* cont : remove unused imports

* cont : update state machine doc string

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Alde Rojas <hello@alde.dev>
Ethan-a2 pushed a commit to Ethan-a2/llama.cpp that referenced this pull request Mar 20, 2026
@aviallon
Copy link
Copy Markdown
Contributor

@pwilkin should this be added to the recent API changes: #9291 issue?

@strawberrymelonpanda
Copy link
Copy Markdown
Contributor

strawberrymelonpanda commented Mar 30, 2026

@pwilkin Wanted to report some interesting numbers: Using GG's llama-eval with a fixed seed of 1234 on aime2025 with Qwen35 27B Q4 on a small sample of 30 questions (hence the same questions), plus a fixed seed on the server:

Unlimited reasoning: 3 out of 30 (10%) [Total accumulated time: 4915.2s]
reasoning budget=1000: 10 out of 30 (33%) [Total accumulated time: 4369.0s]

(message="... reasoning budget exhausted. Let's answer now.")

Stopping runaway reasoning in this case both saves time and improves the score.
I appreciate this command. 👍

@pritam-dey3
Copy link
Copy Markdown

Quick feature suggestion related to llama-server. Is it possible to add an extra-body parameter in the chat completions endpoint to enable something like this?
If it is possible, then that would be helpful since I can decide on the budget for different tasks (and I already use this with openrouter). I assume since this is mostly grammar based, this should be possible. I would like to take a stab at this if @pwilkin thinks it is a reasonable thing to add to the server.

@aviallon
Copy link
Copy Markdown
Contributor

aviallon commented Apr 4, 2026

Quick feature suggestion related to llama-server. Is it possible to add an extra-body parameter in the chat completions endpoint to enable something like this?
If it is possible, then that would be helpful since I can decide on the budget for different tasks (and I already use this with openrouter). I assume since this is mostly grammar based, this should be possible. I would like to take a stab at this if @pwilkin thinks it is a reasonable thing to add to the server.

You can already do that! Just add thinking_budget_tokens to your body

ezturner added a commit to ezturner/llama.cpp that referenced this pull request Apr 22, 2026
This change refactors the reasoning_budget_message parameter from the
common params into the sampling parameters specifically. It also removes
the reasoning_budget common parameter and standardizes on the existing
reasoning_budget_tokens parameter in the sampling configuration.

Issue: ggml-org#20429
Original PR: ggml-org#20297
pwilkin pushed a commit that referenced this pull request Apr 22, 2026
This change refactors the reasoning_budget_message parameter from the
common params into the sampling parameters specifically. It also removes
the reasoning_budget common parameter and standardizes on the existing
reasoning_budget_tokens parameter in the sampling configuration.

Issue: #20429
Original PR: #20297
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Apr 23, 2026
This change refactors the reasoning_budget_message parameter from the
common params into the sampling parameters specifically. It also removes
the reasoning_budget common parameter and standardizes on the existing
reasoning_budget_tokens parameter in the sampling configuration.

Issue: ggml-org#20429
Original PR: ggml-org#20297
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
* v1

* Finished!

* Handlie cli

* Reasoning sampler

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Less explosive terminology :)

* Add utf-8 case and tests

* common : migrate reasoning budget sampler to common

* cont : clean up

* cont : expose state and allow passing as initial state

* cont : remove unused imports

* cont : update state machine doc string

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Alde Rojas <hello@alde.dev>
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026
* v1

* Finished!

* Handlie cli

* Reasoning sampler

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Less explosive terminology :)

* Add utf-8 case and tests

* common : migrate reasoning budget sampler to common

* cont : clean up

* cont : expose state and allow passing as initial state

* cont : remove unused imports

* cont : update state machine doc string

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Alde Rojas <hello@alde.dev>
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026
This change refactors the reasoning_budget_message parameter from the
common params into the sampling parameters specifically. It also removes
the reasoning_budget common parameter and standardizes on the existing
reasoning_budget_tokens parameter in the sampling configuration.

Issue: ggml-org#20429
Original PR: ggml-org#20297
samuraieng pushed a commit to samuraieng/llama.cpp that referenced this pull request May 6, 2026
This change refactors the reasoning_budget_message parameter from the
common params into the sampling parameters specifically. It also removes
the reasoning_budget common parameter and standardizes on the existing
reasoning_budget_tokens parameter in the sampling configuration.

Issue: ggml-org#20429
Original PR: ggml-org#20297
ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026
* v1

* Finished!

* Handlie cli

* Reasoning sampler

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Less explosive terminology :)

* Add utf-8 case and tests

* common : migrate reasoning budget sampler to common

* cont : clean up

* cont : expose state and allow passing as initial state

* cont : remove unused imports

* cont : update state machine doc string

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Alde Rojas <hello@alde.dev>
ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026
ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026
This change refactors the reasoning_budget_message parameter from the
common params into the sampling parameters specifically. It also removes
the reasoning_budget common parameter and standardizes on the existing
reasoning_budget_tokens parameter in the sampling configuration.

Issue: ggml-org#20429
Original PR: ggml-org#20297
meh pushed a commit to meh/llama.cpp that referenced this pull request May 10, 2026
This change refactors the reasoning_budget_message parameter from the
common params into the sampling parameters specifically. It also removes
the reasoning_budget common parameter and standardizes on the existing
reasoning_budget_tokens parameter in the sampling configuration.

Issue: ggml-org#20429
Original PR: ggml-org#20297
dandm1 pushed a commit to dandm1/llama.cpp that referenced this pull request May 13, 2026
* v1

* Finished!

* Handlie cli

* Reasoning sampler

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Less explosive terminology :)

* Add utf-8 case and tests

* common : migrate reasoning budget sampler to common

* cont : clean up

* cont : expose state and allow passing as initial state

* cont : remove unused imports

* cont : update state machine doc string

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Alde Rojas <hello@alde.dev>
dandm1 pushed a commit to dandm1/llama.cpp that referenced this pull request May 13, 2026
my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026
* v1

* Finished!

* Handlie cli

* Reasoning sampler

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Less explosive terminology :)

* Add utf-8 case and tests

* common : migrate reasoning budget sampler to common

* cont : clean up

* cont : expose state and allow passing as initial state

* cont : remove unused imports

* cont : update state machine doc string

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Alde Rojas <hello@alde.dev>
my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026
my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026
This change refactors the reasoning_budget_message parameter from the
common params into the sampling parameters specifically. It also removes
the reasoning_budget common parameter and standardizes on the existing
reasoning_budget_tokens parameter in the sampling configuration.

Issue: ggml-org#20429
Original PR: ggml-org#20297
my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026
* v1

* Finished!

* Handlie cli

* Reasoning sampler

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Less explosive terminology :)

* Add utf-8 case and tests

* common : migrate reasoning budget sampler to common

* cont : clean up

* cont : expose state and allow passing as initial state

* cont : remove unused imports

* cont : update state machine doc string

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Alde Rojas <hello@alde.dev>
my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026
my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026
This change refactors the reasoning_budget_message parameter from the
common params into the sampling parameters specifically. It also removes
the reasoning_budget common parameter and standardizes on the existing
reasoning_budget_tokens parameter in the sampling configuration.

Issue: ggml-org#20429
Original PR: ggml-org#20297
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples server testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.