binary search for batch size by NoahOksuz · Pull Request #362 · p-e-w/heretic

NoahOksuz · 2026-06-07T13:51:47Z

When batch_size = 0 (auto), Heretic probes batch sizes exponentially and picks the size with the best measured throughput (tokens/s). Previously, if a probe hit CUDA OOM, the search stopped and kept the last successful power-of-two even though when a larger non-power-of-two batch would fit and run faster. I also saw #248 but i think this is a good solution for now.

Refines the search after OOM binary-searches between the last successful size and the first failing size, still choosing the batch with the highest measured tokens/s.

Determining optimal batch size...
* Trying batch size 1... Ok (20 tokens/s)
* Trying batch size 2... Ok (38 tokens/s)
* blah blah
* Trying batch size 128... Ok (814 tokens/s)
* Trying batch size 256... Failed (CUDA out of memory...)
* Trying batch size 192... Ok (902 tokens/s)
* Trying batch size 224... Ok (905 tokens/s)
* Trying batch size 240... Ok (914 tokens/s)
* Trying batch size 248... Failed (CUDA out of memory...)
* Chosen batch size: 240

Refine auto batch size with binary search after OOM. After the exponential probe hits CUDA OOM, binary-search between the last successful and first failed size to find higher-throughput batch sizes, still picking the size with the best measured tokens/s.

gemini-code-assist

Code Review

This pull request refactors the batch size determination logic into helper functions and introduces a binary search refinement mechanism to find the optimal batch size when an out-of-memory (OOM) error is encountered. Additionally, the default max_batch_size is increased from 128 to 1024. Feedback suggests using torch.OutOfMemoryError instead of torch.cuda.OutOfMemoryError to make OOM detection more robust across different accelerators, and adding defensive guards in _determine_batch_size to handle edge cases such as empty prompts or invalid maximum batch sizes.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

umran666 · 2026-06-07T15:47:18Z

Hey Noah, verified this on my GPU and it works great! I added the fixes for the code review comments (generalizing to OutOfMemoryError and adding defensive guards in _determine_batch_size). Feel free to pull them from my branch: https://github.com/umran666/heretic/tree/fix/binary-search-batch-size

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

p-e-w · 2026-06-08T12:21:44Z

The current approach is not ideal, but a binary search is not the right solution either IMO, at least not in this form. Here's the crux of the problem:

We aren't actually interested in finding the largest possible batch size. There is another constraint, and that is the size of the prompt dataset(s).

By default, the evaluation datasets contain 100 prompts each. That means choosing any batch size greater than 100 doesn't make sense, because we'll never run more than 100 prompts anyway.

In fact, the only batch sizes that we should consider are ceil(len(prompts) / n) for n = 1, 2, 3, .... Choosing any other size will always be dominated by the next smaller member of this set, because you need the same number of batches but are less robust against VRAM fluctuations.

The correct search strategy is to start with a batch size of len(prompts), then proceed with len(prompts) / 2 if there is an OOM, followed by len(prompts) / 3 etc. But as noted in #248, this needs to happen on each call to generate (with caching), because otherwise, the locked-in batch size from the start of the run leads to a crash if some other process consumes VRAM in between.

NoahOksuz added 2 commits June 7, 2026 13:38

Update main.py

a5adce7

Refine auto batch size with binary search after OOM. After the exponential probe hits CUDA OOM, binary-search between the last successful and first failed size to find higher-throughput batch sizes, still picking the size with the best measured tokens/s.

change max batch

6630dde

gemini-code-assist Bot reviewed Jun 7, 2026

View reviewed changes

Comment thread src/heretic/main.py

Comment thread src/heretic/main.py

NoahOksuz and others added 2 commits June 7, 2026 16:53

Update src/heretic/main.py

e96ca40

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update src/heretic/main.py

1e7f1e1

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

binary search for batch size#362

binary search for batch size#362
NoahOksuz wants to merge 4 commits into
p-e-w:masterfrom
NoahOksuz:optibatchsize

NoahOksuz commented Jun 7, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

umran666 commented Jun 7, 2026

Uh oh!

p-e-w commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

NoahOksuz commented Jun 7, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

umran666 commented Jun 7, 2026

Uh oh!

p-e-w commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants