Early stopping monitors reduce inference cost by detecting when a reasoning model has reached sufficient confidence in its answer and then terminating generation early. Instead of waiting for the model to exhaust its token budget, these monitors analyze intermediate signals (entropy, confidence, answer stability) and inject a </think> token to force the model to emit its final answer.
All early stopping monitors are used via stream_completion:
from interwhen import stream_completion
from interwhen.monitors import EATMonitor, DEERMonitor, KstableAnswerMCQMonitor, KstableAnswerGame24Monitor
answer = await stream_completion(
prompt,
llm_server=llm_server,
monitors=(your_monitor,),
add_delay=False,
termination_requires_validation=False,
async_execution=True
)Uses entropy-based early stopping. At each \n\nWait token boundary, the monitor computes the entropy of the next token and tracks the exponential moving average (EMA) variance. When the EMA variance drops below a threshold delta (after a minimum number of steps), the model's uncertainty has stabilized — meaning it is confident enough to answer — and generation is stopped by appending </think>.
| Parameter | Type | Default | Description |
|---|---|---|---|
name |
str | — | Unique identifier for this monitor |
model_name |
str | — | HuggingFace model used to compute token entropy (e.g., "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B") |
alpha |
float | 0.2 |
EMA smoothing factor — higher values weight recent entropy more |
delta |
float | 0.0001 |
EMA variance threshold — generation stops when variance drops below this |
min_steps |
int | 4 |
Minimum number of steps before early stopping can trigger |
answer_start_token |
str | "</think>" |
Token that marks the transition from reasoning to answer |
async_execution |
bool | True |
Whether to run verification asynchronously |
EATMonitor(
name="EAT_monitor",
model_name="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
alpha=0.2,
delta=0.0002,
min_steps=4,
answer_start_token="</think>",
async_execution=True
)Uses answer confidence to decide when to stop. At each \n\nWait token boundary, the monitor appends a token and sends it to an LLM server to compute the geometric mean confidence of the generated answer tokens. When confidence exceeds a threshold, generation is stopped.
| Parameter | Type | Default | Description |
|---|---|---|---|
name |
str | — | Unique identifier for this monitor |
model_name |
dict | — | LLM server configuration for confidence computation |
threshold |
float | 0.995 |
Geometric mean confidence threshold — stops when exceeded |
answer_start_token |
str | "</think>" |
Token that marks the transition from reasoning to answer |
async_execution |
bool | True |
Whether to run verification asynchronously |
logprobs |
int | 20 |
Number of log-probabilities to request from the server |
DEERMonitor(
name="DEER_monitor",
model_name=earlystop_model,
threshold=0.80,
answer_start_token="</think>",
async_execution=True
)Detects when the model has converged on a multiple-choice answer by monitoring its reasoning trace. When the same normalized answer appears k consecutive times in lines containing the word "answer", the monitor concludes the model has stabilized and triggers early stop.
| Parameter | Type | Default | Description |
|---|---|---|---|
name |
str | — | Unique identifier for this monitor |
k |
int | — | Number of consecutive identical answers required to trigger stop |
options |
dict | — | MCQ option mapping, e.g., {"A": "Yes", "B": "No", "C": "2", "D": "4"} |
answer_start_token |
str | "</think>" |
Token that marks the transition from reasoning to answer |
KstableAnswerMCQMonitor(
name="maze_kstable",
k=3,
options=options,
answer_start_token="</think>"
)Same concept as the MCQ variant, but specialized for Game of 24. Monitors the reasoning trace for arithmetic expressions and triggers early stop when the same normalized equation appears k consecutive times. Optionally validates that the equation uses exactly the expected input numbers.
| Parameter | Type | Default | Description |
|---|---|---|---|
name |
str | — | Unique identifier for this monitor |
k |
int | — | Number of consecutive identical equations required to trigger stop |
expected_nums |
list or None | None |
If set, validates the equation uses exactly these numbers |
answer_start_token |
str | "</think>" |
Token that marks the transition from reasoning to answer |
KstableAnswerGame24Monitor(
name="game24_kstable",
k=3,
expected_nums=nums,
answer_start_token="</think>"
)Each example script runs a full evaluation loop: loading a dataset, building prompts, running inference with an early stopping monitor, and computing accuracy/token statistics.
# Game of 24 with EAT early stopping
python ./examples/EarlyStopping/game24_example.py -n 1
# Maze MCQ with K-Stable answer monitor
python ./examples/EarlyStopping/maze_example.py -n 1
# SpatialMap MCQ with K-Stable answer monitor
python ./examples/EarlyStopping/spatialmap_example.py -n 1| Argument | Description | Default |
|---|---|---|
-n, --num_examples |
Number of examples to run | varies by task |
-m, --monitor |
Enable monitor (early stopping) | True |
-d, --debug |
Enable debug logging | False |
--main_model |
Main generation model | Qwen/Qwen3-30B-A3B-Thinking-2507 |
--earlystop_model |
Auxiliary model for EAT/DEER | deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B |