[RyzenAI 1.7.1 / OGA 0.11.2] NaN logits at 3rd decode step — Qwen-2.5-1.5B-Instruct NPU 16K
Target : github.com/amd/RyzenAI-SW / issues
Severity : High — model unusable for generation >2 tokens
Component : onnxruntime-genai VitisAI EP / ryzenai-dynamic-dispatch
Summary
When generating text with amd/Qwen-2.5_1.5B_Instruct_rai_1.7.1_npu_16K (OGA 0.11.2, NPU provider), all 151,936 logits become NaN at the 3rd decode step (after 2 correct tokens). All subsequent tokens are also NaN garbage. The first 2 generated tokens are always semantically correct. The model is completely unusable for any task requiring >2 output tokens.
Observed: "What is 2+2?" → "2+!!!!!!" (garbage from token 3)
Expected: "2+2=4" or "Two plus two equals four"
Environment
| Field |
Value |
| Machine |
AMD Ryzen AI 9 HX 370 |
| NPU |
AMD XDNA2 — PCI VEN_1022&DEV_17F0 |
| OS |
Windows 11 Pro Build 26200 |
| NPU Driver |
32.0.203.329 (dated 04/12/2025) |
| XRT Status |
✅ PASS (DPU_2_ELF loading verified) |
| Python |
3.12.10 |
onnxruntime-genai-directml-ryzenai |
0.11.2 |
onnxruntime-vitisai |
1.23.3 |
onnxruntime_providers_ryzenai |
0.11.1 |
ryzenai-dynamic-dispatch |
1.7.1 |
| AMD RyzenAI-SW release |
v1.7.1 (2026-03-27) |
| Model |
amd/Qwen-2.5_1.5B_Instruct_rai_1.7.1_npu_16K |
| Snapshot hash |
d83d847501eabe6301fcad8066363ffef775395f |
Prerequisites confirmed working
✅ 1 — DPU_2_ELF loads correctly
The NPU kernel DPU_2_ELF loads without error. Previous investigation confirmed:
- XRT PASS with
ryzenai-dynamic-dispatch 1.7.1
- No "failed to load DPU" or "incompatible ELF" errors in ryzenai-server logs
- The model loads, initializes, and warms up normally
✅ 2 — Chat template applied (ChatML format)
A tokenizer_config.json overlay injects the correct ChatML template. Verified with tokenizer.apply_chat_template():
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is 2+2?<|im_end|>
<|im_start|>assistant
✅ 3 — First 2 tokens semantically correct
Step 1 logits (after token 1 = "2") are fully finite with expected distribution:
Top logits at step 1:
'+' (id=10): 26.000
' +' (id=489): 25.375
' plus' (id=5646): 23.625
'2' (id=17): 22.750
'plus' (id=3767): 20.875
Token 2 = "+" — semantically correct continuation of "2".
Bug — NaN logits at step 2
Step-by-step logit trace
import onnxruntime_genai as og
import numpy as np
model = og.Model(overlay_path) # NPU provider (RyzenAI EP)
tokenizer = og.Tokenizer(model)
params = og.GeneratorParams(model)
params.max_length = 32
prompt = apply_chat_template("What is 2+2?") # ChatML format
tokens = tokenizer.encode(prompt)
params.input_ids = tokens
gen = og.Generator(model, params)
gen.append_tokens(tokens)
| Step |
generate_next_token() |
get_next_tokens()[0] |
Text |
get_logits() NaN count |
| 1 |
✅ |
17 |
"2" |
0 / 151936 — FINITE |
| 2 |
✅ |
10 |
"+" |
151936 / 151936 — ALL NaN |
| 3 |
❌ |
0 |
"!" |
151936 / 151936 — ALL NaN |
| 4 |
❌ |
0 |
"!" |
151936 / 151936 — ALL NaN |
| 5 |
❌ |
0 |
"!" |
151936 / 151936 — ALL NaN |
NaN onset: step 2 = ALL 151,936 logits are NaN.
Once NaN, all subsequent steps are also NaN (KV cache corrupted).
Token id=0 is selected by argmax(NaN) → consistently outputs "!" (greedy fallback).
Reproducibility
- Occurs for all tested prompts (ChatML full, short
"2+2?", tiny "OK")
- Occurs regardless of prompt length (4–29 prefix tokens)
- Occurs with or without system message
- Greedy decoding (top_k=1): NaN always at step 2
- Sampling (top_k=40, top_p=1.0, temp=0.7): NaN still at step 2 in >90% of runs; occasionally a lucky token path avoids it for 1–2 more steps (not reliable)
max_tokens sweep results
| max_tokens |
Output |
Correct tokens |
| 1 |
"2" |
1/1 ✅ |
| 2 |
"2+" |
2/2 ✅ |
| 3 |
"2+!" |
2/3 ❌ |
| 8 |
"2+!!!!!!" |
2/8 ❌ |
| 16 |
"2+!!!!!!!!!!!!!!" |
2/16 ❌ |
Provider isolation
CPU provider
config = og.Config(overlay_path)
config.clear_providers() # Use CPU only
model = og.Model(config) # FAILS at load time
Error:
Load model from (...)/model.onnx failed:
Node () Op (If) [TypeInferenceError] Graph attribute inferencing failed:
Fatal error: custom op registration (VitisAI EP) required
The model.onnx contains AMD VitisAI-specific If node subgraphs that cannot be parsed by the standard ORT CPU or DirectML EP. The model is architecturally NPU-only — there is no CPU fallback available.
DirectML provider
Same TypeInferenceError — cannot load without VitisAI EP registered.
Conclusion: The NaN bug is VitisAI EP / RyzenAI-specific. It cannot be reproduced or diagnosed on CPU/DML.
Minimal reproduction script
#!/usr/bin/env python3
"""
Minimal reproduction — NaN logits at decode step 2
AMD Qwen-2.5-1.5B-Instruct RyzenAI 1.7.1 NPU 16K
Requires: onnxruntime-genai-directml-ryzenai 0.11.2 (AMD RyzenAI SDK)
"""
import onnxruntime_genai as og
import numpy as np
import json
OVERLAY_PATH = r"C:\Users\Arthur Mougin\AI-lab\overlays\qwen25-15b-rai171-npu16k-template-fixed"
CHAT_TEMPLATE = (
"<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
"<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"
)
def run_repro(prompt_text: str, max_new_tokens: int = 6):
model = og.Model(OVERLAY_PATH)
tokenizer = og.Tokenizer(model)
prompt = CHAT_TEMPLATE.format(prompt=prompt_text)
tokens = tokenizer.encode(prompt)
params = og.GeneratorParams(model)
params.max_length = len(tokens) + max_new_tokens
params.input_ids = tokens
gen = og.Generator(model, params)
gen.append_tokens(tokens)
results = []
step = 0
while not gen.is_done():
gen.generate_next_token()
next_tok = int(gen.get_next_tokens()[0])
logits = gen.get_logits()
arr = np.array(logits[0][0])
nan_count = int(np.isnan(arr).sum())
finite_count = int(np.isfinite(arr).sum())
results.append({
"step": step,
"token_id": next_tok,
"token_text": tokenizer.decode([next_tok]),
"nan_count": nan_count,
"finite_count": finite_count,
"all_nan": nan_count == len(arr),
})
step += 1
return {
"prompt": prompt_text,
"steps": results,
"nan_first_step": next((r["step"] for r in results if r["all_nan"]), None),
"output": "".join(r["token_text"] for r in results),
}
if __name__ == "__main__":
result = run_repro("What is 2+2?")
print(json.dumps(result, indent=2))
print(f"\nNaN first at step: {result['nan_first_step']}")
print(f"Output: {result['output']!r}")
Expected output with fixed runtime:
{ "nan_first_step": null, "output": "2+2=4" }
Actual output:
{ "nan_first_step": 1, "output": "2+!!!!!!" }
(step 1 in 0-indexed = the 2nd call to generate_next_token; "NaN at decode step 3" in 1-indexed human terms)
Questions for AMD / maintainers
-
Is this a known issue with OGA 0.11.2 + ryzenai-dynamic-dispatch 1.7.1 on XDNA2?
-
Has this been fixed in an internal/development build? If so, is there a beta wheel or an ETA for public release?
-
What is the expected mechanism? Is it an INT4 overflow in the NPU KV cache operations after step 2? A missing numerics stabilization (e.g., missing softmax temperature clamp)?
-
Is there a workaround short of max_tokens≤2? For example: KV cache reset after every step? Explicit logit clamping? A different SearchOptions configuration?
-
Is the Qwen-2.5-3B or Qwen-2.5-7B NPU model affected by the same bug? (We have not tested these, but if 1.5B is specifically affected, it could indicate a model-size-specific quantization issue.)
-
Does this affect any other models at RyzenAI 1.7.1? (e.g., Phi-3.5, Llama-3.2)
Additional context
- Model HF page:
https://huggingface.co/amd/Qwen-2.5_1.5B_Instruct_rai_1.7.1_npu_16K
- AMD RyzenAI-SW:
https://github.com/amd/RyzenAI-SW
- AMD RyzenAI release v1.7.1:
https://github.com/amd/RyzenAI-SW/releases/tag/v1.7.1
- OGA 0.11.2 wheel source: AMD internal distribution (not on public PyPI)
- The
onnxruntime-genai-directml-ryzenai 0.7.0.1 on PyPI is an older version (likely 1.5.x era)
Reported by: Windows 11 Pro, DESKTOP-4F35QQ5, RyzenAI 9 HX 370 / XDNA2 | 2026-04-28
[RyzenAI 1.7.1 / OGA 0.11.2] NaN logits at 3rd decode step — Qwen-2.5-1.5B-Instruct NPU 16K
Target : github.com/amd/RyzenAI-SW / issues
Severity : High — model unusable for generation >2 tokens
Component : onnxruntime-genai VitisAI EP / ryzenai-dynamic-dispatch
Summary
When generating text with
amd/Qwen-2.5_1.5B_Instruct_rai_1.7.1_npu_16K(OGA 0.11.2, NPU provider), all 151,936 logits become NaN at the 3rd decode step (after 2 correct tokens). All subsequent tokens are also NaN garbage. The first 2 generated tokens are always semantically correct. The model is completely unusable for any task requiring >2 output tokens.Observed:
"What is 2+2?" → "2+!!!!!!"(garbage from token 3)Expected:
"2+2=4"or"Two plus two equals four"Environment
VEN_1022&DEV_17F032.0.203.329(dated 04/12/2025)onnxruntime-genai-directml-ryzenaionnxruntime-vitisaionnxruntime_providers_ryzenairyzenai-dynamic-dispatchamd/Qwen-2.5_1.5B_Instruct_rai_1.7.1_npu_16Kd83d847501eabe6301fcad8066363ffef775395fPrerequisites confirmed working
✅ 1 — DPU_2_ELF loads correctly
The NPU kernel DPU_2_ELF loads without error. Previous investigation confirmed:
ryzenai-dynamic-dispatch 1.7.1✅ 2 — Chat template applied (ChatML format)
A
tokenizer_config.jsonoverlay injects the correct ChatML template. Verified withtokenizer.apply_chat_template():✅ 3 — First 2 tokens semantically correct
Step 1 logits (after token 1 =
"2") are fully finite with expected distribution:Token 2 =
"+"— semantically correct continuation of"2".Bug — NaN logits at step 2
Step-by-step logit trace
generate_next_token()get_next_tokens()[0]get_logits()NaN count"2""+""!""!""!"Reproducibility
"2+2?", tiny"OK")max_tokens sweep results
"2""2+""2+!""2+!!!!!!""2+!!!!!!!!!!!!!!"Provider isolation
CPU provider
Error:
The model.onnx contains AMD VitisAI-specific
Ifnode subgraphs that cannot be parsed by the standard ORT CPU or DirectML EP. The model is architecturally NPU-only — there is no CPU fallback available.DirectML provider
Same
TypeInferenceError— cannot load without VitisAI EP registered.Conclusion: The NaN bug is VitisAI EP / RyzenAI-specific. It cannot be reproduced or diagnosed on CPU/DML.
Minimal reproduction script
Expected output with fixed runtime:
{ "nan_first_step": null, "output": "2+2=4" }Actual output:
{ "nan_first_step": 1, "output": "2+!!!!!!" }(step 1 in 0-indexed = the 2nd call to generate_next_token; "NaN at decode step 3" in 1-indexed human terms)
Questions for AMD / maintainers
Is this a known issue with OGA 0.11.2 + ryzenai-dynamic-dispatch 1.7.1 on XDNA2?
Has this been fixed in an internal/development build? If so, is there a beta wheel or an ETA for public release?
What is the expected mechanism? Is it an INT4 overflow in the NPU KV cache operations after step 2? A missing numerics stabilization (e.g., missing softmax temperature clamp)?
Is there a workaround short of
max_tokens≤2? For example: KV cache reset after every step? Explicit logit clamping? A differentSearchOptionsconfiguration?Is the Qwen-2.5-3B or Qwen-2.5-7B NPU model affected by the same bug? (We have not tested these, but if 1.5B is specifically affected, it could indicate a model-size-specific quantization issue.)
Does this affect any other models at RyzenAI 1.7.1? (e.g., Phi-3.5, Llama-3.2)
Additional context
https://huggingface.co/amd/Qwen-2.5_1.5B_Instruct_rai_1.7.1_npu_16Khttps://github.com/amd/RyzenAI-SWhttps://github.com/amd/RyzenAI-SW/releases/tag/v1.7.1onnxruntime-genai-directml-ryzenai 0.7.0.1on PyPI is an older version (likely 1.5.x era)Reported by: Windows 11 Pro, DESKTOP-4F35QQ5, RyzenAI 9 HX 370 / XDNA2 | 2026-04-28