[RyzenAI 1.7.1 / OGA 0.11.2] NaN logits at 3rd decode step — Qwen-2.5-1.5B-Instruct NPU 16K

# [RyzenAI 1.7.1 / OGA 0.11.2] NaN logits at 3rd decode step — Qwen-2.5-1.5B-Instruct NPU 16K

**Target** : github.com/amd/RyzenAI-SW / issues  
**Severity** : High — model unusable for generation >2 tokens  
**Component** : onnxruntime-genai VitisAI EP / ryzenai-dynamic-dispatch  

---

## Summary

When generating text with `amd/Qwen-2.5_1.5B_Instruct_rai_1.7.1_npu_16K` (OGA 0.11.2, NPU provider), **all 151,936 logits become NaN at the 3rd decode step** (after 2 correct tokens). All subsequent tokens are also NaN garbage. The first 2 generated tokens are always semantically correct. The model is completely unusable for any task requiring >2 output tokens.

**Observed**: `"What is 2+2?" → "2+!!!!!!"` (garbage from token 3)  
**Expected**: `"2+2=4"` or `"Two plus two equals four"`

---

## Environment

| Field | Value |
|-------|-------|
| Machine | AMD Ryzen AI 9 HX 370 |
| NPU | AMD XDNA2 — PCI `VEN_1022&DEV_17F0` |
| OS | Windows 11 Pro Build 26200 |
| NPU Driver | `32.0.203.329` (dated 04/12/2025) |
| XRT Status | ✅ PASS (DPU_2_ELF loading verified) |
| Python | 3.12.10 |
| `onnxruntime-genai-directml-ryzenai` | **0.11.2** |
| `onnxruntime-vitisai` | **1.23.3** |
| `onnxruntime_providers_ryzenai` | **0.11.1** |
| `ryzenai-dynamic-dispatch` | **1.7.1** |
| AMD RyzenAI-SW release | **v1.7.1** (2026-03-27) |
| Model | `amd/Qwen-2.5_1.5B_Instruct_rai_1.7.1_npu_16K` |
| Snapshot hash | `d83d847501eabe6301fcad8066363ffef775395f` |

---

## Prerequisites confirmed working

### ✅ 1 — DPU_2_ELF loads correctly

The NPU kernel DPU_2_ELF loads without error. Previous investigation confirmed:
- XRT PASS with `ryzenai-dynamic-dispatch 1.7.1`  
- No "failed to load DPU" or "incompatible ELF" errors in ryzenai-server logs
- The model loads, initializes, and warms up normally

### ✅ 2 — Chat template applied (ChatML format)

A `tokenizer_config.json` overlay injects the correct ChatML template. Verified with `tokenizer.apply_chat_template()`:

```
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is 2+2?<|im_end|>
<|im_start|>assistant
```

### ✅ 3 — First 2 tokens semantically correct

Step 1 logits (after token 1 = `"2"`) are fully finite with expected distribution:
```
Top logits at step 1:
  '+' (id=10):  26.000
  ' +' (id=489): 25.375
  ' plus' (id=5646): 23.625
  '2' (id=17): 22.750
  'plus' (id=3767): 20.875
```
Token 2 = `"+"` — semantically correct continuation of `"2"`.

---

## Bug — NaN logits at step 2

### Step-by-step logit trace

```python
import onnxruntime_genai as og
import numpy as np

model = og.Model(overlay_path)   # NPU provider (RyzenAI EP)
tokenizer = og.Tokenizer(model)
params = og.GeneratorParams(model)
params.max_length = 32

prompt = apply_chat_template("What is 2+2?")  # ChatML format
tokens = tokenizer.encode(prompt)
params.input_ids = tokens

gen = og.Generator(model, params)
gen.append_tokens(tokens)
```

| Step | `generate_next_token()` | `get_next_tokens()[0]` | Text | `get_logits()` NaN count |
|------|------------------------|----------------------|------|--------------------------|
| 1 | ✅ | 17 | `"2"` | 0 / 151936 — **FINITE** |
| 2 | ✅ | 10 | `"+"` | **151936 / 151936 — ALL NaN** |
| 3 | ❌ | 0 | `"!"` | 151936 / 151936 — ALL NaN |
| 4 | ❌ | 0 | `"!"` | 151936 / 151936 — ALL NaN |
| 5 | ❌ | 0 | `"!"` | 151936 / 151936 — ALL NaN |

```
NaN onset: step 2 = ALL 151,936 logits are NaN.
Once NaN, all subsequent steps are also NaN (KV cache corrupted).
Token id=0 is selected by argmax(NaN) → consistently outputs "!" (greedy fallback).
```

### Reproducibility

- Occurs for **all tested prompts** (ChatML full, short `"2+2?"`, tiny `"OK"`)  
- Occurs regardless of prompt length (4–29 prefix tokens)  
- Occurs with or without system message  
- **Greedy decoding (top_k=1)**: NaN **always** at step 2  
- **Sampling (top_k=40, top_p=1.0, temp=0.7)**: NaN still at step 2 in >90% of runs; occasionally a lucky token path avoids it for 1–2 more steps (not reliable)

### max_tokens sweep results

| max_tokens | Output | Correct tokens |
|------------|--------|---------------|
| 1 | `"2"` | 1/1 ✅ |
| 2 | `"2+"` | 2/2 ✅ |
| 3 | `"2+!"` | 2/3 ❌ |
| 8 | `"2+!!!!!!"` | 2/8 ❌ |
| 16 | `"2+!!!!!!!!!!!!!!"` | 2/16 ❌ |

---

## Provider isolation

### CPU provider

```python
config = og.Config(overlay_path)
config.clear_providers()  # Use CPU only
model = og.Model(config)  # FAILS at load time
```

**Error:**
```
Load model from (...)/model.onnx failed:
Node () Op (If) [TypeInferenceError] Graph attribute inferencing failed:
Fatal error: custom op registration (VitisAI EP) required
```

The model.onnx contains AMD VitisAI-specific `If` node subgraphs that cannot be parsed by the standard ORT CPU or DirectML EP. **The model is architecturally NPU-only** — there is no CPU fallback available.

### DirectML provider

Same `TypeInferenceError` — cannot load without VitisAI EP registered.

**Conclusion**: The NaN bug is **VitisAI EP / RyzenAI-specific**. It cannot be reproduced or diagnosed on CPU/DML.

---

## Minimal reproduction script

```python
#!/usr/bin/env python3
"""
Minimal reproduction — NaN logits at decode step 2
AMD Qwen-2.5-1.5B-Instruct RyzenAI 1.7.1 NPU 16K
Requires: onnxruntime-genai-directml-ryzenai 0.11.2 (AMD RyzenAI SDK)
"""
import onnxruntime_genai as og
import numpy as np
import json

OVERLAY_PATH = r"C:\Users\Arthur Mougin\AI-lab\overlays\qwen25-15b-rai171-npu16k-template-fixed"
CHAT_TEMPLATE = (
    "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
    "<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"
)

def run_repro(prompt_text: str, max_new_tokens: int = 6):
    model = og.Model(OVERLAY_PATH)
    tokenizer = og.Tokenizer(model)
    
    prompt = CHAT_TEMPLATE.format(prompt=prompt_text)
    tokens = tokenizer.encode(prompt)
    
    params = og.GeneratorParams(model)
    params.max_length = len(tokens) + max_new_tokens
    params.input_ids = tokens
    
    gen = og.Generator(model, params)
    gen.append_tokens(tokens)
    
    results = []
    step = 0
    while not gen.is_done():
        gen.generate_next_token()
        next_tok = int(gen.get_next_tokens()[0])
        logits = gen.get_logits()
        arr = np.array(logits[0][0])
        nan_count = int(np.isnan(arr).sum())
        finite_count = int(np.isfinite(arr).sum())
        results.append({
            "step": step,
            "token_id": next_tok,
            "token_text": tokenizer.decode([next_tok]),
            "nan_count": nan_count,
            "finite_count": finite_count,
            "all_nan": nan_count == len(arr),
        })
        step += 1
    
    return {
        "prompt": prompt_text,
        "steps": results,
        "nan_first_step": next((r["step"] for r in results if r["all_nan"]), None),
        "output": "".join(r["token_text"] for r in results),
    }

if __name__ == "__main__":
    result = run_repro("What is 2+2?")
    print(json.dumps(result, indent=2))
    print(f"\nNaN first at step: {result['nan_first_step']}")
    print(f"Output: {result['output']!r}")
```

**Expected output with fixed runtime:**
```json
{ "nan_first_step": null, "output": "2+2=4" }
```

**Actual output:**
```json
{ "nan_first_step": 1, "output": "2+!!!!!!" }
```
*(step 1 in 0-indexed = the 2nd call to generate_next_token; "NaN at decode step 3" in 1-indexed human terms)*

---

## Questions for AMD / maintainers

1. **Is this a known issue with OGA 0.11.2 + ryzenai-dynamic-dispatch 1.7.1 on XDNA2?**

2. **Has this been fixed in an internal/development build?** If so, is there a beta wheel or an ETA for public release?

3. **What is the expected mechanism?** Is it an INT4 overflow in the NPU KV cache operations after step 2? A missing numerics stabilization (e.g., missing softmax temperature clamp)?

4. **Is there a workaround short of `max_tokens≤2`?** For example: KV cache reset after every step? Explicit logit clamping? A different `SearchOptions` configuration?

5. **Is the Qwen-2.5-3B or Qwen-2.5-7B NPU model affected by the same bug?** (We have not tested these, but if 1.5B is specifically affected, it could indicate a model-size-specific quantization issue.)

6. **Does this affect any other models at RyzenAI 1.7.1?** (e.g., Phi-3.5, Llama-3.2)

---

## Additional context

- Model HF page: `https://huggingface.co/amd/Qwen-2.5_1.5B_Instruct_rai_1.7.1_npu_16K`
- AMD RyzenAI-SW: `https://github.com/amd/RyzenAI-SW`
- AMD RyzenAI release v1.7.1: `https://github.com/amd/RyzenAI-SW/releases/tag/v1.7.1`
- OGA 0.11.2 wheel source: AMD internal distribution (not on public PyPI)
- The `onnxruntime-genai-directml-ryzenai 0.7.0.1` on PyPI is an **older** version (likely 1.5.x era)

---

*Reported by: Windows 11 Pro, DESKTOP-4F35QQ5, RyzenAI 9 HX 370 / XDNA2 | 2026-04-28*


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RyzenAI 1.7.1 / OGA 0.11.2] NaN logits at 3rd decode step — Qwen-2.5-1.5B-Instruct NPU 16K #368

[RyzenAI 1.7.1 / OGA 0.11.2] NaN logits at 3rd decode step — Qwen-2.5-1.5B-Instruct NPU 16K

Summary

Environment

Prerequisites confirmed working

✅ 1 — DPU_2_ELF loads correctly

✅ 2 — Chat template applied (ChatML format)

✅ 3 — First 2 tokens semantically correct

Bug — NaN logits at step 2

Step-by-step logit trace

Reproducibility

max_tokens sweep results

Provider isolation

CPU provider

DirectML provider

Minimal reproduction script

Questions for AMD / maintainers

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Field	Value
Machine	AMD Ryzen AI 9 HX 370
NPU	AMD XDNA2 — PCI `VEN_1022&DEV_17F0`
OS	Windows 11 Pro Build 26200
NPU Driver	`32.0.203.329` (dated 04/12/2025)
XRT Status	✅ PASS (DPU_2_ELF loading verified)
Python	3.12.10
`onnxruntime-genai-directml-ryzenai`	0.11.2
`onnxruntime-vitisai`	1.23.3
`onnxruntime_providers_ryzenai`	0.11.1
`ryzenai-dynamic-dispatch`	1.7.1
AMD RyzenAI-SW release	v1.7.1 (2026-03-27)
Model	`amd/Qwen-2.5_1.5B_Instruct_rai_1.7.1_npu_16K`
Snapshot hash	`d83d847501eabe6301fcad8066363ffef775395f`

Step	`generate_next_token()`	`get_next_tokens()[0]`	Text	`get_logits()` NaN count
1	✅	17	`"2"`	0 / 151936 — FINITE
2	✅	10	`"+"`	151936 / 151936 — ALL NaN
3	❌	0	`"!"`	151936 / 151936 — ALL NaN
4	❌	0	`"!"`	151936 / 151936 — ALL NaN
5	❌	0	`"!"`	151936 / 151936 — ALL NaN

max_tokens	Output	Correct tokens
1	`"2"`	1/1 ✅
2	`"2+"`	2/2 ✅
3	`"2+!"`	2/3 ❌
8	`"2+!!!!!!"`	2/8 ❌
16	`"2+!!!!!!!!!!!!!!"`	2/16 ❌

[RyzenAI 1.7.1 / OGA 0.11.2] NaN logits at 3rd decode step — Qwen-2.5-1.5B-Instruct NPU 16K #368

Description

[RyzenAI 1.7.1 / OGA 0.11.2] NaN logits at 3rd decode step — Qwen-2.5-1.5B-Instruct NPU 16K

Summary

Environment

Prerequisites confirmed working

✅ 1 — DPU_2_ELF loads correctly

✅ 2 — Chat template applied (ChatML format)

✅ 3 — First 2 tokens semantically correct

Bug — NaN logits at step 2

Step-by-step logit trace

Reproducibility

max_tokens sweep results

Provider isolation

CPU provider

DirectML provider

Minimal reproduction script

Questions for AMD / maintainers

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions