Prefill hidden/KV diverges from PyTorch before attention plugin output for Alpamayo/Qwen3-VL text path

## Summary

We are comparing `nvidia/Alpamayo-R1-10B` PyTorch/HuggingFace prefill against `TensorRT-Edge-LLM` prefill for the same multimodal sample.

After fixing a separate visual preprocessing issue on our side, we now have:

- matched visual regime
- exact-match `image_grid_thw`
- nearly matched `pixel_values` / visual outputs
- nearly matched `inputs_embeds`
- nearly matched `deepstack_embeds`
- exact-match `position_ids`
- exact-match `rope_deltas`


However, the **text prefill hidden states and KV cache still diverge**.

The main finding is:

- the **first meaningful divergence appears at `q_proj / k_proj / v_proj` outputs**
- `q_norm / k_norm` magnifies that drift
- **attention output itself is still small in early layers**
- the first major amplification happens at **layer 16 `mlp_down`**

This suggests the issue is **not**:
- visual preprocessing anymore
- `position_ids` / `rope_deltas`
- runtime-capture cropping
- decode/generation length
- attention plugin output itself as the first source of error

Instead, it looks like a **text prefill runtime mismatch** in the TRT path, and the earliest visible source is the **projection GEMM path before the attention plugin**.

This issue was reproduced on the **FP16 path** (not FP8 KV cache / quantized runtime).

---

## Environment

### Precision / engine setup
- LLM path under test: **FP16**
- Visual path under test: **FP16**
- KV cache under test: **FP16**
- ONNX Runtime comparison used FP16 ONNX subgraph
- This report is **not** for FP8 KV cache / quantized runtime

### Hardware
- Platform: **NVIDIA Thor**
- GPU:
  - `NVIDIA Thor`
- `nvidia-smi`:
  - Driver Version: **580.00**
  - CUDA Version: **13.0**

### OS
- Ubuntu: **24.04.2 LTS**
- Kernel: **6.8.12-tegra**
- Arch: **aarch64**

### CUDA / cuDNN / TensorRT
- `nvcc --version`:
  - **CUDA 13.0**
  - Build: `cuda_13.0.r13.0/compiler.36260728_0`
- cuDNN:
  - `libcudnn9-cuda-13 9.12.0.46-1`
- TensorRT packages:
  - `libnvinfer10 10.13.2.6-1+cuda13.0`
  - `libnvinfer-plugin10 10.13.2.6-1+cuda13.0`
  - `libnvonnxparsers10 10.13.2.6-1+cuda13.0`

### Python stack
- Python: **3.12.3**
- PyTorch: **2.10.0+cu130**
- Transformers: **4.57.1**
- ONNX: **1.20.1**
- ONNX Runtime: **1.22.0**
  - Available providers in this environment:
    - `CPUExecutionProvider`
    - `AzureExecutionProvider`
  - `CUDAExecutionProvider` is **not** available here
- NumPy: **1.26.4**
- Pillow: **11.3.0**

### TensorRT-Edge-LLM
- Local repo commit:
  - `8fe7fe1`

---

## Model / setup

### Model
- `nvidia/Alpamayo-R1-10B`

### Input
- same 16 images
- same `ego_history_xyz.npy`
- same `ego_history_rot.npy`

### Shapes
- prefill sequence length: `3006`
- hidden size: `4096`
- num heads: `32`
- num KV heads: `8`
- head dim: `128`

### Important matched metadata
- `position_ids`: exact match
- `rope_deltas`: exact match
- prompt prefill length: exact match

---

## What we already ruled out

### 1. Visual preprocessing mismatch
We previously found a batched visual preprocessing issue caused by:
- shared pinned host resize buffer reuse
- async H2D copy overlap

After fixing that, visual outputs became nearly aligned with local PyTorch.

### 2. Positional metadata mismatch
We directly checked:
- `position_ids`
- `rope_deltas`
- RMSNorm epsilon

They all match.

### 3. KV export / runtime-capture packaging
We isolated **pure prefill** (no decode dependence) and still saw mismatch, so this is not caused by KV export packaging or `<traj_future_start>` crop logic.

### 4. Attention plugin output as the first source of error
For layer 0, we extracted a pre-plugin ONNX subgraph and compared:
- PyTorch
- ONNX Runtime
- TRT

We also re-derived the attention output from the same q/k/v tensors in Python.
This shows the attention output itself is not the first place where the mismatch appears.

---

## Main evidence

## A. Prefill final hidden state is already different
We compared TRT prefill final hidden against PyTorch post-norm final hidden.

Observed:
- mean abs diff: `0.4551`
- max abs diff: `76.6875`
- cosine: `0.6764`

So the mismatch exists **before** KV export.

---

## B. Prefill KV cache differs even in prompt-only prefill
With prompt-only prefill (no generated continuation dependence):

- overall mean abs diff: `0.021779`
- max abs diff: `28.46875`
- cosine: `0.998823`

Layer trend:
- deeper layers diverge more
- `V` diverges slightly more than `K`

Worst layers:
- `L35 = 0.1237`
- `L34 = 0.1051`
- `L33 = 0.0831`

---

## C. Early-layer detailed inspection (layers 0-8)

Mean abs diff values:

### Layer 0
- `input_ln`: `0.000155`
- `q_proj`: `0.001625`
- `k_proj`: `0.001691`
- `v_proj`: `0.001396`
- `q_norm`: `0.041313`
- `k_norm`: `0.062339`
- `attn_reshape`: `0.000683`
- `hidden`: `0.003405`

### Layer 8
- `input_ln`: `0.002450`
- `q_proj`: `0.007700`
- `k_proj`: `0.008175`
- `v_proj`: `0.007076`
- `q_norm`: `0.070408`
- `k_norm`: `0.071364`
- `attn_reshape`: `0.001590`
- `hidden`: `0.022988`

Interpretation:
- drift starts small
- it is already visible at `q_proj/k_proj/v_proj`
- `q_norm/k_norm` make it more visible
- `attn_reshape` is still small in early layers

This does **not** look like “attention softmax/matmul explodes immediately”.

---

## D. Layer 16 is the first big amplification point

Mean abs diff values for layer 16:

- `input_ln`: `0.018001`
- `q_proj`: `0.038064`
- `k_proj`: `0.038681`
- `v_proj`: `0.036913`
- `q_norm`: `0.092055`
- `k_norm`: `0.097936`
- `attn_reshape`: `0.010392`
- `o_proj`: `0.024097`
- `post_attn_hidden`: `0.069224`
- `mlp_mul`: `0.017146`
- `mlp_down`: `0.148855`
- `hidden`: `0.181274`

Interpretation:
- the first large amplification is **not** at the attention output
- the first large amplification is at **layer 16 `mlp_down`**

---

## E. 3-way comparison: PyTorch vs ORT vs TRT at layer 0 pre-plugin

For isolation, we extracted a **layer-0 pre-plugin ONNX subgraph** from the raw FP16 LLM ONNX and compared:
- PyTorch
- ONNX Runtime
- TRT

### `input_ln`
- ORT vs PyTorch: `0.000155`
- ORT vs TRT: `1.44e-06`

### `q_proj`
- ORT vs PyTorch: `0.000525`
- ORT vs TRT: `0.001495`
- PyTorch vs TRT: `0.001625`

### `k_proj`
- ORT vs PyTorch: `0.000505`
- ORT vs TRT: `0.001581`
- PyTorch vs TRT: `0.001691`

### `v_proj`
- ORT vs PyTorch: `0.000275`
- ORT vs TRT: `0.001358`
- PyTorch vs TRT: `0.001396`

### `q_norm`
- ORT vs PyTorch: `0.012651`
- ORT vs TRT: `0.038417`
- PyTorch vs TRT: `0.041313`

### `k_norm`
- ORT vs PyTorch: `0.018994`
- ORT vs TRT: `0.058296`
- PyTorch vs TRT: `0.062339`

Interpretation:
- `input_ln` is effectively aligned
- **the first nontrivial mismatch shows up at projection outputs**
- ORT is consistently closer to PyTorch than TRT is
- this suggests the problem is **not primarily in the exported graph definition**
- it points more toward **TRT runtime / GEMM path / prefill execution behavior**

---

## F. RoPE is not introducing the mismatch

We also compared RoPE-applied q/k reconstructed from the same tensors.

Post-RoPE q:
- ORT vs PyTorch: `0.012791`
- ORT vs TRT: `0.038812`
- PyTorch vs TRT: `0.041740`

Post-RoPE k:
- ORT vs PyTorch: `0.019135`
- ORT vs TRT: `0.058708`
- PyTorch vs TRT: `0.062777`

Interpretation:
- the mismatch pattern is already present before RoPE
- RoPE preserves the same mismatch ordering
- this does **not** look like a `position_ids` / rope-index bug

---

## G. Attention output can be re-derived consistently

Using the same q/k/v tensors, we re-derived the layer-0 attention output in Python.

Derived attention output:
- ORT vs PyTorch: `0.000124`
- ORT vs TRT: `0.000662`
- PyTorch vs TRT: `0.000682`

Derived vs dumped `attn_reshape`:
- PyTorch: `1.91e-05`
- TRT: `1.13e-05`

Interpretation:
- the dumped attention output is consistent with the q/k/v tensors
- the attention plugin output itself is not the first place where the mismatch is created

---

## Current conclusion

At this point, the strongest interpretation is:

1. `TensorRT-Edge-LLM` and PyTorch receive effectively matched multimodal prefill inputs.
2. The first meaningful drift appears at **`q_proj/k_proj/v_proj` outputs**.
3. `q_norm/k_norm` magnify that drift.
4. Early attention output is still relatively stable.
5. The first large amplification happens at **layer 16 `mlp_down`**.
6. Therefore, the issue appears to be in the **TRT text prefill runtime path**, with the earliest visible source at the **projection GEMM path before the attention plugin**.

---

## Why this matters

This mismatch is large enough to change downstream expert/FM behavior.
Even when the handoff metadata matches:
- same `offset`
- same `position_ids`
- same `attention_mask`
- same `rope_deltas`

the resulting prompt KV cache still differs enough to affect replay and downstream outputs.

---

## Questions / requested guidance

We would like guidance on the following:

1. Is there any known discrepancy between PyTorch/HF and `TensorRT-Edge-LLM` in the prefill projection GEMM path for Qwen3-VL / Alpamayo-style text backbones?
2. Are there known TRT tactic / accumulation / precision differences in prefill GEMMs that could explain:
   - `input_ln` being aligned
   - but `q_proj/k_proj/v_proj` already diverging?
3. Is there a recommended way to force a more reference-like prefill path for debugging?
4. Is there any known issue in the prefill path where fused runtime behavior differs from ONNX Runtime / PyTorch before the attention plugin?

---

## Repro note

This was reproduced with:
- same sample
- same prompt length
- same multimodal alignment
- prompt-only prefill
- no dependence on decode continuation length

So this appears to be a true prefill runtime mismatch, not a packaging or stopping-condition issue.


Prefill hidden/KV diverges from PyTorch before attention plugin output for Alpamayo/Qwen3-VL text path #47

Description

Summary

Environment

Precision / engine setup

Hardware

OS

CUDA / cuDNN / TensorRT

Python stack

TensorRT-Edge-LLM

Model / setup

Model

Input

Shapes

Important matched metadata

What we already ruled out

1. Visual preprocessing mismatch

2. Positional metadata mismatch

3. KV export / runtime-capture packaging

4. Attention plugin output as the first source of error

Main evidence

A. Prefill final hidden state is already different

B. Prefill KV cache differs even in prompt-only prefill

C. Early-layer detailed inspection (layers 0-8)

Layer 0

Layer 8

D. Layer 16 is the first big amplification point

E. 3-way comparison: PyTorch vs ORT vs TRT at layer 0 pre-plugin

input_ln

q_proj

k_proj

v_proj

q_norm

k_norm

F. RoPE is not introducing the mismatch

G. Attention output can be re-derived consistently

Current conclusion

Why this matters

Questions / requested guidance

Repro note

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`input_ln`

`q_proj`

`k_proj`

`v_proj`

`q_norm`

`k_norm`