Summary
We are comparing nvidia/Alpamayo-R1-10B PyTorch/HuggingFace prefill against TensorRT-Edge-LLM prefill for the same multimodal sample.
After fixing a separate visual preprocessing issue on our side, we now have:
- matched visual regime
- exact-match
image_grid_thw
- nearly matched
pixel_values / visual outputs
- nearly matched
inputs_embeds
- nearly matched
deepstack_embeds
- exact-match
position_ids
- exact-match
rope_deltas
However, the text prefill hidden states and KV cache still diverge.
The main finding is:
- the first meaningful divergence appears at
q_proj / k_proj / v_proj outputs
q_norm / k_norm magnifies that drift
- attention output itself is still small in early layers
- the first major amplification happens at layer 16
mlp_down
This suggests the issue is not:
- visual preprocessing anymore
position_ids / rope_deltas
- runtime-capture cropping
- decode/generation length
- attention plugin output itself as the first source of error
Instead, it looks like a text prefill runtime mismatch in the TRT path, and the earliest visible source is the projection GEMM path before the attention plugin.
This issue was reproduced on the FP16 path (not FP8 KV cache / quantized runtime).
Environment
Precision / engine setup
- LLM path under test: FP16
- Visual path under test: FP16
- KV cache under test: FP16
- ONNX Runtime comparison used FP16 ONNX subgraph
- This report is not for FP8 KV cache / quantized runtime
Hardware
- Platform: NVIDIA Thor
- GPU:
nvidia-smi:
- Driver Version: 580.00
- CUDA Version: 13.0
OS
- Ubuntu: 24.04.2 LTS
- Kernel: 6.8.12-tegra
- Arch: aarch64
CUDA / cuDNN / TensorRT
nvcc --version:
- CUDA 13.0
- Build:
cuda_13.0.r13.0/compiler.36260728_0
- cuDNN:
libcudnn9-cuda-13 9.12.0.46-1
- TensorRT packages:
libnvinfer10 10.13.2.6-1+cuda13.0
libnvinfer-plugin10 10.13.2.6-1+cuda13.0
libnvonnxparsers10 10.13.2.6-1+cuda13.0
Python stack
- Python: 3.12.3
- PyTorch: 2.10.0+cu130
- Transformers: 4.57.1
- ONNX: 1.20.1
- ONNX Runtime: 1.22.0
- Available providers in this environment:
CPUExecutionProvider
AzureExecutionProvider
CUDAExecutionProvider is not available here
- NumPy: 1.26.4
- Pillow: 11.3.0
TensorRT-Edge-LLM
Model / setup
Model
Input
- same 16 images
- same
ego_history_xyz.npy
- same
ego_history_rot.npy
Shapes
- prefill sequence length:
3006
- hidden size:
4096
- num heads:
32
- num KV heads:
8
- head dim:
128
Important matched metadata
position_ids: exact match
rope_deltas: exact match
- prompt prefill length: exact match
What we already ruled out
1. Visual preprocessing mismatch
We previously found a batched visual preprocessing issue caused by:
- shared pinned host resize buffer reuse
- async H2D copy overlap
After fixing that, visual outputs became nearly aligned with local PyTorch.
2. Positional metadata mismatch
We directly checked:
position_ids
rope_deltas
- RMSNorm epsilon
They all match.
3. KV export / runtime-capture packaging
We isolated pure prefill (no decode dependence) and still saw mismatch, so this is not caused by KV export packaging or <traj_future_start> crop logic.
4. Attention plugin output as the first source of error
For layer 0, we extracted a pre-plugin ONNX subgraph and compared:
We also re-derived the attention output from the same q/k/v tensors in Python.
This shows the attention output itself is not the first place where the mismatch appears.
Main evidence
A. Prefill final hidden state is already different
We compared TRT prefill final hidden against PyTorch post-norm final hidden.
Observed:
- mean abs diff:
0.4551
- max abs diff:
76.6875
- cosine:
0.6764
So the mismatch exists before KV export.
B. Prefill KV cache differs even in prompt-only prefill
With prompt-only prefill (no generated continuation dependence):
- overall mean abs diff:
0.021779
- max abs diff:
28.46875
- cosine:
0.998823
Layer trend:
- deeper layers diverge more
V diverges slightly more than K
Worst layers:
L35 = 0.1237
L34 = 0.1051
L33 = 0.0831
C. Early-layer detailed inspection (layers 0-8)
Mean abs diff values:
Layer 0
input_ln: 0.000155
q_proj: 0.001625
k_proj: 0.001691
v_proj: 0.001396
q_norm: 0.041313
k_norm: 0.062339
attn_reshape: 0.000683
hidden: 0.003405
Layer 8
input_ln: 0.002450
q_proj: 0.007700
k_proj: 0.008175
v_proj: 0.007076
q_norm: 0.070408
k_norm: 0.071364
attn_reshape: 0.001590
hidden: 0.022988
Interpretation:
- drift starts small
- it is already visible at
q_proj/k_proj/v_proj
q_norm/k_norm make it more visible
attn_reshape is still small in early layers
This does not look like “attention softmax/matmul explodes immediately”.
D. Layer 16 is the first big amplification point
Mean abs diff values for layer 16:
input_ln: 0.018001
q_proj: 0.038064
k_proj: 0.038681
v_proj: 0.036913
q_norm: 0.092055
k_norm: 0.097936
attn_reshape: 0.010392
o_proj: 0.024097
post_attn_hidden: 0.069224
mlp_mul: 0.017146
mlp_down: 0.148855
hidden: 0.181274
Interpretation:
- the first large amplification is not at the attention output
- the first large amplification is at layer 16
mlp_down
E. 3-way comparison: PyTorch vs ORT vs TRT at layer 0 pre-plugin
For isolation, we extracted a layer-0 pre-plugin ONNX subgraph from the raw FP16 LLM ONNX and compared:
input_ln
- ORT vs PyTorch:
0.000155
- ORT vs TRT:
1.44e-06
q_proj
- ORT vs PyTorch:
0.000525
- ORT vs TRT:
0.001495
- PyTorch vs TRT:
0.001625
k_proj
- ORT vs PyTorch:
0.000505
- ORT vs TRT:
0.001581
- PyTorch vs TRT:
0.001691
v_proj
- ORT vs PyTorch:
0.000275
- ORT vs TRT:
0.001358
- PyTorch vs TRT:
0.001396
q_norm
- ORT vs PyTorch:
0.012651
- ORT vs TRT:
0.038417
- PyTorch vs TRT:
0.041313
k_norm
- ORT vs PyTorch:
0.018994
- ORT vs TRT:
0.058296
- PyTorch vs TRT:
0.062339
Interpretation:
input_ln is effectively aligned
- the first nontrivial mismatch shows up at projection outputs
- ORT is consistently closer to PyTorch than TRT is
- this suggests the problem is not primarily in the exported graph definition
- it points more toward TRT runtime / GEMM path / prefill execution behavior
F. RoPE is not introducing the mismatch
We also compared RoPE-applied q/k reconstructed from the same tensors.
Post-RoPE q:
- ORT vs PyTorch:
0.012791
- ORT vs TRT:
0.038812
- PyTorch vs TRT:
0.041740
Post-RoPE k:
- ORT vs PyTorch:
0.019135
- ORT vs TRT:
0.058708
- PyTorch vs TRT:
0.062777
Interpretation:
- the mismatch pattern is already present before RoPE
- RoPE preserves the same mismatch ordering
- this does not look like a
position_ids / rope-index bug
G. Attention output can be re-derived consistently
Using the same q/k/v tensors, we re-derived the layer-0 attention output in Python.
Derived attention output:
- ORT vs PyTorch:
0.000124
- ORT vs TRT:
0.000662
- PyTorch vs TRT:
0.000682
Derived vs dumped attn_reshape:
- PyTorch:
1.91e-05
- TRT:
1.13e-05
Interpretation:
- the dumped attention output is consistent with the q/k/v tensors
- the attention plugin output itself is not the first place where the mismatch is created
Current conclusion
At this point, the strongest interpretation is:
TensorRT-Edge-LLM and PyTorch receive effectively matched multimodal prefill inputs.
- The first meaningful drift appears at
q_proj/k_proj/v_proj outputs.
q_norm/k_norm magnify that drift.
- Early attention output is still relatively stable.
- The first large amplification happens at layer 16
mlp_down.
- Therefore, the issue appears to be in the TRT text prefill runtime path, with the earliest visible source at the projection GEMM path before the attention plugin.
Why this matters
This mismatch is large enough to change downstream expert/FM behavior.
Even when the handoff metadata matches:
- same
offset
- same
position_ids
- same
attention_mask
- same
rope_deltas
the resulting prompt KV cache still differs enough to affect replay and downstream outputs.
Questions / requested guidance
We would like guidance on the following:
- Is there any known discrepancy between PyTorch/HF and
TensorRT-Edge-LLM in the prefill projection GEMM path for Qwen3-VL / Alpamayo-style text backbones?
- Are there known TRT tactic / accumulation / precision differences in prefill GEMMs that could explain:
input_ln being aligned
- but
q_proj/k_proj/v_proj already diverging?
- Is there a recommended way to force a more reference-like prefill path for debugging?
- Is there any known issue in the prefill path where fused runtime behavior differs from ONNX Runtime / PyTorch before the attention plugin?
Repro note
This was reproduced with:
- same sample
- same prompt length
- same multimodal alignment
- prompt-only prefill
- no dependence on decode continuation length
So this appears to be a true prefill runtime mismatch, not a packaging or stopping-condition issue.
Summary
We are comparing
nvidia/Alpamayo-R1-10BPyTorch/HuggingFace prefill againstTensorRT-Edge-LLMprefill for the same multimodal sample.After fixing a separate visual preprocessing issue on our side, we now have:
image_grid_thwpixel_values/ visual outputsinputs_embedsdeepstack_embedsposition_idsrope_deltasHowever, the text prefill hidden states and KV cache still diverge.
The main finding is:
q_proj / k_proj / v_projoutputsq_norm / k_normmagnifies that driftmlp_downThis suggests the issue is not:
position_ids/rope_deltasInstead, it looks like a text prefill runtime mismatch in the TRT path, and the earliest visible source is the projection GEMM path before the attention plugin.
This issue was reproduced on the FP16 path (not FP8 KV cache / quantized runtime).
Environment
Precision / engine setup
Hardware
NVIDIA Thornvidia-smi:OS
CUDA / cuDNN / TensorRT
nvcc --version:cuda_13.0.r13.0/compiler.36260728_0libcudnn9-cuda-13 9.12.0.46-1libnvinfer10 10.13.2.6-1+cuda13.0libnvinfer-plugin10 10.13.2.6-1+cuda13.0libnvonnxparsers10 10.13.2.6-1+cuda13.0Python stack
CPUExecutionProviderAzureExecutionProviderCUDAExecutionProvideris not available hereTensorRT-Edge-LLM
8fe7fe1Model / setup
Model
nvidia/Alpamayo-R1-10BInput
ego_history_xyz.npyego_history_rot.npyShapes
30064096328128Important matched metadata
position_ids: exact matchrope_deltas: exact matchWhat we already ruled out
1. Visual preprocessing mismatch
We previously found a batched visual preprocessing issue caused by:
After fixing that, visual outputs became nearly aligned with local PyTorch.
2. Positional metadata mismatch
We directly checked:
position_idsrope_deltasThey all match.
3. KV export / runtime-capture packaging
We isolated pure prefill (no decode dependence) and still saw mismatch, so this is not caused by KV export packaging or
<traj_future_start>crop logic.4. Attention plugin output as the first source of error
For layer 0, we extracted a pre-plugin ONNX subgraph and compared:
We also re-derived the attention output from the same q/k/v tensors in Python.
This shows the attention output itself is not the first place where the mismatch appears.
Main evidence
A. Prefill final hidden state is already different
We compared TRT prefill final hidden against PyTorch post-norm final hidden.
Observed:
0.455176.68750.6764So the mismatch exists before KV export.
B. Prefill KV cache differs even in prompt-only prefill
With prompt-only prefill (no generated continuation dependence):
0.02177928.468750.998823Layer trend:
Vdiverges slightly more thanKWorst layers:
L35 = 0.1237L34 = 0.1051L33 = 0.0831C. Early-layer detailed inspection (layers 0-8)
Mean abs diff values:
Layer 0
input_ln:0.000155q_proj:0.001625k_proj:0.001691v_proj:0.001396q_norm:0.041313k_norm:0.062339attn_reshape:0.000683hidden:0.003405Layer 8
input_ln:0.002450q_proj:0.007700k_proj:0.008175v_proj:0.007076q_norm:0.070408k_norm:0.071364attn_reshape:0.001590hidden:0.022988Interpretation:
q_proj/k_proj/v_projq_norm/k_normmake it more visibleattn_reshapeis still small in early layersThis does not look like “attention softmax/matmul explodes immediately”.
D. Layer 16 is the first big amplification point
Mean abs diff values for layer 16:
input_ln:0.018001q_proj:0.038064k_proj:0.038681v_proj:0.036913q_norm:0.092055k_norm:0.097936attn_reshape:0.010392o_proj:0.024097post_attn_hidden:0.069224mlp_mul:0.017146mlp_down:0.148855hidden:0.181274Interpretation:
mlp_downE. 3-way comparison: PyTorch vs ORT vs TRT at layer 0 pre-plugin
For isolation, we extracted a layer-0 pre-plugin ONNX subgraph from the raw FP16 LLM ONNX and compared:
input_ln0.0001551.44e-06q_proj0.0005250.0014950.001625k_proj0.0005050.0015810.001691v_proj0.0002750.0013580.001396q_norm0.0126510.0384170.041313k_norm0.0189940.0582960.062339Interpretation:
input_lnis effectively alignedF. RoPE is not introducing the mismatch
We also compared RoPE-applied q/k reconstructed from the same tensors.
Post-RoPE q:
0.0127910.0388120.041740Post-RoPE k:
0.0191350.0587080.062777Interpretation:
position_ids/ rope-index bugG. Attention output can be re-derived consistently
Using the same q/k/v tensors, we re-derived the layer-0 attention output in Python.
Derived attention output:
0.0001240.0006620.000682Derived vs dumped
attn_reshape:1.91e-051.13e-05Interpretation:
Current conclusion
At this point, the strongest interpretation is:
TensorRT-Edge-LLMand PyTorch receive effectively matched multimodal prefill inputs.q_proj/k_proj/v_projoutputs.q_norm/k_normmagnify that drift.mlp_down.Why this matters
This mismatch is large enough to change downstream expert/FM behavior.
Even when the handoff metadata matches:
offsetposition_idsattention_maskrope_deltasthe resulting prompt KV cache still differs enough to affect replay and downstream outputs.
Questions / requested guidance
We would like guidance on the following:
TensorRT-Edge-LLMin the prefill projection GEMM path for Qwen3-VL / Alpamayo-style text backbones?input_lnbeing alignedq_proj/k_proj/v_projalready diverging?Repro note
This was reproduced with:
So this appears to be a true prefill runtime mismatch, not a packaging or stopping-condition issue.