Skip to content

Sync with Microsoft ONNX Runtime - 23052026#1103

Open
ai-fw-intg wants to merge 14 commits into
ovep-developfrom
sync_msft_23052026
Open

Sync with Microsoft ONNX Runtime - 23052026#1103
ai-fw-intg wants to merge 14 commits into
ovep-developfrom
sync_msft_23052026

Conversation

@ai-fw-intg
Copy link
Copy Markdown

Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.

adrianlizarraga and others added 14 commits May 21, 2026 15:52
microsoft#28499)

### Description

Fixes a heap out-of-bounds read vulnerability in `DynamicQuantizeMatMul`
and `MatMulIntegerToFloat` where a bias tensor with an incorrect number
of elements could cause memory reads beyond the allocated buffer.

## Changes

- **`dynamic_quantize_matmul.cc`**: Added element count validation for
the bias tensor in both the `ComputeCommon` path and the deferred bias
addition path (KleidiAI).
- **`matmul_integer_base.h`**: Added element count validation in the
KleidiAI pre-pack path, causing fallback to `ComputeCommon` (which then
rejects the invalid bias with a clear error).
- **Tests**: Added regression tests covering runtime bias mismatch,
initializer bias mismatch (KleidiAI fallback), and the generic
(non-KleidiAI) path for both operators.

## Why we validate element count, not shape (rank)

The validation checks `bias_tensor->Shape().Size() == N` (total element
count) rather than enforcing that the bias is strictly 1D. This is
intentional for several reasons:

1. **Backward compatibility with existing models.** It's possible that
some models may have bias tensors with shape `(1, N)` instead of `(N)`.
Enforcing rank == 1 would break these models at runtime. This exact
issue occurred with the GroupQueryAttention operator, which required
relaxing its shape validation in PR microsoft#28259.

2. **Consistent with ONNX standard practice.** Most official ONNX
operator schemas (Conv, ConvTranspose, DeformConv, Gemm,
LayerNormalization) do *not* validate bias shape in their schema's
`TypeAndShapeInferenceFunction`; they only document "1D" in the input
description text. `BatchNormalization` is the only exception.

3. **The kernel only needs N contiguous floats.** The compute
implementation accesses bias via raw data pointer
(`bias->Data<float>()`) and reads exactly `N` elements. It never indexes
into specific dimensions or assumes a particular rank. A bias of shape
`(N)`, `(1, N)`, or `(1, 1, N)` all work identically.

4. **Schema constraints cannot be relaxed without a version bump.** If
we added a strict rank check to the schema now and later discovered
models using `(1, N)`, fixing it would probably require a new opset
version (though we've never actually bumped the version for contrib ops
...).

## Motivation and Context

Without this fix, passing a bias tensor with fewer elements than `B`'s
last dimension causes the kernel to read past the end of the bias
buffer, potentially exposing sensitive memory contents or causing a
crash.
…crosoft#28279)

### Description

Add shape validation for the conv bias input (`B`) in
`WordConvEmbedding::Compute()` to prevent out-of-bounds heap reads when
a crafted model provides a bias tensor shorter than `num_filters`.

### Root Cause

`WordConvEmbedding::Compute` passes `b_conv.Data<float>()` directly to
`ComputeConvMaxPoolWithActivation`, which iterates over `num_filters` (=
`w_conv.shape[0]`) elements of the bias buffer. `ValidateInputShape`
only checks the sequence, conv weight, and char embedding shapes — the
bias shape is never validated. A model with `b_conv.shape[0] <
w_conv.shape[0]` causes the inner loop to read past the bias buffer, and
the leaked heap bytes propagate through tanh activation and max-pooling
into the output tensor.

### Fix

Add an inline check after `ValidateInputShape` that rejects bias tensors
whose shape is not `[num_filters]`:

```cpp
ORT_RETURN_IF_NOT(b_conv_shape.NumDimensions() == 1 && b_conv_shape[0] == w_conv_shape[0],
                  "WordConvEmbedding: conv bias B must be a 1-D tensor of length ",
                  w_conv_shape[0], ", but got shape ", b_conv_shape);
### Description

The CoreML \`GatherOpBuilder\` rejected rank-0 (scalar) \`indices\`
because CoreML's \`gather\` requires rank-1+ indices and the obvious
workaround would change the output rank — see the \`// Don't allow
scalar 'indices' input.\` comment at \`gather_op_builder.cc:90\`. This
PR performs the workaround internally:

\`\`\`
reshape(indices, shape=[1])     ->  indices_1d
gather(data, indices_1d, axis)  ->  data_shape with the gather axis = 1
squeeze(., axes=[axis])         ->  ONNX gather output shape
\`\`\`

…in both the MLProgram and NeuralNetwork emitters. The squeeze restores
the original ONNX output rank, so caller-visible Gather semantics are
unchanged. \`reshape\` is used rather than \`expand_dims\` because
CoreML internally pads scalars and \`expand_dims\` on the padded tensor
can push the apparent rank past the rank-5 limit on high-rank \`data\`.

Restrictions:
- \`data\` must have a fully static shape — we claim a static
intermediate shape between gather and squeeze.
- \`data\` rank capped at 4. The rank-5 case still trips CoreML's
compiler with \`Invalid rank: 6\`, so we keep the conservative bound.

Dynamic-shape and rank-5+ scalar Gather still falls back to CPU
(preserves the existing \`GatherWithScalarIndices\` test, whose data is
dynamic-shape).

Fixes microsoft#28180.

### Motivation

StyleGAN-family generators (StyleGAN, StyleGAN2, GFPGAN, …) select
per-layer style codes with a scalar-index Gather. The resulting graph
alternates between Gather and the rest of the generator, splitting the
CoreML subgraph repeatedly.

On GFPGAN-1024 (`[1, 3, 512, 512]`), this PR moves all 16 scalar Gathers
off CPU and the model lands on a single CoreML partition.

**M3 Max, MLProgram, batch 1, 3 × 100-iter steady-state runs (n=300):**

| | Partitions | Mean | StdDev | P50 | P95 | P99 | Max |
|---|---|---|---|---|---|---|---|
| origin/main | 2 | 89.68 ms | 3.67 | 87.82 | 96.71 | 105.00 | 108.01 |
| **this PR** | **1** | **81.77 ms** | **1.85** | **80.97** | **85.98**
| **87.24** | **88.03** |

**Mean −8.8%, stddev −50%, P99 −17%, max −18.5%** — eliminating the
CPU↔CoreML round-trip on every scalar Gather both speeds up the steady
state and tightens the tail.

Striking secondary effect: the worst-case run with the fix (**88.03
ms**) is faster than the *mean* run without it (**89.68 ms**). Every
single fixed inference over n=300 lands below the unfixed average.

### Tests

Six new tests in
\`onnxruntime/test/providers/coreml/coreml_basic_test.cc\` covering
distinct code paths, exercised on both NeuralNetwork and MLProgram
emitters where the dtype is supported:

- \`GatherScalarIndicesAxis1\` — axis=1, mid-rank squeeze.
- \`GatherScalarIndicesAxis0\` — axis=0, leading-axis squeeze.
- \`GatherScalarIndicesNegativeAxis\` — axis=-1, exercises
\`HandleNegativeAxis\`.
- \`GatherScalarIndicesFloat16\` — fp16 data (MLProgram only, as per
\`HasSupportedInputsImpl\`).
- \`GatherScalarIndicesInt64Data\` — int64 data, both formats.
- \`GatherScalarIndicesRank4Data\` — rank-4 data, exercises the
supported maximum.

Each verifies CoreML output against the CPU EP reference and asserts
\`ExpectedEPNodeAssignment::All\`. The existing
\`GatherWithScalarIndices\` test (dynamic-shape data) is updated only in
its comment to reflect the new precise condition; it still exercises the
CPU fall-back as before.

All pass locally on macOS 26.3 / M3 Max.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
### Description
<!-- Describe your changes. -->
Added a WebGPU-specific Component Governance manifest for Dawn and
related dependencies.

Added documentation for the manifest scope, dependency classification,
and maintenance steps. Added a validation script to catch Dawn and DXC
pin drift.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
WebGPU builds depend on Dawn and related components that are not part of
vanilla ONNX Runtime builds.

Downstream WebGPU packaging needs ORT-owned metadata to generate
complete third-party notices without maintaining a duplicate dependency
inventory.

---------

Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com>
microsoft#28277)

### Description

Validate `seqlens_k` values against `cos_cache.shape[0]` in
`GroupQueryAttention::Compute()` when `do_rotary` is enabled, to prevent
out-of-bounds reads in the rotary embedding lookup.

### Root Cause

`CheckRotaryCaches()` validates `cos_cache.shape[0] >=
total_sequence_length`, but runtime position IDs are derived from
`seqlens_k` (a separate, per-batch input). An attacker can set
`total_sequence_length` small enough to pass the guard while setting
`seqlens_k[b]` far beyond `cos_cache.shape[0]`, causing `position_id =
seqlens_k[b]` to index out of bounds into the cos/sin cache. The
resulting heap bytes are used as rotation values and propagate into the
inference output.

### Fix

Add an explicit bounds check in `Compute()` that rejects any
`seqlens_k[b] >= cos_cache.shape[0]` before position IDs are computed.
This is defense-in-depth alongside the existing `RunRotaryEmbedding`
position_ids validation added in microsoft#27597.

### Security

- **Impact:** Heap OOB read (CWE-125) — adjacent heap memory leaks into
inference output via cos/sin rotation values.
- **Attack vector:** Any GQA-based LLM serving endpoint (Llama, Phi,
Mistral) that accepts `seqlens_k` as an inference input. No model
modification required.

### Testing

Verified that crafted inputs with `seqlens_k` exceeding `cos_cache`
dimensions now return `INVALID_ARGUMENT` instead of silently producing
results containing leaked heap data.
### Description

Parallelizes both `GetIndices` and `ScatterData` in the CPU
`ScatterElements` implementation using `ThreadPool::TryParallelFor`.

**Key insight**: For ScatterElements with `axis=a`, work units
identified by coordinates orthogonal to the axis (`outer_size ×
inner_size`) are guaranteed to write to disjoint output elements—even
with reductions. This enables lock-free parallelization without
correctness concerns.

Changes:
- **`GetIndices`**: Index validation/normalization parallelized over the
flat index array
- **`ScatterData`**: Rewritten to decompose into `outer_size *
inner_size` independent work units, each processing `axis_size`
sequential scatter operations along the axis dimension
- Thread pool plumbed through `ScatterDataDispatchTarget` from
`OpKernelContext::GetOperatorThreadPool()`
- Training `GatherElementsGradImpl` passes `nullptr` (sequential
fallback preserved)

For the reported workload (`axis=0`, indices shape `[481385, 80]`): 80
independent parallel streams, each processing 481385
elements—well-suited for multi-core execution.

### Motivation and Context

The CPU `ScatterElements` kernel was entirely sequential—single-threaded
index conversion followed by single-threaded scatter—yielding ~761ms on
a 24-core ARM system for a workload that an optimized parallel
implementation handles in ~6ms (129× gap). The kernel showed zero
intra-op thread utilization in ORT profiling.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
### Description

When a system has multiple Nvidia GPUs, then also multiple EpDevices for
the NVEP should be created.

### Motivation and Context

Fixes the following test failure on a multi-gpu system

```
[ RUN      ] NvExecutionProviderTest.LoadUnloadPluginLibrary
/home/stephan/projects/onnxruntime-winai/onnxruntime/test/providers/nv_tensorrt_rtx/nv_basic_test.cc:359: Failure
Expected equality of these values:
  num_test_ep_devices
    Which is: 2
  1
Expected an OrtEpDevice to have been created by the test library.

[  FAILED  ] NvExecutionProviderTest.LoadUnloadPluginLibrary (0 ms)
```
… fix (microsoft#28607)

## Description

Follow-up to microsoft#28583. Addresses review feedback that landed after merge
(input validation, redundant memset, dead branches in
`PrePackComputeBias`) and fixes a pre-existing latent CUTLASS issue that
surfaced as a packaging pipeline failure once MoE GEMM kernels were
built with a multi-arch `CMAKE_CUDA_ARCHITECTURES` list spanning
pre-Ampere and Ampere+ targets.

## Summary of Changes

### Packaging pipeline build fix

| File | Change |
|------|--------|
|
`onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/gemm/kernel/moe_cutlass_kernel.h`
| Replace the unconditional `static_assert(false, ...)` in the
pre-Ampere `#else` branch of `MoeFCGemm::operator()` with
`CUTLASS_NOT_IMPLEMENTED()` plus a comment explaining why this is safe.
|

Background: `moe_gemm_kernels_*.cu` instantiate `MoeFCGemm` through
`MoeGemmRunner<...>::dispatchToArch`, which contains *runtime* (not
`constexpr`) `if (sm_ >= 80 && sm_ < 90)` branches. NVCC therefore
instantiates the kernel for every requested device target, including
pre-Sm80 device compile passes. The old `static_assert(false, ...)`
fired on those passes whenever `CMAKE_CUDA_ARCHITECTURES` contained any
arch below 80 (e.g. the packaging pipeline list
`52-real;61-real;75-real;86-real;89-real;90-virtual`). Replacing it with
`CUTLASS_NOT_IMPLEMENTED()` lets NVCC emit a runtime trap stub for
pre-Sm80, while runtime dispatch in `MoeGemmRunner::dispatchToArch()`
already guarantees `sm_ >= 80` before the kernel is ever launched, so
the stub is unreachable in practice.

### Address PR microsoft#28583 post-merge review

| File | Change |
|------|--------|
| `onnxruntime/contrib_ops/cuda/moe/qmoe_kernels.cu` | Add
`ValidateScaledZP4BitBatchedArgs` (positive `experts`/`n`/`k_blocks`,
`experts ≤ 65535` for the `gridDim.z` limit) and call it from both
`LaunchQMoEScaledZP4BitBatched` overloads. Matches the validation style
of `LaunchQMoERepackFP4ColToRow`. |
| `onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc`
(`PrePackSwizzleBlockScales`) | Remove the redundant `cudaMemsetAsync`
of the destination buffer. `QMoEBlockScaleInterleaveKernel`'s `(batch,
row, col) -> offset` map is a bijection over the padded output extent
and writes 0 for padded source positions, so every output byte is
already written. Comment explains the invariant. |
| `onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc`
(`PrePackComputeBias`, 4-bit block-wise) | Add `ORT_ENFORCE` checks for
positive shape dims and an `INT_MAX/2` bound on `packed_k_blocks`
(parity with `PrePackSwizzleBlockScales` / `PrePackRepackFP4Weights`).
Drop the shadowed `bool is_fp16 = is_fp16_; bool is_bf16 = !is_fp16_;`
locals in favour of `is_fp16_`. Replace the dead-branch ternary
`(is_fp16 \|\| is_bf16 ? 2 : 4)` with `sizeof(uint16_t)` and a
clarifying comment, and remove the unreachable `else ORT_THROW(...)`
(the QMoE type path is strictly FP16/BF16). |

## Testing

- Built locally with CUDA 12.8 against the failing CI arch list
(`-DCMAKE_CUDA_ARCHITECTURES="52-real;61-real;75-real;86-real;89-real;90-virtual"`)
and confirmed
`onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_gemm_kernels_bf16_bf16.cu.o`
compiles cleanly (only an `sm_<75` deprecation warning, no
`static_assert` failure).
- Existing QMoE Python tests
(`onnxruntime/test/python/transformers/test_qmoe_cuda.py`,
`test_qmoe_cpu.py`) exercise the affected `PrePackSwizzleBlockScales` /
`PrePackComputeBias` paths under `--config Debug` builds and continue to
pass; the added `ORT_ENFORCE` checks only trigger on invalid shapes that
are not produced by the supported QMoE input contract.
- No behaviour change on supported devices: `dispatchToArch` already
gates `MoeFCGemm` behind `sm_ >= 80`, so the new
`CUTLASS_NOT_IMPLEMENTED()` stub is unreachable at runtime.

## Motivation and Context

Once microsoft#28583 enabled the MoE GEMM kernels as part of the contrib CUDA
build, packaging pipelines (which target a wide arch range to maximise
GPU coverage) started failing on the pre-Ampere device compile passes.
The kernel-side fix in this PR resolves the immediate breakage while
keeping the cmake-level binary-size optimisation (per-kernel arch
pinning, TensorRT-LLM style) as a follow-up — CMake's
`CUDA_ARCHITECTURES` is target/directory-scoped only, so the proper way
to restrict per-kernel archs is an OBJECT-library refactor, which is
intentionally not in scope here.

## Checklist

- [x] Tests added/updated (input validation covered by existing QMoE
tests; the new `ORT_ENFORCE` checks fail loudly on out-of-contract
shapes)
- [x] No documentation changes needed
- [x] No breaking changes
- [x] Local packaging-pipeline arch list verified to compile
### Description

The CUDA Attention kernel (`core/providers/cuda/llm/attention.cc`)
depends on contrib_ops internals (flash attention, memory efficient
attention, unfused attention helpers) but was compiled unconditionally.
When building with `--disable_contrib_ops`,
`GetAttentionKernelOptions()` is unavailable (guarded by `#ifndef
DISABLE_CONTRIB_OPS` in `cuda_kernel.h`), causing a compile error.

Changes:
- **`cmake/onnxruntime_providers_cuda.cmake`** — When contrib ops are
disabled (and not in CUDA minimal mode), include the
`contrib_ops/cuda/bert/` attention infrastructure files (flash
attention, memory efficient attention, unfused attention helpers, etc.)
so the ONNX domain Attention kernel can compile and link. Uses
`elseif(onnxruntime_DISABLE_CONTRIB_OPS AND NOT
onnxruntime_CUDA_MINIMAL)` to avoid including these files in CUDA
minimal builds where `llm/attention.cc` isn't compiled and
`cudnn_frontend.h` isn't available.
- **`onnxruntime/core/providers/cuda/cuda_execution_provider.h`** —
Remove `#ifndef DISABLE_CONTRIB_OPS` guards from the
`AttentionKernelOptions` include, `GetAttentionKernelOptions()` method,
and `attention_kernel_options_` member variable
- **`onnxruntime/core/providers/cuda/cuda_kernel.h`** — Remove `#ifndef
DISABLE_CONTRIB_OPS` guard from `GetAttentionKernelOptions()`

The CUDA Attention kernel and its underlying attention backends (flash,
memory efficient, unfused) are now always available in full CUDA builds
regardless of whether contrib ops are enabled. No changes are needed in
`cuda_execution_provider.cc` since the Attention kernel registrations
remain unconditional.

### Motivation and Context

Building onnxruntime with CUDA enabled and `--disable_contrib_ops`
fails:

```
error C2039: 'GetAttentionKernelOptions': is not a member of 'onnxruntime::cuda::Attention<float>'
```

This is a valid build configuration (useful for reducing compile time)
that should be supported. Rather than excluding the CUDA Attention
kernel when contrib ops are disabled, the necessary attention
infrastructure from `contrib_ops/cuda/bert/` is included in the build so
the ONNX domain Attention op retains full CUDA acceleration. The fix is
scoped to non-minimal CUDA builds only, since CUDA minimal builds use a
non-recursive glob that doesn't include `llm/attention.cc` and don't
have `cudnn_frontend` available.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
…#28578) (microsoft#28606)

## Description

Follow-up performance and correctness improvements to the MLAS quantized
KV-cache GEMM kernels introduced in microsoft#28578. These changes target the
AVX2, AVX512-VNNI, and NEON kernel files only.

### Changes

1. **Use embedded rounding in `QuantizeRowToU8` (AVX-512)**
Replace `_mm512_roundscale_ps` + `_mm512_cvtps_epi32` with a single
`_mm512_cvt_roundps_epi32` that combines round-to-nearest-even and
float-to-int32 in one instruction, saving a `vrndscaleps` per loop
iteration.

2. **Use int32 zero-point correction in VNNI dot products**
Perform the `dot - 128*sum(b)` zero-point correction in int32 before
converting to float. This avoids precision loss when operands exceed
2^24 (where float32 loses integer precision), preventing potential
catastrophic cancellation.

3. **Defer per-tensor scale in `FusedDotInt8` (AVX2 + AVX-512)**
Factor the constant per-tensor scale out of the inner loop: `sum(a*b*s)
= s * sum(a*b)`. Saves one `vmulps` per 8/16 elements in the hot path.

4. **Defer per-tensor scale in SVGemm and NEON dequantization**
- AVX2/AVX-512 `SVGemm`: accumulate unscaled dot products, multiply the
output row by the per-tensor scale once after the K loop.
- NEON: parameterize `DequantRow_Neon` with `apply_per_tensor_scale` to
skip per-element scaling during dequantization when using per-tensor
mode; scale the output row once after accumulation.
- Also: clarify AVX2 INT4 nibble extraction comment and use `uint32_t`
for the raw packed load.

### Motivation

The per-tensor quantization paths were previously applying a constant
scale factor on every element inside hot loops. By deferring the scalar
multiplication to after accumulation (using the distributive property),
we reduce instruction count in the inner loops without changing
numerical results (within normal FP reordering tolerance).

The int32 zero-point correction fix addresses a latent precision issue
in AVX512-VNNI paths that could manifest at large K dimensions (K >
~512).

### Testing

- `onnxruntime_mlas_test --gtest_filter=KVQuant.*` passes (Debug build,
x86-64).
- No new tests needed — existing `KVQuant.ShortExecute` exercises all
modified code paths across INT8/INT4 per-tensor/per-channel modes.


### Benchmark Results

Measured on Intel Xeon Platinum 8370C (8 cores, 16 threads, AVX-512 +
VNNI), Release build. Each benchmark uses `--benchmark_min_time=0.3s
--benchmark_repetitions=5`.

**QKGemm (query × K_cache^T) — INT8 per-tensor (S8_PerTensor,
QuantType:0)**

This is the path most improved by the deferred-scale optimization
(changes 3 and 4).

| Shape | Before (ns) | After (ns) | Speedup |
|---|---:|---:|---:|
| M=1, N=512, K=64 | 2,926 | 2,803 | 1.04x |
| M=1, N=512, K=128 | 5,914 | 5,074 | **1.17x** |
| M=1, N=2048, K=128 | 22,401 | 19,937 | **1.12x** |
| M=128, N=512, K=64 | 412,505 | 304,230 | **1.36x** |
| M=128, N=512, K=128 | 911,508 | 788,198 | **1.16x** |
| M=128, N=2048, K=64 | 1,662,547 | 1,242,441 | **1.34x** |
| M=128, N=2048, K=128 | 3,660,599 | 3,176,911 | **1.15x** |

**SVGemm (attn_probs × V_cache) — INT8 per-tensor (S8_PerTensor,
QuantType:0)**

| Shape | Before (ns) | After (ns) | Speedup |
|---|---:|---:|---:|
| M=1, N=64, K=512 | 4,707 | 4,122 | **1.14x** |
| M=1, N=64, K=2048 | 18,516 | 16,533 | **1.12x** |
| M=128, N=64, K=512 | 399,703 | 358,821 | **1.11x** |
| M=128, N=64, K=2048 | 1,633,807 | 1,423,984 | **1.15x** |
| M=128, N=128, K=512 | 775,205 | 761,527 | 1.02x |
| M=128, N=128, K=2048 | 3,086,642 | 2,979,566 | **1.04x** |

**Other quant types (S8_PerChannel, S4_PerTensor, S4_PerChannel) —
neutral**

Per-channel and INT4 paths are not affected by the deferred-scale
optimization. Representative M=128 results:

| Benchmark | QuantType | Before (ns) | After (ns) | Ratio |
|---|---|---:|---:|---:|
| QKGemm M=128, N=2048, K=128 | S8_PerChannel | 4,555,381 | 4,684,954 |
0.97x |
| QKGemm M=128, N=2048, K=128 | S4_PerTensor | 3,841,759 | 3,819,387 |
1.01x |
| QKGemm M=128, N=2048, K=128 | S4_PerChannel | 4,043,262 | 4,056,033 |
1.00x |
| SVGemm M=128, N=128, K=2048 | S8_PerChannel | 4,449,839 | 4,290,344 |
**1.04x** |
| SVGemm M=128, N=128, K=2048 | S4_PerTensor | 2,989,684 | 2,998,154 |
1.00x |
| SVGemm M=128, N=128, K=2048 | S4_PerChannel | 3,403,497 | 3,390,452 |
1.00x |

**Summary**: The INT8 per-tensor paths (the most common decode
configuration) see **12–36% QKGemm speedup** and **4–15% SVGemm
speedup** at representative shapes. Other quantization modes are neutral
within noise (±1–3%).
…osoft#28521)

## Summary
- Guard `QDQPropagationTransformer::PropagateDQForward` against
propagating a DQ whose data input is a constant (graph initializer or
`Constant` op output).
- Prevents a stale `QuantizeLinear` insertion that the S8-to-U8 weight
transformer fails to update, which silently clamps int8 negatives to
zero under `ORT_ENABLE_ALL`.
- Adds a regression test in `qdq_transformer_test.cc` covering both
constant-input shapes.

## Motivation
Fixes microsoft#28491.

Reported scenario: a model containing `Constant(int8) ->
DequantizeLinear -> Reshape` produces correct outputs under
`ORT_DISABLE_ALL` but wrong outputs (negatives clamped to 0) under
`ORT_ENABLE_ALL`. Root cause: `PropagateDQForward` inserts a `Q -> DQ`
pair after the Reshape, then `QDQS8ToU8Transformer` (or the avx2 weight
transformer) flips the upstream DQ from int8 to uint8 without touching
the freshly inserted `Q`. The orphaned uint8 `Q` then clamps the int8
weight's negative values to 0.

Propagating a DQ whose data is a constant weight has no benefit anyway:
the constant is folded, so there is no runtime tensor that downstream
nodes need re-quantized.

## Changes
- `onnxruntime/core/optimizer/qdq_transformer/qdq_propagation.cc`:
inside `PropagateDQForward`'s per-DQ loop, after existing skip checks
and before any graph mutation, `continue` if
`dq_node.InputDefs()[QDQ::InputIndex::INPUT_ID]` is a graph initializer
(`graph_utils::NodeArgIsConstant`) or the output of a `Constant` op node
(`graph.GetProducerNode(...)->OpType() == "Constant"`). Both checks are
required because `NodeArgIsConstant` only handles the initializer case.
- `onnxruntime/test/optimizer/qdq_transformer_test.cc`: new test
`QDQPropagation_DQForward_ConstantInput_NoPropagation` with two cases —
DQ fed by an initializer, and DQ fed by an explicit `Constant` op node —
asserting no `QuantizeLinear` is inserted after the downstream Reshape.

`PropagateQBackward` is structurally not affected (its data input is a
live activation, not a constant), so it does not need a symmetric guard.

## Test Plan
- New
`QDQTransformerTests.QDQPropagation_DQForward_ConstantInput_NoPropagation`
(gtest) covers both code paths and would fail on `main`.
- Existing `QDQPropagation_*` tests use `MakeInput` (graph inputs, not
initializers) for their DQ data tensors, so the new guard does not
regress them.
- The reproducer from microsoft#28491 should now produce identical results
between `ORT_DISABLE_ALL` and `ORT_ENABLE_ALL`.
- CI will exercise `--gtest_filter=QDQTransformerTests.QDQPropagation*`
under `onnxruntime_test_all`.

Fixes microsoft#28491
microsoft#28519)

### Description
- Scale tile_v by 4x when subgroup is enabled and the vectorized
dimension has enough columns, improving data reuse.
- Gate the tile_v expansion on seq_length >= 16, where the prefill
benefit outweighs increased register pressure.
- Remove redundant zero-initialization of the state tile (WGSL default-
initializes vars to zero).

**Intel Panther Lake (xe-3lpg)**
| Model        | Prefill | Baseline (TPS) | Optimized (TPS) | Change |
| :----------- | ------: | -------------: | --------------: | -----: |
| Qwen3.5-0.8B |     128 |        1534.30 |         1681.00 |  9.56% |
| Qwen3.5-0.8B |    1024 |        3267.30 |         3917.60 | 19.90% |
| Qwen3.5-0.8B |    4096 |        2864.50 |         3563.40 | 24.40% |
| Qwen3.5-2B   |     128 |        1295.60 |         1344.60 |  3.78% |
| Qwen3.5-2B   |    1024 |        2177.10 |         2344.00 |  7.67% |
| Qwen3.5-2B   |    4096 |        1942.60 |         2247.00 | 15.67% |
| Qwen3.5-4B   |     128 |         701.30 |          736.20 |  4.98% |
| Qwen3.5-4B   |    1024 |         946.90 |         1036.30 |  9.44% |
| Qwen3.5-4B   |    4096 |         824.10 |          912.00 | 10.67% |

### Motivation and Context
See above.
…or in chained Reshape (microsoft#28455)

### Description

- **Optimizer fix** (`reshape_fusion.cc`): `FuseContiguousReshapes` now
explicitly bails out in the while loop when it encounters a `Reshape`
next-node that has `allowzero=1`. The fused node inherits attributes
from the first node in the chain (which has `allowzero=0` or no
`allowzero` attribute), so including an `allowzero=1` node in the fusion
would silently drop that attribute and cause zeros in the shape tensor
to be misinterpreted as "copy from input" at runtime instead of being
preserved as explicit zero dims. A complementary zero-dim guard (already
present) also prevents fusion when the inferred final output shape
contains any literal zero dimension.

- **New end-to-end execution test** (`graph_transform_test.cc`,
`ReshapeFusionContiguousReshapesWithZeroDimExecution`): Exercises the
exact scenario from the issue — `float[0,8,2] → Reshape([4,2,-1]) →
Reshape([0,0,4], allowzero=1)` — through a full `InferenceSession` run,
asserting the output shape is `(0,0,4)` and not `(0,8,4)`. The existing
`ReshapeFusionContiguousReshapesWithZeroDim` test only validated the
optimizer transformation; this test validates runtime correctness.

### Motivation and Context

A chained `Reshape` model where the second node uses `allowzero=1`
produced the wrong output shape when the fused shape contained zeros.
Example reproducer:

```python
# X: float[0, 8, 2]
n1 = Reshape(X, shape=[4, 2, -1])                # mid: [4, 2, 0]
n2 = Reshape(mid, shape=[0, 0, 4], allowzero=1)  # expected Y: [0, 0, 4]
# ORT returned: (0, 8, 4)  — wrong; reference returned: (0, 0, 4)
```

`FuseContiguousReshapes` merged both nodes into `Reshape(X, [0, 0, 4])`
with `allowzero=0` (inherited from n1), so the zeros were interpreted as
"copy dim from `X`" (`X[1]=8`), yielding `(0, 8, 4)`.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.