Sync with Microsoft ONNX Runtime - 23052026#1103
Open
ai-fw-intg wants to merge 14 commits into
Open
Conversation
microsoft#28499) ### Description Fixes a heap out-of-bounds read vulnerability in `DynamicQuantizeMatMul` and `MatMulIntegerToFloat` where a bias tensor with an incorrect number of elements could cause memory reads beyond the allocated buffer. ## Changes - **`dynamic_quantize_matmul.cc`**: Added element count validation for the bias tensor in both the `ComputeCommon` path and the deferred bias addition path (KleidiAI). - **`matmul_integer_base.h`**: Added element count validation in the KleidiAI pre-pack path, causing fallback to `ComputeCommon` (which then rejects the invalid bias with a clear error). - **Tests**: Added regression tests covering runtime bias mismatch, initializer bias mismatch (KleidiAI fallback), and the generic (non-KleidiAI) path for both operators. ## Why we validate element count, not shape (rank) The validation checks `bias_tensor->Shape().Size() == N` (total element count) rather than enforcing that the bias is strictly 1D. This is intentional for several reasons: 1. **Backward compatibility with existing models.** It's possible that some models may have bias tensors with shape `(1, N)` instead of `(N)`. Enforcing rank == 1 would break these models at runtime. This exact issue occurred with the GroupQueryAttention operator, which required relaxing its shape validation in PR microsoft#28259. 2. **Consistent with ONNX standard practice.** Most official ONNX operator schemas (Conv, ConvTranspose, DeformConv, Gemm, LayerNormalization) do *not* validate bias shape in their schema's `TypeAndShapeInferenceFunction`; they only document "1D" in the input description text. `BatchNormalization` is the only exception. 3. **The kernel only needs N contiguous floats.** The compute implementation accesses bias via raw data pointer (`bias->Data<float>()`) and reads exactly `N` elements. It never indexes into specific dimensions or assumes a particular rank. A bias of shape `(N)`, `(1, N)`, or `(1, 1, N)` all work identically. 4. **Schema constraints cannot be relaxed without a version bump.** If we added a strict rank check to the schema now and later discovered models using `(1, N)`, fixing it would probably require a new opset version (though we've never actually bumped the version for contrib ops ...). ## Motivation and Context Without this fix, passing a bias tensor with fewer elements than `B`'s last dimension causes the kernel to read past the end of the bias buffer, potentially exposing sensitive memory contents or causing a crash.
…crosoft#28279) ### Description Add shape validation for the conv bias input (`B`) in `WordConvEmbedding::Compute()` to prevent out-of-bounds heap reads when a crafted model provides a bias tensor shorter than `num_filters`. ### Root Cause `WordConvEmbedding::Compute` passes `b_conv.Data<float>()` directly to `ComputeConvMaxPoolWithActivation`, which iterates over `num_filters` (= `w_conv.shape[0]`) elements of the bias buffer. `ValidateInputShape` only checks the sequence, conv weight, and char embedding shapes — the bias shape is never validated. A model with `b_conv.shape[0] < w_conv.shape[0]` causes the inner loop to read past the bias buffer, and the leaked heap bytes propagate through tanh activation and max-pooling into the output tensor. ### Fix Add an inline check after `ValidateInputShape` that rejects bias tensors whose shape is not `[num_filters]`: ```cpp ORT_RETURN_IF_NOT(b_conv_shape.NumDimensions() == 1 && b_conv_shape[0] == w_conv_shape[0], "WordConvEmbedding: conv bias B must be a 1-D tensor of length ", w_conv_shape[0], ", but got shape ", b_conv_shape);
### Description The CoreML \`GatherOpBuilder\` rejected rank-0 (scalar) \`indices\` because CoreML's \`gather\` requires rank-1+ indices and the obvious workaround would change the output rank — see the \`// Don't allow scalar 'indices' input.\` comment at \`gather_op_builder.cc:90\`. This PR performs the workaround internally: \`\`\` reshape(indices, shape=[1]) -> indices_1d gather(data, indices_1d, axis) -> data_shape with the gather axis = 1 squeeze(., axes=[axis]) -> ONNX gather output shape \`\`\` …in both the MLProgram and NeuralNetwork emitters. The squeeze restores the original ONNX output rank, so caller-visible Gather semantics are unchanged. \`reshape\` is used rather than \`expand_dims\` because CoreML internally pads scalars and \`expand_dims\` on the padded tensor can push the apparent rank past the rank-5 limit on high-rank \`data\`. Restrictions: - \`data\` must have a fully static shape — we claim a static intermediate shape between gather and squeeze. - \`data\` rank capped at 4. The rank-5 case still trips CoreML's compiler with \`Invalid rank: 6\`, so we keep the conservative bound. Dynamic-shape and rank-5+ scalar Gather still falls back to CPU (preserves the existing \`GatherWithScalarIndices\` test, whose data is dynamic-shape). Fixes microsoft#28180. ### Motivation StyleGAN-family generators (StyleGAN, StyleGAN2, GFPGAN, …) select per-layer style codes with a scalar-index Gather. The resulting graph alternates between Gather and the rest of the generator, splitting the CoreML subgraph repeatedly. On GFPGAN-1024 (`[1, 3, 512, 512]`), this PR moves all 16 scalar Gathers off CPU and the model lands on a single CoreML partition. **M3 Max, MLProgram, batch 1, 3 × 100-iter steady-state runs (n=300):** | | Partitions | Mean | StdDev | P50 | P95 | P99 | Max | |---|---|---|---|---|---|---|---| | origin/main | 2 | 89.68 ms | 3.67 | 87.82 | 96.71 | 105.00 | 108.01 | | **this PR** | **1** | **81.77 ms** | **1.85** | **80.97** | **85.98** | **87.24** | **88.03** | **Mean −8.8%, stddev −50%, P99 −17%, max −18.5%** — eliminating the CPU↔CoreML round-trip on every scalar Gather both speeds up the steady state and tightens the tail. Striking secondary effect: the worst-case run with the fix (**88.03 ms**) is faster than the *mean* run without it (**89.68 ms**). Every single fixed inference over n=300 lands below the unfixed average. ### Tests Six new tests in \`onnxruntime/test/providers/coreml/coreml_basic_test.cc\` covering distinct code paths, exercised on both NeuralNetwork and MLProgram emitters where the dtype is supported: - \`GatherScalarIndicesAxis1\` — axis=1, mid-rank squeeze. - \`GatherScalarIndicesAxis0\` — axis=0, leading-axis squeeze. - \`GatherScalarIndicesNegativeAxis\` — axis=-1, exercises \`HandleNegativeAxis\`. - \`GatherScalarIndicesFloat16\` — fp16 data (MLProgram only, as per \`HasSupportedInputsImpl\`). - \`GatherScalarIndicesInt64Data\` — int64 data, both formats. - \`GatherScalarIndicesRank4Data\` — rank-4 data, exercises the supported maximum. Each verifies CoreML output against the CPU EP reference and asserts \`ExpectedEPNodeAssignment::All\`. The existing \`GatherWithScalarIndices\` test (dynamic-shape data) is updated only in its comment to reflect the new precise condition; it still exercises the CPU fall-back as before. All pass locally on macOS 26.3 / M3 Max. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
### Description <!-- Describe your changes. --> Added a WebGPU-specific Component Governance manifest for Dawn and related dependencies. Added documentation for the manifest scope, dependency classification, and maintenance steps. Added a validation script to catch Dawn and DXC pin drift. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> WebGPU builds depend on Dawn and related components that are not part of vanilla ONNX Runtime builds. Downstream WebGPU packaging needs ORT-owned metadata to generate complete third-party notices without maintaining a duplicate dependency inventory. --------- Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com>
microsoft#28277) ### Description Validate `seqlens_k` values against `cos_cache.shape[0]` in `GroupQueryAttention::Compute()` when `do_rotary` is enabled, to prevent out-of-bounds reads in the rotary embedding lookup. ### Root Cause `CheckRotaryCaches()` validates `cos_cache.shape[0] >= total_sequence_length`, but runtime position IDs are derived from `seqlens_k` (a separate, per-batch input). An attacker can set `total_sequence_length` small enough to pass the guard while setting `seqlens_k[b]` far beyond `cos_cache.shape[0]`, causing `position_id = seqlens_k[b]` to index out of bounds into the cos/sin cache. The resulting heap bytes are used as rotation values and propagate into the inference output. ### Fix Add an explicit bounds check in `Compute()` that rejects any `seqlens_k[b] >= cos_cache.shape[0]` before position IDs are computed. This is defense-in-depth alongside the existing `RunRotaryEmbedding` position_ids validation added in microsoft#27597. ### Security - **Impact:** Heap OOB read (CWE-125) — adjacent heap memory leaks into inference output via cos/sin rotation values. - **Attack vector:** Any GQA-based LLM serving endpoint (Llama, Phi, Mistral) that accepts `seqlens_k` as an inference input. No model modification required. ### Testing Verified that crafted inputs with `seqlens_k` exceeding `cos_cache` dimensions now return `INVALID_ARGUMENT` instead of silently producing results containing leaked heap data.
### Description Parallelizes both `GetIndices` and `ScatterData` in the CPU `ScatterElements` implementation using `ThreadPool::TryParallelFor`. **Key insight**: For ScatterElements with `axis=a`, work units identified by coordinates orthogonal to the axis (`outer_size × inner_size`) are guaranteed to write to disjoint output elements—even with reductions. This enables lock-free parallelization without correctness concerns. Changes: - **`GetIndices`**: Index validation/normalization parallelized over the flat index array - **`ScatterData`**: Rewritten to decompose into `outer_size * inner_size` independent work units, each processing `axis_size` sequential scatter operations along the axis dimension - Thread pool plumbed through `ScatterDataDispatchTarget` from `OpKernelContext::GetOperatorThreadPool()` - Training `GatherElementsGradImpl` passes `nullptr` (sequential fallback preserved) For the reported workload (`axis=0`, indices shape `[481385, 80]`): 80 independent parallel streams, each processing 481385 elements—well-suited for multi-core execution. ### Motivation and Context The CPU `ScatterElements` kernel was entirely sequential—single-threaded index conversion followed by single-threaded scatter—yielding ~761ms on a 24-core ARM system for a workload that an optimized parallel implementation handles in ~6ms (129× gap). The kernel showed zero intra-op thread utilization in ORT profiling. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
### Description
When a system has multiple Nvidia GPUs, then also multiple EpDevices for
the NVEP should be created.
### Motivation and Context
Fixes the following test failure on a multi-gpu system
```
[ RUN ] NvExecutionProviderTest.LoadUnloadPluginLibrary
/home/stephan/projects/onnxruntime-winai/onnxruntime/test/providers/nv_tensorrt_rtx/nv_basic_test.cc:359: Failure
Expected equality of these values:
num_test_ep_devices
Which is: 2
1
Expected an OrtEpDevice to have been created by the test library.
[ FAILED ] NvExecutionProviderTest.LoadUnloadPluginLibrary (0 ms)
```
… fix (microsoft#28607) ## Description Follow-up to microsoft#28583. Addresses review feedback that landed after merge (input validation, redundant memset, dead branches in `PrePackComputeBias`) and fixes a pre-existing latent CUTLASS issue that surfaced as a packaging pipeline failure once MoE GEMM kernels were built with a multi-arch `CMAKE_CUDA_ARCHITECTURES` list spanning pre-Ampere and Ampere+ targets. ## Summary of Changes ### Packaging pipeline build fix | File | Change | |------|--------| | `onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/gemm/kernel/moe_cutlass_kernel.h` | Replace the unconditional `static_assert(false, ...)` in the pre-Ampere `#else` branch of `MoeFCGemm::operator()` with `CUTLASS_NOT_IMPLEMENTED()` plus a comment explaining why this is safe. | Background: `moe_gemm_kernels_*.cu` instantiate `MoeFCGemm` through `MoeGemmRunner<...>::dispatchToArch`, which contains *runtime* (not `constexpr`) `if (sm_ >= 80 && sm_ < 90)` branches. NVCC therefore instantiates the kernel for every requested device target, including pre-Sm80 device compile passes. The old `static_assert(false, ...)` fired on those passes whenever `CMAKE_CUDA_ARCHITECTURES` contained any arch below 80 (e.g. the packaging pipeline list `52-real;61-real;75-real;86-real;89-real;90-virtual`). Replacing it with `CUTLASS_NOT_IMPLEMENTED()` lets NVCC emit a runtime trap stub for pre-Sm80, while runtime dispatch in `MoeGemmRunner::dispatchToArch()` already guarantees `sm_ >= 80` before the kernel is ever launched, so the stub is unreachable in practice. ### Address PR microsoft#28583 post-merge review | File | Change | |------|--------| | `onnxruntime/contrib_ops/cuda/moe/qmoe_kernels.cu` | Add `ValidateScaledZP4BitBatchedArgs` (positive `experts`/`n`/`k_blocks`, `experts ≤ 65535` for the `gridDim.z` limit) and call it from both `LaunchQMoEScaledZP4BitBatched` overloads. Matches the validation style of `LaunchQMoERepackFP4ColToRow`. | | `onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc` (`PrePackSwizzleBlockScales`) | Remove the redundant `cudaMemsetAsync` of the destination buffer. `QMoEBlockScaleInterleaveKernel`'s `(batch, row, col) -> offset` map is a bijection over the padded output extent and writes 0 for padded source positions, so every output byte is already written. Comment explains the invariant. | | `onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc` (`PrePackComputeBias`, 4-bit block-wise) | Add `ORT_ENFORCE` checks for positive shape dims and an `INT_MAX/2` bound on `packed_k_blocks` (parity with `PrePackSwizzleBlockScales` / `PrePackRepackFP4Weights`). Drop the shadowed `bool is_fp16 = is_fp16_; bool is_bf16 = !is_fp16_;` locals in favour of `is_fp16_`. Replace the dead-branch ternary `(is_fp16 \|\| is_bf16 ? 2 : 4)` with `sizeof(uint16_t)` and a clarifying comment, and remove the unreachable `else ORT_THROW(...)` (the QMoE type path is strictly FP16/BF16). | ## Testing - Built locally with CUDA 12.8 against the failing CI arch list (`-DCMAKE_CUDA_ARCHITECTURES="52-real;61-real;75-real;86-real;89-real;90-virtual"`) and confirmed `onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_gemm_kernels_bf16_bf16.cu.o` compiles cleanly (only an `sm_<75` deprecation warning, no `static_assert` failure). - Existing QMoE Python tests (`onnxruntime/test/python/transformers/test_qmoe_cuda.py`, `test_qmoe_cpu.py`) exercise the affected `PrePackSwizzleBlockScales` / `PrePackComputeBias` paths under `--config Debug` builds and continue to pass; the added `ORT_ENFORCE` checks only trigger on invalid shapes that are not produced by the supported QMoE input contract. - No behaviour change on supported devices: `dispatchToArch` already gates `MoeFCGemm` behind `sm_ >= 80`, so the new `CUTLASS_NOT_IMPLEMENTED()` stub is unreachable at runtime. ## Motivation and Context Once microsoft#28583 enabled the MoE GEMM kernels as part of the contrib CUDA build, packaging pipelines (which target a wide arch range to maximise GPU coverage) started failing on the pre-Ampere device compile passes. The kernel-side fix in this PR resolves the immediate breakage while keeping the cmake-level binary-size optimisation (per-kernel arch pinning, TensorRT-LLM style) as a follow-up — CMake's `CUDA_ARCHITECTURES` is target/directory-scoped only, so the proper way to restrict per-kernel archs is an OBJECT-library refactor, which is intentionally not in scope here. ## Checklist - [x] Tests added/updated (input validation covered by existing QMoE tests; the new `ORT_ENFORCE` checks fail loudly on out-of-contract shapes) - [x] No documentation changes needed - [x] No breaking changes - [x] Local packaging-pipeline arch list verified to compile
### Description The CUDA Attention kernel (`core/providers/cuda/llm/attention.cc`) depends on contrib_ops internals (flash attention, memory efficient attention, unfused attention helpers) but was compiled unconditionally. When building with `--disable_contrib_ops`, `GetAttentionKernelOptions()` is unavailable (guarded by `#ifndef DISABLE_CONTRIB_OPS` in `cuda_kernel.h`), causing a compile error. Changes: - **`cmake/onnxruntime_providers_cuda.cmake`** — When contrib ops are disabled (and not in CUDA minimal mode), include the `contrib_ops/cuda/bert/` attention infrastructure files (flash attention, memory efficient attention, unfused attention helpers, etc.) so the ONNX domain Attention kernel can compile and link. Uses `elseif(onnxruntime_DISABLE_CONTRIB_OPS AND NOT onnxruntime_CUDA_MINIMAL)` to avoid including these files in CUDA minimal builds where `llm/attention.cc` isn't compiled and `cudnn_frontend.h` isn't available. - **`onnxruntime/core/providers/cuda/cuda_execution_provider.h`** — Remove `#ifndef DISABLE_CONTRIB_OPS` guards from the `AttentionKernelOptions` include, `GetAttentionKernelOptions()` method, and `attention_kernel_options_` member variable - **`onnxruntime/core/providers/cuda/cuda_kernel.h`** — Remove `#ifndef DISABLE_CONTRIB_OPS` guard from `GetAttentionKernelOptions()` The CUDA Attention kernel and its underlying attention backends (flash, memory efficient, unfused) are now always available in full CUDA builds regardless of whether contrib ops are enabled. No changes are needed in `cuda_execution_provider.cc` since the Attention kernel registrations remain unconditional. ### Motivation and Context Building onnxruntime with CUDA enabled and `--disable_contrib_ops` fails: ``` error C2039: 'GetAttentionKernelOptions': is not a member of 'onnxruntime::cuda::Attention<float>' ``` This is a valid build configuration (useful for reducing compile time) that should be supported. Rather than excluding the CUDA Attention kernel when contrib ops are disabled, the necessary attention infrastructure from `contrib_ops/cuda/bert/` is included in the build so the ONNX domain Attention op retains full CUDA acceleration. The fix is scoped to non-minimal CUDA builds only, since CUDA minimal builds use a non-recursive glob that doesn't include `llm/attention.cc` and don't have `cudnn_frontend` available. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
…#28578) (microsoft#28606) ## Description Follow-up performance and correctness improvements to the MLAS quantized KV-cache GEMM kernels introduced in microsoft#28578. These changes target the AVX2, AVX512-VNNI, and NEON kernel files only. ### Changes 1. **Use embedded rounding in `QuantizeRowToU8` (AVX-512)** Replace `_mm512_roundscale_ps` + `_mm512_cvtps_epi32` with a single `_mm512_cvt_roundps_epi32` that combines round-to-nearest-even and float-to-int32 in one instruction, saving a `vrndscaleps` per loop iteration. 2. **Use int32 zero-point correction in VNNI dot products** Perform the `dot - 128*sum(b)` zero-point correction in int32 before converting to float. This avoids precision loss when operands exceed 2^24 (where float32 loses integer precision), preventing potential catastrophic cancellation. 3. **Defer per-tensor scale in `FusedDotInt8` (AVX2 + AVX-512)** Factor the constant per-tensor scale out of the inner loop: `sum(a*b*s) = s * sum(a*b)`. Saves one `vmulps` per 8/16 elements in the hot path. 4. **Defer per-tensor scale in SVGemm and NEON dequantization** - AVX2/AVX-512 `SVGemm`: accumulate unscaled dot products, multiply the output row by the per-tensor scale once after the K loop. - NEON: parameterize `DequantRow_Neon` with `apply_per_tensor_scale` to skip per-element scaling during dequantization when using per-tensor mode; scale the output row once after accumulation. - Also: clarify AVX2 INT4 nibble extraction comment and use `uint32_t` for the raw packed load. ### Motivation The per-tensor quantization paths were previously applying a constant scale factor on every element inside hot loops. By deferring the scalar multiplication to after accumulation (using the distributive property), we reduce instruction count in the inner loops without changing numerical results (within normal FP reordering tolerance). The int32 zero-point correction fix addresses a latent precision issue in AVX512-VNNI paths that could manifest at large K dimensions (K > ~512). ### Testing - `onnxruntime_mlas_test --gtest_filter=KVQuant.*` passes (Debug build, x86-64). - No new tests needed — existing `KVQuant.ShortExecute` exercises all modified code paths across INT8/INT4 per-tensor/per-channel modes. ### Benchmark Results Measured on Intel Xeon Platinum 8370C (8 cores, 16 threads, AVX-512 + VNNI), Release build. Each benchmark uses `--benchmark_min_time=0.3s --benchmark_repetitions=5`. **QKGemm (query × K_cache^T) — INT8 per-tensor (S8_PerTensor, QuantType:0)** This is the path most improved by the deferred-scale optimization (changes 3 and 4). | Shape | Before (ns) | After (ns) | Speedup | |---|---:|---:|---:| | M=1, N=512, K=64 | 2,926 | 2,803 | 1.04x | | M=1, N=512, K=128 | 5,914 | 5,074 | **1.17x** | | M=1, N=2048, K=128 | 22,401 | 19,937 | **1.12x** | | M=128, N=512, K=64 | 412,505 | 304,230 | **1.36x** | | M=128, N=512, K=128 | 911,508 | 788,198 | **1.16x** | | M=128, N=2048, K=64 | 1,662,547 | 1,242,441 | **1.34x** | | M=128, N=2048, K=128 | 3,660,599 | 3,176,911 | **1.15x** | **SVGemm (attn_probs × V_cache) — INT8 per-tensor (S8_PerTensor, QuantType:0)** | Shape | Before (ns) | After (ns) | Speedup | |---|---:|---:|---:| | M=1, N=64, K=512 | 4,707 | 4,122 | **1.14x** | | M=1, N=64, K=2048 | 18,516 | 16,533 | **1.12x** | | M=128, N=64, K=512 | 399,703 | 358,821 | **1.11x** | | M=128, N=64, K=2048 | 1,633,807 | 1,423,984 | **1.15x** | | M=128, N=128, K=512 | 775,205 | 761,527 | 1.02x | | M=128, N=128, K=2048 | 3,086,642 | 2,979,566 | **1.04x** | **Other quant types (S8_PerChannel, S4_PerTensor, S4_PerChannel) — neutral** Per-channel and INT4 paths are not affected by the deferred-scale optimization. Representative M=128 results: | Benchmark | QuantType | Before (ns) | After (ns) | Ratio | |---|---|---:|---:|---:| | QKGemm M=128, N=2048, K=128 | S8_PerChannel | 4,555,381 | 4,684,954 | 0.97x | | QKGemm M=128, N=2048, K=128 | S4_PerTensor | 3,841,759 | 3,819,387 | 1.01x | | QKGemm M=128, N=2048, K=128 | S4_PerChannel | 4,043,262 | 4,056,033 | 1.00x | | SVGemm M=128, N=128, K=2048 | S8_PerChannel | 4,449,839 | 4,290,344 | **1.04x** | | SVGemm M=128, N=128, K=2048 | S4_PerTensor | 2,989,684 | 2,998,154 | 1.00x | | SVGemm M=128, N=128, K=2048 | S4_PerChannel | 3,403,497 | 3,390,452 | 1.00x | **Summary**: The INT8 per-tensor paths (the most common decode configuration) see **12–36% QKGemm speedup** and **4–15% SVGemm speedup** at representative shapes. Other quantization modes are neutral within noise (±1–3%).
…osoft#28521) ## Summary - Guard `QDQPropagationTransformer::PropagateDQForward` against propagating a DQ whose data input is a constant (graph initializer or `Constant` op output). - Prevents a stale `QuantizeLinear` insertion that the S8-to-U8 weight transformer fails to update, which silently clamps int8 negatives to zero under `ORT_ENABLE_ALL`. - Adds a regression test in `qdq_transformer_test.cc` covering both constant-input shapes. ## Motivation Fixes microsoft#28491. Reported scenario: a model containing `Constant(int8) -> DequantizeLinear -> Reshape` produces correct outputs under `ORT_DISABLE_ALL` but wrong outputs (negatives clamped to 0) under `ORT_ENABLE_ALL`. Root cause: `PropagateDQForward` inserts a `Q -> DQ` pair after the Reshape, then `QDQS8ToU8Transformer` (or the avx2 weight transformer) flips the upstream DQ from int8 to uint8 without touching the freshly inserted `Q`. The orphaned uint8 `Q` then clamps the int8 weight's negative values to 0. Propagating a DQ whose data is a constant weight has no benefit anyway: the constant is folded, so there is no runtime tensor that downstream nodes need re-quantized. ## Changes - `onnxruntime/core/optimizer/qdq_transformer/qdq_propagation.cc`: inside `PropagateDQForward`'s per-DQ loop, after existing skip checks and before any graph mutation, `continue` if `dq_node.InputDefs()[QDQ::InputIndex::INPUT_ID]` is a graph initializer (`graph_utils::NodeArgIsConstant`) or the output of a `Constant` op node (`graph.GetProducerNode(...)->OpType() == "Constant"`). Both checks are required because `NodeArgIsConstant` only handles the initializer case. - `onnxruntime/test/optimizer/qdq_transformer_test.cc`: new test `QDQPropagation_DQForward_ConstantInput_NoPropagation` with two cases — DQ fed by an initializer, and DQ fed by an explicit `Constant` op node — asserting no `QuantizeLinear` is inserted after the downstream Reshape. `PropagateQBackward` is structurally not affected (its data input is a live activation, not a constant), so it does not need a symmetric guard. ## Test Plan - New `QDQTransformerTests.QDQPropagation_DQForward_ConstantInput_NoPropagation` (gtest) covers both code paths and would fail on `main`. - Existing `QDQPropagation_*` tests use `MakeInput` (graph inputs, not initializers) for their DQ data tensors, so the new guard does not regress them. - The reproducer from microsoft#28491 should now produce identical results between `ORT_DISABLE_ALL` and `ORT_ENABLE_ALL`. - CI will exercise `--gtest_filter=QDQTransformerTests.QDQPropagation*` under `onnxruntime_test_all`. Fixes microsoft#28491
microsoft#28519) ### Description - Scale tile_v by 4x when subgroup is enabled and the vectorized dimension has enough columns, improving data reuse. - Gate the tile_v expansion on seq_length >= 16, where the prefill benefit outweighs increased register pressure. - Remove redundant zero-initialization of the state tile (WGSL default- initializes vars to zero). **Intel Panther Lake (xe-3lpg)** | Model | Prefill | Baseline (TPS) | Optimized (TPS) | Change | | :----------- | ------: | -------------: | --------------: | -----: | | Qwen3.5-0.8B | 128 | 1534.30 | 1681.00 | 9.56% | | Qwen3.5-0.8B | 1024 | 3267.30 | 3917.60 | 19.90% | | Qwen3.5-0.8B | 4096 | 2864.50 | 3563.40 | 24.40% | | Qwen3.5-2B | 128 | 1295.60 | 1344.60 | 3.78% | | Qwen3.5-2B | 1024 | 2177.10 | 2344.00 | 7.67% | | Qwen3.5-2B | 4096 | 1942.60 | 2247.00 | 15.67% | | Qwen3.5-4B | 128 | 701.30 | 736.20 | 4.98% | | Qwen3.5-4B | 1024 | 946.90 | 1036.30 | 9.44% | | Qwen3.5-4B | 4096 | 824.10 | 912.00 | 10.67% | ### Motivation and Context See above.
…or in chained Reshape (microsoft#28455) ### Description - **Optimizer fix** (`reshape_fusion.cc`): `FuseContiguousReshapes` now explicitly bails out in the while loop when it encounters a `Reshape` next-node that has `allowzero=1`. The fused node inherits attributes from the first node in the chain (which has `allowzero=0` or no `allowzero` attribute), so including an `allowzero=1` node in the fusion would silently drop that attribute and cause zeros in the shape tensor to be misinterpreted as "copy from input" at runtime instead of being preserved as explicit zero dims. A complementary zero-dim guard (already present) also prevents fusion when the inferred final output shape contains any literal zero dimension. - **New end-to-end execution test** (`graph_transform_test.cc`, `ReshapeFusionContiguousReshapesWithZeroDimExecution`): Exercises the exact scenario from the issue — `float[0,8,2] → Reshape([4,2,-1]) → Reshape([0,0,4], allowzero=1)` — through a full `InferenceSession` run, asserting the output shape is `(0,0,4)` and not `(0,8,4)`. The existing `ReshapeFusionContiguousReshapesWithZeroDim` test only validated the optimizer transformation; this test validates runtime correctness. ### Motivation and Context A chained `Reshape` model where the second node uses `allowzero=1` produced the wrong output shape when the fused shape contained zeros. Example reproducer: ```python # X: float[0, 8, 2] n1 = Reshape(X, shape=[4, 2, -1]) # mid: [4, 2, 0] n2 = Reshape(mid, shape=[0, 0, 4], allowzero=1) # expected Y: [0, 0, 4] # ORT returned: (0, 8, 4) — wrong; reference returned: (0, 0, 4) ``` `FuseContiguousReshapes` merged both nodes into `Reshape(X, [0, 0, 4])` with `allowzero=0` (inherited from n1), so the zeros were interpreted as "copy dim from `X`" (`X[1]=8`), yielding `(0, 8, 4)`. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.