Skip to content

Sync with Microsoft ONNX Runtime - 21052026#1099

Open
ai-fw-intg wants to merge 25 commits into
ovep-developfrom
sync_msft_21052026
Open

Sync with Microsoft ONNX Runtime - 21052026#1099
ai-fw-intg wants to merge 25 commits into
ovep-developfrom
sync_msft_21052026

Conversation

@ai-fw-intg
Copy link
Copy Markdown

Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.

lhrios and others added 25 commits May 18, 2026 12:51
`indices` is built once and then only read during recursive calls to
`CheckIfSubtreesAreEqual`. However it was passed by value, causing a
full copy on every recursive call. Changed to `const&`.

## Data from the profiler:
To collect the following data, a model with a single
TreeEnsembleClassifier node (5000 trees and 3.3 million nodes) has been
used. The loading time dropped from 18 minutes to about 4 seconds.

### After
<img width="1793" height="547" alt="Screenshot 2026-03-25 at 6 40 25 PM"
src="https://github.com/user-attachments/assets/d7c00335-8246-4bd1-9e4d-b0e956d48cdd"
/>


### Before
<img width="1763" height="548" alt="Screenshot 2026-03-25 at 6 40 40 PM"
src="https://github.com/user-attachments/assets/35683112-2919-4031-955c-922937f2df8f"
/>
…ft#28520)

## Summary

- Remove the `Subgroups` feature requirement from
`CanApplyFlashAttention`, enabling flash attention on devices without
subgroup support
- Generalize the Apple-specific shared-memory prefill path into a
`use_shm_path` flag that activates for Apple, NVIDIA, or any device
lacking subgroups
- Replace `is_apple` shader parameter with `use_shm_path` throughout the
WGSL template

## Motivation

Two issues exist on the current main branch:

1. **NVIDIA prefill produces incorrect results (regression from
microsoft#28511):** PR microsoft#28511 increased `max_k_step` to 32 for NVIDIA in C++, but
the shader's subgroup-based path only has `qk_1..qk_4` (16 hardcoded key
indices). When `sg_size=32` (e.g. RTX 5080), the loop steps by 32 but
only computes QK for keys 0-15, silently skipping keys 16-31. This
produces incorrect attention output for models like phi4.

2. **Flash attention prefill unavailable without Subgroups:**
`CanApplyFlashAttention` gates on
`context.HasFeature(wgpu::FeatureName::Subgroups)`, forcing devices
without subgroup support to fall back to the slower split-reduce
2-kernel path for prefill, even though the Apple shared-memory path in
the shader is fully subgroup-free.

This PR fixes both issues by routing Apple, NVIDIA, and no-subgroup
devices through the loop-based shared-memory path (`use_shm_path`),
which naturally handles any `max_k_step` value via `array<q_element_t,
max_k_step>` and loop iteration — no hardcoded key count.

## Test plan

- [x] Built ORT with WebGPU EP on Windows (Release, VS 2022)
- [x] Deployed and ran phi4-graph-prune model: output verified correct
("1+1 equals 2.")
- [x] Lint check passed (`lintrunner -a`)
### Description
<!-- Describe your changes. -->

- Add copyright headers to source files
- Enrich Python and NuGet package metadata
- Add ORT license files to packages
- Clean up readme files

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

WebGPU plugin EP packaging improvements.

Note: Similar updates can be considered for the CUDA plugin EP, but this
PR is scoped to just the WebGPU EP for ease of cherry-picking into the
WebGPU plugin EP release branch.
…oft#28187)

### Description

Detect and reject recursive cycles in model local function definitions
during model loading, preventing stack overflow from unbounded recursion
during function inlining.

### Changes

**Call-graph construction and cycle detection** (`model_helpers.cc`,
`model_helpers.h`)
- `BuildLocalFunctionCallGraph()` builds an adjacency-list call graph
from model local functions using iterative subgraph traversal (no
recursion, safe against deeply nested subgraph attributes).
- `ValidateCallGraphAcyclic()` performs iterative DFS cycle detection.
Uses `find()` throughout (no `operator[]`) to prevent accidental map
insertions.
- `ValidateModelLocalFunctionAcyclic()` convenience wrapper.
- On cycle detection, returns a descriptive error showing the full cycle
path (e.g., `"local:first -> local:second -> local:first"`).

**Integration** (`model.cc`)
- Applied in both `Model` constructors that process local functions.

**Test coverage** (`function_test.cc`)

Integration tests (full model load):
- `RejectsSelfRecursiveLocalFunction` — function calls itself
- `RejectsMutuallyRecursiveLocalFunctions` — A→B→A cycle
- `RejectsRecursionThroughSubgraph` — recursion via subgraph attribute
(e.g., inside If node)
- `RejectsLongerCycle` — A→B→C→A cycle, verifies cycle path reports all
participants
- `RejectsMultipleIndependentCycles` — two disjoint cycles in one model
- `AcceptsAcyclicDiamond` — diamond shape (A→B, A→C, B→D, C→D), no false
positive
- `AcceptsTrivialSingleNodeFunction` — single-Identity-node function
passes validation

Unit tests (call graph validation directly):
- `CallGraphAcyclic_EmptyGraph` — empty graph
- `CallGraphAcyclic_SingleNodeNoCalls` — single function, no callees
- `CallGraphAcyclic_SelfCycle` — self-loop
- `CallGraphAcyclic_MutualCycle` — A↔B
- `CallGraphAcyclic_LongerCycle` — A→B→C→A
- `CallGraphAcyclic_DiamondNoCycle` — diamond, no false positive
- `CallGraphAcyclic_DeepChainNoCycle` — long acyclic chain
- `CallGraphAcyclic_MultipleIndependentCycles` — two independent cycles
- `CallGraphAcyclic_SharedCallsDiamondNoCycle` — shared callees, no
false positive

### Motivation

A malicious or malformed ONNX model with recursive local function
definitions would cause the runtime to recurse until stack overflow
during function inlining. This check fails model loading early with a
clear error message.

### Testing

- Incremental build succeeds
- All new integration and unit tests pass
…Compute (microsoft#28223)

### Description

Replaces the long-standing `// TODO: fix this checker later` comment in
`MaxpoolWithMask::Compute` with real input validation. Without these
checks, a mismatched mask silently causes out-of-bounds memory access.

**Changes:**
- **`contrib_ops/cpu/maxpool_with_mask.h`** — Added three
`ORT_RETURN_IF_NOT` guards:
  - Mask must have the same number of dimensions as the input tensor
- Mask N and C dimensions must be nonzero when input is non-empty
(prevents modulo-by-zero in `total_mask_channels`)
- Each spatial dimension (dim ≥ 2) of the mask must match the
corresponding input dimension
- **`test/contrib_ops/maxpool_mask_test.cc`** — Added three failure-case
tests:
- `MaxPoolWithMask_SpatialDimMismatch` — mask spatial dims differ from
input
- `MaxPoolWithMask_DimCountMismatch` — mask rank differs from input rank
- `MaxPoolWithMask_MaskEmptyBatchDim` — mask N=0 with non-empty input
triggers the nonzero N/C guard

### Motivation and Context

The mask tensor is indexed using the input's spatial step size (`x_step
= height * width`, etc.), so a shape mismatch leads to silent
out-of-bounds reads. Additionally, `total_mask_channels = m_shape[0] *
m_shape[1]` is used as a modulo divisor in the per-channel offset
formula; if either dimension is zero while the input is non-empty, this
causes undefined behaviour (division by zero). The original code had a
commented-out check with a `TODO` acknowledging this gap; this PR closes
it.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: xadupre <22452781+xadupre@users.noreply.github.com>
Co-authored-by: Xavier Dupré <xadupre@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Xavier Dupré <xadupre@microsoft.com>
### Description
As titled.


### Motivation and Context
whisper-small in int4-kquant-mixed is close to int8.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
…icrosoft#28430)

### Description

Add a unit test that verifies `RegisterExecutionProviderLibrary` /
`UnregisterExecutionProviderLibrary` does not leak the library handle
(regression test for microsoft#28396).

`ProviderLibrary::Load()` loads the EP library and probes for the
`GetProvider` symbol. Most plugin EP libraries don't export it, so the
probe fails. Before microsoft#28396, `Load()` returned the error without calling
`Unload()`, leaking a refcount.

### Test approach

The test copies the EP library to a temporary directory with a unique
filename, ensuring it has never been loaded in the process. After
register + unregister, it checks that the library is fully unloaded
(refcount == 0).

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…not a file (microsoft#28431)

### Description
A file output path is needed when:
- Writing the output model to a file (not to a buffer or write function)
- Writing initializers to an external file (needs the model path to
compute the external file location)

otherwise the file output path validation can be skipped.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
When compiling a model via the Compile API in a sandboxed environment,
CreateEpContextModel() would attempt to validate/generate a file output
path, even when the user explicitly set the output to a buffer via
SetOutputModelBuffer(). This caused std::filesystem::exists() to throw
an "Access is denied" exception on the dummy model path
_MODEL_EDITOR_API_MODEL_, because the sandbox restricts filesystem
access.
## Summary
- reject out-of-range `cache_indirection` beam indices in the CPU
beam-attention path before they are converted into past KV offsets
- keep `DecoderMaskedMultiHeadAttention` beam-width handling consistent
with the `cache_indirection` shape
- add CPU regression tests for `MultiHeadAttention` and
`DecoderMaskedMultiHeadAttention`

## Motivation
`MultiHeadAttention` and `DecoderMaskedMultiHeadAttention` on the CPU
provider could consume attacker-controlled `cache_indirection` values as
beam indices without validating that each element stayed within `[0,
beam_width)`. That let malformed models compute offsets past the past
key/value buffers. This change rejects invalid indices up front and adds
focused tests for the failure path.

## Key Changes
- add shared CPU validation in
`AttentionCPUBase::ApplyAttentionWithBeams` so the beam path fails
before any past-key or past-value reads occur
- report an `INVALID_ARGUMENT` error that identifies the offending beam
index and its position
- validate that an explicit decoder `beam_width` input matches
`cache_indirection` dimension 1 when both are present
- add contrib-op tests that exercise invalid cache indirection values on
the CPU execution provider

## Testing
- `lintrunner -a`
- `cd build/Linux/Debug && make -j4
CMakeFiles/onnxruntime_providers.dir/home/tlwu/onnxruntime/onnxruntime/contrib_ops/cpu/bert/multihead_attention.cc.o
CMakeFiles/onnxruntime_providers.dir/home/tlwu/onnxruntime/onnxruntime/contrib_ops/cpu/bert/decoder_masked_multihead_attention.cc.o
CMakeFiles/onnxruntime_provider_test.dir/home/tlwu/onnxruntime/onnxruntime/test/contrib_ops/multihead_attention_op_test.cc.o
CMakeFiles/onnxruntime_provider_test.dir/home/tlwu/onnxruntime/onnxruntime/test/contrib_ops/decoder_masked_multihead_attention_op_test.cc.o`
- full `onnxruntime_provider_test` relink/run was not completed locally

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
…ors with >2^31 elements (microsoft#28386)

- [x] Fix `unary_elementwise_impl.cuh`: Change `CUDA_LONG` to `int64_t`
for `N` parameter and loop index in `_UnaryElementWise` kernel, and fix
`blocksPerGrid` calculation
- [x] Fix `cast_op.cu`: Change `CUDA_LONG` to `int64_t` for `N`
parameter and loop index in `CastKernelStd`, `CastKernelSat`, and
`CudaCastPairwiseKernel` kernels, and remove `static_cast<int>`
truncation
- [x] Use `size_t` for `pair_count` in CudaCastPairwise to avoid double
conversion (review feedback)
- [x] Rename test to `CastKernelCorrectness_ModerateSize` and add
`CastKernel_Int64IndexArithmetic_NoOverflow` host-side test (review
feedback)
- [x] Merge from main to resolve conflicts with Float8E8M0 tests

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
Co-authored-by: justinchuby <11205048+justinchuby@users.noreply.github.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
)

### Description

Hardens the XNNPACK Gemm capability check against two SIGSEGV crashes
during graph partitioning: one when the optional `C` input is omitted,
one when `C` is a rank-0 (scalar) tensor. The check now guards the null
`C` arg before calling `Shape()`, and rejects rank-0 `C` so the node
falls back to the CPU EP cleanly.

Thanks @kadu-v, the minimal Python repros made the root cause easy to
confirm. Both reproduced as a hard crash on the first `InferenceSession`
construction.

### Motivation and Context

Fixes microsoft#28541
Fixes microsoft#28542

`Gemm::IsOnnxNodeSupported` dereferenced `C_arg->Shape()` without
checking whether `C_arg` was non-null, so any Gemm without the optional
bias segfaulted before the EP could decline the node. A rank-0 `C` then
survived the existing checks and reached XNNPACK's fully-connected path,
which doesn't implement scalar broadcast (there's already a TODO in that
file noting it). That's the second SIGSEGV.

### Changes

`onnxruntime/core/providers/xnnpack/math/gemm.cc`:
- Null-check `C_arg` before reading its shape. Absent `C` is valid per
the Gemm spec; treat it as "no bias".
- Reject `C` with rank 0 from `IsOnnxNodeSupported` so the node falls
through to CPU. Adding scalar broadcast support belongs with the TODO in
the fully-connected path, not in the capability check.

### Testing

Three regression tests in
`onnxruntime/test/providers/xnnpack/xnnpack_basic_test.cc`:
- `TestGemm_NoC_NoSegfault` builds a Gemm with the `C` input omitted.
- `TestGemm_ScalarC_NoSegfault` builds a Gemm with a rank-0 `C`.
- `TestGemm_EmptyC_NoSegfault` covers an empty-shape `C` edge case.

Each test loads an `InferenceSession` with the XNNPACK EP registered and
asserts no crash.

I also suspect `Gemm`'s constructor has pre-existing crashes when `A` or
`B` is 1-D, before the capability check even runs. Haven't reproduced
it. Can file a follow-up if useful.

Signed-off-by: Dhruvil <dhruvilparikh79@gmail.com>
## Description

Adds path traversal validation for sparse tensors with external data,
closing a gap where `SparseTensorProtoToDenseTensorProto` would read
external files without checking whether the path escapes the model
directory.

### Bug fix (pre-existing)

- **`CopySparseData` indices size check**: The `raw_data().size()` check
was wrong for external data (where `raw_data` is empty). Fixed by adding
a pre-unpack `raw_data` size guard for inline data and a post-unpack
`unpack_buffer` size check for all data sources.

### Tests

- **Security tests** (tensorutils_test.cc): Path traversal blocked
(values, indices), absolute path blocked (values, indices), zero-element
regression (zero dense elements, zero NNZ). All create escaping files
and assert specifically for `"escapes"` error.
- **Positive tests** (sparse_kernels_test.cc): 7 end-to-end tests for
legitimate sparse tensors with external data — external values, external
indices (INT64/INT32/INT16/INT8), both external (rank-1 and rank-2 COO).

### Known limitation (deferred)

ORT_MEM_ADDR in-memory external data for sparse tensors can trigger
arbitrary memory reads. This is a separate issue from path validation —
`LoadSparseInitializerOrtFormat` legitimately uses in-memory markers for
ORT-format models, so blanket rejection would break functionality.
Should be addressed in a separate PR.

## Motivation and Context

A malicious ONNX model could use `../` path traversal in sparse tensor
external data locations to read arbitrary files outside the model
directory. Dense tensors already had this validation; sparse tensors did
not.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…icrosoft#28538)

### Description
<!-- Describe your changes. -->

`MlasActivationTest.ExecuteShort` (`test_activation.cpp`) feeds NaN
inputs
through `MlasActivation` and asserts the output matches the expected
value
bit-for-bit. This change adds one accepted case: when the expected value
is a
NaN, any NaN output passes.

Non-NaN comparisons are unchanged — a finite output where a NaN is
expected
(or the reverse) still fails. Test-only change, no library behavior
impact.

Verified: `onnxruntime_mlas_test --gtest_filter=Activation.ShortExecute`
on
SpacemiT K3 (riscv64, RVV VLEN=256), rv-gcc 15.2 — FAILED before, PASSED
after (re-run x3). x86/x64 behavior unaffected.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

The bit-exact assertion (`Buffer[i].u == TestData[i][kind].u`)
implicitly
assumes the input NaN payload survives the activation. For kinds
evaluated by
floating-point arithmetic — LeakyRelu (`alpha * x`), HardSigmoid
(`alpha * x + beta`) — that only holds on ISAs that propagate NaN
payloads
(x86, ARM).

IEEE-754 does not require NaN payload propagation. RISC-V's `F`
extension
mandates that any FP operation producing a NaN yields the canonical
quiet NaN
(`0x7fc00000` for f32), discarding the payload. So on riscv64 these
kinds emit
`0x7fc00000` for a NaN input — a correct "NaN in → NaN out" result whose
bit
pattern simply differs from the input — and the bit-exact check fails.

Accepting any NaN where a NaN is expected restores the test to the
portable
IEEE-754 **contract.**

Signed-off-by: qiurui144 <happyqiurui@163.com>
)

## Problem

Windows ML engineers need telemetry that answers: **"Which Execution
Providers and hardware device types are apps using for inference, and
how much?"**

Today, the inference telemetry has these gaps:

| Event | Gap |
|---|---|
| SessionCreation | No hardware device type (CPU/GPU/NPU), no vendor ID.
Fires once — falls out of the 24h pipeline join window for long-lived
sessions. |
| RuntimePerf | No EP type, no hardware device — only session_id,
requires a join back to SessionCreation. |
| ExecutionProviderEvent | Only fires for DML. Irrelevant for
QNN/OpenVINO/etc. |

The new Windows ML EP plugin platform (OrtEpDevice / OrtEpFactory /
OrtHardwareDevice) already has all the hardware metadata we need; we
just weren't surfacing it.

## What this PR does

### 1. New `EpDeviceUsage` ETW event

Emitted once per `(EP, hardware device)` tuple at session init and on
every `RuntimePerf` heartbeat (plus a destructor flush). Each event is
self-contained:

| Field | Example |
|---|---|
| executionProviderType | `QNNExecutionProvider` |
| hardwareDeviceType | `NPU` / `GPU` / `CPU` / `FPGA` / `UNKNOWN` |
| hardwareVendorId / hardwareDeviceId | `0x5143` / `0x0901` (PCI IDs) |
| hardwareVendor | `Qualcomm` |
| epVendor | `Qualcomm` |
| assignedNodeCount | `89` (count after graph partitioning) |
| totalRunsSinceLast / totalRunDurationSinceLast | session-level run
counters |

This gives downstream consumers a trivial `GROUP BY
executionProviderType, hardwareDeviceType` without needing to join back
to `SessionCreation`. Works for long-lived sessions that span past the
24h pipeline join window.

### 2. `SessionCreation` enrichment

Added `hardwareDeviceTypes` and `hardwareVendorIds` (comma-separated,
positionally aligned with the existing `executionProviderIds`). Bumped
`schemaVersion` 0 -> 1.

## Implementation notes

- `LogEpDeviceUsage` added to the `Telemetry` interface with a no-op
default; `WindowsTelemetry` implements it via TraceLogging under the
existing `Microsoft.ML.ONNXRuntime` provider (no new provider GUID).
- `InferenceSession::PopulateEpDeviceInfo` runs after graph
partitioning. For EPs created via the V2 path
(`AppendExecutionProvider_V2` / `SetEpSelectionPolicy` /
`RegisterExecutionProviderLibrary`) it pulls full hardware metadata from
`IExecutionProvider::GetEpDevices()`. For legacy EPs it falls back to
`IExecutionProvider::GetDevice()` (OrtDevice type + vendor ID; no PCI
device ID).
- Heartbeat block in `Run()` and destructor flush in `~InferenceSession`
both emit `LogEpDeviceUsage` per entry.

## Testing

- Debug build with Ninja: clean build (1636 targets)
- `onnxruntime_test_all` (full suite): **1571 passed, 0 failed**, 3
skipped (CUDA-EP-gated, environment)
- No memory leaks reported

## Compatibility

- No public C API surface changes.
- `Telemetry::LogSessionCreation` virtual gains two `const std::string&`
parameters — all in-tree overrides are updated.
- `LogEpDeviceUsage` has a no-op default, so non-Windows platforms are
unaffected.

---------

Co-authored-by: Angela Serrano Brummett <angelser@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…oft#28567)

### Description
Without this change, `./build.sh --config MinSizeRel --build_wasm
--skip_tests --target onnxruntime_webassembly
--compile_no_warning_as_error --parallel` fails on Asahi Linux:

```
-- Fetch protoc_binary from https://github.com/protocolbuffers/protobuf/releases/download/v21.12/protoc-21.12-linux-x86_32.zip

<snip>

/bin/sh: line 1: ../protoc_binary-src/bin/protoc: cannot execute binary file: Exec format error
```

The command succeeds with this change.
microsoft#28556)

### Description
This PR

* Fixes the Linux Python wheel builds fail when pip installs
`mypy>=2.1`, which depends on `ast-serialize` — a Rust-based package.
Cargo failed to fetch the package because of network isolation. Use the
internal cargo feed URL and update the config to use the cargo repo URL
of internal artifactory



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: Kusuma Padma Kavya Bandi <kusbandi@microsoft.com>
…as/lib/riscv64/sconv_depthwise_kernel_rvv.cpp:138:18: error: unused variable 'pad_bottom' (microsoft#28506)

…riscv64/sconv_depthwise_kernel_rvv.cpp:138:18: error: unused variable
'pad_bottom'

Correcting Compilation Errors:
onnxruntime/onnxruntime/core/mlas/lib/riscv64/sconv_depthwise_kernel_rvv.cpp:138:18:
error: unused variable 'pad_bottom'
…MM Refactor (microsoft#28467)

## Description

Update `QMoE` contrib operator for the CUDA EP to supports quantized
Mixture-of-Experts inference with INT4, INT8, FP4 (MXFP4 e2m1), FP8
(e4m3fn), and WFP4AFP8 (mixed FP4 weight × FP8 activation) quantization
formats.

This also refactors the existing MoE GEMM infrastructure to support TMA
warp-specialized grouped GEMM on Hopper (SM90), native MXFP4 on
Blackwell (SM120), and block-scaled tensor ops on SM100+, with automatic
fallback to dequantization on older architectures.

Note that this is modified from `TensorRT-LLM` MoE implementation. There
is a section in moe_qmoe.md about the modifications.

## Summary of Changes

### New QMoE Operator

| File | Change |
|------|--------|
| `onnxruntime/core/graph/contrib_ops/contrib_defs.cc` | Register `QMoE`
op schema (com.microsoft domain, opset 1) |
| `onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc/h` | QMoE CUDA
kernel implementation with dynamic runner selection |
| `onnxruntime/contrib_ops/cuda/moe/qmoe_kernels.cu/h` | Softmax top-k
router, sparse mixer, zero-point pre-packing kernels |
| `onnxruntime/contrib_ops/cuda/moe/moe_base.h` | Shared MoE base class
updates for quantization attributes |
| `docs/contrib_ops/cuda/moe_qmoe.md` | Comprehensive operator
documentation (inputs, attributes, quantization formats) |

### MoE GEMM Refactor

| File | Change |
|------|--------|
| `onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_gemm_kernels.h` |
Unified `CutlassMoeFCRunner` template with FP4/FP8/WFP4AFP8
specializations |
|
`onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_gemm_template_dispatch.h`
| Three-family dispatch: Ampere GemmGrouped, TMA warp-specialized,
block-scaled tensor ops |
| `onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_gemm_profiler.cc/h` |
MoE-specific GEMM tactic profiler for auto-tuning |
| `onnxruntime/contrib_ops/cuda/llm/moe_gemm/common.h` | Shared MoE GEMM
types and config structs |
| `onnxruntime/contrib_ops/cuda/llm/moe_gemm/launchers/` |
SM80/SM90/SM120 launcher instantiations (including generated .cu files)
|

### CUTLASS Extensions

| File | Change |
|------|--------|
| `onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/arch/` | Grid
dependency control, TMA copy traits, multi-mem copy operations |
| `onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/gemm/collective/`
| Mixed-input and gated GEMM collective builders for SM90 |
| `onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/gemm/kernel/` |
Fused MoE kernel traits/routines, MoE problem visitors, gated GEMM
kernels |
| `onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/epilogue/` | MoE
finalize epilogue, per-row/per-col scale epilogues |
| `onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/system_barrier.h`
| System barrier for multi-CTA synchronization |

### Common CUDA Utilities

- `onnxruntime/contrib_ops/cuda/llm/common/cuda_fp8_utils.cu/h` — FP8
conversion, quantization, dequantization kernels
- `onnxruntime/contrib_ops/cuda/llm/common/memory_utils.cu/h` — Device
memory transpose, permute, type conversion utilities
- `onnxruntime/contrib_ops/cuda/llm/common/cuda_type_utils.cuh` —
Unified type traits for half/bfloat16/float/fp8/fp4
- `onnxruntime/contrib_ops/cuda/llm/common/quantization.h` —
Quantization parameter structs and helpers
- `onnxruntime/contrib_ops/cuda/llm/common/reduce_kernel_utils.cuh` —
Warp/block reduction primitives
- `onnxruntime/contrib_ops/cuda/llm/kernels/quantization.cuh` — FP4/FP8
quantization kernels
- `onnxruntime/contrib_ops/cuda/llm/kernels/pre_quant_scale_kernel.cu/h`
— Pre-quantization scaling kernel

### GEMM Profiler Refactor

| File | Change |
|------|--------|
| `onnxruntime/contrib_ops/cuda/llm/gemm_profiler.cc/h` | Refactored
GEMM profiler interface for tactic selection |
| `onnxruntime/contrib_ops/cuda/llm/cutlass_heuristic.cc/h` | Updated
heuristics for new kernel families |
| `onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/gemm_configs.h` |
Extended GEMM config enums for TMA warp-specialized and gated configs |

### Build System

| File | Change |
|------|--------|
| `cmake/CMakeLists.txt` | Add `ENABLE_FP4`, `ENABLE_FP8`,
`ENABLE_CUDA_FP4_QMOE`, `ORT_QUICK_BUILD`, `PLACEHOLDER_KERNELS` options
|
| `cmake/external/cuda_configuration.cmake` | FP4/FP8 capability
detection based on CUDA version and SM arch |
| `cmake/external/cutlass.cmake` | CUTLASS version bump |
| `cmake/onnxruntime_providers_cuda.cmake` | Add MoE GEMM source files
and conditional FP4/FP8 kernel compilation |
| `cmake/onnxruntime_python.cmake` | Add `onnxruntime_pybind_quant.cc`
for Python quantization bindings |

### Python Quantization Bindings

| File | Change |
|------|--------|
| `onnxruntime/python/onnxruntime_pybind_quant.cc` | C++ pybind module
for MoE weight preprocessing (quantize, pack, preprocess) |
| `onnxruntime/python/tools/quantization/quant_utils.py` | FP4/FP8
quantization utilities |
| `setup.py` | Include new pybind module in package build |

### Tests

| File | Change |
|------|--------|
| `onnxruntime/test/python/transformers/test_qmoe_cuda.py` | INT4/INT8
QMoE tests (Phi3 topology, SwiGLU, blockwise, asymmetric) |
| `onnxruntime/test/python/transformers/test_qmoe_fp4_cuda.py` | MXFP4
QMoE tests |
| `onnxruntime/test/python/transformers/test_qmoe_fp8_cuda.py` | FP8
QMoE tests |
| `onnxruntime/test/python/transformers/test_qmoe_wfp4afp8_cuda.py` |
WFP4AFP8 mixed-precision QMoE tests |
| `onnxruntime/test/python/transformers/test_moe_cuda.py` | Updated
existing MoE tests for refactored infrastructure |
| `onnxruntime/test/contrib_ops/moe_test.cc` | C++ MoE unit tests
updated |

### Existing MoE Refactor

- `onnxruntime/contrib_ops/cuda/moe/moe.cc/h` — Refactored to share base
with QMoE
- `onnxruntime/contrib_ops/cuda/moe/ft_moe/` →
`onnxruntime/contrib_ops/cuda/llm/moe_gemm/` — Relocated and rewritten
MoE GEMM kernels
- Removed old `cuda/quantization/moe_quantization.cc/h` in favor of new
`cuda/moe/moe_quantization.cc/h`

## Testing

- **INT4/INT8 QMoE**: `python -m pytest
onnxruntime/test/python/transformers/test_qmoe_cuda.py -v` (requires
CUDA GPU, SM75+)
- **FP4 QMoE**: `python -m pytest
onnxruntime/test/python/transformers/test_qmoe_fp4_cuda.py -v` (requires
SM120+ for native, falls back on older)
- **FP8 QMoE**: `python -m pytest
onnxruntime/test/python/transformers/test_qmoe_fp8_cuda.py -v` (requires
SM90+ for native)
- **WFP4AFP8 QMoE**: `python -m pytest
onnxruntime/test/python/transformers/test_qmoe_wfp4afp8_cuda.py -v`
(requires SM100+)
- **Existing MoE**: `python -m pytest
onnxruntime/test/python/transformers/test_moe_cuda.py -v`
- **C++ MoE tests**: Build with CUDA EP enabled, run
`onnxruntime_test_all --gtest_filter=*MoE*`
- All tests compare QMoE output against PyTorch reference
implementations with configurable tolerance

## Motivation and Context

Modern LLMs increasingly use Mixture-of-Experts architectures (e.g.,
Mixtral, DeepSeek, Phi-3.5-MoE) for efficient scaling. These models
benefit significantly from weight quantization to reduce memory
bandwidth and enable larger models on fewer GPUs. This PR:

1. **Adds native low-precision MoE support** — FP4 and FP8 quantized
weights avoid the dequantization overhead of INT4/INT8 on supported
hardware (Hopper, Blackwell).
2. **Introduces WFP4AFP8** — A novel mixed-precision mode where weights
are MXFP4 and activations are dynamically quantized to FP8, enabling 2×
weight compression with minimal accuracy loss on Blackwell GPUs.
3. **Refactors MoE GEMM infrastructure** — The previous
FasterTransformer-derived MoE GEMM code is replaced with a modern
CUTLASS 4.x-based dispatch system supporting three kernel families
across SM75–SM120+.
4. **Adds auto-tuning** — The GEMM profiler enables runtime tactic
selection for optimal performance across different expert sizes and
batch configurations.
… operators (microsoft#27677)

### Description

Hardens `TreeEnsemble` initialization against malformed/unvalidated ONNX
models by adding missing bounds checks and fixing existing ones.

**Validation additions (`tree_ensemble_attribute.h`)**
- Validate `base_values` / `base_values_as_tensor` size against
`n_targets_or_classes`: must be 0 or N (or ≤ 2 for binary classifiers)

**Validation additions/fixes (`tree_ensemble_common.h`)**
- Pre-loop: add `target_class_treeids.size() == limit` check alongside
existing `target_class_ids`, `target_class_nodeids`, and weights size
checks
- `nodes_featureids[i]`: validate original `int64_t` value is in `[0,
INT_MAX]` **before** narrowing cast to `int` — prevents large values
wrapping to positive ints and bypassing the old post-cast `>= 0` check;
original attribute value included in error message
- Per-row `ProcessTreeNodeLeave` calls (sections C/D/E, C2/D2/E2): pass
`x_data + (i+1)*stride` (per-row end) instead of `x_data + N*C` (global
tensor end); removes `x_data_end` from per-row lambda captures
- Error message for `target_class_ids` range check uses
`target_class_ids` (not the old hard-coded `target_ids`)

**Tests**
- Negative tests for out-of-range `target_ids`, negative
`nodes_featureids`, wrong-sized `base_values` (regressor and classifier,
binary/multi-class)

### Motivation and Context

Tree-building code assumed ONNX models had been pre-validated. These
changes prevent out-of-bounds memory access and wrap-around bugs when
loading unvalidated or adversarially crafted models.

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: xadupre <22452781+xadupre@users.noreply.github.com>
…28576)

## Description

This PR adds INT4/INT8 symmetric quantized KV cache support to the CPU
GroupQueryAttention contrib operator, enabling reduced memory bandwidth
during inference. The quantized path quantizes K/V values on write into
the present cache and performs dequantized-GEMM (QK and SV) during
attention computation, maintaining FP32 accumulation for accuracy.

Note that this is baseline implementation. Further optimization (like
AVX2 and Neon etc) will be in a follow up PR.

## Summary of Changes

### MLAS Quantization Kernels

| File | Change |
|------|--------|
| `onnxruntime/core/mlas/inc/mlas_qkv_quant.h` | New public API header
for INT4/INT8 KV-cache quantize, dequantize, and GEMM routines |
| `onnxruntime/core/mlas/lib/qkv_quant.cpp` | Portable reference
implementation of MlasKVQuantize, MlasQKGemm, MlasSVGemm,
MlasKVDequantize |
| `cmake/onnxruntime_mlas.cmake` | Register new source/header files in
the MLAS build |

### CPU GQA Operator Changes

| File | Change |
|------|--------|
| `onnxruntime/contrib_ops/cpu/bert/attention_common.h` | Add
`StringToKVQuantizationType` helper for attribute parsing |
| `onnxruntime/contrib_ops/cpu/bert/gqa_attention_base.h` | Add
`ConcatQuantStateChunkGQA`, `ToMlasKVQuantType`, quantized attention
base members, and `ApplyAttentionQuantized` implementation |
| `onnxruntime/contrib_ops/cpu/bert/group_query_attention.cc` | Extend
kernel registration for T_CACHE/T_KV_SCALE type constraints; add
quantization validation and dispatch to quantized path |
| `onnxruntime/contrib_ops/cpu/bert/group_query_attention_helper.h` |
Minor cleanup of helper logic |

### CUDA EP Guard

| File | Change |
|------|--------|
| `onnxruntime/contrib_ops/cuda/bert/group_query_attention.cc` | Add
build-time guard for INT4 KV cache not enabled |

### Tests

| File | Change |
|------|--------|
| `onnxruntime/test/contrib_ops/group_query_attention_op_test.cc` | C++
unit tests for quantized KV cache (INT8/INT4, per-tensor/per-channel,
with/without past) |
| `onnxruntime/test/mlas/unittest/test_qkv_quant.cpp` | MLAS-level unit
tests for quantize/dequantize/GEMM kernels |
| `onnxruntime/test/python/transformers/test_gqa_cpu_quantized.py` |
Python integration tests validating end-to-end quantized GQA accuracy |

## Testing

- Run C++ tests: `./onnxruntime_test_all
--gtest_filter='*GroupQueryAttention*Quant*'`
- Run MLAS tests: `./onnxruntime_test_all --gtest_filter='*QKVQuant*'`
- Run Python tests: `pytest
onnxruntime/test/python/transformers/test_gqa_cpu_quantized.py -v`
- All existing GQA tests continue to pass (no behavioral change for
non-quantized paths)

## Motivation and Context

Quantized KV caches significantly reduce memory bandwidth requirements
for long-context LLM inference on CPU. The CUDA EP already supports
INT4/INT8 quantized KV caches; this PR brings parity to the CPU EP. The
MLAS kernels use the same packing conventions as the CUDA implementation
for model portability.

## Checklist

- [x] Tests added/updated
- [x] No breaking changes (new optional inputs/attributes, existing
behavior unchanged)
- [x] CI passes
…gin test pipelines (microsoft#28517)

## Summary

Five plugin test stage YAMLs reference `$(System.Debug)`, a predefined
ADO variable that is only set when "Enable system diagnostics" is
checked at queue time. When undefined, two failure modes occur:

- **bash command lines** (Linux WebGPU/CUDA stages): bash interprets
`$(System.Debug)` as command substitution and the step fails with `bash:
line 14: System.Debug: command not found`.
- **YAML `env:` blocks** (Mac/Win WebGPU + Win CUDA stages): the env var
is set to the literal string `$(System.Debug)`, meaningless to test code
that expects `true`/`false`/`1`/`0`.

## Fix

Add a derived pipeline-level variable to the two test pipeline entry
points:

```yaml
# Verbose-output flag for tests. Resolves to System.Debug when set
# (e.g. queue-time "Enable system diagnostics"), else 'false'.
- name: OrtTestVerbose
  value: $[ coalesce(variables['System.Debug'], 'false') ]
```

and switch the five stage references from `$(System.Debug)` to
`$(OrtTestVerbose)`.

This avoids redefining the predefined `System.Debug` variable while
preserving the "Enable system diagnostics" UI toggle — when that
checkbox is set, `System.Debug` is `'true'` at runtime and the derived
variable follows.

A compile-time `${{ if }}` template alternative was not used because
template expressions are evaluated at compile time and cannot see
queue-time runtime variables like `System.Debug`; only a runtime
expression (`$[ ]`) can react to the standard system-diagnostics
checkbox.

## Files changed

| File | Change |
|---|---|
| `plugin-webgpu-test-pipeline.yml` | Add `OrtTestVerbose` derived
variable |
| `plugin-cuda-test-pipeline.yml` | Add `OrtTestVerbose` derived
variable |
| `stages/plugin-linux-webgpu-test-stage.yml` | `$(System.Debug)` →
`$(OrtTestVerbose)` |
| `stages/plugin-linux-cuda-test-stage.yml` | `$(System.Debug)` →
`$(OrtTestVerbose)` |
| `stages/plugin-mac-webgpu-test-stage.yml` | `$(System.Debug)` →
`$(OrtTestVerbose)` |
| `stages/plugin-win-webgpu-test-stage.yml` | `$(System.Debug)` →
`$(OrtTestVerbose)` |
| `stages/plugin-win-cuda-test-stage.yml` | `$(System.Debug)` →
`$(OrtTestVerbose)` |

Pipeline run verified ✅

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…titions (microsoft#28293)

### Description

Adds three MLProgram op builders (Identity, Ceil, Tile) to the CoreML EP
and a partition-quality heuristic that drops CoreML partitions
consisting entirely of trivial shape / cheap-elementwise ops.

No tracking issue; discovered via YOLOv10 partitioning analysis on Apple
Silicon.

### Empirical impact

YOLOv10n, M3 Max, MLProgram, batch 1, 1500-iter pooled:

|  | Partitions | Mean | StdDev | P99 |
|---|---|---|---|---|
| Without this patch | 4 | 3.798 ms | 0.867 | 6.608 |
| **With this patch** | **3** | **3.403 ms** | **0.636** | **5.957** |

**10.4% mean speedup, 26% stddev tightening.**

### Why both pieces are coupled

Adding the builders alone is net-negative on graphs where these ops sit
in isolated chains. Per-op CoreML dispatch overhead on M3 Max (32-op
chains on a 1×64×56×56 fp32 tensor, n=2997, MLProgram):

| op | CPU EP per op | CoreML EP per op |
|---|---|---|
| Identity | <1 µs | ~14 µs |
| Ceil | ~6 µs | ~12 µs |
| Tile | ~10 µs | ~10 µs |

A trivial-only partition pays ~50-100 µs round-trip marshalling plus ~10
µs per op of CoreML dispatch, vs <1 µs each on CPU. Worth claiming only
when sandwiched between compute-heavy ops, where the round-trip is
already paid for. The heuristic enforces that.

### Implementation

**New op builders.** `Identity` emits MIL `identity` (NN path uses
`LINEAR(α=1, β=0)`). `Ceil` joins the existing unary chain in
`UnaryOpBuilder`. `Tile` emits MIL `tile`; it overrides
`HasSupportedInputsImpl` to additionally accept INT32/INT64/BOOL (Tile
is shape-only data movement, so the default float-only filter rejected
it on common YOLO grid-index post-processing) and accepts a runtime
`reps` tensor in addition to a constant initializer.

**Heuristic.** `CoreMLExecutionProvider::GetCapability` now uses the
callback-taking overload of `CreateSupportedPartitions` (same as NNAPI
EP). A partition is kept iff at least one node is outside the trivial
set:

```
{Identity, Cast, Reshape, Squeeze, Unsqueeze, Flatten, Transpose, Tile, Ceil}
```

This lets the new builders absorb mid-chain trivial ops into existing
CoreML partitions (the win) without claiming isolated trivial chains
that would force a needless CPU→CoreML→CPU detour (the regression).

### Tests

`coreml_basic_test.cc` covers both halves of the heuristic.

Builder coverage (compute anchor present → claimed):
- `IdentityWithConvAnchor`, `CeilWithConvAnchor`, `TileWithConvAnchor`

Heuristic coverage:
- `ConvTrivialChainConvKeepsAllOnCoreML` — Conv → Identity → Cast →
Reshape → Conv stays in a single CoreML partition
- `TrivialOnlyChainIsNotClaimedByCoreML`,
`ReshapeOnlyChainIsNotClaimedByCoreML`,
`TransposeOnlyChainIsNotClaimedByCoreML`,
`TileOnlyIsNotClaimedByCoreML`, `CeilOnlyIsNotClaimedByCoreML`,
`MixedTrivialChainIsNotClaimedByCoreML` — each falls back to CPU

Trivial-only tests pin `graph_optimization_level = Default` so passes
like `IdentityElimination` / `CastElimination` cannot pre-empt the
heuristic - what's exercised is `GetCapability` itself.

All 28 CoreML EP tests pass locally on macOS 26.3 / M3 Max.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…y InsertCastTransformer (microsoft#28268)

### Description
<!-- Describe your changes. -->
- Pass `on_partition_assignment_fn` into `InsertCastTransformer` so that
CPU assignments made during cast insertion are recorded with the same
callback used by `GraphPartitioner`.
- Avoid duplicated records in `ForceSingleNodeCPUFloat16ToFloat32`

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

- `InferenceSession::TransformGraph()` records EP-to-node assignments
via `on_partition_assignment_fn`, which is invoked by `GraphPartitioner`
during **step 4** (graph partitioning). However, **step 6**
(`InsertCastTransformer`) can silently assign nodes to the CPU EP after
this recording window has closed.
- Specifically, `InsertCastTransformer::ApplyImpl()` assigns them to CPU
EP via `node->SetExecutionProviderType(kCpuExecutionProvider)`. Because
this happens after `GraphPartitioner` has already run,
`on_partition_assignment_fn` is never invoked for these nodes, resulting
in an empty or incomplete `GetEpGraphAssignmentInfo()` for models that
contain FP16 ops not supported by any registered EP (e.g. a FP16 Relu
model with no FP16-capable EP).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.