Sync with Microsoft ONNX Runtime - 20052026 by ai-fw-intg · Pull Request #1098 · intel/onnxruntime

ai-fw-intg · 2026-05-19T20:33:32Z

Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.

`indices` is built once and then only read during recursive calls to `CheckIfSubtreesAreEqual`. However it was passed by value, causing a full copy on every recursive call. Changed to `const&`. ## Data from the profiler: To collect the following data, a model with a single TreeEnsembleClassifier node (5000 trees and 3.3 million nodes) has been used. The loading time dropped from 18 minutes to about 4 seconds. ### After <img width="1793" height="547" alt="Screenshot 2026-03-25 at 6 40 25 PM" src="https://github.com/user-attachments/assets/d7c00335-8246-4bd1-9e4d-b0e956d48cdd" /> ### Before <img width="1763" height="548" alt="Screenshot 2026-03-25 at 6 40 40 PM" src="https://github.com/user-attachments/assets/35683112-2919-4031-955c-922937f2df8f" />

…ft#28520) ## Summary - Remove the `Subgroups` feature requirement from `CanApplyFlashAttention`, enabling flash attention on devices without subgroup support - Generalize the Apple-specific shared-memory prefill path into a `use_shm_path` flag that activates for Apple, NVIDIA, or any device lacking subgroups - Replace `is_apple` shader parameter with `use_shm_path` throughout the WGSL template ## Motivation Two issues exist on the current main branch: 1. **NVIDIA prefill produces incorrect results (regression from microsoft#28511):** PR microsoft#28511 increased `max_k_step` to 32 for NVIDIA in C++, but the shader's subgroup-based path only has `qk_1..qk_4` (16 hardcoded key indices). When `sg_size=32` (e.g. RTX 5080), the loop steps by 32 but only computes QK for keys 0-15, silently skipping keys 16-31. This produces incorrect attention output for models like phi4. 2. **Flash attention prefill unavailable without Subgroups:** `CanApplyFlashAttention` gates on `context.HasFeature(wgpu::FeatureName::Subgroups)`, forcing devices without subgroup support to fall back to the slower split-reduce 2-kernel path for prefill, even though the Apple shared-memory path in the shader is fully subgroup-free. This PR fixes both issues by routing Apple, NVIDIA, and no-subgroup devices through the loop-based shared-memory path (`use_shm_path`), which naturally handles any `max_k_step` value via `array<q_element_t, max_k_step>` and loop iteration — no hardcoded key count. ## Test plan - [x] Built ORT with WebGPU EP on Windows (Release, VS 2022) - [x] Deployed and ran phi4-graph-prune model: output verified correct ("1+1 equals 2.") - [x] Lint check passed (`lintrunner -a`)

### Description  - Add copyright headers to source files - Enrich Python and NuGet package metadata - Add ORT license files to packages - Clean up readme files ### Motivation and Context  WebGPU plugin EP packaging improvements. Note: Similar updates can be considered for the CUDA plugin EP, but this PR is scoped to just the WebGPU EP for ease of cherry-picking into the WebGPU plugin EP release branch.

…oft#28187) ### Description Detect and reject recursive cycles in model local function definitions during model loading, preventing stack overflow from unbounded recursion during function inlining. ### Changes **Call-graph construction and cycle detection** (`model_helpers.cc`, `model_helpers.h`) - `BuildLocalFunctionCallGraph()` builds an adjacency-list call graph from model local functions using iterative subgraph traversal (no recursion, safe against deeply nested subgraph attributes). - `ValidateCallGraphAcyclic()` performs iterative DFS cycle detection. Uses `find()` throughout (no `operator[]`) to prevent accidental map insertions. - `ValidateModelLocalFunctionAcyclic()` convenience wrapper. - On cycle detection, returns a descriptive error showing the full cycle path (e.g., `"local:first -> local:second -> local:first"`). **Integration** (`model.cc`) - Applied in both `Model` constructors that process local functions. **Test coverage** (`function_test.cc`) Integration tests (full model load): - `RejectsSelfRecursiveLocalFunction` — function calls itself - `RejectsMutuallyRecursiveLocalFunctions` — A→B→A cycle - `RejectsRecursionThroughSubgraph` — recursion via subgraph attribute (e.g., inside If node) - `RejectsLongerCycle` — A→B→C→A cycle, verifies cycle path reports all participants - `RejectsMultipleIndependentCycles` — two disjoint cycles in one model - `AcceptsAcyclicDiamond` — diamond shape (A→B, A→C, B→D, C→D), no false positive - `AcceptsTrivialSingleNodeFunction` — single-Identity-node function passes validation Unit tests (call graph validation directly): - `CallGraphAcyclic_EmptyGraph` — empty graph - `CallGraphAcyclic_SingleNodeNoCalls` — single function, no callees - `CallGraphAcyclic_SelfCycle` — self-loop - `CallGraphAcyclic_MutualCycle` — A↔B - `CallGraphAcyclic_LongerCycle` — A→B→C→A - `CallGraphAcyclic_DiamondNoCycle` — diamond, no false positive - `CallGraphAcyclic_DeepChainNoCycle` — long acyclic chain - `CallGraphAcyclic_MultipleIndependentCycles` — two independent cycles - `CallGraphAcyclic_SharedCallsDiamondNoCycle` — shared callees, no false positive ### Motivation A malicious or malformed ONNX model with recursive local function definitions would cause the runtime to recurse until stack overflow during function inlining. This check fails model loading early with a clear error message. ### Testing - Incremental build succeeds - All new integration and unit tests pass

…Compute (microsoft#28223) ### Description Replaces the long-standing `// TODO: fix this checker later` comment in `MaxpoolWithMask::Compute` with real input validation. Without these checks, a mismatched mask silently causes out-of-bounds memory access. **Changes:** - **`contrib_ops/cpu/maxpool_with_mask.h`** — Added three `ORT_RETURN_IF_NOT` guards: - Mask must have the same number of dimensions as the input tensor - Mask N and C dimensions must be nonzero when input is non-empty (prevents modulo-by-zero in `total_mask_channels`) - Each spatial dimension (dim ≥ 2) of the mask must match the corresponding input dimension - **`test/contrib_ops/maxpool_mask_test.cc`** — Added three failure-case tests: - `MaxPoolWithMask_SpatialDimMismatch` — mask spatial dims differ from input - `MaxPoolWithMask_DimCountMismatch` — mask rank differs from input rank - `MaxPoolWithMask_MaskEmptyBatchDim` — mask N=0 with non-empty input triggers the nonzero N/C guard ### Motivation and Context The mask tensor is indexed using the input's spatial step size (`x_step = height * width`, etc.), so a shape mismatch leads to silent out-of-bounds reads. Additionally, `total_mask_channels = m_shape[0] * m_shape[1]` is used as a modulo divisor in the per-channel offset formula; if either dimension is zero while the input is non-empty, this causes undefined behaviour (division by zero). The original code had a commented-out check with a `TODO` acknowledging this gap; this PR closes it. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: xadupre <22452781+xadupre@users.noreply.github.com> Co-authored-by: Xavier Dupré <xadupre@users.noreply.github.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Xavier Dupré <xadupre@microsoft.com>

### Description As titled. ### Motivation and Context whisper-small in int4-kquant-mixed is close to int8. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>

…icrosoft#28430) ### Description Add a unit test that verifies `RegisterExecutionProviderLibrary` / `UnregisterExecutionProviderLibrary` does not leak the library handle (regression test for microsoft#28396). `ProviderLibrary::Load()` loads the EP library and probes for the `GetProvider` symbol. Most plugin EP libraries don't export it, so the probe fails. Before microsoft#28396, `Load()` returned the error without calling `Unload()`, leaking a refcount. ### Test approach The test copies the EP library to a temporary directory with a unique filename, ensuring it has never been loaded in the process. After register + unregister, it checks that the library is fully unloaded (refcount == 0). --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…not a file (microsoft#28431) ### Description A file output path is needed when: - Writing the output model to a file (not to a buffer or write function) - Writing initializers to an external file (needs the model path to compute the external file location) otherwise the file output path validation can be skipped. ### Motivation and Context  When compiling a model via the Compile API in a sandboxed environment, CreateEpContextModel() would attempt to validate/generate a file output path, even when the user explicitly set the output to a buffer via SetOutputModelBuffer(). This caused std::filesystem::exists() to throw an "Access is denied" exception on the dummy model path _MODEL_EDITOR_API_MODEL_, because the sandbox restricts filesystem access.

## Summary - reject out-of-range `cache_indirection` beam indices in the CPU beam-attention path before they are converted into past KV offsets - keep `DecoderMaskedMultiHeadAttention` beam-width handling consistent with the `cache_indirection` shape - add CPU regression tests for `MultiHeadAttention` and `DecoderMaskedMultiHeadAttention` ## Motivation `MultiHeadAttention` and `DecoderMaskedMultiHeadAttention` on the CPU provider could consume attacker-controlled `cache_indirection` values as beam indices without validating that each element stayed within `[0, beam_width)`. That let malformed models compute offsets past the past key/value buffers. This change rejects invalid indices up front and adds focused tests for the failure path. ## Key Changes - add shared CPU validation in `AttentionCPUBase::ApplyAttentionWithBeams` so the beam path fails before any past-key or past-value reads occur - report an `INVALID_ARGUMENT` error that identifies the offending beam index and its position - validate that an explicit decoder `beam_width` input matches `cache_indirection` dimension 1 when both are present - add contrib-op tests that exercise invalid cache indirection values on the CPU execution provider ## Testing - `lintrunner -a` - `cd build/Linux/Debug && make -j4 CMakeFiles/onnxruntime_providers.dir/home/tlwu/onnxruntime/onnxruntime/contrib_ops/cpu/bert/multihead_attention.cc.o CMakeFiles/onnxruntime_providers.dir/home/tlwu/onnxruntime/onnxruntime/contrib_ops/cpu/bert/decoder_masked_multihead_attention.cc.o CMakeFiles/onnxruntime_provider_test.dir/home/tlwu/onnxruntime/onnxruntime/test/contrib_ops/multihead_attention_op_test.cc.o CMakeFiles/onnxruntime_provider_test.dir/home/tlwu/onnxruntime/onnxruntime/test/contrib_ops/decoder_masked_multihead_attention_op_test.cc.o` - full `onnxruntime_provider_test` relink/run was not completed locally --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>

…ors with >2^31 elements (microsoft#28386) - [x] Fix `unary_elementwise_impl.cuh`: Change `CUDA_LONG` to `int64_t` for `N` parameter and loop index in `_UnaryElementWise` kernel, and fix `blocksPerGrid` calculation - [x] Fix `cast_op.cu`: Change `CUDA_LONG` to `int64_t` for `N` parameter and loop index in `CastKernelStd`, `CastKernelSat`, and `CudaCastPairwiseKernel` kernels, and remove `static_cast<int>` truncation - [x] Use `size_t` for `pair_count` in CudaCastPairwise to avoid double conversion (review feedback) - [x] Rename test to `CastKernelCorrectness_ModerateSize` and add `CastKernel_Int64IndexArithmetic_NoOverflow` host-side test (review feedback) - [x] Merge from main to resolve conflicts with Float8E8M0 tests --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com> Co-authored-by: justinchuby <11205048+justinchuby@users.noreply.github.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com>

@kadu-v

) ### Description Hardens the XNNPACK Gemm capability check against two SIGSEGV crashes during graph partitioning: one when the optional `C` input is omitted, one when `C` is a rank-0 (scalar) tensor. The check now guards the null `C` arg before calling `Shape()`, and rejects rank-0 `C` so the node falls back to the CPU EP cleanly. Thanks @kadu-v, the minimal Python repros made the root cause easy to confirm. Both reproduced as a hard crash on the first `InferenceSession` construction. ### Motivation and Context Fixes microsoft#28541 Fixes microsoft#28542 `Gemm::IsOnnxNodeSupported` dereferenced `C_arg->Shape()` without checking whether `C_arg` was non-null, so any Gemm without the optional bias segfaulted before the EP could decline the node. A rank-0 `C` then survived the existing checks and reached XNNPACK's fully-connected path, which doesn't implement scalar broadcast (there's already a TODO in that file noting it). That's the second SIGSEGV. ### Changes `onnxruntime/core/providers/xnnpack/math/gemm.cc`: - Null-check `C_arg` before reading its shape. Absent `C` is valid per the Gemm spec; treat it as "no bias". - Reject `C` with rank 0 from `IsOnnxNodeSupported` so the node falls through to CPU. Adding scalar broadcast support belongs with the TODO in the fully-connected path, not in the capability check. ### Testing Three regression tests in `onnxruntime/test/providers/xnnpack/xnnpack_basic_test.cc`: - `TestGemm_NoC_NoSegfault` builds a Gemm with the `C` input omitted. - `TestGemm_ScalarC_NoSegfault` builds a Gemm with a rank-0 `C`. - `TestGemm_EmptyC_NoSegfault` covers an empty-shape `C` edge case. Each test loads an `InferenceSession` with the XNNPACK EP registered and asserts no crash. I also suspect `Gemm`'s constructor has pre-existing crashes when `A` or `B` is 1-D, before the capability check even runs. Haven't reproduced it. Can file a follow-up if useful. Signed-off-by: Dhruvil <dhruvilparikh79@gmail.com>

## Description Adds path traversal validation for sparse tensors with external data, closing a gap where `SparseTensorProtoToDenseTensorProto` would read external files without checking whether the path escapes the model directory. ### Bug fix (pre-existing) - **`CopySparseData` indices size check**: The `raw_data().size()` check was wrong for external data (where `raw_data` is empty). Fixed by adding a pre-unpack `raw_data` size guard for inline data and a post-unpack `unpack_buffer` size check for all data sources. ### Tests - **Security tests** (tensorutils_test.cc): Path traversal blocked (values, indices), absolute path blocked (values, indices), zero-element regression (zero dense elements, zero NNZ). All create escaping files and assert specifically for `"escapes"` error. - **Positive tests** (sparse_kernels_test.cc): 7 end-to-end tests for legitimate sparse tensors with external data — external values, external indices (INT64/INT32/INT16/INT8), both external (rank-1 and rank-2 COO). ### Known limitation (deferred) ORT_MEM_ADDR in-memory external data for sparse tensors can trigger arbitrary memory reads. This is a separate issue from path validation — `LoadSparseInitializerOrtFormat` legitimately uses in-memory markers for ORT-format models, so blanket rejection would break functionality. Should be addressed in a separate PR. ## Motivation and Context A malicious ONNX model could use `../` path traversal in sparse tensor external data locations to read arbitrary files outside the model directory. Dense tensors already had this validation; sparse tensors did not. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…icrosoft#28538) ### Description  `MlasActivationTest.ExecuteShort` (`test_activation.cpp`) feeds NaN inputs through `MlasActivation` and asserts the output matches the expected value bit-for-bit. This change adds one accepted case: when the expected value is a NaN, any NaN output passes. Non-NaN comparisons are unchanged — a finite output where a NaN is expected (or the reverse) still fails. Test-only change, no library behavior impact. Verified: `onnxruntime_mlas_test --gtest_filter=Activation.ShortExecute` on SpacemiT K3 (riscv64, RVV VLEN=256), rv-gcc 15.2 — FAILED before, PASSED after (re-run x3). x86/x64 behavior unaffected. ### Motivation and Context  The bit-exact assertion (`Buffer[i].u == TestData[i][kind].u`) implicitly assumes the input NaN payload survives the activation. For kinds evaluated by floating-point arithmetic — LeakyRelu (`alpha * x`), HardSigmoid (`alpha * x + beta`) — that only holds on ISAs that propagate NaN payloads (x86, ARM). IEEE-754 does not require NaN payload propagation. RISC-V's `F` extension mandates that any FP operation producing a NaN yields the canonical quiet NaN (`0x7fc00000` for f32), discarding the payload. So on riscv64 these kinds emit `0x7fc00000` for a NaN input — a correct "NaN in → NaN out" result whose bit pattern simply differs from the input — and the bit-exact check fails. Accepting any NaN where a NaN is expected restores the test to the portable IEEE-754 **contract.** Signed-off-by: qiurui144 <happyqiurui@163.com>

lhrios and others added 14 commits May 18, 2026 12:51

Merge remote-tracking branch 'origin/master' into sync_msft_20052026

e41e0b7

ai-fw-intg requested review from Jaswanth51, ankitm3k, jatinwadhwa921 and vthaniel May 19, 2026 20:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync with Microsoft ONNX Runtime - 20052026#1098

Sync with Microsoft ONNX Runtime - 20052026#1098
ai-fw-intg wants to merge 14 commits into
ovep-developfrom
sync_msft_20052026

ai-fw-intg commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

Conversation

ai-fw-intg commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants