Sync with Microsoft ONNX Runtime - 20052026#1098
Open
ai-fw-intg wants to merge 14 commits into
Open
Conversation
`indices` is built once and then only read during recursive calls to `CheckIfSubtreesAreEqual`. However it was passed by value, causing a full copy on every recursive call. Changed to `const&`. ## Data from the profiler: To collect the following data, a model with a single TreeEnsembleClassifier node (5000 trees and 3.3 million nodes) has been used. The loading time dropped from 18 minutes to about 4 seconds. ### After <img width="1793" height="547" alt="Screenshot 2026-03-25 at 6 40 25 PM" src="https://github.com/user-attachments/assets/d7c00335-8246-4bd1-9e4d-b0e956d48cdd" /> ### Before <img width="1763" height="548" alt="Screenshot 2026-03-25 at 6 40 40 PM" src="https://github.com/user-attachments/assets/35683112-2919-4031-955c-922937f2df8f" />
…ft#28520) ## Summary - Remove the `Subgroups` feature requirement from `CanApplyFlashAttention`, enabling flash attention on devices without subgroup support - Generalize the Apple-specific shared-memory prefill path into a `use_shm_path` flag that activates for Apple, NVIDIA, or any device lacking subgroups - Replace `is_apple` shader parameter with `use_shm_path` throughout the WGSL template ## Motivation Two issues exist on the current main branch: 1. **NVIDIA prefill produces incorrect results (regression from microsoft#28511):** PR microsoft#28511 increased `max_k_step` to 32 for NVIDIA in C++, but the shader's subgroup-based path only has `qk_1..qk_4` (16 hardcoded key indices). When `sg_size=32` (e.g. RTX 5080), the loop steps by 32 but only computes QK for keys 0-15, silently skipping keys 16-31. This produces incorrect attention output for models like phi4. 2. **Flash attention prefill unavailable without Subgroups:** `CanApplyFlashAttention` gates on `context.HasFeature(wgpu::FeatureName::Subgroups)`, forcing devices without subgroup support to fall back to the slower split-reduce 2-kernel path for prefill, even though the Apple shared-memory path in the shader is fully subgroup-free. This PR fixes both issues by routing Apple, NVIDIA, and no-subgroup devices through the loop-based shared-memory path (`use_shm_path`), which naturally handles any `max_k_step` value via `array<q_element_t, max_k_step>` and loop iteration — no hardcoded key count. ## Test plan - [x] Built ORT with WebGPU EP on Windows (Release, VS 2022) - [x] Deployed and ran phi4-graph-prune model: output verified correct ("1+1 equals 2.") - [x] Lint check passed (`lintrunner -a`)
### Description <!-- Describe your changes. --> - Add copyright headers to source files - Enrich Python and NuGet package metadata - Add ORT license files to packages - Clean up readme files ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> WebGPU plugin EP packaging improvements. Note: Similar updates can be considered for the CUDA plugin EP, but this PR is scoped to just the WebGPU EP for ease of cherry-picking into the WebGPU plugin EP release branch.
…oft#28187) ### Description Detect and reject recursive cycles in model local function definitions during model loading, preventing stack overflow from unbounded recursion during function inlining. ### Changes **Call-graph construction and cycle detection** (`model_helpers.cc`, `model_helpers.h`) - `BuildLocalFunctionCallGraph()` builds an adjacency-list call graph from model local functions using iterative subgraph traversal (no recursion, safe against deeply nested subgraph attributes). - `ValidateCallGraphAcyclic()` performs iterative DFS cycle detection. Uses `find()` throughout (no `operator[]`) to prevent accidental map insertions. - `ValidateModelLocalFunctionAcyclic()` convenience wrapper. - On cycle detection, returns a descriptive error showing the full cycle path (e.g., `"local:first -> local:second -> local:first"`). **Integration** (`model.cc`) - Applied in both `Model` constructors that process local functions. **Test coverage** (`function_test.cc`) Integration tests (full model load): - `RejectsSelfRecursiveLocalFunction` — function calls itself - `RejectsMutuallyRecursiveLocalFunctions` — A→B→A cycle - `RejectsRecursionThroughSubgraph` — recursion via subgraph attribute (e.g., inside If node) - `RejectsLongerCycle` — A→B→C→A cycle, verifies cycle path reports all participants - `RejectsMultipleIndependentCycles` — two disjoint cycles in one model - `AcceptsAcyclicDiamond` — diamond shape (A→B, A→C, B→D, C→D), no false positive - `AcceptsTrivialSingleNodeFunction` — single-Identity-node function passes validation Unit tests (call graph validation directly): - `CallGraphAcyclic_EmptyGraph` — empty graph - `CallGraphAcyclic_SingleNodeNoCalls` — single function, no callees - `CallGraphAcyclic_SelfCycle` — self-loop - `CallGraphAcyclic_MutualCycle` — A↔B - `CallGraphAcyclic_LongerCycle` — A→B→C→A - `CallGraphAcyclic_DiamondNoCycle` — diamond, no false positive - `CallGraphAcyclic_DeepChainNoCycle` — long acyclic chain - `CallGraphAcyclic_MultipleIndependentCycles` — two independent cycles - `CallGraphAcyclic_SharedCallsDiamondNoCycle` — shared callees, no false positive ### Motivation A malicious or malformed ONNX model with recursive local function definitions would cause the runtime to recurse until stack overflow during function inlining. This check fails model loading early with a clear error message. ### Testing - Incremental build succeeds - All new integration and unit tests pass
…Compute (microsoft#28223) ### Description Replaces the long-standing `// TODO: fix this checker later` comment in `MaxpoolWithMask::Compute` with real input validation. Without these checks, a mismatched mask silently causes out-of-bounds memory access. **Changes:** - **`contrib_ops/cpu/maxpool_with_mask.h`** — Added three `ORT_RETURN_IF_NOT` guards: - Mask must have the same number of dimensions as the input tensor - Mask N and C dimensions must be nonzero when input is non-empty (prevents modulo-by-zero in `total_mask_channels`) - Each spatial dimension (dim ≥ 2) of the mask must match the corresponding input dimension - **`test/contrib_ops/maxpool_mask_test.cc`** — Added three failure-case tests: - `MaxPoolWithMask_SpatialDimMismatch` — mask spatial dims differ from input - `MaxPoolWithMask_DimCountMismatch` — mask rank differs from input rank - `MaxPoolWithMask_MaskEmptyBatchDim` — mask N=0 with non-empty input triggers the nonzero N/C guard ### Motivation and Context The mask tensor is indexed using the input's spatial step size (`x_step = height * width`, etc.), so a shape mismatch leads to silent out-of-bounds reads. Additionally, `total_mask_channels = m_shape[0] * m_shape[1]` is used as a modulo divisor in the per-channel offset formula; if either dimension is zero while the input is non-empty, this causes undefined behaviour (division by zero). The original code had a commented-out check with a `TODO` acknowledging this gap; this PR closes it. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: xadupre <22452781+xadupre@users.noreply.github.com> Co-authored-by: Xavier Dupré <xadupre@users.noreply.github.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Xavier Dupré <xadupre@microsoft.com>
### Description As titled. ### Motivation and Context whisper-small in int4-kquant-mixed is close to int8. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
…icrosoft#28430) ### Description Add a unit test that verifies `RegisterExecutionProviderLibrary` / `UnregisterExecutionProviderLibrary` does not leak the library handle (regression test for microsoft#28396). `ProviderLibrary::Load()` loads the EP library and probes for the `GetProvider` symbol. Most plugin EP libraries don't export it, so the probe fails. Before microsoft#28396, `Load()` returned the error without calling `Unload()`, leaking a refcount. ### Test approach The test copies the EP library to a temporary directory with a unique filename, ensuring it has never been loaded in the process. After register + unregister, it checks that the library is fully unloaded (refcount == 0). --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…not a file (microsoft#28431) ### Description A file output path is needed when: - Writing the output model to a file (not to a buffer or write function) - Writing initializers to an external file (needs the model path to compute the external file location) otherwise the file output path validation can be skipped. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> When compiling a model via the Compile API in a sandboxed environment, CreateEpContextModel() would attempt to validate/generate a file output path, even when the user explicitly set the output to a buffer via SetOutputModelBuffer(). This caused std::filesystem::exists() to throw an "Access is denied" exception on the dummy model path _MODEL_EDITOR_API_MODEL_, because the sandbox restricts filesystem access.
## Summary - reject out-of-range `cache_indirection` beam indices in the CPU beam-attention path before they are converted into past KV offsets - keep `DecoderMaskedMultiHeadAttention` beam-width handling consistent with the `cache_indirection` shape - add CPU regression tests for `MultiHeadAttention` and `DecoderMaskedMultiHeadAttention` ## Motivation `MultiHeadAttention` and `DecoderMaskedMultiHeadAttention` on the CPU provider could consume attacker-controlled `cache_indirection` values as beam indices without validating that each element stayed within `[0, beam_width)`. That let malformed models compute offsets past the past key/value buffers. This change rejects invalid indices up front and adds focused tests for the failure path. ## Key Changes - add shared CPU validation in `AttentionCPUBase::ApplyAttentionWithBeams` so the beam path fails before any past-key or past-value reads occur - report an `INVALID_ARGUMENT` error that identifies the offending beam index and its position - validate that an explicit decoder `beam_width` input matches `cache_indirection` dimension 1 when both are present - add contrib-op tests that exercise invalid cache indirection values on the CPU execution provider ## Testing - `lintrunner -a` - `cd build/Linux/Debug && make -j4 CMakeFiles/onnxruntime_providers.dir/home/tlwu/onnxruntime/onnxruntime/contrib_ops/cpu/bert/multihead_attention.cc.o CMakeFiles/onnxruntime_providers.dir/home/tlwu/onnxruntime/onnxruntime/contrib_ops/cpu/bert/decoder_masked_multihead_attention.cc.o CMakeFiles/onnxruntime_provider_test.dir/home/tlwu/onnxruntime/onnxruntime/test/contrib_ops/multihead_attention_op_test.cc.o CMakeFiles/onnxruntime_provider_test.dir/home/tlwu/onnxruntime/onnxruntime/test/contrib_ops/decoder_masked_multihead_attention_op_test.cc.o` - full `onnxruntime_provider_test` relink/run was not completed locally --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
…ors with >2^31 elements (microsoft#28386) - [x] Fix `unary_elementwise_impl.cuh`: Change `CUDA_LONG` to `int64_t` for `N` parameter and loop index in `_UnaryElementWise` kernel, and fix `blocksPerGrid` calculation - [x] Fix `cast_op.cu`: Change `CUDA_LONG` to `int64_t` for `N` parameter and loop index in `CastKernelStd`, `CastKernelSat`, and `CudaCastPairwiseKernel` kernels, and remove `static_cast<int>` truncation - [x] Use `size_t` for `pair_count` in CudaCastPairwise to avoid double conversion (review feedback) - [x] Rename test to `CastKernelCorrectness_ModerateSize` and add `CastKernel_Int64IndexArithmetic_NoOverflow` host-side test (review feedback) - [x] Merge from main to resolve conflicts with Float8E8M0 tests --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com> Co-authored-by: justinchuby <11205048+justinchuby@users.noreply.github.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
) ### Description Hardens the XNNPACK Gemm capability check against two SIGSEGV crashes during graph partitioning: one when the optional `C` input is omitted, one when `C` is a rank-0 (scalar) tensor. The check now guards the null `C` arg before calling `Shape()`, and rejects rank-0 `C` so the node falls back to the CPU EP cleanly. Thanks @kadu-v, the minimal Python repros made the root cause easy to confirm. Both reproduced as a hard crash on the first `InferenceSession` construction. ### Motivation and Context Fixes microsoft#28541 Fixes microsoft#28542 `Gemm::IsOnnxNodeSupported` dereferenced `C_arg->Shape()` without checking whether `C_arg` was non-null, so any Gemm without the optional bias segfaulted before the EP could decline the node. A rank-0 `C` then survived the existing checks and reached XNNPACK's fully-connected path, which doesn't implement scalar broadcast (there's already a TODO in that file noting it). That's the second SIGSEGV. ### Changes `onnxruntime/core/providers/xnnpack/math/gemm.cc`: - Null-check `C_arg` before reading its shape. Absent `C` is valid per the Gemm spec; treat it as "no bias". - Reject `C` with rank 0 from `IsOnnxNodeSupported` so the node falls through to CPU. Adding scalar broadcast support belongs with the TODO in the fully-connected path, not in the capability check. ### Testing Three regression tests in `onnxruntime/test/providers/xnnpack/xnnpack_basic_test.cc`: - `TestGemm_NoC_NoSegfault` builds a Gemm with the `C` input omitted. - `TestGemm_ScalarC_NoSegfault` builds a Gemm with a rank-0 `C`. - `TestGemm_EmptyC_NoSegfault` covers an empty-shape `C` edge case. Each test loads an `InferenceSession` with the XNNPACK EP registered and asserts no crash. I also suspect `Gemm`'s constructor has pre-existing crashes when `A` or `B` is 1-D, before the capability check even runs. Haven't reproduced it. Can file a follow-up if useful. Signed-off-by: Dhruvil <dhruvilparikh79@gmail.com>
## Description Adds path traversal validation for sparse tensors with external data, closing a gap where `SparseTensorProtoToDenseTensorProto` would read external files without checking whether the path escapes the model directory. ### Bug fix (pre-existing) - **`CopySparseData` indices size check**: The `raw_data().size()` check was wrong for external data (where `raw_data` is empty). Fixed by adding a pre-unpack `raw_data` size guard for inline data and a post-unpack `unpack_buffer` size check for all data sources. ### Tests - **Security tests** (tensorutils_test.cc): Path traversal blocked (values, indices), absolute path blocked (values, indices), zero-element regression (zero dense elements, zero NNZ). All create escaping files and assert specifically for `"escapes"` error. - **Positive tests** (sparse_kernels_test.cc): 7 end-to-end tests for legitimate sparse tensors with external data — external values, external indices (INT64/INT32/INT16/INT8), both external (rank-1 and rank-2 COO). ### Known limitation (deferred) ORT_MEM_ADDR in-memory external data for sparse tensors can trigger arbitrary memory reads. This is a separate issue from path validation — `LoadSparseInitializerOrtFormat` legitimately uses in-memory markers for ORT-format models, so blanket rejection would break functionality. Should be addressed in a separate PR. ## Motivation and Context A malicious ONNX model could use `../` path traversal in sparse tensor external data locations to read arbitrary files outside the model directory. Dense tensors already had this validation; sparse tensors did not. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…icrosoft#28538) ### Description <!-- Describe your changes. --> `MlasActivationTest.ExecuteShort` (`test_activation.cpp`) feeds NaN inputs through `MlasActivation` and asserts the output matches the expected value bit-for-bit. This change adds one accepted case: when the expected value is a NaN, any NaN output passes. Non-NaN comparisons are unchanged — a finite output where a NaN is expected (or the reverse) still fails. Test-only change, no library behavior impact. Verified: `onnxruntime_mlas_test --gtest_filter=Activation.ShortExecute` on SpacemiT K3 (riscv64, RVV VLEN=256), rv-gcc 15.2 — FAILED before, PASSED after (re-run x3). x86/x64 behavior unaffected. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> The bit-exact assertion (`Buffer[i].u == TestData[i][kind].u`) implicitly assumes the input NaN payload survives the activation. For kinds evaluated by floating-point arithmetic — LeakyRelu (`alpha * x`), HardSigmoid (`alpha * x + beta`) — that only holds on ISAs that propagate NaN payloads (x86, ARM). IEEE-754 does not require NaN payload propagation. RISC-V's `F` extension mandates that any FP operation producing a NaN yields the canonical quiet NaN (`0x7fc00000` for f32), discarding the payload. So on riscv64 these kinds emit `0x7fc00000` for a NaN input — a correct "NaN in → NaN out" result whose bit pattern simply differs from the input — and the bit-exact check fails. Accepting any NaN where a NaN is expected restores the test to the portable IEEE-754 **contract.** Signed-off-by: qiurui144 <happyqiurui@163.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.