Sync with Microsoft ONNX Runtime - 19052026 by ai-fw-intg · Pull Request #1095 · intel/onnxruntime

ai-fw-intg · 2026-05-18T20:32:59Z

Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.

`indices` is built once and then only read during recursive calls to `CheckIfSubtreesAreEqual`. However it was passed by value, causing a full copy on every recursive call. Changed to `const&`. ## Data from the profiler: To collect the following data, a model with a single TreeEnsembleClassifier node (5000 trees and 3.3 million nodes) has been used. The loading time dropped from 18 minutes to about 4 seconds. ### After <img width="1793" height="547" alt="Screenshot 2026-03-25 at 6 40 25 PM" src="https://github.com/user-attachments/assets/d7c00335-8246-4bd1-9e4d-b0e956d48cdd" /> ### Before <img width="1763" height="548" alt="Screenshot 2026-03-25 at 6 40 40 PM" src="https://github.com/user-attachments/assets/35683112-2919-4031-955c-922937f2df8f" />

…ft#28520) ## Summary - Remove the `Subgroups` feature requirement from `CanApplyFlashAttention`, enabling flash attention on devices without subgroup support - Generalize the Apple-specific shared-memory prefill path into a `use_shm_path` flag that activates for Apple, NVIDIA, or any device lacking subgroups - Replace `is_apple` shader parameter with `use_shm_path` throughout the WGSL template ## Motivation Two issues exist on the current main branch: 1. **NVIDIA prefill produces incorrect results (regression from microsoft#28511):** PR microsoft#28511 increased `max_k_step` to 32 for NVIDIA in C++, but the shader's subgroup-based path only has `qk_1..qk_4` (16 hardcoded key indices). When `sg_size=32` (e.g. RTX 5080), the loop steps by 32 but only computes QK for keys 0-15, silently skipping keys 16-31. This produces incorrect attention output for models like phi4. 2. **Flash attention prefill unavailable without Subgroups:** `CanApplyFlashAttention` gates on `context.HasFeature(wgpu::FeatureName::Subgroups)`, forcing devices without subgroup support to fall back to the slower split-reduce 2-kernel path for prefill, even though the Apple shared-memory path in the shader is fully subgroup-free. This PR fixes both issues by routing Apple, NVIDIA, and no-subgroup devices through the loop-based shared-memory path (`use_shm_path`), which naturally handles any `max_k_step` value via `array<q_element_t, max_k_step>` and loop iteration — no hardcoded key count. ## Test plan - [x] Built ORT with WebGPU EP on Windows (Release, VS 2022) - [x] Deployed and ran phi4-graph-prune model: output verified correct ("1+1 equals 2.") - [x] Lint check passed (`lintrunner -a`)

### Description  - Add copyright headers to source files - Enrich Python and NuGet package metadata - Add ORT license files to packages - Clean up readme files ### Motivation and Context  WebGPU plugin EP packaging improvements. Note: Similar updates can be considered for the CUDA plugin EP, but this PR is scoped to just the WebGPU EP for ease of cherry-picking into the WebGPU plugin EP release branch.

…oft#28187) ### Description Detect and reject recursive cycles in model local function definitions during model loading, preventing stack overflow from unbounded recursion during function inlining. ### Changes **Call-graph construction and cycle detection** (`model_helpers.cc`, `model_helpers.h`) - `BuildLocalFunctionCallGraph()` builds an adjacency-list call graph from model local functions using iterative subgraph traversal (no recursion, safe against deeply nested subgraph attributes). - `ValidateCallGraphAcyclic()` performs iterative DFS cycle detection. Uses `find()` throughout (no `operator[]`) to prevent accidental map insertions. - `ValidateModelLocalFunctionAcyclic()` convenience wrapper. - On cycle detection, returns a descriptive error showing the full cycle path (e.g., `"local:first -> local:second -> local:first"`). **Integration** (`model.cc`) - Applied in both `Model` constructors that process local functions. **Test coverage** (`function_test.cc`) Integration tests (full model load): - `RejectsSelfRecursiveLocalFunction` — function calls itself - `RejectsMutuallyRecursiveLocalFunctions` — A→B→A cycle - `RejectsRecursionThroughSubgraph` — recursion via subgraph attribute (e.g., inside If node) - `RejectsLongerCycle` — A→B→C→A cycle, verifies cycle path reports all participants - `RejectsMultipleIndependentCycles` — two disjoint cycles in one model - `AcceptsAcyclicDiamond` — diamond shape (A→B, A→C, B→D, C→D), no false positive - `AcceptsTrivialSingleNodeFunction` — single-Identity-node function passes validation Unit tests (call graph validation directly): - `CallGraphAcyclic_EmptyGraph` — empty graph - `CallGraphAcyclic_SingleNodeNoCalls` — single function, no callees - `CallGraphAcyclic_SelfCycle` — self-loop - `CallGraphAcyclic_MutualCycle` — A↔B - `CallGraphAcyclic_LongerCycle` — A→B→C→A - `CallGraphAcyclic_DiamondNoCycle` — diamond, no false positive - `CallGraphAcyclic_DeepChainNoCycle` — long acyclic chain - `CallGraphAcyclic_MultipleIndependentCycles` — two independent cycles - `CallGraphAcyclic_SharedCallsDiamondNoCycle` — shared callees, no false positive ### Motivation A malicious or malformed ONNX model with recursive local function definitions would cause the runtime to recurse until stack overflow during function inlining. This check fails model loading early with a clear error message. ### Testing - Incremental build succeeds - All new integration and unit tests pass

lhrios and others added 5 commits May 18, 2026 12:51

Merge remote-tracking branch 'origin/master' into sync_msft_19052026

a983f4b

ai-fw-intg requested review from Jaswanth51, ankitm3k, jatinwadhwa921 and vthaniel May 18, 2026 20:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync with Microsoft ONNX Runtime - 19052026#1095

Sync with Microsoft ONNX Runtime - 19052026#1095
ai-fw-intg wants to merge 5 commits into
ovep-developfrom
sync_msft_19052026

ai-fw-intg commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

ai-fw-intg commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants