Skip to content

Sync with Microsoft ONNX Runtime - 19052026#1095

Open
ai-fw-intg wants to merge 5 commits into
ovep-developfrom
sync_msft_19052026
Open

Sync with Microsoft ONNX Runtime - 19052026#1095
ai-fw-intg wants to merge 5 commits into
ovep-developfrom
sync_msft_19052026

Conversation

@ai-fw-intg
Copy link
Copy Markdown

Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.

lhrios and others added 5 commits May 18, 2026 12:51
`indices` is built once and then only read during recursive calls to
`CheckIfSubtreesAreEqual`. However it was passed by value, causing a
full copy on every recursive call. Changed to `const&`.

## Data from the profiler:
To collect the following data, a model with a single
TreeEnsembleClassifier node (5000 trees and 3.3 million nodes) has been
used. The loading time dropped from 18 minutes to about 4 seconds.

### After
<img width="1793" height="547" alt="Screenshot 2026-03-25 at 6 40 25 PM"
src="https://github.com/user-attachments/assets/d7c00335-8246-4bd1-9e4d-b0e956d48cdd"
/>


### Before
<img width="1763" height="548" alt="Screenshot 2026-03-25 at 6 40 40 PM"
src="https://github.com/user-attachments/assets/35683112-2919-4031-955c-922937f2df8f"
/>
…ft#28520)

## Summary

- Remove the `Subgroups` feature requirement from
`CanApplyFlashAttention`, enabling flash attention on devices without
subgroup support
- Generalize the Apple-specific shared-memory prefill path into a
`use_shm_path` flag that activates for Apple, NVIDIA, or any device
lacking subgroups
- Replace `is_apple` shader parameter with `use_shm_path` throughout the
WGSL template

## Motivation

Two issues exist on the current main branch:

1. **NVIDIA prefill produces incorrect results (regression from
microsoft#28511):** PR microsoft#28511 increased `max_k_step` to 32 for NVIDIA in C++, but
the shader's subgroup-based path only has `qk_1..qk_4` (16 hardcoded key
indices). When `sg_size=32` (e.g. RTX 5080), the loop steps by 32 but
only computes QK for keys 0-15, silently skipping keys 16-31. This
produces incorrect attention output for models like phi4.

2. **Flash attention prefill unavailable without Subgroups:**
`CanApplyFlashAttention` gates on
`context.HasFeature(wgpu::FeatureName::Subgroups)`, forcing devices
without subgroup support to fall back to the slower split-reduce
2-kernel path for prefill, even though the Apple shared-memory path in
the shader is fully subgroup-free.

This PR fixes both issues by routing Apple, NVIDIA, and no-subgroup
devices through the loop-based shared-memory path (`use_shm_path`),
which naturally handles any `max_k_step` value via `array<q_element_t,
max_k_step>` and loop iteration — no hardcoded key count.

## Test plan

- [x] Built ORT with WebGPU EP on Windows (Release, VS 2022)
- [x] Deployed and ran phi4-graph-prune model: output verified correct
("1+1 equals 2.")
- [x] Lint check passed (`lintrunner -a`)
### Description
<!-- Describe your changes. -->

- Add copyright headers to source files
- Enrich Python and NuGet package metadata
- Add ORT license files to packages
- Clean up readme files

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

WebGPU plugin EP packaging improvements.

Note: Similar updates can be considered for the CUDA plugin EP, but this
PR is scoped to just the WebGPU EP for ease of cherry-picking into the
WebGPU plugin EP release branch.
…oft#28187)

### Description

Detect and reject recursive cycles in model local function definitions
during model loading, preventing stack overflow from unbounded recursion
during function inlining.

### Changes

**Call-graph construction and cycle detection** (`model_helpers.cc`,
`model_helpers.h`)
- `BuildLocalFunctionCallGraph()` builds an adjacency-list call graph
from model local functions using iterative subgraph traversal (no
recursion, safe against deeply nested subgraph attributes).
- `ValidateCallGraphAcyclic()` performs iterative DFS cycle detection.
Uses `find()` throughout (no `operator[]`) to prevent accidental map
insertions.
- `ValidateModelLocalFunctionAcyclic()` convenience wrapper.
- On cycle detection, returns a descriptive error showing the full cycle
path (e.g., `"local:first -> local:second -> local:first"`).

**Integration** (`model.cc`)
- Applied in both `Model` constructors that process local functions.

**Test coverage** (`function_test.cc`)

Integration tests (full model load):
- `RejectsSelfRecursiveLocalFunction` — function calls itself
- `RejectsMutuallyRecursiveLocalFunctions` — A→B→A cycle
- `RejectsRecursionThroughSubgraph` — recursion via subgraph attribute
(e.g., inside If node)
- `RejectsLongerCycle` — A→B→C→A cycle, verifies cycle path reports all
participants
- `RejectsMultipleIndependentCycles` — two disjoint cycles in one model
- `AcceptsAcyclicDiamond` — diamond shape (A→B, A→C, B→D, C→D), no false
positive
- `AcceptsTrivialSingleNodeFunction` — single-Identity-node function
passes validation

Unit tests (call graph validation directly):
- `CallGraphAcyclic_EmptyGraph` — empty graph
- `CallGraphAcyclic_SingleNodeNoCalls` — single function, no callees
- `CallGraphAcyclic_SelfCycle` — self-loop
- `CallGraphAcyclic_MutualCycle` — A↔B
- `CallGraphAcyclic_LongerCycle` — A→B→C→A
- `CallGraphAcyclic_DiamondNoCycle` — diamond, no false positive
- `CallGraphAcyclic_DeepChainNoCycle` — long acyclic chain
- `CallGraphAcyclic_MultipleIndependentCycles` — two independent cycles
- `CallGraphAcyclic_SharedCallsDiamondNoCycle` — shared callees, no
false positive

### Motivation

A malicious or malformed ONNX model with recursive local function
definitions would cause the runtime to recurse until stack overflow
during function inlining. This check fails model loading early with a
clear error message.

### Testing

- Incremental build succeeds
- All new integration and unit tests pass
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants