Skip to content

Optimize FP8 gemm on XPU#2

Open
xiangyuT wants to merge 3 commits into
devfrom
feature/omni-fp8-xpu-dev-pr
Open

Optimize FP8 gemm on XPU#2
xiangyuT wants to merge 3 commits into
devfrom
feature/omni-fp8-xpu-dev-pr

Conversation

@xiangyuT
Copy link
Copy Markdown
Collaborator

No description provided.

xiangyuT and others added 3 commits March 19, 2026 16:36
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Split M dimension into chunks when oneDNN fails to create FP8 primitives
for large M values (e.g. WAN 2.2 14B FFN layers with M=32760). Benchmarked
chunk_m=512 yields 4-8% speedup over dequant+bf16 for FFN shapes.

Add COMFY_XPU_FP8_OMNI_LOG env var with 3 levels: 0=off, 1=misses only
(default), 2=verbose. Previously all logging was gated by a single bool.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant