Skip to content

Add option: ulysses qkv_fusion#887

Merged
helloyongyang merged 11 commits intoModelTC:mainfrom
xiehaohx:xh/1
Feb 5, 2026
Merged

Add option: ulysses qkv_fusion#887
helloyongyang merged 11 commits intoModelTC:mainfrom
xiehaohx:xh/1

Conversation

@xiehaohx
Copy link
Contributor

@xiehaohx xiehaohx commented Feb 5, 2026

Add option: ulysses qkv_fusion

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @xiehaohx, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the Ulysses attention mechanism by adding a new use_tensor_fusion option. This option provides greater control over how Query, Key, and Value tensors are handled internally, allowing for flexible processing of these tensors either as a combined unit or individually. The changes primarily impact the tensor reshaping and communication logic within the attention computation, adapting it to support both fused and unfused QKV operations.

Highlights

  • New Configuration Option: Introduced a new boolean parameter, use_tensor_fusion, to the apply function in ulysses_attn.py. This option controls whether Query, Key, and Value (QKV) tensors are processed in a fused manner.
  • Conditional QKV Handling: Implemented conditional logic throughout the attention mechanism to handle QKV tensors differently based on the use_tensor_fusion flag. This affects how QKV tensors are stacked, reshaped, and prepared for communication.
  • Parallel Communication Adaptation: Modified the communication pathways, including both FP8 and standard communication, to correctly process either a single fused QKV tensor or individual Q, K, and V tensors, especially when head parallelism is enabled.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • lightx2v/common/ops/attn/ulysses_attn.py
    • Added use_tensor_fusion parameter with a default value of False to the apply function signature.
    • Introduced an internal use_qkv_fusion variable, mirroring the use_tensor_fusion parameter, to drive conditional logic.
    • Refactored tensor stacking and reshaping operations for image QKV to conditionally create a fused img_qkv or separate img_q, img_k, img_v tensors.
    • Updated head-parallel processing logic to handle either fused img_qkv or individual img_q, img_k, img_v tensors during permutation and allocation of output tensors.
    • Modified FP8 and non-FP8 communication blocks to conditionally perform all_to_all_single operations on either the fused img_qkv tensor or the separate img_q, img_k, img_v tensors.
    • Adjusted the waiting mechanism for asynchronous communication operations to correctly await either a single fused QKV work item or multiple individual Q, K, V work items.
    • Updated the final reshaping and assignment of shard_img_q, shard_img_k, shard_img_v to correctly extract values from either the fused output_qkv or the individual output_q, output_k, output_v tensors.
Activity
  • No human activity (comments, reviews, or approvals) has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request successfully introduces the use_tensor_fusion option to enable QKV fusion within the Ulysses attention mechanism. This feature can potentially offer performance benefits. However, the implementation introduces significant code duplication across various conditional blocks, particularly where use_qkv_fusion and use_fp8_comm interact. This increases the complexity and reduces the maintainability of the code.

Comment on lines +110 to +119
if use_qkv_fusion:
img_qkv = img_qkv.permute(3, 2, 1, 0, 4).contiguous() # (shard_heads, world_size, img_qkv_len, 3, hidden_dims)
output_qkv = torch.empty_like(img_qkv)
else:
img_q = img_q.permute(2, 1, 0, 3).contiguous() # (shard_heads, world_size, img_qkv_len, hidden_dims)
img_k = img_k.permute(2, 1, 0, 3).contiguous()
img_v = img_v.permute(2, 1, 0, 3).contiguous()
output_q = torch.empty_like(img_q)
output_k = torch.empty_like(img_k)
output_v = torch.empty_like(img_v)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This section, and several others throughout the apply method (e.g., lines 123-166, 168-181, and the subsequent for loop), introduces significant code duplication due to the nested if use_qkv_fusion: and else: blocks. This pattern makes the code harder to read, understand, and maintain.

Consider refactoring to reduce this duplication. For instance, you could prepare the tensors (e.g., img_qkv or individual img_q, img_k, img_v) and their corresponding output placeholders (output_qkv or output_q, output_k, output_v) in a unified manner before entering the communication and processing loops. This would allow the subsequent logic to operate on a consistent structure, regardless of whether QKV fusion is enabled, thereby reducing the need for repeated if/else checks.

For example, you could define lists of tensors to communicate and lists of output tensors, and then iterate over these lists in the communication and waiting phases.

@helloyongyang helloyongyang merged commit 86bf5d4 into ModelTC:main Feb 5, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants