[pull] main from NVIDIA:main by pull[bot] · Pull Request #604 · phu0ngng/TransformerEngine

pull · 2026-05-12T22:32:04Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

* Disable the RHT fusion for non-SM100 family devices Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix the compilation error Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Przemek Tredak <ptredak@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

… buffer (#2900) * [PyTorch] Add bulk_allocate utility and use it in quantized tensor allocators Introduces transformer_engine/pytorch/csrc/extensions/allocate.cpp with a general-purpose bulk_allocate function: given parallel lists of shapes, dtypes, and per-tensor byte alignments, it computes a packed layout, does a single CUDA allocation, and returns at::from_blob views whose deleters keep the backing buffer alive. The three internal bulk_allocate_*_tensors helpers in cast.cpp are refactored to call bulk_allocate instead of each owning a copy of the make_torch_view lambda and the offset-computation loops (~120 lines removed). The new function is also exposed via pybind11 so Python can allocate packed CUDA buffers directly without going through a quantizer. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Tim Moon <tmoon@nvidia.com> * Bulk-allocate wgrads in grouped linear impls Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Apply review suggestions Make optional args for device and alignment. Handle case where base data_ptr is unaligned. Align grouped linear wgrad buffers to 256B. Signed-off-by: Tim Moon <tmoon@nvidia.com> * Nits from Claude Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix incorrect call to `bulk_allocate` Signed-off-by: Tim Moon <tmoon@nvidia.com> * Fix ambiguous return type Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Use c10::Device consistently Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…rability when processing the output of Flash Attn forward pass (#2825) fix(CP, FA): when processing the output of Flash Attn forward pass, the conditional logic in the FA version contains a vulnerability, fix it Signed-off-by: zhujian <zhujian.whu.cs@gmail.com>

* Use fast unfused cast mxfp8 kernels by default Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Removed dead code Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Use fast kernel for full 32-element chunks only Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fixed grid size overflow Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Przemyslaw Tredak <ptredak@nvidia.com>

* Build Docs fix Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * fix doxygen warnings Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * class doc fix Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> --------- Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

#2979) add wait per multi-proc test cleanup Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* Avoid CPU offload wait_event for validation Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> --------- Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

ptrendx and others added 7 commits May 12, 2026 11:38

Build Docs fix (#2982)

f0ab81d

* Build Docs fix Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * fix doxygen warnings Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * class doc fix Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> --------- Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

[JAX] Add wait per multi-proc cleanup in L0_jax_distributed_unittest (

4eab389

#2979) add wait per multi-proc test cleanup Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

pull Bot locked and limited conversation to collaborators May 12, 2026

pull Bot added the ⤵️ pull label May 12, 2026

pull Bot merged commit 472ae55 into phu0ngng:main May 12, 2026
4 of 6 checks passed

pull Bot had a problem deploying to github-pages May 12, 2026 22:33 Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] main from NVIDIA:main#604

[pull] main from NVIDIA:main#604
pull[bot] merged 7 commits into
phu0ngng:mainfrom
NVIDIA:main

pull Bot commented May 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

pull Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

pull Bot commented May 12, 2026 •

edited

Loading