Skip to content

[pull] main from NVIDIA:main#604

Merged
pull[bot] merged 7 commits into
phu0ngng:mainfrom
NVIDIA:main
May 12, 2026
Merged

[pull] main from NVIDIA:main#604
pull[bot] merged 7 commits into
phu0ngng:mainfrom
NVIDIA:main

Conversation

@pull
Copy link
Copy Markdown

@pull pull Bot commented May 12, 2026

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

ptrendx and others added 7 commits May 12, 2026 11:38
* Disable the RHT fusion for non-SM100 family devices

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix the compilation error

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
… buffer (#2900)

* [PyTorch] Add bulk_allocate utility and use it in quantized tensor allocators

Introduces transformer_engine/pytorch/csrc/extensions/allocate.cpp with a
general-purpose bulk_allocate function: given parallel lists of shapes,
dtypes, and per-tensor byte alignments, it computes a packed layout, does
a single CUDA allocation, and returns at::from_blob views whose deleters
keep the backing buffer alive.

The three internal bulk_allocate_*_tensors helpers in cast.cpp are
refactored to call bulk_allocate instead of each owning a copy of the
make_torch_view lambda and the offset-computation loops (~120 lines
removed). The new function is also exposed via pybind11 so Python can
allocate packed CUDA buffers directly without going through a quantizer.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Bulk-allocate wgrads in grouped linear impls

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Apply review suggestions

Make optional args for device and alignment. Handle case where base data_ptr is unaligned. Align grouped linear wgrad buffers to 256B.

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Nits from Claude

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix incorrect call to `bulk_allocate`

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Fix ambiguous return type

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Use c10::Device consistently

Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…rability when processing the output of Flash Attn forward pass (#2825)

fix(CP, FA): when processing the output of Flash Attn forward pass, the conditional logic in the FA version contains a vulnerability, fix it

Signed-off-by: zhujian <zhujian.whu.cs@gmail.com>
* Use fast unfused cast mxfp8 kernels by default

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Removed dead code

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Use fast kernel for full 32-element chunks only

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixed grid size overflow

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Przemyslaw Tredak <ptredak@nvidia.com>
* Build Docs fix

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* fix doxygen warnings

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* class doc fix

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

---------

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
#2979)

add wait per multi-proc test cleanup

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
* Avoid CPU offload wait_event for validation

Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

---------

Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@pull pull Bot locked and limited conversation to collaborators May 12, 2026
@pull pull Bot added the ⤵️ pull label May 12, 2026
@pull pull Bot merged commit 472ae55 into phu0ngng:main May 12, 2026
4 of 6 checks passed
@pull pull Bot had a problem deploying to github-pages May 12, 2026 22:33 Failure
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants