Skip to content

TE-FL Upgrade: Synchronization with TE Release V2.14#62

Open
lxd-cumt wants to merge 324 commits into
flagos-ai:mainfrom
lxd-cumt:merge-to-main-20260509
Open

TE-FL Upgrade: Synchronization with TE Release V2.14#62
lxd-cumt wants to merge 324 commits into
flagos-ai:mainfrom
lxd-cumt:merge-to-main-20260509

Conversation

@lxd-cumt
Copy link
Copy Markdown
Collaborator

@lxd-cumt lxd-cumt commented May 9, 2026

Merge upstream release_v2.14 into main

Integrates NVIDIA TransformerEngine upstream release v2.14 (304 commits, v2.9.0 → v2.14.0) into the TransformerEngine-FL fork via tree replacement strategy, preserving the
custom plugin system while incorporating upstream enhancements.

Upstream Enhancements

Quantization & Precision

  • MXFP8 grouped GEMM with persistent quantization kernels and tensor-scaled FP8 support
  • NVFP4 grouped quantization with Hadamard transform for MoE workloads
  • QuantizedTensor support in FusedAdam optimizer for MXFP8/Float8 block scaling

Architecture Support

  • Blackwell (sm120) fused attention support with cuDNN 9.18.1+
  • Deterministic training on Blackwell with cuDNN ≥9.18.1
  • Grouped GEMM cuBLAS bindings with bias support and tensor swizzling

Distributed Training

  • FSDP2 support with DTensor-aware optimizer states and allgather optimizations
  • Collective GEMM with FP8/MXFP8 for JAX
  • GroupedTensor torch ops for DDP and distributed optimizer

Operators

  • Fused RMSNorm dLN with add-through via cuDNN
  • MoE grouped MLP ops with split dBias and router kernel JAX bindings
  • Configurable philox rounds for stochastic rounding

FlagOS Features

Plugin System Preservation

  • Synced plugin OP API signatures with upstream csrc changes (fused_attn_fwd/bwd parameters, attention backend dispatch)
  • Patched new upstream CUDA hardcoding to te_device_type() for multi-backend compatibility
  • Fixed stale references to renamed upstream symbols (e.g., CPUOffloadEnabled → is_cpu_offload_enabled())

Verification

  • Build & import validation passed
  • Unit & integration tests passed
  • FlagScale end to end training test, summary as follows:

Qwen3-32B, 16 layers, 20 iters, 1node x 8 gpus

Config Status Avg Throughput (tokens/s/gpu) Note
vendor-flash PASS 124.33
vendor-fused PASS 121.95
vendor-unfused PASS 108.35
flagos-flash PASS 94.20
flagos-fused FAIL No fused attention backend supports for flagos backend
flagos-unfused PASS 65.01
reference-flash PASS 93.14
reference-fused FAIL No fused attention backend support for reference backend
reference-unfused PASS 66.89

DeepSeek-V3 16BA3B, 18 layers with 1 mtp layer, 20 iters, 1node x 8gpus, there is no flash-attn or fused-attn support for multi-latent attention

Config Status Avg Throughput (tokens/s/gpu) Note
vendor-unfused PASS 47.00
flagos-unfused PASS 18.57
reference-unfused PASS 20.68

tdophung and others added 30 commits November 18, 2025 12:23
…tation (NVIDIA#2394)

Signed-off-by: tdophung <tdophung@nvidia.com>
* Cache device tensors properly

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix annotation and add test

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* skip nvfp4 test if not supported

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
…LP with checkpoint flag (NVIDIA#2311)

* custom tests for selective activation checkpointing for layernorm mlp

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* add selective layernorm mlp to te.pytorch

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* update test and fix SLNMLP bug

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* implement slnmlp

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* fix tests pointed out by greptile app bot, still pass

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* minor formatting change in tests/pytorch/selective_layernorm_mlp/distributed/run_numerics.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Jaime <102792198+jaimec00@users.noreply.github.com>
Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* remove duplicate import in test/pytorch/selective_layernorm_mlp/test_recipe.py

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* clean up tests, remove unused imports

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* remove unused paths in test_deffered_init

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* fix issue with zero_centered_gamma in test_numerics reference implementation

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* clean up tests

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* make comparison.py more extensive, cleaner output

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* fix small typo in tests/pytorch/selective_layernorm_mlp/compare.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Jaime <102792198+jaimec00@users.noreply.github.com>
Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* fix typo by grepbot in compare.py

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* make selectiuve activation checkpointing optional in slnmlp via checkpoint flag

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* add comments to clarify logic

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* add checkpoint param to pytests, change compare.py to compare checkppoint=False vs checkpoint=True, skip cuda graph tests for checkpoint=True

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* refactor tests to call modified LayerNormMLP

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* refactor to implement selective activation checkpointing directly into LayerNormMLP, also fix bug to reach cleanup logic in fwd

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix skip explanation for cuda_graphs.py

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* make _recompute deal with lists instead of tuples

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix MOST cuda graph failures by initializing identical quantizers during fwd. Float8CurrentScaling with bf16 and fp16 still fail with checkpointing

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix cuda graphs issue, all tests pass now

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix small logic bugs, clean up

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* integrate tests into main testing scripts

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* incorporate rng state tracking in checkpointing

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* clean up tests

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* fix return type mismatches

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* remove checkpoint test from test_recipe, add sperate test in test_numerics

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor typo fix

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Jaime <102792198+jaimec00@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* clear up assertions in tests/pytorch/layernorm_mlp/test_selective_activation_checkpoint.py

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add license and copyright info

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* fix lint issues in layernorm_mlp

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* fix cpu_offload_v1 error

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* possibly fix recomputation in cuda graph bug

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* skip cuda graphs test for SLNMLP with SM>=10.0 and using delayed scaling

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix typo for setting IS_FIRST_FP8_MODULE

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

---------

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>
Signed-off-by: Jaime <102792198+jaimec00@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* fix test_current_device

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* refactor mxfp8_cast_only kernel

Signed-off-by: Jianbing Dong <jianbingd@nvidia.com>

* fix ptx.cuh after format

Signed-off-by: Jianbing Dong <jianbingd@nvidia.com>

---------

Signed-off-by: Jianbing Dong <jianbingd@nvidia.com>
Co-authored-by: Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com>
…A#2399)

Disable Flash attention in Userbuffers tests

Signed-off-by: Tim Moon <tmoon@nvidia.com>
…VIDIA#2397)

* Avoid autogenerating docs for Python files with leading underscore

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Do not exclude __init__.py files from doc generation

Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Minor CPU overhead changes

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Cache per device

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Jack <lityangweiguang@163.com>
* ci: Build and attach bdist wheels to release page

Signed-off-by: oliver könig <okoenig@nvidia.com>

* free up space

Signed-off-by: oliver könig <okoenig@nvidia.com>

* cleanup

Signed-off-by: oliver könig <okoenig@nvidia.com>

* test

Signed-off-by: oliver könig <okoenig@nvidia.com>

* test

Signed-off-by: oliver könig <okoenig@nvidia.com>

* test

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* test

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* c28619d8999a147d5e09c1199f84ff6af6ad5794

Signed-off-by: oliver könig <okoenig@nvidia.com>

* c28619d8999a147d5e09c1199f84ff6af6ad5794

Signed-off-by: oliver könig <okoenig@nvidia.com>

* Reduce months to check from 7 to 5

Signed-off-by: oliver könig <okoenig@nvidia.com>

* Update .github/scripts/check_for_ngc_images.sh

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Update .github/actions/build-pytorch-wheel/build.sh

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
…on (NVIDIA#2103)

Signed-off-by: janbernloehr <jan@bernloehrs.de>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
…IA#2392)

* Make BSHD default for Unfused DPA, DPA and MHA in TE JAX

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Remove explicit transpose_batch set for BSHD for DPA in JAX quickstart

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Add warnings in DPA and MHA to warn users of change defaults to BSHD instead of SBHD

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Minimize the scope of when to trigger warnings for changed defaults for transpose_batch_sequence

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…ns_offsets() (NVIDIA#2201)

* Remove unnecessary SWA calculation from _segment_ids_pos_to_seqlens_offsets

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Add support for THD+CP+SWA through A2A comms

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* unblock the `padding`+`THD`+`CP(A2A)` with SWA case in A2A forward

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* add proper support for thd

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* bug fix

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* enable thd+cp tests as essential

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* add cp+thd+a2a test to essential

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* fix comments from greptile

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add proper skip for flash attention

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* fix the test to create separate tensors for flash and fused attention backend scenarios

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* remove redundant compare

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* simplify code

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* add note for cu_seqlens_kv and cu_seqlens_kv_padded

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* Update tests/pytorch/attention/test_attention_with_cp.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* Update transformer_engine/pytorch/attention/dot_product_attention/context_parallel.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* fixo

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* fix docs

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* fix the argument name

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
…VIDIA#2401)

Only disable Flash Attention in Userbuffers test on A100

Signed-off-by: Tim Moon <tmoon@nvidia.com>
… work (NVIDIA#2416)

* Change order of arguments to make jax works

Signed-off-by: tdophung <tdophung@nvidia.com>

* make num_experts a tl.constepxr again

Signed-off-by: tdophung <tdophung@nvidia.com>

---------

Signed-off-by: tdophung <tdophung@nvidia.com>
…#2414)

Add:: NVTE_CUDA_ARCHS to README

Signed-off-by: Shoval Atias <satias@satias-mlt.client.nvidia.com>
Co-authored-by: Shoval Atias <satias@satias-mlt.client.nvidia.com>
* allow dp + fsdp and fixed sr_rng_state partitioning

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* cleanup for lint test

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
remove linear redundant check

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
* minor fix of torch view dtype

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* multi-tensor RHT amax, compiles

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* setup multi_tensor_quantize_nvfp4_impl

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* wire things up and run without crash

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* numerical test

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* unit test passing

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* finish unit test of split quantize api

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* bump up padding to 64 for nvfp4 grouped quantize

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* fix stochastic rounding

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* lint

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* change error message

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* clean up

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* enable multi-amax without RHT

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* fix col-only quantize mode

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* improve benchmark script

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* add NCU example script

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* add larger test case

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* add contiguous_data_and_scale check to bulk allocator

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* unified naming and differentiate between group_ and multi_

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* move regular amax into multi_tensor.h

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* Disentangle logic for split-quantize and general multi-tensor quantize

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Use size_t for split sections

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Suggestions from @greptile-apps

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…NVIDIA#2370)

* fix ci issue

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* revert back testing changes

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* remove quantizer copy + fused adam working

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix test

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix mxfp8 bug, god knows who created it

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update transformer_engine/pytorch/optimizers/fused_adam.py

Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* Update comment

Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
* fix backward_dw cuda graph order

Signed-off-by: Pingtian Li <pingtianl@nvidia.com>

* add validation for num_layers_per_chunk

Signed-off-by: Pingtian Li <pingtianl@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Pingtian Li <pingtianl@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* main

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* docs

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* add

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fixes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* test fixes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fixes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…nstructor (NVIDIA#2421)

Do not initialize recipe state in base op class

Op attrs may not be set. Move recipe state initialization to linear op constructor.

Signed-off-by: Tim Moon <tmoon@nvidia.com>
)

* Extend docs with quantizers/quantized_tensors/custom_recipe

Signed-off-by: Evgeny <etsykunov@nvidia.com>

* Bring structure, reduce redundant members

Signed-off-by: Evgeny <etsykunov@nvidia.com>

---------

Signed-off-by: Evgeny <etsykunov@nvidia.com>
* init

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* lines lenght

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* subtitle --- fix in many files:

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* cross entropy _input -> input rename

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* cross entropy _input -> input rename

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* a lot of small fixes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* torch_version() change

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add missing module and fix warnings

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* removed training whitespace:

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* Update docs/api/pytorch.rst

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>

* Fix import

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix more imports

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix NumPy docstring parameter spacing and indentation

- Standardize parameter documentation to use 'param : type' format (space before and after colon) per NumPy style guide
- Fix inconsistent indentation in cpu_offload.py docstring
- Modified 51 Python files across transformer_engine/pytorch

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Oleg-Goncharov and others added 18 commits April 2, 2026 23:14
* Enabled persistency with WorkID Query feature

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Added a struct with tunable parameters

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Added persistency with static scheduling

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixed test cases

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Ready for benchmarking

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixed out-of-boundary error

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Tuned kernel parameters

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Refactoring

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Refactoring 2

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Refactoring 3

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Removed the dynamic (WorkID Query) persistency

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Ready for PR

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixes per the review

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Ready for benchmark

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Ready for benchmark - Regular kernel

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Added the source code to the profiler

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Added constructors to Job and Block descriptors

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Removed the prefetch overlapping between jobs

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Cache tensor ID

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* ShapeRepresentation is not a template parameter

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Removed redundant fence_proxy

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Refactoring

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Used mixed precision FMA

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Added Quantize parameters

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Added the fast math branch

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Added the fast math to cpp test suite

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Align tests

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Use STS instead of generic ST

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Add zero-tensor cases

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Used LDS instead of generic LD in colwise path

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Used LDS instead of generic LD in rowwise

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Ready for merge

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Uncommented test cases

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Added FP16 Fast math path to rowwise processing

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Refactoring

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixed lint

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixes

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fix

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixed test suite

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixed test suite

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixes per the review

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Modifications per the review

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Assert the buffer size

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Added fast math RCP for bf16

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fast math for BF16 is now default

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixed compilation error when compiling on previous archs

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Boundary condition fix

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixed compilation error

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Refactoring. Moved helpers to core-common

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Refactoring

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Refactoring per the review

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Addressed the PR review comments

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixed the compilation error when PTX was compiled for CUDA 13.0

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixed pytorch extensions

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

---------

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…os (NVIDIA#2823)

* Fix: Use jitted kernels for generating THD (and BSHD) segment pos if only segment id is passed

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Make passing of segment_pos to from_segmet_ids_and_pos for creating a SequenceDescriptor mandatory

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Make test changes for from_segmet_ids_and_pos API change. Also some nits.

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* nit: Make segment_pos arg mandatory and not Optional

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Add comments for from_segment_ids_and_pos

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* nit: Change data types for BSHD seg pos and seg id to be int32 adn consistent with THD when setting up test inputs

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Replace a TypeError if segment_pos is not passed with a ValueError with a message

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* GEMM + Swiglu fused Grouped MLP for MXFP8

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* cleanup/lint

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Properly cache the alpha tensor

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* nD dummy grad

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* 0 tokens in entire rank

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* tmp downgrade cublas version check

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* delayed wgrad tests pass for basic gl

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* merge everything

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Rebase into fused_mxfp8_grouped_mlp; unit tests for delayed wgrad working

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix tests being skipped for fusible ops

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Integrate mxfp8 dbias kernel in group_quantize

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Add bias/dbias fused support with cute GEMMs

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Check bias/dbias support

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Pack biases more efficiently

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* GroupedTensor for biases to avoid concat

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* format

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Support 1D grouped tensor shape for bias and fix checkpointing

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fixes and tests

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Refactor grouped tensor marking for paged stashing

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Remove setting logical_shape in mark_grouped_tensor

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Cleanup logical_shape

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* pass the tests for now

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* address some review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* address review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* more cleanups

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* cleanup

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* refactor wgrad logic

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Rename argument from single_grouped_parameter to single_grouped_weight

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Check wgrad store context is not empty for 0 token case.

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Test only checks for fusion if fused kernel is available

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* fix the tolerance to be of bf16 for the cute gemm

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* Update transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py

Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* address further review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* address more review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* address more review comments + test for zero grouped tensor work case

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* cublaslt remove zero work gemm avoidance

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* address review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix the wgrad test

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* split dbias functionality from gq api

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Format and lint

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* port fixes and add better doc for page stashing war

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Guard fusion via env

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Change to trigger CI

Remove unnecessary blank line in docstring.

* To retrigger CI

* Space to trigger the pipeline

* fix zero work cublas gemm

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Varun Thumbe <vthumbe@nvidia.com>
Co-authored-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* change distributed tests infra for fsdp2

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* verbose flag for reporting

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* add back coments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* another minor fix

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* not needed for this PR

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* address review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* unecessary comments
…ter CI error re… (NVIDIA#2802)

* Capture subprocess stderr in distributed tests for better CI error reporting

Distributed tests launch subprocesses via torch.distributed.launch/torchrun.
When these fail, pytest only captures the CalledProcessError from the parent
process, not the actual worker traceback. This makes CI JUnit XML reports
show "exit code 1" with no useful error detail.

Add run_distributed() utility to tests/pytorch/utils.py that captures stderr
while letting stdout stream to the terminal. On failure, the worker's stderr
(containing the actual Python traceback) is included in the AssertionError,
which pytest writes into the JUnit XML report.

Behavior:
- Interactive use: stdout streams in real time (unchanged), stderr shown on failure
- CI/JUnit XML: failure reports now include the actual worker traceback

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* Add JUnit XML output to ctest in L0_cppunittest

Add --output-junit flag so ctest writes JUnit XML to /logs/,
matching the pattern used by pytest tests. The XML is written
before ctest exits, so it's captured even on test failure.

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

---------

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Add tests that demonstrate two known memory issues with FSDP2 + FP8:

- Issue NVIDIA#2681: FP8 weight copies created during te.autocast() forward pass
  accumulate across layers instead of being freed between layers, defeating
  FSDP2's memory efficiency. Detected by comparing per-layer forward memory
  increments against a bf16 baseline using layer hooks.

- Issue NVIDIA#2717: Transpose cache tensors (_create_transpose) allocated during
  backward persist until the next forward pass instead of being freed after
  backward completes. Detected by comparing the backward memory delta
  (post_bwd - post_fwd) against a bf16 baseline.

New tests:
- test_bf16_no_excess_forward_memory: control, validates per-layer measurement
- test_bf16_no_excess_backward_memory: control, validates backward delta comparison
- test_fp8_temp_accumulation_across_layers: xfail, detects NVIDIA#2681
- test_transpose_cache_retained_after_backward: xfail, detects NVIDIA#2717

All parametrized over 5 FP8 recipes x {no_quant_init, quant_init}.

Signed-off-by: Peter St. John <pstjohn@nvidia.com>
Co-authored-by: vthumbe1503 <vthumbe@nvidia.com>
… states should also be DTensors. (NVIDIA#2795)

* If model parameters are DTensors, optimizer state should also be DTensor.

Signed-off-by: Cory Ye <cye@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Unpack DTensor in FusedAdam.step().

Signed-off-by: Cory Ye <cye@nvidia.com>

* Apply suggestions from code review

Add Greptile bug-fixes.

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Cory Ye <44509866+cspades@users.noreply.github.com>

* Revert erroneous Greptile diff.

Signed-off-by: Cory Ye <cye@nvidia.com>

* Add DTensor parity check to FusedAdam.step().

Signed-off-by: Cory Ye <cye@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add DTensor handling in state_dict and load_state_dict, and add a DCP re-sharding test.

Signed-off-by: Cory Ye <cye@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update test commentary.

Signed-off-by: Cory Ye <cye@nvidia.com>

* Filter out DCP resharding tests from the 2 GPU FusedAdam test matrix, as those tests need to be run in sequence.

Signed-off-by: Cory Ye <cye@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix float8

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* xfail block scaling

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* Fix rebase error, pytest filters were shoved into a different test.

Signed-off-by: Cory Ye <cye@nvidia.com>

---------

Signed-off-by: Cory Ye <cye@nvidia.com>
Signed-off-by: Cory Ye <44509866+cspades@users.noreply.github.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: vthumbe1503 <vthumbe@nvidia.com>
…ch tensor class for the shape (NVIDIA#2841)

* fix

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Resolved 28 merge conflicts:
- P0 (20 files): transformer_engine/pytorch/ — preserved te_device_type()
  device abstraction and plugin system hooks
- P1 (1 file): transformer_engine/common/__init__.py — preserved plugin
  bootstrapping and skip_cuda_build()
- P2 (7 files): CI/CD, qa, config — preserved fork-specific CI and test harness
Updated plugin OP API layer to match pytorch/csrc/ pybind changes
between base and dev branches. Changes applied to:
- ops.py base class (TEFLBackendBase)
- All 5 vendor backends (cuda, iluvatar, metax, musa, hygon)
- All 5 vendor register_ops.py files
- Scanned flagos/reference backends for changed interfaces (no changes needed)

New APIs added: group_quantize, bgrad_group_quantize, glu, dglu,
te_general_grouped_gemm_for_grouped_tensor, te_general_grouped_gemm_for_discrete_in,
te_general_grouped_gemm_for_discrete_out, nvfp4_data_transpose, swizzle_scales_for_gemm_,
grouped_swizzle_for_gemm, convert_host_pointers_to_tensor,
get_device_pointer_for_data_and_scales, splits_to_offsets,
mxfp8_scaling_compute_partial_amax, mxfp8_scaling_partial_cast,
nvfp4_2d_compute_partial_amax, nvfp4_multi_tensor_compute_partial_amax,
nvfp4_compute_global_scale, nvfp4_compute_per_block_scale, nvfp4_expand_scale_to_fp8,
nvfp4_fused_scale, nvfp4_multi_tensor_fused_scale, nvfp4_2d_partial_cast,
nvfp4_multi_tensor_2d_partial_cast, nvfp4_2d_multi_tensor_transpose,
multi_tensor_scale_tensor, multi_tensor_compute_scale_inv_e8m0

Modified APIs: split_quantize (added disable_bulk_allocation param)
…ype()

Scanned Python-layer diff (base..dev, excluding csrc) for newly introduced
hardcoded 'cuda' device strings. Replaced 11 instances across 7 files:
- device=torch.device('cuda') → device=torch.device(te_device_type()): 3
- device='cuda' → device=te_device_type(): 1
- .device.type == 'cuda' → .device.type == te_device_type(): 2
- get_autocast_dtype('cuda') → get_autocast_dtype(te_device_type()): 5
Skipped 10 intentional default parameter values and docstrings.
torch.cuda.* API calls left as-is (handled by vendor patches.py at runtime).
Scanned fork-specific code (new in merge vs dev) for references to
functions, classes, and file paths that upstream renamed or relocated
between base and dev. Fixed 6 stale reference(s):
- _load_cudnn() → _load_cuda_library("cudnn")
- _load_nvrtc() → _load_cuda_library("nvrtc")
- _load_curand() → _load_cuda_library("curand")
- _load_nvidia_cuda_library("cublas"/"cuda_runtime") → _load_cuda_library_from_python()
- tensor.quantized_tensor → quantized_tensor (pytorch/utils.py)
- tensor.quantized_tensor → quantized_tensor (flagos backends.py)
Updated plugin OP API layer to match pytorch/csrc/ pybind changes
between base and dev branches. Changes applied to:
- ops.py base class (TEFLBackendBase): added cuda_graph, deterministic to get_fused_attn_backend
- ops.py FlashAttentionBase: added num_splits to forward/_forward_impl signatures
- All vendor FlashAttention subclasses (cuda, hygon, metax, musa, kunlunxin)
- All 5 vendor backends get_fused_attn_backend (cuda, iluvatar, metax, musa, hygon)
- Reference and flagos backends updated for both APIs
- Verified get_attention_backend/AttentionParams pass-through (no changes needed)
See /tmp/plugin_api_changes.log for details.
…_attn_fwd/bwd

Found during batch validation combo 2/9
(te_fl_prefer=vendor, attention_backend=fused, attempt 1).
Error: CUDABackend.fused_attn_fwd() takes 29 positional arguments but 31 were given
Root cause: upstream merge added bottom_right_diagonal and cuda_graph params to the
caller (cpp_extensions/fused_attn.py) but the plugin backend signatures were not updated.
Fix: added both params to ops.py base class, CUDA backend, and all vendor backends
(musa, iluvatar, hygon, metax) for both fused_attn_fwd and fused_attn_bwd.
…led() in flagos backend

Found during batch validation combo 4/9
(te_fl_prefer=flagos, attention_backend=flash, attempt 1).
Error: Cached implementation 'default.flagos' failed for op 'get_flash_attention_class':
cannot import name 'CPUOffloadEnabled' from 'transformer_engine.pytorch.cpu_offload'
Root cause: upstream removed CPUOffloadEnabled from cpu_offload.py (v2 API),
replacing it with is_cpu_offload_enabled() function.
Fix: updated flagos backend to use the new function.
…metax runner (flagos-ai#60)

Refactors CI/CD workflows to support both CUDA (NVIDIA A100) and Metax
(C500) platforms, removes obsolete workflows, and fixes several
platform-specific test failures. Add functional testing, and log
reporting, with significant workflow simplification, and Metax platform
use BAAI runner configs.

---

- [x] New feature (non-breaking change which adds functionality)
- [x] Infra/Build change (changes to CI/CD workflows or build scripts)
- [x] Code refactoring
- [x] Bug fix
- [ ] Documentation change
- [ ] Breaking change

---

- **Workflow cleanup**: Removed 7 obsolete workflows; extracted lint
into a standalone reusable `lint_common.yml` (runs in parallel); add
`integration_tests_common.yml`
- **Platform refactoring**: Added per-platform setup scripts
(`setup_cuda.sh` / `setup_metax.sh`); switched Metax config to BAAI
online environment; removed unsupported test types (JAX distributed)
from Metax matrix
- **Bug fixes**:
- Metax: skip incompatible distributed test files (`test_numerics`,
`test_torch_fsdp2`, etc.) to prevent `torchrun` SIGSEGV
- Metax: replace `nvidia-smi`-only FP8 detection with platform-aware
check
- CUDA: fix `libcudart` load failure when runtime is pip-installed (add
proper fallback chain in `_load_cudart()` and `try_load_lib`)

---

- [x] I have read and followed the contributing guidelines
- [x] The functionality is complete
- [x] I have commented my code, particularly in CI workflow setup steps
- [x] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [x] I have added/updated tests that prove my feature works on CUDA and
Metax platform
- [x] New and existing unit tests pass locally on CUDA and Metax
platform

---------

Co-authored-by: qqjxzxq <1376782660@qq.com>
Co-authored-by: HermiaHuan <3081497279@qq.com>
Tree replacement merge from merge/dev-to-main-20260410.
Working tree is identical to the source branch.
Stages 1-8 completed and verified.
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 9, 2026

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 17 committers have signed the CLA.

✅ BrianPei
❌ timmoon10
❌ phu0ngng
❌ ptrendx
❌ ksivaman
❌ zhongbozhu
❌ vthumbe1503
❌ jomitchellnv
❌ vcherepanov-nv
❌ CarlosGomes98
❌ KshitijLakhani
❌ lixianduo
❌ cspades
❌ Oleg-Goncharov
❌ jberchtold-nvidia
❌ sudhakarsingh27
❌ pstjohn


lixianduo seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

lixianduo and others added 7 commits May 9, 2026 15:23
- Remove unused imports in utils.py, multi_head_attention.py, float8_blockwise_tensor.py
- Reorder imports to follow stdlib → third-party → first-party → local convention
- Fixes CI lint failures while maintaining 10.00/10 pylint score

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@lxd-cumt lxd-cumt force-pushed the merge-to-main-20260509 branch from fe21a60 to e5c8380 Compare May 12, 2026 06:18
lixianduo and others added 3 commits May 12, 2026 16:11
This commit synchronizes all vendor backends (enflame, hygon, iluvatar, metax, musa) with the CUDA backend API:

1. **Enflame backend**:
   - Added 27 new operations: group_quantize, bgrad_group_quantize, glu, dglu, te_general_grouped_gemm_*, nvfp4_*, mxfp8_*, multi_tensor_scale_tensor, multi_tensor_compute_scale_inv_e8m0, and utility functions
   - Fixed 6 method signatures: group_quantize, bgrad_group_quantize, get_fused_attn_backend, fused_attn_fwd, fused_attn_bwd, fused_rope_backward (added missing parameters)
   - Updated split_quantize to accept disable_bulk_allocation parameter
   - Replaced *args/**kwargs with explicit parameter lists for 15 methods

2. **Hygon, Iluvatar, Metax backends**:
   - Replaced *args/**kwargs with explicit parameter lists for 16 methods each

3. **Musa backend**:
   - Replaced *args/**kwargs with explicit parameter lists for 11 methods
   - Added type hints to splits_to_offsets and mxfp8_scaling_partial_cast

All backends now use explicit parameter signatures matching CUDA backend, except for methods where CUDA also uses *args/**kwargs (te_general_grouped_gemm_*, nvfp4_compute_per_block_scale, nvfp4_expand_scale_to_fp8, nvfp4_fused_scale, nvfp4_multi_tensor_2d_partial_cast).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.