TE-FL Upgrade: Synchronization with TE Release V2.14 by lxd-cumt · Pull Request #62 · flagos-ai/TransformerEngine-FL

lxd-cumt · 2026-05-09T07:21:40Z

Merge upstream release_v2.14 into main

Integrates NVIDIA TransformerEngine upstream release v2.14 (304 commits, v2.9.0 → v2.14.0) into the TransformerEngine-FL fork via tree replacement strategy, preserving the
custom plugin system while incorporating upstream enhancements.

Upstream Enhancements

Quantization & Precision

MXFP8 grouped GEMM with persistent quantization kernels and tensor-scaled FP8 support
NVFP4 grouped quantization with Hadamard transform for MoE workloads
QuantizedTensor support in FusedAdam optimizer for MXFP8/Float8 block scaling

Architecture Support

Blackwell (sm120) fused attention support with cuDNN 9.18.1+
Deterministic training on Blackwell with cuDNN ≥9.18.1
Grouped GEMM cuBLAS bindings with bias support and tensor swizzling

Distributed Training

FSDP2 support with DTensor-aware optimizer states and allgather optimizations
Collective GEMM with FP8/MXFP8 for JAX
GroupedTensor torch ops for DDP and distributed optimizer

Operators

Fused RMSNorm dLN with add-through via cuDNN
MoE grouped MLP ops with split dBias and router kernel JAX bindings
Configurable philox rounds for stochastic rounding

FlagOS Features

Plugin System Preservation

Synced plugin OP API signatures with upstream csrc changes (fused_attn_fwd/bwd parameters, attention backend dispatch)
Patched new upstream CUDA hardcoding to te_device_type() for multi-backend compatibility
Fixed stale references to renamed upstream symbols (e.g., CPUOffloadEnabled → is_cpu_offload_enabled())

Verification

Build & import validation passed
Unit & integration tests passed
FlagScale end to end training test, summary as follows:

Qwen3-32B, 16 layers, 20 iters, 1node x 8 gpus

Config	Status	Avg Throughput (tokens/s/gpu)	Note
vendor-flash	PASS	124.33
vendor-fused	PASS	121.95
vendor-unfused	PASS	108.35
flagos-flash	PASS	94.20
flagos-fused	FAIL	—	No fused attention backend supports for flagos backend
flagos-unfused	PASS	65.01
reference-flash	PASS	93.14
reference-fused	FAIL	—	No fused attention backend support for reference backend
reference-unfused	PASS	66.89

DeepSeek-V3 16BA3B, 18 layers with 1 mtp layer, 20 iters, 1node x 8gpus, there is no flash-attn or fused-attn support for multi-latent attention

Config	Status	Avg Throughput (tokens/s/gpu)
vendor-unfused	PASS	47.00
flagos-unfused	PASS	18.57
reference-unfused	PASS	20.68

…tation (NVIDIA#2394) Signed-off-by: tdophung <tdophung@nvidia.com>

* Cache device tensors properly Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix annotation and add test Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * skip nvfp4 test if not supported Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

…LP with checkpoint flag (NVIDIA#2311) * custom tests for selective activation checkpointing for layernorm mlp Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai> * add selective layernorm mlp to te.pytorch Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai> * update test and fix SLNMLP bug Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai> * implement slnmlp Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai> * fix tests pointed out by greptile app bot, still pass Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai> * minor formatting change in tests/pytorch/selective_layernorm_mlp/distributed/run_numerics.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Jaime <102792198+jaimec00@users.noreply.github.com> Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai> * remove duplicate import in test/pytorch/selective_layernorm_mlp/test_recipe.py Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai> * clean up tests, remove unused imports Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai> * remove unused paths in test_deffered_init Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai> * fix issue with zero_centered_gamma in test_numerics reference implementation Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai> * clean up tests Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai> * make comparison.py more extensive, cleaner output Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai> * fix small typo in tests/pytorch/selective_layernorm_mlp/compare.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Jaime <102792198+jaimec00@users.noreply.github.com> Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai> * fix typo by grepbot in compare.py Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai> * make selectiuve activation checkpointing optional in slnmlp via checkpoint flag Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai> * add comments to clarify logic Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai> * add checkpoint param to pytests, change compare.py to compare checkppoint=False vs checkpoint=True, skip cuda graph tests for checkpoint=True Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai> * refactor tests to call modified LayerNormMLP Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai> * refactor to implement selective activation checkpointing directly into LayerNormMLP, also fix bug to reach cleanup logic in fwd Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix skip explanation for cuda_graphs.py Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai> * make _recompute deal with lists instead of tuples Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix MOST cuda graph failures by initializing identical quantizers during fwd. Float8CurrentScaling with bf16 and fp16 still fail with checkpointing Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix cuda graphs issue, all tests pass now Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix small logic bugs, clean up Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * integrate tests into main testing scripts Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai> * incorporate rng state tracking in checkpointing Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * clean up tests Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai> * fix return type mismatches Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai> * remove checkpoint test from test_recipe, add sperate test in test_numerics Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor typo fix Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Jaime <102792198+jaimec00@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * clear up assertions in tests/pytorch/layernorm_mlp/test_selective_activation_checkpoint.py Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add license and copyright info Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai> * fix lint issues in layernorm_mlp Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai> * fix cpu_offload_v1 error Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai> * possibly fix recomputation in cuda graph bug Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * skip cuda graphs test for SLNMLP with SM>=10.0 and using delayed scaling Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix typo for setting IS_FIRST_FP8_MODULE Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai> --------- Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai> Signed-off-by: Jaime <102792198+jaimec00@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix test_current_device Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* refactor mxfp8_cast_only kernel Signed-off-by: Jianbing Dong <jianbingd@nvidia.com> * fix ptx.cuh after format Signed-off-by: Jianbing Dong <jianbingd@nvidia.com> --------- Signed-off-by: Jianbing Dong <jianbingd@nvidia.com> Co-authored-by: Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com>

…A#2399) Disable Flash attention in Userbuffers tests Signed-off-by: Tim Moon <tmoon@nvidia.com>

…VIDIA#2397) * Avoid autogenerating docs for Python files with leading underscore Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Do not exclude __init__.py files from doc generation Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Minor CPU overhead changes Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Cache per device Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Signed-off-by: Jack <lityangweiguang@163.com>

* ci: Build and attach bdist wheels to release page Signed-off-by: oliver könig <okoenig@nvidia.com> * free up space Signed-off-by: oliver könig <okoenig@nvidia.com> * cleanup Signed-off-by: oliver könig <okoenig@nvidia.com> * test Signed-off-by: oliver könig <okoenig@nvidia.com> * test Signed-off-by: oliver könig <okoenig@nvidia.com> * test Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * test Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * c28619d8999a147d5e09c1199f84ff6af6ad5794 Signed-off-by: oliver könig <okoenig@nvidia.com> * c28619d8999a147d5e09c1199f84ff6af6ad5794 Signed-off-by: oliver könig <okoenig@nvidia.com> * Reduce months to check from 7 to 5 Signed-off-by: oliver könig <okoenig@nvidia.com> * Update .github/scripts/check_for_ngc_images.sh Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Update .github/actions/build-pytorch-wheel/build.sh Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: oliver könig <okoenig@nvidia.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

…on (NVIDIA#2103) Signed-off-by: janbernloehr <jan@bernloehrs.de> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

…IA#2392) * Make BSHD default for Unfused DPA, DPA and MHA in TE JAX Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com> * Remove explicit transpose_batch set for BSHD for DPA in JAX quickstart Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com> * Add warnings in DPA and MHA to warn users of change defaults to BSHD instead of SBHD Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com> * Minimize the scope of when to trigger warnings for changed defaults for transpose_batch_sequence Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…ns_offsets() (NVIDIA#2201) * Remove unnecessary SWA calculation from _segment_ids_pos_to_seqlens_offsets Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Add support for THD+CP+SWA through A2A comms Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * unblock the `padding`+`THD`+`CP(A2A)` with SWA case in A2A forward Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * add proper support for thd Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * bug fix Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * enable thd+cp tests as essential Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * add cp+thd+a2a test to essential Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * fix comments from greptile Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add proper skip for flash attention Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * fix the test to create separate tensors for flash and fused attention backend scenarios Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * remove redundant compare Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * simplify code Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * add note for cu_seqlens_kv and cu_seqlens_kv_padded Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * Update tests/pytorch/attention/test_attention_with_cp.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * Update transformer_engine/pytorch/attention/dot_product_attention/context_parallel.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * fixo Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * fix docs Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * fix the argument name Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

…VIDIA#2401) Only disable Flash Attention in Userbuffers test on A100 Signed-off-by: Tim Moon <tmoon@nvidia.com>

… work (NVIDIA#2416) * Change order of arguments to make jax works Signed-off-by: tdophung <tdophung@nvidia.com> * make num_experts a tl.constepxr again Signed-off-by: tdophung <tdophung@nvidia.com> --------- Signed-off-by: tdophung <tdophung@nvidia.com>

…#2414) Add:: NVTE_CUDA_ARCHS to README Signed-off-by: Shoval Atias <satias@satias-mlt.client.nvidia.com> Co-authored-by: Shoval Atias <satias@satias-mlt.client.nvidia.com>

* allow dp + fsdp and fixed sr_rng_state partitioning Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * cleanup for lint test Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

remove linear redundant check Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

@greptile-apps

* minor fix of torch view dtype Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * multi-tensor RHT amax, compiles Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * setup multi_tensor_quantize_nvfp4_impl Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * wire things up and run without crash Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * numerical test Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * unit test passing Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * finish unit test of split quantize api Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * bump up padding to 64 for nvfp4 grouped quantize Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * fix stochastic rounding Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * lint Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * change error message Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * clean up Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * enable multi-amax without RHT Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * fix col-only quantize mode Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * improve benchmark script Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * add NCU example script Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * add larger test case Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * add contiguous_data_and_scale check to bulk allocator Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * unified naming and differentiate between group_ and multi_ Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * move regular amax into multi_tensor.h Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> * Disentangle logic for split-quantize and general multi-tensor quantize Signed-off-by: Tim Moon <tmoon@nvidia.com> * Use size_t for split sections Signed-off-by: Tim Moon <tmoon@nvidia.com> * Suggestions from @greptile-apps Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> Signed-off-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…NVIDIA#2370) * fix ci issue Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * revert back testing changes Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * remove quantizer copy + fused adam working Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix test Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix mxfp8 bug, god knows who created it Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update transformer_engine/pytorch/optimizers/fused_adam.py Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> * Update comment Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> Signed-off-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by: Tim Moon <tmoon@nvidia.com>

* fix backward_dw cuda graph order Signed-off-by: Pingtian Li <pingtianl@nvidia.com> * add validation for num_layers_per_chunk Signed-off-by: Pingtian Li <pingtianl@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Pingtian Li <pingtianl@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* main Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * docs Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * add Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fixes Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * test fixes Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixes Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…nstructor (NVIDIA#2421) Do not initialize recipe state in base op class Op attrs may not be set. Move recipe state initialization to linear op constructor. Signed-off-by: Tim Moon <tmoon@nvidia.com>

) * Extend docs with quantizers/quantized_tensors/custom_recipe Signed-off-by: Evgeny <etsykunov@nvidia.com> * Bring structure, reduce redundant members Signed-off-by: Evgeny <etsykunov@nvidia.com> --------- Signed-off-by: Evgeny <etsykunov@nvidia.com>

* init Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * lines lenght Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * subtitle --- fix in many files: Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * cross entropy _input -> input rename Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * cross entropy _input -> input rename Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * a lot of small fixes Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * torch_version() change Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add missing module and fix warnings Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * removed training whitespace: Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * Update docs/api/pytorch.rst Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com> * Fix import Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix more imports Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix NumPy docstring parameter spacing and indentation - Standardize parameter documentation to use 'param : type' format (space before and after colon) per NumPy style guide - Fix inconsistent indentation in cpu_offload.py docstring - Modified 51 Python files across transformer_engine/pytorch Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

…VIDIA#2429) Signed-off-by: tdophung <tdophung@nvidia.com>

* Enabled persistency with WorkID Query feature Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Added a struct with tunable parameters Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Added persistency with static scheduling Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fixed test cases Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Ready for benchmarking Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fixed out-of-boundary error Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Tuned kernel parameters Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Refactoring Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Refactoring 2 Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Refactoring 3 Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Removed the dynamic (WorkID Query) persistency Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Ready for PR Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixes per the review Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Ready for benchmark Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Ready for benchmark - Regular kernel Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Added the source code to the profiler Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Added constructors to Job and Block descriptors Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Removed the prefetch overlapping between jobs Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Cache tensor ID Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * ShapeRepresentation is not a template parameter Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Removed redundant fence_proxy Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Refactoring Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Used mixed precision FMA Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Added Quantize parameters Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Added the fast math branch Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Added the fast math to cpp test suite Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Align tests Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Use STS instead of generic ST Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Add zero-tensor cases Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Used LDS instead of generic LD in colwise path Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Used LDS instead of generic LD in rowwise Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Ready for merge Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Uncommented test cases Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Added FP16 Fast math path to rowwise processing Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Refactoring Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed lint Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fixes Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fix Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fixed test suite Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fixed test suite Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fixes per the review Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Modifications per the review Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Assert the buffer size Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Added fast math RCP for bf16 Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fast math for BF16 is now default Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed compilation error when compiling on previous archs Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Boundary condition fix Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fixed compilation error Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Refactoring. Moved helpers to core-common Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Refactoring Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Refactoring per the review Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Addressed the PR review comments Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed the compilation error when PTX was compiled for CUDA 13.0 Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed pytorch extensions Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> --------- Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> Signed-off-by: Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…os (NVIDIA#2823) * Fix: Use jitted kernels for generating THD (and BSHD) segment pos if only segment id is passed Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * Make passing of segment_pos to from_segmet_ids_and_pos for creating a SequenceDescriptor mandatory Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * Make test changes for from_segmet_ids_and_pos API change. Also some nits. Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * nit: Make segment_pos arg mandatory and not Optional Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * Add comments for from_segment_ids_and_pos Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * nit: Change data types for BSHD seg pos and seg id to be int32 adn consistent with THD when setting up test inputs Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Replace a TypeError if segment_pos is not passed with a ValueError with a message Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* GEMM + Swiglu fused Grouped MLP for MXFP8 Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * cleanup/lint Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Properly cache the alpha tensor Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * nD dummy grad Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * 0 tokens in entire rank Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tmp downgrade cublas version check Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * delayed wgrad tests pass for basic gl Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * merge everything Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Rebase into fused_mxfp8_grouped_mlp; unit tests for delayed wgrad working Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix tests being skipped for fusible ops Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Integrate mxfp8 dbias kernel in group_quantize Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Add bias/dbias fused support with cute GEMMs Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Check bias/dbias support Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Pack biases more efficiently Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * GroupedTensor for biases to avoid concat Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * format Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Support 1D grouped tensor shape for bias and fix checkpointing Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fixes and tests Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Refactor grouped tensor marking for paged stashing Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Remove setting logical_shape in mark_grouped_tensor Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Cleanup logical_shape Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * pass the tests for now Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * address some review comments Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * address review comments Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * more cleanups Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * cleanup Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * refactor wgrad logic Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Rename argument from single_grouped_parameter to single_grouped_weight Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Check wgrad store context is not empty for 0 token case. Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Test only checks for fusion if fused kernel is available Signed-off-by: Tim Moon <tmoon@nvidia.com> * fix the tolerance to be of bf16 for the cute gemm Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * Update transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> * address further review comments Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * address more review comments Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * address more review comments + test for zero grouped tensor work case Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * cublaslt remove zero work gemm avoidance Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * address review comments Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix the wgrad test Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * split dbias functionality from gq api Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Format and lint Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * port fixes and add better doc for page stashing war Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Guard fusion via env Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Change to trigger CI Remove unnecessary blank line in docstring. * To retrigger CI * Space to trigger the pipeline * fix zero work cublas gemm Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> Signed-off-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Varun Thumbe <vthumbe@nvidia.com> Co-authored-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> Co-authored-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * change distributed tests infra for fsdp2 Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * verbose flag for reporting Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * add back coments Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * another minor fix Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * not needed for this PR Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * address review comments Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * unecessary comments

…ter CI error re… (NVIDIA#2802) * Capture subprocess stderr in distributed tests for better CI error reporting Distributed tests launch subprocesses via torch.distributed.launch/torchrun. When these fail, pytest only captures the CalledProcessError from the parent process, not the actual worker traceback. This makes CI JUnit XML reports show "exit code 1" with no useful error detail. Add run_distributed() utility to tests/pytorch/utils.py that captures stderr while letting stdout stream to the terminal. On failure, the worker's stderr (containing the actual Python traceback) is included in the AssertionError, which pytest writes into the JUnit XML report. Behavior: - Interactive use: stdout streams in real time (unchanged), stderr shown on failure - CI/JUnit XML: failure reports now include the actual worker traceback Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * Add JUnit XML output to ctest in L0_cppunittest Add --output-junit flag so ctest writes JUnit XML to /logs/, matching the pattern used by pytest tests. The XML is written before ctest exits, so it's captured even on test failure. Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> --------- Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

Add tests that demonstrate two known memory issues with FSDP2 + FP8: - Issue NVIDIA#2681: FP8 weight copies created during te.autocast() forward pass accumulate across layers instead of being freed between layers, defeating FSDP2's memory efficiency. Detected by comparing per-layer forward memory increments against a bf16 baseline using layer hooks. - Issue NVIDIA#2717: Transpose cache tensors (_create_transpose) allocated during backward persist until the next forward pass instead of being freed after backward completes. Detected by comparing the backward memory delta (post_bwd - post_fwd) against a bf16 baseline. New tests: - test_bf16_no_excess_forward_memory: control, validates per-layer measurement - test_bf16_no_excess_backward_memory: control, validates backward delta comparison - test_fp8_temp_accumulation_across_layers: xfail, detects NVIDIA#2681 - test_transpose_cache_retained_after_backward: xfail, detects NVIDIA#2717 All parametrized over 5 FP8 recipes x {no_quant_init, quant_init}. Signed-off-by: Peter St. John <pstjohn@nvidia.com> Co-authored-by: vthumbe1503 <vthumbe@nvidia.com>

… states should also be DTensors. (NVIDIA#2795) * If model parameters are DTensors, optimizer state should also be DTensor. Signed-off-by: Cory Ye <cye@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Unpack DTensor in FusedAdam.step(). Signed-off-by: Cory Ye <cye@nvidia.com> * Apply suggestions from code review Add Greptile bug-fixes. Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Cory Ye <44509866+cspades@users.noreply.github.com> * Revert erroneous Greptile diff. Signed-off-by: Cory Ye <cye@nvidia.com> * Add DTensor parity check to FusedAdam.step(). Signed-off-by: Cory Ye <cye@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add DTensor handling in state_dict and load_state_dict, and add a DCP re-sharding test. Signed-off-by: Cory Ye <cye@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update test commentary. Signed-off-by: Cory Ye <cye@nvidia.com> * Filter out DCP resharding tests from the 2 GPU FusedAdam test matrix, as those tests need to be run in sequence. Signed-off-by: Cory Ye <cye@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix float8 Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * xfail block scaling Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * Fix rebase error, pytest filters were shoved into a different test. Signed-off-by: Cory Ye <cye@nvidia.com> --------- Signed-off-by: Cory Ye <cye@nvidia.com> Signed-off-by: Cory Ye <44509866+cspades@users.noreply.github.com> Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: vthumbe1503 <vthumbe@nvidia.com>

…ch tensor class for the shape (NVIDIA#2841) * fix Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Resolved 28 merge conflicts: - P0 (20 files): transformer_engine/pytorch/ — preserved te_device_type() device abstraction and plugin system hooks - P1 (1 file): transformer_engine/common/__init__.py — preserved plugin bootstrapping and skip_cuda_build() - P2 (7 files): CI/CD, qa, config — preserved fork-specific CI and test harness

Updated plugin OP API layer to match pytorch/csrc/ pybind changes between base and dev branches. Changes applied to: - ops.py base class (TEFLBackendBase) - All 5 vendor backends (cuda, iluvatar, metax, musa, hygon) - All 5 vendor register_ops.py files - Scanned flagos/reference backends for changed interfaces (no changes needed) New APIs added: group_quantize, bgrad_group_quantize, glu, dglu, te_general_grouped_gemm_for_grouped_tensor, te_general_grouped_gemm_for_discrete_in, te_general_grouped_gemm_for_discrete_out, nvfp4_data_transpose, swizzle_scales_for_gemm_, grouped_swizzle_for_gemm, convert_host_pointers_to_tensor, get_device_pointer_for_data_and_scales, splits_to_offsets, mxfp8_scaling_compute_partial_amax, mxfp8_scaling_partial_cast, nvfp4_2d_compute_partial_amax, nvfp4_multi_tensor_compute_partial_amax, nvfp4_compute_global_scale, nvfp4_compute_per_block_scale, nvfp4_expand_scale_to_fp8, nvfp4_fused_scale, nvfp4_multi_tensor_fused_scale, nvfp4_2d_partial_cast, nvfp4_multi_tensor_2d_partial_cast, nvfp4_2d_multi_tensor_transpose, multi_tensor_scale_tensor, multi_tensor_compute_scale_inv_e8m0 Modified APIs: split_quantize (added disable_bulk_allocation param)

…ype() Scanned Python-layer diff (base..dev, excluding csrc) for newly introduced hardcoded 'cuda' device strings. Replaced 11 instances across 7 files: - device=torch.device('cuda') → device=torch.device(te_device_type()): 3 - device='cuda' → device=te_device_type(): 1 - .device.type == 'cuda' → .device.type == te_device_type(): 2 - get_autocast_dtype('cuda') → get_autocast_dtype(te_device_type()): 5 Skipped 10 intentional default parameter values and docstrings. torch.cuda.* API calls left as-is (handled by vendor patches.py at runtime).

Scanned fork-specific code (new in merge vs dev) for references to functions, classes, and file paths that upstream renamed or relocated between base and dev. Fixed 6 stale reference(s): - _load_cudnn() → _load_cuda_library("cudnn") - _load_nvrtc() → _load_cuda_library("nvrtc") - _load_curand() → _load_cuda_library("curand") - _load_nvidia_cuda_library("cublas"/"cuda_runtime") → _load_cuda_library_from_python() - tensor.quantized_tensor → quantized_tensor (pytorch/utils.py) - tensor.quantized_tensor → quantized_tensor (flagos backends.py)

Updated plugin OP API layer to match pytorch/csrc/ pybind changes between base and dev branches. Changes applied to: - ops.py base class (TEFLBackendBase): added cuda_graph, deterministic to get_fused_attn_backend - ops.py FlashAttentionBase: added num_splits to forward/_forward_impl signatures - All vendor FlashAttention subclasses (cuda, hygon, metax, musa, kunlunxin) - All 5 vendor backends get_fused_attn_backend (cuda, iluvatar, metax, musa, hygon) - Reference and flagos backends updated for both APIs - Verified get_attention_backend/AttentionParams pass-through (no changes needed) See /tmp/plugin_api_changes.log for details.

…_attn_fwd/bwd Found during batch validation combo 2/9 (te_fl_prefer=vendor, attention_backend=fused, attempt 1). Error: CUDABackend.fused_attn_fwd() takes 29 positional arguments but 31 were given Root cause: upstream merge added bottom_right_diagonal and cuda_graph params to the caller (cpp_extensions/fused_attn.py) but the plugin backend signatures were not updated. Fix: added both params to ops.py base class, CUDA backend, and all vendor backends (musa, iluvatar, hygon, metax) for both fused_attn_fwd and fused_attn_bwd.

…led() in flagos backend Found during batch validation combo 4/9 (te_fl_prefer=flagos, attention_backend=flash, attempt 1). Error: Cached implementation 'default.flagos' failed for op 'get_flash_attention_class': cannot import name 'CPUOffloadEnabled' from 'transformer_engine.pytorch.cpu_offload' Root cause: upstream removed CPUOffloadEnabled from cpu_offload.py (v2 API), replacing it with is_cpu_offload_enabled() function. Fix: updated flagos backend to use the new function.

…metax runner (flagos-ai#60) Refactors CI/CD workflows to support both CUDA (NVIDIA A100) and Metax (C500) platforms, removes obsolete workflows, and fixes several platform-specific test failures. Add functional testing, and log reporting, with significant workflow simplification, and Metax platform use BAAI runner configs. --- - [x] New feature (non-breaking change which adds functionality) - [x] Infra/Build change (changes to CI/CD workflows or build scripts) - [x] Code refactoring - [x] Bug fix - [ ] Documentation change - [ ] Breaking change --- - **Workflow cleanup**: Removed 7 obsolete workflows; extracted lint into a standalone reusable `lint_common.yml` (runs in parallel); add `integration_tests_common.yml` - **Platform refactoring**: Added per-platform setup scripts (`setup_cuda.sh` / `setup_metax.sh`); switched Metax config to BAAI online environment; removed unsupported test types (JAX distributed) from Metax matrix - **Bug fixes**: - Metax: skip incompatible distributed test files (`test_numerics`, `test_torch_fsdp2`, etc.) to prevent `torchrun` SIGSEGV - Metax: replace `nvidia-smi`-only FP8 detection with platform-aware check - CUDA: fix `libcudart` load failure when runtime is pip-installed (add proper fallback chain in `_load_cudart()` and `try_load_lib`) --- - [x] I have read and followed the contributing guidelines - [x] The functionality is complete - [x] I have commented my code, particularly in CI workflow setup steps - [x] I have made corresponding changes to the documentation - [x] My changes generate no new warnings - [x] I have added/updated tests that prove my feature works on CUDA and Metax platform - [x] New and existing unit tests pass locally on CUDA and Metax platform --------- Co-authored-by: qqjxzxq <1376782660@qq.com> Co-authored-by: HermiaHuan <3081497279@qq.com>

Tree replacement merge from merge/dev-to-main-20260410. Working tree is identical to the source branch. Stages 1-8 completed and verified.

CLAassistant · 2026-05-09T07:22:01Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 17 committers have signed the CLA.

✅ BrianPei
❌ timmoon10
❌ phu0ngng
❌ ptrendx
❌ ksivaman
❌ zhongbozhu
❌ vthumbe1503
❌ jomitchellnv
❌ vcherepanov-nv
❌ CarlosGomes98
❌ KshitijLakhani
❌ lixianduo
❌ cspades
❌ Oleg-Goncharov
❌ jberchtold-nvidia
❌ sudhakarsingh27
❌ pstjohn

lixianduo seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

…main)

- Remove unused imports in utils.py, multi_head_attention.py, float8_blockwise_tensor.py - Reorder imports to follow stdlib → third-party → first-party → local convention - Fixes CI lint failures while maintaining 10.00/10 pylint score Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…gin backends

…nown type warnings

into merge-to-main-20260509

This commit synchronizes all vendor backends (enflame, hygon, iluvatar, metax, musa) with the CUDA backend API: 1. **Enflame backend**: - Added 27 new operations: group_quantize, bgrad_group_quantize, glu, dglu, te_general_grouped_gemm_*, nvfp4_*, mxfp8_*, multi_tensor_scale_tensor, multi_tensor_compute_scale_inv_e8m0, and utility functions - Fixed 6 method signatures: group_quantize, bgrad_group_quantize, get_fused_attn_backend, fused_attn_fwd, fused_attn_bwd, fused_rope_backward (added missing parameters) - Updated split_quantize to accept disable_bulk_allocation parameter - Replaced *args/**kwargs with explicit parameter lists for 15 methods 2. **Hygon, Iluvatar, Metax backends**: - Replaced *args/**kwargs with explicit parameter lists for 16 methods each 3. **Musa backend**: - Replaced *args/**kwargs with explicit parameter lists for 11 methods - Added type hints to splits_to_offsets and mxfp8_scaling_partial_cast All backends now use explicit parameter signatures matching CUDA backend, except for methods where CUDA also uses *args/**kwargs (te_general_grouped_gemm_*, nvfp4_compute_per_block_scale, nvfp4_expand_scale_to_fp8, nvfp4_fused_scale, nvfp4_multi_tensor_2d_partial_cast). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

tdophung and others added 30 commits November 18, 2025 12:23

Show quickstart_jax.ipynb along with quickstart.ipynb on html documen…

d677a26

…tation (NVIDIA#2394) Signed-off-by: tdophung <tdophung@nvidia.com>

[PyTorch] Fix small errors (NVIDIA#2396)

30c0120

* fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

[PyTorch] Disable Flash Attention backend in Userbuffers tests (NVIDI…

e6da012

…A#2399) Disable Flash attention in Userbuffers tests Signed-off-by: Tim Moon <tmoon@nvidia.com>

Make grad_output contiguous in cross_entropy.py (NVIDIA#2402)

6f4bc33

Signed-off-by: Jack <lityangweiguang@163.com>

[PyTorch] Fix assertion error message formatting in DotProductAttenti…

b14f417

…on (NVIDIA#2103) Signed-off-by: janbernloehr <jan@bernloehrs.de> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

[PyTorch] Only disable Flash Attention in Userbuffers test on SM 8.0 (N…

f8cb598

…VIDIA#2401) Only disable Flash Attention in Userbuffers test on A100 Signed-off-by: Tim Moon <tmoon@nvidia.com>

docs: Document NVTE_CUDA_ARCHS environment variable in README (NVIDIA…

f612b74

…#2414) Add:: NVTE_CUDA_ARCHS to README Signed-off-by: Shoval Atias <satias@satias-mlt.client.nvidia.com> Co-authored-by: Shoval Atias <satias@satias-mlt.client.nvidia.com>

[Pytorch] remove redundant error check in Linear module (NVIDIA#2420)

3b8d9a8

remove linear redundant check Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

[PyTorch] Avoid initializing recipe state in fusible op base class co…

9ca89e9

…nstructor (NVIDIA#2421) Do not initialize recipe state in base op class Op attrs may not be set. Move recipe state initialization to linear op constructor. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Change Flax MHA to DPA to remove the duplicated QKV projection step (N…

3ff0b8d

…VIDIA#2429) Signed-off-by: tdophung <tdophung@nvidia.com>

Oleg-Goncharov and others added 18 commits April 2, 2026 23:14

Final Polish

8fa8199

merge: integrate upstream release_v2.14 via tree replacement

24c28d0

Tree replacement merge from merge/dev-to-main-20260410. Working tree is identical to the source branch. Stages 1-8 completed and verified.

lixianduo and others added 7 commits May 9, 2026 15:23

chore: remove SYNC_POINT.md (intermediate sync record, not needed on …

36af46a

…main)

fix commit init

e2812ae

Fix fused_rope_backward: add missing start_positions parameter to plu…

2c334ae

…gin backends

fix test_numerics unit test

879eddc

fix Latex not found errors, use mathjax

e12589a

Fix Sphinx build warnings: suppress autoapi import resolution and unk…

e5c8380

…nown type warnings

lxd-cumt force-pushed the merge-to-main-20260509 branch from fe21a60 to e5c8380 Compare May 12, 2026 06:18

lixianduo and others added 3 commits May 12, 2026 16:11

Merge branch 'main' of https://github.com/flagos-ai/TransformerEngine-FL

d2c4731

into merge-to-main-20260509

Fix enflame flash_attention: add missing num_splits parameter

c263902

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TE-FL Upgrade: Synchronization with TE Release V2.14#62

TE-FL Upgrade: Synchronization with TE Release V2.14#62
lxd-cumt wants to merge 324 commits into
flagos-ai:mainfrom
lxd-cumt:merge-to-main-20260509

lxd-cumt commented May 9, 2026 •

edited

Loading

Uh oh!

CLAassistant commented May 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

lxd-cumt commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CLAassistant commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

lxd-cumt commented May 9, 2026 •

edited

Loading

CLAassistant commented May 9, 2026 •

edited

Loading