TE-FL Upgrade: Synchronization with TE Release V2.14 by lxd-cumt · Pull Request #59 · flagos-ai/TransformerEngine-FL

lxd-cumt · 2026-04-15T02:18:46Z

Description

Please include a brief summary of the changes, relevant motivation and context.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

…A#2274) * Fix imports in test for deprecated jax.experimental.pjit Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix: Pass NamedSharding instead of PartitionSpec to compare_ops() so that when the in and out sharding is used to create a jitted function, it has the mesh info Signed-off-by: Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> Signed-off-by: Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* Support wheel build for cuda 13 Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fixes Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fixes for cu13 runtime, format Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Add documentation Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Better error handling Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix jax sdist Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Modify function names Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

…tization (NVIDIA#2270) * [JAX] Support recipe flags for disabling SR, RHT, and 2D quantization Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * lint Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Fix issue with SR state being erased due to pytree handling of NVFP4Quantizer Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Add test for SR state preservation across VJP boundaries Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Fix sharding of SR rng state Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * lint Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * update tolerances slightly now that SR is enabled Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * lint Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Use hashlib for deterministic hashes across runs for SR Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * rename uses_rht on scaled tensors to has_applied_rht Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * add assert Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Move decision of whether to use RHT into helper.py and add dedicated RHT tests Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * lint Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * fix use_rht attr usage Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * fix pure-jax rht usage criteria Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Adjust tolerances after rebase Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

Include TE core headers in build Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Added sm_120f to the build Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Change the arch specific handling Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Fix Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Support for CUDA<12.9 Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Moved through the rest of the files Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Fix Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Common cases Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Remove pure 100 from the list Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Fix Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * CMake changes, (not yet working) Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Fix Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Do not pass the arch-specific thing from build_tools Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Fix Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Moved some of the files to arch-specific compilation Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Fix and also changing the order of compilation to hopefully get the compilation time lower Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Fix for the files overwriting custom compile properties Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Actually make this whole thing work Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add space to the error message Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Przemyslaw Tredak <ptrendx@gmail.com> * Apply suggestions from code review Co-authored-by: Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com> Signed-off-by: Przemyslaw Tredak <ptrendx@gmail.com> * Fixes from review Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Changing the naming to be more intuitive Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add missing cassert include for device-side asserts Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Przemek Tredak <ptredak@nvidia.com> Signed-off-by: Przemyslaw Tredak <ptrendx@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com>

* add max_score for fused/unfused F16 non-CP Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * calculate max per head instead of max over all heads Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix fused attn max_score shape Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert FE to github Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update FE to 1.15.0-rc Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix merge Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * reduce ew kernels; fix causal masks; add more tests Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fix to tests Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove logic for flash-attn Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: add CP support for p2p/a2a/all_gather Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor improvements of implementation/tests Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * WIP: add thd support Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add thd to UnfusedDPA Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix lint Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * more fixes for lint Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update to FE 1.15 Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove unneeded changes Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * disable unfused for thd + pad_between_seqs Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fixes Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * disable thd for unfused until bug is fixed Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix all_gather Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix all gather Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * rename max_score to max_logit Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix all_gather Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix all_gather Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * disable fused attn + thd Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…DIA#2288) * Fix CI failures due to deterministic attention Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * some more cleanup Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix debug test Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

…2315) * Fix: Skip determinism tests for bprop for all sm >=100 Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * Add username to TODO Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * Assert in fused attn bwd pass for sm100+ Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Fix attention backend and tests for sm120 Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Disable MLA only for backward Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

…determinism on Blackwell (NVIDIA#2316) * Bump the min version expected to supported FP8 cs det on Blackwell Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * Disable fused attn for cudnn < 9.14 for FP8 CS. Disable fused attn for cudnn < 9.18 for FP8 deterministic CS Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…NVIDIA#2322) Ensure JAX reference impl uses an accurate backend Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

# Description Add the FlagOS multi-chip backend for TransformerEngine Fixes # (issue) ## Type of change - [ ] Documentation change (change only to the documentation, either a fix or a new content) - [ ] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected) - [ ] Infra/Build change - [ ] Code refactoring ## Changes Please list the changes introduced in this PR: - Change A - Change B # Checklist: - [ ] I have read and followed the [contributing guidelines](https://github.com/NVIDIA/TransformerEngine/blob/main/CONTRIBUTING.rst) - [ ] The functionality is complete - [ ] I have commented my code, particularly in hard-to-understand areas - [ ] I have made corresponding changes to the documentation - [ ] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] New and existing unit tests pass locally with my changes --------- Co-authored-by: zhaoyinglia <ylzhao@baai.ac.cn>

# Description Fix import bugs Fixes # (issue) ## Type of change - [ ] Documentation change (change only to the documentation, either a fix or a new content) - [ ] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected) - [ ] Infra/Build change - [ ] Code refactoring ## Changes Please list the changes introduced in this PR: - Change A - Change B # Checklist: - [ ] I have read and followed the [contributing guidelines](https://github.com/NVIDIA/TransformerEngine/blob/main/CONTRIBUTING.rst) - [ ] The functionality is complete - [ ] I have commented my code, particularly in hard-to-understand areas - [ ] I have made corresponding changes to the documentation - [ ] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] New and existing unit tests pass locally with my changes

# Description Please include a brief summary of the changes, relevant motivation and context. Fixes # (issue) ## Type of change - [ ] Documentation change (change only to the documentation, either a fix or a new content) - [ ] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected) - [ ] Infra/Build change - [ ] Code refactoring ## Changes Please list the changes introduced in this PR: - Change A - Change B # Checklist: - [ ] I have read and followed the [contributing guidelines](https://github.com/NVIDIA/TransformerEngine/blob/main/CONTRIBUTING.rst) - [ ] The functionality is complete - [ ] I have commented my code, particularly in hard-to-understand areas - [ ] I have made corresponding changes to the documentation - [ ] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] New and existing unit tests pass locally with my changes

…lagos-ai#4) # TransformerEngine-FL Plugin System ## Overview This PR implements a comprehensive multi-backend plugin system for TransformerEngine-FL, enabling support for multiple hardware vendors (NVIDIA, AMD, Hygon, etc.) while maintaining full API compatibility with the original `transformer_engine_torch`. **Core Philosophy**: A plugin-based backend system that allows hardware vendors to easily implement their own operator optimizations while preserving complete compatibility with the original TransformerEngine API. ## Key Features ### Full API Compatibility - Drop-in replacement for `transformer_engine_torch` - Switch backends via environment variables - Zero changes required to existing user code ### Multi-Backend Support | Backend | Description | Implementation | |---------|-------------|----------------| | **FlagOS (default)** | Triton-based cross-platform implementation | `backends/flagos/` | | **CUDA (vendor)** | Wraps original TransformerEngine C++ extensions | `backends/vendor/cuda/` | | **Reference** | Pure PyTorch fallback implementation | `backends/reference/` | ### Three-Tier Backend Selection ``` ┌─────────────────────────────────────────────────────────┐ │ 1. TE_FL_PER_OP (Per-operator override) [Highest] │ │ Example: TE_FL_PER_OP="rmsnorm_fwd=vendor:cuda" │ ├─────────────────────────────────────────────────────────┤ │ 2. TE_FL_PREFER (Global preference) │ │ Values: flagos / vendor / reference │ ├─────────────────────────────────────────────────────────┤ │ 3. Backend Priority (Intrinsic) [Lowest] │ │ Each implementation has a priority value │ └─────────────────────────────────────────────────────────┘ ``` ## Architecture ### Directory Structure ``` transformer_engine/plugin/core/ ├── __init__.py # Public API exports ├── types.py # Core types: BackendImplKind, OpImpl ├── registry.py # OpRegistry: stores all implementations ├── manager.py # OpManager: selects and calls implementations ├── policy.py # SelectionPolicy: backend selection rules ├── discovery.py # Plugin auto-discovery (entry_points, env) ├── builtin_ops.py # Registers all built-in backends ├── ops.py # TEFLModule: transformer_engine_torch compatible API ├── logger_manager.py # Logging utilities ├── _module_setup.py # Module aliasing setup ├── _build_config.py # Build-time configuration │ └── backends/ ├── flagos/ # FlagOS backend (Triton-based) │ ├── flagos.py # FlagOSBackend class │ ├── register_ops.py # Operator registration │ └── impl/ # Operator implementations │ ├── rmsnorm.py │ ├── gemm.py │ └── ... │ ├── vendor/ # Vendor backends │ └── cuda/ # NVIDIA CUDA backend │ ├── cuda.py # CUDABackend class │ └── register_ops.py │ └── reference/ # Reference backend (PyTorch) ├── reference.py # ReferenceBackend class ├── register_ops.py └── impl/ # Pure PyTorch implementations ``` ### Core Components | File | Description | |------|-------------| | `types.py` | Defines `BackendImplKind` (DEFAULT/VENDOR/REFERENCE) and `OpImpl` dataclass | | `registry.py` | `OpRegistry` - Central storage for all operator implementations | | `manager.py` | `OpManager` - Handles implementation selection, fallback, and execution | | `policy.py` | `SelectionPolicy` - Configurable rules for backend selection | | `discovery.py` | Auto-discovers plugins via `entry_points` or `TE_FL_PLUGIN_MODULES` | | `ops.py` | `TEFLModule` - Provides `transformer_engine_torch` compatible interface | ## Installation ### Build with CUDA support ```bash pip install --no-build-isolation -e . ``` ### Build without CUDA (FlagOS only) ```bash TE_FL_SKIP_CUDA=1 pip install --no-build-isolation -e . ``` ## Environment Variables ### Backend Selection | Variable | Description | Values | Default | |----------|-------------|--------|---------| | `TE_FL_PREFER` | Preferred backend type | `flagos` / `vendor` / `reference` | `flagos` | | `TE_FL_PREFER_VENDOR` | Prefer vendor (legacy) | `1` / `0` | `0` | | `TE_FL_STRICT` | Strict mode (no fallback) | `1` / `0` | `0` | ### Vendor Filtering | Variable | Description | Example | |----------|-------------|---------| | `TE_FL_ALLOW_VENDORS` | Allowed vendors (whitelist) | `nvidia,amd` | | `TE_FL_DENY_VENDORS` | Denied vendors (blacklist) | `vendor_a` | ### Per-Operator Configuration | Variable | Description | Example | |----------|-------------|---------| | `TE_FL_PER_OP` | Per-operator backend ordering | `rmsnorm_fwd=vendor:cuda\|default` | ### Plugin Discovery | Variable | Description | Example | |----------|-------------|---------| | `TE_FL_PLUGIN_MODULES` | Plugin modules to load | `my_plugin,another_plugin` | ### Build Configuration | Variable | Description | Values | Default | |----------|-------------|--------|---------| | `TE_FL_SKIP_CUDA` | Skip CUDA backend | `1` / `0` | `0` | | `CUDA_HOME` | CUDA installation path | `/usr/local/cuda` | Auto-detected | ### Logging | Variable | Description | Values | Default | |----------|-------------|--------|---------| | `TEFL_LOG_LEVEL` | Log level | `DEBUG` / `INFO` / `WARNING` / `ERROR` | `INFO` | ## Usage Examples ### Basic Usage (No Code Changes Required) ```python # Existing code works as-is import transformer_engine.pytorch as te # or import transformer_engine_torch as te ``` ### Register Custom Backend (In-tree) ```python from transformer_engine.plugin.core import ( OpRegistry, OpManager, OpImpl, BackendImplKind ) # 1. Define implementation def my_rmsnorm(input, weight, eps=1e-5, **kwargs): variance = input.pow(2).mean(-1, keepdim=True) return input * torch.rsqrt(variance + eps) * weight, torch.rsqrt(variance + eps) # 2. Register registry = OpRegistry() registry.register_impl(OpImpl( op_name="rmsnorm_fwd", impl_id="vendor.mybackend", kind=BackendImplKind.VENDOR, vendor="mybackend", fn=my_rmsnorm, priority=200, )) # 3. Call manager = OpManager(registry) output, rsigma = manager.call("rmsnorm_fwd", input, weight) ``` ### Register Custom Backend (Out-of-tree Plugin) Create a plugin package with `register(registry)` function: ```python # my_vendor_plugin/__init__.py from transformer_engine.plugin.core import OpImpl, BackendImplKind def my_rmsnorm(input, weight, eps=1e-5, **kwargs): # Your implementation ... def register(registry): """Called automatically by TE-FL""" registry.register_impl(OpImpl( op_name="rmsnorm_fwd", impl_id="vendor.myvendor", kind=BackendImplKind.VENDOR, vendor="myvendor", fn=my_rmsnorm, priority=200, )) ``` Load via environment variable: ```bash export TE_FL_PLUGIN_MODULES=my_vendor_plugin python your_script.py ``` ## Runtime Logs When running, you'll see logs indicating which backend is used: ``` [TE-FL manager.py:133 INFO] Registered impl_ids: ['default.flagos', 'reference.torch', 'vendor.cuda'] [TE-FL manager.py:390 INFO] Op 'rmsnorm_fwd' using 'default.flagos' (kind=default, vendor=None) [TE-FL manager.py:395 INFO] Op 'rmsnorm_fwd' switched from 'default.flagos' to 'vendor.cuda' (kind=vendor, vendor=CUDA) ``` ## Examples See `transformer_engine/plugins/examples/` for complete working examples: - `example_intree.py` - In-tree backend registration - `example_outtree.py` - Out-of-tree plugin registration Fixes # (issue) ## Type of change - [ ] Documentation change (change only to the documentation, either a fix or a new content) - [ ] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected) - [ ] Infra/Build change - [ ] Code refactoring ## Changes Please list the changes introduced in this PR: - Change A - Change B # Checklist: - [ ] I have read and followed the [contributing guidelines](https://github.com/NVIDIA/TransformerEngine/blob/main/CONTRIBUTING.rst) - [ ] The functionality is complete - [ ] I have commented my code, particularly in hard-to-understand areas - [ ] I have made corresponding changes to the documentation - [ ] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] New and existing unit tests pass locally with my changes --------- Co-authored-by: panpy <panpy@sugon.com>

# Description - Add missing __init__.py files to transformer_engine/plugin/core/backends/flagos/attention/ directory tree to fix import errors when accessing these modules as Python packages - Add comprehensive test suite (test_policy.py) covering the TE-FL scheduling policy system including: SelectionPolicy creation and configuration Environment variable parsing (TE_FL_PREFER, TE_FL_STRICT, etc.) Policy context managers Vendor filtering (allow/deny) Thread safety validation Minor code style improvements Fixes # (issue) ## Type of change - [ ] Documentation change (change only to the documentation, either a fix or a new content) - [ ] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected) - [ ] Infra/Build change - [ ] Code refactoring ## Changes Please list the changes introduced in this PR: - Change A - Change B # Checklist: - [ ] I have read and followed the [contributing guidelines](https://github.com/NVIDIA/TransformerEngine/blob/main/CONTRIBUTING.rst) - [ ] The functionality is complete - [ ] I have commented my code, particularly in hard-to-understand areas - [ ] I have made corresponding changes to the documentation - [ ] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] New and existing unit tests pass locally with my changes

…n fallback (flagos-ai#14) ## Summary This PR contains two major improvements: 1. **Register `get_attention_backend` function for all backends** (CUDA, FlagOS, Reference) - Added `get_attention_backend` implementation to all backend types - Ensures consistent attention backend selection across different hardware platforms 2. **Fix FlashAttention fallback mechanism** - Removed redundant `_called_impls` dictionary, replaced with simpler `_last_impl_id` class variable - Removed unused `_log_lock` threading lock - Simplified implementation tracking and logging logic - Reduced code complexity and memory overhead while maintaining full functionality ## Changes - Updated `FlashAttentionBase` class in `ops.py` to remove redundant implementation tracking - Added `get_attention_backend` registration to CUDA, FlagOS, and Reference backends - Fixed fallback logic in attention backend selection ## Test Plan - [x] Code builds successfully - [x] Existing tests pass - [x] Manual testing with different backend configurations ## Related Issues Fixes issues with FlashAttention fallback and improves backend consistency.

# Description fix nv shared lib bug [CUDA] Import failed: No module named 'transformer_engine_torch_nv' Fixes # (issue) ## Type of change - [ ] Documentation change (change only to the documentation, either a fix or a new content) - [ ] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected) - [ ] Infra/Build change - [ ] Code refactoring ## Changes Please list the changes introduced in this PR: - Change A - Change B # Checklist: - [ ] I have read and followed the [contributing guidelines](https://github.com/NVIDIA/TransformerEngine/blob/main/CONTRIBUTING.rst) - [ ] The functionality is complete - [ ] I have commented my code, particularly in hard-to-understand areas - [ ] I have made corresponding changes to the documentation - [ ] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] New and existing unit tests pass locally with my changes

# Description This pr add hygon backend for calling basic ops on hygon dcu. ## Type of change - [x] New feature (non-breaking change which adds functionality) ## Changes Please list the changes introduced in this PR: - Add a new `hygon` folder in `vendor` contains `__init__.py`, `hygon.py`, `register_ops.py` - Register hygon ops in `builtin_ops.py` # Requirements In order to use hygon backend, the following, the following requirements need to be met - The python package `transformer_engine_fl_hygon` needs to be installed # Checklist: - [ ] I have read and followed the [contributing guidelines](https://github.com/NVIDIA/TransformerEngine/blob/main/CONTRIBUTING.rst) - [ ] The functionality is complete - [ ] I have commented my code, particularly in hard-to-understand areas - [ ] I have made corresponding changes to the documentation - [ ] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] New and existing unit tests pass locally with my changes --------- Signed-off-by: wenjh <wenjh@sugon.com>

…gos-ai#18) Add a flag that permanently enables flag_gems with a single switch, eliminating the need to call flag_gems.use_gems for every single operator. This removes significant registration overhead and improves end-to-end throughput. - When the flag is set, every operator’s implementation is forced to use flag_os/vendor; the default PyTorch reference backend is unavailable. - When the flag is not set, operators can freely switch among flag_os, vendor, and torch backends.

Unify the usage of the gems context - only enter or exit the context when switching between the flagos backend and the torch backend (or vice versa). - avoids the overhead of repeated enter/exit calls across multiple OPs.

## Summary - Support combined qkv_layout formats like `sbhd_sbhd_sbhd` by extracting the first part for layout conversion - Distinguish between standard 4D tensor format (sbhd/bshd) and true packed format (thd). For 4D tensors, directly convert layout like flagos backend does, instead of incorrectly trying to unpack ## Problem When using torch SDPA backend with `batch_size > 1`, the following error occurs: ``` ValueError: Unexpected 4D tensor shape torch.Size([4096, 4, 16, 128]). Expected [total_tokens, 1, num_heads, head_dim] ``` The original code incorrectly tried to unpack 4D tensors when `cu_seqlens` was provided, but 4D tensors in `sbhd`/`bshd` format should be handled with simple layout conversion (like flagos backend does). ## Test plan - [x] Tested with batch_size=4, verified no ValueError - [x] Results match flagos backend output

- Remove the flag_gems.use_gems() context to avoid context-switching overhead - Call flag_gems.xxx directly wherever possible.

# Description Add the new vendor backend METAX ## Type of change - [ ] New feature (non-breaking change which adds functionality) ## Changes Please list the changes introduced in this PR: - Add metax ops register - Add metax backend implementation - Register metax ops in builtin_ops.py ## Requirements - The module transformer_engine_torch_metax is needed, to use this module, need to install package transformer_engine_metax # Checklist: - [x] I have read and followed the [contributing guidelines](https://github.com/NVIDIA/TransformerEngine/blob/main/CONTRIBUTING.rst) - [x] The functionality is complete - [x] I have commented my code, particularly in hard-to-understand areas - [x] I have made corresponding changes to the documentation - [x] My changes generate no new warnings - [x] I have added tests that prove my fix is effective or that my feature works - [x] New and existing unit tests pass locally with my changes

…lagos-ai#23) ## Summary - flagos: Add multi_tensor_adam_param_remainder implementation - reference: Add multi_tensor_adam_param_remainder implementation - reference: Add context parallel support for Flash Attention - manager: Add cache mechanism with _impl_cache and _impl_cache_meta for conditional op selection ## Changes ### flagos backend - Implemented multi_tensor_adam_param_remainder operation for handling parameter remainders in multi-tensor Adam optimizer ### reference backend - Implemented multi_tensor_adam_param_remainder operation - Added context parallel support for Flash Attention implementation ### Core manager - Added cache mechanism using _impl_cache and _impl_cache_meta - Improved op selection with conditional caching based on policy fingerprint and epoch --------- Signed-off-by: wenone766 <wenone766@wenobug.com> Co-authored-by: wenone766 <wenone766@wenobug.com>

- Fix enum mismatch, between ```transformer_engine/plugin/core/ops.py``` and ```transformer_engine/common/include/transformer_engine/xxx.h```

# Description add Vendor KUNLUNXIN

Fix kunlunxin register errors

# Description Validate TE QA test cases with new CI workflows ## Type of change - [ ] Documentation change (change only to the documentation, either a fix or a new content) - [ ] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected) - [x] Infra/Build change - [ ] Code refactoring ## Changes Please list the changes introduced in this PR: - Added code inspection and PyTorch/C++ unit tests to improve the TE testing system - Implemented end-to-end automation of TE wheel package building, installation, and verification, supporting multiple versions of Flash Attention and GPUs with different CUDA architectures - Verified TE's core functions (distributed communication, matrix multiplication, ONNX export) and compatibility with Megatron-LM/Lightning-Thunder - Completed the verification of the nvinspect debugging tool and re-verification of core numerical tests # Checklist: - [ ] I have read and followed the [contributing guidelines](https://github.com/NVIDIA/TransformerEngine/blob/main/CONTRIBUTING.rst) - [ ] The functionality is complete - [x] I have commented my code, particularly in hard-to-understand areas - [ ] I have made corresponding changes to the documentation - [x] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] New and existing unit tests pass locally with my changes --------- Co-authored-by: zihugithub <fbye@baai.ac.cn> Co-authored-by: liyuzhuo <lee.yuzhuo233@gamil.com>

…os-ai#36) ## Summary Refactor and improve the FlagOS optimizer and multi_tensor implementations to better match CUDA behavior and improve code quality. ## Changes ### `fused_adam.py` (FlagOS backend) - Remove unused `inv_scale` and `out_dtype` parameters from `multi_tensor_adam_fl` - `multi_tensor_adam_param_remainder_fl`: rewrite FP32 master weight reconstruction using bit manipulation (int16 high/low bits), matching the CUDA implementation exactly ### `multi_tensor.py` (FlagOS backend) - `multi_tensor_l2_norm_fl`: add proper type hints, noop_flag check, inf/nan detection, and replace raw `**` / `+` operators with `flag_gems.mul` / `flag_gems.add` - `multi_tensor_scale_fl`: add type hints, noop_flag check, inf/nan detection, and replace `src * scale` with `flag_gems.mul(src, scale)` ### `optimizer.py` (reference backend) - Update `multi_tensor_l2norm_torch` and `multi_tensor_adam_torch` to match new signatures and CUDA behavior (L2 vs AdamW mode split) - Rewrite `multi_tensor_adam_param_remainder_torch` with bit manipulation matching CUDA - Rename `eps` → `epsilon` for consistency ### `optimizers/__init__.py` - Export `multi_tensor_scale` and `multi_tensor_l2norm` ### Misc - Fix missing newline at end of files

# Description Add Musa backend

TE-FL Python-level now supports multiple platforms, including the following two changes: 1. support for vendor-specific patches: vendors can now add their own patches, e.g., patching ```torch.cuda``` to ```torch.musa```. For patch implementation, please refer to ```transformer_engine/plugin/core/backends/vendor/musa/musa_patches.py```; for patch integration, please refer to ```transformer_engine/__init__.py```. 2. abstraction of CUDA device references: files under ```transformer_engine/``` now abstract CUDA device-related code into ```te_device_type```. For example, ```torch.device("cuda")``` is now replaced with ```torch.device(te_device_type)```. 3. Fix - FlagOS Backend: ```get_num_cublas_stream``` and ```get_cudnn_version``` - Reference Backend: ```get_num_cublas_stream``` and ```scaled_mask_softmax_forward```

…-ai#52) Add two functions for flagos backend, based on flaggems - scaled_masked_softmax_forward - scaled_masked_softmax_backend

- Fix quantizer dtype attr conversion errors for vendor backends - Polish logger for vendor backend

- add ```te_general_grouped_gemm``` op for flagos backend, base on flag_gems - support both forward and backward computation, distinguished by ```grad```

# Description This PR implements and integrates the **Metax (MACA)** workflow into TransformerEngine-FL. It enables automated CI/CD pipelines, functional training tests, and unit tests specifically optimized for Metax hardware environments. **Key updates in this version:** Successful TE compilation on Metax and alignment with NVIDIA's standard QA workflows. Fixes # (issue_number_if_applicable) ## Type of change - [x] New feature (non-breaking change which adds functionality) - [x] Infra/Build change (changes to CI/CD workflows or build scripts) - [ ] Documentation change - [ ] Bug fix - [ ] Code refactoring ## Changes ### 1. Build & Compilation - **TE Build Completion**: Successfully completed the compilation and build process for TransformerEngine on the Metax platform. - **Workflow Alignment**: Designed the Metax testing workflow based on NVIDIA's `qa-l0-te-cpp-unittest-pytorch-lint` standard to ensure parity with upstream quality gates. ### 2. CI/CD Infrastructure & Test Modules - **Metax Platform Support**: Added `configs/metax.yml` to define Metax-specific runner labels, images, and device configurations. - **Verified Workflow Modules**: The following modules have been implemented and verified on the Metax platform: - **pytorch-lint**: Static code analysis and linting. - **pytorch-debug**: Debug-level build and basic functional verification. - **pytorch-unittest**: Core unit testing for Metax-adapted operators. - **Workflow Modularization**: - Introduced `configs/all_tests_common.yml` and `configs/unit_tests_common.yml` for reusable test logic. - Added `configs/all_tests_metax.yml` as the dedicated entry point for Metax functional testing. ### 3. Environment & Runtime Fixes - **Image Management**: Implemented `image-pull-policy: never` and `--pull never` options to force the use of local registry images (localhost:5000), optimizing startup time in local cluster environments. - **Dynamic Resource Scaling**: - Adapted `torchrun` and training scripts to support dynamic GPU/Accelerator counts (specifically for C500 clusters). - Removed hardcoded GPU host configurations to improve portability across different Metax nodes. ### 4. Cleanup - Removed legacy CUDA/Ascend specific configurations from the Metax workflow path to prevent environment contamination. ## Hardware/Environment Verified - **Platform**: Metax MACA - **Accelerator**: C500 - **Registry**: Local Registry (localhost:5000) --- ## TODO / Next Steps - [ ] Integrate the Metax-specific adaptation workflow into the central platform. - [ ] Generate and upload comprehensive Benchmark and Performance test reports. # Checklist: - [x] I have read and followed the contributing guidelines. - [x] The functionality is complete and verified on Metax hardware. - [x] I have commented my code, particularly in hardware-specific adaptation areas. - [x] My changes generate no new warnings. - [x] I have added/updated tests that prove my feature works on the MACA platform. - [x] New and existing unit tests (Lint, Debug, Unittest) pass locally with Metax environment. --------- Co-authored-by: 爱洗澡 qq <aixizaoqq@aixizaodeMacBook-Air.local> Co-authored-by: zhoujiamei <2867770387@qq.com> Co-authored-by: zhoujiamei <zjm> Co-authored-by: peiyu <peiyu@jinglong.ai>

… FlagCICD runner (flagos-ai#58) # Description Simplifies and consolidates the coverage report generation logic in the CI unittest workflow, reducing redundant steps and dependencies. Need to test **uploading reports to FlagCICD step** in CI env. ## Type of change - [x] New feature (non-breaking change which adds functionality) - [x] Infra/Build change (changes to CI/CD workflows or build scripts) - [x] Code refactoring - [ ] Documentation change - [ ] Bug fix - [ ] Breaking change ## Changes - Merged `Generate Coverage Report` into the `Execute Tests` step — coverage `combine` and `json` generation now run inline after `bash test.sh`, following the same pattern as Megatron-LM-FL - Coverage collection is gated on `test_type == 'unittest'` to avoid running for lint/debug groups, and `pip install` is done only once - Removed `fetch-depth: 0` from checkout steps (not required for unit test runs) - Removed unused/leftover scripts from the repository ## TODO # Checklist: - [x] I have read and followed the contributing guidelines. - [x] The functionality is complete - [x] I have commented my code, particularly in coverage report uploading steps - [x] My changes generate no new warnings - [x] I have added/updated tests that prove my feature works on Cuda and Metax platform. - [x] New and existing unit tests pass locally on Cuda and Metax platform.

Resolved 28 merge conflicts: - P0 (20 files): transformer_engine/pytorch/ — preserved te_device_type() device abstraction and plugin system hooks - P1 (1 file): transformer_engine/common/__init__.py — preserved plugin bootstrapping and skip_cuda_build() - P2 (7 files): CI/CD, qa, config — preserved fork-specific CI and test harness

Updated plugin OP API layer to match pytorch/csrc/ pybind changes between base and dev branches. Changes applied to: - ops.py base class (TEFLBackendBase) - All 5 vendor backends (cuda, iluvatar, metax, musa, hygon) - All 5 vendor register_ops.py files - Scanned flagos/reference backends for changed interfaces (no changes needed) New APIs added: group_quantize, bgrad_group_quantize, glu, dglu, te_general_grouped_gemm_for_grouped_tensor, te_general_grouped_gemm_for_discrete_in, te_general_grouped_gemm_for_discrete_out, nvfp4_data_transpose, swizzle_scales_for_gemm_, grouped_swizzle_for_gemm, convert_host_pointers_to_tensor, get_device_pointer_for_data_and_scales, splits_to_offsets, mxfp8_scaling_compute_partial_amax, mxfp8_scaling_partial_cast, nvfp4_2d_compute_partial_amax, nvfp4_multi_tensor_compute_partial_amax, nvfp4_compute_global_scale, nvfp4_compute_per_block_scale, nvfp4_expand_scale_to_fp8, nvfp4_fused_scale, nvfp4_multi_tensor_fused_scale, nvfp4_2d_partial_cast, nvfp4_multi_tensor_2d_partial_cast, nvfp4_2d_multi_tensor_transpose, multi_tensor_scale_tensor, multi_tensor_compute_scale_inv_e8m0 Modified APIs: split_quantize (added disable_bulk_allocation param)

…ype() Scanned Python-layer diff (base..dev, excluding csrc) for newly introduced hardcoded 'cuda' device strings. Replaced 11 instances across 7 files: - device=torch.device('cuda') → device=torch.device(te_device_type()): 3 - device='cuda' → device=te_device_type(): 1 - .device.type == 'cuda' → .device.type == te_device_type(): 2 - get_autocast_dtype('cuda') → get_autocast_dtype(te_device_type()): 5 Skipped 10 intentional default parameter values and docstrings. torch.cuda.* API calls left as-is (handled by vendor patches.py at runtime).

Scanned fork-specific code (new in merge vs dev) for references to functions, classes, and file paths that upstream renamed or relocated between base and dev. Fixed 6 stale reference(s): - _load_cudnn() → _load_cuda_library("cudnn") - _load_nvrtc() → _load_cuda_library("nvrtc") - _load_curand() → _load_cuda_library("curand") - _load_nvidia_cuda_library("cublas"/"cuda_runtime") → _load_cuda_library_from_python() - tensor.quantized_tensor → quantized_tensor (pytorch/utils.py) - tensor.quantized_tensor → quantized_tensor (flagos backends.py)

Updated plugin OP API layer to match pytorch/csrc/ pybind changes between base and dev branches. Changes applied to: - ops.py base class (TEFLBackendBase): added cuda_graph, deterministic to get_fused_attn_backend - ops.py FlashAttentionBase: added num_splits to forward/_forward_impl signatures - All vendor FlashAttention subclasses (cuda, hygon, metax, musa, kunlunxin) - All 5 vendor backends get_fused_attn_backend (cuda, iluvatar, metax, musa, hygon) - Reference and flagos backends updated for both APIs - Verified get_attention_backend/AttentionParams pass-through (no changes needed) See /tmp/plugin_api_changes.log for details.

…_attn_fwd/bwd Found during batch validation combo 2/9 (te_fl_prefer=vendor, attention_backend=fused, attempt 1). Error: CUDABackend.fused_attn_fwd() takes 29 positional arguments but 31 were given Root cause: upstream merge added bottom_right_diagonal and cuda_graph params to the caller (cpp_extensions/fused_attn.py) but the plugin backend signatures were not updated. Fix: added both params to ops.py base class, CUDA backend, and all vendor backends (musa, iluvatar, hygon, metax) for both fused_attn_fwd and fused_attn_bwd.

…led() in flagos backend Found during batch validation combo 4/9 (te_fl_prefer=flagos, attention_backend=flash, attempt 1). Error: Cached implementation 'default.flagos' failed for op 'get_flash_attention_class': cannot import name 'CPUOffloadEnabled' from 'transformer_engine.pytorch.cpu_offload' Root cause: upstream removed CPUOffloadEnabled from cpu_offload.py (v2 API), replacing it with is_cpu_offload_enabled() function. Fix: updated flagos backend to use the new function.

CLAassistant · 2026-04-15T02:18:59Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
9 out of 15 committers have signed the CLA.

✅ lihongyang1990
✅ lxd-cumt
✅ ssuurrffaaccee
✅ dinghaodhd
✅ DannyP0
✅ chai-xiaonan
✅ BrianPei
✅ qqjxzxq
✅ Darryl233
❌ KshitijLakhani
❌ ptrendx
❌ ksivaman
❌ cyanguwa
❌ jberchtold-nvidia
❌ wendell

wendell seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

xmhubj · 2026-04-15T03:06:01Z

This configuration file is not derived from the template that used by cuda.yml and meta.yml

…metax runner (flagos-ai#60) ## Description Refactors CI/CD workflows to support both CUDA (NVIDIA A100) and Metax (C500) platforms, removes obsolete workflows, and fixes several platform-specific test failures. Add functional testing, and log reporting, with significant workflow simplification, and Metax platform use BAAI runner configs. --- ## Type of change - [x] New feature (non-breaking change which adds functionality) - [x] Infra/Build change (changes to CI/CD workflows or build scripts) - [x] Code refactoring - [x] Bug fix - [ ] Documentation change - [ ] Breaking change --- ### Changes - **Workflow cleanup**: Removed 7 obsolete workflows; extracted lint into a standalone reusable `lint_common.yml` (runs in parallel); add `integration_tests_common.yml` - **Platform refactoring**: Added per-platform setup scripts (`setup_cuda.sh` / `setup_metax.sh`); switched Metax config to BAAI online environment; removed unsupported test types (JAX distributed) from Metax matrix - **Bug fixes**: - Metax: skip incompatible distributed test files (`test_numerics`, `test_torch_fsdp2`, etc.) to prevent `torchrun` SIGSEGV - Metax: replace `nvidia-smi`-only FP8 detection with platform-aware check - CUDA: fix `libcudart` load failure when runtime is pip-installed (add proper fallback chain in `_load_cudart()` and `try_load_lib`) --- ## Checklist - [x] I have read and followed the contributing guidelines - [x] The functionality is complete - [x] I have commented my code, particularly in CI workflow setup steps - [x] I have made corresponding changes to the documentation - [x] My changes generate no new warnings - [x] I have added/updated tests that prove my feature works on CUDA and Metax platform - [x] New and existing unit tests pass locally on CUDA and Metax platform --------- Co-authored-by: qqjxzxq <1376782660@qq.com> Co-authored-by: HermiaHuan <3081497279@qq.com>

…metax runner (flagos-ai#60) Refactors CI/CD workflows to support both CUDA (NVIDIA A100) and Metax (C500) platforms, removes obsolete workflows, and fixes several platform-specific test failures. Add functional testing, and log reporting, with significant workflow simplification, and Metax platform use BAAI runner configs. --- - [x] New feature (non-breaking change which adds functionality) - [x] Infra/Build change (changes to CI/CD workflows or build scripts) - [x] Code refactoring - [x] Bug fix - [ ] Documentation change - [ ] Breaking change --- - **Workflow cleanup**: Removed 7 obsolete workflows; extracted lint into a standalone reusable `lint_common.yml` (runs in parallel); add `integration_tests_common.yml` - **Platform refactoring**: Added per-platform setup scripts (`setup_cuda.sh` / `setup_metax.sh`); switched Metax config to BAAI online environment; removed unsupported test types (JAX distributed) from Metax matrix - **Bug fixes**: - Metax: skip incompatible distributed test files (`test_numerics`, `test_torch_fsdp2`, etc.) to prevent `torchrun` SIGSEGV - Metax: replace `nvidia-smi`-only FP8 detection with platform-aware check - CUDA: fix `libcudart` load failure when runtime is pip-installed (add proper fallback chain in `_load_cudart()` and `try_load_lib`) --- - [x] I have read and followed the contributing guidelines - [x] The functionality is complete - [x] I have commented my code, particularly in CI workflow setup steps - [x] I have made corresponding changes to the documentation - [x] My changes generate no new warnings - [x] I have added/updated tests that prove my feature works on CUDA and Metax platform - [x] New and existing unit tests pass locally on CUDA and Metax platform --------- Co-authored-by: qqjxzxq <1376782660@qq.com> Co-authored-by: HermiaHuan <3081497279@qq.com>

Tree replacement merge from merge/dev-to-main-20260410. Working tree is identical to the source branch. Stages 1-8 completed and verified.

…main)

- Remove unused imports in utils.py, multi_head_attention.py, float8_blockwise_tensor.py - Reorder imports to follow stdlib → third-party → first-party → local convention - Fixes CI lint failures while maintaining 10.00/10 pylint score Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…gin backends

…nown type warnings

ptrendx and others added 30 commits October 16, 2025 16:35

Changed VERSION to 2.9.0

966a5b9

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Include TE core headers in final build (NVIDIA#2291)

9b75db3

Include TE core headers in build Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

[JAX] Ensure JAX reference impl uses an accurate backend in our tests (…

70f5366

…NVIDIA#2322) Ensure JAX reference impl uses an accurate backend Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

Polish readme (flagos-ai#11)

ec8edfc

Unify the usage of the gems context (flagos-ai#20)

3d80e63

Unify the usage of the gems context - only enter or exit the context when switching between the flagos backend and the torch backend (or vice versa). - avoids the overhead of repeated enter/exit calls across multiple OPs.

Remove use_gems context and call flag_gems.xxx directly (flagos-ai#22)

832a797

- Remove the flag_gems.use_gems() context to avoid context-switching overhead - Call flag_gems.xxx directly wherever possible.

Fix enum mismatch in plugins (flagos-ai#25)

54390c7

- Fix enum mismatch, between ```transformer_engine/plugin/core/ops.py``` and ```transformer_engine/common/include/transformer_engine/xxx.h```

add Vendor KUNLUNXIN (flagos-ai#27)

48c8480

# Description add Vendor KUNLUNXIN

Fix the incorrect registration on Kunlunxin (flagos-ai#29)

de00a8a

Fix kunlunxin register errors

Darryl233 and others added 17 commits March 2, 2026 10:36

tefl musa support (flagos-ai#42)

acced6d

# Description Add Musa backend

Add scaled_masked_softmax_forward/backward for flagos backend (flagos…

7f788a3

…-ai#52) Add two functions for flagos backend, based on flaggems - scaled_masked_softmax_forward - scaled_masked_softmax_backend

Fix quantizer dtype conversion errors (flagos-ai#54)

1f98511

- Fix quantizer dtype attr conversion errors for vendor backends - Polish logger for vendor backend

apply flagos te_groups_gemm op (flagos-ai#55)

2188137

- add ```te_general_grouped_gemm``` op for flagos backend, base on flag_gems - support both forward and backward computation, distinguished by ```grad```

Final Polish

8fa8199

xmhubj reviewed Apr 15, 2026

View reviewed changes

BrianPei and others added 10 commits April 24, 2026 18:04

merge: integrate upstream release_v2.14 via tree replacement

24c28d0

Tree replacement merge from merge/dev-to-main-20260410. Working tree is identical to the source branch. Stages 1-8 completed and verified.

chore: remove SYNC_POINT.md (intermediate sync record, not needed on …

36af46a

…main)

fix commit init

e2812ae

Fix fused_rope_backward: add missing start_positions parameter to plu…

2c334ae

…gin backends

fix test_numerics unit test

879eddc

fix Latex not found errors, use mathjax

e12589a

Fix Sphinx build warnings: suppress autoapi import resolution and unk…

e5c8380

…nown type warnings

lxd-cumt force-pushed the merge/dev-to-main-20260410 branch from fe21a60 to e5c8380 Compare May 12, 2026 06:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TE-FL Upgrade: Synchronization with TE Release V2.14#59

TE-FL Upgrade: Synchronization with TE Release V2.14#59
lxd-cumt wants to merge 61 commits into
flagos-ai:dev+te2.14.0from
lxd-cumt:merge/dev-to-main-20260410

lxd-cumt commented Apr 15, 2026

Uh oh!

CLAassistant commented Apr 15, 2026 •

edited

Loading

Uh oh!

xmhubj Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

17 participants

Conversation

lxd-cumt commented Apr 15, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

CLAassistant commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xmhubj Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

17 participants

CLAassistant commented Apr 15, 2026 •

edited

Loading