Skip to content

TE-FL Upgrade: Synchronization with TE Release V2.14#59

Open
lxd-cumt wants to merge 61 commits into
flagos-ai:dev+te2.14.0from
lxd-cumt:merge/dev-to-main-20260410
Open

TE-FL Upgrade: Synchronization with TE Release V2.14#59
lxd-cumt wants to merge 61 commits into
flagos-ai:dev+te2.14.0from
lxd-cumt:merge/dev-to-main-20260410

Conversation

@lxd-cumt
Copy link
Copy Markdown
Collaborator

Description

Please include a brief summary of the changes, relevant motivation and context.

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

ptrendx and others added 30 commits October 16, 2025 16:35
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
…A#2274)

* Fix imports in test for deprecated jax.experimental.pjit

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix: Pass NamedSharding instead of PartitionSpec to compare_ops() so that when the in and out sharding is used to create a jitted function, it has the mesh info

Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>
* Support wheel build for cuda 13

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fixes

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fixes for cu13 runtime, format

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Add documentation

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Better error handling

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix jax sdist

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Modify function names

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
…tization (NVIDIA#2270)

* [JAX] Support recipe flags for disabling SR, RHT, and 2D quantization

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* lint

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix issue with SR state being erased due to pytree handling of NVFP4Quantizer

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Add test for SR state preservation across VJP boundaries

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix sharding of SR rng state

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* lint

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* update tolerances slightly now that SR is enabled

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* lint

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Use hashlib for deterministic hashes across runs for SR

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* rename uses_rht on scaled tensors to has_applied_rht

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* add assert

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Move decision of whether to use RHT into helper.py and add dedicated RHT tests

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* lint

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* fix use_rht attr usage

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* fix pure-jax rht usage criteria

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Adjust tolerances after rebase

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Include TE core headers in build

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Added sm_120f to the build

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Change the arch specific handling

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Support for CUDA<12.9

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Moved through the rest of the files

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Common cases

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Remove pure 100 from the list

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* CMake changes, (not yet working)

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Do not pass the arch-specific thing from build_tools

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Moved some of the files to arch-specific compilation

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix and also changing the order of compilation to hopefully get the
compilation time lower

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix for the files overwriting custom compile properties

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Actually make this whole thing work

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add space to the error message

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Przemyslaw Tredak <ptrendx@gmail.com>

* Apply suggestions from code review

Co-authored-by: Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com>
Signed-off-by: Przemyslaw Tredak <ptrendx@gmail.com>

* Fixes from review

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Changing the naming to be more intuitive

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add missing cassert include for device-side asserts

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemyslaw Tredak <ptrendx@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com>
* add max_score for fused/unfused F16 non-CP

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* calculate max per head instead of max over all heads

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix fused attn max_score shape

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* revert FE to github

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update FE to 1.15.0-rc

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix merge

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* reduce ew kernels; fix causal masks; add more tests

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor fix to tests

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove logic for flash-attn

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* WIP: add CP support for p2p/a2a/all_gather

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor improvements of implementation/tests

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* WIP: add thd support

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add thd to UnfusedDPA

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix lint

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* more fixes for lint

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update to FE 1.15

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove unneeded changes

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* disable unfused for thd + pad_between_seqs

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* minor fixes

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* disable thd for unfused until bug is fixed

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix all_gather

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix all gather

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* rename max_score to max_logit

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix all_gather

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix all_gather

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* disable fused attn + thd

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

---------

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…DIA#2288)

* Fix CI failures due to deterministic attention

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* some more cleanup

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix debug test

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
…2315)

* Fix: Skip determinism tests for bprop for all sm >=100

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Add username to TODO

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Assert in fused attn bwd pass for sm100+

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix attention backend and tests for sm120

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Disable MLA only for backward

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
…determinism on Blackwell (NVIDIA#2316)

* Bump the min version expected to supported FP8 cs det on Blackwell

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Disable fused attn for cudnn < 9.14 for FP8 CS. Disable fused attn for cudnn < 9.18 for FP8 deterministic CS

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…NVIDIA#2322)

Ensure JAX reference impl uses an accurate backend

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
# Description
 
Add the FlagOS multi-chip backend for TransformerEngine

Fixes # (issue)

## Type of change

- [ ] Documentation change (change only to the documentation, either a
fix or a new content)
- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing
functionality to not work as expected)
- [ ] Infra/Build change
- [ ] Code refactoring

## Changes

Please list the changes introduced in this PR:

- Change A
- Change B

# Checklist:

- [ ] I have read and followed the [contributing
guidelines](https://github.com/NVIDIA/TransformerEngine/blob/main/CONTRIBUTING.rst)
- [ ] The functionality is complete
- [ ] I have commented my code, particularly in hard-to-understand areas
- [ ] I have made corresponding changes to the documentation
- [ ] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my
feature works
- [ ] New and existing unit tests pass locally with my changes

---------

Co-authored-by: zhaoyinglia <ylzhao@baai.ac.cn>
# Description

Fix import bugs

Fixes # (issue)

## Type of change

- [ ] Documentation change (change only to the documentation, either a
fix or a new content)
- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing
functionality to not work as expected)
- [ ] Infra/Build change
- [ ] Code refactoring

## Changes

Please list the changes introduced in this PR:

- Change A
- Change B

# Checklist:

- [ ] I have read and followed the [contributing
guidelines](https://github.com/NVIDIA/TransformerEngine/blob/main/CONTRIBUTING.rst)
- [ ] The functionality is complete
- [ ] I have commented my code, particularly in hard-to-understand areas
- [ ] I have made corresponding changes to the documentation
- [ ] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my
feature works
- [ ] New and existing unit tests pass locally with my changes
# Description

Please include a brief summary of the changes, relevant motivation and
context.

Fixes # (issue)

## Type of change

- [ ] Documentation change (change only to the documentation, either a
fix or a new content)
- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing
functionality to not work as expected)
- [ ] Infra/Build change
- [ ] Code refactoring

## Changes

Please list the changes introduced in this PR:

- Change A
- Change B

# Checklist:

- [ ] I have read and followed the [contributing
guidelines](https://github.com/NVIDIA/TransformerEngine/blob/main/CONTRIBUTING.rst)
- [ ] The functionality is complete
- [ ] I have commented my code, particularly in hard-to-understand areas
- [ ] I have made corresponding changes to the documentation
- [ ] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my
feature works
- [ ] New and existing unit tests pass locally with my changes
…lagos-ai#4)

# TransformerEngine-FL Plugin System

## Overview

This PR implements a comprehensive multi-backend plugin system for
TransformerEngine-FL, enabling support for multiple hardware vendors
(NVIDIA, AMD, Hygon, etc.) while maintaining full API compatibility with
the original `transformer_engine_torch`.

**Core Philosophy**: A plugin-based backend system that allows hardware
vendors to easily implement their own operator optimizations while
preserving complete compatibility with the original TransformerEngine
API.

## Key Features

### Full API Compatibility
- Drop-in replacement for `transformer_engine_torch`
- Switch backends via environment variables
- Zero changes required to existing user code

### Multi-Backend Support

| Backend | Description | Implementation |
|---------|-------------|----------------|
| **FlagOS (default)** | Triton-based cross-platform implementation |
`backends/flagos/` |
| **CUDA (vendor)** | Wraps original TransformerEngine C++ extensions |
`backends/vendor/cuda/` |
| **Reference** | Pure PyTorch fallback implementation |
`backends/reference/` |

### Three-Tier Backend Selection

```
┌─────────────────────────────────────────────────────────┐
│  1. TE_FL_PER_OP (Per-operator override)    [Highest]   │
│     Example: TE_FL_PER_OP="rmsnorm_fwd=vendor:cuda"     │
├─────────────────────────────────────────────────────────┤
│  2. TE_FL_PREFER (Global preference)                    │
│     Values: flagos / vendor / reference                │
├─────────────────────────────────────────────────────────┤
│  3. Backend Priority (Intrinsic)            [Lowest]    │
│     Each implementation has a priority value            │
└─────────────────────────────────────────────────────────┘
```

## Architecture

### Directory Structure

```
transformer_engine/plugin/core/
├── __init__.py              # Public API exports
├── types.py                 # Core types: BackendImplKind, OpImpl
├── registry.py              # OpRegistry: stores all implementations
├── manager.py               # OpManager: selects and calls implementations
├── policy.py                # SelectionPolicy: backend selection rules
├── discovery.py             # Plugin auto-discovery (entry_points, env)
├── builtin_ops.py           # Registers all built-in backends
├── ops.py                   # TEFLModule: transformer_engine_torch compatible API
├── logger_manager.py        # Logging utilities
├── _module_setup.py         # Module aliasing setup
├── _build_config.py         # Build-time configuration
│
└── backends/
    ├── flagos/              # FlagOS backend (Triton-based)
    │   ├── flagos.py        # FlagOSBackend class
    │   ├── register_ops.py  # Operator registration
    │   └── impl/            # Operator implementations
    │       ├── rmsnorm.py
    │       ├── gemm.py
    │       └── ...
    │
    ├── vendor/              # Vendor backends
    │   └── cuda/            # NVIDIA CUDA backend
    │       ├── cuda.py      # CUDABackend class
    │       └── register_ops.py
    │
    └── reference/           # Reference backend (PyTorch)
        ├── reference.py     # ReferenceBackend class
        ├── register_ops.py
        └── impl/            # Pure PyTorch implementations
```

### Core Components

| File | Description |
|------|-------------|
| `types.py` | Defines `BackendImplKind` (DEFAULT/VENDOR/REFERENCE) and
`OpImpl` dataclass |
| `registry.py` | `OpRegistry` - Central storage for all operator
implementations |
| `manager.py` | `OpManager` - Handles implementation selection,
fallback, and execution |
| `policy.py` | `SelectionPolicy` - Configurable rules for backend
selection |
| `discovery.py` | Auto-discovers plugins via `entry_points` or
`TE_FL_PLUGIN_MODULES` |
| `ops.py` | `TEFLModule` - Provides `transformer_engine_torch`
compatible interface |

## Installation

### Build with CUDA support
```bash
pip install --no-build-isolation -e .
```

### Build without CUDA (FlagOS only)
```bash
TE_FL_SKIP_CUDA=1 pip install --no-build-isolation -e .
```

## Environment Variables

### Backend Selection

| Variable | Description | Values | Default |
|----------|-------------|--------|---------|
| `TE_FL_PREFER` | Preferred backend type | `flagos` / `vendor` /
`reference` | `flagos` |
| `TE_FL_PREFER_VENDOR` | Prefer vendor (legacy) | `1` / `0` | `0` |
| `TE_FL_STRICT` | Strict mode (no fallback) | `1` / `0` | `0` |

### Vendor Filtering

| Variable | Description | Example |
|----------|-------------|---------|
| `TE_FL_ALLOW_VENDORS` | Allowed vendors (whitelist) | `nvidia,amd` |
| `TE_FL_DENY_VENDORS` | Denied vendors (blacklist) | `vendor_a` |

### Per-Operator Configuration

| Variable | Description | Example |
|----------|-------------|---------|
| `TE_FL_PER_OP` | Per-operator backend ordering |
`rmsnorm_fwd=vendor:cuda\|default` |

### Plugin Discovery

| Variable | Description | Example |
|----------|-------------|---------|
| `TE_FL_PLUGIN_MODULES` | Plugin modules to load |
`my_plugin,another_plugin` |

### Build Configuration

| Variable | Description | Values | Default |
|----------|-------------|--------|---------|
| `TE_FL_SKIP_CUDA` | Skip CUDA backend | `1` / `0` | `0` |
| `CUDA_HOME` | CUDA installation path | `/usr/local/cuda` |
Auto-detected |

### Logging

| Variable | Description | Values | Default |
|----------|-------------|--------|---------|
| `TEFL_LOG_LEVEL` | Log level | `DEBUG` / `INFO` / `WARNING` / `ERROR`
| `INFO` |

## Usage Examples

### Basic Usage (No Code Changes Required)
```python
# Existing code works as-is
import transformer_engine.pytorch as te
# or
import transformer_engine_torch as te
```

### Register Custom Backend (In-tree)

```python
from transformer_engine.plugin.core import (
    OpRegistry, OpManager, OpImpl, BackendImplKind
)

# 1. Define implementation
def my_rmsnorm(input, weight, eps=1e-5, **kwargs):
    variance = input.pow(2).mean(-1, keepdim=True)
    return input * torch.rsqrt(variance + eps) * weight, torch.rsqrt(variance + eps)

# 2. Register
registry = OpRegistry()
registry.register_impl(OpImpl(
    op_name="rmsnorm_fwd",
    impl_id="vendor.mybackend",
    kind=BackendImplKind.VENDOR,
    vendor="mybackend",
    fn=my_rmsnorm,
    priority=200,
))

# 3. Call
manager = OpManager(registry)
output, rsigma = manager.call("rmsnorm_fwd", input, weight)
```

### Register Custom Backend (Out-of-tree Plugin)

Create a plugin package with `register(registry)` function:

```python
# my_vendor_plugin/__init__.py
from transformer_engine.plugin.core import OpImpl, BackendImplKind

def my_rmsnorm(input, weight, eps=1e-5, **kwargs):
    # Your implementation
    ...

def register(registry):
    """Called automatically by TE-FL"""
    registry.register_impl(OpImpl(
        op_name="rmsnorm_fwd",
        impl_id="vendor.myvendor",
        kind=BackendImplKind.VENDOR,
        vendor="myvendor",
        fn=my_rmsnorm,
        priority=200,
    ))
```

Load via environment variable:
```bash
export TE_FL_PLUGIN_MODULES=my_vendor_plugin
python your_script.py
```

## Runtime Logs

When running, you'll see logs indicating which backend is used:

```
[TE-FL manager.py:133 INFO] Registered impl_ids: ['default.flagos', 'reference.torch', 'vendor.cuda']
[TE-FL manager.py:390 INFO] Op 'rmsnorm_fwd' using 'default.flagos' (kind=default, vendor=None)
[TE-FL manager.py:395 INFO] Op 'rmsnorm_fwd' switched from 'default.flagos' to 'vendor.cuda' (kind=vendor, vendor=CUDA)
```

## Examples

See `transformer_engine/plugins/examples/` for complete working
examples:
- `example_intree.py` - In-tree backend registration
- `example_outtree.py` - Out-of-tree plugin registration

Fixes # (issue)

## Type of change

- [ ] Documentation change (change only to the documentation, either a
fix or a new content)
- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing
functionality to not work as expected)
- [ ] Infra/Build change
- [ ] Code refactoring

## Changes

Please list the changes introduced in this PR:

- Change A
- Change B

# Checklist:

- [ ] I have read and followed the [contributing
guidelines](https://github.com/NVIDIA/TransformerEngine/blob/main/CONTRIBUTING.rst)
- [ ] The functionality is complete
- [ ] I have commented my code, particularly in hard-to-understand areas
- [ ] I have made corresponding changes to the documentation
- [ ] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my
feature works
- [ ] New and existing unit tests pass locally with my changes

---------

Co-authored-by: panpy <panpy@sugon.com>
# Description

- Add missing __init__.py files to
transformer_engine/plugin/core/backends/flagos/attention/ directory tree
to fix import errors when accessing these modules as Python packages
- Add comprehensive test suite (test_policy.py) covering the TE-FL
scheduling policy system including:
SelectionPolicy creation and configuration
Environment variable parsing (TE_FL_PREFER, TE_FL_STRICT, etc.)
Policy context managers
Vendor filtering (allow/deny)
Thread safety validation
Minor code style improvements

Fixes # (issue)

## Type of change

- [ ] Documentation change (change only to the documentation, either a
fix or a new content)
- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing
functionality to not work as expected)
- [ ] Infra/Build change
- [ ] Code refactoring

## Changes

Please list the changes introduced in this PR:

- Change A
- Change B

# Checklist:

- [ ] I have read and followed the [contributing
guidelines](https://github.com/NVIDIA/TransformerEngine/blob/main/CONTRIBUTING.rst)
- [ ] The functionality is complete
- [ ] I have commented my code, particularly in hard-to-understand areas
- [ ] I have made corresponding changes to the documentation
- [ ] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my
feature works
- [ ] New and existing unit tests pass locally with my changes
…n fallback (flagos-ai#14)

## Summary
This PR contains two major improvements:

1. **Register `get_attention_backend` function for all backends** (CUDA,
FlagOS, Reference)
   - Added `get_attention_backend` implementation to all backend types
- Ensures consistent attention backend selection across different
hardware platforms

2. **Fix FlashAttention fallback mechanism**
- Removed redundant `_called_impls` dictionary, replaced with simpler
`_last_impl_id` class variable
   - Removed unused `_log_lock` threading lock
   - Simplified implementation tracking and logging logic
- Reduced code complexity and memory overhead while maintaining full
functionality

## Changes
- Updated `FlashAttentionBase` class in `ops.py` to remove redundant
implementation tracking
- Added `get_attention_backend` registration to CUDA, FlagOS, and
Reference backends
- Fixed fallback logic in attention backend selection

## Test Plan
- [x] Code builds successfully
- [x] Existing tests pass
- [x] Manual testing with different backend configurations

## Related Issues
Fixes issues with FlashAttention fallback and improves backend
consistency.
# Description

fix nv shared lib bug

[CUDA] Import failed: No module named 'transformer_engine_torch_nv'

Fixes # (issue)

## Type of change

- [ ] Documentation change (change only to the documentation, either a
fix or a new content)
- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing
functionality to not work as expected)
- [ ] Infra/Build change
- [ ] Code refactoring

## Changes

Please list the changes introduced in this PR:

- Change A
- Change B

# Checklist:

- [ ] I have read and followed the [contributing
guidelines](https://github.com/NVIDIA/TransformerEngine/blob/main/CONTRIBUTING.rst)
- [ ] The functionality is complete
- [ ] I have commented my code, particularly in hard-to-understand areas
- [ ] I have made corresponding changes to the documentation
- [ ] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my
feature works
- [ ] New and existing unit tests pass locally with my changes
# Description

This pr add hygon backend for calling basic ops on hygon dcu.

## Type of change

- [x] New feature (non-breaking change which adds functionality)

## Changes

Please list the changes introduced in this PR:

- Add a new `hygon` folder in `vendor` contains `__init__.py`,
`hygon.py`, `register_ops.py`
- Register hygon ops in `builtin_ops.py`

# Requirements

In order to use hygon backend, the following, the following requirements
need to be met

- The python package `transformer_engine_fl_hygon` needs to be installed

# Checklist:

- [ ] I have read and followed the [contributing
guidelines](https://github.com/NVIDIA/TransformerEngine/blob/main/CONTRIBUTING.rst)
- [ ] The functionality is complete
- [ ] I have commented my code, particularly in hard-to-understand areas
- [ ] I have made corresponding changes to the documentation
- [ ] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my
feature works
- [ ] New and existing unit tests pass locally with my changes

---------

Signed-off-by: wenjh <wenjh@sugon.com>
…gos-ai#18)

Add a flag that permanently enables flag_gems with a single switch,
eliminating the need to call flag_gems.use_gems for every single
operator. This removes significant registration overhead and improves
end-to-end throughput.
- When the flag is set, every operator’s implementation is forced to use
flag_os/vendor; the default PyTorch reference backend is unavailable.
- When the flag is not set, operators can freely switch among flag_os,
vendor, and torch backends.
Unify the usage of the gems context
- only enter or exit the context when switching between the flagos
backend and the torch backend (or vice versa).
- avoids the overhead of repeated enter/exit calls across multiple OPs.
## Summary
- Support combined qkv_layout formats like `sbhd_sbhd_sbhd` by
extracting the first part for layout conversion
- Distinguish between standard 4D tensor format (sbhd/bshd) and true
packed format (thd). For 4D tensors, directly convert layout like flagos
backend does, instead of incorrectly trying to unpack

## Problem
When using torch SDPA backend with `batch_size > 1`, the following error
occurs:
```
ValueError: Unexpected 4D tensor shape torch.Size([4096, 4, 16, 128]). Expected [total_tokens, 1, num_heads, head_dim]
```

The original code incorrectly tried to unpack 4D tensors when
`cu_seqlens` was provided, but 4D tensors in `sbhd`/`bshd` format should
be handled with simple layout conversion (like flagos backend does).

## Test plan
- [x] Tested with batch_size=4, verified no ValueError
- [x] Results match flagos backend output
- Remove the flag_gems.use_gems() context to avoid context-switching
overhead
- Call flag_gems.xxx directly wherever possible.
# Description

Add the new vendor backend METAX

## Type of change

- [ ] New feature (non-breaking change which adds functionality)

## Changes

Please list the changes introduced in this PR:

-  Add metax ops register 
-  Add metax backend implementation
-  Register metax ops in builtin_ops.py

## Requirements

- The module transformer_engine_torch_metax is needed, to use this
module, need to install package transformer_engine_metax

# Checklist:

- [x] I have read and followed the [contributing
guidelines](https://github.com/NVIDIA/TransformerEngine/blob/main/CONTRIBUTING.rst)
- [x] The functionality is complete
- [x] I have commented my code, particularly in hard-to-understand areas
- [x] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [x] I have added tests that prove my fix is effective or that my
feature works
- [x] New and existing unit tests pass locally with my changes
…lagos-ai#23)

## Summary
- flagos: Add multi_tensor_adam_param_remainder implementation
- reference: Add multi_tensor_adam_param_remainder implementation  
- reference: Add context parallel support for Flash Attention
- manager: Add cache mechanism with _impl_cache and _impl_cache_meta for
conditional op selection

## Changes
### flagos backend
- Implemented multi_tensor_adam_param_remainder operation for handling
parameter remainders in multi-tensor Adam optimizer

### reference backend  
- Implemented multi_tensor_adam_param_remainder operation
- Added context parallel support for Flash Attention implementation

### Core manager
- Added cache mechanism using _impl_cache and _impl_cache_meta
- Improved op selection with conditional caching based on policy
fingerprint and epoch

---------

Signed-off-by: wenone766 <wenone766@wenobug.com>
Co-authored-by: wenone766 <wenone766@wenobug.com>
- Fix enum mismatch, between ```transformer_engine/plugin/core/ops.py```
and ```transformer_engine/common/include/transformer_engine/xxx.h```
# Description

add Vendor KUNLUNXIN
Darryl233 and others added 17 commits March 2, 2026 10:36
# Description

Validate TE QA test cases with new CI workflows

## Type of change
- [ ] Documentation change (change only to the documentation, either a
fix or a new content)
- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing
functionality to not work as expected)
- [x] Infra/Build change
- [ ] Code refactoring

## Changes

Please list the changes introduced in this PR:

- Added code inspection and PyTorch/C++ unit tests to improve the TE
testing system
- Implemented end-to-end automation of TE wheel package building,
installation, and verification, supporting multiple versions of Flash
Attention and GPUs with different CUDA architectures
- Verified TE's core functions (distributed communication, matrix
multiplication, ONNX export) and compatibility with
Megatron-LM/Lightning-Thunder
- Completed the verification of the nvinspect debugging tool and
re-verification of core numerical tests

# Checklist:

- [ ] I have read and followed the [contributing
guidelines](https://github.com/NVIDIA/TransformerEngine/blob/main/CONTRIBUTING.rst)
- [ ] The functionality is complete
- [x] I have commented my code, particularly in hard-to-understand areas
- [ ] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my
feature works
- [ ] New and existing unit tests pass locally with my changes

---------

Co-authored-by: zihugithub <fbye@baai.ac.cn>
Co-authored-by: liyuzhuo <lee.yuzhuo233@gamil.com>
…os-ai#36)

## Summary

Refactor and improve the FlagOS optimizer and multi_tensor
implementations to better match CUDA behavior and improve code quality.

## Changes

### `fused_adam.py` (FlagOS backend)
- Remove unused `inv_scale` and `out_dtype` parameters from
`multi_tensor_adam_fl`
- `multi_tensor_adam_param_remainder_fl`: rewrite FP32 master weight
reconstruction using bit manipulation (int16 high/low bits), matching
the CUDA implementation exactly

### `multi_tensor.py` (FlagOS backend)
- `multi_tensor_l2_norm_fl`: add proper type hints, noop_flag check,
inf/nan detection, and replace raw `**` / `+` operators with
`flag_gems.mul` / `flag_gems.add`
- `multi_tensor_scale_fl`: add type hints, noop_flag check, inf/nan
detection, and replace `src * scale` with `flag_gems.mul(src, scale)`

### `optimizer.py` (reference backend)
- Update `multi_tensor_l2norm_torch` and `multi_tensor_adam_torch` to
match new signatures and CUDA behavior (L2 vs AdamW mode split)
- Rewrite `multi_tensor_adam_param_remainder_torch` with bit
manipulation matching CUDA
- Rename `eps` → `epsilon` for consistency

### `optimizers/__init__.py`
- Export `multi_tensor_scale` and `multi_tensor_l2norm`

### Misc
- Fix missing newline at end of files
# Description

Add Musa backend
TE-FL Python-level now supports multiple platforms, including the
following two changes:

1. support for vendor-specific patches: vendors can now add their own
patches, e.g., patching ```torch.cuda``` to ```torch.musa```. For patch
implementation, please refer to
```transformer_engine/plugin/core/backends/vendor/musa/musa_patches.py```;
for patch integration, please refer to
```transformer_engine/__init__.py```.

2. abstraction of CUDA device references: files under
```transformer_engine/``` now abstract CUDA device-related code into
```te_device_type```. For example, ```torch.device("cuda")``` is now
replaced with ```torch.device(te_device_type)```.

3. Fix

- FlagOS Backend: ```get_num_cublas_stream``` and
```get_cudnn_version```
- Reference Backend: ```get_num_cublas_stream``` and
```scaled_mask_softmax_forward```
…-ai#52)

Add two functions for flagos backend, based on flaggems

- scaled_masked_softmax_forward
- scaled_masked_softmax_backend
- Fix quantizer dtype attr conversion errors for vendor backends
- Polish logger for vendor backend
- add ```te_general_grouped_gemm``` op for flagos backend, base on
flag_gems
- support both forward and backward computation, distinguished by
```grad```
# Description

This PR implements and integrates the **Metax (MACA)** workflow into
TransformerEngine-FL. It enables automated CI/CD pipelines, functional
training tests, and unit tests specifically optimized for Metax hardware
environments.

**Key updates in this version:** Successful TE compilation on Metax and
alignment with NVIDIA's standard QA workflows.

Fixes # (issue_number_if_applicable)

## Type of change

- [x] New feature (non-breaking change which adds functionality)
- [x] Infra/Build change (changes to CI/CD workflows or build scripts)
- [ ] Documentation change
- [ ] Bug fix
- [ ] Code refactoring

## Changes

### 1. Build & Compilation
- **TE Build Completion**: Successfully completed the compilation and
build process for TransformerEngine on the Metax platform.
- **Workflow Alignment**: Designed the Metax testing workflow based on
NVIDIA's `qa-l0-te-cpp-unittest-pytorch-lint` standard to ensure parity
with upstream quality gates.

### 2. CI/CD Infrastructure & Test Modules
- **Metax Platform Support**: Added `configs/metax.yml` to define
Metax-specific runner labels, images, and device configurations.
- **Verified Workflow Modules**: The following modules have been
implemented and verified on the Metax platform:
    - **pytorch-lint**: Static code analysis and linting.
- **pytorch-debug**: Debug-level build and basic functional
verification.
- **pytorch-unittest**: Core unit testing for Metax-adapted operators.
- **Workflow Modularization**: 
- Introduced `configs/all_tests_common.yml` and
`configs/unit_tests_common.yml` for reusable test logic.
- Added `configs/all_tests_metax.yml` as the dedicated entry point for
Metax functional testing.

### 3. Environment & Runtime Fixes
- **Image Management**: Implemented `image-pull-policy: never` and
`--pull never` options to force the use of local registry images
(localhost:5000), optimizing startup time in local cluster environments.
- **Dynamic Resource Scaling**: 
- Adapted `torchrun` and training scripts to support dynamic
GPU/Accelerator counts (specifically for C500 clusters).
- Removed hardcoded GPU host configurations to improve portability
across different Metax nodes.

### 4. Cleanup
- Removed legacy CUDA/Ascend specific configurations from the Metax
workflow path to prevent environment contamination.

## Hardware/Environment Verified
- **Platform**: Metax MACA
- **Accelerator**: C500
- **Registry**: Local Registry (localhost:5000)

---

## TODO / Next Steps
- [ ] Integrate the Metax-specific adaptation workflow into the central
platform.
- [ ] Generate and upload comprehensive Benchmark and Performance test
reports.

# Checklist:

- [x] I have read and followed the contributing guidelines.
- [x] The functionality is complete and verified on Metax hardware.
- [x] I have commented my code, particularly in hardware-specific
adaptation areas.
- [x] My changes generate no new warnings.
- [x] I have added/updated tests that prove my feature works on the MACA
platform.
- [x] New and existing unit tests (Lint, Debug, Unittest) pass locally
with Metax environment.

---------

Co-authored-by: 爱洗澡 qq <aixizaoqq@aixizaodeMacBook-Air.local>
Co-authored-by: zhoujiamei <2867770387@qq.com>
Co-authored-by: zhoujiamei <zjm>
Co-authored-by: peiyu <peiyu@jinglong.ai>
… FlagCICD runner (flagos-ai#58)

# Description

Simplifies and consolidates the coverage report generation logic in the
CI unittest workflow, reducing redundant steps and dependencies.
Need to test **uploading reports to FlagCICD step** in CI env.

## Type of change

- [x] New feature (non-breaking change which adds functionality)
- [x] Infra/Build change (changes to CI/CD workflows or build scripts)
- [x] Code refactoring
- [ ] Documentation change
- [ ] Bug fix
- [ ] Breaking change

## Changes

- Merged `Generate Coverage Report` into the `Execute Tests` step —
coverage `combine` and `json` generation now run inline after `bash
test.sh`, following the same pattern as Megatron-LM-FL
- Coverage collection is gated on `test_type == 'unittest'` to avoid
running for lint/debug groups, and `pip install` is done only once
- Removed `fetch-depth: 0` from checkout steps (not required for unit
test runs)
- Removed unused/leftover scripts from the repository

## TODO

# Checklist:

- [x] I have read and followed the contributing guidelines.
- [x] The functionality is complete
- [x] I have commented my code, particularly in coverage report
uploading steps
- [x] My changes generate no new warnings
- [x] I have added/updated tests that prove my feature works on Cuda and
Metax platform.
- [x] New and existing unit tests pass locally on Cuda and Metax
platform.
Resolved 28 merge conflicts:
- P0 (20 files): transformer_engine/pytorch/ — preserved te_device_type()
  device abstraction and plugin system hooks
- P1 (1 file): transformer_engine/common/__init__.py — preserved plugin
  bootstrapping and skip_cuda_build()
- P2 (7 files): CI/CD, qa, config — preserved fork-specific CI and test harness
Updated plugin OP API layer to match pytorch/csrc/ pybind changes
between base and dev branches. Changes applied to:
- ops.py base class (TEFLBackendBase)
- All 5 vendor backends (cuda, iluvatar, metax, musa, hygon)
- All 5 vendor register_ops.py files
- Scanned flagos/reference backends for changed interfaces (no changes needed)

New APIs added: group_quantize, bgrad_group_quantize, glu, dglu,
te_general_grouped_gemm_for_grouped_tensor, te_general_grouped_gemm_for_discrete_in,
te_general_grouped_gemm_for_discrete_out, nvfp4_data_transpose, swizzle_scales_for_gemm_,
grouped_swizzle_for_gemm, convert_host_pointers_to_tensor,
get_device_pointer_for_data_and_scales, splits_to_offsets,
mxfp8_scaling_compute_partial_amax, mxfp8_scaling_partial_cast,
nvfp4_2d_compute_partial_amax, nvfp4_multi_tensor_compute_partial_amax,
nvfp4_compute_global_scale, nvfp4_compute_per_block_scale, nvfp4_expand_scale_to_fp8,
nvfp4_fused_scale, nvfp4_multi_tensor_fused_scale, nvfp4_2d_partial_cast,
nvfp4_multi_tensor_2d_partial_cast, nvfp4_2d_multi_tensor_transpose,
multi_tensor_scale_tensor, multi_tensor_compute_scale_inv_e8m0

Modified APIs: split_quantize (added disable_bulk_allocation param)
…ype()

Scanned Python-layer diff (base..dev, excluding csrc) for newly introduced
hardcoded 'cuda' device strings. Replaced 11 instances across 7 files:
- device=torch.device('cuda') → device=torch.device(te_device_type()): 3
- device='cuda' → device=te_device_type(): 1
- .device.type == 'cuda' → .device.type == te_device_type(): 2
- get_autocast_dtype('cuda') → get_autocast_dtype(te_device_type()): 5
Skipped 10 intentional default parameter values and docstrings.
torch.cuda.* API calls left as-is (handled by vendor patches.py at runtime).
Scanned fork-specific code (new in merge vs dev) for references to
functions, classes, and file paths that upstream renamed or relocated
between base and dev. Fixed 6 stale reference(s):
- _load_cudnn() → _load_cuda_library("cudnn")
- _load_nvrtc() → _load_cuda_library("nvrtc")
- _load_curand() → _load_cuda_library("curand")
- _load_nvidia_cuda_library("cublas"/"cuda_runtime") → _load_cuda_library_from_python()
- tensor.quantized_tensor → quantized_tensor (pytorch/utils.py)
- tensor.quantized_tensor → quantized_tensor (flagos backends.py)
Updated plugin OP API layer to match pytorch/csrc/ pybind changes
between base and dev branches. Changes applied to:
- ops.py base class (TEFLBackendBase): added cuda_graph, deterministic to get_fused_attn_backend
- ops.py FlashAttentionBase: added num_splits to forward/_forward_impl signatures
- All vendor FlashAttention subclasses (cuda, hygon, metax, musa, kunlunxin)
- All 5 vendor backends get_fused_attn_backend (cuda, iluvatar, metax, musa, hygon)
- Reference and flagos backends updated for both APIs
- Verified get_attention_backend/AttentionParams pass-through (no changes needed)
See /tmp/plugin_api_changes.log for details.
…_attn_fwd/bwd

Found during batch validation combo 2/9
(te_fl_prefer=vendor, attention_backend=fused, attempt 1).
Error: CUDABackend.fused_attn_fwd() takes 29 positional arguments but 31 were given
Root cause: upstream merge added bottom_right_diagonal and cuda_graph params to the
caller (cpp_extensions/fused_attn.py) but the plugin backend signatures were not updated.
Fix: added both params to ops.py base class, CUDA backend, and all vendor backends
(musa, iluvatar, hygon, metax) for both fused_attn_fwd and fused_attn_bwd.
…led() in flagos backend

Found during batch validation combo 4/9
(te_fl_prefer=flagos, attention_backend=flash, attempt 1).
Error: Cached implementation 'default.flagos' failed for op 'get_flash_attention_class':
cannot import name 'CPUOffloadEnabled' from 'transformer_engine.pytorch.cpu_offload'
Root cause: upstream removed CPUOffloadEnabled from cpu_offload.py (v2 API),
replacing it with is_cpu_offload_enabled() function.
Fix: updated flagos backend to use the new function.
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 15, 2026

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
9 out of 15 committers have signed the CLA.

✅ lihongyang1990
✅ lxd-cumt
✅ ssuurrffaaccee
✅ dinghaodhd
✅ DannyP0
✅ chai-xiaonan
✅ BrianPei
✅ qqjxzxq
✅ Darryl233
❌ KshitijLakhani
❌ ptrendx
❌ ksivaman
❌ cyanguwa
❌ jberchtold-nvidia
❌ wendell


wendell seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This configuration file is not derived from the template that used by cuda.yml and meta.yml

BrianPei and others added 10 commits April 24, 2026 18:04
…metax runner (flagos-ai#60)

## Description

Refactors CI/CD workflows to support both CUDA (NVIDIA A100) and Metax
(C500) platforms, removes obsolete workflows, and fixes several
platform-specific test failures. Add functional testing, and log
reporting, with significant workflow simplification, and Metax platform
use BAAI runner configs.

---

## Type of change

- [x] New feature (non-breaking change which adds functionality)
- [x] Infra/Build change (changes to CI/CD workflows or build scripts)
- [x] Code refactoring
- [x] Bug fix
- [ ] Documentation change
- [ ] Breaking change

---

### Changes

- **Workflow cleanup**: Removed 7 obsolete workflows; extracted lint
into a standalone reusable `lint_common.yml` (runs in parallel); add
`integration_tests_common.yml`
- **Platform refactoring**: Added per-platform setup scripts
(`setup_cuda.sh` / `setup_metax.sh`); switched Metax config to BAAI
online environment; removed unsupported test types (JAX distributed)
from Metax matrix
- **Bug fixes**:
- Metax: skip incompatible distributed test files (`test_numerics`,
`test_torch_fsdp2`, etc.) to prevent `torchrun` SIGSEGV
- Metax: replace `nvidia-smi`-only FP8 detection with platform-aware
check
- CUDA: fix `libcudart` load failure when runtime is pip-installed (add
proper fallback chain in `_load_cudart()` and `try_load_lib`)

---

## Checklist

- [x] I have read and followed the contributing guidelines
- [x] The functionality is complete
- [x] I have commented my code, particularly in CI workflow setup steps
- [x] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [x] I have added/updated tests that prove my feature works on CUDA and
Metax platform
- [x] New and existing unit tests pass locally on CUDA and Metax
platform

---------

Co-authored-by: qqjxzxq <1376782660@qq.com>
Co-authored-by: HermiaHuan <3081497279@qq.com>
…metax runner (flagos-ai#60)

Refactors CI/CD workflows to support both CUDA (NVIDIA A100) and Metax
(C500) platforms, removes obsolete workflows, and fixes several
platform-specific test failures. Add functional testing, and log
reporting, with significant workflow simplification, and Metax platform
use BAAI runner configs.

---

- [x] New feature (non-breaking change which adds functionality)
- [x] Infra/Build change (changes to CI/CD workflows or build scripts)
- [x] Code refactoring
- [x] Bug fix
- [ ] Documentation change
- [ ] Breaking change

---

- **Workflow cleanup**: Removed 7 obsolete workflows; extracted lint
into a standalone reusable `lint_common.yml` (runs in parallel); add
`integration_tests_common.yml`
- **Platform refactoring**: Added per-platform setup scripts
(`setup_cuda.sh` / `setup_metax.sh`); switched Metax config to BAAI
online environment; removed unsupported test types (JAX distributed)
from Metax matrix
- **Bug fixes**:
- Metax: skip incompatible distributed test files (`test_numerics`,
`test_torch_fsdp2`, etc.) to prevent `torchrun` SIGSEGV
- Metax: replace `nvidia-smi`-only FP8 detection with platform-aware
check
- CUDA: fix `libcudart` load failure when runtime is pip-installed (add
proper fallback chain in `_load_cudart()` and `try_load_lib`)

---

- [x] I have read and followed the contributing guidelines
- [x] The functionality is complete
- [x] I have commented my code, particularly in CI workflow setup steps
- [x] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [x] I have added/updated tests that prove my feature works on CUDA and
Metax platform
- [x] New and existing unit tests pass locally on CUDA and Metax
platform

---------

Co-authored-by: qqjxzxq <1376782660@qq.com>
Co-authored-by: HermiaHuan <3081497279@qq.com>
Tree replacement merge from merge/dev-to-main-20260410.
Working tree is identical to the source branch.
Stages 1-8 completed and verified.
- Remove unused imports in utils.py, multi_head_attention.py, float8_blockwise_tensor.py
- Reorder imports to follow stdlib → third-party → first-party → local convention
- Fixes CI lint failures while maintaining 10.00/10 pylint score

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@lxd-cumt lxd-cumt force-pushed the merge/dev-to-main-20260410 branch from fe21a60 to e5c8380 Compare May 12, 2026 06:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.