Add the new vendor backend ENFLAME#61
Conversation
gongxijun
commented
Apr 28, 2026
env config:` ` running log` export FRAME_WORK="Megatron" CUDA vendor backend skipped (CUDA build was disabled at build time) [2026-05-09 10:23:49,665 TE-FL manager.py:99 DEBUG] Initializing OpManager in PID 38751 warnings.warn(msg, ImportWarning)
INFO:megatron.training.initialize:Setting logging level to 40 [Gloo] Rank 3 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[2026-05-09 10:24:20] iteration 1/ 4 | consumed samples: 16 | elapsed time per iteration (ms): 2800.2 | throughput per GPU (TFLOP/s/GPU): 19.9 | learning rate: 3.385972E-06 | global batch size: 16 | lm loss: 1.280801E+01 | load_balancing_loss: 1.000339E+00 | loss scale: 1.0 | grad norm: 15.105 | number of skipped iterations: 0 | number of nan iterations: 0 | [2026-05-09 10:24:21] iteration 2/ 4 | consumed samples: 32 | elapsed time per iteration (ms): 609.1 | throughput per GPU (TFLOP/s/GPU): 91.5 | learning rate: 2.145000E-06 | global batch size: 16 | lm loss: 1.272279E+01 | load_balancing_loss: 1.000339E+00 | loss scale: 1.0 | grad norm: 15.646 | number of skipped iterations: 0 | number of nan iterations: 0 | ` |
|
install and run pre-commit to pass ci |
8aea577 to
6574fd0
Compare
| from transformer_engine.plugin.core.ops import FlashAttentionBase | ||
|
|
||
|
|
||
| class FlashAttentionMETAX(FlashAttentionBase): |
There was a problem hiding this comment.
new running log:
`
[2026-05-11 08:59:19,336 TE-FL manager.py:99 DEBUG] Initializing OpManager in PID 18699
[WARNING] Failed to register FlagOS operators: No module named 'flag_gems'
[WARNING] Failed to register FlagOS operators: No module named 'flag_gems'
[2026-05-11 08:59:19,339 TE-FL manager.py:99 DEBUG] Initializing OpManager in PID 18702
[WARNING] Failed to register FlagOS operators: No module named 'flag_gems'
[2026-05-11 08:59:19,343 TE-FL manager.py:99 DEBUG] Initializing OpManager in PID 18701
[2026-05-11 08:59:19,343 TE-FL manager.py:99 DEBUG] Initializing OpManager in PID 18697
[WARNING] Failed to register FlagOS operators: No module named 'flag_gems'
CUDA vendor backend skipped (CUDA build was disabled at build time)
[WARNING] Failed to register FlagOS operators: No module named 'flag_gems'
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:387: ImportWarning:
*************************************************************************************************************
The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.gcu and torch.nn.Module.gcu now..
The backend in torch.distributed.init_process_group set to eccl now..
The torch.cuda.* and torch.cuda.amp.* are replaced with torch.gcu.* and torch.gcu.amp.* now..
The device parameters have been replaced with gcu in the function below:
torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.autocast, torch.load, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.nn.Module.to, torch.nn.Module.to_empty
*************************************************************************************************************
warnings.warn(msg, ImportWarning)
[2026-05-11 08:59:19,355 TE-FL manager.py:99 DEBUG] Initializing OpManager in PID 18704
[WARNING] Failed to register FlagOS operators: No module named 'flag_gems'
[2026-05-11 08:59:19,357 TE-FL discovery.py:24 DEBUG] Starting plugin discovery...
[2026-05-11 08:59:19,360 TE-FL manager.py:99 DEBUG] Initializing OpManager in PID 18700
[WARNING] Failed to register FlagOS operators: No module named 'flag_gems'
[2026-05-11 08:59:19,375 TE-FL discovery.py:24 DEBUG] No entry points found for group: te_fl.plugin
[2026-05-11 08:59:19,375 TE-FL discovery.py:24 DEBUG] Plugin discovery complete. Loaded 0 plugin.
[2026-05-11 08:59:19,376 TE-FL manager.py:122 INFO] OpManager initialized: 110 ops with 164 implementations
[2026-05-11 08:59:19,376 TE-FL manager.py:146 DEBUG] Vendor: 110, Default: 0, Reference: 54
[2026-05-11 08:59:19,376 TE-FL manager.py:155 INFO] Registered impl_ids: ['reference.torch', 'vendor.enflame']
[2026-05-11 08:59:19,414 TE-FL discovery.py:24 DEBUG] Starting plugin discovery...
[2026-05-11 08:59:19,414 TE-FL discovery.py:24 DEBUG] Starting plugin discovery...
[2026-05-11 08:59:19,414 TE-FL discovery.py:24 DEBUG] Starting plugin discovery...
[2026-05-11 08:59:19,414 TE-FL discovery.py:24 DEBUG] Starting plugin discovery...
[2026-05-11 08:59:19,414 TE-FL discovery.py:24 DEBUG] Starting plugin discovery...
[2026-05-11 08:59:19,416 TE-FL discovery.py:24 DEBUG] Starting plugin discovery...
[2026-05-11 08:59:19,417 TE-FL discovery.py:24 DEBUG] Starting plugin discovery...
[2026-05-11 08:59:19,427 TE-FL discovery.py:24 DEBUG] No entry points found for group: te_fl.plugin
[2026-05-11 08:59:19,427 TE-FL discovery.py:24 DEBUG] Plugin discovery complete. Loaded 0 plugin.
[2026-05-11 08:59:19,427 TE-FL discovery.py:24 DEBUG] No entry points found for group: te_fl.plugin
[2026-05-11 08:59:19,427 TE-FL discovery.py:24 DEBUG] No entry points found for group: te_fl.plugin
[2026-05-11 08:59:19,427 TE-FL discovery.py:24 DEBUG] No entry points found for group: te_fl.plugin
[2026-05-11 08:59:19,427 TE-FL discovery.py:24 DEBUG] No entry points found for group: te_fl.plugin
[2026-05-11 08:59:19,427 TE-FL manager.py:122 INFO] OpManager initialized: 110 ops with 164 implementations
[2026-05-11 08:59:19,427 TE-FL discovery.py:24 DEBUG] Plugin discovery complete. Loaded 0 plugin.
[2026-05-11 08:59:19,427 TE-FL discovery.py:24 DEBUG] Plugin discovery complete. Loaded 0 plugin.
[2026-05-11 08:59:19,427 TE-FL discovery.py:24 DEBUG] Plugin discovery complete. Loaded 0 plugin.
[2026-05-11 08:59:19,427 TE-FL discovery.py:24 DEBUG] Plugin discovery complete. Loaded 0 plugin.
[2026-05-11 08:59:19,427 TE-FL manager.py:146 DEBUG] Vendor: 110, Default: 0, Reference: 54
[2026-05-11 08:59:19,428 TE-FL manager.py:155 INFO] Registered impl_ids: ['reference.torch', 'vendor.enflame']
[2026-05-11 08:59:19,428 TE-FL manager.py:122 INFO] OpManager initialized: 110 ops with 164 implementations
[2026-05-11 08:59:19,428 TE-FL manager.py:122 INFO] OpManager initialized: 110 ops with 164 implementations
[2026-05-11 08:59:19,428 TE-FL manager.py:122 INFO] OpManager initialized: 110 ops with 164 implementations
[2026-05-11 08:59:19,428 TE-FL manager.py:122 INFO] OpManager initialized: 110 ops with 164 implementations
[2026-05-11 08:59:19,428 TE-FL manager.py:146 DEBUG] Vendor: 110, Default: 0, Reference: 54
[2026-05-11 08:59:19,428 TE-FL manager.py:146 DEBUG] Vendor: 110, Default: 0, Reference: 54
[2026-05-11 08:59:19,428 TE-FL manager.py:146 DEBUG] Vendor: 110, Default: 0, Reference: 54
[2026-05-11 08:59:19,428 TE-FL manager.py:146 DEBUG] Vendor: 110, Default: 0, Reference: 54
[2026-05-11 08:59:19,428 TE-FL manager.py:155 INFO] Registered impl_ids: ['reference.torch', 'vendor.enflame']
[2026-05-11 08:59:19,428 TE-FL manager.py:155 INFO] Registered impl_ids: ['reference.torch', 'vendor.enflame']
[2026-05-11 08:59:19,428 TE-FL manager.py:155 INFO] Registered impl_ids: ['reference.torch', 'vendor.enflame']
[2026-05-11 08:59:19,428 TE-FL manager.py:155 INFO] Registered impl_ids: ['reference.torch', 'vendor.enflame']
[2026-05-11 08:59:19,430 TE-FL discovery.py:24 DEBUG] No entry points found for group: te_fl.plugin
[2026-05-11 08:59:19,430 TE-FL discovery.py:24 DEBUG] Plugin discovery complete. Loaded 0 plugin.
[2026-05-11 08:59:19,430 TE-FL manager.py:122 INFO] OpManager initialized: 110 ops with 164 implementations
[2026-05-11 08:59:19,430 TE-FL manager.py:146 DEBUG] Vendor: 110, Default: 0, Reference: 54
[2026-05-11 08:59:19,430 TE-FL manager.py:155 INFO] Registered impl_ids: ['reference.torch', 'vendor.enflame']
[2026-05-11 08:59:19,439 TE-FL discovery.py:24 DEBUG] No entry points found for group: te_fl.plugin
[2026-05-11 08:59:19,439 TE-FL discovery.py:24 DEBUG] Plugin discovery complete. Loaded 0 plugin.
[2026-05-11 08:59:19,439 TE-FL manager.py:122 INFO] OpManager initialized: 110 ops with 164 implementations
[2026-05-11 08:59:19,439 TE-FL manager.py:146 DEBUG] Vendor: 110, Default: 0, Reference: 54
[2026-05-11 08:59:19,440 TE-FL manager.py:155 INFO] Registered impl_ids: ['reference.torch', 'vendor.enflame']
INFO:megatron.core.num_microbatches_calculator:setting number of microbatches to constant 2
building HuggingFaceTokenizer tokenizer ...
WARNING:megatron.core.datasets.megatron_tokenizer:You’re using the legacy tokenizer system, which is deprecated and will be removed in a future release. Please migrate to the new tokenizer system (megatron.core.tokenizers.MegatronTokenizer).
INFO:megatron.training.initialize:Setting logging level to 40
INFO:megatron.training.initialize:Setting logging level to 40
INFO:megatron.training.initialize:Setting logging level to 40
INFO:megatron.training.initialize:Setting logging level to 40
INFO:megatron.training.initialize:Setting logging level to 40
setting tensorboard ...
INFO:megatron.training.initialize:Setting logging level to 40
padded vocab (size: 151669) with 267 dummy tokens (new size: 151936)
INFO:megatron.training.initialize:Setting logging level to 40
initializing torch distributed ...
/usr/local/lib/python3.12/dist-packages/eventlet/support/greenlets.py:6: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
preserves_excinfo = (distutils.version.LooseVersion(greenlet.version)
wandb: Tracking run with wandb version 0.24.0
wandb: W&B syncing is set toofflinein this directory. Runwandb onlineor set WANDB_MODE=online to enable cloud syncing.
wandb: Run data is saved locally in /datasets/checkpoints/wandb/offline-run-20260511_085946-0t1qgbto
/usr/local/lib/python3.12/dist-packages/wandb/util.py:1967: DeprecationWarning: Implicit None on return values is deprecated and will raise KeyErrors.
yield InstalledDistribution(key=d.metadata["Name"], version=d.version)
/usr/lib/python3.12/importlib/metadata/init.py:467: DeprecationWarning: Implicit None on return values is deprecated and will raise KeyErrors.
return self.metadata['Version']
WARNING: one_logger package is required to enable e2e metrics tracking. please go to https://confluence.nvidia.com/display/MLWFO/Package+Repositories for details to install it/usr/local/lib/python3.12/dist-packages/google/protobuf/internal/well_known_types.py:195: DeprecationWarning: datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).
self.FromDatetime(datetime.datetime.utcnow())
INFO:megatron.training.initialize:Setting logging level to 40
[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank [Gloo] Rank 2 is connected to 7 peer ranks. 1Expected number of connected peer ranks is : is connected to 77 peer ranks. Expected number of connected peer ranks is :
7[Gloo] Rank
4 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 3 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 5 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank [Gloo] Rank 7 is connected to 67 peer ranks. is connected to Expected number of connected peer ranks is : 77 peer ranks. Expected number of connected peer ranks is :
7
[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 1 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 2 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 3 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 5 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 6 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 4 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 7 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank [Gloo] Rank 0 is connected to 00 peer ranks. is connected to Expected number of connected peer ranks is : 00 peer ranks. Expected number of connected peer ranks is :
0
[Gloo] Rank [Gloo] Rank 0 is connected to 00 is connected to peer ranks. 0Expected number of connected peer ranks is : peer ranks. [Gloo] Rank 0Expected number of connected peer ranks is : 0
0 is connected to
0 peer ranks. Expected number of connected peer ranks is : 0
initialized tensor model parallel with size 1
initialized pipeline model parallel with size 1
setting random seeds to 1234 ...
compiling dataset index builder ...
make: Entering directory '/home/xijun.gong/topslm/megatron_repos/Megatron-LM_dev_2b1fc7/megatron/core/datasets'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/xijun.gong/topslm/megatron_repos/Megatron-LM_dev_2b1fc7/megatron/core/datasets'done with dataset index builder. Compilation time: 0.162 seconds
WARNING: constraints for invoking optimized fused softmax kernel are not met. We default back to unfused kernel invocations.
compiling and loading fused kernels ...
/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py:4876: UserWarning: barrier(): using the device under current context. You can specifydevice_idininit_process_groupto mute this warning.
warnings.warn( # warn only once
ECCL version 3.6.3.9 + compiled with TopsPlatform 1.7.2.21
done with compiling and loading fused kernels. Compilation time: 0.675 seconds
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:149: UserWarning: GCU not support Double use Float replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:24.)
return fn(*args, **kwargs)
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:149: UserWarning: GCU not support Double use Float replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:24.)
return fn(*args, **kwargs)
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:149: UserWarning: GCU not support Double use Float replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:24.)
return fn(*args, **kwargs)
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:149: UserWarning: GCU not support Double use Float replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:24.)
return fn(*args, **kwargs)
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:149: UserWarning: GCU not support Double use Float replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:24.)
return fn(*args, **kwargs)
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:149: UserWarning: GCU not support Double use Float replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:24.)
return fn(*args, **kwargs)
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:149: UserWarning: GCU not support Double use Float replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:24.)
return fn(*args, **kwargs)
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:149: UserWarning: GCU not support Double use Float replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:24.)
return fn(*args, **kwargs)
time to initialize megatron (seconds): 3.901
[after megatron is initialized] datetime: 2026-05-11 08:59:47
building GPT model ...
[2026-05-11 08:59:48,030 TE-FL manager.py:460 WARNING] Implementation 'vendor.enflame' failed for op 'get_cudnn_version': module 'migration.patches.transformer_engine.v2_9_0' has no attribute 'get_cudnn_version'
[2026-05-11 08:59:48,031 TE-FL manager.py:439 INFO] Op 'get_cudnn_version' using 'reference.torch' (kind=reference, vendor=None)
[2026-05-11 08:59:48,031 TE-FL manager.py:439 INFO] Op 'get_flash_attention_class' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
/home/xijun.gong/topslm/megatron_repos/Megatron-LM_dev_2b1fc7/megatron/core/extensions/transformer_engine.py:227: DeprecationWarning:hidden_sizearg has been renamed tonormalized_shapefor compatibility withtorch.nn.LayerNorm.
instance = te.pytorch.RMSNorm(
[2026-05-11 08:59:48,047 TE-FL manager.py:460 WARNING] Implementation 'vendor.enflame' failed for op 'get_cudnn_version': module 'migration.patches.transformer_engine.v2_9_0' has no attribute 'get_cudnn_version'
[2026-05-11 08:59:48,047 TE-FL manager.py:439 INFO] Op 'get_cudnn_version' using 'reference.torch' (kind=reference, vendor=None)
[2026-05-11 08:59:48,048 TE-FL manager.py:439 INFO] Op 'get_flash_attention_class' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:48,048 TE-FL manager.py:460 WARNING] Implementation 'vendor.enflame' failed for op 'get_cudnn_version': module 'migration.patches.transformer_engine.v2_9_0' has no attribute 'get_cudnn_version'
[2026-05-11 08:59:48,048 TE-FL manager.py:439 INFO] Op 'get_cudnn_version' using 'reference.torch' (kind=reference, vendor=None)
[2026-05-11 08:59:48,049 TE-FL manager.py:439 INFO] Op 'get_flash_attention_class' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
/home/xijun.gong/topslm/megatron_repos/Megatron-LM_dev_2b1fc7/megatron/core/extensions/transformer_engine.py:227: DeprecationWarning:hidden_sizearg has been renamed tonormalized_shapefor compatibility withtorch.nn.LayerNorm.
instance = te.pytorch.RMSNorm(
/home/xijun.gong/topslm/megatron_repos/Megatron-LM_dev_2b1fc7/megatron/core/extensions/transformer_engine.py:227: DeprecationWarning:hidden_sizearg has been renamed tonormalized_shapefor compatibility withtorch.nn.LayerNorm.
instance = te.pytorch.RMSNorm(
[2026-05-11 08:59:48,050 TE-FL manager.py:460 WARNING] Implementation 'vendor.enflame' failed for op 'get_cudnn_version': module 'migration.patches.transformer_engine.v2_9_0' has no attribute 'get_cudnn_version'
[2026-05-11 08:59:48,051 TE-FL manager.py:439 INFO] Op 'get_cudnn_version' using 'reference.torch' (kind=reference, vendor=None)
[2026-05-11 08:59:48,051 TE-FL manager.py:439 INFO] Op 'get_flash_attention_class' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:48,051 TE-FL manager.py:460 WARNING] Implementation 'vendor.enflame' failed for op 'get_cudnn_version': module 'migration.patches.transformer_engine.v2_9_0' has no attribute 'get_cudnn_version'
[2026-05-11 08:59:48,051 TE-FL manager.py:439 INFO] Op 'get_cudnn_version' using 'reference.torch' (kind=reference, vendor=None)
[2026-05-11 08:59:48,052 TE-FL manager.py:439 INFO] Op 'get_flash_attention_class' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
/home/xijun.gong/topslm/megatron_repos/Megatron-LM_dev_2b1fc7/megatron/core/extensions/transformer_engine.py:227: DeprecationWarning:hidden_sizearg has been renamed tonormalized_shapefor compatibility withtorch.nn.LayerNorm.
instance = te.pytorch.RMSNorm(
/home/xijun.gong/topslm/megatron_repos/Megatron-LM_dev_2b1fc7/megatron/core/extensions/transformer_engine.py:227: DeprecationWarning:hidden_sizearg has been renamed tonormalized_shapefor compatibility withtorch.nn.LayerNorm.
instance = te.pytorch.RMSNorm(
[2026-05-11 08:59:48,055 TE-FL manager.py:460 WARNING] Implementation 'vendor.enflame' failed for op 'get_cudnn_version': module 'migration.patches.transformer_engine.v2_9_0' has no attribute 'get_cudnn_version'
[2026-05-11 08:59:48,055 TE-FL manager.py:439 INFO] Op 'get_cudnn_version' using 'reference.torch' (kind=reference, vendor=None)
[2026-05-11 08:59:48,056 TE-FL manager.py:439 INFO] Op 'get_flash_attention_class' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
/home/xijun.gong/topslm/megatron_repos/Megatron-LM_dev_2b1fc7/megatron/core/extensions/transformer_engine.py:227: DeprecationWarning:hidden_sizearg has been renamed tonormalized_shapefor compatibility withtorch.nn.LayerNorm.
instance = te.pytorch.RMSNorm(
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:147: UserWarning: GCU not support Long use Int replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:28.)
return fn(*args, **kwargs)
[2026-05-11 08:59:48,068 TE-FL manager.py:460 WARNING] Implementation 'vendor.enflame' failed for op 'get_cudnn_version': module 'migration.patches.transformer_engine.v2_9_0' has no attribute 'get_cudnn_version'
[2026-05-11 08:59:48,069 TE-FL manager.py:439 INFO] Op 'get_cudnn_version' using 'reference.torch' (kind=reference, vendor=None)
[2026-05-11 08:59:48,069 TE-FL manager.py:460 WARNING] Implementation 'vendor.enflame' failed for op 'get_cudnn_version': module 'migration.patches.transformer_engine.v2_9_0' has no attribute 'get_cudnn_version'
[2026-05-11 08:59:48,069 TE-FL manager.py:439 INFO] Op 'get_cudnn_version' using 'reference.torch' (kind=reference, vendor=None)
[2026-05-11 08:59:48,069 TE-FL manager.py:439 INFO] Op 'get_flash_attention_class' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:48,070 TE-FL manager.py:439 INFO] Op 'get_flash_attention_class' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
/home/xijun.gong/topslm/megatron_repos/Megatron-LM_dev_2b1fc7/megatron/core/extensions/transformer_engine.py:227: DeprecationWarning:hidden_sizearg has been renamed tonormalized_shapefor compatibility withtorch.nn.LayerNorm.
instance = te.pytorch.RMSNorm(
/home/xijun.gong/topslm/megatron_repos/Megatron-LM_dev_2b1fc7/megatron/core/extensions/transformer_engine.py:227: DeprecationWarning:hidden_sizearg has been renamed tonormalized_shapefor compatibility withtorch.nn.LayerNorm.
instance = te.pytorch.RMSNorm(
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:147: UserWarning: GCU not support Long use Int replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:28.)
return fn(*args, **kwargs)
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:147: UserWarning: GCU not support Long use Int replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:28.)
return fn(*args, **kwargs)
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:147: UserWarning: GCU not support Long use Int replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:28.)
return fn(*args, **kwargs)
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:147: UserWarning: GCU not support Long use Int replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:28.)
return fn(*args, **kwargs)
number of parameters on (tensor, pipeline) model parallel rank (0, 0): 1538544128
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:147: UserWarning: GCU not support Long use Int replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:28.)
return fn(*args, **kwargs)
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:147: UserWarning: GCU not support Long use Int replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:28.)
return fn(*args, **kwargs)
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:147: UserWarning: GCU not support Long use Int replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:28.)
return fn(*args, **kwargs)
[after model, optimizer, and learning rate scheduler are built] datetime: 2026-05-11 08:59:48
building train, validation, and test datasets ...
train: 64
validation: 0
test: 0
building train, validation, and test datasets for GPT ...
finished creating GPT datasets ...
/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py:659: UserWarning: pin_memory_device is deprecated, the current accelerator will be used as the device,ignore pin_memory_device='gcu'.
warnings.warn(
/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py:659: UserWarning: pin_memory_device is deprecated, the current accelerator will be used as the device,ignore pin_memory_device='gcu'.
warnings.warn(
/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py:659: UserWarning: pin_memory_device is deprecated, the current accelerator will be used as the device,ignore pin_memory_device='gcu'.
warnings.warn(
/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py:659: UserWarning: pin_memory_device is deprecated, the current accelerator will be used as the device,ignore pin_memory_device='gcu'.
warnings.warn(
/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py:659: UserWarning: pin_memory_device is deprecated, the current accelerator will be used as the device,ignore pin_memory_device='gcu'.
warnings.warn(
/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py:659: UserWarning: pin_memory_device is deprecated, the current accelerator will be used as the device,ignore pin_memory_device='gcu'.
warnings.warn(
/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py:659: UserWarning: pin_memory_device is deprecated, the current accelerator will be used as the device,ignore pin_memory_device='gcu'.
warnings.warn(
/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py:659: UserWarning: pin_memory_device is deprecated, the current accelerator will be used as the device,ignore pin_memory_device='gcu'.
warnings.warn(
/usr/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=18698) is multi-threaded, use of fork() may lead to deadlocks in the child.
self.pid = os.fork()
/usr/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=18702) is multi-threaded, use of fork() may lead to deadlocks in the child.
self.pid = os.fork()
/usr/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=18703) is multi-threaded, use of fork() may lead to deadlocks in the child.
self.pid = os.fork()
/usr/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=18704) is multi-threaded, use of fork() may lead to deadlocks in the child.
self.pid = os.fork()
/usr/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=18697) is multi-threaded, use of fork() may lead to deadlocks in the child.
self.pid = os.fork()
[2026-05-11 08:59:49,085 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,085 TE-FL manager.py:70 DEBUG] OpManager reset after fork
/usr/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=18699) is multi-threaded, use of fork() may lead to deadlocks in the child.
self.pid = os.fork()
/usr/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=18701) is multi-threaded, use of fork() may lead to deadlocks in the child.
self.pid = os.fork()
[2026-05-11 08:59:49,085 TE-FL manager.py:70 DEBUG] OpManager reset after fork
/usr/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=18700) is multi-threaded, use of fork() may lead to deadlocks in the child.
self.pid = os.fork()
[2026-05-11 08:59:49,086 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,086 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,087 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,087 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,087 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,108 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,109 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,109 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,109 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,110 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,112 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,112 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,112 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,129 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,130 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,130 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,130 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,131 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,133 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,134 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,135 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,150 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,150 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,151 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,152 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,152 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,154 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,156 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,157 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,170 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,170 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,171 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,171 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,172 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,174 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,177 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,180 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,191 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,191 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,193 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,193 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,195 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,197 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,200 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,202 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,220 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,221 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,223 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,223 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,227 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,228 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,232 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,234 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,243 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,244 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,246 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,246 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,251 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,257 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,258 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,279 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,283 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,289 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,290 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,291 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,300 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,300 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,301 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,304 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,306 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,310 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,312 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,312 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,320 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,322 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,323 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,327 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,328 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,329 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,331 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,332 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,342 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,342 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,343 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,346 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,347 TE-FL manager.py:70 DEBUG] OpManager reset after fork
/usr/local/lib/python3.12/dist-packages/torch/utils/data/_utils/pin_memory.py:57: DeprecationWarning: The argument 'device' of Tensor.pin_memory() is deprecated. Please do not pass this argument. (Triggered internally at /pytorch/aten/src/ATen/native/Memory.cpp:46.)
return data.pin_memory(device)
[2026-05-11 08:59:49,354 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,357 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,357 TE-FL manager.py:70 DEBUG] OpManager reset after fork
/usr/local/lib/python3.12/dist-packages/torch/utils/data/_utils/pin_memory.py:57: DeprecationWarning: The argument 'device' of Tensor.pin_memory() is deprecated. Please do not pass this argument. (Triggered internally at /pytorch/aten/src/ATen/native/Memory.cpp:46.)
return data.pin_memory(device)
/usr/local/lib/python3.12/dist-packages/torch/utils/data/_utils/pin_memory.py:57: DeprecationWarning: The argument 'device' of Tensor.pin_memory() is deprecated. Please do not pass this argument. (Triggered internally at /pytorch/aten/src/ATen/native/Memory.cpp:46.)
return data.pin_memory(device)
/usr/local/lib/python3.12/dist-packages/torch/utils/data/_utils/pin_memory.py:57: DeprecationWarning: The argument 'device' of Tensor.pin_memory() is deprecated. Please do not pass this argument. (Triggered internally at /pytorch/aten/src/ATen/native/Memory.cpp:46.)
return data.pin_memory(device)
[2026-05-11 08:59:49,363 TE-FL manager.py:70 DEBUG] OpManager reset after fork
/usr/local/lib/python3.12/dist-packages/torch/utils/data/_utils/pin_memory.py:57: DeprecationWarning: The argument 'device' of Tensor.pin_memory() is deprecated. Please do not pass this argument. (Triggered internally at /pytorch/aten/src/ATen/native/Memory.cpp:46.)
return data.pin_memory(device)
[2026-05-11 08:59:49,368 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,370 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,372 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,372 TE-FL manager.py:70 DEBUG] OpManager reset after fork
/usr/local/lib/python3.12/dist-packages/torch/utils/data/_utils/pin_memory.py:57: DeprecationWarning: The argument 'device' of Tensor.pin_memory() is deprecated. Please do not pass this argument. (Triggered internally at /pytorch/aten/src/ATen/native/Memory.cpp:46.)
return data.pin_memory(device)
/usr/local/lib/python3.12/dist-packages/torch/utils/data/_utils/pin_memory.py:57: DeprecationWarning: The argument 'device' of Tensor.pin_memory() is deprecated. Please do not pass this argument. (Triggered internally at /pytorch/aten/src/ATen/native/Memory.cpp:46.)
return data.pin_memory(device)
/usr/local/lib/python3.12/dist-packages/torch/utils/data/_utils/pin_memory.py:57: DeprecationWarning: The argument 'device' of Tensor.pin_memory() is deprecated. Please do not pass this argument. (Triggered internally at /pytorch/aten/src/ATen/native/Memory.cpp:46.)
return data.pin_memory(device)
[2026-05-11 08:59:49,383 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,384 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,387 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,389 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,393 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,395 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,397 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,400 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,405 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,406 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,409 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,414 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,414 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,418 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,420 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,423 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,426 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,426 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,430 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,436 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,441 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,442 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,445 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,446 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,449 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,450 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,453 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,462 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,469 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,472 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,474 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,477 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,477 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,478 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,481 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,488 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,490 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,500 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,502 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,506 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,505 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,506 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,510 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,512 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,529 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,529 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,531 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[after dataloaders are built] datetime: 2026-05-11 08:59:49
done with setup ...
training ...
Overwriting rerun_state_machine.current_iteration from -1 to 0...
[before the start of training step] datetime: 2026-05-11 08:59:49
ECCL version 3.6.3.9 + compiled with TopsPlatform 1.7.2.21
ECCL version 3.6.3.9 + compiled with TopsPlatform 1.7.2.21
[2026-05-11 08:59:50,005 TE-FL manager.py:439 INFO] Op 'rmsnorm_fwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:50,009 TE-FL manager.py:439 INFO] Op 'generic_gemm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:50,034 TE-FL manager.py:439 INFO] Op 'get_attention_backend' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:50,038 TE-FL manager.py:439 INFO] Op 'rmsnorm_fwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:50,043 TE-FL manager.py:439 INFO] Op 'generic_gemm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:50,069 TE-FL manager.py:439 INFO] Op 'get_attention_backend' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
ECCL version 3.6.3.9 + compiled with TopsPlatform 1.7.2.21
ECCL version 3.6.3.9 + compiled with TopsPlatform 1.7.2.21
ECCL version 3.6.3.9 + compiled with TopsPlatform 1.7.2.21
ECCL version 3.6.3.9 + compiled with TopsPlatform 1.7.2.21
ECCL version 3.6.3.9 + compiled with TopsPlatform 1.7.2.21
[2026-05-11 08:59:50,228 TE-FL manager.py:439 INFO] Op 'rmsnorm_fwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:50,231 TE-FL manager.py:439 INFO] Op 'rmsnorm_fwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:50,232 TE-FL manager.py:439 INFO] Op 'rmsnorm_fwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:50,233 TE-FL manager.py:439 INFO] Op 'rmsnorm_fwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:50,233 TE-FL manager.py:439 INFO] Op 'generic_gemm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:50,236 TE-FL manager.py:439 INFO] Op 'generic_gemm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:50,236 TE-FL manager.py:439 INFO] Op 'rmsnorm_fwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:50,236 TE-FL manager.py:439 INFO] Op 'generic_gemm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:50,238 TE-FL manager.py:439 INFO] Op 'generic_gemm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:50,241 TE-FL manager.py:439 INFO] Op 'generic_gemm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:50,247 TE-FL manager.py:439 INFO] Op 'rmsnorm_fwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:50,252 TE-FL manager.py:439 INFO] Op 'generic_gemm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:50,262 TE-FL manager.py:439 INFO] Op 'get_attention_backend' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:50,265 TE-FL manager.py:439 INFO] Op 'get_attention_backend' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:50,265 TE-FL manager.py:439 INFO] Op 'get_attention_backend' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:50,266 TE-FL manager.py:439 INFO] Op 'get_attention_backend' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:50,270 TE-FL manager.py:439 INFO] Op 'get_attention_backend' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:50,281 TE-FL manager.py:439 INFO] Op 'get_attention_backend' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:51,287 TE-FL manager.py:439 INFO] Op 'rmsnorm_bwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:51,291 TE-FL manager.py:439 INFO] Op 'rmsnorm_bwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:51,291 TE-FL manager.py:439 INFO] Op 'rmsnorm_bwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:51,291 TE-FL manager.py:439 INFO] Op 'rmsnorm_bwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:51,294 TE-FL manager.py:439 INFO] Op 'rmsnorm_bwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:51,297 TE-FL manager.py:439 INFO] Op 'rmsnorm_bwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:51,311 TE-FL manager.py:439 INFO] Op 'rmsnorm_bwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:51,332 TE-FL manager.py:439 INFO] Op 'rmsnorm_bwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,061 TE-FL manager.py:439 INFO] Op 'multi_tensor_l2norm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,061 TE-FL manager.py:439 INFO] Op 'multi_tensor_l2norm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,061 TE-FL manager.py:439 INFO] Op 'multi_tensor_l2norm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,061 TE-FL manager.py:439 INFO] Op 'multi_tensor_l2norm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,061 TE-FL manager.py:439 INFO] Op 'multi_tensor_l2norm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,061 TE-FL manager.py:439 INFO] Op 'multi_tensor_l2norm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,061 TE-FL manager.py:439 INFO] Op 'multi_tensor_l2norm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,061 TE-FL manager.py:439 INFO] Op 'multi_tensor_l2norm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,324 TE-FL manager.py:439 INFO] Op 'multi_tensor_scale' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,324 TE-FL manager.py:439 INFO] Op 'multi_tensor_scale' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,324 TE-FL manager.py:439 INFO] Op 'multi_tensor_scale' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,324 TE-FL manager.py:439 INFO] Op 'multi_tensor_scale' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,324 TE-FL manager.py:439 INFO] Op 'multi_tensor_scale' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,324 TE-FL manager.py:439 INFO] Op 'multi_tensor_scale' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,325 TE-FL manager.py:439 INFO] Op 'multi_tensor_scale' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,325 TE-FL manager.py:439 INFO] Op 'multi_tensor_scale' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,331 TE-FL manager.py:439 INFO] Op 'multi_tensor_adam' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,331 TE-FL manager.py:439 INFO] Op 'multi_tensor_adam' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,331 TE-FL manager.py:439 INFO] Op 'multi_tensor_adam' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,332 TE-FL manager.py:439 INFO] Op 'multi_tensor_adam' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,332 TE-FL manager.py:439 INFO] Op 'multi_tensor_adam' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,332 TE-FL manager.py:439 INFO] Op 'multi_tensor_adam' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,332 TE-FL manager.py:439 INFO] Op 'multi_tensor_adam' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,332 TE-FL manager.py:439 INFO] Op 'multi_tensor_adam' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
Number of parameters in transformer block in billions: 1.35
Number of parameters in embedding layers in billions: 1.24
Total number of parameters in billions: 2.60
Number of parameters in most loaded shard in billions: 2.5953
compute_activation_memory_without_sp
Activation memory footprint per transformer layer (precise, without SP): 544.0 MB
Theoretical memory footprints: weight and optimizer=18562.79 MB, activation=2439.18 MB, total=21001.98 MB
[2026-05-11 08:59:52] iteration 1/ 4 | consumed samples: 16 | elapsed time per iteration (ms): 3134.6 | throughput per GPU (TFLOP/s/GPU): 17.8 | learning rate: 3.385972E-06 | global batch size: 16 | lm loss: 1.280801E+01 | load_balancing_loss: 1.000339E+00 | loss scale: 1.0 | grad norm: 15.105 | number of skipped iterations: 0 | number of nan iterations: 0 |
[Rank 0] (after 1 iterations) memory (MB) | allocated: 12524.95361328125 | max allocated: 16005.3583984375 | reserved: 20646.0 | max reserved: 20646.0 | device usage: 26724.37890625
Number of parameters in transformer block in billions: 1.35
Number of parameters in embedding layers in billions: 1.24
Total number of parameters in billions: 2.60
Number of parameters in most loaded shard in billions: 2.5953
compute_activation_memory_without_sp
Activation memory footprint per transformer layer (precise, without SP): 544.0 MB
Theoretical memory footprints: weight and optimizer=18562.79 MB, activation=2439.18 MB, total=21001.98 MB
[2026-05-11 08:59:53] iteration 2/ 4 | consumed samples: 32 | elapsed time per iteration (ms): 564.8 | throughput per GPU (TFLOP/s/GPU): 98.7 | learning rate: 2.145000E-06 | global batch size: 16 | lm loss: 1.272276E+01 | load_balancing_loss: 1.000339E+00 | loss scale: 1.0 | grad norm: 15.646 | number of skipped iterations: 0 | number of nan iterations: 0 |
[Rank 0] (after 2 iterations) memory (MB) | allocated: 12524.95361328125 | max allocated: 18474.93505859375 | reserved: 23024.0 | max reserved: 23024.0 | device usage: 29104.37890625
[2026-05-11 08:59:53] iteration 3/ 4 | consumed samples: 48 | elapsed time per iteration (ms): 562.9 | throughput per GPU (TFLOP/s/GPU): 99.0 | learning rate: 9.040276E-07 | global batch size: 16 | lm loss: 1.269750E+01 | load_balancing_loss: 1.000339E+00 | loss scale: 1.0 | grad norm: 14.716 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-05-11 08:59:54] iteration 4/ 4 | consumed samples: 64 | elapsed time per iteration (ms): 558.7 | throughput per GPU (TFLOP/s/GPU): 99.8 | learning rate: 3.900000E-07 | global batch size: 16 | lm loss: 1.250520E+01 | load_balancing_loss: 1.000339E+00 | loss scale: 1.0 | grad norm: 15.292 | number of skipped iterations: 0 | number of nan iterations: 0 |
[after training is done] datetime: 2026-05-11 08:59:54
wandb: training_time_layer_fwd 0
wandb: training_time_layer_bwd 0
wandb: tokens_per_sec_per_device 14552.366627560767
wandb: MFU% 31.736331295294892
wandb:
wandb: Run history:
wandb: batch-size ▁▁▁▁
wandb: grad-norm ▄█▁▅
wandb: iteration-time █▁▁▁
wandb: learning-rate █▅▂▁
wandb: lm loss █▆▅▁
wandb: load_balancing_loss ▁▁▁▁
wandb: loss-scale ▁▁▁▁
wandb: samples vs steps ▁▃▆█
wandb: throughput ▁███
wandb:
wandb: Run summary:
wandb: batch-size 16
wandb: grad-norm 15.29184
wandb: iteration-time 0.55875
wandb: learning-rate 0.0
wandb: lm loss 12.5052
wandb: load_balancing_loss 1.00034
wandb: loss-scale 1
wandb: samples vs steps 64
wandb: throughput 99.75928
wandb:
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /datasets/checkpoints/wandb/offline-run-20260511_085946-0t1qgbto
wandb: Find logs at: /datasets/checkpoints/wandb/offline-run-20260511_085946-0t1qgbto/logs
test_pass**
`
| from ....ops import * | ||
|
|
||
| def _get_tex(): | ||
| from migration.patches.transformer_engine import v2_9_0 |
There was a problem hiding this comment.
Need to add enaure_enflame_libs to ensure proper imports in the environment, or alternatively add try-catch handling.
| return v2_9_0 | ||
|
|
||
| def _check_enflame_available() -> bool: | ||
| from torch_gcu import transfer_to_gcu |
There was a problem hiding this comment.
torch_gcu is not a general package, please add try-catch for import
# Description
Add the new vendor backend ENFLAME
## Type of change
- [ √ ] New feature (non-breaking change which adds functionality)
## Changes
Please list the changes introduced in this PR:
- Add enflame ops register
- Add enflame backend implementation
- Register enflame ops in builtin_ops.py
## Requirements
- The module migraiton is needed, to use this
module, need to install package migration whl
# Checklist:
- [x] I have read and followed the [contributing
guidelines](https://github.com/NVIDIA/TransformerEngine/blob/main/CONTRIBUTING.rst)
- [x] The functionality is complete
- [x] I have commented my code, particularly in hard-to-understand areas
- [x] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [x] I have added tests that prove my fix is effective or that my
feature works
- [x] New and existing unit tests pass locally with my changes

