Skip to content

Add the new vendor backend ENFLAME#61

Merged
zhaoyinglia merged 1 commit into
flagos-ai:mainfrom
gongxijun:main
May 12, 2026
Merged

Add the new vendor backend ENFLAME#61
zhaoyinglia merged 1 commit into
flagos-ai:mainfrom
gongxijun:main

Conversation

@gongxijun
Copy link
Copy Markdown

# Description

Add the new vendor backend ENFLAME

## Type of change

- [ √ ] New feature (non-breaking change which adds functionality)

## Changes

Please list the changes introduced in this PR:

-  Add enflame ops register
-  Add enflame backend implementation
-  Register enflame ops in builtin_ops.py

## Requirements

- The module migraiton is needed, to use this module, need to install package migration whl

# Checklist:

- [x] I have read and followed the [contributing guidelines](https://github.com/NVIDIA/TransformerEngine/blob/main/CONTRIBUTING.rst)
- [x] The functionality is complete
- [x] I have commented my code, particularly in hard-to-understand areas
- [x] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [x] I have added tests that prove my fix is effective or that my feature works
- [x] New and existing unit tests pass locally with my changes

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 28, 2026

CLA assistant check
All committers have signed the CLA.

Copy link
Copy Markdown
Author

@gongxijun gongxijun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gongxijun
Copy link
Copy Markdown
Author

env config:

`
TEFL_LOG_LEVEL=DEBUG
TE_FL_SKIP_CUDA=1
TE_FL_PREFER=vendor
NVTE_DEBUG=1
NVTE_DEBUG_LEVEL=2

`

running log

`
------------------CASE_NAME: Megatron_Qwen3-235B_pretrain_T1P1C1E8_4KMB1GB16_BF16L2E32------------------
default not use ENFLAME_TORCH_GCU_ENABLE_AUTO_MIGRATION and TOPSLM_ENABLE_AUTO_MIGRATION, use ENFLAME_ENABLE_AUTO_MIGRATION
WARNING: Skipping /usr/lib/python3/dist-packages/PyJWT-2.7.0.dist-info due to invalid metadata entry 'name'
migration 3.7.20260507+gcu

export FRAME_WORK="Megatron"
export MODEL_NAME="Qwen3-235B"
export TASK_NAME="pretrain"
export TP="1"
export PP="1"
export EP="8"
export CP="1"
export VPP="0"
export UP="1"
export SDP="-1"
export MBS="1"
export GBS="16"
export SEQ_LEN="4096"
export NUM_LAYER="2"
export NUM_EXPERT="32"
export PR="BF16"
frame_work: Megatron
export TEST_ID="1x8_Megatron_dev_Qwen3-235B_pretrain_E8_4KMB1GB16_BF16L2E32"
MODEL_NAME_LOW: qwen3-235b
TEST_ID_RUN: GCU400_1x8_Megatron_dev_Qwen3-235B_pretrain_E8_4KMB1GB16_BF16L2E32Do_test
PYTHONPATH: /home/xijun.gong/topslm/megatron_repos/Megatron-LM_dev_2b1fc7/:.:/workspace/Verl/
PY_MAIN: ./models/megatron/pretrain.py

CUDA vendor backend skipped (CUDA build was disabled at build time)
CUDA vendor backend skipped (CUDA build was disabled at build time)
CUDA vendor backend skipped (CUDA build was disabled at build time)
[2026-05-09 10:23:49,635 TE-FL manager.py:99 DEBUG] Initializing OpManager in PID 38752
[2026-05-09 10:23:49,635 TE-FL manager.py:99 DEBUG] Initializing OpManager in PID 38756
[2026-05-09 10:23:49,635 TE-FL manager.py:99 DEBUG] Initializing OpManager in PID 38754
[WARNING] Failed to register FlagOS operators: No module named 'flag_gems'
[WARNING] Failed to register FlagOS operators: No module named 'flag_gems'
[WARNING] Failed to register FlagOS operators: No module named 'flag_gems'

[2026-05-09 10:23:49,665 TE-FL manager.py:99 DEBUG] Initializing OpManager in PID 38751
[2026-05-09 10:23:49,666 TE-FL manager.py:99 DEBUG] Initializing OpManager in PID 38750
[WARNING] Failed to register FlagOS operators: No module named 'flag_gems'
[WARNING] Failed to register FlagOS operators: No module named 'flag_gems'
CUDA vendor backend skipped (CUDA build was disabled at build time)
CUDA vendor backend skipped (CUDA build was disabled at build time)
CUDA vendor backend skipped (CUDA build was disabled at build time)
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:387: ImportWarning:
*************************************************************************************************************
The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.gcu and torch.nn.Module.gcu now..
The backend in torch.distributed.init_process_group set to eccl now..
The torch.cuda.* and torch.cuda.amp.* are replaced with torch.gcu.* and torch.gcu.amp.* now..
The device parameters have been replaced with gcu in the function below:
torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.autocast, torch.load, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.nn.Module.to, torch.nn.Module.to_empty
*************************************************************************************************************

warnings.warn(msg, ImportWarning)
[2026-05-09 10:23:49,684 TE-FL manager.py:99 DEBUG] Initializing OpManager in PID 38755
[2026-05-09 10:23:49,685 TE-FL manager.py:99 DEBUG] Initializing OpManager in PID 38757
[2026-05-09 10:23:49,685 TE-FL manager.py:99 DEBUG] Initializing OpManager in PID 38753
[WARNING] Failed to register FlagOS operators: No module named 'flag_gems'
[WARNING] Failed to register FlagOS operators: No module named 'flag_gems'
[WARNING] Failed to register FlagOS operators: No module named 'flag_gems'
[2026-05-09 10:23:49,699 TE-FL discovery.py:24 DEBUG] Starting plugin discovery...
[2026-05-09 10:23:49,699 TE-FL discovery.py:24 DEBUG] Starting plugin discovery...
[2026-05-09 10:23:49,704 TE-FL discovery.py:24 DEBUG] Starting plugin discovery...
[2026-05-09 10:23:49,704 TE-FL discovery.py:24 DEBUG] Starting plugin discovery...
[2026-05-09 10:23:49,705 TE-FL discovery.py:24 DEBUG] Starting plugin discovery...
[2026-05-09 10:23:49,712 TE-FL discovery.py:24 DEBUG] No entry points found for group: te_fl.plugin
[2026-05-09 10:23:49,712 TE-FL discovery.py:24 DEBUG] Plugin discovery complete. Loaded 0 plugin.
[2026-05-09 10:23:49,712 TE-FL discovery.py:24 DEBUG] No entry points found for group: te_fl.plugin
[2026-05-09 10:23:49,712 TE-FL manager.py:122 INFO] OpManager initialized: 110 ops with 164 implementations
[2026-05-09 10:23:49,712 TE-FL manager.py:146 DEBUG] Vendor: 110, Default: 0, Reference: 54
[2026-05-09 10:23:49,712 TE-FL discovery.py:24 DEBUG] Plugin discovery complete. Loaded 0 plugin.
[2026-05-09 10:23:49,712 TE-FL manager.py:155 INFO] Registered impl_ids: ['reference.torch', 'vendor.enflame']
[2026-05-09 10:23:49,712 TE-FL manager.py:122 INFO] OpManager initialized: 110 ops with 164 implementations
[2026-05-09 10:23:49,712 TE-FL manager.py:146 DEBUG] Vendor: 110, Default: 0, Reference: 54
[2026-05-09 10:23:49,712 TE-FL manager.py:155 INFO] Registered impl_ids: ['reference.torch', 'vendor.enflame']
[2026-05-09 10:23:49,716 TE-FL discovery.py:24 DEBUG] No entry points found for group: te_fl.plugin
[2026-05-09 10:23:49,716 TE-FL discovery.py:24 DEBUG] Plugin discovery complete. Loaded 0 plugin.
[2026-05-09 10:23:49,717 TE-FL manager.py:122 INFO] OpManager initialized: 110 ops with 164 implementations
[2026-05-09 10:23:49,717 TE-FL manager.py:146 DEBUG] Vendor: 110, Default: 0, Reference: 54
[2026-05-09 10:23:49,717 TE-FL manager.py:155 INFO] Registered impl_ids: ['reference.torch', 'vendor.enflame']
[2026-05-09 10:23:49,717 TE-FL discovery.py:24 DEBUG] No entry points found for group: te_fl.plugin
[2026-05-09 10:23:49,717 TE-FL discovery.py:24 DEBUG] Plugin discovery complete. Loaded 0 plugin.
[2026-05-09 10:23:49,717 TE-FL manager.py:122 INFO] OpManager initialized: 110 ops with 164 implementations
[2026-05-09 10:23:49,717 TE-FL manager.py:146 DEBUG] Vendor: 110, Default: 0, Reference: 54
[2026-05-09 10:23:49,718 TE-FL manager.py:155 INFO] Registered impl_ids: ['reference.torch', 'vendor.enflame']
[2026-05-09 10:23:49,718 TE-FL discovery.py:24 DEBUG] No entry points found for group: te_fl.plugin
[2026-05-09 10:23:49,718 TE-FL discovery.py:24 DEBUG] Plugin discovery complete. Loaded 0 plugin.
[2026-05-09 10:23:49,718 TE-FL manager.py:122 INFO] OpManager initialized: 110 ops with 164 implementations
[2026-05-09 10:23:49,718 TE-FL manager.py:146 DEBUG] Vendor: 110, Default: 0, Reference: 54
[2026-05-09 10:23:49,718 TE-FL manager.py:155 INFO] Registered impl_ids: ['reference.torch', 'vendor.enflame']
[2026-05-09 10:23:49,729 TE-FL discovery.py:24 DEBUG] Starting plugin discovery...
[2026-05-09 10:23:49,729 TE-FL discovery.py:24 DEBUG] Starting plugin discovery...
[2026-05-09 10:23:49,729 TE-FL discovery.py:24 DEBUG] Starting plugin discovery...
[2026-05-09 10:23:49,742 TE-FL discovery.py:24 DEBUG] No entry points found for group: te_fl.plugin
[2026-05-09 10:23:49,742 TE-FL discovery.py:24 DEBUG] No entry points found for group: te_fl.plugin
[2026-05-09 10:23:49,742 TE-FL discovery.py:24 DEBUG] Plugin discovery complete. Loaded 0 plugin.
[2026-05-09 10:23:49,742 TE-FL discovery.py:24 DEBUG] Plugin discovery complete. Loaded 0 plugin.
[2026-05-09 10:23:49,742 TE-FL manager.py:122 INFO] OpManager initialized: 110 ops with 164 implementations
[2026-05-09 10:23:49,742 TE-FL manager.py:122 INFO] OpManager initialized: 110 ops with 164 implementations
[2026-05-09 10:23:49,742 TE-FL manager.py:146 DEBUG] Vendor: 110, Default: 0, Reference: 54
[2026-05-09 10:23:49,742 TE-FL manager.py:146 DEBUG] Vendor: 110, Default: 0, Reference: 54
[2026-05-09 10:23:49,742 TE-FL manager.py:155 INFO] Registered impl_ids: ['reference.torch', 'vendor.enflame']
[2026-05-09 10:23:49,742 TE-FL manager.py:155 INFO] Registered impl_ids: ['reference.torch', 'vendor.enflame']
[2026-05-09 10:23:49,742 TE-FL discovery.py:24 DEBUG] No entry points found for group: te_fl.plugin
[2026-05-09 10:23:49,742 TE-FL discovery.py:24 DEBUG] Plugin discovery complete. Loaded 0 plugin.
[2026-05-09 10:23:49,743 TE-FL manager.py:122 INFO] OpManager initialized: 110 ops with 164 implementations
[2026-05-09 10:23:49,743 TE-FL manager.py:146 DEBUG] Vendor: 110, Default: 0, Reference: 54
[2026-05-09 10:23:49,743 TE-FL manager.py:155 INFO] Registered impl_ids: ['reference.torch', 'vendor.enflame']
⚙️ Running in WANDB offline mode
⚙️ Running in WANDB offline mode
⚙️ Running in WANDB offline mode
⚙️ Running in WANDB offline mode
⚙️ Running in WANDB offline mode
⚙️ Running in WANDB offline mode
⚙️ Running in WANDB offline mode
⚙️ Running in WANDB offline mode
WARNING:megatron.core.datasets.megatron_tokenizer:You’re using the legacy tokenizer system, which is deprecated and will be removed in a future release. Please migrate to the new tokenizer system (megatron.core.tokenizers.MegatronTokenizer).
WARNING:megatron.core.datasets.megatron_tokenizer:You’re using the legacy tokenizer system, which is deprecated and will be removed in a future release. Please migrate to the new tokenizer system (megatron.core.tokenizers.MegatronTokenizer).
WARNING:megatron.core.datasets.megatron_tokenizer:You’re using the legacy tokenizer system, which is deprecated and will be removed in a future release. Please migrate to the new tokenizer system (megatron.core.tokenizers.MegatronTokenizer).
WARNING:megatron.core.datasets.megatron_tokenizer:You’re using the legacy tokenizer system, which is deprecated and will be removed in a future release. Please migrate to the new tokenizer system (megatron.core.tokenizers.MegatronTokenizer).
WARNING:megatron.core.datasets.megatron_tokenizer:You’re using the legacy tokenizer system, which is deprecated and will be removed in a future release. Please migrate to the new tokenizer system (megatron.core.tokenizers.MegatronTokenizer).
using world size: 8, data-parallel size: 8, context-parallel size: 1, hierarchical context-parallel sizes: None, tensor-model-parallel size: 1, pipeline-model-parallel size: 1
WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer with tokenizer_type:HuggingFaceTokenizer
Number of virtual stages per pipeline stage: None
accumulate and all-reduce gradients in fp32 for bfloat16 data type.
using torch.bfloat16 for parameters ...
WARNING:megatron.core.datasets.megatron_tokenizer:You’re using the legacy tokenizer system, which is deprecated and will be removed in a future release. Please migrate to the new tokenizer system (megatron.core.tokenizers.MegatronTokenizer).
WARNING:megatron.core.datasets.megatron_tokenizer:You’re using the legacy tokenizer system, which is deprecated and will be removed in a future release. Please migrate to the new tokenizer system (megatron.core.tokenizers.MegatronTokenizer).
/home/xijun.gong/topslm/megatron_repos/Megatron-LM_dev_2b1fc7/megatron/training/utils.py:409: UserWarning: Disabling sequence parallelism because tensor model parallelism is disabled
warnings.warn(message)
------------------------ arguments ------------------------
account_for_embedding_in_pipeline_split ......... False
account_for_loss_in_pipeline_split .............. False
accumulate_allreduce_grads_in_fp32 .............. True
activation_func_clamp_value ..................... None
adam_beta1 ...................................... 0.9
adam_beta2 ...................................... 0.95
adam_eps ........................................ 1e-08
add_bias_linear ................................. False
add_position_embedding .......................... True
add_qkv_bias .................................... False
adlr_autoresume ................................. False
adlr_autoresume_interval ........................ 1000
align_grad_reduce ............................... True
align_param_gather .............................. False
allow_ambiguous_pad_tokens ...................... False
app_tag_run_name ................................ None
app_tag_run_version ............................. 0.0.0
apply_layernorm_1p .............................. False
apply_query_key_layer_scaling ................... False
apply_residual_connection_post_layernorm ........ False
apply_rope_fusion ............................... False
async_save ...................................... None
async_tensor_model_parallel_allreduce ........... True
attention_backend ............................... AttnBackend.flash
attention_dropout ............................... 0.0
attention_output_gate ........................... False
attention_softmax_in_fp32 ....................... False
auto_detect_ckpt_format ......................... True
barrier_with_L1_time ............................ True
bert_binary_head ................................ True
bert_embedder_type .............................. megatron
bert_load ....................................... None
bf16 ............................................ True
bias_dropout_fusion ............................. False
bias_gelu_fusion ................................ False
bias_swiglu_fusion .............................. False
biencoder_projection_dim ........................ 0
biencoder_shared_query_context_model ............ False
block_data_path ................................. None
cache_mla_latents ............................... False
calc_ft_timeouts ................................ False
calculate_per_token_loss ........................ False
check_for_large_grads ........................... False
check_for_nan_in_loss_and_grad .................. False
check_for_spiky_loss ............................ False
check_weight_hash_across_dp_replicas_interval ... None
ckpt_assume_constant_structure .................. False
ckpt_convert_format ............................. None
ckpt_convert_save ............................... None
ckpt_convert_update_legacy_dist_opt_format ...... False
ckpt_format ..................................... torch
ckpt_fully_parallel_load ........................ False
ckpt_fully_parallel_save ........................ True
ckpt_fully_parallel_save_deprecated ............. False
ckpt_step ....................................... None
classes_fraction ................................ 1.0
clip_grad ....................................... 1.0
clone_scatter_output_in_embedding ............... True
config_logger_dir ...............................
consumed_train_samples .......................... 0
consumed_valid_samples .......................... 0
context_parallel_size ........................... 1
cp_comm_type .................................... ['p2p']
cpu_offloading_num_layers ....................... 0
create_attention_mask_in_dataloader ............. False
cross_entropy_fusion_impl ....................... native
cross_entropy_loss_fusion ....................... False
cuda_graph_impl ................................. none
cuda_graph_scope ................................ []
cuda_graph_warmup_steps ......................... 3
data_args_path .................................. None
data_cache_path ................................. /topsmodels/data-llm/data_cache
data_parallel_random_init ....................... False
data_parallel_sharding_strategy ................. no_shard
data_parallel_size .............................. 8
data_path ....................................... None
data_per_class_fraction ......................... 1.0
data_sharding ................................... True
dataloader_type ................................. single
ddp_average_in_collective ....................... False
ddp_bucket_size ................................. None
ddp_num_buckets ................................. None
ddp_pad_buckets_for_high_nccl_busbw ............. False
ddp_reduce_scatter_with_fp32_accumulation ....... False
decode_only_cuda_graphs ......................... False
decoder_first_pipeline_num_layers ............... None
decoder_last_pipeline_num_layers ................ None
decoder_num_layers .............................. None
decoder_seq_length .............................. None
decoupled_lr .................................... None
decoupled_min_lr ................................ None
decrease_batch_size_if_needed ................... False
defer_embedding_wgrad_compute ................... False
delay_wgrad_compute ............................. False
deprecated_use_mcore_models ..................... True
deterministic_mode .............................. False
dino_bottleneck_size ............................ 256
dino_freeze_last_layer .......................... 1
dino_head_hidden_size ........................... 2048
dino_local_crops_number ......................... 10
dino_local_img_size ............................. 96
dino_norm_last_layer ............................ False
dino_teacher_temp ............................... 0.07
dino_warmup_teacher_temp ........................ 0.04
dino_warmup_teacher_temp_epochs ................. 30
disable_bf16_reduced_precision_matmul ........... False
disable_chunked_prefill ......................... False
disable_jit_fuser ............................... False
disable_mamba_mem_eff_path ...................... False
disable_straggler_on_startup .................... False
disable_symmetric_registration .................. False
dist_ckpt_format_deprecated ..................... None
dist_ckpt_optim_fully_reshardable ............... False
dist_ckpt_save_pre_mcore_014 .................... False
dist_ckpt_strictness ............................ log_all
distrib_optim_fully_reshardable_mem_efficient ... False
distribute_saved_activations .................... False
distributed_backend ............................. nccl
distributed_timeout_minutes ..................... 60
distributed_timeout_seconds_after_init .......... None
dsa_indexer_head_dim ............................ None
dsa_indexer_loss_coeff .......................... 0.0
dsa_indexer_n_heads ............................. None
dsa_indexer_topk ................................ None
dsa_indexer_use_sparse_loss ..................... False
dump_param_to_param_group_map ................... None
embedding_init_method_std ....................... None
embedding_path .................................. None
empty_unused_memory_level ....................... 0
enable_cuda_graph ............................... False
enable_experimental ............................. True
enable_ft_package ............................... False
enable_full_sharding_in_hsdp .................... False
enable_gloo_process_groups ...................... True
enable_gpt_oss .................................. False
enable_msc ...................................... True
enable_one_logger ............................... True
encoder_num_layers .............................. 2
encoder_seq_length .............................. 4096
end_weight_decay ................................ 0.1
eod_mask_loss ................................... False
ep_overlap_early_attn_memory_release ............ False
error_injection_rate ............................ 0
error_injection_type ............................ transient_error
eval_interval ................................... 20000
eval_iters ...................................... 0
evidence_data_path .............................. None
exit_duration_in_mins ........................... 220
exit_interval ................................... None
exit_on_missing_checkpoint ...................... False
exit_signal ..................................... 15
exit_signal_handler ............................. False
exit_signal_handler_for_dataloader .............. False
exp_avg_dtype ................................... torch.float32
exp_avg_sq_dtype ................................ torch.float32
experimental_attention_variant .................. None
expert_model_parallel_size ...................... 8
expert_tensor_parallel_size ..................... 1
export_force_local_attention .................... False
export_kd_cfg ................................... None
export_kd_teacher_ckpt_format ................... None
export_kd_teacher_load .......................... None
export_kv_cache_quant ........................... False
export_legacy_megatron .......................... False
export_model_type ............................... GPTModel
export_moe_apply_probs_on_input ................. False
export_offline_model ............................ False
export_qk_l2_norm ............................... False
export_quant_cfg ................................ None
export_real_quant_cfg ........................... None
export_te_mcore_model ........................... False
external_cuda_graph ............................. False
fake_process_group .............................. False
fallback_to_eager_attn .......................... False
ffn_hidden_size ................................. 12288
fim_data ........................................ False
fim_eod_token ................................... <|endoftext|>
fim_fragment_rate ............................... None
fim_middle_token ................................ <fim_middle>
fim_no_prefix ................................... None
fim_pad_token ................................... <fim_pad>
fim_prefix_token ................................ <fim_prefix>
fim_rate ........................................ 0.5
fim_split_sample ................................ None
fim_spm_rate .................................... 0.5
fim_suffix_token ................................ <fim_suffix>
fine_grained_activation_offloading .............. False
finetune ........................................ False
finetune_data_split ............................. train
finetune_hf_dataset ............................. None
first_last_layers_bf16 .......................... False
flash_decode .................................... False
fp16 ............................................ False
fp16_lm_cross_entropy ........................... False
fp32_residual_connection ........................ False
fp4 ............................................. None
fp4_param ....................................... False
fp4_quantizer_factory ........................... None
fp4_recipe ...................................... nvfp4
fp8 ............................................. None
fp8_amax_compute_algo ........................... most_recent
fp8_amax_history_len ............................ 1
fp8_interval .................................... 1
fp8_margin ...................................... 0
fp8_param_gather ................................ False
fp8_quantizer_factory ........................... None
fp8_recipe ...................................... delayed
fp8_wgrad ....................................... True
fsdp_double_buffer .............................. False
full_validation ................................. False
global_batch_size ............................... 16
glu_linear_offset ............................... 0.0
grad_reduce_in_bf16 ............................. False
gradient_accumulation_fusion .................... False
gradient_reduce_div_fusion ...................... True
group_query_attention ........................... True
grpo_clamp_eps_lower ............................ 0.01
grpo_clamp_eps_upper ............................ 0.01
grpo_default_temperature ........................ 1.0
grpo_default_top_p .............................. 0
grpo_entropy_term_weight ........................ 0.0
grpo_filter_groups_with_same_reward ............. False
grpo_group_size ................................. 2
grpo_iterations ................................. 2
grpo_kl_beta .................................... 0.001
grpo_prompts_per_step ........................... 32
head_lr_mult .................................... 1.0
heterogeneous_layers_config_encoded_json ........ None
heterogeneous_layers_config_path ................ None
hidden_dropout .................................. 0.0
hidden_size ..................................... 4096
hierarchical_context_parallel_sizes ............. None
high_priority_stream_groups ..................... []
hybrid_attention_ratio .......................... 0.0
hybrid_context_parallel ......................... False
hybrid_mlp_ratio ................................ 0.0
hybrid_override_pattern ......................... None
hysteresis ...................................... 2
ict_head_size ................................... None
ict_load ........................................ None
img_h ........................................... 224
img_w ........................................... 224
indexer_batch_size .............................. 128
indexer_log_interval ............................ 1000
inference_batch_times_seqlen_threshold .......... -1
inference_coordinator_port ...................... 12346
inference_dynamic_batching ...................... False
inference_dynamic_batching_block_size ........... 256
inference_dynamic_batching_buffer_size_gb ....... 40.0
inference_dynamic_batching_cuda_graph_max_tokens 1024
inference_dynamic_batching_cuda_graph_mixed_prefill_count 16
inference_dynamic_batching_max_tokens ........... None
inference_dynamic_batching_num_cuda_graphs ...... 16
inference_dynamic_batching_track_paused_request_events False
inference_dynamic_batching_unified_memory_level . 1
inference_max_batch_size ........................ 8
inference_max_seq_length ........................ 2560
inference_rng_tracker ........................... False
inference_wandb_logging_step_interval ........... 0
init_method_std ................................. 0.02
init_method_xavier_uniform ...................... False
init_model_with_meta_device ..................... False
initial_loss_scale .............................. 4294967296
inprocess_active_world_size ..................... 8
inprocess_barrier_timeout ....................... 120
inprocess_completion_timeout .................... 120
inprocess_empty_cuda_cache ...................... False
inprocess_granularity ........................... node
inprocess_hard_timeout .......................... 90
inprocess_heartbeat_interval .................... 30
inprocess_heartbeat_timeout ..................... 60
inprocess_last_call_wait ........................ 1
inprocess_max_iterations ........................ None
inprocess_monitor_process_interval .............. 1.0
inprocess_monitor_thread_interval ............... 1.0
inprocess_progress_watchdog_interval ............ 1.0
inprocess_restart ............................... False
inprocess_soft_timeout .......................... 60
inprocess_termination_grace_time ................ 1
is_hybrid_model ................................. False
iter_per_epoch .................................. 1250
iterations_to_skip .............................. []
keep_fp8_transpose_cache ........................ False
kitchen_config_file ............................. None
kitchen_recipe_number ........................... None
kv_channels ..................................... 128
kv_lora_rank .................................... 32
langrl_env_config ............................... None
langrl_external_server .......................... False
langrl_inference_server_conversation_template ... None
langrl_inference_server_type .................... inplace_megatron
lazy_mpu_init ................................... None
legacy_tokenizer ................................ False
linear_attention_freq ........................... None
linear_attention_type ........................... None
linear_conv_kernel_dim .......................... 4
linear_key_head_dim ............................. 128
linear_num_key_heads ............................ 16
linear_num_value_heads .......................... 32
linear_value_head_dim ........................... 128
load ............................................ None
load_main_params_from_ckpt ...................... None
local_rank ...................................... 0
log_energy ...................................... False
log_interval .................................... 1
log_loss_scale_to_tensorboard ................... True
log_max_attention_logit ......................... False
log_memory_to_tensorboard ....................... True
log_num_zeros_in_grad ........................... False
log_params_norm ................................. False
log_progress .................................... False
log_straggler ................................... False
log_throughput .................................. True
log_timers_to_tensorboard ....................... True
log_validation_ppl_to_tensorboard ............... True
log_world_size_to_tensorboard ................... False
logging_level ................................... 40
loss_scale ...................................... None
loss_scale_window ............................... 1000
lr .............................................. 3.9e-06
lr_decay_iters .................................. None
lr_decay_samples ................................ None
lr_decay_style .................................. cosine
lr_warmup_fraction .............................. None
lr_warmup_init .................................. 3.9e-07
lr_warmup_iters ................................. 0
lr_warmup_samples ............................... 0
lr_wsd_decay_iters .............................. None
lr_wsd_decay_samples ............................ None
lr_wsd_decay_style .............................. exponential
main_grads_dtype ................................ torch.float32
main_params_dtype ............................... torch.float32
make_vocab_size_divisible_by .................... 1187
mamba_head_dim .................................. 64
mamba_num_groups ................................ 8
mamba_num_heads ................................. None
mamba_state_dim ................................. 128
manual_gc ....................................... True
manual_gc_eval .................................. True
manual_gc_interval .............................. 10
mask_factor ..................................... 1.0
mask_prob ....................................... 0.15
mask_type ....................................... random
masked_softmax_fusion ........................... False
max_position_embeddings ......................... 40960
max_seqlen_per_cp_rank .......................... None
max_tokens_to_oom ............................... 12000
memory_snapshot_path ............................ snapshot.pickle
merge_file ...................................... None
micro_batch_size ................................ 1
microbatch_group_size_per_vp_stage .............. None
mid_level_dataset_surplus ....................... 0.005
min_loss_scale .................................. 1.0
min_lr .......................................... 3.9e-07
min_offloaded_tensor_size ....................... 1048576
mlp_chunks_for_prefill .......................... 1
mmap_bin_files .................................. False
mock_data ....................................... False
modelopt_enabled ................................ False
moe_apply_probs_on_input ........................ False
moe_aux_loss_coeff .............................. 0.001
moe_deepep_num_sms .............................. 20
moe_enable_deepep ............................... False
moe_expert_capacity_factor ...................... None
moe_extended_tp ................................. False
moe_ffn_hidden_size ............................. 1536
moe_flex_dispatcher_backend ..................... deepep
moe_grouped_gemm ................................ False
moe_hybridep_num_sms ............................ 16
moe_input_jitter_eps ............................ None
moe_layer_freq .................................. [1, 1]
moe_layer_recompute ............................. False
moe_pad_expert_input_to_capacity ................ False
moe_pad_experts_for_cuda_graph_inference ........ False
moe_per_layer_logging ........................... False
moe_permute_fusion .............................. False
moe_router_bias_update_rate ..................... 0.001
moe_router_dtype ................................ fp32
moe_router_enable_expert_bias ................... False
moe_router_force_load_balancing ................. True
moe_router_fusion ............................... False
moe_router_group_topk ........................... None
moe_router_load_balancing_type .................. aux_loss
moe_router_num_groups ........................... None
moe_router_padding_for_fp8 ...................... False
moe_router_padding_for_quantization ............. False
moe_router_pre_softmax .......................... False
moe_router_score_function ....................... softmax
moe_router_topk ................................. 8
moe_router_topk_scaling_factor .................. None
moe_shared_expert_gate .......................... False
moe_shared_expert_intermediate_size ............. None
moe_shared_expert_overlap ....................... False
moe_token_dispatcher_type ....................... alltoall
moe_token_drop_policy ........................... probs
moe_upcycling_granularity ....................... 1
moe_use_legacy_grouped_gemm ..................... False
moe_use_upcycling ............................... False
moe_z_loss_coeff ................................ None
mrope_section ................................... None
mscale .......................................... 1.0
mscale_all_dim .................................. 1.0
mtp_loss_scaling_factor ......................... 0.1
mtp_num_layers .................................. None
multi_latent_attention .......................... False
multiple_validation_sets ........................ False
muon_extra_scale_factor ......................... 1.0
muon_fp32_matmul_prec ........................... medium
muon_momentum ................................... 0.9
muon_num_ns_steps ............................... 5
muon_scale_mode ................................. spectral
muon_split_qkv .................................. True
muon_tp_mode .................................... blockwise
muon_use_nesterov ............................... False
nccl_all_reduce_for_prefill ..................... False
nccl_communicator_config_path ................... None
nccl_ub ......................................... False
no_load_optim ................................... True
no_load_rng ..................................... True
no_persist_layer_norm ........................... False
no_rope_freq .................................... None
no_save_optim ................................... True
no_save_rng ..................................... None
no_weight_decay_cond_type ....................... None
non_persistent_ckpt_type ........................ None
non_persistent_global_ckpt_dir .................. None
non_persistent_local_ckpt_algo .................. fully_parallel
non_persistent_local_ckpt_dir ................... None
non_persistent_save_interval .................... None
norm_epsilon .................................... 1e-06
normalization ................................... RMSNorm
num_attention_heads ............................. 64
num_channels .................................... 3
num_classes ..................................... 1000
num_dataset_builder_threads ..................... 1
num_distributed_optimizer_instances ............. 1
num_experts ..................................... 32
num_layers ...................................... 2
num_layers_at_end_in_bf16 ....................... 1
num_layers_at_start_in_bf16 ..................... 1
num_layers_per_virtual_pipeline_stage ........... None
num_query_groups ................................ 4
num_virtual_stages_per_pipeline_rank ............ None
num_workers ..................................... 6
object_storage_cache_path ....................... None
offload_modules ................................. []
one_logger_async ................................ False
one_logger_project .............................. megatron-lm
one_logger_run_name ............................. None
onnx_safe ....................................... None
openai_gelu ..................................... False
optimizer ....................................... adam
optimizer_cpu_offload ........................... False
optimizer_offload_fraction ...................... 1.0
output_bert_embeddings .......................... False
overlap_cpu_optimizer_d2h_h2d ................... False
overlap_grad_reduce ............................. True
overlap_moe_expert_parallel_comm ................ False
overlap_p2p_comm ................................ False
overlap_p2p_comm_warmup_flush ................... False
overlap_param_gather ............................ True
overlap_param_gather_with_optimizer_step ........ False
override_opt_param_scheduler .................... False
padded_vocab_size ............................... None
params_dtype .................................... torch.bfloat16
patch_dim ....................................... 16
per_split_data_args_path ........................ None
perform_initialization .......................... True
perform_rl_step ................................. False
pin_cpu_grads ................................... True
pin_cpu_params .................................. True
pipeline_model_parallel_comm_backend ............ None
pipeline_model_parallel_layout .................. None
pipeline_model_parallel_size .................... 1
position_embedding_type ......................... rope
pretrained_checkpoint ........................... None
profile ......................................... False
profile_ranks ................................... [0]
profile_step_end ................................ 12
profile_step_start .............................. 10
q_lora_rank ..................................... None
qk_clip ......................................... False
qk_clip_alpha ................................... 0.5
qk_clip_threshold ............................... 100
qk_head_dim ..................................... 128
qk_l2_norm ...................................... False
qk_layernorm .................................... True
qk_pos_emb_head_dim ............................. 64
query_in_block_prob ............................. 0.1
quick_geglu ..................................... False
rampup_batch_size ............................... None
rank ............................................ 0
recompute_granularity ........................... None
recompute_method ................................ None
recompute_modules ............................... None
recompute_num_layers ............................ None
record_memory_history ........................... False
relative_attention_max_distance ................. 128
relative_attention_num_buckets .................. 32
replication ..................................... False
replication_factor .............................. 2
replication_jump ................................ None
rerun_mode ...................................... validate_results
reset_attention_mask ............................ False
reset_position_ids .............................. False
result_rejected_tracker_filename ................ None
retriever_report_topk_accuracies ................ []
retriever_score_scaling ......................... False
retriever_seq_length ............................ 256
retro_add_retriever ............................. False
retro_attention_gate ............................ 1
retro_cyclic_train_iters ........................ None
retro_encoder_attention_dropout ................. 0.1
retro_encoder_hidden_dropout .................... 0.1
retro_encoder_layers ............................ 2
retro_num_neighbors ............................. 2
retro_num_retrieved_chunks ...................... 2
retro_project_dir ............................... None
retro_verify_neighbor_count ..................... True
reuse_grad_buf_for_mxfp8_param_ag ............... False
rl_calculate_intra_group_similarity ............. False
rl_importance_sampling_truncation_coef .......... None
rl_inference_logprobs_is_correction ............. False
rl_offload_kv_cache_during_training ............. False
rl_offload_optimizer_during_inference ........... False
rl_partial_rollouts ............................. False
rl_prompts_per_eval ............................. 32
rl_remove_kv_cache_during_training .............. False
rl_reset_cuda_graphs ............................ False
rl_sequence_packing_algo ........................ fifo
rl_sequence_packing_bin_size .................... 8192
rl_use_sequence_packing ......................... False
rope_scaling_factor ............................. 8.0
rope_type ....................................... None
rotary_base ..................................... 1000000
rotary_interleaved .............................. False
rotary_percent .................................. 1.0
rotary_scaling_factor ........................... 1.0
rotary_seq_len_interpolation_factor ............. 1
run_workload_inspector_server ................... False
sample_rate ..................................... 1.0
save ............................................ None
save_interval ................................... 20000
save_retain_interval ............................ None
scatter_gather_tensors_in_pipeline .............. True
seed ............................................ 1234
seq_length ...................................... 4096
sequence_parallel ............................... False
sft ............................................. False
sft_tokenizer_prompt_format ..................... nemotron-h-aligned
sgd_momentum .................................... 0.9
sharp_enabled_group ............................. None
short_seq_prob .................................. 0.1
skip_train ...................................... False
skipped_train_samples ........................... 0
softmax_type .................................... vanilla
spec ............................................ None
split ........................................... None
squared_relu .................................... False
start_weight_decay .............................. 0.1
straggler_ctrlr_port ............................ 65535
straggler_minmax_count .......................... 1
strict_fsdp_dtensor_load ........................ True
suggested_communication_unit_size ............... None
swiglu .......................................... True
swin_backbone_type .............................. tiny
symmetric_ar_type ............................... None
te_rng_tracker .................................. True
teacher_model_config ............................ None
tensor_model_parallel_size ...................... 1
tensorboard_dir ................................. /datasets/tensorboard
tensorboard_log_interval ........................ 1
tensorboard_queue_size .......................... 1000
test_data_path .................................. ['/topsmodels/data-llm/c4_part/processed-gpt/c4_part_validation-gpt_text_document']
test_mode ....................................... False
tiktoken_num_special_tokens ..................... 1000
tiktoken_pattern ................................ None
tiktoken_special_tokens ......................... None
timing_log_level ................................ 0
timing_log_option ............................... minmax
titles_data_path ................................ None
tokenizer_metadata .............................. None
tokenizer_model ................................. /topsmodels/data-llm/Qwen3-235B-A22B-Instruct-2507-FP8
tokenizer_type .................................. HuggingFaceTokenizer
torch_fsdp2_reshard_after_forward ............... True
tp_comm_bootstrap_backend ....................... nccl
tp_comm_bulk_dgrad .............................. True
tp_comm_bulk_wgrad .............................. True
tp_comm_overlap ................................. False
tp_comm_overlap_ag .............................. True
tp_comm_overlap_cfg ............................. None
tp_comm_overlap_rs .............................. True
tp_comm_overlap_rs_dgrad ........................ False
tp_comm_split_ag ................................ True
tp_comm_split_rs ................................ True
train_data_path ................................. ['/topsmodels/data-llm/c4_part/processed-gpt/c4_part_train-gpt_text_document']
train_iters ..................................... 4
train_samples ................................... None
train_sync_interval ............................. None
transformer_impl ................................ transformer_engine
transformer_pipeline_model_parallel_size ........ 1
trust_remote_code ............................... False
untie_embeddings_and_output_weights ............. True
use_checkpoint_args ............................. False
use_checkpoint_opt_param_scheduler .............. False
use_cpu_initialization .......................... None
use_dist_ckpt ................................... False
use_dist_ckpt_deprecated ........................ False
use_distributed_optimizer ....................... True
use_flash_attn .................................. True
use_fused_weighted_squared_relu ................. False
use_legacy_models ............................... False
use_legacy_static_engine ........................ False
use_megatron_fsdp ............................... False
use_mp_args_from_checkpoint_args ................ False
use_one_sent_docs ............................... False
use_persistent_ckpt_worker ...................... False
use_precision_aware_optimizer ................... False
use_pytorch_profiler ............................ False
use_ring_exchange_p2p ........................... False
use_rope_scaling ................................ False
use_rotary_position_embeddings .................. False
use_sharp ....................................... False
use_te_activation_func .......................... False
use_tokenizer_model_from_checkpoint_args ........ True
use_torch_fsdp2 ................................. False
use_torch_optimizer_for_cpu_offload ............. False
use_tp_pp_dp_mapping ............................ False
v_head_dim ...................................... 128
valid_data_path ................................. ['/topsmodels/data-llm/c4_part/processed-gpt/c4_part_validation-gpt_text_document']
variable_seq_lengths ............................ False
virtual_pipeline_model_parallel_size ............ None
vision_backbone_type ............................ vit
vision_pretraining .............................. False
vision_pretraining_type ......................... classify
vocab_extra_ids ................................. 0
vocab_file ...................................... None
vocab_size ...................................... None
wandb_entity ....................................
wandb_exp_name .................................. qwen3-235b-v0.15
wandb_project ................................... qwen3-235b-benchmarking-v0.15
wandb_save_dir .................................. /datasets/checkpoints
weight_decay .................................... 0.1
weight_decay_incr_style ......................... constant
wgrad_deferral_limit ............................ 0
window_attn_skip_freq ........................... None
window_size ..................................... None
world_size ...................................... 8
yaml_cfg ........................................ None
-------------------- end of arguments ---------------------
INFO:megatron.core.num_microbatches_calculator:setting number of microbatches to constant 2

building HuggingFaceTokenizer tokenizer ...
WARNING:megatron.core.datasets.megatron_tokenizer:You’re using the legacy tokenizer system, which is deprecated and will be removed in a future release. Please migrate to the new tokenizer system (megatron.core.tokenizers.MegatronTokenizer).
INFO:megatron.training.initialize:Setting logging level to 40
INFO:megatron.training.initialize:Setting logging level to 40
INFO:megatron.training.initialize:Setting logging level to 40
INFO:megatron.training.initialize:Setting logging level to 40
setting tensorboard ...
INFO:megatron.training.initialize:Setting logging level to 40
INFO:megatron.training.initialize:Setting logging level to 40
padded vocab (size: 151669) with 267 dummy tokens (new size: 151936)
INFO:megatron.training.initialize:Setting logging level to 40
initializing torch distributed ...
/usr/local/lib/python3.12/dist-packages/eventlet/support/greenlets.py:6: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
preserves_excinfo = (distutils.version.LooseVersion(greenlet.version)
wandb: Tracking run with wandb version 0.24.0
wandb: W&B syncing is set to offline in this directory. Run wandb online or set WANDB_MODE=online to enable cloud syncing.
wandb: Run data is saved locally in /datasets/checkpoints/wandb/offline-run-20260509_102414-fffxft76
/usr/local/lib/python3.12/dist-packages/wandb/util.py:1967: DeprecationWarning: Implicit None on return values is deprecated and will raise KeyErrors.
yield InstalledDistribution(key=d.metadata["Name"], version=d.version)
/usr/lib/python3.12/importlib/metadata/init.py:467: DeprecationWarning: Implicit None on return values is deprecated and will raise KeyErrors.
return self.metadata['Version']
WARNING: one_logger package is required to enable e2e metrics tracking. please go to https://confluence.nvidia.com/display/MLWFO/Package+Repositories for details to install it/usr/local/lib/python3.12/dist-packages/google/protobuf/internal/well_known_types.py:195: DeprecationWarning: datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).
self.FromDatetime(datetime.datetime.utcnow())

INFO:megatron.training.initialize:Setting logging level to 40
[Gloo] Rank [Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 75 is connected to
7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank [Gloo] Rank 21 is connected to is connected to 77 peer ranks. peer ranks. Expected number of connected peer ranks is : Expected number of connected peer ranks is : 77

[Gloo] Rank 3 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 4 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 7 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 6 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 1 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 4 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 3 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 2 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 5 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 7 is connected to 7 peer ranks. Expected number of connected peer ranks is : [Gloo] Rank 7
6 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 0 is connected to 0[Gloo] Rank peer ranks. Expected number of connected peer ranks is : 00 is connected to
0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : [Gloo] Rank 0
0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0

initialized tensor model parallel with size 1
initialized pipeline model parallel with size 1
setting random seeds to 1234 ...
compiling dataset index builder ...
make: Entering directory '/home/xijun.gong/topslm/megatron_repos/Megatron-LM_dev_2b1fc7/megatron/core/datasets'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/xijun.gong/topslm/megatron_repos/Megatron-LM_dev_2b1fc7/megatron/core/datasets'

done with dataset index builder. Compilation time: 0.155 seconds
WARNING: constraints for invoking optimized fused softmax kernel are not met. We default back to unfused kernel invocations.
compiling and loading fused kernels ...
/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py:4876: UserWarning: barrier(): using the device under current context. You can specify device_id in init_process_group to mute this warning.
warnings.warn( # warn only once
ECCL version 3.6.3.9 + compiled with TopsPlatform 1.7.2.21
done with compiling and loading fused kernels. Compilation time: 0.548 seconds
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:149: UserWarning: GCU not support Double use Float replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:24.)
return fn(*args, **kwargs)
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:149: UserWarning: GCU not support Double use Float replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:24.)
return fn(*args, **kwargs)
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:149: UserWarning: GCU not support Double use Float replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:24.)
return fn(*args, **kwargs)
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:149: UserWarning: GCU not support Double use Float replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:24.)
return fn(*args, **kwargs)
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:149: UserWarning: GCU not support Double use Float replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:24.)
return fn(*args, **kwargs)
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:149: UserWarning: GCU not support Double use Float replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:24.)
return fn(*args, **kwargs)
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:149: UserWarning: GCU not support Double use Float replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:24.)
return fn(*args, **kwargs)
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:149: UserWarning: GCU not support Double use Float replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:24.)
return fn(*args, **kwargs)
time to initialize megatron (seconds): -47.838
[after megatron is initialized] datetime: 2026-05-09 10:24:16
building GPT model ...
[2026-05-09 10:24:16,268 TE-FL manager.py:460 WARNING] Implementation 'vendor.enflame' failed for op 'get_cudnn_version': module 'migration.patches.transformer_engine.v2_9_0' has no attribute 'get_cudnn_version'
[2026-05-09 10:24:16,269 TE-FL manager.py:439 INFO] Op 'get_cudnn_version' using 'reference.torch' (kind=reference, vendor=None)
[2026-05-09 10:24:16,269 TE-FL manager.py:460 WARNING] Implementation 'vendor.enflame' failed for op 'get_flash_attention_class': cannot import name 'FlashAttentionENFLAME' from 'transformer_engine.plugin.core.backends.vendor.enflame.flash_attention' (/usr/local/lib/python3.12/dist-packages/transformer_engine/plugin/core/backends/vendor/enflame/flash_attention.py)
[2026-05-09 10:24:16,270 TE-FL manager.py:439 INFO] Op 'get_flash_attention_class' using 'reference.torch' (kind=reference, vendor=None)
/home/xijun.gong/topslm/megatron_repos/Megatron-LM_dev_2b1fc7/megatron/core/extensions/transformer_engine.py:227: DeprecationWarning: hidden_size arg has been renamed to normalized_shape for compatibility with torch.nn.LayerNorm.
instance = te.pytorch.RMSNorm(
[2026-05-09 10:24:16,296 TE-FL manager.py:460 WARNING] Implementation 'vendor.enflame' failed for op 'get_cudnn_version': module 'migration.patches.transformer_engine.v2_9_0' has no attribute 'get_cudnn_version'
[2026-05-09 10:24:16,296 TE-FL manager.py:460 WARNING] Implementation 'vendor.enflame' failed for op 'get_cudnn_version': module 'migration.patches.transformer_engine.v2_9_0' has no attribute 'get_cudnn_version'
[2026-05-09 10:24:16,296 TE-FL manager.py:439 INFO] Op 'get_cudnn_version' using 'reference.torch' (kind=reference, vendor=None)
[2026-05-09 10:24:16,296 TE-FL manager.py:460 WARNING] Implementation 'vendor.enflame' failed for op 'get_cudnn_version': module 'migration.patches.transformer_engine.v2_9_0' has no attribute 'get_cudnn_version'
[2026-05-09 10:24:16,296 TE-FL manager.py:460 WARNING] Implementation 'vendor.enflame' failed for op 'get_cudnn_version': module 'migration.patches.transformer_engine.v2_9_0' has no attribute 'get_cudnn_version'
[2026-05-09 10:24:16,296 TE-FL manager.py:439 INFO] Op 'get_cudnn_version' using 'reference.torch' (kind=reference, vendor=None)
[2026-05-09 10:24:16,296 TE-FL manager.py:439 INFO] Op 'get_cudnn_version' using 'reference.torch' (kind=reference, vendor=None)
[2026-05-09 10:24:16,297 TE-FL manager.py:439 INFO] Op 'get_cudnn_version' using 'reference.torch' (kind=reference, vendor=None)
[2026-05-09 10:24:16,297 TE-FL manager.py:460 WARNING] Implementation 'vendor.enflame' failed for op 'get_flash_attention_class': cannot import name 'FlashAttentionENFLAME' from 'transformer_engine.plugin.core.backends.vendor.enflame.flash_attention' (/usr/local/lib/python3.12/dist-packages/transformer_engine/plugin/core/backends/vendor/enflame/flash_attention.py)
[2026-05-09 10:24:16,297 TE-FL manager.py:460 WARNING] Implementation 'vendor.enflame' failed for op 'get_flash_attention_class': cannot import name 'FlashAttentionENFLAME' from 'transformer_engine.plugin.core.backends.vendor.enflame.flash_attention' (/usr/local/lib/python3.12/dist-packages/transformer_engine/plugin/core/backends/vendor/enflame/flash_attention.py)
[2026-05-09 10:24:16,297 TE-FL manager.py:460 WARNING] Implementation 'vendor.enflame' failed for op 'get_flash_attention_class': cannot import name 'FlashAttentionENFLAME' from 'transformer_engine.plugin.core.backends.vendor.enflame.flash_attention' (/usr/local/lib/python3.12/dist-packages/transformer_engine/plugin/core/backends/vendor/enflame/flash_attention.py)
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:147: UserWarning: GCU not support Long use Int replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:28.)
return fn(*args, **kwargs)
[2026-05-09 10:24:16,297 TE-FL manager.py:460 WARNING] Implementation 'vendor.enflame' failed for op 'get_cudnn_version': module 'migration.patches.transformer_engine.v2_9_0' has no attribute 'get_cudnn_version'
[2026-05-09 10:24:16,298 TE-FL manager.py:439 INFO] Op 'get_flash_attention_class' using 'reference.torch' (kind=reference, vendor=None)
[2026-05-09 10:24:16,298 TE-FL manager.py:460 WARNING] Implementation 'vendor.enflame' failed for op 'get_flash_attention_class': cannot import name 'FlashAttentionENFLAME' from 'transformer_engine.plugin.core.backends.vendor.enflame.flash_attention' (/usr/local/lib/python3.12/dist-packages/transformer_engine/plugin/core/backends/vendor/enflame/flash_attention.py)
[2026-05-09 10:24:16,298 TE-FL manager.py:439 INFO] Op 'get_flash_attention_class' using 'reference.torch' (kind=reference, vendor=None)
[2026-05-09 10:24:16,298 TE-FL manager.py:439 INFO] Op 'get_flash_attention_class' using 'reference.torch' (kind=reference, vendor=None)
[2026-05-09 10:24:16,298 TE-FL manager.py:439 INFO] Op 'get_cudnn_version' using 'reference.torch' (kind=reference, vendor=None)
[2026-05-09 10:24:16,298 TE-FL manager.py:439 INFO] Op 'get_flash_attention_class' using 'reference.torch' (kind=reference, vendor=None)
[2026-05-09 10:24:16,298 TE-FL manager.py:460 WARNING] Implementation 'vendor.enflame' failed for op 'get_flash_attention_class': cannot import name 'FlashAttentionENFLAME' from 'transformer_engine.plugin.core.backends.vendor.enflame.flash_attention' (/usr/local/lib/python3.12/dist-packages/transformer_engine/plugin/core/backends/vendor/enflame/flash_attention.py)
[2026-05-09 10:24:16,299 TE-FL manager.py:439 INFO] Op 'get_flash_attention_class' using 'reference.torch' (kind=reference, vendor=None)
/home/xijun.gong/topslm/megatron_repos/Megatron-LM_dev_2b1fc7/megatron/core/extensions/transformer_engine.py:227: DeprecationWarning: hidden_size arg has been renamed to normalized_shape for compatibility with torch.nn.LayerNorm.
instance = te.pytorch.RMSNorm(
/home/xijun.gong/topslm/megatron_repos/Megatron-LM_dev_2b1fc7/megatron/core/extensions/transformer_engine.py:227: DeprecationWarning: hidden_size arg has been renamed to normalized_shape for compatibility with torch.nn.LayerNorm.
instance = te.pytorch.RMSNorm(
/home/xijun.gong/topslm/megatron_repos/Megatron-LM_dev_2b1fc7/megatron/core/extensions/transformer_engine.py:227: DeprecationWarning: hidden_size arg has been renamed to normalized_shape for compatibility with torch.nn.LayerNorm.
instance = te.pytorch.RMSNorm(
/home/xijun.gong/topslm/megatron_repos/Megatron-LM_dev_2b1fc7/megatron/core/extensions/transformer_engine.py:227: DeprecationWarning: hidden_size arg has been renamed to normalized_shape for compatibility with torch.nn.LayerNorm.
instance = te.pytorch.RMSNorm(
/home/xijun.gong/topslm/megatron_repos/Megatron-LM_dev_2b1fc7/megatron/core/extensions/transformer_engine.py:227: DeprecationWarning: hidden_size arg has been renamed to normalized_shape for compatibility with torch.nn.LayerNorm.
instance = te.pytorch.RMSNorm(
[2026-05-09 10:24:16,308 TE-FL manager.py:460 WARNING] Implementation 'vendor.enflame' failed for op 'get_cudnn_version': module 'migration.patches.transformer_engine.v2_9_0' has no attribute 'get_cudnn_version'
[2026-05-09 10:24:16,308 TE-FL manager.py:439 INFO] Op 'get_cudnn_version' using 'reference.torch' (kind=reference, vendor=None)
[2026-05-09 10:24:16,308 TE-FL manager.py:460 WARNING] Implementation 'vendor.enflame' failed for op 'get_flash_attention_class': cannot import name 'FlashAttentionENFLAME' from 'transformer_engine.plugin.core.backends.vendor.enflame.flash_attention' (/usr/local/lib/python3.12/dist-packages/transformer_engine/plugin/core/backends/vendor/enflame/flash_attention.py)
[2026-05-09 10:24:16,309 TE-FL manager.py:439 INFO] Op 'get_flash_attention_class' using 'reference.torch' (kind=reference, vendor=None)
[2026-05-09 10:24:16,309 TE-FL manager.py:460 WARNING] Implementation 'vendor.enflame' failed for op 'get_cudnn_version': module 'migration.patches.transformer_engine.v2_9_0' has no attribute 'get_cudnn_version'
[2026-05-09 10:24:16,309 TE-FL manager.py:439 INFO] Op 'get_cudnn_version' using 'reference.torch' (kind=reference, vendor=None)
[2026-05-09 10:24:16,310 TE-FL manager.py:460 WARNING] Implementation 'vendor.enflame' failed for op 'get_flash_attention_class': cannot import name 'FlashAttentionENFLAME' from 'transformer_engine.plugin.core.backends.vendor.enflame.flash_attention' (/usr/local/lib/python3.12/dist-packages/transformer_engine/plugin/core/backends/vendor/enflame/flash_attention.py)
[2026-05-09 10:24:16,311 TE-FL manager.py:439 INFO] Op 'get_flash_attention_class' using 'reference.torch' (kind=reference, vendor=None)
/home/xijun.gong/topslm/megatron_repos/Megatron-LM_dev_2b1fc7/megatron/core/extensions/transformer_engine.py:227: DeprecationWarning: hidden_size arg has been renamed to normalized_shape for compatibility with torch.nn.LayerNorm.
instance = te.pytorch.RMSNorm(
number of parameters on (tensor, pipeline) model parallel rank (0, 0): 1538544128
/home/xijun.gong/topslm/megatron_repos/Megatron-LM_dev_2b1fc7/megatron/core/extensions/transformer_engine.py:227: DeprecationWarning: hidden_size arg has been renamed to normalized_shape for compatibility with torch.nn.LayerNorm.
instance = te.pytorch.RMSNorm(
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:147: UserWarning: GCU not support Long use Int replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:28.)
return fn(*args, **kwargs)
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:147: UserWarning: GCU not support Long use Int replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:28.)
return fn(*args, **kwargs)
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:147: UserWarning: GCU not support Long use Int replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:28.)
return fn(*args, **kwargs)
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:147: UserWarning: GCU not support Long use Int replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:28.)
return fn(*args, **kwargs)
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:147: UserWarning: GCU not support Long use Int replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:28.)
return fn(*args, **kwargs)
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:147: UserWarning: GCU not support Long use Int replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:28.)
return fn(*args, **kwargs)
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:147: UserWarning: GCU not support Long use Int replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:28.)
return fn(*args, **kwargs)
[after model, optimizer, and learning rate scheduler are built] datetime: 2026-05-09 10:24:16
building train, validation, and test datasets ...
train: 64
validation: 0
test: 0
building train, validation, and test datasets for GPT ...
finished creating GPT datasets ...
/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py:659: UserWarning: pin_memory_device is deprecated, the current accelerator will be used as the device,ignore pin_memory_device='gcu'.
warnings.warn(
/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py:659: UserWarning: pin_memory_device is deprecated, the current accelerator will be used as the device,ignore pin_memory_device='gcu'.
warnings.warn(
/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py:659: UserWarning: pin_memory_device is deprecated, the current accelerator will be used as the device,ignore pin_memory_device='gcu'.
warnings.warn(
/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py:659: UserWarning: pin_memory_device is deprecated, the current accelerator will be used as the device,ignore pin_memory_device='gcu'.
warnings.warn(
/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py:659: UserWarning: pin_memory_device is deprecated, the current accelerator will be used as the device,ignore pin_memory_device='gcu'.
warnings.warn(
/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py:659: UserWarning: pin_memory_device is deprecated, the current accelerator will be used as the device,ignore pin_memory_device='gcu'.
warnings.warn(
/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py:659: UserWarning: pin_memory_device is deprecated, the current accelerator will be used as the device,ignore pin_memory_device='gcu'.
warnings.warn(
/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py:659: UserWarning: pin_memory_device is deprecated, the current accelerator will be used as the device,ignore pin_memory_device='gcu'.
warnings.warn(
/usr/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=38756) is multi-threaded, use of fork() may lead to deadlocks in the child.
self.pid = os.fork()
/usr/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=38752) is multi-threaded, use of fork() may lead to deadlocks in the child.
self.pid = os.fork()
/usr/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=38751) is multi-threaded, use of fork() may lead to deadlocks in the child.
self.pid = os.fork()
/usr/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=38750) is multi-threaded, use of fork() may lead to deadlocks in the child.
self.pid = os.fork()
[2026-05-09 10:24:17,164 TE-FL manager.py:70 DEBUG] OpManager reset after fork
/usr/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=38753) is multi-threaded, use of fork() may lead to deadlocks in the child.
self.pid = os.fork()
[2026-05-09 10:24:17,164 TE-FL manager.py:70 DEBUG] OpManager reset after fork
/usr/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=38755) is multi-threaded, use of fork() may lead to deadlocks in the child.
self.pid = os.fork()
/usr/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=38757) is multi-threaded, use of fork() may lead to deadlocks in the child.
self.pid = os.fork()
[2026-05-09 10:24:17,165 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,165 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,166 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,166 TE-FL manager.py:70 DEBUG] OpManager reset after fork
/usr/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=38754) is multi-threaded, use of fork() may lead to deadlocks in the child.
self.pid = os.fork()
[2026-05-09 10:24:17,167 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,168 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,191 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,191 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,193 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,193 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,193 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,194 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,196 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,197 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,211 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,213 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,214 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,215 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,215 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,217 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,219 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,220 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,228 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,231 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,231 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,233 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,234 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,236 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,240 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,240 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,247 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,250 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,253 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,253 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,255 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,257 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,261 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,263 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,269 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,270 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,274 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,276 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,276 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,278 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,282 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,285 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,294 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,295 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,301 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,301 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,302 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,303 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,307 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,310 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,313 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,320 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,320 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,321 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,322 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,323 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,329 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,352 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,354 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,362 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,362 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,362 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,363 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,363 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,371 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,374 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,376 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,383 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,383 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,383 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,383 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,383 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,394 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,396 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,400 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,403 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,403 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,406 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,408 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,409 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,413 TE-FL manager.py:70 DEBUG] OpManager reset after fork
/usr/local/lib/python3.12/dist-packages/torch/utils/data/_utils/pin_memory.py:57: DeprecationWarning: The argument 'device' of Tensor.pin_memory() is deprecated. Please do not pass this argument. (Triggered internally at /pytorch/aten/src/ATen/native/Memory.cpp:46.)
return data.pin_memory(device)
[2026-05-09 10:24:17,417 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,421 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,424 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,424 TE-FL manager.py:70 DEBUG] OpManager reset after fork
/usr/local/lib/python3.12/dist-packages/torch/utils/data/_utils/pin_memory.py:57: DeprecationWarning: The argument 'device' of Tensor.pin_memory() is deprecated. Please do not pass this argument. (Triggered internally at /pytorch/aten/src/ATen/native/Memory.cpp:46.)
return data.pin_memory(device)
/usr/local/lib/python3.12/dist-packages/torch/utils/data/_utils/pin_memory.py:57: DeprecationWarning: The argument 'device' of Tensor.pin_memory() is deprecated. Please do not pass this argument. (Triggered internally at /pytorch/aten/src/ATen/native/Memory.cpp:46.)
return data.pin_memory(device)
[2026-05-09 10:24:17,428 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,428 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,429 TE-FL manager.py:70 DEBUG] OpManager reset after fork
/usr/local/lib/python3.12/dist-packages/torch/utils/data/_utils/pin_memory.py:57: DeprecationWarning: The argument 'device' of Tensor.pin_memory() is deprecated. Please do not pass this argument. (Triggered internally at /pytorch/aten/src/ATen/native/Memory.cpp:46.)
return data.pin_memory(device)
/usr/local/lib/python3.12/dist-packages/torch/utils/data/_utils/pin_memory.py:57: DeprecationWarning: The argument 'device' of Tensor.pin_memory() is deprecated. Please do not pass this argument. (Triggered internally at /pytorch/aten/src/ATen/native/Memory.cpp:46.)
return data.pin_memory(device)
/usr/local/lib/python3.12/dist-packages/torch/utils/data/_utils/pin_memory.py:57: DeprecationWarning: The argument 'device' of Tensor.pin_memory() is deprecated. Please do not pass this argument. (Triggered internally at /pytorch/aten/src/ATen/native/Memory.cpp:46.)
return data.pin_memory(device)
[2026-05-09 10:24:17,437 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,438 TE-FL manager.py:70 DEBUG] OpManager reset after fork
/usr/local/lib/python3.12/dist-packages/torch/utils/data/_utils/pin_memory.py:57: DeprecationWarning: The argument 'device' of Tensor.pin_memory() is deprecated. Please do not pass this argument. (Triggered internally at /pytorch/aten/src/ATen/native/Memory.cpp:46.)
return data.pin_memory(device)
[2026-05-09 10:24:17,443 TE-FL manager.py:70 DEBUG] OpManager reset after fork
/usr/local/lib/python3.12/dist-packages/torch/utils/data/_utils/pin_memory.py:57: DeprecationWarning: The argument 'device' of Tensor.pin_memory() is deprecated. Please do not pass this argument. (Triggered internally at /pytorch/aten/src/ATen/native/Memory.cpp:46.)
return data.pin_memory(device)
[2026-05-09 10:24:17,449 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,450 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,456 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,456 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,458 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,461 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,466 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,471 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,472 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,473 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,479 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,480 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,481 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,482 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,489 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,491 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,495 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,498 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,500 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,501 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,502 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,504 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,509 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,512 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,516 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,523 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,524 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,524 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,526 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,530 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,535 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,536 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,541 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,551 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,551 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,554 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,555 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,560 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,563 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,563 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,568 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,577 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,580 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,582 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,589 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,590 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-09 10:24:17,605 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[after dataloaders are built] datetime: 2026-05-09 10:24:17
done with setup ...
training ...
Overwriting rerun_state_machine.current_iteration from -1 to 0...
[before the start of training step] datetime: 2026-05-09 10:24:17
ECCL version 3.6.3.9 + compiled with TopsPlatform 1.7.2.21
ECCL version 3.6.3.9 + compiled with TopsPlatform 1.7.2.21
ECCL version 3.6.3.9 + compiled with TopsPlatform 1.7.2.21
ECCL version 3.6.3.9 + compiled with TopsPlatform 1.7.2.21
ECCL version 3.6.3.9 + compiled with TopsPlatform 1.7.2.21
ECCL version 3.6.3.9 + compiled with TopsPlatform 1.7.2.21
ECCL version 3.6.3.9 + compiled with TopsPlatform 1.7.2.21
[2026-05-09 10:24:18,075 TE-FL manager.py:439 INFO] Op 'rmsnorm_fwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-09 10:24:18,078 TE-FL manager.py:439 INFO] Op 'rmsnorm_fwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-09 10:24:18,080 TE-FL manager.py:439 INFO] Op 'generic_gemm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-09 10:24:18,083 TE-FL manager.py:439 INFO] Op 'generic_gemm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
DEBUG:DotProductAttention:Running with config={'transformer_engine_version': '2.9.0', 'flash_attn_version': '2.7.2+torch.2.9.1.gcu.3.4.20260420', 'flash_attn_3_version': 'not installed', 'qkv_type': <class 'torch.Tensor'>, 'qkv_dtype': torch.bfloat16, 'qkv_layout': 'sbhd_sbhd_sbhd', 'batch_size': 1, 'num_heads': 64, 'num_gqa_groups': 4, 'max_seqlen_q': 4096, 'max_seqlen_kv': 4096, 'head_dim_qk': 128, 'head_dim_v': 128, 'attn_mask_type': 'causal', 'window_size': (-1, 0), 'alibi_slopes_shape': None, 'core_attention_bias_type': 'no_bias', 'core_attention_bias_shape': None, 'core_attention_bias_requires_grad': False, 'pad_between_seqs': False, 'attention_dropout': 0.0, 'context_parallel': False, 'cp_comm_type': 'p2p', 'deterministic': False, 'is_training': True, 'fp8': False, 'fp8_meta': {'fp8_checkpoint': False, 'fp8_group': None}, 'inference_params': None, 'softmax_type': 'vanilla', 'return_max_logit': False}
DEBUG:DotProductAttention:Disabling FusedAttention due to NVTE_FUSED_ATTN=0
DEBUG:DotProductAttention:Disabling UnfusedDotProductAttention due to NVTE_UNFUSED_ATTN=0
DEBUG:DotProductAttention:Available backends = {FlashAttention=True (2.7.2+torch.2.9.1.gcu.3.4.20260420), FusedAttention=False, UnfusedDotProductAttention=False}
DEBUG:DotProductAttention:Selected backend = FlashAttention (2.7.2+torch.2.9.1.gcu.3.4.20260420)
[2026-05-09 10:24:18,109 TE-FL manager.py:439 INFO] Op 'get_attention_backend' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
INFO:DotProductAttention:Running with FlashAttention backend (version 2.7.2+torch.2.9.1.gcu.3.4.20260420)
DEBUG:DotProductAttention:Running with config={'transformer_engine_version': '2.9.0', 'flash_attn_version': '2.7.2+torch.2.9.1.gcu.3.4.20260420', 'flash_attn_3_version': 'not installed', 'qkv_type': <class 'torch.Tensor'>, 'qkv_dtype': torch.bfloat16, 'qkv_layout': 'sbhd_sbhd_sbhd', 'batch_size': 1, 'num_heads': 64, 'num_gqa_groups': 4, 'max_seqlen_q': 4096, 'max_seqlen_kv': 4096, 'head_dim_qk': 128, 'head_dim_v': 128, 'attn_mask_type': 'causal', 'window_size': (-1, 0), 'alibi_slopes_shape': None, 'core_attention_bias_type': 'no_bias', 'core_attention_bias_shape': None, 'core_attention_bias_requires_grad': False, 'pad_between_seqs': False, 'attention_dropout': 0.0, 'context_parallel': False, 'cp_comm_type': 'p2p', 'deterministic': False, 'is_training': True, 'fp8': False, 'fp8_meta': {'fp8_checkpoint': False, 'fp8_group': None}, 'inference_params': None, 'softmax_type': 'vanilla', 'return_max_logit': False}
DEBUG:DotProductAttention:Disabling FusedAttention due to NVTE_FUSED_ATTN=0
DEBUG:DotProductAttention:Disabling UnfusedDotProductAttention due to NVTE_UNFUSED_ATTN=0
DEBUG:DotProductAttention:Available backends = {FlashAttention=True (2.7.2+torch.2.9.1.gcu.3.4.20260420), FusedAttention=False, UnfusedDotProductAttention=False}
DEBUG:DotProductAttention:Selected backend = FlashAttention (2.7.2+torch.2.9.1.gcu.3.4.20260420)
[2026-05-09 10:24:18,112 TE-FL manager.py:439 INFO] Op 'get_attention_backend' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
INFO:DotProductAttention:Running with FlashAttention backend (version 2.7.2+torch.2.9.1.gcu.3.4.20260420)
[2026-05-09 10:24:18,122 TE-FL manager.py:439 INFO] Op 'rmsnorm_fwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-09 10:24:18,126 TE-FL manager.py:439 INFO] Op 'generic_gemm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-09 10:24:18,130 TE-FL manager.py:439 INFO] Op 'rmsnorm_fwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-09 10:24:18,132 TE-FL manager.py:439 INFO] Op 'rmsnorm_fwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-09 10:24:18,132 TE-FL manager.py:439 INFO] Op 'rmsnorm_fwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-09 10:24:18,133 TE-FL manager.py:439 INFO] Op 'rmsnorm_fwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-09 10:24:18,134 TE-FL manager.py:439 INFO] Op 'rmsnorm_fwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-09 10:24:18,135 TE-FL manager.py:439 INFO] Op 'generic_gemm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-09 10:24:18,136 TE-FL manager.py:439 INFO] Op 'generic_gemm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-09 10:24:18,136 TE-FL manager.py:439 INFO] Op 'generic_gemm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-09 10:24:18,137 TE-FL manager.py:439 INFO] Op 'generic_gemm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-09 10:24:18,139 TE-FL manager.py:439 INFO] Op 'generic_gemm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
DEBUG:DotProductAttention:Running with config={'transformer_engine_version': '2.9.0', 'flash_attn_version': '2.7.2+torch.2.9.1.gcu.3.4.20260420', 'flash_attn_3_version': 'not installed', 'qkv_type': <class 'torch.Tensor'>, 'qkv_dtype': torch.bfloat16, 'qkv_layout': 'sbhd_sbhd_sbhd', 'batch_size': 1, 'num_heads': 64, 'num_gqa_groups': 4, 'max_seqlen_q': 4096, 'max_seqlen_kv': 4096, 'head_dim_qk': 128, 'head_dim_v': 128, 'attn_mask_type': 'causal', 'window_size': (-1, 0), 'alibi_slopes_shape': None, 'core_attention_bias_type': 'no_bias', 'core_attention_bias_shape': None, 'core_attention_bias_requires_grad': False, 'pad_between_seqs': False, 'attention_dropout': 0.0, 'context_parallel': False, 'cp_comm_type': 'p2p', 'deterministic': False, 'is_training': True, 'fp8': False, 'fp8_meta': {'fp8_checkpoint': False, 'fp8_group': None}, 'inference_params': None, 'softmax_type': 'vanilla', 'return_max_logit': False}
DEBUG:DotProductAttention:Disabling FusedAttention due to NVTE_FUSED_ATTN=0
DEBUG:DotProductAttention:Disabling UnfusedDotProductAttention due to NVTE_UNFUSED_ATTN=0
DEBUG:DotProductAttention:Available backends = {FlashAttention=True (2.7.2+torch.2.9.1.gcu.3.4.20260420), FusedAttention=False, UnfusedDotProductAttention=False}
DEBUG:DotProductAttention:Selected backend = FlashAttention (2.7.2+torch.2.9.1.gcu.3.4.20260420)
[2026-05-09 10:24:18,154 TE-FL manager.py:439 INFO] Op 'get_attention_backend' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
INFO:DotProductAttention:Running with FlashAttention backend (version 2.7.2+torch.2.9.1.gcu.3.4.20260420)
DEBUG:DotProductAttention:Running with config={'transformer_engine_version': '2.9.0', 'flash_attn_version': '2.7.2+torch.2.9.1.gcu.3.4.20260420', 'flash_attn_3_version': 'not installed', 'qkv_type': <class 'torch.Tensor'>, 'qkv_dtype': torch.bfloat16, 'qkv_layout': 'sbhd_sbhd_sbhd', 'batch_size': 1, 'num_heads': 64, 'num_gqa_groups': 4, 'max_seqlen_q': 4096, 'max_seqlen_kv': 4096, 'head_dim_qk': 128, 'head_dim_v': 128, 'attn_mask_type': 'causal', 'window_size': (-1, 0), 'alibi_slopes_shape': None, 'core_attention_bias_type': 'no_bias', 'core_attention_bias_shape': None, 'core_attention_bias_requires_grad': False, 'pad_between_seqs': False, 'attention_dropout': 0.0, 'context_parallel': False, 'cp_comm_type': 'p2p', 'deterministic': False, 'is_training': True, 'fp8': False, 'fp8_meta': {'fp8_checkpoint': False, 'fp8_group': None}, 'inference_params': None, 'softmax_type': 'vanilla', 'return_max_logit': False}
DEBUG:DotProductAttention:Disabling FusedAttention due to NVTE_FUSED_ATTN=0
DEBUG:DotProductAttention:Disabling UnfusedDotProductAttention due to NVTE_UNFUSED_ATTN=0
DEBUG:DotProductAttention:Running with config={'transformer_engine_version': '2.9.0', 'flash_attn_version': '2.7.2+torch.2.9.1.gcu.3.4.20260420', 'flash_attn_3_version': 'not installed', 'qkv_type': <class 'torch.Tensor'>, 'qkv_dtype': torch.bfloat16, 'qkv_layout': 'sbhd_sbhd_sbhd', 'batch_size': 1, 'num_heads': 64, 'num_gqa_groups': 4, 'max_seqlen_q': 4096, 'max_seqlen_kv': 4096, 'head_dim_qk': 128, 'head_dim_v': 128, 'attn_mask_type': 'causal', 'window_size': (-1, 0), 'alibi_slopes_shape': None, 'core_attention_bias_type': 'no_bias', 'core_attention_bias_shape': None, 'core_attention_bias_requires_grad': False, 'pad_between_seqs': False, 'attention_dropout': 0.0, 'context_parallel': False, 'cp_comm_type': 'p2p', 'deterministic': False, 'is_training': True, 'fp8': False, 'fp8_meta': {'fp8_checkpoint': False, 'fp8_group': None}, 'inference_params': None, 'softmax_type': 'vanilla', 'return_max_logit': False}
DEBUG:DotProductAttention:Available backends = {FlashAttention=True (2.7.2+torch.2.9.1.gcu.3.4.20260420), FusedAttention=False, UnfusedDotProductAttention=False}
DEBUG:DotProductAttention:Disabling FusedAttention due to NVTE_FUSED_ATTN=0
DEBUG:DotProductAttention:Disabling UnfusedDotProductAttention due to NVTE_UNFUSED_ATTN=0
DEBUG:DotProductAttention:Available backends = {FlashAttention=True (2.7.2+torch.2.9.1.gcu.3.4.20260420), FusedAttention=False, UnfusedDotProductAttention=False}
DEBUG:DotProductAttention:Selected backend = FlashAttention (2.7.2+torch.2.9.1.gcu.3.4.20260420)
[2026-05-09 10:24:18,163 TE-FL manager.py:439 INFO] Op 'get_attention_backend' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
INFO:DotProductAttention:Running with FlashAttention backend (version 2.7.2+torch.2.9.1.gcu.3.4.20260420)
DEBUG:DotProductAttention:Selected backend = FlashAttention (2.7.2+torch.2.9.1.gcu.3.4.20260420)
[2026-05-09 10:24:18,163 TE-FL manager.py:439 INFO] Op 'get_attention_backend' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
INFO:DotProductAttention:Running with FlashAttention backend (version 2.7.2+torch.2.9.1.gcu.3.4.20260420)
DEBUG:DotProductAttention:Running with config={'transformer_engine_version': '2.9.0', 'flash_attn_version': '2.7.2+torch.2.9.1.gcu.3.4.20260420', 'flash_attn_3_version': 'not installed', 'qkv_type': <class 'torch.Tensor'>, 'qkv_dtype': torch.bfloat16, 'qkv_layout': 'sbhd_sbhd_sbhd', 'batch_size': 1, 'num_heads': 64, 'num_gqa_groups': 4, 'max_seqlen_q': 4096, 'max_seqlen_kv': 4096, 'head_dim_qk': 128, 'head_dim_v': 128, 'attn_mask_type': 'causal', 'window_size': (-1, 0), 'alibi_slopes_shape': None, 'core_attention_bias_type': 'no_bias', 'core_attention_bias_shape': None, 'core_attention_bias_requires_grad': False, 'pad_between_seqs': False, 'attention_dropout': 0.0, 'context_parallel': False, 'cp_comm_type': 'p2p', 'deterministic': False, 'is_training': True, 'fp8': False, 'fp8_meta': {'fp8_checkpoint': False, 'fp8_group': None}, 'inference_params': None, 'softmax_type': 'vanilla', 'return_max_logit': False}
DEBUG:DotProductAttention:Disabling FusedAttention due to NVTE_FUSED_ATTN=0
DEBUG:DotProductAttention:Disabling UnfusedDotProductAttention due to NVTE_UNFUSED_ATTN=0
DEBUG:DotProductAttention:Available backends = {FlashAttention=True (2.7.2+torch.2.9.1.gcu.3.4.20260420), FusedAttention=False, UnfusedDotProductAttention=False}
DEBUG:DotProductAttention:Selected backend = FlashAttention (2.7.2+torch.2.9.1.gcu.3.4.20260420)
[2026-05-09 10:24:18,164 TE-FL manager.py:439 INFO] Op 'get_attention_backend' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
INFO:DotProductAttention:Running with FlashAttention backend (version 2.7.2+torch.2.9.1.gcu.3.4.20260420)
DEBUG:DotProductAttention:Running with config={'transformer_engine_version': '2.9.0', 'flash_attn_version': '2.7.2+torch.2.9.1.gcu.3.4.20260420', 'flash_attn_3_version': 'not installed', 'qkv_type': <class 'torch.Tensor'>, 'qkv_dtype': torch.bfloat16, 'qkv_layout': 'sbhd_sbhd_sbhd', 'batch_size': 1, 'num_heads': 64, 'num_gqa_groups': 4, 'max_seqlen_q': 4096, 'max_seqlen_kv': 4096, 'head_dim_qk': 128, 'head_dim_v': 128, 'attn_mask_type': 'causal', 'window_size': (-1, 0), 'alibi_slopes_shape': None, 'core_attention_bias_type': 'no_bias', 'core_attention_bias_shape': None, 'core_attention_bias_requires_grad': False, 'pad_between_seqs': False, 'attention_dropout': 0.0, 'context_parallel': False, 'cp_comm_type': 'p2p', 'deterministic': False, 'is_training': True, 'fp8': False, 'fp8_meta': {'fp8_checkpoint': False, 'fp8_group': None}, 'inference_params': None, 'softmax_type': 'vanilla', 'return_max_logit': False}
DEBUG:DotProductAttention:Disabling FusedAttention due to NVTE_FUSED_ATTN=0
DEBUG:DotProductAttention:Disabling UnfusedDotProductAttention due to NVTE_UNFUSED_ATTN=0
DEBUG:DotProductAttention:Available backends = {FlashAttention=True (2.7.2+torch.2.9.1.gcu.3.4.20260420), FusedAttention=False, UnfusedDotProductAttention=False}
DEBUG:DotProductAttention:Selected backend = FlashAttention (2.7.2+torch.2.9.1.gcu.3.4.20260420)
[2026-05-09 10:24:18,165 TE-FL manager.py:439 INFO] Op 'get_attention_backend' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
INFO:DotProductAttention:Running with FlashAttention backend (version 2.7.2+torch.2.9.1.gcu.3.4.20260420)
DEBUG:DotProductAttention:Running with config={'transformer_engine_version': '2.9.0', 'flash_attn_version': '2.7.2+torch.2.9.1.gcu.3.4.20260420', 'flash_attn_3_version': 'not installed', 'qkv_type': <class 'torch.Tensor'>, 'qkv_dtype': torch.bfloat16, 'qkv_layout': 'sbhd_sbhd_sbhd', 'batch_size': 1, 'num_heads': 64, 'num_gqa_groups': 4, 'max_seqlen_q': 4096, 'max_seqlen_kv': 4096, 'head_dim_qk': 128, 'head_dim_v': 128, 'attn_mask_type': 'causal', 'window_size': (-1, 0), 'alibi_slopes_shape': None, 'core_attention_bias_type': 'no_bias', 'core_attention_bias_shape': None, 'core_attention_bias_requires_grad': False, 'pad_between_seqs': False, 'attention_dropout': 0.0, 'context_parallel': False, 'cp_comm_type': 'p2p', 'deterministic': False, 'is_training': True, 'fp8': False, 'fp8_meta': {'fp8_checkpoint': False, 'fp8_group': None}, 'inference_params': None, 'softmax_type': 'vanilla', 'return_max_logit': False}
DEBUG:DotProductAttention:Disabling FusedAttention due to NVTE_FUSED_ATTN=0
DEBUG:DotProductAttention:Disabling UnfusedDotProductAttention due to NVTE_UNFUSED_ATTN=0
WARNING:DotProductAttention:flash-attn v3 may provide important feature support or performance improvement. Please install flash-attn v3 by
(1) git clone https://github.com/Dao-AILab/flash-attention.git
(2) cd flash-attention/ && git checkout 3ba6f82 && git submodule update --init && cd hopper/ && python setup.py install
(3) python_path=python -c "import site; print(site.getsitepackages()[0])"
(4) mkdir -p $python_path/flash_attn_3
(5) cp flash_attn_interface.py $python_path/flash_attn_3/flash_attn_interface.py
DEBUG:DotProductAttention:Available backends = {FlashAttention=True (2.7.2+torch.2.9.1.gcu.3.4.20260420), FusedAttention=False, UnfusedDotProductAttention=False}
DEBUG:DotProductAttention:Selected backend = FlashAttention (2.7.2+torch.2.9.1.gcu.3.4.20260420)
[2026-05-09 10:24:18,166 TE-FL manager.py:439 INFO] Op 'get_attention_backend' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
INFO:DotProductAttention:Running with FlashAttention backend (version 2.7.2+torch.2.9.1.gcu.3.4.20260420)
[2026-05-09 10:24:19,090 TE-FL manager.py:439 INFO] Op 'rmsnorm_bwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-09 10:24:19,090 TE-FL manager.py:439 INFO] Op 'rmsnorm_bwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-09 10:24:19,093 TE-FL manager.py:439 INFO] Op 'rmsnorm_bwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-09 10:24:19,094 TE-FL manager.py:439 INFO] Op 'rmsnorm_bwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-09 10:24:19,094 TE-FL manager.py:439 INFO] Op 'rmsnorm_bwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-09 10:24:19,094 TE-FL manager.py:439 INFO] Op 'rmsnorm_bwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-09 10:24:19,094 TE-FL manager.py:439 INFO] Op 'rmsnorm_bwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-09 10:24:19,101 TE-FL manager.py:439 INFO] Op 'rmsnorm_bwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
/home/xijun.gong/topslm/megatron_repos/Megatron-LM_dev_2b1fc7/megatron/core/pipeline_parallel/schedules.py:183: UserWarning: [GCU_DETERMINISTIC] _scaled_dot_product_efficient_attention_backward: Setting deterministic mode to 0 (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/aten/aot_ops/gcu_attention.cpp:250.)
Variable._execution_engine.run_backward(
/home/xijun.gong/topslm/megatron_repos/Megatron-LM_dev_2b1fc7/megatron/core/pipeline_parallel/schedules.py:183: UserWarning: [GCU_DETERMINISTIC] _scaled_dot_product_efficient_attention_backward: Setting deterministic mode to 0 (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/aten/aot_ops/gcu_attention.cpp:250.)
Variable._execution_engine.run_backward(
/home/xijun.gong/topslm/megatron_repos/Megatron-LM_dev_2b1fc7/megatron/core/pipeline_parallel/schedules.py:183: UserWarning: [GCU_DETERMINISTIC] _scaled_dot_product_efficient_attention_backward: Setting deterministic mode to 0 (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/aten/aot_ops/gcu_attention.cpp:250.)
Variable._execution_engine.run_backward(
/home/xijun.gong/topslm/megatron_repos/Megatron-LM_dev_2b1fc7/megatron/core/pipeline_parallel/schedules.py:183: UserWarning: [GCU_DETERMINISTIC] _scaled_dot_product_efficient_attention_backward: Setting deterministic mode to 0 (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/aten/aot_ops/gcu_attention.cpp:250.)
Variable._execution_engine.run_backward(
/home/xijun.gong/topslm/megatron_repos/Megatron-LM_dev_2b1fc7/megatron/core/pipeline_parallel/schedules.py:183: UserWarning: [GCU_DETERMINISTIC] _scaled_dot_product_efficient_attention_backward: Setting deterministic mode to 0 (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/aten/aot_ops/gcu_attention.cpp:250.)
Variable._execution_engine.run_backward(
/home/xijun.gong/topslm/megatron_repos/Megatron-LM_dev_2b1fc7/megatron/core/pipeline_parallel/schedules.py:183: UserWarning: [GCU_DETERMINISTIC] _scaled_dot_product_efficient_attention_backward: Setting deterministic mode to 0 (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/aten/aot_ops/gcu_attention.cpp:250.)
Variable._execution_engine.run_backward(
/home/xijun.gong/topslm/megatron_repos/Megatron-LM_dev_2b1fc7/megatron/core/pipeline_parallel/schedules.py:183: UserWarning: [GCU_DETERMINISTIC] _scaled_dot_product_efficient_attention_backward: Setting deterministic mode to 0 (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/aten/aot_ops/gcu_attention.cpp:250.)
Variable._execution_engine.run_backward(
/home/xijun.gong/topslm/megatron_repos/Megatron-LM_dev_2b1fc7/megatron/core/pipeline_parallel/schedules.py:183: UserWarning: [GCU_DETERMINISTIC] _scaled_dot_product_efficient_attention_backward: Setting deterministic mode to 0 (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/aten/aot_ops/gcu_attention.cpp:250.)
Variable._execution_engine.run_backward(
[2026-05-09 10:24:19,817 TE-FL manager.py:439 INFO] Op 'multi_tensor_l2norm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-09 10:24:19,817 TE-FL manager.py:439 INFO] Op 'multi_tensor_l2norm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-09 10:24:19,817 TE-FL manager.py:439 INFO] Op 'multi_tensor_l2norm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-09 10:24:19,817 TE-FL manager.py:439 INFO] Op 'multi_tensor_l2norm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-09 10:24:19,817 TE-FL manager.py:439 INFO] Op 'multi_tensor_l2norm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-09 10:24:19,817 TE-FL manager.py:439 INFO] Op 'multi_tensor_l2norm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-09 10:24:19,818 TE-FL manager.py:439 INFO] Op 'multi_tensor_l2norm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-09 10:24:19,818 TE-FL manager.py:439 INFO] Op 'multi_tensor_l2norm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-09 10:24:20,099 TE-FL manager.py:439 INFO] Op 'multi_tensor_scale' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-09 10:24:20,099 TE-FL manager.py:439 INFO] Op 'multi_tensor_scale' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-09 10:24:20,099 TE-FL manager.py:439 INFO] Op 'multi_tensor_scale' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-09 10:24:20,099 TE-FL manager.py:439 INFO] Op 'multi_tensor_scale' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-09 10:24:20,099 TE-FL manager.py:439 INFO] Op 'multi_tensor_scale' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-09 10:24:20,099 TE-FL manager.py:439 INFO] Op 'multi_tensor_scale' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-09 10:24:20,099 TE-FL manager.py:439 INFO] Op 'multi_tensor_scale' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-09 10:24:20,099 TE-FL manager.py:439 INFO] Op 'multi_tensor_scale' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-09 10:24:20,106 TE-FL manager.py:439 INFO] Op 'multi_tensor_adam' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-09 10:24:20,106 TE-FL manager.py:439 INFO] Op 'multi_tensor_adam' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-09 10:24:20,106 TE-FL manager.py:439 INFO] Op 'multi_tensor_adam' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-09 10:24:20,107 TE-FL manager.py:439 INFO] Op 'multi_tensor_adam' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-09 10:24:20,110 TE-FL manager.py:439 INFO] Op 'multi_tensor_adam' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-09 10:24:20,115 TE-FL manager.py:439 INFO] Op 'multi_tensor_adam' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-09 10:24:20,116 TE-FL manager.py:439 INFO] Op 'multi_tensor_adam' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-09 10:24:20,120 TE-FL manager.py:439 INFO] Op 'multi_tensor_adam' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
Number of parameters in transformer block in billions: 1.35
Number of parameters in embedding layers in billions: 1.24
Total number of parameters in billions: 2.60
Number of parameters in most loaded shard in billions: 2.5953
compute_activation_memory_without_sp
Activation memory footprint per transformer layer (precise, without SP): 544.0 MB
Theoretical memory footprints: weight and optimizer=18562.79 MB, activation=2439.18 MB, total=21001.98 MB

[2026-05-09 10:24:20] iteration 1/ 4 | consumed samples: 16 | elapsed time per iteration (ms): 2800.2 | throughput per GPU (TFLOP/s/GPU): 19.9 | learning rate: 3.385972E-06 | global batch size: 16 | lm loss: 1.280801E+01 | load_balancing_loss: 1.000339E+00 | loss scale: 1.0 | grad norm: 15.105 | number of skipped iterations: 0 | number of nan iterations: 0 |
[Rank 0] (after 1 iterations) memory (MB) | allocated: 12524.95361328125 | max allocated: 16309.083984375 | reserved: 20966.0 | max reserved: 20966.0 | device usage: 27060.37890625
Number of parameters in transformer block in billions: 1.35
Number of parameters in embedding layers in billions: 1.24
Total number of parameters in billions: 2.60
Number of parameters in most loaded shard in billions: 2.5953
compute_activation_memory_without_sp
Activation memory footprint per transformer layer (precise, without SP): 544.0 MB
Theoretical memory footprints: weight and optimizer=18562.79 MB, activation=2439.18 MB, total=21001.98 MB

[2026-05-09 10:24:21] iteration 2/ 4 | consumed samples: 32 | elapsed time per iteration (ms): 609.1 | throughput per GPU (TFLOP/s/GPU): 91.5 | learning rate: 2.145000E-06 | global batch size: 16 | lm loss: 1.272279E+01 | load_balancing_loss: 1.000339E+00 | loss scale: 1.0 | grad norm: 15.646 | number of skipped iterations: 0 | number of nan iterations: 0 |
[Rank 0] (after 2 iterations) memory (MB) | allocated: 12524.95361328125 | max allocated: 18778.93603515625 | reserved: 23344.0 | max reserved: 23344.0 | device usage: 29440.37890625
[2026-05-09 10:24:21] iteration 3/ 4 | consumed samples: 48 | elapsed time per iteration (ms): 606.7 | throughput per GPU (TFLOP/s/GPU): 91.9 | learning rate: 9.040276E-07 | global batch size: 16 | lm loss: 1.269746E+01 | load_balancing_loss: 1.000339E+00 | loss scale: 1.0 | grad norm: 14.716 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-05-09 10:24:22] iteration 4/ 4 | consumed samples: 64 | elapsed time per iteration (ms): 603.9 | throughput per GPU (TFLOP/s/GPU): 92.3 | learning rate: 3.900000E-07 | global batch size: 16 | lm loss: 1.250517E+01 | load_balancing_loss: 1.000339E+00 | loss scale: 1.0 | grad norm: 15.292 | number of skipped iterations: 0 | number of nan iterations: 0 |
[after training is done] datetime: 2026-05-09 10:24:22
wandb: training_time_layer_fwd 0
wandb: training_time_layer_bwd 0
wandb: tokens_per_sec_per_device 13501.876904069928
wandb: MFU% 29.44538503615737
wandb:
wandb: Run history:
wandb: batch-size ▁▁▁▁
wandb: grad-norm ▄█▁▅
wandb: iteration-time █▁▁▁
wandb: learning-rate █▅▂▁
wandb: lm loss █▆▅▁
wandb: load_balancing_loss ▁▁▁▁
wandb: loss-scale ▁▁▁▁
wandb: samples vs steps ▁▃▆█
wandb: throughput ▁███
wandb:
wandb: Run summary:
wandb: batch-size 16
wandb: grad-norm 15.29164
wandb: iteration-time 0.60393
wandb: learning-rate 0.0
wandb: lm loss 12.50517
wandb: load_balancing_loss 1.00034
wandb: loss-scale 1
wandb: samples vs steps 64
wandb: throughput 92.29521
wandb:
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /datasets/checkpoints/wandb/offline-run-20260509_102414-fffxft76
wandb: Find logs at: /datasets/checkpoints/wandb/offline-run-20260509_102414-fffxft76/logs
test_pass**

`

@gongxijun gongxijun marked this pull request as draft May 11, 2026 06:18
@gongxijun gongxijun marked this pull request as ready for review May 11, 2026 06:30
@lxd-cumt
Copy link
Copy Markdown
Collaborator

install and run pre-commit to pass ci format_check

@gongxijun gongxijun force-pushed the main branch 2 times, most recently from 8aea577 to 6574fd0 Compare May 11, 2026 06:51
from transformer_engine.plugin.core.ops import FlashAttentionBase


class FlashAttentionMETAX(FlashAttentionBase):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not METAX?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix done

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new running log:
`
[2026-05-11 08:59:19,336 TE-FL manager.py:99 DEBUG] Initializing OpManager in PID 18699
[WARNING] Failed to register FlagOS operators: No module named 'flag_gems'
[WARNING] Failed to register FlagOS operators: No module named 'flag_gems'
[2026-05-11 08:59:19,339 TE-FL manager.py:99 DEBUG] Initializing OpManager in PID 18702
[WARNING] Failed to register FlagOS operators: No module named 'flag_gems'
[2026-05-11 08:59:19,343 TE-FL manager.py:99 DEBUG] Initializing OpManager in PID 18701
[2026-05-11 08:59:19,343 TE-FL manager.py:99 DEBUG] Initializing OpManager in PID 18697
[WARNING] Failed to register FlagOS operators: No module named 'flag_gems'
CUDA vendor backend skipped (CUDA build was disabled at build time)
[WARNING] Failed to register FlagOS operators: No module named 'flag_gems'
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:387: ImportWarning:
*************************************************************************************************************
The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.gcu and torch.nn.Module.gcu now..
The backend in torch.distributed.init_process_group set to eccl now..
The torch.cuda.* and torch.cuda.amp.* are replaced with torch.gcu.* and torch.gcu.amp.* now..
The device parameters have been replaced with gcu in the function below:
torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.autocast, torch.load, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.nn.Module.to, torch.nn.Module.to_empty
*************************************************************************************************************

warnings.warn(msg, ImportWarning)
[2026-05-11 08:59:19,355 TE-FL manager.py:99 DEBUG] Initializing OpManager in PID 18704
[WARNING] Failed to register FlagOS operators: No module named 'flag_gems'
[2026-05-11 08:59:19,357 TE-FL discovery.py:24 DEBUG] Starting plugin discovery...
[2026-05-11 08:59:19,360 TE-FL manager.py:99 DEBUG] Initializing OpManager in PID 18700
[WARNING] Failed to register FlagOS operators: No module named 'flag_gems'
[2026-05-11 08:59:19,375 TE-FL discovery.py:24 DEBUG] No entry points found for group: te_fl.plugin
[2026-05-11 08:59:19,375 TE-FL discovery.py:24 DEBUG] Plugin discovery complete. Loaded 0 plugin.
[2026-05-11 08:59:19,376 TE-FL manager.py:122 INFO] OpManager initialized: 110 ops with 164 implementations
[2026-05-11 08:59:19,376 TE-FL manager.py:146 DEBUG] Vendor: 110, Default: 0, Reference: 54
[2026-05-11 08:59:19,376 TE-FL manager.py:155 INFO] Registered impl_ids: ['reference.torch', 'vendor.enflame']
[2026-05-11 08:59:19,414 TE-FL discovery.py:24 DEBUG] Starting plugin discovery...
[2026-05-11 08:59:19,414 TE-FL discovery.py:24 DEBUG] Starting plugin discovery...
[2026-05-11 08:59:19,414 TE-FL discovery.py:24 DEBUG] Starting plugin discovery...
[2026-05-11 08:59:19,414 TE-FL discovery.py:24 DEBUG] Starting plugin discovery...
[2026-05-11 08:59:19,414 TE-FL discovery.py:24 DEBUG] Starting plugin discovery...
[2026-05-11 08:59:19,416 TE-FL discovery.py:24 DEBUG] Starting plugin discovery...
[2026-05-11 08:59:19,417 TE-FL discovery.py:24 DEBUG] Starting plugin discovery...
[2026-05-11 08:59:19,427 TE-FL discovery.py:24 DEBUG] No entry points found for group: te_fl.plugin
[2026-05-11 08:59:19,427 TE-FL discovery.py:24 DEBUG] Plugin discovery complete. Loaded 0 plugin.
[2026-05-11 08:59:19,427 TE-FL discovery.py:24 DEBUG] No entry points found for group: te_fl.plugin
[2026-05-11 08:59:19,427 TE-FL discovery.py:24 DEBUG] No entry points found for group: te_fl.plugin
[2026-05-11 08:59:19,427 TE-FL discovery.py:24 DEBUG] No entry points found for group: te_fl.plugin
[2026-05-11 08:59:19,427 TE-FL discovery.py:24 DEBUG] No entry points found for group: te_fl.plugin
[2026-05-11 08:59:19,427 TE-FL manager.py:122 INFO] OpManager initialized: 110 ops with 164 implementations
[2026-05-11 08:59:19,427 TE-FL discovery.py:24 DEBUG] Plugin discovery complete. Loaded 0 plugin.
[2026-05-11 08:59:19,427 TE-FL discovery.py:24 DEBUG] Plugin discovery complete. Loaded 0 plugin.
[2026-05-11 08:59:19,427 TE-FL discovery.py:24 DEBUG] Plugin discovery complete. Loaded 0 plugin.
[2026-05-11 08:59:19,427 TE-FL discovery.py:24 DEBUG] Plugin discovery complete. Loaded 0 plugin.
[2026-05-11 08:59:19,427 TE-FL manager.py:146 DEBUG] Vendor: 110, Default: 0, Reference: 54
[2026-05-11 08:59:19,428 TE-FL manager.py:155 INFO] Registered impl_ids: ['reference.torch', 'vendor.enflame']
[2026-05-11 08:59:19,428 TE-FL manager.py:122 INFO] OpManager initialized: 110 ops with 164 implementations
[2026-05-11 08:59:19,428 TE-FL manager.py:122 INFO] OpManager initialized: 110 ops with 164 implementations
[2026-05-11 08:59:19,428 TE-FL manager.py:122 INFO] OpManager initialized: 110 ops with 164 implementations
[2026-05-11 08:59:19,428 TE-FL manager.py:122 INFO] OpManager initialized: 110 ops with 164 implementations
[2026-05-11 08:59:19,428 TE-FL manager.py:146 DEBUG] Vendor: 110, Default: 0, Reference: 54
[2026-05-11 08:59:19,428 TE-FL manager.py:146 DEBUG] Vendor: 110, Default: 0, Reference: 54
[2026-05-11 08:59:19,428 TE-FL manager.py:146 DEBUG] Vendor: 110, Default: 0, Reference: 54
[2026-05-11 08:59:19,428 TE-FL manager.py:146 DEBUG] Vendor: 110, Default: 0, Reference: 54
[2026-05-11 08:59:19,428 TE-FL manager.py:155 INFO] Registered impl_ids: ['reference.torch', 'vendor.enflame']
[2026-05-11 08:59:19,428 TE-FL manager.py:155 INFO] Registered impl_ids: ['reference.torch', 'vendor.enflame']
[2026-05-11 08:59:19,428 TE-FL manager.py:155 INFO] Registered impl_ids: ['reference.torch', 'vendor.enflame']
[2026-05-11 08:59:19,428 TE-FL manager.py:155 INFO] Registered impl_ids: ['reference.torch', 'vendor.enflame']
[2026-05-11 08:59:19,430 TE-FL discovery.py:24 DEBUG] No entry points found for group: te_fl.plugin
[2026-05-11 08:59:19,430 TE-FL discovery.py:24 DEBUG] Plugin discovery complete. Loaded 0 plugin.
[2026-05-11 08:59:19,430 TE-FL manager.py:122 INFO] OpManager initialized: 110 ops with 164 implementations
[2026-05-11 08:59:19,430 TE-FL manager.py:146 DEBUG] Vendor: 110, Default: 0, Reference: 54
[2026-05-11 08:59:19,430 TE-FL manager.py:155 INFO] Registered impl_ids: ['reference.torch', 'vendor.enflame']
[2026-05-11 08:59:19,439 TE-FL discovery.py:24 DEBUG] No entry points found for group: te_fl.plugin
[2026-05-11 08:59:19,439 TE-FL discovery.py:24 DEBUG] Plugin discovery complete. Loaded 0 plugin.
[2026-05-11 08:59:19,439 TE-FL manager.py:122 INFO] OpManager initialized: 110 ops with 164 implementations
[2026-05-11 08:59:19,439 TE-FL manager.py:146 DEBUG] Vendor: 110, Default: 0, Reference: 54
[2026-05-11 08:59:19,440 TE-FL manager.py:155 INFO] Registered impl_ids: ['reference.torch', 'vendor.enflame']
INFO:megatron.core.num_microbatches_calculator:setting number of microbatches to constant 2

building HuggingFaceTokenizer tokenizer ...
WARNING:megatron.core.datasets.megatron_tokenizer:You’re using the legacy tokenizer system, which is deprecated and will be removed in a future release. Please migrate to the new tokenizer system (megatron.core.tokenizers.MegatronTokenizer).
INFO:megatron.training.initialize:Setting logging level to 40
INFO:megatron.training.initialize:Setting logging level to 40
INFO:megatron.training.initialize:Setting logging level to 40
INFO:megatron.training.initialize:Setting logging level to 40
INFO:megatron.training.initialize:Setting logging level to 40
setting tensorboard ...
INFO:megatron.training.initialize:Setting logging level to 40
padded vocab (size: 151669) with 267 dummy tokens (new size: 151936)
INFO:megatron.training.initialize:Setting logging level to 40
initializing torch distributed ...
/usr/local/lib/python3.12/dist-packages/eventlet/support/greenlets.py:6: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
preserves_excinfo = (distutils.version.LooseVersion(greenlet.version)
wandb: Tracking run with wandb version 0.24.0
wandb: W&B syncing is set to offline in this directory. Run wandb online or set WANDB_MODE=online to enable cloud syncing.
wandb: Run data is saved locally in /datasets/checkpoints/wandb/offline-run-20260511_085946-0t1qgbto
/usr/local/lib/python3.12/dist-packages/wandb/util.py:1967: DeprecationWarning: Implicit None on return values is deprecated and will raise KeyErrors.
yield InstalledDistribution(key=d.metadata["Name"], version=d.version)
/usr/lib/python3.12/importlib/metadata/init.py:467: DeprecationWarning: Implicit None on return values is deprecated and will raise KeyErrors.
return self.metadata['Version']
WARNING: one_logger package is required to enable e2e metrics tracking. please go to https://confluence.nvidia.com/display/MLWFO/Package+Repositories for details to install it/usr/local/lib/python3.12/dist-packages/google/protobuf/internal/well_known_types.py:195: DeprecationWarning: datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).
self.FromDatetime(datetime.datetime.utcnow())

INFO:megatron.training.initialize:Setting logging level to 40
[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank [Gloo] Rank 2 is connected to 7 peer ranks. 1Expected number of connected peer ranks is : is connected to 77 peer ranks. Expected number of connected peer ranks is :
7[Gloo] Rank
4 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 3 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 5 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank [Gloo] Rank 7 is connected to 67 peer ranks. is connected to Expected number of connected peer ranks is : 77 peer ranks. Expected number of connected peer ranks is :
7
[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 1 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 2 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 3 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 5 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 6 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 4 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 7 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank [Gloo] Rank 0 is connected to 00 peer ranks. is connected to Expected number of connected peer ranks is : 00 peer ranks. Expected number of connected peer ranks is :
0
[Gloo] Rank [Gloo] Rank 0 is connected to 00 is connected to peer ranks. 0Expected number of connected peer ranks is : peer ranks. [Gloo] Rank 0Expected number of connected peer ranks is : 0
0 is connected to
0 peer ranks. Expected number of connected peer ranks is : 0

initialized tensor model parallel with size 1
initialized pipeline model parallel with size 1
setting random seeds to 1234 ...
compiling dataset index builder ...
make: Entering directory '/home/xijun.gong/topslm/megatron_repos/Megatron-LM_dev_2b1fc7/megatron/core/datasets'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/xijun.gong/topslm/megatron_repos/Megatron-LM_dev_2b1fc7/megatron/core/datasets'

done with dataset index builder. Compilation time: 0.162 seconds
WARNING: constraints for invoking optimized fused softmax kernel are not met. We default back to unfused kernel invocations.
compiling and loading fused kernels ...
/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py:4876: UserWarning: barrier(): using the device under current context. You can specify device_id in init_process_group to mute this warning.
warnings.warn( # warn only once
ECCL version 3.6.3.9 + compiled with TopsPlatform 1.7.2.21
done with compiling and loading fused kernels. Compilation time: 0.675 seconds
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:149: UserWarning: GCU not support Double use Float replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:24.)
return fn(*args, **kwargs)
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:149: UserWarning: GCU not support Double use Float replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:24.)
return fn(*args, **kwargs)
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:149: UserWarning: GCU not support Double use Float replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:24.)
return fn(*args, **kwargs)
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:149: UserWarning: GCU not support Double use Float replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:24.)
return fn(*args, **kwargs)
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:149: UserWarning: GCU not support Double use Float replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:24.)
return fn(*args, **kwargs)
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:149: UserWarning: GCU not support Double use Float replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:24.)
return fn(*args, **kwargs)
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:149: UserWarning: GCU not support Double use Float replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:24.)
return fn(*args, **kwargs)
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:149: UserWarning: GCU not support Double use Float replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:24.)
return fn(*args, **kwargs)
time to initialize megatron (seconds): 3.901
[after megatron is initialized] datetime: 2026-05-11 08:59:47
building GPT model ...
[2026-05-11 08:59:48,030 TE-FL manager.py:460 WARNING] Implementation 'vendor.enflame' failed for op 'get_cudnn_version': module 'migration.patches.transformer_engine.v2_9_0' has no attribute 'get_cudnn_version'
[2026-05-11 08:59:48,031 TE-FL manager.py:439 INFO] Op 'get_cudnn_version' using 'reference.torch' (kind=reference, vendor=None)
[2026-05-11 08:59:48,031 TE-FL manager.py:439 INFO] Op 'get_flash_attention_class' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
/home/xijun.gong/topslm/megatron_repos/Megatron-LM_dev_2b1fc7/megatron/core/extensions/transformer_engine.py:227: DeprecationWarning: hidden_size arg has been renamed to normalized_shape for compatibility with torch.nn.LayerNorm.
instance = te.pytorch.RMSNorm(
[2026-05-11 08:59:48,047 TE-FL manager.py:460 WARNING] Implementation 'vendor.enflame' failed for op 'get_cudnn_version': module 'migration.patches.transformer_engine.v2_9_0' has no attribute 'get_cudnn_version'
[2026-05-11 08:59:48,047 TE-FL manager.py:439 INFO] Op 'get_cudnn_version' using 'reference.torch' (kind=reference, vendor=None)
[2026-05-11 08:59:48,048 TE-FL manager.py:439 INFO] Op 'get_flash_attention_class' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:48,048 TE-FL manager.py:460 WARNING] Implementation 'vendor.enflame' failed for op 'get_cudnn_version': module 'migration.patches.transformer_engine.v2_9_0' has no attribute 'get_cudnn_version'
[2026-05-11 08:59:48,048 TE-FL manager.py:439 INFO] Op 'get_cudnn_version' using 'reference.torch' (kind=reference, vendor=None)
[2026-05-11 08:59:48,049 TE-FL manager.py:439 INFO] Op 'get_flash_attention_class' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
/home/xijun.gong/topslm/megatron_repos/Megatron-LM_dev_2b1fc7/megatron/core/extensions/transformer_engine.py:227: DeprecationWarning: hidden_size arg has been renamed to normalized_shape for compatibility with torch.nn.LayerNorm.
instance = te.pytorch.RMSNorm(
/home/xijun.gong/topslm/megatron_repos/Megatron-LM_dev_2b1fc7/megatron/core/extensions/transformer_engine.py:227: DeprecationWarning: hidden_size arg has been renamed to normalized_shape for compatibility with torch.nn.LayerNorm.
instance = te.pytorch.RMSNorm(
[2026-05-11 08:59:48,050 TE-FL manager.py:460 WARNING] Implementation 'vendor.enflame' failed for op 'get_cudnn_version': module 'migration.patches.transformer_engine.v2_9_0' has no attribute 'get_cudnn_version'
[2026-05-11 08:59:48,051 TE-FL manager.py:439 INFO] Op 'get_cudnn_version' using 'reference.torch' (kind=reference, vendor=None)
[2026-05-11 08:59:48,051 TE-FL manager.py:439 INFO] Op 'get_flash_attention_class' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:48,051 TE-FL manager.py:460 WARNING] Implementation 'vendor.enflame' failed for op 'get_cudnn_version': module 'migration.patches.transformer_engine.v2_9_0' has no attribute 'get_cudnn_version'
[2026-05-11 08:59:48,051 TE-FL manager.py:439 INFO] Op 'get_cudnn_version' using 'reference.torch' (kind=reference, vendor=None)
[2026-05-11 08:59:48,052 TE-FL manager.py:439 INFO] Op 'get_flash_attention_class' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
/home/xijun.gong/topslm/megatron_repos/Megatron-LM_dev_2b1fc7/megatron/core/extensions/transformer_engine.py:227: DeprecationWarning: hidden_size arg has been renamed to normalized_shape for compatibility with torch.nn.LayerNorm.
instance = te.pytorch.RMSNorm(
/home/xijun.gong/topslm/megatron_repos/Megatron-LM_dev_2b1fc7/megatron/core/extensions/transformer_engine.py:227: DeprecationWarning: hidden_size arg has been renamed to normalized_shape for compatibility with torch.nn.LayerNorm.
instance = te.pytorch.RMSNorm(
[2026-05-11 08:59:48,055 TE-FL manager.py:460 WARNING] Implementation 'vendor.enflame' failed for op 'get_cudnn_version': module 'migration.patches.transformer_engine.v2_9_0' has no attribute 'get_cudnn_version'
[2026-05-11 08:59:48,055 TE-FL manager.py:439 INFO] Op 'get_cudnn_version' using 'reference.torch' (kind=reference, vendor=None)
[2026-05-11 08:59:48,056 TE-FL manager.py:439 INFO] Op 'get_flash_attention_class' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
/home/xijun.gong/topslm/megatron_repos/Megatron-LM_dev_2b1fc7/megatron/core/extensions/transformer_engine.py:227: DeprecationWarning: hidden_size arg has been renamed to normalized_shape for compatibility with torch.nn.LayerNorm.
instance = te.pytorch.RMSNorm(
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:147: UserWarning: GCU not support Long use Int replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:28.)
return fn(*args, **kwargs)
[2026-05-11 08:59:48,068 TE-FL manager.py:460 WARNING] Implementation 'vendor.enflame' failed for op 'get_cudnn_version': module 'migration.patches.transformer_engine.v2_9_0' has no attribute 'get_cudnn_version'
[2026-05-11 08:59:48,069 TE-FL manager.py:439 INFO] Op 'get_cudnn_version' using 'reference.torch' (kind=reference, vendor=None)
[2026-05-11 08:59:48,069 TE-FL manager.py:460 WARNING] Implementation 'vendor.enflame' failed for op 'get_cudnn_version': module 'migration.patches.transformer_engine.v2_9_0' has no attribute 'get_cudnn_version'
[2026-05-11 08:59:48,069 TE-FL manager.py:439 INFO] Op 'get_cudnn_version' using 'reference.torch' (kind=reference, vendor=None)
[2026-05-11 08:59:48,069 TE-FL manager.py:439 INFO] Op 'get_flash_attention_class' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:48,070 TE-FL manager.py:439 INFO] Op 'get_flash_attention_class' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
/home/xijun.gong/topslm/megatron_repos/Megatron-LM_dev_2b1fc7/megatron/core/extensions/transformer_engine.py:227: DeprecationWarning: hidden_size arg has been renamed to normalized_shape for compatibility with torch.nn.LayerNorm.
instance = te.pytorch.RMSNorm(
/home/xijun.gong/topslm/megatron_repos/Megatron-LM_dev_2b1fc7/megatron/core/extensions/transformer_engine.py:227: DeprecationWarning: hidden_size arg has been renamed to normalized_shape for compatibility with torch.nn.LayerNorm.
instance = te.pytorch.RMSNorm(
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:147: UserWarning: GCU not support Long use Int replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:28.)
return fn(*args, **kwargs)
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:147: UserWarning: GCU not support Long use Int replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:28.)
return fn(*args, **kwargs)
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:147: UserWarning: GCU not support Long use Int replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:28.)
return fn(*args, **kwargs)
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:147: UserWarning: GCU not support Long use Int replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:28.)
return fn(*args, **kwargs)
number of parameters on (tensor, pipeline) model parallel rank (0, 0): 1538544128
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:147: UserWarning: GCU not support Long use Int replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:28.)
return fn(*args, **kwargs)
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:147: UserWarning: GCU not support Long use Int replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:28.)
return fn(*args, **kwargs)
/usr/local/lib/python3.12/dist-packages/torch_gcu/transfer_to_gcu.py:147: UserWarning: GCU not support Long use Int replace, maybe lead to unexpected overflow issues. (Triggered internally at /builds/workspace/torch_gcu/torch_gcu/csrc/gcu/gcu_utils.h:28.)
return fn(*args, **kwargs)
[after model, optimizer, and learning rate scheduler are built] datetime: 2026-05-11 08:59:48
building train, validation, and test datasets ...
train: 64
validation: 0
test: 0
building train, validation, and test datasets for GPT ...
finished creating GPT datasets ...
/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py:659: UserWarning: pin_memory_device is deprecated, the current accelerator will be used as the device,ignore pin_memory_device='gcu'.
warnings.warn(
/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py:659: UserWarning: pin_memory_device is deprecated, the current accelerator will be used as the device,ignore pin_memory_device='gcu'.
warnings.warn(
/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py:659: UserWarning: pin_memory_device is deprecated, the current accelerator will be used as the device,ignore pin_memory_device='gcu'.
warnings.warn(
/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py:659: UserWarning: pin_memory_device is deprecated, the current accelerator will be used as the device,ignore pin_memory_device='gcu'.
warnings.warn(
/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py:659: UserWarning: pin_memory_device is deprecated, the current accelerator will be used as the device,ignore pin_memory_device='gcu'.
warnings.warn(
/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py:659: UserWarning: pin_memory_device is deprecated, the current accelerator will be used as the device,ignore pin_memory_device='gcu'.
warnings.warn(
/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py:659: UserWarning: pin_memory_device is deprecated, the current accelerator will be used as the device,ignore pin_memory_device='gcu'.
warnings.warn(
/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py:659: UserWarning: pin_memory_device is deprecated, the current accelerator will be used as the device,ignore pin_memory_device='gcu'.
warnings.warn(
/usr/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=18698) is multi-threaded, use of fork() may lead to deadlocks in the child.
self.pid = os.fork()
/usr/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=18702) is multi-threaded, use of fork() may lead to deadlocks in the child.
self.pid = os.fork()
/usr/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=18703) is multi-threaded, use of fork() may lead to deadlocks in the child.
self.pid = os.fork()
/usr/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=18704) is multi-threaded, use of fork() may lead to deadlocks in the child.
self.pid = os.fork()
/usr/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=18697) is multi-threaded, use of fork() may lead to deadlocks in the child.
self.pid = os.fork()
[2026-05-11 08:59:49,085 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,085 TE-FL manager.py:70 DEBUG] OpManager reset after fork
/usr/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=18699) is multi-threaded, use of fork() may lead to deadlocks in the child.
self.pid = os.fork()
/usr/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=18701) is multi-threaded, use of fork() may lead to deadlocks in the child.
self.pid = os.fork()
[2026-05-11 08:59:49,085 TE-FL manager.py:70 DEBUG] OpManager reset after fork
/usr/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=18700) is multi-threaded, use of fork() may lead to deadlocks in the child.
self.pid = os.fork()
[2026-05-11 08:59:49,086 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,086 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,087 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,087 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,087 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,108 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,109 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,109 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,109 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,110 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,112 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,112 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,112 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,129 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,130 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,130 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,130 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,131 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,133 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,134 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,135 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,150 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,150 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,151 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,152 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,152 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,154 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,156 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,157 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,170 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,170 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,171 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,171 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,172 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,174 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,177 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,180 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,191 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,191 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,193 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,193 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,195 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,197 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,200 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,202 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,220 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,221 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,223 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,223 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,227 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,228 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,232 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,234 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,243 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,244 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,246 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,246 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,251 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,257 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,258 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,279 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,283 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,289 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,290 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,291 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,300 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,300 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,301 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,304 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,306 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,310 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,312 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,312 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,320 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,322 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,323 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,327 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,328 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,329 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,331 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,332 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,342 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,342 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,343 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,346 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,347 TE-FL manager.py:70 DEBUG] OpManager reset after fork
/usr/local/lib/python3.12/dist-packages/torch/utils/data/_utils/pin_memory.py:57: DeprecationWarning: The argument 'device' of Tensor.pin_memory() is deprecated. Please do not pass this argument. (Triggered internally at /pytorch/aten/src/ATen/native/Memory.cpp:46.)
return data.pin_memory(device)
[2026-05-11 08:59:49,354 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,357 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,357 TE-FL manager.py:70 DEBUG] OpManager reset after fork
/usr/local/lib/python3.12/dist-packages/torch/utils/data/_utils/pin_memory.py:57: DeprecationWarning: The argument 'device' of Tensor.pin_memory() is deprecated. Please do not pass this argument. (Triggered internally at /pytorch/aten/src/ATen/native/Memory.cpp:46.)
return data.pin_memory(device)
/usr/local/lib/python3.12/dist-packages/torch/utils/data/_utils/pin_memory.py:57: DeprecationWarning: The argument 'device' of Tensor.pin_memory() is deprecated. Please do not pass this argument. (Triggered internally at /pytorch/aten/src/ATen/native/Memory.cpp:46.)
return data.pin_memory(device)
/usr/local/lib/python3.12/dist-packages/torch/utils/data/_utils/pin_memory.py:57: DeprecationWarning: The argument 'device' of Tensor.pin_memory() is deprecated. Please do not pass this argument. (Triggered internally at /pytorch/aten/src/ATen/native/Memory.cpp:46.)
return data.pin_memory(device)
[2026-05-11 08:59:49,363 TE-FL manager.py:70 DEBUG] OpManager reset after fork
/usr/local/lib/python3.12/dist-packages/torch/utils/data/_utils/pin_memory.py:57: DeprecationWarning: The argument 'device' of Tensor.pin_memory() is deprecated. Please do not pass this argument. (Triggered internally at /pytorch/aten/src/ATen/native/Memory.cpp:46.)
return data.pin_memory(device)
[2026-05-11 08:59:49,368 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,370 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,372 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,372 TE-FL manager.py:70 DEBUG] OpManager reset after fork
/usr/local/lib/python3.12/dist-packages/torch/utils/data/_utils/pin_memory.py:57: DeprecationWarning: The argument 'device' of Tensor.pin_memory() is deprecated. Please do not pass this argument. (Triggered internally at /pytorch/aten/src/ATen/native/Memory.cpp:46.)
return data.pin_memory(device)
/usr/local/lib/python3.12/dist-packages/torch/utils/data/_utils/pin_memory.py:57: DeprecationWarning: The argument 'device' of Tensor.pin_memory() is deprecated. Please do not pass this argument. (Triggered internally at /pytorch/aten/src/ATen/native/Memory.cpp:46.)
return data.pin_memory(device)
/usr/local/lib/python3.12/dist-packages/torch/utils/data/_utils/pin_memory.py:57: DeprecationWarning: The argument 'device' of Tensor.pin_memory() is deprecated. Please do not pass this argument. (Triggered internally at /pytorch/aten/src/ATen/native/Memory.cpp:46.)
return data.pin_memory(device)
[2026-05-11 08:59:49,383 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,384 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,387 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,389 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,393 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,395 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,397 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,400 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,405 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,406 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,409 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,414 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,414 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,418 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,420 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,423 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,426 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,426 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,430 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,436 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,441 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,442 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,445 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,446 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,449 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,450 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,453 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,462 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,469 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,472 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,474 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,477 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,477 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,478 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,481 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,488 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,490 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,500 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,502 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,506 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,505 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,506 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,510 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,512 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,529 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,529 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[2026-05-11 08:59:49,531 TE-FL manager.py:70 DEBUG] OpManager reset after fork
[after dataloaders are built] datetime: 2026-05-11 08:59:49
done with setup ...
training ...
Overwriting rerun_state_machine.current_iteration from -1 to 0...
[before the start of training step] datetime: 2026-05-11 08:59:49
ECCL version 3.6.3.9 + compiled with TopsPlatform 1.7.2.21
ECCL version 3.6.3.9 + compiled with TopsPlatform 1.7.2.21
[2026-05-11 08:59:50,005 TE-FL manager.py:439 INFO] Op 'rmsnorm_fwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:50,009 TE-FL manager.py:439 INFO] Op 'generic_gemm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:50,034 TE-FL manager.py:439 INFO] Op 'get_attention_backend' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:50,038 TE-FL manager.py:439 INFO] Op 'rmsnorm_fwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:50,043 TE-FL manager.py:439 INFO] Op 'generic_gemm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:50,069 TE-FL manager.py:439 INFO] Op 'get_attention_backend' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
ECCL version 3.6.3.9 + compiled with TopsPlatform 1.7.2.21
ECCL version 3.6.3.9 + compiled with TopsPlatform 1.7.2.21
ECCL version 3.6.3.9 + compiled with TopsPlatform 1.7.2.21
ECCL version 3.6.3.9 + compiled with TopsPlatform 1.7.2.21
ECCL version 3.6.3.9 + compiled with TopsPlatform 1.7.2.21
[2026-05-11 08:59:50,228 TE-FL manager.py:439 INFO] Op 'rmsnorm_fwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:50,231 TE-FL manager.py:439 INFO] Op 'rmsnorm_fwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:50,232 TE-FL manager.py:439 INFO] Op 'rmsnorm_fwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:50,233 TE-FL manager.py:439 INFO] Op 'rmsnorm_fwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:50,233 TE-FL manager.py:439 INFO] Op 'generic_gemm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:50,236 TE-FL manager.py:439 INFO] Op 'generic_gemm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:50,236 TE-FL manager.py:439 INFO] Op 'rmsnorm_fwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:50,236 TE-FL manager.py:439 INFO] Op 'generic_gemm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:50,238 TE-FL manager.py:439 INFO] Op 'generic_gemm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:50,241 TE-FL manager.py:439 INFO] Op 'generic_gemm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:50,247 TE-FL manager.py:439 INFO] Op 'rmsnorm_fwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:50,252 TE-FL manager.py:439 INFO] Op 'generic_gemm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:50,262 TE-FL manager.py:439 INFO] Op 'get_attention_backend' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:50,265 TE-FL manager.py:439 INFO] Op 'get_attention_backend' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:50,265 TE-FL manager.py:439 INFO] Op 'get_attention_backend' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:50,266 TE-FL manager.py:439 INFO] Op 'get_attention_backend' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:50,270 TE-FL manager.py:439 INFO] Op 'get_attention_backend' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:50,281 TE-FL manager.py:439 INFO] Op 'get_attention_backend' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:51,287 TE-FL manager.py:439 INFO] Op 'rmsnorm_bwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:51,291 TE-FL manager.py:439 INFO] Op 'rmsnorm_bwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:51,291 TE-FL manager.py:439 INFO] Op 'rmsnorm_bwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:51,291 TE-FL manager.py:439 INFO] Op 'rmsnorm_bwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:51,294 TE-FL manager.py:439 INFO] Op 'rmsnorm_bwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:51,297 TE-FL manager.py:439 INFO] Op 'rmsnorm_bwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:51,311 TE-FL manager.py:439 INFO] Op 'rmsnorm_bwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:51,332 TE-FL manager.py:439 INFO] Op 'rmsnorm_bwd' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,061 TE-FL manager.py:439 INFO] Op 'multi_tensor_l2norm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,061 TE-FL manager.py:439 INFO] Op 'multi_tensor_l2norm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,061 TE-FL manager.py:439 INFO] Op 'multi_tensor_l2norm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,061 TE-FL manager.py:439 INFO] Op 'multi_tensor_l2norm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,061 TE-FL manager.py:439 INFO] Op 'multi_tensor_l2norm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,061 TE-FL manager.py:439 INFO] Op 'multi_tensor_l2norm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,061 TE-FL manager.py:439 INFO] Op 'multi_tensor_l2norm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,061 TE-FL manager.py:439 INFO] Op 'multi_tensor_l2norm' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,324 TE-FL manager.py:439 INFO] Op 'multi_tensor_scale' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,324 TE-FL manager.py:439 INFO] Op 'multi_tensor_scale' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,324 TE-FL manager.py:439 INFO] Op 'multi_tensor_scale' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,324 TE-FL manager.py:439 INFO] Op 'multi_tensor_scale' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,324 TE-FL manager.py:439 INFO] Op 'multi_tensor_scale' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,324 TE-FL manager.py:439 INFO] Op 'multi_tensor_scale' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,325 TE-FL manager.py:439 INFO] Op 'multi_tensor_scale' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,325 TE-FL manager.py:439 INFO] Op 'multi_tensor_scale' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,331 TE-FL manager.py:439 INFO] Op 'multi_tensor_adam' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,331 TE-FL manager.py:439 INFO] Op 'multi_tensor_adam' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,331 TE-FL manager.py:439 INFO] Op 'multi_tensor_adam' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,332 TE-FL manager.py:439 INFO] Op 'multi_tensor_adam' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,332 TE-FL manager.py:439 INFO] Op 'multi_tensor_adam' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,332 TE-FL manager.py:439 INFO] Op 'multi_tensor_adam' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,332 TE-FL manager.py:439 INFO] Op 'multi_tensor_adam' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
[2026-05-11 08:59:52,332 TE-FL manager.py:439 INFO] Op 'multi_tensor_adam' using 'vendor.enflame' (kind=vendor, vendor=ENFLAME)
Number of parameters in transformer block in billions: 1.35
Number of parameters in embedding layers in billions: 1.24
Total number of parameters in billions: 2.60
Number of parameters in most loaded shard in billions: 2.5953
compute_activation_memory_without_sp
Activation memory footprint per transformer layer (precise, without SP): 544.0 MB
Theoretical memory footprints: weight and optimizer=18562.79 MB, activation=2439.18 MB, total=21001.98 MB

[2026-05-11 08:59:52] iteration 1/ 4 | consumed samples: 16 | elapsed time per iteration (ms): 3134.6 | throughput per GPU (TFLOP/s/GPU): 17.8 | learning rate: 3.385972E-06 | global batch size: 16 | lm loss: 1.280801E+01 | load_balancing_loss: 1.000339E+00 | loss scale: 1.0 | grad norm: 15.105 | number of skipped iterations: 0 | number of nan iterations: 0 |
[Rank 0] (after 1 iterations) memory (MB) | allocated: 12524.95361328125 | max allocated: 16005.3583984375 | reserved: 20646.0 | max reserved: 20646.0 | device usage: 26724.37890625
Number of parameters in transformer block in billions: 1.35
Number of parameters in embedding layers in billions: 1.24
Total number of parameters in billions: 2.60
Number of parameters in most loaded shard in billions: 2.5953
compute_activation_memory_without_sp
Activation memory footprint per transformer layer (precise, without SP): 544.0 MB
Theoretical memory footprints: weight and optimizer=18562.79 MB, activation=2439.18 MB, total=21001.98 MB

[2026-05-11 08:59:53] iteration 2/ 4 | consumed samples: 32 | elapsed time per iteration (ms): 564.8 | throughput per GPU (TFLOP/s/GPU): 98.7 | learning rate: 2.145000E-06 | global batch size: 16 | lm loss: 1.272276E+01 | load_balancing_loss: 1.000339E+00 | loss scale: 1.0 | grad norm: 15.646 | number of skipped iterations: 0 | number of nan iterations: 0 |
[Rank 0] (after 2 iterations) memory (MB) | allocated: 12524.95361328125 | max allocated: 18474.93505859375 | reserved: 23024.0 | max reserved: 23024.0 | device usage: 29104.37890625
[2026-05-11 08:59:53] iteration 3/ 4 | consumed samples: 48 | elapsed time per iteration (ms): 562.9 | throughput per GPU (TFLOP/s/GPU): 99.0 | learning rate: 9.040276E-07 | global batch size: 16 | lm loss: 1.269750E+01 | load_balancing_loss: 1.000339E+00 | loss scale: 1.0 | grad norm: 14.716 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-05-11 08:59:54] iteration 4/ 4 | consumed samples: 64 | elapsed time per iteration (ms): 558.7 | throughput per GPU (TFLOP/s/GPU): 99.8 | learning rate: 3.900000E-07 | global batch size: 16 | lm loss: 1.250520E+01 | load_balancing_loss: 1.000339E+00 | loss scale: 1.0 | grad norm: 15.292 | number of skipped iterations: 0 | number of nan iterations: 0 |
[after training is done] datetime: 2026-05-11 08:59:54
wandb: training_time_layer_fwd 0
wandb: training_time_layer_bwd 0
wandb: tokens_per_sec_per_device 14552.366627560767
wandb: MFU% 31.736331295294892
wandb:
wandb: Run history:
wandb: batch-size ▁▁▁▁
wandb: grad-norm ▄█▁▅
wandb: iteration-time █▁▁▁
wandb: learning-rate █▅▂▁
wandb: lm loss █▆▅▁
wandb: load_balancing_loss ▁▁▁▁
wandb: loss-scale ▁▁▁▁
wandb: samples vs steps ▁▃▆█
wandb: throughput ▁███
wandb:
wandb: Run summary:
wandb: batch-size 16
wandb: grad-norm 15.29184
wandb: iteration-time 0.55875
wandb: learning-rate 0.0
wandb: lm loss 12.5052
wandb: load_balancing_loss 1.00034
wandb: loss-scale 1
wandb: samples vs steps 64
wandb: throughput 99.75928
wandb:
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /datasets/checkpoints/wandb/offline-run-20260511_085946-0t1qgbto
wandb: Find logs at: /datasets/checkpoints/wandb/offline-run-20260511_085946-0t1qgbto/logs
test_pass**

`

from ....ops import *

def _get_tex():
from migration.patches.transformer_engine import v2_9_0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add enaure_enflame_libs to ensure proper imports in the environment, or alternatively add try-catch handling.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix done

return v2_9_0

def _check_enflame_available() -> bool:
from torch_gcu import transfer_to_gcu
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch_gcu is not a general package, please add try-catch for import

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix done

@lxd-cumt
Copy link
Copy Markdown
Collaborator

lxd-cumt commented May 11, 2026

截屏2026-05-11 14 53 08

get_flash_attn_class op failed due to import FlashAttentionENFLAME but define as FlashAttentionMETAX, need to fix

@gongxijun
Copy link
Copy Markdown
Author

截屏2026-05-11 14 53 08 `get_flash_attn_class ` op failed due to import `FlashAttentionENFLAME` but define as `FlashAttentionMETAX`, need to fix

fix done

    # Description

    Add the new vendor backend ENFLAME

    ## Type of change

    - [ √ ] New feature (non-breaking change which adds functionality)

    ## Changes

    Please list the changes introduced in this PR:

    -  Add enflame ops register
    -  Add enflame backend implementation
    -  Register enflame ops in builtin_ops.py

    ## Requirements

    - The module migraiton is needed, to use this
    module, need to install package migration whl

    # Checklist:

    - [x] I have read and followed the [contributing
    guidelines](https://github.com/NVIDIA/TransformerEngine/blob/main/CONTRIBUTING.rst)
    - [x] The functionality is complete
    - [x] I have commented my code, particularly in hard-to-understand areas
    - [x] I have made corresponding changes to the documentation
    - [x] My changes generate no new warnings
    - [x] I have added tests that prove my fix is effective or that my
    feature works
    - [x] New and existing unit tests pass locally with my changes
Copy link
Copy Markdown
Collaborator

@lxd-cumt lxd-cumt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Collaborator

@zhaoyinglia zhaoyinglia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhaoyinglia zhaoyinglia merged commit 38bce13 into flagos-ai:main May 12, 2026
20 of 22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants