Skip to content

add continue_pretrain script#589

Merged
ftgreat merged 1 commit into
masterfrom
ftgreat-patch-3
May 11, 2026
Merged

add continue_pretrain script#589
ftgreat merged 1 commit into
masterfrom
ftgreat-patch-3

Conversation

@ftgreat
Copy link
Copy Markdown
Collaborator

@ftgreat ftgreat commented May 11, 2026


name: Pull Request
title: '[PR]'
assignees: 'BAAI-OpenPlatform,ftgreat'


Description

Please describe here what the PR does.

Checklist

  • bug fixed
  • new feature provided
  • documentation updated
  • tests added

@ftgreat ftgreat merged commit 83b8bea into master May 11, 2026
1 of 2 checks passed
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new shell script, continue_pretrain_aquila_moe.sh, for pre-training Aquila MoE models using dlrover-run. The review feedback highlights several issues regarding script portability and maintainability, specifically pointing out redundant variable assignments, hardcoded user-specific and root directory paths, a hardcoded rendezvous ID, and the intrusive modification of global git configurations.

Comment on lines +58 to +60
EXPNAME=K74_BETA_Merge
EXPNAME=K74_BETA_Single
EXPNAME=K74_BETA_Single_Lin
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The variable EXPNAME is assigned three times consecutively. This is redundant and appears to be leftover from development; only the final assignment is effective.

Suggested change
EXPNAME=K74_BETA_Merge
EXPNAME=K74_BETA_Single
EXPNAME=K74_BETA_Single_Lin
EXPNAME=K74_BETA_Single_Lin


CMD=" \
dlrover-run \
--rdzv-id 20240410_074637.966007 \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The --rdzv-id is hardcoded to a specific timestamp. This can cause conflicts if multiple training jobs are launched or if the script is reused without modification. It is recommended to use a dynamic ID or an environment variable.

Suggested change
--rdzv-id 20240410_074637.966007 \
--rdzv-id ${RDZV_ID:-$(date +%Y%m%d_%H%M%S)} \

#ulimit -c unlimited
#ulimit -a

git config --global --add safe.directory /mnt/baai/ldwang/$FlagScale
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Modifying the global git configuration (git config --global) is an intrusive side effect that persists across sessions and affects other repositories on the system. This should be avoided in a training script.

export FORCE_ACTIVATE_WAIT=1
export MHA_USE_BLAS=ON
#export MHA_BWD_NO_ATOMIC_F64=1
export LD_LIBRARY_PATH=/root/FWD_76_BWD_79:/opt/maca/lib:/opt/maca/ompi/lib:/opt/maca/ucx/lib
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The LD_LIBRARY_PATH includes a hardcoded path in the /root directory. This makes the script non-portable and will cause errors for users who do not have access to that directory or are not running as root.

Comment on lines +52 to +56
BASE_PATH=/mnt/baai/ldwang
# TODO Datasets
source /mnt/baai/ldwang/FlagScale/examples/aquila/datasets_mx.sh
DATA_PATH=$DATASET_K74_BETA
TOKENIZER_PATH=/mnt/baai/ldwang/FlagScale/examples/aquila/1_8B/tokenizer_hf
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The script contains multiple hardcoded paths specific to a user's directory (/mnt/baai/ldwang). These should be parameterized or use environment variables to make the script usable in different environments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant