-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
PySDK Version
- PySDK V2 (2.x)
- PySDK V3 (3.x)
Describe the bug
When submitting training job using ModelTrainer, the jobs terminate successfully. These jobs ingest custom training code supplied though a SourceCode object as well as training, validation and config data supplied through designated channels (using InputData objects). However, when the ModelTrainer is wrapped using a HyperParameterTuner object, the tuning jobs fail.
Looking at the logs, I see:
- Training jobs: SM_CHANNELS=['code', 'config', 'sm_drivers', 'train', 'validation']
- Tuning jobs: SM_CHANNELS=["config","train","validation"]
The tuning job is using the default framework container behavior instead of the ModelTrainer's custom entrypoint that runs sm_train.sh. This is evidenced by the following run commands in the logs:
- Training job: torchrun --nnodes=1 --nproc_per_node=4 train.py ....
- Tuning job: /usr/local/bin/python train.py --learning_rate ....
The HyperparameterTuner._build_training_job_definition() method is not properly including the source code channels and container configuration from the ModelTrainer.
To reproduce
Below is a minimal example:
from sagemaker.core.training.configs import SourceCode
from sagemaker.core.training.configs import Compute
from sagemaker.train.distributed import Torchrun
from sagemaker.core.training.configs import InputData, OutputDataConfig
from sagemaker.core.shapes import RetryStrategy, MetricDefinition, StoppingCondition, CheckpointConfig
from sagemaker.train.model_trainer import ModelTrainer, Mode
root = str(Path.cwd().parent)
source_dir = os.path.join(root, "sagemaker")
requirements = 'requirements.txt'
entry_script = "train.py"
source_code = SourceCode(source_dir=source_dir,
requirements=requirements,
entry_script=entry_script)
instance_type = "ml.g6e.12xlarge"
instance_count = 1
volume_size_in_gb = 200
compute = Compute(instance_type=instance_type,
instance_count=instance_count,
volume_size_in_gb=volume_size_in_gb)
distributed_strategy = Torchrun()
s3_input_path = "s3 path to training data"
training_data = InputData(channel_name='train',
data_source=s3_input_path)
s3_input_path = "s3 path to validation data"
validation_data = InputData(channel_name='validation',
data_source=s3_input_path)
# Path to S3 yaml file containing training hyperparameters
config_path = "s3 path to yaml config file"
config_data = InputData(channel_name='config', data_source=config_path)
s3_output_path = "s3 output path"
output = OutputDataConfig(s3_output_path=s3_output_path,
compression_type='NONE')
# Defined checkpoint config
checkpoint_config = CheckpointConfig(s3_uri=s3_output_path, local_path="/opt/ml/checkpoints")
# Define retry strategy
retry_strategy = RetryStrategy(maximum_retry_attempts=3)
# Define tracking metrics
metric_definitions = [
MetricDefinition(
name="eval_acro_f1",
regex="eval_macro_f1: (.*?)",
)]
# Define stopping condition
num_hours = 3
stopping = StoppingCondition(max_runtime_in_seconds=3600 * num_hours)
job_name = "my_training_job"
training_image = "763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-training:2.8.0-transformers4.56.2-gpu-py312-cu129-ubuntu22.04-v1.0"
training_mode = Mode.SAGEMAKER_TRAINING_JOB
model_trainer = ModelTrainer(
training_mode=training_mode,
sagemaker_session=sagemaker_session,
role=role,
training_image=training_image,
base_job_name=job_name,
source_code=source_code,
compute=compute,
distributed=distributed_strategy,
output_data_config=output,
checkpoint_config=checkpoint_config,
stopping_condition=stopping,
environment={"PYTORCH_CUDA_ALLOC_CONF": "expandable_segments:True"},
hyperparameters={"learning_rate": 1e-5}
)
model_trainer.train(wait=False, logs=True, input_data_config=[training_data, validation_data, config_data])
from sagemaker.train.tuner import HyperparameterTuner
from sagemaker.core.parameter import ContinuousParameter
metric_definitions = [{
"Name": "eval_loss",
"Regex": "eval_loss: (.*?)"}]
learning_rate = ContinuousParameter(
min_value=1e-5,
max_value=5e-4,
scaling_type='Logarithmic')
hyperparameter_ranges = {"learning_rate": learning_rate}
tuner = HyperparameterTuner(model_trainer=model_trainer,
objective_metric_name="eval_loss",
metric_definitions=metric_definitions,
hyperparameter_ranges=hyperparameter_ranges,
max_jobs=3,
max_parallel_jobs=3)
tuner.tune(wait=False, inputs=[training_data, validation_data, config_data])
Expected behavior
Tuning jobs completing without errors.
Screenshots or logs
System information
A description of your system. Please provide:
- SageMaker Python SDK version: 3.3.1
- Framework name (eg. PyTorch) or algorithm (eg. KMeans): PyTorch
- Framework version:
- Python version: 3.13
- CPU or GPU: GPU
- Custom Docker image (Y/N): 763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-training:2.8.0-transformers4.56.2-gpu-py312-cu129-ubuntu22.04-v1.0
Additional context