Skip to content

HyperParameterTuner not fetching channels from ModelTrainer #5508

@CoolFish88

Description

@CoolFish88

PySDK Version

  • PySDK V2 (2.x)
  • PySDK V3 (3.x)

Describe the bug
When submitting training job using ModelTrainer, the jobs terminate successfully. These jobs ingest custom training code supplied though a SourceCode object as well as training, validation and config data supplied through designated channels (using InputData objects). However, when the ModelTrainer is wrapped using a HyperParameterTuner object, the tuning jobs fail.

Looking at the logs, I see:

  • Training jobs: SM_CHANNELS=['code', 'config', 'sm_drivers', 'train', 'validation']
  • Tuning jobs: SM_CHANNELS=["config","train","validation"]

The tuning job is using the default framework container behavior instead of the ModelTrainer's custom entrypoint that runs sm_train.sh. This is evidenced by the following run commands in the logs:

  • Training job: torchrun --nnodes=1 --nproc_per_node=4 train.py ....
  • Tuning job: /usr/local/bin/python train.py --learning_rate ....

The HyperparameterTuner._build_training_job_definition() method is not properly including the source code channels and container configuration from the ModelTrainer.

To reproduce
Below is a minimal example:

from sagemaker.core.training.configs import SourceCode
from sagemaker.core.training.configs import Compute
from sagemaker.train.distributed import Torchrun
from sagemaker.core.training.configs import InputData, OutputDataConfig
from sagemaker.core.shapes import RetryStrategy, MetricDefinition, StoppingCondition, CheckpointConfig
from sagemaker.train.model_trainer import ModelTrainer, Mode

root = str(Path.cwd().parent)
source_dir = os.path.join(root, "sagemaker")
requirements = 'requirements.txt'
entry_script = "train.py"
source_code = SourceCode(source_dir=source_dir,
                         requirements=requirements,
                         entry_script=entry_script)

instance_type = "ml.g6e.12xlarge"
instance_count = 1
volume_size_in_gb = 200
compute = Compute(instance_type=instance_type,
                  instance_count=instance_count,
                  volume_size_in_gb=volume_size_in_gb)

distributed_strategy = Torchrun()

s3_input_path = "s3 path to training data"
training_data = InputData(channel_name='train',
                          data_source=s3_input_path)

s3_input_path = "s3 path to validation data"
validation_data = InputData(channel_name='validation',
                            data_source=s3_input_path)

# Path to S3 yaml file containing training hyperparameters
config_path = "s3 path to yaml config file"
config_data = InputData(channel_name='config', data_source=config_path)

s3_output_path = "s3 output path"
output = OutputDataConfig(s3_output_path=s3_output_path,
                          compression_type='NONE')

# Defined checkpoint config
checkpoint_config = CheckpointConfig(s3_uri=s3_output_path, local_path="/opt/ml/checkpoints")

# Define retry strategy
retry_strategy = RetryStrategy(maximum_retry_attempts=3)

# Define tracking metrics
metric_definitions = [
    MetricDefinition(
        name="eval_acro_f1",
        regex="eval_macro_f1: (.*?)",
    )]

# Define stopping condition
num_hours = 3
stopping = StoppingCondition(max_runtime_in_seconds=3600 * num_hours)

job_name = "my_training_job"
training_image = "763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-training:2.8.0-transformers4.56.2-gpu-py312-cu129-ubuntu22.04-v1.0"
training_mode = Mode.SAGEMAKER_TRAINING_JOB

model_trainer = ModelTrainer(
    training_mode=training_mode,
    sagemaker_session=sagemaker_session,
    role=role,
    training_image=training_image,
    base_job_name=job_name,
    source_code=source_code,
    compute=compute,
    distributed=distributed_strategy,
    output_data_config=output,
    checkpoint_config=checkpoint_config,
    stopping_condition=stopping,
    environment={"PYTORCH_CUDA_ALLOC_CONF": "expandable_segments:True"},
   hyperparameters={"learning_rate": 1e-5}
)
model_trainer.train(wait=False, logs=True, input_data_config=[training_data, validation_data, config_data])

from sagemaker.train.tuner import HyperparameterTuner
from sagemaker.core.parameter import ContinuousParameter

metric_definitions = [{
    "Name": "eval_loss",
    "Regex": "eval_loss: (.*?)"}]

learning_rate = ContinuousParameter(
    min_value=1e-5,
    max_value=5e-4,
    scaling_type='Logarithmic')

hyperparameter_ranges = {"learning_rate": learning_rate}

tuner = HyperparameterTuner(model_trainer=model_trainer,
                            objective_metric_name="eval_loss",
                            metric_definitions=metric_definitions,
                            hyperparameter_ranges=hyperparameter_ranges,
                            max_jobs=3,
                            max_parallel_jobs=3)
tuner.tune(wait=False, inputs=[training_data, validation_data, config_data])

Expected behavior
Tuning jobs completing without errors.

Screenshots or logs

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 3.3.1
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): PyTorch
  • Framework version:
  • Python version: 3.13
  • CPU or GPU: GPU
  • Custom Docker image (Y/N): 763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-training:2.8.0-transformers4.56.2-gpu-py312-cu129-ubuntu22.04-v1.0

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions