Skip to content

Migrate SageMaker use case to v3 #10

@jslecointre

Description

@jslecointre

Migration to AWS SageMaker Python SDK v3

Overview

Migrate the mlmonitor project from AWS SageMaker Python SDK v2 to v3 to leverage the new unified API, improved architecture, and modern design patterns.

Objectives

  • Upgrade from SageMaker SDK v2.x to v3.x
  • Migrate training workflows from framework-specific Estimators to unified ModelTrainer
  • Migrate deployment workflows from framework-specific Models to unified ModelBuilder
  • Maintain compatibility with Watson OpenScale and AI Factsheets integrations
  • Ensure all existing functionality continues to work

Current State

Affected Components

Files requiring changes:

  • mlmonitor/src/aws/__init__.py - Framework imports and mappings
  • mlmonitor/src/aws/train_sagemaker_job.py - Training job orchestration
  • mlmonitor/src/aws/deploy_sagemaker_endpoint.py - Endpoint deployment
  • mlmonitor/src/aws/training.py - Training parameter generation
  • mlmonitor/src/aws/deployment.py - Deployment parameter generation

Current SDK v2 Usage:

  • Training: Estimator, PyTorch, TensorFlow, SKLearn classes
  • Deployment: SKLearnModel, XGBoostModel, TensorFlowModel, PyTorchModel
  • Serialization: CSVSerializer, JSONSerializer, CSVDeserializer, JSONDeserializer

Detailed Changes

1. Dependency Updates

File: requirements.txt or setup.py

- sagemaker==2.*
+ sagemaker>=3.0.0
+ sagemaker-core
+ sagemaker-train
+ sagemaker-serve

2. Import Changes

File: mlmonitor/src/aws/__init__.py

# BEFORE (v2)
from sagemaker.sklearn.estimator import SKLearnModel
from sagemaker.xgboost import XGBoostModel
from sagemaker.tensorflow import TensorFlowModel, TensorFlow
from sagemaker.pytorch import PyTorchModel, PyTorch
from sagemaker.estimator import Estimator
from sagemaker.deserializers import CSVDeserializer, JSONDeserializer
from sagemaker.serializers import CSVSerializer, JSONSerializer

# AFTER (v3)
from sagemaker.train import ModelTrainer
from sagemaker.serve import ModelBuilder
from sagemaker.serve import serializers, deserializers

3. Training Workflow Migration

File: mlmonitor/src/aws/train_sagemaker_job.py

Current v2 Pattern:

SelectedEstimator = sagemaker_estimators.get(framework)
est = SelectedEstimator(**estimator_params)
est.fit(train_dict)

New v3 Pattern:

from sagemaker.train import ModelTrainer
from sagemaker.train.configs import InputData

# Create input data configuration
input_data_list = []
for channel_name, s3_path in train_dict.items():
    if s3_path:
        input_data_list.append(
            InputData(
                channel_name=channel_name,
                data_source=s3_path
            )
        )

# Create unified trainer
trainer = ModelTrainer(
    training_image=estimator_params.get("image_uri"),
    role=estimator_params["role"],
    instance_type=estimator_params["instance_type"],
    instance_count=estimator_params["instance_count"],
    output_path=estimator_params["output_path"],
    hyperparameters=estimator_params["hyperparameters"],
    source_dir=estimator_params.get("source_dir"),
    entry_point=estimator_params.get("entry_point"),
)

# Train
training_job = trainer.train(input_data_config=input_data_list)
trained_model_data = training_job.model_uri

4. Deployment Workflow Migration

File: mlmonitor/src/aws/deploy_sagemaker_endpoint.py

Current v2 Pattern:

SelectedModel = sagemaker_models.get(framework)
selected_model = SelectedModel(**model_params)
predictor = selected_model.deploy(
    endpoint_name=deployment_name,
    initial_instance_count=1,
    instance_type=model_config.inference_instance,
    serializer=SelectedSerializer(),
    deserializer=SelectedDeserializer(),
)
preds = predictor.predict(scoring_data)

New v3 Pattern:

from sagemaker.serve import ModelBuilder
from sagemaker.serve.configs import DeploymentConfig

# Create model builder
model_builder = ModelBuilder(
    model=deployment_name,
    model_path=model_params["model_data"],
    role=model_params["role"],
    image_uri=model_params["image_uri"],
    source_dir=model_params.get("source_dir"),
    entry_point=model_params.get("entry_point"),
    framework_version=model_params.get("framework_version"),
)

# Build and deploy
endpoint = model_builder.build(
    deployment_config=DeploymentConfig(
        endpoint_name=deployment_name,
        instance_type=model_config.inference_instance,
        instance_count=1,
    )
)

# Invoke
response = endpoint.invoke(scoring_data)

5. Training Parameter Generation

File: mlmonitor/src/aws/training.py

Consolidate framework-specific functions into unified parameter generation:

def generate_training_params(
    framework: str,
    framework_version: str,
    estimator_params: Dict,
    train_dict: Dict,
    sagemaker_session: sagemaker.Session,
    py_version: Optional[str] = None,
) -> Tuple[Dict, List]:
    """Unified parameter generation for all frameworks in SDK v3"""
    from sagemaker.train.configs import InputData
    
    # Retrieve container image
    container = image_uris.retrieve(
        framework,
        sagemaker_session.boto_region_name,
        framework_version,
        image_scope="training",
    )
    
    # Create input data list
    channel_mapping = {
        "train": "training",
        "test": "testing", 
        "validation": "validation"
    }
    
    input_data_list = []
    for old_key, new_key in channel_mapping.items():
        if old_key in train_dict and train_dict[old_key]:
            input_data_list.append(
                InputData(
                    channel_name=new_key,
                    data_source=train_dict[old_key]
                )
            )
    
    trainer_params = {
        "training_image": container,
        "role": estimator_params["role"],
        "instance_type": estimator_params["instance_type"],
        "instance_count": estimator_params["instance_count"],
        "output_path": estimator_params["output_path"],
        "hyperparameters": estimator_params["hyperparameters"],
        "source_dir": estimator_params.get("source_dir"),
        "entry_point": estimator_params.get("entry_point"),
    }
    
    return trainer_params, input_data_list

6. Deployment Parameter Generation

File: mlmonitor/src/aws/deployment.py

Update to return v3-compatible parameters:

def generate_base_deployment_params(
    trained_model_data: str,
    source_dir: str,
    framework: str,
    framework_version: str,
    py_version: str,
    script: str,
    instance: str,
) -> Dict:
    """Generate parameters for ModelBuilder (SDK v3)"""
    
    container = image_uris.retrieve(
        framework=framework,
        region=os.environ.get("AWS_DEFAULT_REGION", "ca-central-1"),
        version=framework_version,
        image_scope="inference",
        instance_type=instance,
    )
    
    model_builder_params = {
        "model_path": trained_model_data.strip(),
        "source_dir": f"{PROJECT_ROOT}/{source_dir}",
        "image_uri": container,
        "role": ROLE,
        "entry_point": script,
        "framework_version": framework_version,
    }
    
    return model_builder_params

✅ Testing Requirements

Unit Tests

  • Test ModelTrainer instantiation for all frameworks (sklearn, xgboost, pytorch, tensorflow)
  • Validate InputData configuration
  • Test ModelBuilder instantiation
  • Validate endpoint creation and invocation
  • Test serialization/deserialization

Integration Tests

  • End-to-end training: sklearn model
  • End-to-end training: xgboost model
  • End-to-end training: pytorch model
  • End-to-end training: tensorflow model
  • End-to-end deployment and scoring
  • Watson OpenScale integration
  • AI Factsheets integration

Test Files to Update

  • mlmonitor/tests/aws_model_use_case/test_aws_model_config.py
  • mlmonitor/tests/aws_model_use_case/test_aws_resources.py

📋 Implementation Checklist

Pre-Migration

  • Create backup branch: git checkout -b backup/sagemaker-v2
  • Document current SDK version: pip freeze | grep sagemaker
  • Run all existing tests and save results
  • Create feature branch: git checkout -b feature/sagemaker-v3-migration

Core Migration

  • Update requirements.txt or setup.py dependencies
  • Update mlmonitor/src/aws/__init__.py imports
  • Migrate mlmonitor/src/aws/training.py
  • Migrate mlmonitor/src/aws/train_sagemaker_job.py
  • Migrate mlmonitor/src/aws/deployment.py
  • Migrate mlmonitor/src/aws/deploy_sagemaker_endpoint.py
  • Review mlmonitor/use_case_gcr/train_gcr.py (may not need changes)
  • Review mlmonitor/use_case_gcr/inference_aws_gcr.py (may not need changes)

Testing Phase

  • Run unit tests for training module
  • Run unit tests for deployment module
  • Run integration test: sklearn model
  • Run integration test: xgboost model
  • Run integration test: pytorch model
  • Run integration test: tensorflow model
  • Test Watson OpenScale payload logging
  • Test AI Factsheets model tracking
  • Update examples/mlmonitor-sagemaker.ipynb and verify it works

Documentation

  • Update README.md with SDK v3 requirements
  • Update inline code documentation
  • Add migration notes to CHANGELOG
  • Update any architecture diagrams if needed

Deployment

  • Code review
  • Merge to main branch
  • Tag release: v3.0.0-sagemaker-v3
  • Monitor first production deployment

🔙 Rollback Plan

Immediate Rollback

git checkout main
pip install sagemaker==2.*

Partial Rollback

If some features work but others fail, use version detection:

import sagemaker
SDK_VERSION = int(sagemaker.__version__.split('.')[0])

if SDK_VERSION >= 3:
    from sagemaker.train import ModelTrainer
else:
    from sagemaker.estimator import Estimator

Data Integrity

  • ✅ Model artifacts remain compatible (same S3 .tar.gz format)
  • ✅ Endpoints can be managed with either SDK version
  • ✅ No data migration needed
  • ✅ IAM roles and permissions unchanged

API Mapping Reference

Training API

SDK v2 SDK v3
sagemaker.estimator.Estimator sagemaker.train.ModelTrainer
sagemaker.pytorch.PyTorch sagemaker.train.ModelTrainer
sagemaker.tensorflow.TensorFlow sagemaker.train.ModelTrainer
estimator.fit({"training": "s3://..."}) trainer.train(input_data_config=[InputData(...)])

Deployment API

SDK v2 SDK v3
sagemaker.sklearn.SKLearnModel sagemaker.serve.ModelBuilder
sagemaker.xgboost.XGBoostModel sagemaker.serve.ModelBuilder
sagemaker.tensorflow.TensorFlowModel sagemaker.serve.ModelBuilder
sagemaker.pytorch.PyTorchModel sagemaker.serve.ModelBuilder
model.deploy(...) model_builder.build()
predictor.predict(data) endpoint.invoke(data)

Benefits of SDK v3

  • Unified API: Single ModelTrainer and ModelBuilder for all frameworks
  • Modular Architecture: Separate packages for core, training, and serving
  • Better Structure: Clearer separation of concerns
  • Object-Oriented: Structured configs aligned with AWS APIs
  • Less Boilerplate: Simplified workflows and reduced code duplication
  • Future-Proof: Aligned with AWS's long-term SDK strategy

Resources

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions