-
Notifications
You must be signed in to change notification settings - Fork 6
Open
Description
Migration to AWS SageMaker Python SDK v3
Overview
Migrate the mlmonitor project from AWS SageMaker Python SDK v2 to v3 to leverage the new unified API, improved architecture, and modern design patterns.
Objectives
- Upgrade from SageMaker SDK v2.x to v3.x
- Migrate training workflows from framework-specific Estimators to unified
ModelTrainer - Migrate deployment workflows from framework-specific Models to unified
ModelBuilder - Maintain compatibility with Watson OpenScale and AI Factsheets integrations
- Ensure all existing functionality continues to work
Current State
Affected Components
Files requiring changes:
mlmonitor/src/aws/__init__.py- Framework imports and mappingsmlmonitor/src/aws/train_sagemaker_job.py- Training job orchestrationmlmonitor/src/aws/deploy_sagemaker_endpoint.py- Endpoint deploymentmlmonitor/src/aws/training.py- Training parameter generationmlmonitor/src/aws/deployment.py- Deployment parameter generation
Current SDK v2 Usage:
- Training:
Estimator,PyTorch,TensorFlow,SKLearnclasses - Deployment:
SKLearnModel,XGBoostModel,TensorFlowModel,PyTorchModel - Serialization:
CSVSerializer,JSONSerializer,CSVDeserializer,JSONDeserializer
Detailed Changes
1. Dependency Updates
File: requirements.txt or setup.py
- sagemaker==2.*
+ sagemaker>=3.0.0
+ sagemaker-core
+ sagemaker-train
+ sagemaker-serve2. Import Changes
File: mlmonitor/src/aws/__init__.py
# BEFORE (v2)
from sagemaker.sklearn.estimator import SKLearnModel
from sagemaker.xgboost import XGBoostModel
from sagemaker.tensorflow import TensorFlowModel, TensorFlow
from sagemaker.pytorch import PyTorchModel, PyTorch
from sagemaker.estimator import Estimator
from sagemaker.deserializers import CSVDeserializer, JSONDeserializer
from sagemaker.serializers import CSVSerializer, JSONSerializer
# AFTER (v3)
from sagemaker.train import ModelTrainer
from sagemaker.serve import ModelBuilder
from sagemaker.serve import serializers, deserializers3. Training Workflow Migration
File: mlmonitor/src/aws/train_sagemaker_job.py
Current v2 Pattern:
SelectedEstimator = sagemaker_estimators.get(framework)
est = SelectedEstimator(**estimator_params)
est.fit(train_dict)New v3 Pattern:
from sagemaker.train import ModelTrainer
from sagemaker.train.configs import InputData
# Create input data configuration
input_data_list = []
for channel_name, s3_path in train_dict.items():
if s3_path:
input_data_list.append(
InputData(
channel_name=channel_name,
data_source=s3_path
)
)
# Create unified trainer
trainer = ModelTrainer(
training_image=estimator_params.get("image_uri"),
role=estimator_params["role"],
instance_type=estimator_params["instance_type"],
instance_count=estimator_params["instance_count"],
output_path=estimator_params["output_path"],
hyperparameters=estimator_params["hyperparameters"],
source_dir=estimator_params.get("source_dir"),
entry_point=estimator_params.get("entry_point"),
)
# Train
training_job = trainer.train(input_data_config=input_data_list)
trained_model_data = training_job.model_uri4. Deployment Workflow Migration
File: mlmonitor/src/aws/deploy_sagemaker_endpoint.py
Current v2 Pattern:
SelectedModel = sagemaker_models.get(framework)
selected_model = SelectedModel(**model_params)
predictor = selected_model.deploy(
endpoint_name=deployment_name,
initial_instance_count=1,
instance_type=model_config.inference_instance,
serializer=SelectedSerializer(),
deserializer=SelectedDeserializer(),
)
preds = predictor.predict(scoring_data)New v3 Pattern:
from sagemaker.serve import ModelBuilder
from sagemaker.serve.configs import DeploymentConfig
# Create model builder
model_builder = ModelBuilder(
model=deployment_name,
model_path=model_params["model_data"],
role=model_params["role"],
image_uri=model_params["image_uri"],
source_dir=model_params.get("source_dir"),
entry_point=model_params.get("entry_point"),
framework_version=model_params.get("framework_version"),
)
# Build and deploy
endpoint = model_builder.build(
deployment_config=DeploymentConfig(
endpoint_name=deployment_name,
instance_type=model_config.inference_instance,
instance_count=1,
)
)
# Invoke
response = endpoint.invoke(scoring_data)5. Training Parameter Generation
File: mlmonitor/src/aws/training.py
Consolidate framework-specific functions into unified parameter generation:
def generate_training_params(
framework: str,
framework_version: str,
estimator_params: Dict,
train_dict: Dict,
sagemaker_session: sagemaker.Session,
py_version: Optional[str] = None,
) -> Tuple[Dict, List]:
"""Unified parameter generation for all frameworks in SDK v3"""
from sagemaker.train.configs import InputData
# Retrieve container image
container = image_uris.retrieve(
framework,
sagemaker_session.boto_region_name,
framework_version,
image_scope="training",
)
# Create input data list
channel_mapping = {
"train": "training",
"test": "testing",
"validation": "validation"
}
input_data_list = []
for old_key, new_key in channel_mapping.items():
if old_key in train_dict and train_dict[old_key]:
input_data_list.append(
InputData(
channel_name=new_key,
data_source=train_dict[old_key]
)
)
trainer_params = {
"training_image": container,
"role": estimator_params["role"],
"instance_type": estimator_params["instance_type"],
"instance_count": estimator_params["instance_count"],
"output_path": estimator_params["output_path"],
"hyperparameters": estimator_params["hyperparameters"],
"source_dir": estimator_params.get("source_dir"),
"entry_point": estimator_params.get("entry_point"),
}
return trainer_params, input_data_list6. Deployment Parameter Generation
File: mlmonitor/src/aws/deployment.py
Update to return v3-compatible parameters:
def generate_base_deployment_params(
trained_model_data: str,
source_dir: str,
framework: str,
framework_version: str,
py_version: str,
script: str,
instance: str,
) -> Dict:
"""Generate parameters for ModelBuilder (SDK v3)"""
container = image_uris.retrieve(
framework=framework,
region=os.environ.get("AWS_DEFAULT_REGION", "ca-central-1"),
version=framework_version,
image_scope="inference",
instance_type=instance,
)
model_builder_params = {
"model_path": trained_model_data.strip(),
"source_dir": f"{PROJECT_ROOT}/{source_dir}",
"image_uri": container,
"role": ROLE,
"entry_point": script,
"framework_version": framework_version,
}
return model_builder_params✅ Testing Requirements
Unit Tests
- Test
ModelTrainerinstantiation for all frameworks (sklearn, xgboost, pytorch, tensorflow) - Validate
InputDataconfiguration - Test
ModelBuilderinstantiation - Validate endpoint creation and invocation
- Test serialization/deserialization
Integration Tests
- End-to-end training: sklearn model
- End-to-end training: xgboost model
- End-to-end training: pytorch model
- End-to-end training: tensorflow model
- End-to-end deployment and scoring
- Watson OpenScale integration
- AI Factsheets integration
Test Files to Update
mlmonitor/tests/aws_model_use_case/test_aws_model_config.pymlmonitor/tests/aws_model_use_case/test_aws_resources.py
📋 Implementation Checklist
Pre-Migration
- Create backup branch:
git checkout -b backup/sagemaker-v2 - Document current SDK version:
pip freeze | grep sagemaker - Run all existing tests and save results
- Create feature branch:
git checkout -b feature/sagemaker-v3-migration
Core Migration
- Update
requirements.txtorsetup.pydependencies - Update
mlmonitor/src/aws/__init__.pyimports - Migrate
mlmonitor/src/aws/training.py - Migrate
mlmonitor/src/aws/train_sagemaker_job.py - Migrate
mlmonitor/src/aws/deployment.py - Migrate
mlmonitor/src/aws/deploy_sagemaker_endpoint.py - Review
mlmonitor/use_case_gcr/train_gcr.py(may not need changes) - Review
mlmonitor/use_case_gcr/inference_aws_gcr.py(may not need changes)
Testing Phase
- Run unit tests for training module
- Run unit tests for deployment module
- Run integration test: sklearn model
- Run integration test: xgboost model
- Run integration test: pytorch model
- Run integration test: tensorflow model
- Test Watson OpenScale payload logging
- Test AI Factsheets model tracking
- Update
examples/mlmonitor-sagemaker.ipynband verify it works
Documentation
- Update
README.mdwith SDK v3 requirements - Update inline code documentation
- Add migration notes to CHANGELOG
- Update any architecture diagrams if needed
Deployment
- Code review
- Merge to main branch
- Tag release:
v3.0.0-sagemaker-v3 - Monitor first production deployment
🔙 Rollback Plan
Immediate Rollback
git checkout main
pip install sagemaker==2.*Partial Rollback
If some features work but others fail, use version detection:
import sagemaker
SDK_VERSION = int(sagemaker.__version__.split('.')[0])
if SDK_VERSION >= 3:
from sagemaker.train import ModelTrainer
else:
from sagemaker.estimator import EstimatorData Integrity
- ✅ Model artifacts remain compatible (same S3
.tar.gzformat) - ✅ Endpoints can be managed with either SDK version
- ✅ No data migration needed
- ✅ IAM roles and permissions unchanged
API Mapping Reference
Training API
| SDK v2 | SDK v3 |
|---|---|
sagemaker.estimator.Estimator |
sagemaker.train.ModelTrainer |
sagemaker.pytorch.PyTorch |
sagemaker.train.ModelTrainer |
sagemaker.tensorflow.TensorFlow |
sagemaker.train.ModelTrainer |
estimator.fit({"training": "s3://..."}) |
trainer.train(input_data_config=[InputData(...)]) |
Deployment API
| SDK v2 | SDK v3 |
|---|---|
sagemaker.sklearn.SKLearnModel |
sagemaker.serve.ModelBuilder |
sagemaker.xgboost.XGBoostModel |
sagemaker.serve.ModelBuilder |
sagemaker.tensorflow.TensorFlowModel |
sagemaker.serve.ModelBuilder |
sagemaker.pytorch.PyTorchModel |
sagemaker.serve.ModelBuilder |
model.deploy(...) |
model_builder.build() |
predictor.predict(data) |
endpoint.invoke(data) |
Benefits of SDK v3
- Unified API: Single
ModelTrainerandModelBuilderfor all frameworks - Modular Architecture: Separate packages for core, training, and serving
- Better Structure: Clearer separation of concerns
- Object-Oriented: Structured configs aligned with AWS APIs
- Less Boilerplate: Simplified workflows and reduced code duplication
- Future-Proof: Aligned with AWS's long-term SDK strategy
Resources
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels