Migrate SageMaker use case to v3

# Migration to AWS SageMaker Python SDK v3

## Overview

Migrate the `mlmonitor` project from AWS SageMaker Python SDK v2 to v3 to leverage the new unified API, improved architecture, and modern design patterns.

## Objectives

- Upgrade from SageMaker SDK v2.x to v3.x
- Migrate training workflows from framework-specific Estimators to unified `ModelTrainer`
- Migrate deployment workflows from framework-specific Models to unified `ModelBuilder`
- Maintain compatibility with Watson OpenScale and AI Factsheets integrations
- Ensure all existing functionality continues to work

## Current State

### Affected Components

**Files requiring changes:**
- `mlmonitor/src/aws/__init__.py` - Framework imports and mappings
- `mlmonitor/src/aws/train_sagemaker_job.py` - Training job orchestration
- `mlmonitor/src/aws/deploy_sagemaker_endpoint.py` - Endpoint deployment
- `mlmonitor/src/aws/training.py` - Training parameter generation
- `mlmonitor/src/aws/deployment.py` - Deployment parameter generation

**Current SDK v2 Usage:**
- Training: `Estimator`, `PyTorch`, `TensorFlow`, `SKLearn` classes
- Deployment: `SKLearnModel`, `XGBoostModel`, `TensorFlowModel`, `PyTorchModel`
- Serialization: `CSVSerializer`, `JSONSerializer`, `CSVDeserializer`, `JSONDeserializer`

## Detailed Changes

### 1. Dependency Updates

**File:** `requirements.txt` or `setup.py`

```diff
- sagemaker==2.*
+ sagemaker>=3.0.0
+ sagemaker-core
+ sagemaker-train
+ sagemaker-serve
```

### 2. Import Changes

**File:** `mlmonitor/src/aws/__init__.py`

```python
# BEFORE (v2)
from sagemaker.sklearn.estimator import SKLearnModel
from sagemaker.xgboost import XGBoostModel
from sagemaker.tensorflow import TensorFlowModel, TensorFlow
from sagemaker.pytorch import PyTorchModel, PyTorch
from sagemaker.estimator import Estimator
from sagemaker.deserializers import CSVDeserializer, JSONDeserializer
from sagemaker.serializers import CSVSerializer, JSONSerializer

# AFTER (v3)
from sagemaker.train import ModelTrainer
from sagemaker.serve import ModelBuilder
from sagemaker.serve import serializers, deserializers
```

### 3. Training Workflow Migration

**File:** `mlmonitor/src/aws/train_sagemaker_job.py`

**Current v2 Pattern:**
```python
SelectedEstimator = sagemaker_estimators.get(framework)
est = SelectedEstimator(**estimator_params)
est.fit(train_dict)
```

**New v3 Pattern:**
```python
from sagemaker.train import ModelTrainer
from sagemaker.train.configs import InputData

# Create input data configuration
input_data_list = []
for channel_name, s3_path in train_dict.items():
    if s3_path:
        input_data_list.append(
            InputData(
                channel_name=channel_name,
                data_source=s3_path
            )
        )

# Create unified trainer
trainer = ModelTrainer(
    training_image=estimator_params.get("image_uri"),
    role=estimator_params["role"],
    instance_type=estimator_params["instance_type"],
    instance_count=estimator_params["instance_count"],
    output_path=estimator_params["output_path"],
    hyperparameters=estimator_params["hyperparameters"],
    source_dir=estimator_params.get("source_dir"),
    entry_point=estimator_params.get("entry_point"),
)

# Train
training_job = trainer.train(input_data_config=input_data_list)
trained_model_data = training_job.model_uri
```

### 4. Deployment Workflow Migration

**File:** `mlmonitor/src/aws/deploy_sagemaker_endpoint.py`

**Current v2 Pattern:**
```python
SelectedModel = sagemaker_models.get(framework)
selected_model = SelectedModel(**model_params)
predictor = selected_model.deploy(
    endpoint_name=deployment_name,
    initial_instance_count=1,
    instance_type=model_config.inference_instance,
    serializer=SelectedSerializer(),
    deserializer=SelectedDeserializer(),
)
preds = predictor.predict(scoring_data)
```

**New v3 Pattern:**
```python
from sagemaker.serve import ModelBuilder
from sagemaker.serve.configs import DeploymentConfig

# Create model builder
model_builder = ModelBuilder(
    model=deployment_name,
    model_path=model_params["model_data"],
    role=model_params["role"],
    image_uri=model_params["image_uri"],
    source_dir=model_params.get("source_dir"),
    entry_point=model_params.get("entry_point"),
    framework_version=model_params.get("framework_version"),
)

# Build and deploy
endpoint = model_builder.build(
    deployment_config=DeploymentConfig(
        endpoint_name=deployment_name,
        instance_type=model_config.inference_instance,
        instance_count=1,
    )
)

# Invoke
response = endpoint.invoke(scoring_data)
```

### 5. Training Parameter Generation

**File:** `mlmonitor/src/aws/training.py`

Consolidate framework-specific functions into unified parameter generation:

```python
def generate_training_params(
    framework: str,
    framework_version: str,
    estimator_params: Dict,
    train_dict: Dict,
    sagemaker_session: sagemaker.Session,
    py_version: Optional[str] = None,
) -> Tuple[Dict, List]:
    """Unified parameter generation for all frameworks in SDK v3"""
    from sagemaker.train.configs import InputData
    
    # Retrieve container image
    container = image_uris.retrieve(
        framework,
        sagemaker_session.boto_region_name,
        framework_version,
        image_scope="training",
    )
    
    # Create input data list
    channel_mapping = {
        "train": "training",
        "test": "testing", 
        "validation": "validation"
    }
    
    input_data_list = []
    for old_key, new_key in channel_mapping.items():
        if old_key in train_dict and train_dict[old_key]:
            input_data_list.append(
                InputData(
                    channel_name=new_key,
                    data_source=train_dict[old_key]
                )
            )
    
    trainer_params = {
        "training_image": container,
        "role": estimator_params["role"],
        "instance_type": estimator_params["instance_type"],
        "instance_count": estimator_params["instance_count"],
        "output_path": estimator_params["output_path"],
        "hyperparameters": estimator_params["hyperparameters"],
        "source_dir": estimator_params.get("source_dir"),
        "entry_point": estimator_params.get("entry_point"),
    }
    
    return trainer_params, input_data_list
```

### 6. Deployment Parameter Generation

**File:** `mlmonitor/src/aws/deployment.py`

Update to return v3-compatible parameters:

```python
def generate_base_deployment_params(
    trained_model_data: str,
    source_dir: str,
    framework: str,
    framework_version: str,
    py_version: str,
    script: str,
    instance: str,
) -> Dict:
    """Generate parameters for ModelBuilder (SDK v3)"""
    
    container = image_uris.retrieve(
        framework=framework,
        region=os.environ.get("AWS_DEFAULT_REGION", "ca-central-1"),
        version=framework_version,
        image_scope="inference",
        instance_type=instance,
    )
    
    model_builder_params = {
        "model_path": trained_model_data.strip(),
        "source_dir": f"{PROJECT_ROOT}/{source_dir}",
        "image_uri": container,
        "role": ROLE,
        "entry_point": script,
        "framework_version": framework_version,
    }
    
    return model_builder_params
```

## ✅ Testing Requirements

### Unit Tests
- [ ] Test `ModelTrainer` instantiation for all frameworks (sklearn, xgboost, pytorch, tensorflow)
- [ ] Validate `InputData` configuration
- [ ] Test `ModelBuilder` instantiation
- [ ] Validate endpoint creation and invocation
- [ ] Test serialization/deserialization

### Integration Tests
- [ ] End-to-end training: sklearn model
- [ ] End-to-end training: xgboost model
- [ ] End-to-end training: pytorch model
- [ ] End-to-end training: tensorflow model
- [ ] End-to-end deployment and scoring
- [ ] Watson OpenScale integration
- [ ] AI Factsheets integration

### Test Files to Update
- `mlmonitor/tests/aws_model_use_case/test_aws_model_config.py`
- `mlmonitor/tests/aws_model_use_case/test_aws_resources.py`

## 📋 Implementation Checklist

### Pre-Migration
- [ ] Create backup branch: `git checkout -b backup/sagemaker-v2`
- [ ] Document current SDK version: `pip freeze | grep sagemaker`
- [ ] Run all existing tests and save results
- [ ] Create feature branch: `git checkout -b feature/sagemaker-v3-migration`

### Core Migration
- [ ] Update `requirements.txt` or `setup.py` dependencies
- [ ] Update `mlmonitor/src/aws/__init__.py` imports
- [ ] Migrate `mlmonitor/src/aws/training.py`
- [ ] Migrate `mlmonitor/src/aws/train_sagemaker_job.py`
- [ ] Migrate `mlmonitor/src/aws/deployment.py`
- [ ] Migrate `mlmonitor/src/aws/deploy_sagemaker_endpoint.py`
- [ ] Review `mlmonitor/use_case_gcr/train_gcr.py` (may not need changes)
- [ ] Review `mlmonitor/use_case_gcr/inference_aws_gcr.py` (may not need changes)

### Testing Phase
- [ ] Run unit tests for training module
- [ ] Run unit tests for deployment module
- [ ] Run integration test: sklearn model
- [ ] Run integration test: xgboost model
- [ ] Run integration test: pytorch model
- [ ] Run integration test: tensorflow model
- [ ] Test Watson OpenScale payload logging
- [ ] Test AI Factsheets model tracking
- [ ] Update `examples/mlmonitor-sagemaker.ipynb` and verify it works

### Documentation
- [ ] Update `README.md` with SDK v3 requirements
- [ ] Update inline code documentation
- [ ] Add migration notes to CHANGELOG
- [ ] Update any architecture diagrams if needed

### Deployment
- [ ] Code review
- [ ] Merge to main branch
- [ ] Tag release: `v3.0.0-sagemaker-v3`
- [ ] Monitor first production deployment

## 🔙 Rollback Plan

### Immediate Rollback
```bash
git checkout main
pip install sagemaker==2.*
```

### Partial Rollback
If some features work but others fail, use version detection:
```python
import sagemaker
SDK_VERSION = int(sagemaker.__version__.split('.')[0])

if SDK_VERSION >= 3:
    from sagemaker.train import ModelTrainer
else:
    from sagemaker.estimator import Estimator
```

### Data Integrity
- ✅ Model artifacts remain compatible (same S3 `.tar.gz` format)
- ✅ Endpoints can be managed with either SDK version
- ✅ No data migration needed
- ✅ IAM roles and permissions unchanged

## API Mapping Reference

### Training API

| SDK v2 | SDK v3 |
|--------|--------|
| `sagemaker.estimator.Estimator` | `sagemaker.train.ModelTrainer` |
| `sagemaker.pytorch.PyTorch` | `sagemaker.train.ModelTrainer` |
| `sagemaker.tensorflow.TensorFlow` | `sagemaker.train.ModelTrainer` |
| `estimator.fit({"training": "s3://..."})` | `trainer.train(input_data_config=[InputData(...)])` |

### Deployment API

| SDK v2 | SDK v3 |
|--------|--------|
| `sagemaker.sklearn.SKLearnModel` | `sagemaker.serve.ModelBuilder` |
| `sagemaker.xgboost.XGBoostModel` | `sagemaker.serve.ModelBuilder` |
| `sagemaker.tensorflow.TensorFlowModel` | `sagemaker.serve.ModelBuilder` |
| `sagemaker.pytorch.PyTorchModel` | `sagemaker.serve.ModelBuilder` |
| `model.deploy(...)` | `model_builder.build()` |
| `predictor.predict(data)` | `endpoint.invoke(data)` |

## Benefits of SDK v3

- **Unified API**: Single `ModelTrainer` and `ModelBuilder` for all frameworks
- **Modular Architecture**: Separate packages for core, training, and serving
- **Better Structure**: Clearer separation of concerns
- **Object-Oriented**: Structured configs aligned with AWS APIs
- **Less Boilerplate**: Simplified workflows and reduced code duplication
- **Future-Proof**: Aligned with AWS's long-term SDK strategy


## Resources

- [SageMaker SDK v3 GitHub Repository](https://github.com/aws/sagemaker-python-sdk)
- [SageMaker V3 Examples](https://github.com/aws/sagemaker-python-sdk/tree/master/v3-examples)
- [Local Training Example](https://github.com/aws/sagemaker-python-sdk/blob/master/v3-examples/training-examples/local-training-example.ipynb)
- [SDK v3 Migration Guide](https://github.com/aws/sagemaker-python-sdk#key-benefits-of-3x)
- [SageMaker V2 Documentation](https://sagemaker.readthedocs.io/en/v2.x/)


SDK v2	SDK v3
`sagemaker.estimator.Estimator`	`sagemaker.train.ModelTrainer`
`sagemaker.pytorch.PyTorch`	`sagemaker.train.ModelTrainer`
`sagemaker.tensorflow.TensorFlow`	`sagemaker.train.ModelTrainer`
`estimator.fit({"training": "s3://..."})`	`trainer.train(input_data_config=[InputData(...)])`

SDK v2	SDK v3
`sagemaker.sklearn.SKLearnModel`	`sagemaker.serve.ModelBuilder`
`sagemaker.xgboost.XGBoostModel`	`sagemaker.serve.ModelBuilder`
`sagemaker.tensorflow.TensorFlowModel`	`sagemaker.serve.ModelBuilder`
`sagemaker.pytorch.PyTorchModel`	`sagemaker.serve.ModelBuilder`
`model.deploy(...)`	`model_builder.build()`
`predictor.predict(data)`	`endpoint.invoke(data)`

Migrate SageMaker use case to v3 #10

Description

Migration to AWS SageMaker Python SDK v3

Overview

Objectives

Current State

Affected Components

Detailed Changes

1. Dependency Updates

2. Import Changes

3. Training Workflow Migration

4. Deployment Workflow Migration

5. Training Parameter Generation

6. Deployment Parameter Generation

✅ Testing Requirements

Unit Tests

Integration Tests

Test Files to Update

📋 Implementation Checklist

Pre-Migration

Core Migration

Testing Phase

Documentation

Deployment

🔙 Rollback Plan

Immediate Rollback

Partial Rollback

Data Integrity

API Mapping Reference

Training API

Deployment API

Benefits of SDK v3

Resources

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions