Train your first Machine Learning model using AWS cloud tools.
- Machine Learning workflow basics
- How cloud notebooks work
- How to train and evaluate a model
- How cloud storage is used in ML
Used to store:
- Datasets
- Training outputs
- Saved models
Used to:
- Write ML code
- Train models
- Test models
- Run ML notebooks in the cloud
Think of the workflow like this:
- Collect Data
- Store Data in Cloud
- Train Model
- Test Model
- Save Model
๐ก Optional Practice โ Before You Start: Draw the workflow on paper as a diagram with arrows connecting each step. Try to explain it out loud to a friend or family member like you're teaching them. Teaching something is one of the best ways to understand it!
- Go to AWS website
- Click Create Account
- Enter email and password
- Add payment method (Free tier still requires card)
- Verify phone number
- Choose Free Tier plan
- Log into AWS
- In search bar type:
SageMaker
- Click SageMaker service
- Click Studio
- Click Open Studio
Wait for it to load.
- In AWS search bar type:
S3
- Click S3 service
- Click Create Bucket
- Enter bucket name:
ML-project-3-yourname
- Leave settings default
- Click Create Bucket
๐ฎ Optional Practice โ Bucket Explorer: After creating your bucket, create a second bucket with a different name and try uploading a regular text file (like a
.txtfile with a fun message) into it. Then delete that second bucket. This builds comfort with S3 before your real data goes in!
Inside SageMaker Studio:
- Click File
- Click New
- Click Notebook
- Choose Python 3 kernel
- Click Create
Good beginner dataset:
- Iris dataset (CSV format)
- Open your bucket
- Click Upload
- Select CSV file
- Click Upload
Run in notebook:
pip install sagemaker pandas numpy scikit-learnimport pandas as pd
import numpy as npdata = pd.read_csv("your_file.csv")
print(data.head())๐ Optional Practice โ Explore Your Data: Before moving on, try running these extra lines to get to know your dataset better:
print(data.shape) # How many rows and columns? print(data.describe()) # Min, max, average of each column print(data.isnull().sum()) # Are there any missing values?See if you can answer: How many flowers are in the Iris dataset? How many different species are there?
from sklearn.model_selection import train_test_split
X = data.drop("target", axis=1)
y = data["target"]
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2
)from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X_train, y_train)๐ฒ Optional Practice โ Change the Forest Size: The
RandomForestClassifierhas a setting calledn_estimatorswhich controls how many decision trees it uses. Try changing it and see if your accuracy improves:model = RandomForestClassifier(n_estimators=10) # small forest model = RandomForestClassifier(n_estimators=100) # medium forest model = RandomForestClassifier(n_estimators=500) # big forestTrain and test each one. Does more trees always mean better accuracy?
predictions = model.predict(X_test)from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)If accuracy is close to:
1.0โ very good0.5โ average- Below
0.5โ needs improvement
๐ Optional Practice โ Try a Different Model: You used a Random Forest, but there are other models too! Try swapping it out and compare their accuracy scores:
# Option 1: Decision Tree from sklearn.tree import DecisionTreeClassifier model = DecisionTreeClassifier() # Option 2: K-Nearest Neighbors from sklearn.neighbors import KNeighborsClassifier model = KNeighborsClassifier() # Option 3: Logistic Regression from sklearn.linear_model import LogisticRegression model = LogisticRegression()Train each one and record the accuracy. Which model wins on the Iris dataset? Make a little leaderboard in a comment in your notebook!
Usually models are saved:
- Locally
- Or back into S3 storage
๐พ Optional Practice โ Actually Save Your Model: Try saving and reloading your trained model using
joblib:import joblib # Save the model to a file joblib.dump(model, "my_first_model.pkl") print("Model saved!") # Load it back and make a prediction loaded_model = joblib.load("my_first_model.pkl") print("Model loaded! Accuracy:", accuracy_score(y_test, loaded_model.predict(X_test)))If the accuracy matches your original, the save worked perfectly. Congrats โ you just built and preserved your first ML model! ๐
Inside SageMaker:
- Stop notebook instance
Check:
- Running training jobs
- Stop any active jobs
โ ๏ธ Important: AWS can charge if resources keep running. Always clean up when done.
These are completely optional but a great way to level up your skills!
Use your trained model to predict just one new flower you make up:
import numpy as np
# A made-up flower: sepal_length=5.1, sepal_width=3.5, petal_length=1.4, petal_width=0.2
new_flower = np.array([[5.1, 3.5, 1.4, 0.2]])
prediction = model.predict(new_flower)
print("This flower is probably a:", prediction)Try changing the numbers. Does the predicted species change?
See your data as a chart before training:
import matplotlib.pyplot as plt
# Plot two features against each other, colored by species
colors = {"setosa": "red", "versicolor": "blue", "virginica": "green"}
for species, group in data.groupby("target"):
plt.scatter(group["sepal_length"], group["petal_length"], label=species)
plt.xlabel("Sepal Length")
plt.ylabel("Petal Length")
plt.legend()
plt.title("Iris Flowers by Species")
plt.show()Can you see natural clusters forming? That's what the model is learning!
The Iris dataset is great for learning, but try one of these next:
| Dataset | What it predicts | Where to find it |
|---|---|---|
| Titanic | Who survived | kaggle.com |
| Penguins | Penguin species | seaborn.load_dataset("penguins") |
| Wine Quality | Wine rating | UCI ML Repository |
| Digits | Handwritten numbers | sklearn.datasets.load_digits() |
Load a new dataset and repeat the full workflow from Step 11 onwards. You already know how!
A confusion matrix shows you where your model gets confused, not just how accurate it is overall:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cm = confusion_matrix(y_test, predictions)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
disp.plot()
plt.title("Where Does My Model Get Confused?")
plt.show()Each row is the real answer. Each column is what the model guessed. Diagonal squares = correct! Off-diagonal = mistakes. Which species confuses your model the most?
๐ You Did It! You've gone from zero to training, evaluating, and saving a real Machine Learning model in the cloud. Every data scientist started exactly where you are right now.