This repository contains Infrastructure as Code (IaC) and code definitions for training ML models using Ray on Kubernetes. The setup includes distributed training capabilities and secure access to model weights and datasets through MinIO.
- Access to the Kubernetes cluster
kubectlconfigured with the correct contextuvpackage manager installed
The training infrastructure consists of:
- A Ray cluster with 1 head node and 2 worker nodes
- Cilium Network Policies for secure communication
- Integration with internal MinIO for model weights and datasets storage
Install dependencies:
make installDeploy the Ray cluster and network policies:
make deployTo refresh the deployment (delete and redeploy):
make refreshYou should run this if the ray worker image has been updated.
Start a training job:
make trainThis command will:
- Install dependencies
- Set up port forwarding to the Ray cluster
- Run the training script
- Automatically clean up port forwarding on completion
- Port Forwarding
- Manually manage port forwarding:
make start_port_forwardThis will make the Ray dashboard accessible on localhost:8265 and the Hubble dashboard accessible on localhost:8080.
make stop_port_forwardModel weights and datasets are stored in the internal MinIO deployment, accessible at minio.minio-internal.svc.cluster.local. The training setup includes utilities for copying data between S3 and local storage:
fs = pyarrow.fs.S3FileSystem(
access_key="minioadmin",
secret_key="minioadmin",
endpoint_override="minio.minio-internal.svc.cluster.local:9000",
scheme="http"
)All dockerfiles in the training-infra/dockerfiles repo are built externally and pushed with the name of the dockerfile. Any updates to these files in a PR will automatically cause the images to be built and pushed. They can then be accessed with e.g. localhost:5001/<dockerfile_name>:{{ .Values.environment }} by helm templates.
The training script automatically handles data and model transfer between MinIO and local storage.