This guide demonstrates how to train a model on Google Kubernetes Engine (GKE) using Ray, PyTorch, and the Ray Operator add-on.
About Ray
Ray is an open-source scalable compute framework for AI/ML applications. Ray Train is a component within Ray designed for distributed model training and fine-tuning. You can use the Ray Train API to scale training across multiple machines and to integrate with machine learning libraries such as PyTorch.
You can deploy Ray training jobs using the RayCluster or RayJob resource. You should use a RayJob resource when deploying Ray jobs in production for the following reasons
- The RayJob resource creates an ephemeral Ray cluster that can be automatically deleted when a job completes.
- The RayJob resource supports retry policies for resilient job execution.
- You can manage Ray jobs using familiar Kubernetes API patterns.
Prepare your environment
To prepare your environment, follow these steps:
Launch a Cloud Shell session from the Google Cloud console, by clicking
Activate Cloud Shell in the Google Cloud console. This launches a session in the bottom pane of the Google Cloud console.
Set environment variables:
export PROJECT_ID=PROJECT_ID export CLUSTER_NAME=ray-cluster export COMPUTE_REGION=us-central1 export COMPUTE_ZONE=us-central1-c export CLUSTER_VERSION=CLUSTER_VERSION export TUTORIAL_HOME=`pwd`
Replace the following:
PROJECT_ID
: your Google Cloud project ID.CLUSTER_VERSION
: the GKE version to use. Must be1.30.1
or later.
Clone the GitHub repository:
git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples
Change to the working directory:
cd kubernetes-engine-samples/ai-ml/gke-ray/raytrain/pytorch-mnist
Create a Python virtual environment:
python -m venv myenv && \ source myenv/bin/activate
Create a GKE cluster
Create an Autopilot or Standard GKE cluster:
Autopilot
Create an Autopilot cluster:
gcloud container clusters create-auto ${CLUSTER_NAME} \
--enable-ray-operator \
--cluster-version=${CLUSTER_VERSION} \
--location=${COMPUTE_REGION}
Standard
Create a Standard cluster:
gcloud container clusters create ${CLUSTER_NAME} \
--addons=RayOperator \
--cluster-version=${CLUSTER_VERSION} \
--machine-type=e2-standard-8 \
--location=${COMPUTE_ZONE} \
--num-nodes=4
Deploy a RayCluster resource
Deploy a RayCluster resource to your cluster:
Review the following manifest:
This manifest describes a RayCluster custom resource.
Apply the manifest to your GKE cluster:
kubectl apply -f ray-cluster.yaml
Verify the RayCluster resource is ready:
kubectl get raycluster
The output is similar to the following:
NAME DESIRED WORKERS AVAILABLE WORKERS CPUS MEMORY GPUS STATUS AGE pytorch-mnist-cluster 2 2 6 20Gi 0 ready 63s
In this output,
ready
in theSTATUS
column indicates the RayCluster resource is ready.
Connect to the RayCluster resource
Connect to the RayCluster resource to submit a Ray job.
Verify that GKE created the RayCluster Service:
kubectl get svc pytorch-mnist-cluster-head-svc
The output is similar to the following:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE pytorch-mnist-cluster-head-svc ClusterIP 34.118.238.247 <none> 10001/TCP,8265/TCP,6379/TCP,8080/TCP 109s
Establish a port-forwarding session to the Ray head:
kubectl port-forward svc/pytorch-mnist-cluster-head-svc 8265:8265 2>&1 >/dev/null &
Verify the Ray client can connect to the Ray cluster using localhost:
ray list nodes --address http://localhost:8265
The output is similar to the following:
Stats: ------------------------------ Total: 3 Table: ------------------------------ NODE_ID NODE_IP IS_HEAD_NODE STATE NODE_NAME RESOURCES_TOTAL LABELS 0 1d07447d7d124db641052a3443ed882f913510dbe866719ac36667d2 10.28.1.21 False ALIVE 10.28.1.21 CPU: 2.0 ray.io/node_id: 1d07447d7d124db641052a3443ed882f913510dbe866719ac36667d2 # Several lines of output omitted
Train a model
Train a PyTorch model using the Fashion MNIST dataset:
Submit a Ray job and wait for the job to complete:
ray job submit --submission-id pytorch-mnist-job --working-dir . --runtime-env-json='{"pip": ["torch", "torchvision"], "excludes": ["myenv"]}' --address http://localhost:8265 -- python train.py
The output is similar to the following:
Job submission server address: http://localhost:8265 -------------------------------------------- Job 'pytorch-mnist-job' submitted successfully -------------------------------------------- Next steps Query the logs of the job: ray job logs pytorch-mnist-job Query the status of the job: ray job status pytorch-mnist-job Request the job to be stopped: ray job stop pytorch-mnist-job Handling connection for 8265 Tailing logs until the job exits (disable with --no-wait): ... ...
Verify the Job status:
ray job status pytorch-mnist
The output is similar to the following:
Job submission server address: http://localhost:8265 Status for job 'pytorch-mnist-job': RUNNING Status message: Job is currently running.
Wait for the
Status for job
to beCOMPLETE
. This can take 15 minutes or longer.View Ray job logs:
ray job logs pytorch-mnist
The output is similar to the following:
Training started with configuration: ╭─────────────────────────────────────────────────╮ │ Training config │ ├──────────────────────────────────────────────────┤ │ train_loop_config/batch_size_per_worker 8 │ │ train_loop_config/epochs 10 │ │ train_loop_config/lr 0.001 │ ╰─────────────────────────────────────────────────╯ # Several lines omitted Training finished iteration 10 at 2024-06-19 08:29:36. Total running time: 9min 18s ╭───────────────────────────────╮ │ Training result │ ├────────────────────────────────┤ │ checkpoint_dir_name │ │ time_this_iter_s 25.7394 │ │ time_total_s 351.233 │ │ training_iteration 10 │ │ accuracy 0.8656 │ │ loss 0.37827 │ ╰───────────────────────────────╯ # Several lines omitted ------------------------------- Job 'pytorch-mnist' succeeded -------------------------------
Deploy a RayJob
The RayJob custom resource manages the lifecycle of a RayCluster resource during the execution of a single Ray job.
Review the following manifest:
This manifest describes a RayJob custom resource.
Apply the manifest to your GKE cluster:
kubectl apply -f ray-job.yaml
Verify the RayJob resource is running:
kubectl get rayjob
The output is similar to the following:
NAME JOB STATUS DEPLOYMENT STATUS START TIME END TIME AGE pytorch-mnist-job RUNNING Running 2024-06-19T15:43:32Z 2m29s
In this output, the
DEPLOYMENT STATUS
column indicates the RayJob resource isRunning
.View the RayJob resource status:
kubectl logs -f -l job-name=pytorch-mnist-job
The output is similar to the following:
Training started with configuration: ╭─────────────────────────────────────────────────╮ │ Training config │ ├──────────────────────────────────────────────────┤ │ train_loop_config/batch_size_per_worker 8 │ │ train_loop_config/epochs 10 │ │ train_loop_config/lr 0.001 │ ╰─────────────────────────────────────────────────╯ # Several lines omitted Training finished iteration 10 at 2024-06-19 08:29:36. Total running time: 9min 18s ╭───────────────────────────────╮ │ Training result │ ├────────────────────────────────┤ │ checkpoint_dir_name │ │ time_this_iter_s 25.7394 │ │ time_total_s 351.233 │ │ training_iteration 10 │ │ accuracy 0.8656 │ │ loss 0.37827 │ ╰───────────────────────────────╯ # Several lines omitted ------------------------------- Job 'pytorch-mnist' succeeded -------------------------------
Verify the Ray job is complete:
kubectl get rayjob
The output is similar to the following:
NAME JOB STATUS DEPLOYMENT STATUS START TIME END TIME AGE pytorch-mnist-job SUCCEEDED Complete 2024-06-19T15:43:32Z 2024-06-19T15:51:12Z 9m6s
In this output, the
DEPLOYMENT STATUS
column indicates the RayJob resource isComplete
.