Before you can deploy your first cluster on Managed Training , you must configure your Google Cloud project and environment. This guide covers all the necessary prerequisites, which fall into three main categories:
Project Access: Gaining access to the service, which is by invitation only.
Resource Configuration: Enabling APIs and setting up the required VPC network and storage services.
User Permissions: Granting the necessary IAM roles for cluster management and resource access.
Completing these steps prepares your project for a successful deployment.
Prerequisites
To use Managed Training , you must:
- Allowlist your project by contacting your sales representative for access.
- Obtain capacity for GPU clusters in supported regions.
- Enable the necessary APIs, including the Compute Engine, Filestore, Cloud Storage, Managed Lustre (optional), Hypercomputer Configuration Service, and Vertex AI APIs.
- Configure networking by ensuring an existing network meets specific conditions (for example, Google Private Access, firewall rules) or by creating a new VPC network and subnetwork.
- Configure storage by creating a zonal or regional Filestore
instance to serve as the
/homedirectory and optionally configuring a Google Cloud managed Lustre instance. - Grant IAM permissions to users for cluster management, storage access and SSH access to cluster nodes, as described in the IAM permissions section.
Supported regions
us-central1us-east1us-east4us-east5us-south1us-west1us-west4asia-southeast1europe-west1europe-west4europe-north1
IAM permissions
- Grant the
roles/aiplatform.adminrole to users who will manage Managed Training clusters. - Grant the
roles/aiplatform.viewerrole to users who only need to view clusters and their configurations. Grant the following IAM roles to the user or service account that will manage (create, delete, and update) Mananged Training clusters:
Role Name Role ID Compute Instance Admin (v1) roles/compute.instanceAdmin.v1Logs Writer roles/logging.logWriterMonitoring Metric Writer roles/monitoring.metricWriterService Account User roles/iam.serviceAccountUserService Networking Admin roles/servicenetworking.networksAdminTo allow the cluster's nodes to read from and write to Cloud Storage buckets using Google Cloud Storage FUSE, grant the Storage Object User role (
roles/storage.objectUser) to the service account used by the VMs.For SSH access to the Slurm login nodes, grant the following permissions:
Permissions Descriptions Purpose Compute OS Login Sign in to a VM as a standard (non-administrator) user. If sudois needed then use Compute OS Admin Login instead.SSH to the deployed login node IAP-secured Tunnel User Access Tunnel resources which use Identity-Aware Proxy. SSH to the deployed login node
Enable APIs
Enable the Google Compute Engine API:
gcloud services enable compute.googleapis.comEnable the service networking since Filestore must be deployed before creating the cluster.
gcloud services enable servicenetworking.googleapis.comEnable the Cloud Storage API:
gcloud services enable storage.googleapis.comEnable the Lustre API (if using Lustre):
gcloud services enable lustre.googleapis.comEnable the HCS API:
gcloud services enable hypercomputecluster.googleapis.comEnable the Vertex AI API:
gcloud services enable aiplatform.googleapis.comEnable the Cloud Resource Manager API:
gcloud services enable cloudresourcemanager.googleapis.com
What's next
For a detailed guide on creating a Managed Training cluster and running your AI/ML workloads, contact your sales representative.