Managed Training supports a variety of machine types to accommodate different workloads. You can choose from the following options when configuring your cluster node pools:
- a4-highgpu-8g
- a3-ultragpu-8g
- a3-megagpu-8g
- n2 CPU family
Capacity provisioning
Choosing the right provisioning model is critical for balancing cost, speed, and resource availability. See the following provisioning options:
RESERVATION: Allocates nodes from a specific Compute Engine reservation that you've created in advance. This model ensures capacity and is the recommended choice for high-demand resources.FLEX_START: Utilizes the Dynamic Workload Scheduler to queue your job. The job begins automatically as soon as the requested compute resources become available, offering a flexible start time without requiring a reservation.SPOT: Provisions the node pool using Spot VMs. This is the most cost-effective option, but it should only be used for workloads that are fault-tolerant and can handle interruptions, as the VMs may be preempted at any time.ON_DEMAND: This is the default option for CPU-only node pools and is best suited for machine types that are not scarce. It provides standard VM instances with predictable, pay-as-you-go pricing.
Use the following guidance to make your selection:
For high-demand GPU resources (like A3 and A4): The
RESERVATIONmodel is strongly recommended. It ensures you have dedicated access to the capacity you need for critical training jobs.For bursty or flexible workloads: Consider
FLEX_STARTorSPOT.FLEX_STARTqueues your job until resources are available, whileSPOToffers significant cost savings for fault-tolerant jobs that can handle preemption.For abundant machine types: The
ON_DEMANDmodel is the preferred choice. Use it for machine types that are not scarce and where immediate availability isn't a concern.
Using a shared reservation (optional)
If you'd like to use a shared reservation, rather than a local reservation, there are additional steps to take before you can create a cluster.
Before using a shared reservation in Managed Training, make sure the shared
reservation works by manually creating a VM that uses the shared reservation.
If this VM creation works, move on to the next step.
In the cluster creation configuration, use the reservation name in the following
format:
projects/RESERVATION_HOST_PROJECT_ID/zones/RESERVATION_ZONE/reservations/RESERVATION_NAME.