This document provides a comprehensive overview of the compute resources that are available for Cluster Director clusters, including CPU-based machines, GPU-accelerated machines, and the pre-configured OS images. You use these resources to deploy and manage the underlying infrastructure for high performance computing (HPC), artificial intelligence (AI), and machine learning (ML) workloads.
CPU-based machines
Cluster Director supports general-purpose machine series, which are primarily CPU-based and suitable for HPC workloads and cluster management roles. These machines are ideal for cluster components that don't require GPU acceleration, such as login nodes, or for running CPU-bound computational tasks.
N2 machine series
Cluster Director supports the N2 machine series. This series offers a balance of price and performance suitable for a variety of workloads, such as CPU-bound computational tasks or general-purpose HPC workloads.
The N2 machine types are differentiated by the amount of memory configured per vCPU. These memory configurations are as follows:
standard: 4 GB of system memory per vCPUhighmem: 8 GB of system memory per vCPUhighcpu: 1 GB of system memory per vCPU
Each family offers a range of machine sizes, from 2 to 128 vCPUs. For login nodes, you can only use N2 standard machine types with 32 or fewer vCPUs. Otherwise, you can use any N2 machine type. The following table summarizes the available N2 machine types in Cluster Director:
N2 standard
| Machine types | vCPUs* | Memory (GB) | Default egress bandwidth (Gbps)‡ | Tier_1 egress bandwidth (Gbps)# |
|---|---|---|---|---|
n2-standard-2 |
2 | 8 | Up to 10 | N/A |
n2-standard-4 |
4 | 16 | Up to 10 | N/A |
n2-standard-8 |
8 | 32 | Up to 16 | N/A |
n2-standard-16 |
16 | 64 | Up to 32 | N/A |
n2-standard-32 |
32 | 128 | Up to 32 | Up to 50 |
n2-standard-48 |
48 | 192 | Up to 32 | Up to 50 |
n2-standard-64 |
64 | 256 | Up to 32 | Up to 75 |
n2-standard-80 |
80 | 320 | Up to 32 | Up to 100 |
n2-standard-96 |
96 | 384 | Up to 32 | Up to 100 |
n2-standard-128 |
128 | 512 | Up to 32 | Up to 100 |
* A vCPU is implemented as a single hardware thread, or logical
core, on one of the available
CPU platforms.
‡ Maximum egress bandwidth cannot exceed the number given. Actual
egress bandwidth depends on the destination IP address and other factors. For
more information, see
Network bandwidth.
# Supports
high-bandwidth networking for larger machine types. For Windows OS images,
the maximum network bandwidth is limited to 50 Gbps.
N2 high-mem
| Machine types | vCPUs* | Memory (GB) | Default egress bandwidth (Gbps)‡ | Tier_1 egress bandwidth (Gbps)# |
|---|---|---|---|---|
n2-highmem-2 |
2 | 16 | Up to 10 | N/A |
n2-highmem-4 |
4 | 32 | Up to 10 | N/A |
n2-highmem-8 |
8 | 64 | Up to 16 | N/A |
n2-highmem-16 |
16 | 128 | Up to 32 | N/A |
n2-highmem-32 |
32 | 256 | Up to 32 | Up to 50 |
n2-highmem-48 |
48 | 384 | Up to 32 | Up to 50 |
n2-highmem-64 |
64 | 512 | Up to 32 | Up to 75 |
n2-highmem-80 |
80 | 640 | Up to 32 | Up to 100 |
n2-highmem-96 |
96 | 768 | Up to 32 | Up to 100 |
n2-highmem-128 |
128 | 864 | Up to 32 | Up to 100 |
* A vCPU is implemented as a single hardware thread, or logical
core, on one of the available
CPU platforms.
‡ Maximum egress bandwidth cannot exceed the number given. Actual
egress bandwidth depends on the destination IP address and other factors. For
more information, see
Network bandwidth.
# Supports
high-bandwidth networking for larger machine types. For Windows OS images,
the maximum network bandwidth is limited to 50 Gbps.
N2 high-cpu
| Machine types | vCPUs* | Memory (GB) | Default egress bandwidth (Gbps)‡ | Tier_1 egress bandwidth (Gbps)# |
|---|---|---|---|---|
n2-highcpu-2 |
2 | 2 | Up to 10 | N/A |
n2-highcpu-4 |
4 | 4 | Up to 10 | N/A |
n2-highcpu-8 |
8 | 8 | Up to 16 | N/A |
n2-highcpu-16 |
16 | 16 | Up to 32 | N/A |
n2-highcpu-32 |
32 | 32 | Up to 32 | Up to 50 |
n2-highcpu-48 |
48 | 48 | Up to 32 | Up to 50 |
n2-highcpu-64 |
64 | 64 | Up to 32 | Up to 75 |
n2-highcpu-80 |
80 | 80 | Up to 32 | Up to 100 |
n2-highcpu-96 |
96 | 96 | Up to 32 | Up to 100 |
* A vCPU is implemented as a single hardware thread, or logical
core, on one of the available
CPU platforms.
‡ Maximum egress bandwidth cannot exceed the number given. Actual
egress bandwidth depends on the destination IP address and other factors. For
more information, see
Network bandwidth.
# Supports
high-bandwidth networking for larger machine types. For Windows OS images,
the maximum network bandwidth is limited to 50 Gbps.
GPU-accelerated machines
Cluster Director supports a specific set of accelerator-optimized machine series designed for large-scale AI, ML, and HPC workloads.
GPU machine types are available only in specific regions and zones. To review availability, see GPU locations.
A4X machine series
A4X accelerator-optimized
machine types use NVIDIA GB200 Grace Blackwell Superchips (nvidia-gb200) and
are ideal for foundation model training and serving.
A4X is an exascale platform based on NVIDIA GB200 NVL72. Each machine has two sockets with NVIDIA Grace CPUs with Arm Neoverse V2 cores. These CPUs are connected to four NVIDIA B200 Blackwell GPUs with fast chip-to-chip (NVLink-C2C) communication.
| Attached NVIDIA GB200 Grace Blackwell Superchips | |||||||
|---|---|---|---|---|---|---|---|
| Machine type | vCPU count1 | Instance memory (GB) | Attached Local SSD (GiB) | Physical NIC count | Maximum network bandwidth (Gbps)2 | GPU count | GPU memory3 (GB HBM3e) |
a4x-highgpu-4g |
140 | 884 | 12,000 | 6 | 2,000 | 4 | 744 |
1A vCPU is implemented as a single hardware hyper-thread on one of
the available CPU platforms.
2Maximum egress bandwidth cannot exceed the number given. Actual
egress bandwidth depends on the destination IP address and other factors.
For more information about network bandwidth,
see Network bandwidth.
3GPU memory is the memory on a GPU device that can be used for
temporary storage of data. It is separate from the instance's memory and is
specifically designed to handle the higher bandwidth demands of your
graphics-intensive workloads.
A4 machine series
A4 accelerator-optimized
machine types have
NVIDIA B200 Blackwell GPUs
(nvidia-b200) attached and are ideal for foundation model
training and serving.
| Attached NVIDIA B200 Blackwell GPUs | |||||||
|---|---|---|---|---|---|---|---|
| Machine type | vCPU count1 | Instance memory (GB) | Attached Local SSD (GiB) | Physical NIC count | Maximum network bandwidth (Gbps)2 | GPU count | GPU memory3 (GB HBM3e) |
a4-highgpu-8g |
224 | 3,968 | 12,000 | 10 | 3,600 | 8 | 1,440 |
1A vCPU is implemented as a single hardware hyper-thread on one of
the available CPU platforms.
2Maximum egress bandwidth cannot exceed the number given. Actual
egress bandwidth depends on the destination IP address and other factors.
For more information about network bandwidth, see
Network bandwidth.
3GPU memory is the memory on a GPU device that can be used for
temporary storage of data. It is separate from the instance's memory and is
specifically designed to handle the higher bandwidth demands of your
graphics-intensive workloads.
A3 machine series
The A3 machine series is powered by NVIDIA H100 or H200 SXM GPUs and is suitable for a wide range of large model training and inference workloads.
Cluster Director supports the A3 machine types that are described in the following sections.
A3 Ultra machine type
A3 Ultra
machine types have NVIDIA H200 SXM GPUs
(nvidia-h200-141gb) attached and provides the highest network
performance in the A3 series. A3 Ultra machine types are ideal for foundation model training and
serving.
| Attached NVIDIA H200 GPUs | |||||||
|---|---|---|---|---|---|---|---|
| Machine type | vCPU count1 | Instance memory (GB) | Attached Local SSD (GiB) | Physical NIC count | Maximum network bandwidth (Gbps)2 | GPU count | GPU memory3 (GB HBM3e) |
a3-ultragpu-8g |
224 | 2,952 | 12,000 | 10 | 3,600 | 8 | 1128 |
1A vCPU is implemented as a single hardware hyper-thread on one of
the available CPU platforms.
2Maximum egress bandwidth cannot exceed the number given. Actual
egress bandwidth depends on the destination IP address and other factors.
For more information about network bandwidth,
see Network bandwidth.
3GPU memory is the memory on a GPU device that can be used for
temporary storage of data. It is separate from the instance's memory and is
specifically designed to handle the higher bandwidth demands of your
graphics-intensive workloads.
A3 Mega machine type
A3 Mega machine types have NVIDIA H100 SXM GPUs and are ideal for large model training and multi-host inference.| Attached NVIDIA H100 GPUs | |||||||
|---|---|---|---|---|---|---|---|
| Machine type | vCPU count1 | Instance memory (GB) | Attached Local SSD (GiB) | Physical NIC count | Maximum network bandwidth (Gbps)2 | GPU count | GPU memory3 (GB HBM3) |
a3-megagpu-8g |
208 | 1,872 | 6,000 | 9 | 1,800 | 8 | 640 |
1A vCPU is implemented as a single hardware hyper-thread on one of
the available CPU platforms.
2Maximum egress bandwidth cannot exceed the number given. Actual
egress bandwidth depends on the destination IP address and other factors.
For more information about network bandwidth,
see Network bandwidth.
3GPU memory is the memory on a GPU device that can be used for
temporary storage of data. It is separate from the instance's memory and is
specifically designed to handle the higher bandwidth demands of your
graphics-intensive workloads.
Operating system (OS) images
OS images are pre-configured, bootable disk images used to deploy virtual machine (VM) instances in Cluster Director. These images are Docker images that are pre-configured with ML frameworks and libraries to simplify the model creation and training for large-scale AI and ML workloads.
Slurm OS images for Cluster Director
When you deploy a Slurm cluster, Cluster Director uses a custom OS image that is automatically built by Cluster Toolkit. This process helps ensure that Cluster Director provisions every node in the cluster with a correct and consistent software stack, including the controller and compute nodes.
These custom OS images are extensions of the Ubuntu LTS Accelerator OS images and include all necessary system software for cluster and workload management.
Included software
The custom OS images for Cluster Director nodes include the following software components by default:
Operating system: Ubuntu 22.04 LTS
Orchestration: Slurm version 24.11.2 and its dependencies, such as MUNGE and MariaDB.
Containerization tools: the NVIDIA enroot container runtime and NVIDIA pyxis, used for running containerized workloads on Slurm clusters.
Drivers and libraries: NVIDIA driver version 570, CUDA Toolkit (version 12.8), and libraries for GPUDirect-RDMA, including
ibverbs-utilsandrdma-core.Parallel computing libraries: OpenMPI and PMIx for managing parallel processing tasks across the cluster.
Google Cloud integrations: the Ops Agent for monitoring and logging, and Cloud Storage FUSE for accessing Cloud Storage buckets from the cluster nodes.