Cluster creation process overview

This document helps you understand the components that you need to configure before creating a cluster to meet the requirements of your workload.

Cluster creation process

Creating a cluster involves configuring compute, storage, and networking resources, as well as the Slurm orchestrator. The following process is to help you pick the configurations that best fits your workload needs:

  1. Configure compute resources

  2. Configure storage resources

  3. Configure networking resources

  4. Configure the Slurm orchestrator

Configure compute resources

You must choose the type and number of virtual machine (VM) instances that you want to create in your cluster. You must consider the following:

  • Machine series: you must choose the machines series and types that you want to create in your cluster. Each cluster partition can use a different machine series. Each virtual machine (VM) instance in your cluster serves as a node. For each type of node, we recommend the following machine series:

    • Compute nodes: based on your workload, use one of the following machine series:

      • Foundational model training and inference: A4X machine series

      • Large model training, fine-tuning, and inference: A4 or A3 Ultra machine series

      • Mainstream model inference, fine tuning, and serving inference: A3 Mega machine series

    • Controller nodes: use the N2 machine series.

    • Login nodes: use the N2 machine series.

  • Location: based on the locations where your chosen machine series are available, you must select the region and zone where you want to deploy your cluster. For more information, review the regions and zones where machine series are available.

  • Consumption option: you must choose the consumption option that you want to use to get and use compute resources. Each option affects the availability, obtainability, and pricing of a compute resource. For more information, see Choose a consumption option.

Configure storage resources

You must define the storage that your cluster uses:

  • Shared file system for /home directory: a Filestore instance or Managed Lustre instance is required for the shared /home directory in a Slurm cluster. When you create a cluster, you can do one of the following:

    • Let Cluster Director create a new Filestore or Managed Lustre instance with pre-configured settings.

    • Use an existing Filestore or Managed Lustre instance on the same VPC network as your cluster.

  • Additional storage: for large-scale AI workloads that require more storage, you can add other solutions—for example, Cloud Storage buckets.

Configure networking resources

You must define the networking resources that your cluster uses:

  • Create a new VPC network: if you choose this option, then Cluster Director creates the VPC network on your behalf, and it automatically configures the necessary firewall rules to let you connect to your cluster.

  • Use an existing VPC network: if you want to use an existing VPC network, then you must configure the network and subnetwork as follows:

    • You must enable Private Google Access in your network. For more information, see Private Google Access configuration.

    • Your must configure firewall rules to allow SSH ingress and internal access. For more information, see Use VPC firewall rules.

      • The subnetwork has the following configuration requirements:

        • The subnetwork must allow health checks by using the TCP protocol and port 8319. For more information, see Use health checks.

        • The subnetwork must have a firewall rule that allows Identity-Aware Proxy (IAP) to access the VMs in the cluster by using the TCP protocol and port 22. For more information, see Prepare your project for IAP TCP forwarding.

        • The subnetwork can't overlap with the 10.20.1.0/28 IPv4 address range.

Configure the Slurm orchestrator

For the final step, you must configure the Slurm orchestrator. This process involves configuring the following components:

  • Login and controller nodes: you must define the machine series and boot disk options for the login and controller nodes. These nodes manage the cluster and let you connect to it.

  • Partitions and nodesets: you must define how you want to group your compute resources. Each partition can contain one or more nodesets, letting you create different pools of resources for different types of jobs. For each nodeset, you must also specify the minimum and maximum number of VMs that the nodeset can contain.

What's next?