Create an AI-optimized cluster based on a template

This document explains how to create a cluster in Cluster Director by using a pre-configured template that is optimized for running artificial intelligence (AI) and machine learning (ML) training workloads.

Cluster Director provides a set of templates that are pre-configured for most AI and ML training workloads. Each template specifies a set of optimized compute, networking, and storage resources that are designed for high reliability and resiliency. This approach helps you to quickly provision a cluster while maintaining the flexibility to make changes if your workload needs it.

To create a fully customized cluster, see instead Create a custom cluster.

Limitations

When you create a cluster in Cluster Director, the following limitations apply:

  • Regional scope: clusters are regional resources. You can only create or use compute resources, storage resources, and subnetworks that exist within the same region as your cluster.
  • Compute resource configuration per nodeset: you can only assign one compute resource configuration for each nodeset that you want to create in your cluster.
  • Storage classes for new Cloud Storage buckets: if you plan to create one or more buckets when creating a cluster, then you can only specify the Standard storage class or Autoclass. If you want to use other classes, then you must update the bucket after you create the cluster.

Templates for AI-optimized clusters

The following table describes the templates that you can use to create AI-optimized clusters in Cluster Director. When you create your cluster, you can either use the pre-configured settings in the template, or edit one or more settings based on the demands of your workloads.

Name Description Consumption option Use cases
Large-scale model training with A3 Ultra
  • Compute nodes: 4 A3 Ultra virtual machine (VM) instances
  • Storage:
    • 1 5 TiB basic SSD Filestore instance
    • 1 36 TiB Managed Lustre instance
  • Login nodes: 1 N2 VM
  • Future reservations for blocks of capacity
  • Future reservations for up to 90 days (in calendar mode)
This template is optimized for large-scale AI or ML training workloads.
High-performance training with A4X
  • Compute nodes: 18 A4X VMs
  • Storage: 1 36 TiB and a 1,000 MBps/TiB tier Managed Lustre instance
  • Login nodes: 1 N2 VM
Future reservations for blocks of capacity This template is optimized for compute and memory intensive workloads, such as multimodal inference workloads, and for the fastest reads and writes.
Short-term training and experimentation with A4
  • Compute nodes: 4 A4 VMs
  • Storage:
    • 1 2 TiB basic SSD Filestore instance
    • 1 18 TiB Managed Lustre instance
  • Login nodes: 1 N2 VM
Flex-start This template is optimized for running pre-training, fine-tuning, and experimentation workloads for up to seven days.

Before you begin

Before you create a cluster in Cluster Director, do the following:

  1. Choose consumption options. If you haven't already, then you must choose the consumption options for the virtual machine (VM) instances that you want to use in each partition for your cluster. Each consumption option determines the availability, obtainability, and pricing for your VMs.

    To learn more, see Choose a consumption option.

  2. Obtain capacity and quota. Based on your chosen consumption option, review the quota requirements for the VMs that you want to create in the cluster. If you lack sufficient quota, then creating your cluster fails.

    To learn more, see Capacity and quota overview.

  3. Verify usable reservations. If you want to create your cluster by using one or more reservations, then verify that the reservations have enough available resources to create your chosen number of VMs in the cluster. Otherwise, skip this step.

    To learn more, see Consumable VMs in a reservation.

  4. Verify trusted image policy. If the organization in which your project exists has a trusted image policy (constraints/compute.trustedImageProjects), then verify that the clusterdirector-public-images project is included in the list of allowed projects.

    To learn more, see Setting up trusted image policies.

  5. Verify existing resource requirements. If you plan to use existing storage or networking resources in your cluster instead of creating new ones, then you must verify that those resources are correctly configured. Otherwise, skip this step.

    To learn more, see Cluster creation process overview.

Required roles

To get the permissions that you need to create a cluster based on a template, ask your administrator to grant you the following IAM roles:

For more information about granting roles, see Manage access to projects, folders, and organizations.

These predefined roles contain the permissions required to create a cluster based on a template. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to create a cluster based on a template:

  • To create a cluster: hypercomputecluster.clusters.create

You might also be able to get these permissions with custom roles or other predefined roles.

Create an AI-optimized cluster based on a template

To create an AI-optimized cluster based on a template, complete the following steps:

  1. Configure compute resource configurations

  2. Configure network

  3. Configure storage resources

  4. Configure the Slurm environment

Configure compute resource configurations

To configure compute resource configurations when creating a cluster, complete the following steps:

  1. In the Google Cloud console, go to the Cluster Director page.

    Go to Cluster Director

  2. Click Create cluster.

  3. In the dialog that appears, click Reference architecture. The Create a cluster page appears.

  4. Select one of the available templates.

  5. Click Customize.

  6. In the Compute section, in the Cluster name field, enter a name for your cluster. The name can contain up to 10 characters, and it can only use numbers or lowercase letters (a-z).

  7. To add information to the pre-configured compute resource configuration, or edit the number and type of VMs that the configurations specifies, do the following:

    1. In the Compute section, click Edit resource configuration. The Add resource configuration pane appears.

    2. Optional: To change the compute resource configuration name, in the Name field, enter a new name.

    3. Optional: To change the number and type of VMs that your cluster uses, in the Machine configuration section, follow the prompts to update the compute resources.

    4. In the Consumption options section, specify the consumption option that you want to use to obtain resources:

      • To create VMs by using a reservation, do the following:

        1. Click the Use reservation tab.

        2. Click Select reservation. The Choose a reservation pane appears.

        3. Select the reservation that you want to use. Then, click Choose. This action automatically sets the Region and Zone of your compute resources.

      • To create Flex-start VMs, do the following:

        1. Click the Flex start tab.

        2. In the Time limit for the VM section, specify the run duration for the VMs. The value must be between 10 minutes and 7 days.

        3. In the Location section, select the region where you want to create Flex-start VMs. The Google Cloud console automatically filters the available regions to only show only those regions that support Flex-start VMs for your selected machine type.

      • To create Spot VMs, do the following:

        1. Click the Use spot tab.

        2. In the On VM termination list, select one of the following options:

          • To delete Spot VMs on preemption, select Delete.

          • To stop Spot VMs on preemption, select Stop.

        3. In the Location section, select the Region and Zone where you want to create Spot VMs. The Google Cloud console automatically filters the available regions to only show only those regions that support Spot VMs for your selected machine type.

    5. Click Done.

    6. Optional: To create additional compute resource configurations for a partition, click the Add resource configuration, and then follow the prompts to specify the compute resources.

  8. Click Continue.

Configure network

To configure the network that your cluster uses, complete the following steps:

  1. In the Choose new or existing network section, do one of the following:

    • Recommended: To let Cluster Director automatically create a network with the required firewall rules, select Create network.

    • To use an existing network, do the following:

      1. Select Select existing network.

      2. In the Select VPC network list, select an existing network.

      3. In the Select subnetwork list, select an existing subnetwork.

  2. Click Continue.

Configure storage resources

To configure the storage resources that your cluster uses, in the Storage section, complete the following steps:

  1. Optional: To edit a storage resource, click the Edit storage plan, and then follow the prompts to update the configuration of the storage resource.

  2. Optional: To add storage resources to your cluster, click Add storage configuration, and then follow the prompts to specify the configuration for the storage resources.

  3. Click Continue.

Configure the Slurm environment

To configure the Slurm environment in your cluster, complete the following steps:

  1. Optional: To edit the number and type of VMs that the login node uses, expand the Login node section, and then follow the prompts to update the compute resources.

  2. Optional: To edit partitions of your cluster to organize your compute resources, expand the Partitions section, and then do one of the following:

    • To add a partition, click Add partition, and then do the following:

      1. In the Partition name field, enter a name for the partition.

      2. To edit a nodeset, click Toggle nodeset. Otherwise, to add a nodeset, click Add nodeset.

      3. In the Nodeset name field, enter a name for your nodeset.

      4. In the Resource configuration field, select a compute resource configuration that you created in the previous steps.

      5. In the Static node count field, enter the minimum number of VMs that must always be running in the cluster.

      6. In the Dynamic node count field, enter a maximum number of VMs that Cluster Director can increase the cluster to during increases in traffic.

      7. In the Boot disk type list and Boot disk size field, enter the type and size of the boot disk for the VMs to use.

      8. Click Done.

    • To remove a partition, click Delete partition.

  3. Optional: To add prolog or epilog scripts to your Slurm environment, do the following:

    1. Expand the Advanced orchestration settings section.

    2. In the Scripts section, follow the prompts to add scripts.

  4. Click Create.

    The Clusters page appears. Creating the cluster can take some time to complete. The completion time depends on the number of VMs that you request and resource availability in the VMs' zone. If your requested resources are unavailable, then Cluster Director maintains the creation request until resources become available. To view the status of the cluster create operation, view your cluster's details.

    When Cluster Director creates your login node, the cluster state changes to Ready. You can then connect to your cluster; however, you can run workloads only after Cluster Director creates the compute nodes in the cluster.

What's next?