Create a TPU nodeset for a Slurm partition

Use the schedmd-slurm-gcp-v6-nodeset-tpu module to define a group of Tensor Processing Unit (TPU) nodes. This module creates the data structure that's required by the schedmd-slurm-gcp-v6-partition module to build a partition that supports machine learning workloads.

TPUs are Google's custom-developed Application-Specific Integrated Circuits (ASICs) designed to accelerate machine learning workloads.

For the complete list of inputs and outputs, see the schedmd-slurm-gcp-v6-nodeset-tpu module on GitHub.

Before you begin

Before you begin, verify that you meet the following requirements:

  • You have installed and configured Cluster Toolkit. For installation instructions, see Set up Cluster Toolkit.
  • You have an existing cluster blueprint. You can use and modify an existing blueprint or create one from scratch. To view a working example of a blueprint configured for the slurm-nodeset-tpu module, go to the Cluster blueprint catalog page, click the Select scheduler menu and then select Slurm. For more information about creating and customizing blueprints, see Cluster blueprint.
  • To view a complete list of blueprints that support the slurm-nodeset-tpu module, go to the Cluster blueprint catalog page.

Required roles

To get the permissions that you need to deploy the TPU nodeset, ask your administrator to grant you the following IAM roles on your project:

For more information about granting roles, see Manage access to projects, folders, and organizations.

You might also be able to get the required permissions through custom roles or other predefined roles.

Create a TPU nodeset

To create a TPU nodeset, add the schedmd-slurm-gcp-v6-nodeset-tpu module to your blueprint. You must specify the node type and TensorFlow version.

The following example creates a TPU partition by using the tpu_nodeset module as input. It configures the nodeset with the following settings:

  • Connects to a network defined by a separate module.
  • Uses the v2-8 node type.
  • Uses TensorFlow version 2.10.0.
  • Enables preemptibility for the TPU VMs.
  • Disables the preserve_tpu setting. If this setting is disabled, then the system deletes suspended VMs.
-   id: tpu_nodeset
  source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset-tpu
  use: [network]
  settings:
    node_type: v2-8
    tf_version: 2.10.0
    disable_public_ips: false
    preemptible: true
    preserve_tpu: false

-   id: tpu_partition
  source: community/modules/compute/schedmd-slurm-gcp-v6-partition
  use: [tpu_nodeset]
  settings:
    partition_name: tpu

Configure preemptibility

To reduce costs, you can configure your TPU nodes to be preemptible by enabling the preemptible setting.

Additionally, you can control the behavior of suspended VMs by using the preserve_tpu setting:

  • false: suspended VMs are deleted. This value is the default value for this setting.
  • true: suspended VMs are stopped but not deleted, preserving their state.
  settings:
    preemptible: true
    preserve_tpu: false

What's next