Create a TPU nodeset for a Slurm partition

Use the schedmd-slurm-gcp-v6-nodeset-tpu module to define a group of Tensor Processing Unit (TPU) nodes. This module creates the data structure that's required by the schedmd-slurm-gcp-v6-partition module to build a partition that supports machine learning workloads.

TPUs are Google's custom-developed Application-Specific Integrated Circuits (ASICs) designed to accelerate machine learning workloads.

For the complete list of inputs and outputs, see the schedmd-slurm-gcp-v6-nodeset-tpu module on GitHub.

Before you begin

Before you begin, verify that you meet the following requirements:

You have installed and configured Cluster Toolkit. For installation instructions, see Set up Cluster Toolkit.
You have an existing cluster blueprint. You can use and modify an existing blueprint or create one from scratch. To view a working example of a blueprint configured for the slurm-nodeset-tpu module, go to the Cluster blueprint catalog page, click the Select scheduler menu and then select Slurm. For more information about creating and customizing blueprints, see Cluster blueprint.
To view a complete list of blueprints that support the slurm-nodeset-tpu module, go to the Cluster blueprint catalog page.

Required roles

To get the permissions that you need to deploy the TPU nodeset, ask your administrator to grant you the following IAM roles on your project:

TPU Admin (roles/tpu.admin)
Service Account User (roles/iam.serviceAccountUser)

For more information about granting roles, see Manage access to projects, folders, and organizations.

You might also be able to get the required permissions through custom roles or other predefined roles.