schedmd-slurm-gcp-v6-nodeset-tpu module to define a group of Tensor
Processing Unit (TPU) nodes. This module creates the data structure that's required by
the schedmd-slurm-gcp-v6-partition
module to build a partition that
supports machine learning workloads.
TPUs are Google's custom-developed Application-Specific Integrated Circuits (ASICs) designed to accelerate machine learning workloads.
For the complete list of inputs and outputs, see the
schedmd-slurm-gcp-v6-nodeset-tpu
module
on GitHub.
Before you begin
Before you begin, verify that you meet the following requirements:
- You have installed and configured Cluster Toolkit. For installation instructions, see Set up Cluster Toolkit.
- You have an existing cluster blueprint. You can use and modify an existing
blueprint or create one from scratch. To view a working example of a blueprint
configured for the
slurm-nodeset-tpumodule, go to the Cluster blueprint catalog page, click the Select scheduler menu and then select Slurm. For more information about creating and customizing blueprints, see Cluster blueprint. - To view a complete list of blueprints that support the
slurm-nodeset-tpumodule, go to the Cluster blueprint catalog page.
Required roles
To get the permissions that you need to deploy the TPU nodeset, ask your administrator to grant you the following IAM roles on your project:
- TPU Admin (
roles/tpu.admin) - Service Account User (
roles/iam.serviceAccountUser)
For more information about granting roles, see Manage access to projects, folders, and organizations.
You might also be able to get the required permissions through custom roles or other predefined roles.
Create a TPU nodeset
To create a TPU nodeset, add the schedmd-slurm-gcp-v6-nodeset-tpu module to
your blueprint. You must specify the node type and TensorFlow version.
The following example creates a TPU partition by using the tpu_nodeset module
as input. It configures the nodeset with the following settings:
- Connects to a network defined by a separate module.
- Uses the
v2-8node type. - Uses TensorFlow version
2.10.0. - Enables preemptibility for the TPU VMs.
- Disables the
preserve_tpusetting. If this setting is disabled, then the system deletes suspended VMs.
- id: tpu_nodeset
source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset-tpu
use: [network]
settings:
node_type: v2-8
tf_version: 2.10.0
disable_public_ips: false
preemptible: true
preserve_tpu: false
- id: tpu_partition
source: community/modules/compute/schedmd-slurm-gcp-v6-partition
use: [tpu_nodeset]
settings:
partition_name: tpu
Configure preemptibility
To reduce costs, you can configure your TPU nodes to be preemptible by enabling
the preemptible setting.
Additionally, you can control the behavior of suspended VMs by using the
preserve_tpu setting:
false: suspended VMs are deleted. This value is the default value for this setting.true: suspended VMs are stopped but not deleted, preserving their state.
settings:
preemptible: true
preserve_tpu: false
What's next
- To add this nodeset to a partition, see Create a partition for a Slurm controller.
- For a complete list of inputs and outputs, see the
slurm-nodeset-tpumodule on GitHub.