schedmd-slurm-gcp-v6-nodeset module to define a group of compute nodes, known as a
nodeset. This module creates the data structure required by the
schedmd-slurm-gcp-v6-partition
module to build a partition.
Nodesets let you add different types of nodes to a partition. This configuration lets you run jobs that use a mix of hardware specifications.
For a complete list of inputs and outputs, see the schedmd-slurm-gcp-v6-nodeset module
in the Cluster Toolkit GitHub repository.
Before you begin
Before you begin, verify that you meet the following requirements:
- You have installed and configured Cluster Toolkit. For installation instructions, see Set up Cluster Toolkit.
- You have an existing cluster blueprint. You can use and modify an existing
blueprint or create one from scratch. For a working example of a blueprint
configured for the
slurm-nodesetmodule, see theexamples/hpc-slurm.yamlfile. For more information about creating and customizing blueprints, see Cluster blueprint. - To view a complete list of blueprints that support the
slurm-nodesetmodule, go to the Cluster blueprint catalog page, click the Select scheduler menu and then select Slurm.
Required roles
To get the permissions that you need to deploy the nodeset, ask your administrator to grant you the following IAM roles on your project:
- Compute Instance Admin (v1) (
roles/compute.instanceAdmin.v1) - Service Account User (
roles/iam.serviceAccountUser)
For more information about granting roles, see Manage access to projects, folders, and organizations.
You might also be able to get the required permissions through custom roles or other predefined roles.
Create a nodeset
To create a nodeset, add the schedmd-slurm-gcp-v6-nodeset module to your
blueprint. You must specify the machine type and the maximum number of dynamic
nodes.
The following example creates a partition module by using the schedmd-slurm-gcp-v6-nodeset module
as input:
- Uses
c2-standard-30machine types. - Sets a maximum count of 200 dynamically created nodes for the nodeset.
- Uses a nodeset name of
ghpc - Connects to the module with the
networkID by using theusefield. - Mounts nodes to the
homefsnetwork file system by using theusefield.
- id: nodeset
source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
use:
- network
settings:
node_count_dynamic_max: 200
machine_type: c2-standard-30
- id: compute_partition
source: community/modules/compute/schedmd-slurm-gcp-v6-partition
use:
- homefs
- nodeset
settings:
partition_name: compute
Target specific nodesets
To specify nodes from a specific nodeset in a partition, use the --nodelist or
-w flag. This example selects three nodes from the nodes cluster-compute-group-[0-2] in
the compute partition:
srun -N 3 -p compute --nodelist cluster-compute-group-[0-2] hostname
Depending on how the nodes differ, you can also use the --constraint (or -C)
flag or other flags, such as --mincpus, to specify nodes with the selected
characteristics.
Configure custom images
To create valid custom images for the node group virtual machine (VM) instances or for custom instance templates, see Slurm on Google Cloud custom images on GitHub.
Configure GPU support
To learn more about GPU support for this module, and other Cluster Toolkit modules, see GPU support on GitHub.
Configure zone policies
You can specify additional zones for your nodeset to create VM instances by using bulk creation. Configuring partitions with popular VM families lets you access more compute resources across zones.
If Google adds a new zone to the region while the cluster is active, then nodes in the partition might be created in that zone. In this case, you might need to redeploy the partition to verify that the partition denies the zone that was newly added.
vars:
zone: us-central1-f
- id: zonal-nodeset
source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
The following example configuration lets you create VM instances in additional zones:
vars:
zone: us-central1-f
- id: multi-zonal-nodeset
source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
settings:
zones:
- us-central1-a
- us-central1-b
Configure VM instance placement in the nodeset
Use the enable_placement setting to decide how Google Cloud places your VM
instances in the nodeset.
By default, the enable_placement setting is set to a value of true. With
this setting enabled, the module groups your VM instances close together to help
them communicate faster by reducing network latency.
In certain cases, it might be helpful to disable the enable_placement setting,
such as when you want to do any of the following:
- Avoid resource shortages. When placement is enabled, Google Cloud tries to create all requested VMs in the same hardware group. If there isn't enough space in that group, the request might fail completely. Disabling placement allows Google Cloud to create your VMs wherever space is available.
- Improve reliability. To help ensure that a single hardware failure doesn't affect all of your VM instances at once, spread your VMs across different hardware.
- Avoid extra costs. If you use specific types of reservations (like non-dense reservations), then you can avoid the extra cost of placement policies.
To control the maximum distance between nodes in a placement group, use the
placement_max_distance variable. Cluster Toolkit restricts which
values are available for the placement_max_distance based on your chosen
machine type.
Get support
The Cluster Toolkit team maintains the wrapper around the slurm-on-gcp
Terraform modules. For support with the underlying modules, see the instructions
in the Slurm on Google Cloud
README
on GitHub.
What's next
- To add this nodeset to a partition, see Create a partition for a Slurm controller.
- To learn about creating nodesets specifically for TPUs, see Create a TPU nodeset for a Slurm partition.
- For a complete list of all available input fields and output values, see the
schedmd-slurm-gcp-v6-nodesetmodule on GitHub.