Create a Slurm cluster for short-term training
This quickstart guides you through the process of creating and connecting to a Slurm cluster with two A3 Mega Flex-start VMs in Cluster Director. By using a custom cluster configuration that uses Flex-start VMs, you can create a cluster for short-term training workloads that run for no longer than seven days. This configuration lets you quickly create a cluster without the need for reserving capacity for a future date and time. After Cluster Director creates the login node in the cluster, you connect to it and run sample jobs by using Slurm commands for job management.
To follow step-by-step guidance for this task directly in the Google Cloud console, click Guide me:
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
-
Create a project: To create a project, you need the Project Creator role
(
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission. Learn how to grant roles.
-
Verify that billing is enabled for your Google Cloud project.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
-
Create a project: To create a project, you need the Project Creator role
(
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission. Learn how to grant roles.
-
Verify that billing is enabled for your Google Cloud project.
Enable the Hypercompute Cluster API, Compute Engine API, Filestore API, Google Cloud Managed Lustre API, Cloud Logging API, and Cloud Monitoring API:
Enable the APIs- Verify that your project and the Compute Engine default service
account have the following Identity and Access Management (IAM) roles:
-
To get the permissions that you need to complete this quickstart, ask your administrator to grant you the following IAM roles on your project:
-
To create and manage the cluster:
Cluster Director Editor (
roles/hypercomputecluster.editor) -
To connect to the login node in a cluster:
-
Compute OS Login (
roles/compute.osLogin) -
IAP-Secured Tunnel User (
roles/iap.tunnelResourceAccessor)
-
Compute OS Login (
For more information about granting roles, see Manage access to projects, folders, and organizations.
You might also be able to get the required permissions through custom roles or other predefined roles.
-
To create and manage the cluster:
Cluster Director Editor (
-
To get the permissions that you need to complete this quickstart, ask your administrator to grant you the following IAM roles on the Compute Engine default service account:
-
Compute Instance Admin (v1) (
roles/compute.instanceAdmin.v1) -
Logs Writer (
roles/logging.logWriter) -
Monitoring Metric Writer (
roles/monitoring.metricWriter) -
Storage Object Viewer (
roles/storage.objectViewer)
-
Compute Instance Admin (v1) (
-
- If the organization in which your project exists has a trusted image policy
(
constraints/compute.trustedImageProjects), then verify that theclusterdirector-public-imagesproject is included in the list of allowed projects. To view the trusted image policies for your organization, see Set image access constraints.
Costs
This quickstart uses the following billable Google Cloud resources:
Compute Engine: one N2 VM and two A3 Mega Flex-start VMs
Filestore: a Filestore instance
Google Cloud Managed Lustre: a Managed Lustre instance
To generate a cost estimate based on your projected usage, use the pricing calculator.
Create a Slurm cluster
To create a Slurm cluster, complete the following steps:
In the Google Cloud console, go to the Cluster Director page.
Click Create a cluster.
In the dialog that appears, click Step-by-step configuration. The Create cluster page appears.
In the Cluster name field, enter
cluster000.In the Compute section, click Configure resources. In the Add resource configuration pane that appears, complete the following steps:
In the GPU type list, select NVIDIA H100 80GB MEGA.
In the Number of instances field, enter
2.In the Consumption options section, click Use Flex-start.
In the Location section, specify the following options:
For Region, select us-central1.
For Zone, select us-central1-a.
Click Done.
In the navigation menu, click Storage.
In the Storage section, click Add storage configuration. In the Add storage configuration pane that appears, complete the following steps:
Click the Managed lustre tab.
Click Done.
Click Create. The Clusters page appears.
Creating the cluster can take some time to complete. The completion time depends on the number of VMs that you request and resource availability in the VMs' zone. If your requested resources are unavailable, then Cluster Director maintains the creation request until resources become available.
View the cluster creation request
To review the cluster creation request, complete the following steps:
In the Clusters table, in the Name column, click cluster000. A page that gives the details of the cluster appears, and the Details tab is selected.
In the Compute section, locate the Status row. When Cluster Director sets its value to Ready, you can proceed to the next section.
Connect to your cluster through SSH
To connect to your cluster through SSH, complete the following steps:
Click the Nodes tab.
In the Login nodes table, find the row that contains the cluster000-login-001 node. In that row, in the Connect column, click the SSH button. The SSH-in-browser window appears.
If prompted, then click Authorize. Connecting to your cluster can take some time to complete. When the terminal is ready, proceed to the next section.
Run sample jobs
In the SSH-in-browser window, complete the following steps:
To verify that Slurm is running, run the following command:
sinfoTo submit a test job that returns the hostname of the node, run the following command:
srun hostnameTo submit a batch job that sleeps for 30 seconds, run the following command:
sbatch --wrap="sleep 30"To check the status of jobs in the queue, run the following command:
squeueTo view accounting data for jobs, run the following command:
sacct
You've successfully created a Slurm cluster, connected to it, and run sample jobs! If Cluster Director still hasn't created the two A3 Mega Flex-start VMs, then you can either wait for the cluster to create the VMs, or delete the cluster to avoid incurring any unnecessary charges.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used on this page, follow these steps.
Delete your project
The easiest way to eliminate billing is to delete the project that you created for the tutorial.
To delete the project:
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
Delete your cluster
To delete the cluster, and its associated resources, that you created as part of this quickstart, complete the following steps:
On the page that contains the details of your cluster, click Delete.
In the dialog that appears, enter
cluster000, and then click Delete to confirm.