Create a Slurm cluster for short-term training

This quickstart guides you through the process of creating and connecting to a Slurm cluster with two A3 Mega Flex-start VMs in Cluster Director. By using a custom cluster configuration that uses Flex-start VMs, you can create a cluster for short-term training workloads that run for no longer than seven days. This configuration lets you quickly create a cluster without the need for reserving capacity for a future date and time. After Cluster Director creates the login node in the cluster, you connect to it and run sample jobs by using Slurm commands for job management.


To follow step-by-step guidance for this task directly in the Google Cloud console, click Guide me:

Guide me


Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

    Go to project selector

  3. Verify that billing is enabled for your Google Cloud project.

  4. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

    Go to project selector

  5. Verify that billing is enabled for your Google Cloud project.

  6. Enable the Hypercompute Cluster API, Compute Engine API, Filestore API, Google Cloud Managed Lustre API, Cloud Logging API, and Cloud Monitoring API:

    Enable the APIs
  7. Verify that your project and the Compute Engine default service account have the following Identity and Access Management (IAM) roles:
  8. If the organization in which your project exists has a trusted image policy (constraints/compute.trustedImageProjects), then verify that the clusterdirector-public-images project is included in the list of allowed projects. To view the trusted image policies for your organization, see Set image access constraints.

Costs

This quickstart uses the following billable Google Cloud resources:

  • Compute Engine: one N2 VM and two A3 Mega Flex-start VMs

  • Filestore: a Filestore instance

  • Google Cloud Managed Lustre: a Managed Lustre instance

To generate a cost estimate based on your projected usage, use the pricing calculator.

Create a Slurm cluster

To create a Slurm cluster, complete the following steps:

  1. In the Google Cloud console, go to the Cluster Director page.

    Go to Cluster Director

  2. Click Create a cluster.

  3. In the dialog that appears, click Step-by-step configuration. The Create cluster page appears.

  4. In the Cluster name field, enter cluster000.

  5. In the Compute section, click Configure resources. In the Add resource configuration pane that appears, complete the following steps:

    1. In the GPU type list, select NVIDIA H100 80GB MEGA.

    2. In the Number of instances field, enter 2.

    3. In the Consumption options section, click Use Flex-start.

    4. In the Location section, specify the following options:

      • For Region, select us-central1.

      • For Zone, select us-central1-a.

    5. Click Done.

  6. In the navigation menu, click Storage.

  7. In the Storage section, click Add storage configuration. In the Add storage configuration pane that appears, complete the following steps:

    1. Click the Managed lustre tab.

    2. Click Done.

  8. Click Create. The Clusters page appears.

    Creating the cluster can take some time to complete. The completion time depends on the number of VMs that you request and resource availability in the VMs' zone. If your requested resources are unavailable, then Cluster Director maintains the creation request until resources become available.

View the cluster creation request

To review the cluster creation request, complete the following steps:

  1. In the Clusters table, in the Name column, click cluster000. A page that gives the details of the cluster appears, and the Details tab is selected.

  2. In the Compute section, locate the Status row. When Cluster Director sets its value to Ready, you can proceed to the next section.

Connect to your cluster through SSH

To connect to your cluster through SSH, complete the following steps:

  1. Click the Nodes tab.

  2. In the Login nodes table, find the row that contains the cluster000-login-001 node. In that row, in the Connect column, click the SSH button. The SSH-in-browser window appears.

  3. If prompted, then click Authorize. Connecting to your cluster can take some time to complete. When the terminal is ready, proceed to the next section.

Run sample jobs

In the SSH-in-browser window, complete the following steps:

  1. To verify that Slurm is running, run the following command:

    sinfo
    
  2. To submit a test job that returns the hostname of the node, run the following command:

    srun hostname
    
  3. To submit a batch job that sleeps for 30 seconds, run the following command:

    sbatch --wrap="sleep 30"
    
  4. To check the status of jobs in the queue, run the following command:

    squeue
    
  5. To view accounting data for jobs, run the following command:

    sacct
    

You've successfully created a Slurm cluster, connected to it, and run sample jobs! If Cluster Director still hasn't created the two A3 Mega Flex-start VMs, then you can either wait for the cluster to create the VMs, or delete the cluster to avoid incurring any unnecessary charges.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used on this page, follow these steps.

Delete your project

The easiest way to eliminate billing is to delete the project that you created for the tutorial.

To delete the project:

  1. In the Google Cloud console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

Delete your cluster

To delete the cluster, and its associated resources, that you created as part of this quickstart, complete the following steps:

  1. On the page that contains the details of your cluster, click Delete.

  2. In the dialog that appears, enter cluster000, and then click Delete to confirm.

What's next