Monitor clusters with custom dashboards or alerts

This document explains how to monitor your cluster by creating custom dashboards or alerting policies in Cloud Monitoring. By using Prometheus metrics for Slurm, you can monitor specific events in your cluster by doing the following:

Build custom dashboards: create dashboards with custom widgets based on your workload's needs. For example, you can monitor the percentage of nodes in a cluster that Slurm sets to a specific state.
Create alerting policies: receive notifications when a certain threshold is crossed or an event is triggered. For example, you can receive notifications when 10% or more nodes in your cluster are degraded.

To use prebuilt Monitoring dashboards to monitor the health and resource usage of your clusters, see instead Monitor cluster performance with prebuilt dashboards.

Before you begin

When you access and use the Google Cloud console, you don't need to authenticate. You can automatically use Google Cloud services and APIs.

Required roles

To get the permissions that you need to create Monitoring dashboards or alerting policies, ask your administrator to grant you the Monitoring Editor (roles/monitoring.editor) IAM role on the project. For more information about granting roles, see Manage access to projects, folders, and organizations.

You might also be able to get the required permissions through custom roles or other predefined roles.

Available Prometheus metrics

Based on your use case, the following metrics are available for monitoring your clusters:

Job queue metrics
Node state metrics
Scheduler performance metrics
GPU and hardware metrics
RPC and user activity metrics
Reservation metrics

Job queue metrics

The following metrics help you track the status of your jobs in the queue:

Metric	Description
`slurm_queue_pending`	The number of pending jobs in the queue, labeled by user and reason for being pending.
`slurm_queue_running`	The number of running jobs in your cluster partition.
`slurm_queue_suspended`	The number of suspended jobs in your cluster partition.
`slurm_queue_cancelled`	The number of cancelled jobs in your cluster partition.
`slurm_queue_completing`	The number of jobs that are completing in your cluster partition.
`slurm_queue_completed`	The number of completed jobs in your cluster partition.
`slurm_queue_configuring`	The number of jobs that are configuring in your cluster partition.
`slurm_queue_failed`	The number of failed jobs in your cluster partition.
`slurm_queue_timeout`	The number of jobs that were stopped by timeout.
`slurm_queue_preempted`	The number of jobs that were preempted by Compute Engine.
`slurm_queue_node_fail`	The number of jobs that stopped due to host errors.
`slurm_cores_pending`	The number of vCPU cores that have pending jobs in the queue.
`slurm_cores_running`	The number of vCPU cores that are running jobs in your cluster partition.
`slurm_cores_suspended`	The number of vCPU cores that have suspended jobs in your cluster partition.
`slurm_cores_cancelled`	The number of vCPU cores that have cancelled jobs in your cluster partition.
`slurm_cores_completing`	The number of vCPU cores that have jobs that are completing in the cluster.
`slurm_cores_completed`	The number of vCPU cores that have completed jobs in your cluster partition.
`slurm_cores_configuring`	The number of vCPU cores that have configuring jobs in your cluster partition.
`slurm_cores_failed`	The number of vCPU cores that have failed jobs in your cluster partition.
`slurm_cores_timeout`	The number of vCPU cores that have jobs that were stopped by timeout in your cluster partition.
`slurm_cores_preempted`	The number of vCPU cores that have jobs that were stopped by preemption in your cluster partition.
`slurm_cores_node_fail`	The number of vCPU cores that have jobs that were stopped due to host errors.

Node state metrics

The following metrics help you understand the number of nodes in your cluster that have a specific Slurm state:

Metric	Description
`slurm_nodes_alloc`	The number of nodes that are in the `allocated` state.
`slurm_nodes_comp`	The number of nodes that are in the `completing` state.
`slurm_nodes_down`	The number of nodes that are in the `down` state.
`slurm_nodes_drain`	The number of nodes that are in the `drain` state.
`slurm_nodes_err`	The number of nodes that are in the `error` state.
`slurm_nodes_fail`	The number of nodes that are in the `fail` state.
`slurm_nodes_idle`	The number of nodes that are in the `idle` state.
`slurm_nodes_maint`	The number of nodes that are in the `maintenance` state.
`slurm_nodes_mix`	The number of nodes that are in the `mix` state.
`slurm_nodes_resv`	The number of nodes that are reserved in your cluster partition.
`slurm_nodes_other`	The number of nodes that have an unknown state.
`slurm_nodes_planned`	The number of nodes that have a job scheduled.
`slurm_nodes_total`	The total number of nodes in your cluster partition.

Scheduler performance metrics

The following metrics help you understand statistics from the Slurm scheduler daemon:

Metric	Description
`slurm_scheduler_threads`	The number of active threads in the scheduler.
`slurm_scheduler_queue_size`	The length of the main queue for the scheduler.
`slurm_scheduler_dbd_queue_size`	The length of the Slurm database daemon agent queue.
`slurm_scheduler_last_cycle`	The time that the scheduler took to complete the last cycle, in microseconds.
`slurm_scheduler_mean_cycle`	The average time that the scheduler takes for a cycle, in microseconds.
`slurm_scheduler_cycle_per_minute`	The number of cycles that the scheduler completes per minute.
`slurm_scheduler_backfill_last_cycle`	The time that the scheduler took to complete the last backfill cycle, in microseconds.
`slurm_scheduler_backfill_mean_cycle`	The average time that the scheduler takes for backfill cycles, in microseconds.
`slurm_scheduler_backfill_depth_mean`	The average depth of searches in backfill scheduling.
`slurm_scheduler_backfilled_jobs_since_start_total`	The total number of jobs that Slurm started using backfill scheduling since Slurm started.
`slurm_scheduler_backfilled_jobs_since_cycle_total`	The number of jobs that Slurm started using backfill scheduling since the last statistics reset.
`slurm_scheduler_backfilled_heterogeneous_total`	The total number of components in jobs that use different types of resources (known as heterogeneous jobs) since Slurm started using backfill scheduling.

GPU and hardware metrics

The following metrics help you understand the usage of your GPUs and the resource consumption for each node in your cluster:

Metric	Description
`slurm_gpus_alloc`	The number of GPUs that are allocated in your cluster partition.
`slurm_gpus_idle`	The number of GPUs that are idle.
`slurm_gpus_other`	The number of GPUs that are in other states.
`slurm_gpus_total`	The total number of GPUs in your cluster partition.
`slurm_gpus_utilization`	The overall utilization ratio of the GPUs in your cluster partition.
`slurm_node_cpu_alloc`	The number of vCPU cores that are allocated on each node in your cluster partition.
`slurm_node_cpu_idle`	The number of vCPU cores that are idle on each node in your cluster partition.
`slurm_node_cpu_other`	The number of vCPU cores that are in other states on each node in your cluster partition.
`slurm_node_cpu_total`	The total number of vCPU cores on each node in your cluster partition.
`slurm_node_mem_alloc`	The amount of memory that is allocated on each node in your cluster partition, in megabytes (MB).
`slurm_node_mem_total`	The total amount of memory available on each node in your cluster partition, in megabytes (MB).
`slurm_node_status`	The status of each node in your cluster partition.

RPC and user activity metrics

The following metrics help you understand the volume and latency of remote procedure calls (RPCs) in your cluster:

Metric	Description
`slurm_rpc_stats`	The number of RPCs that Slurm has processed, labeled by operation.
`slurm_rpc_stats_avg_time`	The average time that Slurm takes for each RPC operation.
`slurm_rpc_stats_total_time`	The total time that Slurm spent on RPC operations.
`slurm_user_rpc_stats`	The number of RPCs that Slurm processed for each user.
`slurm_user_rpc_stats_avg_time`	The average time that Slurm takes for RPCs for each user.
`slurm_user_rpc_stats_total_time`	The total time that Slurm spent on RPCs for each user.
`slurm_user_jobs_pending`	The number of pending jobs for each user.
`slurm_user_jobs_running`	The number of running jobs for each user.
`slurm_user_cpus_running`	The number of vCPU cores that are running jobs for each user.
`slurm_user_jobs_suspended`	The number of suspended jobs for each user.

Reservation metrics

The following metrics help you understand the reserved resources for your cluster, and related cluster information:

Metric	Description
`slurm_reservation_info`	The metadata about active reservations.
`slurm_reservation_start_time_seconds`	The start time of the reservation, as a Unix timestamp.
`slurm_reservation_end_time_seconds`	The end time of the reservation, as a Unix timestamp.
`slurm_reservation_node_count`	The total number of nodes that Slurm has allocated to a reservation.
`slurm_reservation_core_count`	The total number of vCPU cores that Slurm has allocated to a reservation.
`slurm_account_fairshare`	The fairshare score that Slurm calculated for an account.

Create custom dashboards

You can display Prometheus metrics in a widget within a Monitoring dashboard. This approach lets you configure and display data so that it best fits the needs of your workload.

The following example shows how to create a widget in a new custom dashboard by using promQL. This widget displays the percentage of nodes in a cluster that Slurm sets to a specific state, letting you verify how many nodes in your cluster are unused.

To create the example widget in a new custom dashboard, complete the following steps:

In the Google Cloud console, go to the Dashboards page.

Go to Dashboards
Click Create custom dashboard. The New dashboard page opens.
Click Add widget. The Add widget pane appears.
In the Visualization section, click Line. The Configure widget pane appears.

Click PromQL. Then, to view the percentage of nodes with a state set to allocated in a cluster, enter the following query:

((count({"__name__"="slurm_node_status","status"=~"allocated|allocated#|allocated~"}) or vector(0)) / count({"__name__"="slurm_node_status"})) * 100

Click Add query, click PromQL, and then, to view the percentage of nodes with a state set to completing in a cluster, enter the following query:
```
((count({"__name__"="slurm_node_status","status"=~"completing"}) or vector(0)) / count({"__name__"="slurm_node_status"})) * 100
```

Repeat step 6 for the following queries:

To view a percentage of nodes with a state set to down in a cluster, use the following query:

((count({"__name__"="slurm_node_status","status"=~"down|down\\*|down~"}) or vector(0)) / count({"__name__"="slurm_node_status"})) * 100

To view a percentage of nodes with a state set to drained in a cluster, use the following query:

((count({"__name__"="slurm_node_status","status"=~"drained"}) or vector(0)) / count({"__name__"="slurm_node_status"})) * 100

To view a percentage of nodes with a state set to idle in a cluster, use the following query:

((count({"__name__"="slurm_node_status","status"=~"idle|idle\\*|idle#|idle%|idle~"}) or vector(0)) / count({"__name__"="slurm_node_status"})) * 100

To view a percentage of nodes with a state set to mixed in a cluster, use the following query:

((count({"__name__"="slurm_node_status","status"=~"mixed|mixed#"}) or vector(0)) / count({"__name__"="slurm_node_status"})) * 100

Click Run query.
Click Apply.

For more information about creating custom dashboards in Monitoring, see Create and manage custom dashboards.

Create alerting policies

You can create alerting policies and receive notifications when a metric in your cluster crosses a specific threshold. Alerting policies notify you when certain events occur in your cluster.

The following example shows how to create an alerting policy by using promQL. The policy notifies you when Slurm sets 10% or more nodes in your cluster to the down state. This state indicates that the nodes are degraded and must be repaired before you can run jobs on them again.

To create the example alerting policy, complete the following steps:

In the Google Cloud console, go to the Dashboards page.

Go to Dashboards
Click Create policy. The Create alerting policy page opens.

In the Policy configuration mode pane that appears, do the following:

Select PromQL.

In the Query editor box, enter the following query:

(((count({"__name__"="slurm_node_status","status"=~"down|down\\*|down~"}) or vector(0)) / count({"__name__"="slurm_node_status"}))) * 100 >= 10

Click Next.

In the Configure alert trigger pane that appears, do the following:
1. In the Retest window list, select 5 minutes.
2. In the Evaluation interval list, select 1 minute.
3. In the Condition name field, enter a name for the condition that triggers the alert.
4. Click Next.
In the Configure notifications and finalize alert pane that appears, do the following:
1. In the Notification channels list, select the channels where Google Cloud can notify you about the incident. If you don't have a notification channel in the list, then click Manage notification channels, and then follow the prompts to create one.
2. In the Notification subject line field, enter the subject line for the message that Google Cloud sends you.
3. In the Name the alert policy section, in the Alert policy name field, enter a name for your alerting policy.
4. Click Next.
Click Save policy.

For more information about creating alerting policies in Monitoring, see Create metric-threshold alerting policies.

Monitor clusters with custom dashboards or alerts Stay organized with collections Save and categorize content based on your preferences.