This document explains how to monitor your cluster by creating custom dashboards or alerting policies in Cloud Monitoring. By using Prometheus metrics for Slurm, you can monitor specific events in your cluster by doing the following:
Build custom dashboards: create dashboards with custom widgets based on your workload's needs. For example, you can monitor the percentage of nodes in a cluster that Slurm sets to a specific state.
Create alerting policies: receive notifications when a certain threshold is crossed or an event is triggered. For example, you can receive notifications when 10% or more nodes in your cluster are degraded.
To use prebuilt Monitoring dashboards to monitor the health and resource usage of your clusters, see instead Monitor cluster performance with prebuilt dashboards.
Before you begin
When you access and use the Google Cloud console, you don't need to authenticate. You can automatically use Google Cloud services and APIs.
Required roles
To get the permissions that
you need to create Monitoring dashboards or alerting policies,
ask your administrator to grant you the
Monitoring Editor (roles/monitoring.editor)
IAM role on the project.
For more information about granting roles, see Manage access to projects, folders, and organizations.
You might also be able to get the required permissions through custom roles or other predefined roles.
Available Prometheus metrics
Based on your use case, the following metrics are available for monitoring your clusters:
Job queue metrics
The following metrics help you track the status of your jobs in the queue:
| Metric | Description |
|---|---|
slurm_queue_pending |
The number of pending jobs in the queue, labeled by user and reason for being pending. |
slurm_queue_running |
The number of running jobs in your cluster partition. |
slurm_queue_suspended |
The number of suspended jobs in your cluster partition. |
slurm_queue_cancelled |
The number of cancelled jobs in your cluster partition. |
slurm_queue_completing |
The number of jobs that are completing in your cluster partition. |
slurm_queue_completed |
The number of completed jobs in your cluster partition. |
slurm_queue_configuring |
The number of jobs that are configuring in your cluster partition. |
slurm_queue_failed |
The number of failed jobs in your cluster partition. |
slurm_queue_timeout |
The number of jobs that were stopped by timeout. |
slurm_queue_preempted |
The number of jobs that were preempted by Compute Engine. |
slurm_queue_node_fail |
The number of jobs that stopped due to host errors. |
slurm_cores_pending |
The number of vCPU cores that have pending jobs in the queue. |
slurm_cores_running |
The number of vCPU cores that are running jobs in your cluster partition. |
slurm_cores_suspended |
The number of vCPU cores that have suspended jobs in your cluster partition. |
slurm_cores_cancelled |
The number of vCPU cores that have cancelled jobs in your cluster partition. |
slurm_cores_completing |
The number of vCPU cores that have jobs that are completing in the cluster. |
slurm_cores_completed |
The number of vCPU cores that have completed jobs in your cluster partition. |
slurm_cores_configuring |
The number of vCPU cores that have configuring jobs in your cluster partition. |
slurm_cores_failed |
The number of vCPU cores that have failed jobs in your cluster partition. |
slurm_cores_timeout |
The number of vCPU cores that have jobs that were stopped by timeout in your cluster partition. |
slurm_cores_preempted |
The number of vCPU cores that have jobs that were stopped by preemption in your cluster partition. |
slurm_cores_node_fail |
The number of vCPU cores that have jobs that were stopped due to host errors. |
Node state metrics
The following metrics help you understand the number of nodes in your cluster that have a specific Slurm state:
| Metric | Description |
|---|---|
slurm_nodes_alloc |
The number of nodes that are in the allocated state. |
slurm_nodes_comp |
The number of nodes that are in the completing state. |
slurm_nodes_down |
The number of nodes that are in the down state. |
slurm_nodes_drain |
The number of nodes that are in the drain state. |
slurm_nodes_err |
The number of nodes that are in the error state. |
slurm_nodes_fail |
The number of nodes that are in the fail state. |
slurm_nodes_idle |
The number of nodes that are in the idle state. |
slurm_nodes_maint |
The number of nodes that are in the maintenance state. |
slurm_nodes_mix |
The number of nodes that are in the mix state. |
slurm_nodes_resv |
The number of nodes that are reserved in your cluster partition. |
slurm_nodes_other |
The number of nodes that have an unknown state. |
slurm_nodes_planned |
The number of nodes that have a job scheduled. |
slurm_nodes_total |
The total number of nodes in your cluster partition. |
Scheduler performance metrics
The following metrics help you understand statistics from the Slurm scheduler daemon:
| Metric | Description |
|---|---|
slurm_scheduler_threads |
The number of active threads in the scheduler. |
slurm_scheduler_queue_size |
The length of the main queue for the scheduler. |
slurm_scheduler_dbd_queue_size |
The length of the Slurm database daemon agent queue. |
slurm_scheduler_last_cycle |
The time that the scheduler took to complete the last cycle, in microseconds. |
slurm_scheduler_mean_cycle |
The average time that the scheduler takes for a cycle, in microseconds. |
slurm_scheduler_cycle_per_minute |
The number of cycles that the scheduler completes per minute. |
slurm_scheduler_backfill_last_cycle |
The time that the scheduler took to complete the last backfill cycle, in microseconds. |
slurm_scheduler_backfill_mean_cycle |
The average time that the scheduler takes for backfill cycles, in microseconds. |
slurm_scheduler_backfill_depth_mean |
The average depth of searches in backfill scheduling. |
slurm_scheduler_backfilled_jobs_since_start_total |
The total number of jobs that Slurm started using backfill scheduling since Slurm started. |
slurm_scheduler_backfilled_jobs_since_cycle_total |
The number of jobs that Slurm started using backfill scheduling since the last statistics reset. |
slurm_scheduler_backfilled_heterogeneous_total |
The total number of components in jobs that use different types of resources (known as heterogeneous jobs) since Slurm started using backfill scheduling. |
GPU and hardware metrics
The following metrics help you understand the usage of your GPUs and the resource consumption for each node in your cluster:
| Metric | Description |
|---|---|
slurm_gpus_alloc |
The number of GPUs that are allocated in your cluster partition. |
slurm_gpus_idle |
The number of GPUs that are idle. |
slurm_gpus_other |
The number of GPUs that are in other states. |
slurm_gpus_total |
The total number of GPUs in your cluster partition. |
slurm_gpus_utilization |
The overall utilization ratio of the GPUs in your cluster partition. |
slurm_node_cpu_alloc |
The number of vCPU cores that are allocated on each node in your cluster partition. |
slurm_node_cpu_idle |
The number of vCPU cores that are idle on each node in your cluster partition. |
slurm_node_cpu_other |
The number of vCPU cores that are in other states on each node in your cluster partition. |
slurm_node_cpu_total |
The total number of vCPU cores on each node in your cluster partition. |
slurm_node_mem_alloc |
The amount of memory that is allocated on each node in your cluster partition, in megabytes (MB). |
slurm_node_mem_total |
The total amount of memory available on each node in your cluster partition, in megabytes (MB). |
slurm_node_status |
The status of each node in your cluster partition. |
RPC and user activity metrics
The following metrics help you understand the volume and latency of remote procedure calls (RPCs) in your cluster:
| Metric | Description |
|---|---|
slurm_rpc_stats |
The number of RPCs that Slurm has processed, labeled by operation. |
slurm_rpc_stats_avg_time |
The average time that Slurm takes for each RPC operation. |
slurm_rpc_stats_total_time |
The total time that Slurm spent on RPC operations. |
slurm_user_rpc_stats |
The number of RPCs that Slurm processed for each user. |
slurm_user_rpc_stats_avg_time |
The average time that Slurm takes for RPCs for each user. |
slurm_user_rpc_stats_total_time |
The total time that Slurm spent on RPCs for each user. |
slurm_user_jobs_pending |
The number of pending jobs for each user. |
slurm_user_jobs_running |
The number of running jobs for each user. |
slurm_user_cpus_running |
The number of vCPU cores that are running jobs for each user. |
slurm_user_jobs_suspended |
The number of suspended jobs for each user. |
Reservation metrics
The following metrics help you understand the reserved resources for your cluster, and related cluster information:
| Metric | Description |
|---|---|
slurm_reservation_info |
The metadata about active reservations. |
slurm_reservation_start_time_seconds |
The start time of the reservation, as a Unix timestamp. |
slurm_reservation_end_time_seconds |
The end time of the reservation, as a Unix timestamp. |
slurm_reservation_node_count |
The total number of nodes that Slurm has allocated to a reservation. |
slurm_reservation_core_count |
The total number of vCPU cores that Slurm has allocated to a reservation. |
slurm_account_fairshare |
The fairshare score that Slurm calculated for an account. |
Create custom dashboards
You can display Prometheus metrics in a widget within a Monitoring dashboard. This approach lets you configure and display data so that it best fits the needs of your workload.
The following example shows how to create a widget in a new custom dashboard by using promQL. This widget displays the percentage of nodes in a cluster that Slurm sets to a specific state, letting you verify how many nodes in your cluster are unused.
To create the example widget in a new custom dashboard, complete the following steps:
In the Google Cloud console, go to the Dashboards page.
Click Create custom dashboard. The New dashboard page opens.
Click Add widget. The Add widget pane appears.
In the Visualization section, click Line. The Configure widget pane appears.
Click PromQL. Then, to view the percentage of nodes with a state set to
allocatedin a cluster, enter the following query:((count({"__name__"="slurm_node_status","status"=~"allocated|allocated#|allocated~"}) or vector(0)) / count({"__name__"="slurm_node_status"})) * 100Click Add query, click PromQL, and then, to view the percentage of nodes with a state set to
completingin a cluster, enter the following query:((count({"__name__"="slurm_node_status","status"=~"completing"}) or vector(0)) / count({"__name__"="slurm_node_status"})) * 100Repeat step 6 for the following queries:
To view a percentage of nodes with a state set to
downin a cluster, use the following query:((count({"__name__"="slurm_node_status","status"=~"down|down\\*|down~"}) or vector(0)) / count({"__name__"="slurm_node_status"})) * 100To view a percentage of nodes with a state set to
drainedin a cluster, use the following query:((count({"__name__"="slurm_node_status","status"=~"drained"}) or vector(0)) / count({"__name__"="slurm_node_status"})) * 100To view a percentage of nodes with a state set to
idlein a cluster, use the following query:((count({"__name__"="slurm_node_status","status"=~"idle|idle\\*|idle#|idle%|idle~"}) or vector(0)) / count({"__name__"="slurm_node_status"})) * 100To view a percentage of nodes with a state set to
mixedin a cluster, use the following query:((count({"__name__"="slurm_node_status","status"=~"mixed|mixed#"}) or vector(0)) / count({"__name__"="slurm_node_status"})) * 100
Click Run query.
Click Apply.
For more information about creating custom dashboards in Monitoring, see Create and manage custom dashboards.
Create alerting policies
You can create alerting policies and receive notifications when a metric in your cluster crosses a specific threshold. Alerting policies notify you when certain events occur in your cluster.
The following example shows how to create an alerting policy by using
promQL. The policy notifies you when
Slurm sets 10% or more nodes in your cluster to the down state. This state
indicates that the nodes are degraded and must be repaired before you can run
jobs on them again.
To create the example alerting policy, complete the following steps:
In the Google Cloud console, go to the Dashboards page.
Click Create policy. The Create alerting policy page opens.
In the Policy configuration mode pane that appears, do the following:
Select PromQL.
In the Query editor box, enter the following query:
(((count({"__name__"="slurm_node_status","status"=~"down|down\\*|down~"}) or vector(0)) / count({"__name__"="slurm_node_status"}))) * 100 >= 10Click Next.
In the Configure alert trigger pane that appears, do the following:
In the Retest window list, select 5 minutes.
In the Evaluation interval list, select 1 minute.
In the Condition name field, enter a name for the condition that triggers the alert.
Click Next.
In the Configure notifications and finalize alert pane that appears, do the following:
In the Notification channels list, select the channels where Google Cloud can notify you about the incident. If you don't have a notification channel in the list, then click Manage notification channels, and then follow the prompts to create one.
In the Notification subject line field, enter the subject line for the message that Google Cloud sends you.
In the Name the alert policy section, in the Alert policy name field, enter a name for your alerting policy.
Click Next.
Click Save policy.
For more information about creating alerting policies in Monitoring, see Create metric-threshold alerting policies.