Node lifecycle

This document explains the lifecycle of a cluster node in Cluster Director.

A node in a cluster transitions through various states from its creation to its deletion. Understanding these states lets you do the following:

Manage resources and costs: understand when Cluster Director provisions and stops dynamic nodes to run queued jobs and reduce unnecessary charges.
Minimize disruptions: anticipate planned host maintenance events.
Troubleshoot errors: identify why a node fails to boot, restart, or resume.

To check the current state of a node, view the details of your cluster.

Node states

The following diagram shows the different states that Cluster Director can set a node to:

The states shown in the preceding diagram are as follows:

Unprovisioned state
Creating state
Active state
Suspending state
Suspended state
Repairing state
Failed state
Deleting state

Unprovisioned state

When Cluster Director creates a dynamic node, the node starts in the UNPROVISIONED state. From this state, a dynamic node can transition to one of the following states:

CREATING: when you submit a job that requires more compute resources than the ones in your static nodes, Cluster Director attempts to allocate resources to create the dynamic node.
DELETING: when you reduce the dynamic node count in the cluster by modifying or deleting it, the delete operation for the node starts.

Creating state

When a node is in the CREATING state, Cluster Director attempts to allocate resources to boot, restart, or resume your node.

From this state, the node can transition to one of the following states:

ACTIVE: when Cluster Director allocates resources for your node, the node boots up and starts running.
FAILED: the node encounters a host error.

Active state

When a node is in the ACTIVE state, the node is healthy, running, and ready to run Slurm jobs.

From this state, the node can transition to one of the following states:

SUSPENDING: you stop or suspend the node.
REPAIRING: planned or unplanned host maintenance for the node starts.
UNPROVISIONED: Slurm deletes the node after it has been in the idle state for some time, or Cluster Director deletes the node after you reduce the number of dynamic nodes in the cluster.
DELETING: you reduce the static node count in a nodeset or you delete the cluster.
FAILED: the node encounters a host error.

Suspending state

When a node is in the SUSPENDING state, Cluster Director is stopping or suspending the node. You can only suspend nodes that use N2 machine types. For more information, see Suspend, stop, or reset Compute Engine instances.

From this state, when the stop or suspend operation completes, the node transitions to the SUSPENDED state.

Suspended state

When a node is in the SUSPENDED state, the stop or suspend operation completes.

From this state, the node transitions to one of the following states:

CREATING: when you resume or restart the node.
DELETING: when you delete the node or the cluster.

Repairing state

When a node is in the REPAIRING state, Cluster Director stops the node to perform host maintenance on the node's underlying Compute Engine infrastructure. Only GPU nodes enter this state. Nodes that use N2 machine types keep running during maintenance as Cluster Director live migrates them.

From this state, the node transitions to one of the following states:

ACTIVE: when host maintenance completes, the node restarts. Cluster Director helps ensure that the resources needed to restart the node are available.
FAILED: the node encounters a host error.

Failed state

When a node is in the FAILED state, the node is unhealthy or encountered a host error. To help you troubleshoot the issue, Cluster Director provides details about the error in Logs Explorer, or in the stateMessage field when you use the Google Cloud CLI or REST API.

From this state, when you delete the node or cluster, the node transitions to the DELETING state.

Deleting state

When a node is in the DELETING state, Cluster Director is deleting the node and releasing its resources.

Nodes enter this state when you manually reduce the static node count (staticNodeCount) or maximum dynamic node count (maxDynamicNodeCount) in a nodeset, or when you delete the cluster. When the deletion operation ends, Cluster Director permanently deletes the node and all of its attached resources.

Pricing for clusters and nodes

When you create a cluster, you incur charges for the underlying Google Cloud resources that your cluster uses:

Nodes: you incur charges billed for vCPUs, memory, and any attached GPUs only when the node is running, which is when the node is in the ACTIVE state. For any disks and external IP addresses that are attached to the node, you incur charges for these resources as long as the node exists, regardless of the node state. For more information, see Virtual machines pricing.
Networking resources: you incur charges for data transfer (egress) and any additional networking features that your cluster uses. For more information, see All networking pricing.
Storage resources: you incur charges for the storage resources that your cluster uses, as long as those resources exist. For more information, see the following: