Supported modules

This document describes the core module categories that are supported by Cluster Toolkit. Use these references to identify the right building blocks to construct a deployment tailored to your specific workload requirements.

For more information about module concepts and structure, see Modules.

For a complete list of supported modules, including experimental and community modules, see the modules page in the Cluster Toolkit GitHub repository.

Compute modules

Compute modules define the processing power of your cluster, including virtual machine (VM) instances and autoscaling groups. Use these modules to provision resources that are optimized for your specific workloads, such as high performance computing (HPC) or AI/ML training.

The following modules let you configure compute resources:

Slurm compute resources

If you use the Slurm scheduler, then use specific modules to define the partitions and node sets that make up your cluster:

  • Create a partition for a Slurm controller: This module creates a compute partition composed of one or more nodesets and configures scheduling options such as job exclusivity and power management.
  • Create a nodeset for a Slurm partition: This module defines a set of compute nodes with specific hardware configurations, letting you build heterogeneous Slurm partitions by grouping multiple nodesets together.
  • Create a TPU nodeset for a Slurm partition: This module defines a nodeset of Tensor Processing Units (TPUs) to be used in a Slurm partition, that let you accelerate machine learning workloads with Google's custom Application-Specific Integrated Circuits (ASICs).
  • Create a dynamic nodeset for a Slurm partition: This module creates an instance template and a dynamic nodeset configuration, letting your Slurm partition provision compute nodes on demand.

Scheduler modules

Scheduler modules deploy and configure the workload managers that orchestrate your jobs. These modules handle the setup of controller nodes, login nodes, and the necessary software to manage job queues.

The following modules let you configure schedulers:

  • Create a Batch job template: This module creates a Google Cloud Batch job template and an associated instance template to configure compute resources, networking, and storage options.
  • Create a Batch login node: This module provisions a login node VM that mirrors the configuration of your Batch job, letting you test workloads and submit jobs.
  • Create a GKE cluster: This module creates a GKE cluster and configures options such as VPC-native networking, multi-networking, and inference gateways.
  • Use an existing GKE cluster: This module discovers an existing GKE cluster and outputs its attributes, letting you use the existing cluster as a drop-in substitute for a new cluster in your blueprints.
  • Create a Slurm controller node: This module deploys a Slurm controller node to orchestrate jobs and manage the cluster, with support for live reconfiguration and scheduled maintenance.
  • Create a Slurm login node: This module creates a login node for a Slurm cluster to provide an access point for users to submit jobs and interact with the controller.

File system modules

File system modules provision shared storage resources that your cluster nodes can access. These modules let you mount high-performance storage solutions for data-intensive applications.

The following modules let you configure storage:

  • Create a Cloud Storage bucket: This module creates a Cloud Storage bucket to provide scalable object storage for your cluster, with options to configure encryption and lifecycle rules.
  • Create a Managed Lustre instance: This module creates a Google Cloud Managed Lustre instance to provide a high-performance network file system that you can mount to your cluster nodes, with options to import data from Cloud Storage.
  • Create a Filestore instance: This module creates a Filestore instance to provide high-performance network file storage for your virtual machine (VM) instances and GKE clusters.
  • Create a NetApp Volumes storage pool: This module creates a Google Cloud NetApp Volumes storage pool to provide managed file storage services.
  • Create a NetApp Volumes: This module creates a Google Cloud NetApp Volumes volume within a storage pool, letting you mount high-performance NFS or SMB shares to your nodes.
  • Create a Google Kubernetes Engine persistent volume: This module creates Kubernetes Persistent Volume (PV) and Persistent Volume Claim (PVC) resources to connect storage modules to GKE jobs.
  • Create a Google Kubernetes Engine storage class: This module creates a Kubernetes StorageClass (SC) to dynamically provision Google Cloud storage resources for GKE clusters.
  • Use pre-existing network storage: This module configures access to an existing file system, such as Network File System (NFS), Managed Lustre, or Cloud Storage, that lets you mount external storage resources to your cluster nodes.

Networking modules

Networking modules define the connectivity layer of your cluster. Use these modules to create isolated networks, configure firewalls, or integrate your cluster into an existing VPC.

The following modules let you configure networking:

  • Create a VPC network: This module creates a new VPC network with subnetworks and configures a Cloud Router, Cloud NAT, and firewall rules for connectivity.
  • Create multiple VPC networks: This module creates between 2 and 8 VPC networks with distinct IP ranges to support multi-networking configurations.
  • Create a GPU RDMA VPC network: This module creates a network optimized for GPU Remote Direct Memory Access (RDMA) traffic, which is critical for high-performance AI training.
  • Use a pre-existing VPC network: This module uses an existing VPC network and outputs its attributes, letting you use the existing network as a drop-in substitute for a new network in your blueprints.
  • Use a pre-existing subnetwork: This module uses an existing subnetwork within an existing VPC network and retrieves its configuration.
  • Create firewall rules: This module creates custom ingress and egress firewall rules for existing networks, letting you granularly control network traffic.
  • Configure private service access: This module configures private service access for a VPC network, enabling private connectivity to Google Cloud services like Cloud SQL or Managed Lustre.

Observability modules

Observability modules help you monitor the health and performance of your cluster. For example, the Create a Cloud Monitoring dashboard module creates a Cloud Monitoring dashboard that displays metrics for your cluster deployment, providing a default HPC-focused view that you can customize with additional widgets.

Software and system setup modules

These modules let you customize the software environment of your VM instances and automate system configuration tasks.

  • Build custom images with Packer: This module uses Packer to build custom VM images by provisioning a temporary instance and applying configurations using shell scripts, Ansible playbooks, or startup scripts.
  • Use startup scripts: This module generates a startup script that executes a specified list of runners, such as shell scripts or Ansible playbooks, to customize your VM instances at boot time.
  • Apply Kubernetes manifests: This module lets you deploy Kubernetes manifests to your GKE clusters from local files, remote URLs, or direct strings.

Project and identity modules

These modules let you manage project-level resources and identity configurations.

  • Create a service account: This module automates the creation of service accounts and the assignment of IAM roles to ensure secure access for your cluster resources.

What's next