Cluster Toolkit overview

Cluster Toolkit is an open-source tool that simplifies the deployment of high performance computing (HPC) and AI/ML workloads on Google Cloud. It uses customizable blueprints to provision infrastructure that aligns with Google Cloud best practices.

Cluster Toolkit is highly customizable and extensible to address the deployment needs of a broad range of use cases.

Features

Cluster Toolkit lets you do the following:

  • Efficiently create and deploy turnkey HPC, AI, and ML clusters that follow Google Cloud best practices.
  • Configure and extend an open source solution.
  • Integrate seamlessly with various partners, such as Intel DAOS, DDN EXAscaler, and Slurm.
  • Monitor and gain performance visibility through integration with Cloud Monitoring.

Components

Cluster Toolkit consists of the following components:

  • Cluster blueprint: A YAML file that defines your cluster's architecture by specifying which modules to use and how to configure them.
  • Modules: Reusable, configurable building blocks that define specific resources like schedulers, storage, or compute nodes.
  • The gcluster tool: A command-line utility that compiles your blueprint and modules into a deployment folder.
  • Deployment folder: A generated directory containing the Terraform and Packer configurations needed to deploy your cluster to Google Cloud. You can deploy this folder directly or customize it further before deployment.

How Cluster Toolkit works

Cluster Toolkit architecture
overview.

Figure 1. Cluster Toolkit architecture overview

To deploy clusters on Google Cloud by using Cluster Toolkit,do the following:

  1. Set up your environment by using Cloud Shell or a local Linux or macOS terminal. If you use a local terminal, then you must install a few dependencies.
  2. Clone the Cluster Toolkit repository. This repository contains the gcluster binary, modules, cluster blueprint examples, and other resources.
  3. Use an editor to create your cluster blueprint file. Example blueprints are available in the Cluster Toolkit repository.
  4. Run gcluster create to generate a deployment folder that contains the necessary Terraform and Packer configurations.
  5. Run the commands provided by the gcluster tool. After you run these commands, Terraform or Packer then deploys the cluster on Google Cloud. For more information, see Deploy a cluster.
  6. After your cluster is deployed, you can submit jobs to your HPC cluster. You can also use Cloud Monitoring to analyze and monitor the Google Cloud resources that are used by your cluster.

Limitations

Cluster Toolkit supports updating specific configurations of an active cluster, such as resizing a Slurm partition or updating a GKE node pool. For more information, see the following guides:

For fundamental architectural changes, such as switching to a new VPC or changing the scheduler, you must redeploy the cluster:

  1. Delete the cluster.
  2. Update the cluster blueprint.
  3. Create the deployment folder.
  4. Deploy the cluster.

What's next