Use the gpu-rdma-vpc module to create a VPC network that is
optimized for GPU Remote Direct Memory Access (RDMA) traffic.
This module is a simplified version of the standard vpc module. This module
lets you automatically create a variable number of subnetworks within a single
VPC network. Each generated subnetwork contains its own distinct
IP address range. By using this module, you can provision the complex network
topologies that you need for high-performance GPU clusters.
The module outputs the following unique parameters:
- The
subnetwork_interfacesparameter, which is compatible with Slurm modules andvm-instancemodules. - The
subnetwork_interfaces_gkeparameter, which is compatible with Google Kubernetes Engine modules.
For the complete list of inputs and outputs for this module, see the
gpu-rdma-vpc
module
page in the Cluster Toolkit GitHub repository.
Before you begin
Before you begin, verify that you meet the following requirements:
- You have installed and configured Cluster Toolkit. For installation instructions, see Set up Cluster Toolkit.
- You have an existing cluster blueprint. You can use and modify an existing
blueprint or create one from scratch. For a working example of a blueprint
configured for the
gpu-rdma-vpcmodule, see theexamples/a3-ultragpu-8g.yamlfile. For more information about creating and customizing blueprints, see Cluster blueprint. - To view a complete list of blueprints that support the
gpu-rdma-vpcmodule, go to the Cluster blueprint catalog page, click the Select machine type menu and then select a3-ultragpu-8g. - The
gpu-rdma-vpcmodule does not create a continuous long-running workload or a full cluster. It provisions a VPC network with multiple subnetworks optimized for GPU RDMA traffic.
Required roles
To get the permissions that
you need to create the VPC network and subnetworks,
ask your administrator to grant you the
Compute Network Admin (roles/compute.networkAdmin) IAM role on your project.
For more information about granting roles, see Manage access to projects, folders, and organizations.
You might also be able to get the required permissions through custom roles or other predefined roles.
Configure the subnetworks template
The main difference between this module and the standard vpc module is the
subnetworks_template variable. This variable serves as the template for all
subnetworks that the module creates within the network.
The template contains the following values:
- Count: the number of subnetworks to create. Specify this value by using
the
countfield. - Name prefix: the prefix for the subnetwork names. Specify this value by
using the
name_prefixfield. - IP range: the Classless Inter-Domain Routing (CIDR) formatted IP address
range. Specify this value by using the
ip_rangefield. - Region: the Google Cloud region where the module deploys the subnetwork.
Specify this value by using the
regionfield.
The following example demonstrates how to use the gpu-rdma-vpc module to
create a new VPC network named test-rdma-net. The module
creates eight subnetworks with names ranging from the test-mrdma-sub-0 string
to the test-mrdma-sub-7 string.
The subnetworks split the IP address range evenly, starting from bit 16
(0-indexed). The Slurm nodeset then ingests the networks within the
additional_networks setting.
- id: rdma-net
source: modules/network/gpu-rdma-vpc
settings:
network_name: test-rdma-net
network_profile: https://www.googleapis.com/compute/beta/projects/$(vars.project_id)/global/networkProfiles/$(vars.zone)-vpc-roce
network_routing_mode: REGIONAL
subnetworks_template:
name_prefix: test-mrdma-sub
count: 8
ip_range: 192.168.0.0/16
region: $(vars.region)
- id: a3_nodeset
source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
use: [network0]
settings:
machine_type: a3-ultragpu-8g
additional_networks:
$(concat(
[{
network=null,
subnetwork=network1.subnetwork_self_link,
subnetwork_project=vars.project_id,
nic_type="GVNIC",
queue_count=null,
network_ip="",
stack_type=null,
access_config=[],
ipv6_access_config=[],
alias_ip_range=[]
}],
rdma-net.subnetwork_interfaces
))
...
What's next
- For the complete list of inputs and outputs for this module, see the
gpu-rdma-vpcmodule page in the Cluster Toolkit GitHub repository. - For a complete list of supported modules, see the compatibility matrix on GitHub.