Run NCCL on custom GKE clusters that use A4X Max

This document describes how to run NCCL/gIB tests on provisioned clusters that use GPUDirect RDMA. Depending on your use case, use one of the following options:

If you have two nodes to test, run a basic test.
If you have more than two nodes to test, run the test with JobSet.

Test on two nodes

The following test runs a NCCL workload across two nodes. Understand the following about this test:

By default, GKE schedules the two Pods to separate node pools, if available. If node pools are created with distinct NVLink domains, then this test represents cross-domain RDMA throughput. To schedule Pods on the same domain, modify the Pod affinity to schedule on the same node pool.

Run the two-node test:

Connect to your cluster:
```
gcloud container clusters get-credentials CLUSTER_NAME \
    --location=COMPUTE_REGION
```
Replace the following variables:
- CLUSTER_NAME: the name of your cluster, which, for the clusters created with Cluster Toolkit, is based on the DEPLOYMENT_NAME.
- COMPUTE_REGION: the name of the compute region.

Deploy a NCCL test workload of two test Pods that are running on two A4X Max nodes:

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-rdma/nccl-test-a4x-max.yaml

Check that the Pods are both running on some nodes:
```
kubectl get pods nccl-test-host-1 nccl-test-host-2
```
If the two Pods show a Running status, you can proceed to the next step.

Trigger an all-gather test for the A4X Max nodes:

HOSTS="nccl-host-1 nccl-host-2"
kubectl exec nccl-test-host-1 -it -- bash -c "/usr/local/gib/scripts/run_nccl_tests.sh -t alltoall -b 1M -e 16G ${HOSTS}"

The output is similar to the following:

# nccl-tests version 2.17.6 nccl-headers=22809 nccl-library=22809
# Collective test starting: all_gather_perf
# nThread 1 nGpus 1 minBytes 4096 maxBytes 17179869184 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid    299 on gke-psch-ug-cluster-a4xmax-2-1391c2ef-m9bm device  0 [0008:06:00] NVIDIA GB300
#  Rank  1 Group  0 Pid    300 on gke-psch-ug-cluster-a4xmax-2-1391c2ef-m9bm device  1 [0009:06:00] NVIDIA GB300
#  Rank  2 Group  0 Pid    301 on gke-psch-ug-cluster-a4xmax-2-1391c2ef-m9bm device  2 [0018:06:00] NVIDIA GB300
#  Rank  3 Group  0 Pid    302 on gke-psch-ug-cluster-a4xmax-2-1391c2ef-m9bm device  3 [0019:06:00] NVIDIA GB300
#  Rank  4 Group  0 Pid    237 on gke-psch-ug-cluster-a4xmax-03878a77-d3lp device  0 [0008:06:00] NVIDIA GB300
#  Rank  5 Group  0 Pid    238 on gke-psch-ug-cluster-a4xmax-03878a77-d3lp device  1 [0009:06:00] NVIDIA GB300
#  Rank  6 Group  0 Pid    239 on gke-psch-ug-cluster-a4xmax-03878a77-d3lp device  2 [0018:06:00] NVIDIA GB300
#  Rank  7 Group  0 Pid    240 on gke-psch-ug-cluster-a4xmax-03878a77-d3lp device  3 [0019:06:00] NVIDIA GB300
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)
        4096           128     float    none      -1    28.51    0.14    0.13       0    27.71    0.15    0.13       0
        8192           256     float    none      -1    28.10    0.29    0.26       0    28.40    0.29    0.25       0
       16384           512     float    none      -1    28.55    0.57    0.50       0    28.19    0.58    0.51       0
       32768          1024     float    none      -1    30.56    1.07    0.94       0    29.65    1.11    0.97       0
       65536          2048     float    none      -1    33.30    1.97    1.72       0    33.14    1.98    1.73       0
      131072          4096     float    none      -1    36.18    3.62    3.17       0    36.14    3.63    3.17       0
      262144          8192     float    none      -1    38.50    6.81    5.96       0    94.91    2.76    2.42       0
      524288         16384     float    none      -1   152.25    3.44    3.01       0    54.79    9.57    8.37       0
     1048576         32768     float    none      -1    63.82   16.43   14.38       0    64.06   16.37   14.32       0
     2097152         65536     float    none      -1    65.10   32.21   28.19       0    66.13   31.71   27.75       0
     4194304        131072     float    none      -1    67.73   61.92   54.18       0    67.16   62.45   54.65       0
     8388608        262144     float    none      -1    79.65  105.31   92.15       0    80.02  104.83   91.73       0
    16777216        524288     float    none      -1   189.74   88.42   77.37       0   187.57   89.44   78.26       0
    33554432       1048576     float    none      -1   252.85  132.70  116.11       0   202.31  165.86  145.13       0
    67108864       2097152     float    none      -1   250.55  267.85  234.37       0   276.11  243.06  212.67       0
   134217728       4194304     float    none      -1   394.38  340.33  297.79       0   487.60  275.26  240.85       0
   268435456       8388608     float    none      -1   717.97  373.88  327.15       0   799.98  335.55  293.61       0
   536870912      16777216     float    none      -1  1421.29  377.73  330.52       0  1392.81  385.46  337.28       0
  1073741824      33554432     float    none      -1  2783.37  385.77  337.55       0  2596.97  413.46  361.78       0
  2147483648      67108864     float    none      -1  5396.10  397.97  348.22       0  5059.01  424.49  371.43       0
  4294967296     134217728     float    none      -1  10579.7  405.96  355.22       0  9918.44  433.03  378.90       0
  8589934592     268435456     float    none      -1  21012.9  408.79  357.69       0  20043.4  428.57  375.00       0
 17179869184     536870912     float    none      -1  42091.7  408.15  357.13       0  40243.2  426.90  373.54       0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 146.047
#
# Collective test concluded: all_gather_perf

If Pods are scheduled on nodes in distinct NVLink domains, this test represents cross-domain RDMA throughput, as shown in the provided output. To spread across to node pools created in distinct NVLink domains, modify the Pod spec affinity in nccl-test-a4x-max.yaml with the following:

spec:
...
affinity:
podAntiAffinity:
  preferredDuringSchedulingIgnoredDuringExecution:
  - weight: 100
    podAffinityTerm:
      labelSelector:
        matchLabels:
          app: nccl-test
      topologyKey: cloud.google.com/gke-nodepool

Test using JobSet

Install JobSet:

VERSION=v0.10.1
kubectl apply --server-side -f https://github.com/kubernetes-sigs/jobset/releases/download/$VERSION/manifests.yaml

Make sure that your non-GPU node pools have enough resources to schedule the JobSet controllers. Follow the step to define your own resource adjustments.

For more information about JobSet installation, see Installation.

Run the following commands, replacing NUM_NODES with the number of nodes that you want to run the NCCL test with:

wget https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/gpudirect-rdma/nccl-test-a4x-max-jobset.yaml

NUM_NODES=2
sed "s|__NUM_NODES__|NUM_NODES|" nccl-test-a4x-max-jobset.yaml | kubectl apply -f -

Check that all Pods are in the Completed state:

kubectl get pods | grep allgather-worker

The output is similar to the following:

allgather-worker-0-0-g45d2   0/1     Completed   0          13m
allgather-worker-0-1-prpvw   0/1     Completed   0          13m
allgather-worker-0-2-qbwt5   0/1     Completed   0          13m

See the test result from the head Pod (nccl-test-nccl-test-0) from where the test is launched:

kubectl logs $(kubectl get pods -o go-template='{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | grep allgather-worker-0-0)

The output is similar to the following:

# nccl-tests version 2.17.6 nccl-headers=22809 nccl-library=22809
# Collective test starting: all_gather_perf
# nThread 1 nGpus 1 minBytes 1024 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 100 agg iters: 1 validation: 1 graph: 0
#
...
#                                                            out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)
        1024            32     float    none      -1    45.49    0.02    0.02       0    45.29    0.02    0.02       0
        2048            64     float    none      -1    45.52    0.04    0.04       0    45.37    0.05    0.04       0
        4096           128     float    none      -1    46.02    0.09    0.08       0    45.83    0.09    0.08       0
        8192           256     float    none      -1    63.93    0.13    0.11       0    46.98    0.17    0.15       0
       16384           512     float    none      -1    46.51    0.35    0.31       0    47.11    0.35    0.30       0
       32768          1024     float    none      -1    66.32    0.49    0.43       0    50.73    0.65    0.57       0
       65536          2048     float    none      -1    49.89    1.31    1.15       0    50.04    1.31    1.15       0
      131072          4096     float    none      -1    54.68    2.40    2.10       0    52.38    2.50    2.19       0
      262144          8192     float    none      -1    54.66    4.80    4.20       0    54.06    4.85    4.24       0
      524288         16384     float    none      -1    66.28    7.91    6.92       0    65.75    7.97    6.98       0
     1048576         32768     float    none      -1    85.63   12.25   10.72       0    86.44   12.13   10.61       0
     2097152         65536     float    none      -1    68.33   30.69   26.86       0    72.32   29.00   25.37       0
     4194304        131072     float    none      -1    71.85   58.37   51.08       0    71.58   58.60   51.28       0
     8388608        262144     float    none      -1    83.80  100.10   87.59       0    85.73   97.85   85.62       0
    16777216        524288     float    none      -1   195.94   85.62   74.92       0   195.86   85.66   74.95       0
    33554432       1048576     float    none      -1   240.84  139.32  121.91       0   210.82  159.16  139.27       0
    67108864       2097152     float    none      -1   254.95  263.22  230.32       0   250.93  267.44  234.01       0
   134217728       4194304     float    none      -1   411.09  326.49  285.68       0   386.11  347.61  304.16       0
   268435456       8388608     float    none      -1   741.69  361.92  316.68       0   722.42  371.58  325.13       0
   536870912      16777216     float    none      -1  1358.44  395.21  345.81       0  1343.63  399.57  349.62       0
  1073741824      33554432     float    none      -1  2679.62  400.71  350.62       0  2585.68  415.26  363.36       0
  2147483648      67108864     float    none      -1  5281.54  406.60  355.78       0  5074.73  423.17  370.28       0
  4294967296     134217728     float    none      -1  10476.2  409.97  358.73       0  10027.5  428.32  374.78       0
  8589934592     268435456     float    none      -1  20853.9  411.91  360.42       0  20194.7  425.36  372.19       0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 126.85
#
# Collective test concluded: all_gather_perf

Run NCCL on custom GKE clusters that use A4X Max Stay organized with collections Save and categorize content based on your preferences.

Test on two nodes

Test using JobSet

Run NCCL on custom GKE clusters that use A4X Max