Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

在采用默认配置的 GKE 集群上运行 NCCL

本页介绍了如何在采用 默认配置 的 GKE 集群上运行 NCCL 测试。默认配置是指使用 Cluster Toolkit 创建的集群。如果您使用 gcloud 命令创建集群，则使用的是自定义配置，本页中的说明可能不适用。如果您使用的是自定义配置，请参阅以下页面之一：

为您的机器类型选择相应步骤：

A4X

连接到集群：
```
gcloud container clusters get-credentials CLUSTER_NAME \
    --location=COMPUTE_REGION
```
执行以下变量替换操作：
- CLUSTER_NAME：集群的名称，对于使用 Cluster Toolkit 创建的集群，此名称基于 DEPLOYMENT_NAME。
- COMPUTE_REGION：计算区域的名称。
使用 gke-a4x/nccl-jobset-example.yaml 文件部署启用了 TAS 的 all-gather NCCL 性能测试：
1. 默认情况下，测试使用特定数量的节点。如果您想更改节点数量，请修改 YAML 文件，将以下值更改为所需的节点数量：
  - numNodes
  - parallelism
  - completions
  - N_NODES
2. 创建资源以运行测试：
```
kubectl create -f ~/cluster-toolkit/examples/gke-a4x/nccl-jobset-example.yaml
```

确认所有 nccl-test Pod 都已达到 Completed 状态：

kubectl get pods

输出应类似如下所示：

nccl-all-worker-0-0-ft8jm   0/1     Completed   0          13m
nccl-all-worker-0-1-prpvw   0/1     Completed   0          13m

找到与 nccl-all-worker-0-0-* 模式匹配的 Pod 名称。此 Pod 的日志包含 NCCL 测试的结果。

如需提取此 Pod 的日志，请运行以下命令：

 
kubectl logs $(kubectl get pods -o go-template='{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | grep nccl-all-worker-0-0)

输出应类似如下所示：

#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
        1024            32     float    none      -1    19.60    0.05    0.05      0    19.00    0.05    0.05      0
        2048            64     float    none      -1    19.63    0.10    0.09      0    19.47    0.11    0.09      0
        4096           128     float    none      -1    19.88    0.21    0.18      0    19.61    0.21    0.18      0
        8192           256     float    none      -1    20.31    0.40    0.35      0    19.82    0.41    0.36      0
       16384           512     float    none      -1    20.30    0.81    0.71      0    20.17    0.81    0.71      0
       32768          1024     float    none      -1    20.70    1.58    1.39      0    20.36    1.61    1.41      0
       65536          2048     float    none      -1    20.94    3.13    2.74      0    20.88    3.14    2.75      0
      131072          4096     float    none      -1    21.12    6.20    5.43      0    20.96    6.25    5.47      0
      262144          8192     float    none      -1    21.24   12.34   10.80      0    21.01   12.48   10.92      0
      524288         16384     float    none      -1    21.28   24.63   21.55      0    21.07   24.88   21.77      0
     1048576         32768     float    none      -1    21.95   47.77   41.80      0    21.72   48.28   42.24      0
     2097152         65536     float    none      -1    24.15   86.85   76.00      0    23.75   88.30   77.26      0
     4194304        131072     float    none      -1    31.50  133.13  116.49      0    30.75  136.39  119.34      0
     8388608        262144     float    none      -1    47.42  176.88  154.77      0    46.47  180.51  157.95      0
    16777216        524288     float    none      -1    48.72  344.39  301.34      0    47.85  350.63  306.80      0
    33554432       1048576     float    none      -1    75.08  446.91  391.05      0    73.89  454.10  397.34      0
    67108864       2097152     float    none      -1    178.7  375.47  328.53      0    179.1  374.67  327.84      0
   134217728       4194304     float    none      -1    211.1  635.86  556.37      0    211.3  635.21  555.81      0
   268435456       8388608     float    none      -1    413.2  649.68  568.47      0    414.9  646.95  566.08      0
   536870912      16777216     float    none      -1    820.1  654.64  572.81      0    814.9  658.81  576.46      0
  1073741824      33554432     float    none      -1   1566.5  685.43  599.76      0   1567.9  684.82  599.22      0
  2147483648      67108864     float    none      -1   3025.3  709.83  621.10      0   3017.2  711.74  622.77      0
  4294967296     134217728     float    none      -1   5898.8  728.11  637.10      0   5784.0  742.57  649.74      0
  8589934592     268435456     float    none      -1    11541  744.31  651.28      0    11293  760.67  665.58      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 236.839

A4

连接到集群：
```
gcloud container clusters get-credentials CLUSTER_NAME \
    --location=COMPUTE_REGION
```
执行以下变量替换操作：
- CLUSTER_NAME：集群的名称，对于使用 Cluster Toolkit 创建的集群，此名称基于 DEPLOYMENT_NAME。
- COMPUTE_REGION：计算区域的名称。
使用 gke-a4/nccl-jobset-example.yaml 文件部署启用了 TAS 的 all-gather NCCL 性能测试：
1. 如果您符合以下条件，请按以下方式修改 YAML 文件：
  - 默认情况下，测试使用特定数量的节点。如果您想更改节点数量，请将以下值更改为所需的节点数量：
    - parallelism
    - completions
    - N_NODES
  - 如果您想测试由灵活启动预配的节点，请在 metadata 字段中执行以下操作：
    - 将 kueue.x-k8s.io/queue-name 值替换为 dws-local-queue。
    - 添加以下注解：
      annotations: provreq.kueue.x-k8s.io/maxRunDurationSeconds: "600"
2. 创建资源以运行测试：
```
kubectl create -f ~/cluster-toolkit/examples/gke-a4/nccl-jobset-example.yaml
```
  此命令会返回 JobSet 名称。
  
  输出应类似如下所示：
```
jobset.jobset.x-k8s.io/all-gather8t7dt created
```

如需查看 NCCL 测试的结果，请运行以下命令以查看所有正在运行的 Pod：

kubectl get pods

输出应类似如下所示：

NAME                          READY   STATUS      RESTARTS   AGE
all-gather8t7dt-w-0-0-n9s6j   0/1     Completed   0          9m34s
all-gather8t7dt-w-0-1-rsf7r   0/1     Completed   0          9m34s

找到与 jobset-name-w-0-0-* 模式匹配的 Pod 名称。此 Pod 的日志包含 NCCL 测试的结果。

如需提取此 Pod 的日志，请运行以下命令：

kubectl logs all-gather8t7dt-w-0-0-n9s6j

输出应类似如下所示：

#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
        1024            16     float    none      -1    54.07    0.02    0.02      0    55.80    0.02    0.02      0
        2048            32     float    none      -1    55.46    0.04    0.03      0    55.31    0.04    0.03      0
        4096            64     float    none      -1    55.59    0.07    0.07      0    55.38    0.07    0.07      0
        8192           128     float    none      -1    56.05    0.15    0.14      0    55.92    0.15    0.14      0
       16384           256     float    none      -1    57.08    0.29    0.27      0    57.75    0.28    0.27      0
       32768           512     float    none      -1    57.49    0.57    0.53      0    57.22    0.57    0.54      0
       65536          1024     float    none      -1    59.20    1.11    1.04      0    59.20    1.11    1.04      0
      131072          2048     float    none      -1    59.58    2.20    2.06      0    63.57    2.06    1.93      0
      262144          4096     float    none      -1    63.87    4.10    3.85      0    63.61    4.12    3.86      0
      524288          8192     float    none      -1    64.83    8.09    7.58      0    64.40    8.14    7.63      0
     1048576         16384     float    none      -1    79.74   13.15   12.33      0    76.66   13.68   12.82      0
     2097152         32768     float    none      -1    78.41   26.74   25.07      0    79.05   26.53   24.87      0
     4194304         65536     float    none      -1    83.21   50.41   47.26      0    81.25   51.62   48.39      0
     8388608        131072     float    none      -1    94.35   88.91   83.35      0    99.07   84.68   79.38      0
    16777216        262144     float    none      -1    122.9  136.55  128.02      0    121.7  137.83  129.21      0
    33554432        524288     float    none      -1    184.2  182.19  170.80      0    178.1  188.38  176.60      0
    67108864       1048576     float    none      -1    294.7  227.75  213.51      0    277.7  241.62  226.52      0
   134217728       2097152     float    none      -1    495.4  270.94  254.00      0    488.8  274.60  257.43      0
   268435456       4194304     float    none      -1    877.5  305.92  286.80      0    861.3  311.65  292.17      0
   536870912       8388608     float    none      -1   1589.8  337.71  316.60      0   1576.2  340.61  319.33      0
  1073741824      16777216     float    none      -1   3105.7  345.74  324.13      0   3069.2  349.85  327.98      0
  2147483648      33554432     float    none      -1   6161.7  348.52  326.74      0   6070.7  353.75  331.64      0
  4294967296      67108864     float    none      -1    12305  349.03  327.22      0    12053  356.35  334.08      0
  8589934592     134217728     float    none      -1    24489  350.77  328.85      0    23991  358.05  335.67      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 120.248

A3 Ultra

连接到集群：
```
gcloud container clusters get-credentials CLUSTER_NAME \
    --location=COMPUTE_REGION
```
执行以下变量替换操作：
- CLUSTER_NAME：集群的名称，对于使用 Cluster Toolkit 创建的集群，此名称基于 DEPLOYMENT_NAME。
- COMPUTE_REGION：计算区域的名称。
使用 gke-a3-ultragpu/nccl-jobset-example.yaml 文件部署启用了 TAS 的 all-gather NCCL 性能测试：
1. 如果您符合以下条件，请按以下方式修改 YAML 文件：
  - 默认情况下，测试使用特定数量的节点。如果您想更改节点数量，请将以下值更改为所需的节点数量：
    - parallelism
    - completions
    - N_NODES
  - 如果您想测试由灵活启动预配的节点，请在 metadata 字段中执行以下操作：
    - 将 kueue.x-k8s.io/queue-name 值替换为 dws-local-queue。
    - 添加以下注解：
      annotations: provreq.kueue.x-k8s.io/maxRunDurationSeconds: "600"
2. 创建资源以运行测试：
```
kubectl create -f ~/cluster-toolkit/examples/gke-a3-ultragpu/nccl-jobset-example.yaml
```
  此命令会返回 JobSet 名称。
  
  输出应类似如下所示：
```
jobset.jobset.x-k8s.io/all-gather8t7dt created
```

如需查看 NCCL 测试的结果，请运行以下命令以查看所有正在运行的 Pod：

kubectl get pods

输出应类似如下所示：

NAME                          READY   STATUS      RESTARTS   AGE
all-gather8t7dt-w-0-0-n9s6j   0/1     Completed   0          9m34s
all-gather8t7dt-w-0-1-rsf7r   0/1     Completed   0          9m34s

找到与 jobset-name-w-0-0-* 模式匹配的 Pod 名称。此 Pod 的日志包含 NCCL 测试的结果。

如需提取此 Pod 的日志，请运行以下命令：

kubectl logs all-gather8t7dt-w-0-0-n9s6j

输出应类似如下所示：

#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
        1024            16     float    none      -1    54.07    0.02    0.02      0    55.80    0.02    0.02      0
        2048            32     float    none      -1    55.46    0.04    0.03      0    55.31    0.04    0.03      0
        4096            64     float    none      -1    55.59    0.07    0.07      0    55.38    0.07    0.07      0
        8192           128     float    none      -1    56.05    0.15    0.14      0    55.92    0.15    0.14      0
       16384           256     float    none      -1    57.08    0.29    0.27      0    57.75    0.28    0.27      0
       32768           512     float    none      -1    57.49    0.57    0.53      0    57.22    0.57    0.54      0
       65536          1024     float    none      -1    59.20    1.11    1.04      0    59.20    1.11    1.04      0
      131072          2048     float    none      -1    59.58    2.20    2.06      0    63.57    2.06    1.93      0
      262144          4096     float    none      -1    63.87    4.10    3.85      0    63.61    4.12    3.86      0
      524288          8192     float    none      -1    64.83    8.09    7.58      0    64.40    8.14    7.63      0
     1048576         16384     float    none      -1    79.74   13.15   12.33      0    76.66   13.68   12.82      0
     2097152         32768     float    none      -1    78.41   26.74   25.07      0    79.05   26.53   24.87      0
     4194304         65536     float    none      -1    83.21   50.41   47.26      0    81.25   51.62   48.39      0
     8388608        131072     float    none      -1    94.35   88.91   83.35      0    99.07   84.68   79.38      0
    16777216        262144     float    none      -1    122.9  136.55  128.02      0    121.7  137.83  129.21      0
    33554432        524288     float    none      -1    184.2  182.19  170.80      0    178.1  188.38  176.60      0
    67108864       1048576     float    none      -1    294.7  227.75  213.51      0    277.7  241.62  226.52      0
   134217728       2097152     float    none      -1    495.4  270.94  254.00      0    488.8  274.60  257.43      0
   268435456       4194304     float    none      -1    877.5  305.92  286.80      0    861.3  311.65  292.17      0
   536870912       8388608     float    none      -1   1589.8  337.71  316.60      0   1576.2  340.61  319.33      0
  1073741824      16777216     float    none      -1   3105.7  345.74  324.13      0   3069.2  349.85  327.98      0
  2147483648      33554432     float    none      -1   6161.7  348.52  326.74      0   6070.7  353.75  331.64      0
  4294967296      67108864     float    none      -1    12305  349.03  327.22      0    12053  356.35  334.08      0
  8589934592     134217728     float    none      -1    24489  350.77  328.85      0    23991  358.05  335.67      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 120.248

A3 High

此工作负载包含一个名为 tcpx-daemon 的边车容器，它运行一个服务以使 Pod 能够使用 GPUDirect-TCPX 协议。如果您自己的环境中有任何 Pod 需要使用 GPUDirect-TCPX 协议，则必须将此边车容器添加到这些 Pod。如需查看要添加到清单的必填字段的代码段，请参阅将 GPUDirect 添加到清单。

查看 GitHub 中的 nccl-config.yaml ConfigMap 清单。此清单部署初始化 NCCL all-gather 测试的脚本并设置特定于 NCCL 的配置设置。
查看 GitHub 中的 nccl-test-latest.yaml Deployment 清单。

部署 ConfigMap 和测试工作负载：

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpx/nccl-config.yaml
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpx/nccl-test-latest.yaml

运行以下命令以触发节点的 NCCL all-gather 测试：

kubectl exec \
   --stdin --tty --container=nccl-test nccl-test-host-1 \
   -- /configs/allgather.sh nccl-host-1 nccl-host-2

输出类似于以下内容：

#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
   1048576         16384     float    none      -1    696.8    1.50    1.41      0    729.0    1.44    1.35      0
   2097152         32768     float    none      -1    776.4    2.70    2.53      0    726.7    2.89    2.71      0
   4194304         65536     float    none      -1    774.3    5.42    5.08      0    805.1    5.21    4.88      0
   8388608        131072     float    none      -1    812.1   10.33    9.68      0    817.6   10.26    9.62      0
   16777216        262144     float    none      -1   1035.2   16.21   15.19      0   1067.8   15.71   14.73      0
   33554432        524288     float    none      -1   1183.3   28.36   26.59      0   1211.8   27.69   25.96      0
   67108864       1048576     float    none      -1   1593.4   42.12   39.49      0   1510.5   44.43   41.65      0
   134217728       2097152     float    none      -1   2127.8   63.08   59.13      0   2312.7   58.03   54.41      0
   268435456       4194304     float    none      -1   3603.0   74.50   69.85      0   3586.2   74.85   70.17      0
   536870912       8388608     float    none      -1   7101.7   75.60   70.87      0   7060.9   76.03   71.28      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 29.8293

在节点上完成 GPUDirect-TCPX 的安装后，您可以使用它来优化在这些节点上运行的 GPU 密集型工作负载的吞吐量。将 GPUDirect 添加到清单部分介绍了在您自己的 Pod 中使用 GPUDirect-TCPX 所需的字段。

A3 Mega

此工作负载包含一个名为 tcpxo-daemon 的边车容器，它运行一个服务以使 Pod 能够使用 GPUDirect-TCPXO 协议。如果您自己的环境中有任何 Pod 需要使用 GPUDirect-TCPXO 协议，则必须将此边车容器添加到这些 Pod。如需查看要添加到清单的必填字段的代码段，请参阅将 GPUDirect 添加到清单。

查看 GitHub 中的 nccl-test-latest.yaml 清单。

部署包含测试工作负载的两个 Pod：

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/nccl-test-latest.yaml

Pod 部署后，触发 all-gather 测试：

kubectl exec --stdin --tty --container=nccl-test nccl-test-host-1 -- /scripts/allgather.sh nccl-host-1 nccl-host-2

输出类似于以下内容：

#                                                              out-of-place                       in-place
#        size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#         (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
            0             0     float    none      -1     0.24    0.00    0.00      0     0.18    0.00    0.00      0
            0             0     float    none      -1     0.19    0.00    0.00      0     0.17    0.00    0.00      0
            0             0     float    none      -1     0.17    0.00    0.00      0     0.17    0.00    0.00      0
            0             0     float    none      -1     0.17    0.00    0.00      0     0.17    0.00    0.00      0
            0             0     float    none      -1     0.17    0.00    0.00      0     0.17    0.00    0.00      0
         256             4     float    none      -1    235.2    0.00    0.00      0    235.1    0.00    0.00      0
         512             8     float    none      -1    241.0    0.00    0.00      0    236.1    0.00    0.00      0
         1024            16     float    none      -1    236.3    0.00    0.00      0    233.3    0.00    0.00      0
         2048            32     float    none      -1    234.1    0.01    0.01      0    233.4    0.01    0.01      0
         4096            64     float    none      -1    237.1    0.02    0.02      0    235.3    0.02    0.02      0
         8192           128     float    none      -1    236.2    0.03    0.03      0    235.2    0.03    0.03      0
         16384           256     float    none      -1    236.6    0.07    0.06      0    238.5    0.07    0.06      0
         32768           512     float    none      -1    237.9    0.14    0.13      0    238.8    0.14    0.13      0
         65536          1024     float    none      -1    242.3    0.27    0.25      0    239.4    0.27    0.26      0
         131072          2048     float    none      -1    263.0    0.50    0.47      0    275.1    0.48    0.45      0
         262144          4096     float    none      -1    279.2    0.94    0.88      0    269.9    0.97    0.91      0
         524288          8192     float    none      -1    273.5    1.92    1.80      0    273.5    1.92    1.80      0
   1048576         16384     float    none      -1    315.1    3.33    3.12      0    314.1    3.34    3.13      0
   2097152         32768     float    none      -1    319.2    6.57    6.16      0    311.5    6.73    6.31      0
   4194304         65536     float    none      -1    331.8   12.64   11.85      0    331.3   12.66   11.87      0
   8388608        131072     float    none      -1    356.3   23.54   22.07      0    353.8   23.71   22.23      0
   16777216        262144     float    none      -1    409.1   41.01   38.45      0    405.2   41.40   38.81      0
   33554432        524288     float    none      -1    451.4   74.34   69.69      0    447.7   74.94   70.26      0
   67108864       1048576     float    none      -1    713.4   94.07   88.19      0    713.8   94.01   88.13      0
   134217728       2097152     float    none      -1   1122.1  119.62  112.14      0   1116.3  120.23  112.72      0
   268435456       4194304     float    none      -1   1785.8  150.32  140.92      0   1769.2  151.72  142.24      0
   536870912       8388608     float    none      -1   2859.7  187.74  176.00      0   2852.6  188.20  176.44      0
   1073741824      16777216     float    none      -1   5494.1  195.44  183.22      0   5568.2  192.83  180.78      0
   2147483648      33554432     float    none      -1    10841  198.09  185.71      0    10798  198.88  186.45      0
   4294967296      67108864     float    none      -1    21453  200.21  187.70      0    21490  199.86  187.37      0
   8589934592     134217728     float    none      -1    42603  201.63  189.03      0    42670  201.31  188.73      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 45.7587
#

在节点上完成 GPUDirect-TCPXO 的安装后，您可以使用它来优化在这些节点上运行的 GPU 密集型工作负载的吞吐量。将 GPUDirect 添加到清单部分介绍了在您自己的 Pod 中使用 GPUDirect-TCPXO 所需的字段。

后续步骤

收集和了解 NCCL 日志以进行问题排查，从而了解测试输出并排查问题。
了解如何排查性能缓慢的问题。

在采用默认配置的 GKE 集群上运行 NCCL 使用集合让一切井井有条 根据您的偏好保存内容并对其进行分类。

A4X

A4

A3 Ultra

A3 High

A3 Mega

后续步骤

在采用默认配置的 GKE 集群上运行 NCCL