Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

在 Slurm 集群上运行 NCCL

本页介绍了如何在 Slurm 集群上运行 NCCL 测试。如需使用包含内置 NCCL 测试（用于验证集群健康状况）的受管 Slurm 环境，请参阅 Cluster Director。

根据您的机器类型选择相应步骤：

A4X Max、A4X 和 A4

以下测试使用 Ramble，这是一个用 Python 编写的开源多平台实验框架，用于协调 NCCL 测试的运行。Ramble 及其依赖项与 A4X Max 和 A4X 机器使用的 ARM64 架构兼容。

用于此测试的运行脚本已在 Slurm 控制器节点上的 /opt/apps/system_benchmarks 中暂存，并且可供集群中的所有节点使用。运行此测试会将 Ramble 安装到 /opt/apps/ramble 目录。

在 ${HOME} 目录中的登录节点上，运行以下命令。由于测试可能需要大约 10 分钟，如果队列中有其他作业，则可能需要更长时间，因此以下命令使用 nohup 并将 stdout/err 重定向到日志文件。
```
nohup bash /opt/apps/system_benchmarks/run-nccl-tests-via-ramble.sh >& nccl.log &
```
此命令会创建一个名为 nccl-tests_$(date +%s) 的文件夹，用于存储所有测试结果。日期标记可确保根据每个当前时间戳创建一个唯一的文件夹。

例如，如果您的集群有 16 个节点，则会在 2、4、8 和 16 个节点上针对 all-gather、all-reduce 和 reduce-scatter 运行 NCCL 测试。
查看结果。nccl.log 包含设置和运行测试的日志。如需查看这些日志，请运行以下命令：
```
tail -f nccl.log
```
您还可以随时使用 Ctrl+C 停止跟踪输出。在 nccl.log 结束时，您的输出应类似于以下内容：
```
...
---- SUMMARY for >1GB Message Sizes ----
workload        n_nodes msg_size        busbw
all-gather      2       1073741824      ###.##
all-gather      2       2147483648      ###.##
all-gather      2       4294967296      ###.##
all-gather      2       8589934592      ###.##
...
all-reduce      2       1073741824      ###.##
...
reduce-scatter  2       1073741824      ###.##
...
-------- Benchmarking Complete -------
```
所有 Slurm 作业脚本和 nccl-tests 输出日志都存储在 nccl-tests_$(date +%s)/experiments 目录中。NCCL 测试性能的摘要也存储在 nccl-tests_${date +%s)/summary.tsv 文件中。

移除 nccl-tests_$(date +%s)/ 目录会移除这些测试期间生成的所有文件。

A3 Ultra

从登录节点的共享目录（此节点通常位于 ${HOME}）中，运行以下命令来下载构建 NCCL 测试所需的脚本：

wget -np -nd https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/heads/main/examples/machine-learning/a3-ultragpu-8g/nccl-tests/build-nccl-tests.sh

脚本下载完成后，从 NVIDIA 容器注册表导入 Pytorch 映像，并构建 NCCL 测试。为此，请运行以下命令：
```
sbatch build-nccl-tests.sh
```
上述脚本在您的某个节点上运行。它使用 --container-mounts 开关将当前目录 $PWD 装载到容器内的 /nccl 目录中。
验证 NCCL 测试是否已构建。如需验证这一点，请运行以下命令：
```
sacct -a
```
如果成功完成，则输出类似于以下内容：
```
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
1            build-ncc+    a3ultra                   112  COMPLETED      0:0
```
如果 build 成功，您还应该在运行命令的目录中获得一个名为 nvidia+pytorch+24.09-py3.sqsh 的文件以及一个名为 nccl-tests 的目录。
检查 nccl-tests/build 文件夹是否包含多个二进制文件，包括 all_gather_perf、all_reduce_perf、reduce_scatter_perf 和 alltoall_perf。
下载 NCCL 测试脚本。
```
wget -np -nd https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/heads/main/examples/machine-learning/a3-ultragpu-8g/nccl-tests/run-nccl-tests.sh
```
若要在 A3 Ultra 集群上运行任何作业，必须设置多个环境变量，以便通过 RDMA 实现高性能网络。由于您在此过程中使用 enroot 容器来启动工作负载，因此这些变量必须在容器环境中设置，而不是在宿主环境中设置。您可以在刚刚下载的 run-nccl-tests.sh 脚本中检查这些变量。
运行 NCCL 测试脚本。测试可能需要大约 15 分钟或更长时间。
```
sbatch run-nccl-tests.sh
```

查看结果。该脚本会输出一个 slurm-XX.out 文件，其中包含 nccl all_gather_perf 基准比较的结果。

输出类似于以下内容：

#
#                                                              out-of-place                       in-place
#        size         count     type     redop   root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#         (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
    268435456       4194304     float    none      -1    #####  ###.##  ###.##    N/A   ######  ###.##  ###.##      0
    536870912       8388608     float    none      -1    #####  ###.##  ###.##    N/A   ######  ###.##  ###.##      0
   1073741824      16777216     float    none      -1    #####  ###.##  ###.##    N/A   ######  ###.##  ###.##      0
   2147483648      33554432     float    none      -1    #####  ###.##  ###.##    N/A   ######  ###.##  ###.##      0
   4294967296      67108864     float    none      -1    #####  ###.##  ###.##    N/A   ######  ###.##  ###.##      0
   8589934592     134217728     float    none      -1    #####  ###.##  ###.##    N/A   ######  ###.##  ###.##      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : ###.##
#

A3 Mega

从登录节点的共享目录（此节点通常位于 ${HOME}）中，运行以下命令，下载构建 NCCL 测试所需的脚本：

wget -np -nd https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/heads/main/examples/machine-learning/a3-megagpu-8g/nccl-tests/build-nccl-tests.sh

脚本下载完成后，从 NVIDIA 容器注册表导入 Pytorch 映像，并构建 NCCL 测试。
```
sbatch build-nccl-tests.sh
```
上述脚本在您的某个节点上运行。它使用 --container-mounts 开关将当前目录 $PWD 装载到容器内的 /nccl 目录中。
验证 NCCL 测试是否已构建：
```
sacct -a
```
输出类似于以下内容：
```
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
1            build-ncc+    a3mega                   112  COMPLETED      0:0
```
构建完成后，系统会创建 nccl-tests 目录。此目录包含 nvidia+pytorch+24.09-py3.sqsh 文件。.sqsh 文件是一种压缩的只读文件系统映像，可作为 AI 工作负载的标准容器格式。
检查 nccl-tests/build 文件夹是否包含多个二进制文件，包括 all_gather_perf、all_reduce_perf、reduce_scatter_perf 和 alltoall_perf。
下载 NCCL 测试脚本：
```
wget -np -nd https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/heads/main/examples/machine-learning/a3-megagpu-8g/nccl-tests/run-nccl-tests.sh
```
若要在 A3 Mega 集群上运行任何作业，必须设置多个环境变量，以通过 GPUDirect-TCPXO 协议实现高性能网络。由于您在此过程中使用 enroot 容器启动工作负载，因此必须在容器环境（而非主机环境）中设置这些变量。您可以在上一步下载的 run-nccl-tests.sh 脚本中检查这些变量。
运行 NCCL 测试脚本。测试可能需要大约 15 分钟或更长时间。
```
sbatch run-nccl-tests.sh
```

查看结果。该脚本会输出一个 slurm-XX.out 文件，其中包含 nccl all_gather_perf 基准比较的结果。

输出类似于以下内容：

#
#                                                              out-of-place                       in-place
#        size         count     type     redop   root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#         (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
    268435456       4194304     float    none      -1    #####  ###.##  ###.##    N/A   ######  ###.##  ###.##      0
    536870912       8388608     float    none      -1    #####  ###.##  ###.##    N/A   ######  ###.##  ###.##      0
   1073741824      16777216     float    none      -1    #####  ###.##  ###.##    N/A   ######  ###.##  ###.##      0
   2147483648      33554432     float    none      -1    #####  ###.##  ###.##    N/A   ######  ###.##  ###.##      0
   4294967296      67108864     float    none      -1    #####  ###.##  ###.##    N/A   ######  ###.##  ###.##      0
   8589934592     134217728     float    none      -1    #####  ###.##  ###.##    N/A   ######  ###.##  ###.##      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : ###.##
#

A3 High

从登录节点的共享目录（此节点通常位于 ${HOME}）中，运行以下命令下载构建 NCCL 测试所需的脚本：

wget -np -nd https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/heads/main/examples/machine-learning/a3-highgpu-8g/nccl-tests/build-nccl-tests.sh

脚本下载完成后，从 NVIDIA 容器注册表导入 Pytorch 映像，并构建 NCCL 测试。为此，请运行以下命令：
```
sbatch build-nccl-tests.sh
```
上述脚本在您的某个节点上运行。它使用 --container-mounts 开关将当前目录 $PWD 装载到容器内的 /nccl 目录中。
验证 NCCL 测试是否已构建：
```
sacct -a
```
输出类似于以下内容：
```
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
1            build-ncc+    a3high                   112  COMPLETED      0:0
```
如果构建成功，则会创建 nccl-tests 目录。此目录包含 nvidia+pytorch+24.09-py3.sqsh 文件。.sqsh 文件是一种压缩的只读文件系统映像，可作为 AI 工作负载的标准容器格式。
检查 nccl-tests/build 文件夹是否包含多个二进制文件，包括 all_gather_perf、all_reduce_perf、reduce_scatter_perf 和 alltoall_perf。
下载 NCCL 测试脚本：
```
wget -np -nd https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/heads/main/examples/machine-learning/a3-highgpu-8g/nccl-tests/run-nccl-tests.sh
```
如需在 A3 High 集群上运行任何作业，必须设置多个环境变量，以通过 GPUDirect-TCPX 实现高性能网络。由于您在此过程中使用 enroot 容器启动工作负载，因此这些变量必须在容器环境中设置，而不是在宿主环境中设置。您可以在刚刚下载的 run-nccl-tests.sh 脚本中检查这些变量。
运行 NCCL 测试脚本。测试可能需要大约 15 分钟或更长时间。
```
sbatch run-nccl-tests.sh
```

查看结果。该脚本会输出一个 slurm-XX.out 文件，其中包含 nccl all_gather_perf 基准比较的结果。

输出类似于以下内容：

#
#                                                              out-of-place                       in-place
#        size         count     type     redop   root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#         (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
    268435456       4194304     float    none      -1    #####  ###.##  ###.##    N/A   ######  ###.##  ###.##      0
    536870912       8388608     float    none      -1    #####  ###.##  ###.##    N/A   ######  ###.##  ###.##      0
   1073741824      16777216     float    none      -1    #####  ###.##  ###.##    N/A   ######  ###.##  ###.##      0
   2147483648      33554432     float    none      -1    #####  ###.##  ###.##    N/A   ######  ###.##  ###.##      0
   4294967296      67108864     float    none      -1    #####  ###.##  ###.##    N/A   ######  ###.##  ###.##      0
   8589934592     134217728     float    none      -1    #####  ###.##  ###.##    N/A   ######  ###.##  ###.##      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : ###.##
#

在 Slurm 集群上运行 NCCL 使用集合让一切井井有条 根据您的偏好保存内容并对其进行分类。

A4X Max、A4X 和 A4

A3 Ultra

A3 Mega

A3 High

后续步骤

在 Slurm 集群上运行 NCCL