Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

在 Slurm 叢集上執行 NCCL

本頁面說明如何在 Slurm 叢集上執行 NCCL 測試。如要使用代管 Slurm 環境 (內含用於驗證叢集健康狀態的內建 NCCL 測試)，請參閱Cluster Director。

請選擇適合你機型的步驟：

A4X Max、A4X 和 A4

下列測試使用 Ramble，這是一種以 Python 編寫的開放原始碼多平台實驗框架，用於協調 NCCL 測試的執行作業。Ramble 及其依附元件與 A4X Max 和 A4X 機器的 ARM64 架構相容。

這項測試使用的執行指令碼會暫存在 Slurm 控制器節點的 /opt/apps/system_benchmarks 中，並供叢集中的所有節點使用。執行這項測試會將 Ramble 安裝至 /opt/apps/ramble 目錄。

在 ${HOME} 目錄的登入節點中，執行下列指令。由於測試可能需要約 10 分鐘，如果佇列中有其他工作，則可能需要更長時間，因此下列指令會使用 nohup，並將 stdout/err 重新導向至記錄檔。
```
nohup bash /opt/apps/system_benchmarks/run-nccl-tests-via-ramble.sh >& nccl.log &
```
這項指令會建立名為 nccl-tests_$(date +%s) 的資料夾，儲存所有測試結果。日期標記可確保系統根據每個目前的時間戳記建立專屬資料夾。

舉例來說，如果叢集有 16 個節點，系統會在 2、4、8 和 16 個節點上，為 all-gather、all-reduce 和 reduce-scatter 執行 NCCL 測試。
查看結果。nccl.log 包含設定及執行測試的記錄。如要查看這些記錄，請執行下列指令：
```
tail -f nccl.log
```
您也可以隨時使用 Ctrl+C 停止追蹤輸出內容。nccl.log 結束時，輸出內容應如下所示：
```
...
---- SUMMARY for >1GB Message Sizes ----
workload        n_nodes msg_size        busbw
all-gather      2       1073741824      ###.##
all-gather      2       2147483648      ###.##
all-gather      2       4294967296      ###.##
all-gather      2       8589934592      ###.##
...
all-reduce      2       1073741824      ###.##
...
reduce-scatter  2       1073741824      ###.##
...
-------- Benchmarking Complete -------
```
所有 Slurm 工作指令碼和 nccl-tests 輸出記錄都會儲存在 nccl-tests_$(date +%s)/experiments 目錄中。NCCL 測試效能摘要也會儲存在 nccl-tests_${date +%s)/summary.tsv 檔案中。

移除 nccl-tests_$(date +%s)/ 目錄會一併移除這些測試期間產生的所有檔案。

A3 Ultra

從登入節點的共用目錄 (這個節點通常位於 ${HOME})，執行下列指令來下載建構 NCCL 測試所需的指令碼：

wget -np -nd https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/heads/main/examples/machine-learning/a3-ultragpu-8g/nccl-tests/build-nccl-tests.sh

指令碼下載完成後，請從 NVIDIA 容器登錄檔匯入 Pytorch 映像檔，並建構 NCCL 測試。請執行下列指令來進行這項操作：
```
sbatch build-nccl-tests.sh
```
上述指令碼會在其中一個節點上執行。指令碼會使用 --container-mounts 切換開關，將目前的目錄 $PWD 掛接到容器內的 /nccl 目錄。
確認 NCCL 測試已建構完成。如要確認這一點，請執行下列指令：
```
sacct -a
```
如果順利完成，輸出內容會類似下列內容：
```
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
1            build-ncc+    a3ultra                   112  COMPLETED      0:0
```
如果建構作業成功，您應該也會在執行指令的目錄中，看到名為 nvidia+pytorch+24.09-py3.sqsh 的檔案，以及名為 nccl-tests 的目錄。
確認 nccl-tests/build 資料夾包含多個二進位檔，包括 all_gather_perf、all_reduce_perf、reduce_scatter_perf 和 alltoall_perf。
下載 NCCL 測試腳本。
```
wget -np -nd https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/heads/main/examples/machine-learning/a3-ultragpu-8g/nccl-tests/run-nccl-tests.sh
```
如要在 A3 Ultra 叢集上執行任何工作，必須設定多個環境變數，才能透過 RDMA 啟用高效能網路。由於您在本程序中使用 enroot 容器啟動工作負載，因此這些變數必須在容器環境中設定，而非主機環境。您可以在剛下載的 run-nccl-tests.sh 指令碼中檢查這些變數。
執行 NCCL 測試腳本。這項測試大約需要 15 分鐘，或更久。
```
sbatch run-nccl-tests.sh
```

查看結果。指令碼會輸出 slurm-XX.out 檔案，其中包含 nccl all_gather_perf 基準測試的結果。

輸出結果會與下列內容相似：

#
#                                                              out-of-place                       in-place
#        size         count     type     redop   root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#         (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
    268435456       4194304     float    none      -1    #####  ###.##  ###.##    N/A   ######  ###.##  ###.##      0
    536870912       8388608     float    none      -1    #####  ###.##  ###.##    N/A   ######  ###.##  ###.##      0
   1073741824      16777216     float    none      -1    #####  ###.##  ###.##    N/A   ######  ###.##  ###.##      0
   2147483648      33554432     float    none      -1    #####  ###.##  ###.##    N/A   ######  ###.##  ###.##      0
   4294967296      67108864     float    none      -1    #####  ###.##  ###.##    N/A   ######  ###.##  ###.##      0
   8589934592     134217728     float    none      -1    #####  ###.##  ###.##    N/A   ######  ###.##  ###.##      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : ###.##
#

A3 Mega

從登入節點的共用目錄 (這個節點通常位於 ${HOME})，執行下列指令來下載建構 NCCL 測試所需的指令碼：

wget -np -nd https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/heads/main/examples/machine-learning/a3-megagpu-8g/nccl-tests/build-nccl-tests.sh

指令碼下載完成後，請從 NVIDIA 容器登錄檔匯入 Pytorch 映像檔，並建構 NCCL 測試。
```
sbatch build-nccl-tests.sh
```
上述指令碼會在其中一個節點上執行。指令碼會使用 --container-mounts 切換開關，將目前的目錄 $PWD 掛接到容器內的 /nccl 目錄。
確認 NCCL 測試已建構：
```
sacct -a
```
輸出結果會與下列內容相似：
```
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
1            build-ncc+    a3mega                   112  COMPLETED      0:0
```
建構完成後，系統會建立 nccl-tests 目錄。這個目錄包含 nvidia+pytorch+24.09-py3.sqsh 檔案。.sqsh 檔案是經過壓縮的唯讀檔案系統映像檔，可做為 AI 工作負載的標準容器格式。
確認 nccl-tests/build 資料夾包含多個二進位檔，包括 all_gather_perf、all_reduce_perf、reduce_scatter_perf 和 alltoall_perf。
下載 NCCL 測試腳本：
```
wget -np -nd https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/heads/main/examples/machine-learning/a3-megagpu-8g/nccl-tests/run-nccl-tests.sh
```
如要在 A3 Mega 叢集上執行任何工作，必須設定數個環境變數，才能透過 GPUDirect-TCPXO 通訊協定啟用高效能網路。由於您在本程序中使用 enroot 容器啟動工作負載，因此這些變數必須在容器環境中設定，而非主機環境。您可以在上一個步驟下載的 run-nccl-tests.sh 指令碼中檢查這些變數。
執行 NCCL 測試腳本。這項測試大約需要 15 分鐘，或更久。
```
sbatch run-nccl-tests.sh
```

查看結果。指令碼會輸出 slurm-XX.out 檔案，其中包含 nccl all_gather_perf 基準測試的結果。

輸出結果會與下列內容相似：

#
#                                                              out-of-place                       in-place
#        size         count     type     redop   root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#         (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
    268435456       4194304     float    none      -1    #####  ###.##  ###.##    N/A   ######  ###.##  ###.##      0
    536870912       8388608     float    none      -1    #####  ###.##  ###.##    N/A   ######  ###.##  ###.##      0
   1073741824      16777216     float    none      -1    #####  ###.##  ###.##    N/A   ######  ###.##  ###.##      0
   2147483648      33554432     float    none      -1    #####  ###.##  ###.##    N/A   ######  ###.##  ###.##      0
   4294967296      67108864     float    none      -1    #####  ###.##  ###.##    N/A   ######  ###.##  ###.##      0
   8589934592     134217728     float    none      -1    #####  ###.##  ###.##    N/A   ######  ###.##  ###.##      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : ###.##
#

A3 High

從登入節點的共用目錄 (這個節點通常位於 ${HOME})，執行下列指令來下載建構 NCCL 測試所需的指令碼：

wget -np -nd https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/heads/main/examples/machine-learning/a3-highgpu-8g/nccl-tests/build-nccl-tests.sh

指令碼下載完成後，請從 NVIDIA 容器登錄檔匯入 Pytorch 映像檔，並建構 NCCL 測試。如要進行此操作，請執行以下指令：
```
sbatch build-nccl-tests.sh
```
上述指令碼會在其中一個節點上執行。指令碼會使用 --container-mounts 切換開關，將目前的目錄 $PWD 掛接到容器內的 /nccl 目錄。
確認 NCCL 測試已建構：
```
sacct -a
```
輸出結果會與下列內容相似：
```
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
1            build-ncc+    a3high                   112  COMPLETED      0:0
```
如果建構成功，系統會建立 nccl-tests 目錄。這個目錄包含 nvidia+pytorch+24.09-py3.sqsh 檔案。.sqsh 檔案是經過壓縮的唯讀檔案系統映像檔，可做為 AI 工作負載的標準容器格式。
確認 nccl-tests/build 資料夾包含多個二進位檔，包括 all_gather_perf、all_reduce_perf、reduce_scatter_perf 和 alltoall_perf。
下載 NCCL 測試腳本：
```
wget -np -nd https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/heads/main/examples/machine-learning/a3-highgpu-8g/nccl-tests/run-nccl-tests.sh
```
如要在 A3 High 叢集上執行任何工作，必須設定數個環境變數，才能透過 GPUDirect-TCPX 啟用高效能網路。由於您在本程序中使用 enroot 容器啟動工作負載，因此這些變數必須在容器環境中設定，而非主機環境。您可以在剛下載的 run-nccl-tests.sh 指令碼中檢查這些變數。
執行 NCCL 測試腳本。這項測試大約需要 15 分鐘，或更久。
```
sbatch run-nccl-tests.sh
```

查看結果。指令碼會輸出 slurm-XX.out 檔案，其中包含 nccl all_gather_perf 基準測試的結果。

輸出結果會與下列內容相似：

#
#                                                              out-of-place                       in-place
#        size         count     type     redop   root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#         (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
    268435456       4194304     float    none      -1    #####  ###.##  ###.##    N/A   ######  ###.##  ###.##      0
    536870912       8388608     float    none      -1    #####  ###.##  ###.##    N/A   ######  ###.##  ###.##      0
   1073741824      16777216     float    none      -1    #####  ###.##  ###.##    N/A   ######  ###.##  ###.##      0
   2147483648      33554432     float    none      -1    #####  ###.##  ###.##    N/A   ######  ###.##  ###.##      0
   4294967296      67108864     float    none      -1    #####  ###.##  ###.##    N/A   ######  ###.##  ###.##      0
   8589934592     134217728     float    none      -1    #####  ###.##  ###.##    N/A   ######  ###.##  ###.##      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : ###.##
#

在 Slurm 叢集上執行 NCCL 透過集合功能整理內容 你可以依據偏好儲存及分類內容。

A4X Max、A4X 和 A4

A3 Ultra

A3 Mega

A3 High

後續步驟

在 Slurm 叢集上執行 NCCL