Google은 AI 기술을 사용하여 콘텐츠를 사용자의 기본 언어로 번역합니다. AI 번역에는 오류가 있을 수 있습니다.

A4X Max를 사용하는 맞춤 GKE 클러스터에서 NCCL 실행

이 문서에서는 GPUDirect RDMA를 사용하는 프로비저닝된 클러스터에서 NCCL/gIB 테스트를 실행하는 방법을 설명합니다. 사용 사례에 따라 다음 옵션 중 하나를 사용하세요.

테스트할 노드가 두 개 있는 경우 기본 테스트를 실행합니다.
테스트할 노드가 3개 이상인 경우 JobSet으로 테스트를 실행하세요.

두 노드에서 테스트

다음 테스트는 두 노드에서 NCCL 워크로드를 실행합니다. 이 테스트에 대해 다음 사항을 이해하세요.

기본적으로 GKE는 사용 가능한 경우 두 포드를 별도의 노드 풀에 예약합니다. 노드 풀이 서로 다른 NVLink 도메인으로 생성된 경우 이 테스트는 교차 도메인 RDMA 처리량을 나타냅니다. 동일한 도메인에 포드를 예약하려면 동일한 노드 풀에 예약되도록 포드 어피니티를 수정합니다.

2노드 테스트를 실행합니다.

클러스터에 연결합니다.
```
gcloud container clusters get-credentials CLUSTER_NAME \
    --location=COMPUTE_REGION
```
다음 변수를 바꿉니다.
- CLUSTER_NAME: 클러스터의 이름입니다. Cluster Toolkit으로 만든 클러스터의 경우 DEPLOYMENT_NAME을 기반으로 합니다.
- COMPUTE_REGION: 컴퓨팅 리전의 이름입니다.

두 개의 A4X Max 노드에서 실행되는 두 개의 테스트 포드로 구성된 NCCL 테스트 워크로드를 배포합니다.

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-rdma/nccl-test-a4x-max.yaml

포드가 일부 노드에서 모두 실행 중인지 확인합니다.
```
kubectl get pods nccl-test-host-1 nccl-test-host-2
```
두 포드에 Running 상태가 표시되면 다음 단계로 진행할 수 있습니다.

A4X Max 노드의 all-gather 테스트를 트리거합니다.

HOSTS="nccl-host-1 nccl-host-2"
kubectl exec nccl-test-host-1 -it -- bash -c "/usr/local/gib/scripts/run_nccl_tests.sh -t all_gather -b 1M -e 16G ${HOSTS}"

출력은 다음과 비슷합니다.

# nccl-tests version 2.17.6 nccl-headers=22809 nccl-library=22809
# Collective test starting: all_gather_perf
# nThread 1 nGpus 1 minBytes 4096 maxBytes 17179869184 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid    299 on gke-psch-ug-cluster-a4xmax-2-1391c2ef-m9bm device  0 [0008:06:00] NVIDIA GB300
#  Rank  1 Group  0 Pid    300 on gke-psch-ug-cluster-a4xmax-2-1391c2ef-m9bm device  1 [0009:06:00] NVIDIA GB300
#  Rank  2 Group  0 Pid    301 on gke-psch-ug-cluster-a4xmax-2-1391c2ef-m9bm device  2 [0018:06:00] NVIDIA GB300
#  Rank  3 Group  0 Pid    302 on gke-psch-ug-cluster-a4xmax-2-1391c2ef-m9bm device  3 [0019:06:00] NVIDIA GB300
#  Rank  4 Group  0 Pid    237 on gke-psch-ug-cluster-a4xmax-03878a77-d3lp device  0 [0008:06:00] NVIDIA GB300
#  Rank  5 Group  0 Pid    238 on gke-psch-ug-cluster-a4xmax-03878a77-d3lp device  1 [0009:06:00] NVIDIA GB300
#  Rank  6 Group  0 Pid    239 on gke-psch-ug-cluster-a4xmax-03878a77-d3lp device  2 [0018:06:00] NVIDIA GB300
#  Rank  7 Group  0 Pid    240 on gke-psch-ug-cluster-a4xmax-03878a77-d3lp device  3 [0019:06:00] NVIDIA GB300
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)
        4096           128     float    none      -1    28.51    0.14    0.13       0    27.71    0.15    0.13       0
        8192           256     float    none      -1    28.10    0.29    0.26       0    28.40    0.29    0.25       0
       16384           512     float    none      -1    28.55    0.57    0.50       0    28.19    0.58    0.51       0
       32768          1024     float    none      -1    30.56    1.07    0.94       0    29.65    1.11    0.97       0
       65536          2048     float    none      -1    33.30    1.97    1.72       0    33.14    1.98    1.73       0
      131072          4096     float    none      -1    36.18    3.62    3.17       0    36.14    3.63    3.17       0
      262144          8192     float    none      -1    38.50    6.81    5.96       0    94.91    2.76    2.42       0
      524288         16384     float    none      -1   152.25    3.44    3.01       0    54.79    9.57    8.37       0
     1048576         32768     float    none      -1    63.82   16.43   14.38       0    64.06   16.37   14.32       0
     2097152         65536     float    none      -1    65.10   32.21   28.19       0    66.13   31.71   27.75       0
     4194304        131072     float    none      -1    67.73   61.92   54.18       0    67.16   62.45   54.65       0
     8388608        262144     float    none      -1    79.65  105.31   92.15       0    80.02  104.83   91.73       0
    16777216        524288     float    none      -1   189.74   88.42   77.37       0   187.57   89.44   78.26       0
    33554432       1048576     float    none      -1   252.85  132.70  116.11       0   202.31  165.86  145.13       0
    67108864       2097152     float    none      -1   250.55  267.85  234.37       0   276.11  243.06  212.67       0
   134217728       4194304     float    none      -1   394.38  340.33  297.79       0   487.60  275.26  240.85       0
   268435456       8388608     float    none      -1   717.97  373.88  327.15       0   799.98  335.55  293.61       0
   536870912      16777216     float    none      -1  1421.29  377.73  330.52       0  1392.81  385.46  337.28       0
  1073741824      33554432     float    none      -1  2783.37  385.77  337.55       0  2596.97  413.46  361.78       0
  2147483648      67108864     float    none      -1  5396.10  397.97  348.22       0  5059.01  424.49  371.43       0
  4294967296     134217728     float    none      -1  10579.7  405.96  355.22       0  9918.44  433.03  378.90       0
  8589934592     268435456     float    none      -1  21012.9  408.79  357.69       0  20043.4  428.57  375.00       0
 17179869184     536870912     float    none      -1  42091.7  408.15  357.13       0  40243.2  426.90  373.54       0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 146.047
#
# Collective test concluded: all_gather_perf

포드가 서로 다른 NVLink 도메인의 노드에 예약된 경우 이 테스트는 제공된 출력에 표시된 대로 교차 도메인 RDMA 처리량을 나타냅니다. 별도의 NVLink 도메인에서 생성된 노드 풀에 분산하려면 nccl-test-a4x-max.yaml에서 다음과 같이 포드 사양 어피니티를 수정합니다.

spec:
...
affinity:
podAntiAffinity:
  preferredDuringSchedulingIgnoredDuringExecution:
  - weight: 100
    podAffinityTerm:
      labelSelector:
        matchLabels:
          app: nccl-test
      topologyKey: cloud.google.com/gke-nodepool

JobSet을 사용하여 테스트

JobSet을 설치합니다.

VERSION=v0.10.1
kubectl apply --server-side -f https://github.com/kubernetes-sigs/jobset/releases/download/$VERSION/manifests.yaml

비 GPU 노드 풀에 JobSet 컨트롤러를 예약할 수 있는 충분한 리소스가 있는지 확인합니다. 단계에 따라 자체 리소스 조정 정의를 수행합니다.

JobSet 설치에 관한 자세한 내용은 설치를 참고하세요.

다음 명령어를 실행합니다. 이때 NUM_NODES을 NCCL 테스트를 실행할 노드 수로 바꿉니다.

wget https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/gpudirect-rdma/nccl-test-a4x-max-jobset.yaml

NUM_NODES=2
sed "s|__NUM_NODES__|NUM_NODES|" nccl-test-a4x-max-jobset.yaml | kubectl apply -f -

모든 포드가 Completed 상태인지 확인합니다.

kubectl get pods | grep allgather-worker

출력은 다음과 비슷합니다.

allgather-worker-0-0-g45d2   0/1     Completed   0          13m
allgather-worker-0-1-prpvw   0/1     Completed   0          13m
allgather-worker-0-2-qbwt5   0/1     Completed   0          13m

테스트가 실행된 헤드 포드 (nccl-test-nccl-test-0)의 테스트 결과를 확인합니다.

kubectl logs $(kubectl get pods -o go-template='{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | grep allgather-worker-0-0)

출력은 다음과 비슷합니다.

# nccl-tests version 2.17.6 nccl-headers=22809 nccl-library=22809
# Collective test starting: all_gather_perf
# nThread 1 nGpus 1 minBytes 1024 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 100 agg iters: 1 validation: 1 graph: 0
#
...
#                                                            out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)
        1024            32     float    none      -1    45.49    0.02    0.02       0    45.29    0.02    0.02       0
        2048            64     float    none      -1    45.52    0.04    0.04       0    45.37    0.05    0.04       0
        4096           128     float    none      -1    46.02    0.09    0.08       0    45.83    0.09    0.08       0
        8192           256     float    none      -1    63.93    0.13    0.11       0    46.98    0.17    0.15       0
       16384           512     float    none      -1    46.51    0.35    0.31       0    47.11    0.35    0.30       0
       32768          1024     float    none      -1    66.32    0.49    0.43       0    50.73    0.65    0.57       0
       65536          2048     float    none      -1    49.89    1.31    1.15       0    50.04    1.31    1.15       0
      131072          4096     float    none      -1    54.68    2.40    2.10       0    52.38    2.50    2.19       0
      262144          8192     float    none      -1    54.66    4.80    4.20       0    54.06    4.85    4.24       0
      524288         16384     float    none      -1    66.28    7.91    6.92       0    65.75    7.97    6.98       0
     1048576         32768     float    none      -1    85.63   12.25   10.72       0    86.44   12.13   10.61       0
     2097152         65536     float    none      -1    68.33   30.69   26.86       0    72.32   29.00   25.37       0
     4194304        131072     float    none      -1    71.85   58.37   51.08       0    71.58   58.60   51.28       0
     8388608        262144     float    none      -1    83.80  100.10   87.59       0    85.73   97.85   85.62       0
    16777216        524288     float    none      -1   195.94   85.62   74.92       0   195.86   85.66   74.95       0
    33554432       1048576     float    none      -1   240.84  139.32  121.91       0   210.82  159.16  139.27       0
    67108864       2097152     float    none      -1   254.95  263.22  230.32       0   250.93  267.44  234.01       0
   134217728       4194304     float    none      -1   411.09  326.49  285.68       0   386.11  347.61  304.16       0
   268435456       8388608     float    none      -1   741.69  361.92  316.68       0   722.42  371.58  325.13       0
   536870912      16777216     float    none      -1  1358.44  395.21  345.81       0  1343.63  399.57  349.62       0
  1073741824      33554432     float    none      -1  2679.62  400.71  350.62       0  2585.68  415.26  363.36       0
  2147483648      67108864     float    none      -1  5281.54  406.60  355.78       0  5074.73  423.17  370.28       0
  4294967296     134217728     float    none      -1  10476.2  409.97  358.73       0  10027.5  428.32  374.78       0
  8589934592     268435456     float    none      -1  20853.9  411.91  360.42       0  20194.7  425.36  372.19       0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 126.85
#
# Collective test concluded: all_gather_perf

A4X Max를 사용하는 맞춤 GKE 클러스터에서 NCCL 실행 컬렉션을 사용해 정리하기 내 환경설정을 기준으로 콘텐츠를 저장하고 분류하세요.

두 노드에서 테스트

JobSet을 사용하여 테스트

A4X Max를 사용하는 맞춤 GKE 클러스터에서 NCCL 실행