A4X를 사용하는 맞춤 GKE 클러스터에서 NCCL 실행

이 페이지에서는 GPUDirect RDMA를 사용하는 프로비저닝된 클러스터에서 NCCL/gIB 테스트를 실행하는 방법을 설명합니다. 다음 시나리오의 테스트를 설명합니다.

두 노드에서 테스트

  1. 클러스터에 연결합니다.

    gcloud container clusters get-credentials CLUSTER_NAME \
        --location=COMPUTE_REGION
    

    다음 변수를 바꿉니다.

    • CLUSTER_NAME: 클러스터의 이름입니다. 클러스터 툴킷으로 만든 클러스터의 경우 DEPLOYMENT_NAME을 기반으로 합니다.
    • COMPUTE_REGION: 컴퓨팅 리전의 이름입니다.
  2. 두 A4X 노드에서 실행되는 두 테스트 포드의 NCCL 테스트 워크로드를 배포하려면 다음을 실행하세요.

    kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/gpudirect-rdma/nccl-test-imex-a4x.yaml
    
  3. 포드가 일부 노드에서 모두 실행 중인지 확인합니다.

    kubectl get pods nccl-test-host-1 nccl-test-host-2
    

    두 포드에 Running 상태가 표시되면 다음 단계로 진행할 수 있습니다.

  4. A4X 노드의 all-gather 테스트를 트리거합니다.

    kubectl exec nccl-test-host-1 -it -- /usr/local/gib/scripts/run_nccl_tests.sh -t all_gather -b 1K -e 8G nccl-host-1 nccl-host-2
    

    출력은 다음과 비슷합니다.

    #                                                              out-of-place                       in-place
    #       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
    #        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
            1024            32     float    none      -1    21.20    0.05    0.04      0    20.56    0.05    0.04      0
            2048            64     float    none      -1    21.03    0.10    0.09      0    20.82    0.10    0.09      0
            4096           128     float    none      -1    21.11    0.19    0.17      0    20.98    0.20    0.17      0
            8192           256     float    none      -1    21.51    0.38    0.33      0    21.15    0.39    0.34      0
           16384           512     float    none      -1    21.85    0.75    0.66      0    21.72    0.75    0.66      0
           32768          1024     float    none      -1    24.08    1.36    1.19      0    23.73    1.38    1.21      0
           65536          2048     float    none      -1    24.68    2.66    2.32      0    24.02    2.73    2.39      0
          131072          4096     float    none      -1    24.93    5.26    4.60      0    24.30    5.40    4.72      0
          262144          8192     float    none      -1    24.86   10.55    9.23      0    24.33   10.78    9.43      0
          524288         16384     float    none      -1    25.10   20.89   18.28      0    24.48   21.41   18.74      0
         1048576         32768     float    none      -1    25.43   41.24   36.09      0    24.82   42.25   36.97      0
         2097152         65536     float    none      -1    32.30   64.93   56.81      0    31.28   67.04   58.66      0
         4194304        131072     float    none      -1    45.92   91.34   79.92      0    44.22   94.84   82.99      0
         8388608        262144     float    none      -1    71.38  117.52  102.83      0    68.98  121.61  106.41      0
        16777216        524288     float    none      -1    74.17  226.20  197.93      0    72.37  231.83  202.85      0
        33554432       1048576     float    none      -1    116.6  287.84  251.86      0    112.7  297.75  260.54      0
        67108864       2097152     float    none      -1    188.9  355.27  310.86      0    184.0  364.71  319.12      0
       134217728       4194304     float    none      -1    309.6  433.56  379.36      0    299.7  447.83  391.85      0
       268435456       8388608     float    none      -1    559.0  480.23  420.20      0    540.3  496.85  434.75      0
       536870912      16777216     float    none      -1   1053.7  509.52  445.83      0   1021.4  525.64  459.93      0
      1073741824      33554432     float    none      -1   2087.4  514.39  450.10      0   2013.8  533.19  466.54      0
      2147483648      67108864     float    none      -1   4154.7  516.88  452.27      0   3987.4  538.57  471.25      0
      4294967296     134217728     float    none      -1   8289.2  518.14  453.37      0   7907.4  543.16  475.26      0
      8589934592     268435456     float    none      -1    16556  518.85  453.99      0    15726  546.24  477.96      0
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : 175.233
    #
    

TAS로 테스트

프로비저닝된 클러스터의 기능을 검증하려면 TAS를 사용하여 다음 NCCL 테스트를 실행하면 됩니다.

TAS가 사용 설정된 Kueue 구성

  1. TAS가 사용 설정된 Kueue를 설치합니다.
  2. 다음 파일을 만들어 TAS가 사용 설정된 Kueue를 구성합니다. 파일 이름은 a4x-kueue-config.yaml입니다.

    apiVersion: kueue.x-k8s.io/v1alpha1
    kind: Topology
    metadata:
      name: "a4x-default"
    spec:
      levels:
      - nodeLabel: "cloud.google.com/gce-topology-block"
      - nodeLabel: "cloud.google.com/gce-topology-subblock"
      - nodeLabel: "cloud.google.com/gke-nodepool"
      - nodeLabel: "cloud.google.com/gce-topology-host"
      - nodeLabel: "kubernetes.io/hostname"
    ---
    kind: ResourceFlavor
    apiVersion: kueue.x-k8s.io/v1beta1
    metadata:
      name: "a4x"
    spec:
      nodeLabels:
        cloud.google.com/gke-accelerator: nvidia-gb200
      topologyName: "a4x-default"
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: NoSchedule
      - key: "kubernetes.io/arch"
        operator: "Exists"
        effect: NoSchedule
    ---
    apiVersion: kueue.x-k8s.io/v1beta1
    kind: ClusterQueue
    metadata:
      name: "a4x"
    spec:
      namespaceSelector: {} # match all.
      resourceGroups:
      - coveredResources: ["nvidia.com/gpu"]
        flavors:
        - name: "a4x"
          resources:
          - name: "nvidia.com/gpu"
            nominalQuota: 1_000_000_000
    ---
    apiVersion: kueue.x-k8s.io/v1beta1
    kind: LocalQueue
    metadata:
      namespace: "default"
      name: "a4x"
    spec:
      clusterQueue: "a4x"
    
  3. 테스트를 실행합니다.

    kubectl apply -f a4x-kueue-config.yaml
    

TAS가 사용 설정된 Kueue로 토폴로지 인식 NCCL 테스트 예약

다음 워크로드는 단일 NVLink 도메인 하위 블록 내에 배치해야 합니다.

  1. Kubernetes 작업 그룹을 하나의 단위로 관리하기 위한 Kubernetes 네이티브 API인 JobSet 설치 비 GPU 노드 풀에 JobSet 컨트롤러를 예약할 수 있는 리소스가 충분한지 확인합니다.
  2. nccl-tas-test.yaml이라는 이름으로 다음 파일을 만듭니다. NUM_NODES를 NCCL 테스트를 실행할 의도된 노드 수로 바꿉니다(최대 18).

    apiVersion: resource.nvidia.com/v1beta1
    kind: ComputeDomain
    metadata:
      name: nccl-test-compute-domain
    spec:
      numNodes: NUM_NODES
      channel:
        resourceClaimTemplate:
          name: nccl-test-compute-domain-channel
    ---
    apiVersion: jobset.x-k8s.io/v1alpha2
    kind: JobSet
    metadata:
      name: kueue-tas-nccl-all-gather
      labels:
        kueue.x-k8s.io/queue-name: a4x
    spec:
      ttlSecondsAfterFinished: 1200
      network:
        enableDNSHostnames: true
      replicatedJobs:
        - name: worker
          template:
            spec:
              parallelism: NUM_NODES
              completions: NUM_NODES
              template:
                metadata:
                  annotations:
                    kueue.x-k8s.io/podset-required-topology: "cloud.google.com/gce-topology-subblock"
                    networking.gke.io/default-interface: 'eth0'
                    networking.gke.io/interfaces: |
                      [
                        {"interfaceName":"eth0","network":"default"},
                        {"interfaceName":"eth2","network":"rdma-0"},
                        {"interfaceName":"eth3","network":"rdma-1"},
                        {"interfaceName":"eth4","network":"rdma-2"},
                        {"interfaceName":"eth5","network":"rdma-3"}
                      ]
                spec:
                  activeDeadlineSeconds: 3600
                  restartPolicy: Never
                  nodeSelector:
                    cloud.google.com/gke-accelerator: nvidia-gb200
                  tolerations:
                  - key: nvidia.com/gpu
                    operator: Equal
                    value: present
                    effect: NoSchedule
                  - key: kubernetes.io/arch
                    operator: Equal
                    value: arm64
                    effect: NoSchedule
                  setHostnameAsFQDN: true
                  volumes:
                  - name: gib
                    hostPath:
                      path: /home/kubernetes/bin/gib
                  - name: nvidia
                    hostPath:
                      path: /home/kubernetes/bin/nvidia
                  - name: lib64
                    hostPath:
                      path: /lib64
                  - name: shared-memory
                    emptyDir:
                      medium: "Memory"
                      sizeLimit: 250Gi
                  resourceClaims:
                  - name: compute-domain-channel
                    resourceClaimTemplateName: nccl-test-compute-domain-channel
                  containers:
                  - name: nccl-test
                    stdin: true
                    tty: true
                    image: us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib-diagnostic-arm64:v1.0.4
                    env:
                    - name: MY_NODE_NAME
                      valueFrom:
                        fieldRef:
                          fieldPath: spec.nodeName
                    - name: OMPI_ALLOW_RUN_AS_ROOT
                      value: "1"
                    - name: OMPI_ALLOW_RUN_AS_ROOT_CONFIRM
                      value: "1"
                    - name: N_NODES
                      value: "NUM_NODES"
                    - name: LD_LIBRARY_PATH
                      value: /usr/local/nvidia/lib64
                    command:
                    - bash
                    - -c
                    - |
                      set -x
                      echo "Starting workload container on ${MY_NODE_NAME} for $N_NODES benchmark"
                      # Install ping
                      apt update -y
                      apt install -y iputils-ping
    
                      # Start sshd
                      /scripts/container_entry.sh daemon &
    
                      # Get helper variables to form all hostnames
                      export POSTFIX=$(hostname | cut -d . -f 2-)
                      export WORKERS_BASENAME=$(hostname | cut -d . -f 1 | rev | cut -d - -f 2- | rev )
                      export NODE_RANK=$JOB_COMPLETION_INDEX
    
                      # For every worker, wait till online and add to hostfile
                      for i in `seq 0 $(($N_NODES-1))`; do
                        OTHER=${WORKERS_BASENAME}-${i}.${POSTFIX}
                        until ssh -p 222 -o StrictHostKeyChecking=no $OTHER hostname; do
                          echo Waiting for ${OTHER}...
                          sleep 10
                        done
                        echo ${OTHER} port=222 slots=4 | tee -a /tmp/hostfile;
                      done
    
                      cat /tmp/hostfile
    
                      # Launch from head node
                      if [[ "${NODE_RANK}" -eq "0" ]]; then
    
                          # World Level = 0x0, Rail Aligned = 0x7
                          export NCCL_TESTS_SPLIT_MASK="0x0";
    
                          # Force use of libnccl-gib
                          export NCCL_NET=gIB
    
                          # Set all the correct libnccl-gib environment variables
                          source /usr/local/gib/scripts/set_nccl_env.sh
    
                          # Get all relevant NCCL / env vars to pass to all workers
                          ENV_VARS=$(echo ${!NCCL*} ${!OMPI*} LD_LIBRARY_PATH PATH | sed 's/ / -x /g')
    
                          mpirun --hostfile /tmp/hostfile \
                            -x $ENV_VARS  \
                            -mca plm_rsh_no_tree_spawn 1 \
                            --mca orte_keep_fqdn_hostnames 1 \
                            --mca btl self,tcp \
                            --mca btl_tcp_if_include eth0 \
                            --bind-to none \
                            --mca plm_rsh_agent "ssh -q -o LogLevel=ERROR -o StrictHostKeyChecking=no -p 222" \
                            /third_party/nccl-tests/build/all_gather_perf -b 1K -e 8G -f 2 -g 1 -w 5 --iters 100 -c 1
    
                      else
                          while ping -c 1 ${WORKERS_BASENAME}-0.${POSTFIX}; do
                          sleep 5
                      done
                      fi
    
                      exit 0
                    volumeMounts:
                    - name: nvidia
                      mountPath: /usr/local/nvidia
                    - name: gib
                      mountPath: /usr/local/gib
                    - name: shared-memory
                      mountPath: /dev/shm
                    resources:
                      limits:
                        nvidia.com/gpu: 4
                      requests:
                        nvidia.com/gpu: 4
                      claims:
                        - name: compute-domain-channel
                  restartPolicy: Never
    
  3. 테스트를 실행합니다.

    kubectl apply -f nccl-tas-test.yaml
    
  4. 로그를 검토하여 테스트 결과를 확인합니다.

    kubectl logs $(kubectl get pods -o go-template='{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | grep kueue-tas-nccl-all-gather-worker-0-0)

    출력은 다음과 비슷하게 표시됩니다.

     #       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
     #        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
             1024             8     float    none      -1    56.72    0.02    0.02      0    56.12    0.02    0.02      0
             2048            16     float    none      -1    56.85    0.04    0.03      0    56.87    0.04    0.03      0
             4096            32     float    none      -1    57.53    0.07    0.07      0    57.47    0.07    0.07      0
             8192            64     float    none      -1    58.43    0.14    0.14      0    58.27    0.14    0.14      0
            16384           128     float    none      -1    59.29    0.28    0.27      0    58.87    0.28    0.27      0
            32768           256     float    none      -1    60.02    0.55    0.53      0    59.60    0.55    0.53      0
            65536           512     float    none      -1    61.83    1.06    1.03      0    61.64    1.06    1.03      0
           131072          1024     float    none      -1    70.99    1.85    1.79      0    70.82    1.85    1.79      0
           262144          2048     float    none      -1    71.56    3.66    3.55      0    71.07    3.69    3.57      0
           524288          4096     float    none      -1    72.62    7.22    6.99      0    71.90    7.29    7.06      0
          1048576          8192     float    none      -1    72.80   14.40   13.95      0    72.31   14.50   14.05      0
          2097152         16384     float    none      -1    73.40   28.57   27.68      0    72.96   28.74   27.85      0
          4194304         32768     float    none      -1    73.86   56.78   55.01      0    73.44   57.12   55.33      0
          8388608         65536     float    none      -1    102.5   81.86   79.30      0    101.4   82.69   80.11      0
         16777216        131072     float    none      -1    158.3  105.97  102.66      0    156.8  107.02  103.68      0
         33554432        262144     float    none      -1    158.4  211.89  205.26      0    157.5  212.99  206.33      0
         67108864        524288     float    none      -1    250.7  267.68  259.32      0    248.7  269.81  261.38      0
        134217728       1048576     float    none      -1    417.7  321.29  311.25      0    414.1  324.13  314.01      0
        268435456       2097152     float    none      -1    728.8  368.32  356.81      0    721.5  372.08  360.45      0
        536870912       4194304     float    none      -1   1226.5  437.72  424.04      0   1216.1  441.46  427.66      0
       1073741824       8388608     float    none      -1   2268.4  473.35  458.56      0   2247.0  477.86  462.93      0
       2147483648      16777216     float    none      -1   4330.6  495.88  480.39      0   4291.6  500.39  484.76      0
       4294967296      33554432     float    none      -1   8640.9  497.05  481.52      0   8544.0  502.69  486.98      0
       8589934592      67108864     float    none      -1    17258  497.75  482.19      0    17052  503.75  488.00      0
     # Out of bounds values : 0 OK
     # Avg bus bandwidth    : 157.091
    

다음 단계