A3 Mega 또는 A3 High를 사용하는 맞춤 GKE 클러스터에서 NCCL 실행

이 페이지에서는 GPUDirect-TCPXO 및 GPUDirect-TCPX 네트워킹 프로토콜을 사용하는 커스텀 GKE 클러스터에서 NVIDIA Collective Communications Library (NCCL) 테스트를 실행하는 방법을 설명합니다. 커스텀 GKE 클러스터는 gcloud 명령어를 사용하여 만드는 클러스터입니다.

이 페이지에 설명된 테스트는 다음 시나리오에 사용할 수 있습니다.

시작하기 전에

이 페이지의 테스트는 JobSet토폴로지 인식 예약 (TAS)과 함께 Kueue를 사용합니다. 테스트를 실행하기 전에 클러스터를 설정하고 다음을 수행해야 합니다.

  1. JobSet을 설치합니다.

  2. Kueue를 설치합니다.

    kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.16.5/manifests.yaml
    

Jobset 및 Kueue로 클러스터 설정

JobSet 및 Kueue를 설치한 후 다음 단계를 따르세요.

  1. 다음 매니페스트를 kueue-config.yaml로 저장합니다.

    A3 High

    apiVersion: kueue.x-k8s.io/v1beta2
    kind: Topology
    metadata:
      name: "gke-default"
    spec:
      levels:
      - nodeLabel: "cloud.google.com/gce-topology-block"
      - nodeLabel: "cloud.google.com/gce-topology-subblock"
      - nodeLabel: "cloud.google.com/gce-topology-host"
      - nodeLabel: "kubernetes.io/hostname"
    ---
    apiVersion: kueue.x-k8s.io/v1beta2
    kind: ResourceFlavor
    metadata:
      name: a3-high-flavor
    spec:
      nodeLabels:
        cloud.google.com/gke-accelerator: nvidia-h100-80gb
      topologyName: "gke-default"
    ---
    apiVersion: kueue.x-k8s.io/v1beta2
    kind: ResourceFlavor
    metadata:
      name: a3-high-dws-flavor
    spec:
      nodeLabels:
        cloud.google.com/gke-accelerator: nvidia-h100-80gb
      topologyName: "gke-default"
      tolerations:
      - key: "cloud.google.com/gke-queued"
        operator: "Exists"
        effect: NoSchedule
    ---     
    apiVersion: kueue.x-k8s.io/v1beta2
    kind: AdmissionCheck
    metadata:
      name: dws-prov
    spec:
      controllerName: kueue.x-k8s.io/provisioning-request
      parameters:
        apiGroup: kueue.x-k8s.io
        kind: ProvisioningRequestConfig
        name: dws-config
    ---
    apiVersion: kueue.x-k8s.io/v1beta2
    kind: ProvisioningRequestConfig
    metadata:
      name: dws-config
    spec:
      provisioningClassName: queued-provisioning.gke.io
      podSetUpdates:
      - key: autoscaling.gke.io/provisioning-request
        valueFromProvisioningClassDetail: ResizeRequestName
      managedResources:
      - nvidia.com/gpu
    ---
    apiVersion: kueue.x-k8s.io/v1beta2
    kind: ClusterQueue
    metadata:
      name: cq-tas
    spec:
      namespaceSelector: {}
      clusterQueueingStrategy: BestEffortFIFO
      resourceGroups:
      - flavors:
        - name: a3-high-flavor
          resources:
          - name: "cpu"
            nominalQuota: 1000
          - name: "memory"
            nominalQuota: 1000Ti
          - name: "nvidia.com/gpu"
            nominalQuota: 1000
        - name: a3-high-dws-flavor
          resources:
          - name: "cpu"
            nominalQuota: 1000
          - name: "memory"
            nominalQuota: 1000Ti
          - name: "nvidia.com/gpu"
            nominalQuota: 1000
      admissionChecksStrategy:
        admissionChecks:
        - name: "dws-prov"
          onFlavors: [a3-high-dws-flavor]
    ---
    apiVersion: kueue.x-k8s.io/v1beta2
    kind: LocalQueue
    metadata:
      namespace: default
      name: lq-tas
    spec:
      clusterQueue: cq-tas
    

    A3 Mega

    apiVersion: kueue.x-k8s.io/v1beta2
    kind: Topology
    metadata:
      name: "gke-default"
    spec:
      levels:
      - nodeLabel: "cloud.google.com/gce-topology-block"
      - nodeLabel: "cloud.google.com/gce-topology-subblock"
      - nodeLabel: "cloud.google.com/gce-topology-host"
      - nodeLabel: "kubernetes.io/hostname"
    ---
    apiVersion: kueue.x-k8s.io/v1beta2
    kind: ResourceFlavor
    metadata:
      name: a3-mega-flavor
    spec:
      nodeLabels:
        cloud.google.com/gke-accelerator: nvidia-h100-mega-80gb
      topologyName: "gke-default"
    ---
    apiVersion: kueue.x-k8s.io/v1beta2
    kind: ResourceFlavor
    metadata:
      name: a3-mega-dws-flavor
    spec:
      nodeLabels:
        cloud.google.com/gke-accelerator: nvidia-h100-mega-80gb
      topologyName: "gke-default"
      tolerations:
      - key: "cloud.google.com/gke-queued"
        operator: "Exists"
        effect: NoSchedule
    ---     
    apiVersion: kueue.x-k8s.io/v1beta2
    kind: AdmissionCheck
    metadata:
      name: dws-prov
    spec:
      controllerName: kueue.x-k8s.io/provisioning-request
      parameters:
        apiGroup: kueue.x-k8s.io
        kind: ProvisioningRequestConfig
        name: dws-config
    ---
    apiVersion: kueue.x-k8s.io/v1beta2
    kind: ProvisioningRequestConfig
    metadata:
      name: dws-config
    spec:
      provisioningClassName: queued-provisioning.gke.io
      podSetUpdates:
      - key: autoscaling.gke.io/provisioning-request
        valueFromProvisioningClassDetail: ResizeRequestName
      managedResources:
      - nvidia.com/gpu
    ---
    apiVersion: kueue.x-k8s.io/v1beta2
    kind: ClusterQueue
    metadata:
      name: cq-tas
    spec:
      namespaceSelector: {}
      clusterQueueingStrategy: BestEffortFIFO
      resourceGroups:
      - flavors:
        - name: a3-mega-flavor
          resources:
          - name: "cpu"
            nominalQuota: 1000
          - name: "memory"
            nominalQuota: 1000Ti
          - name: "nvidia.com/gpu"
            nominalQuota: 1000
        - name: a3-mega-dws-flavor
          resources:
          - name: "cpu"
            nominalQuota: 1000
          - name: "memory"
            nominalQuota: 1000Ti
          - name: "nvidia.com/gpu"
            nominalQuota: 1000
      admissionChecksStrategy:
        admissionChecks:
        - name: "dws-prov"
          onFlavors: [a3-mega-dws-flavor]
    ---
    apiVersion: kueue.x-k8s.io/v1beta2
    kind: LocalQueue
    metadata:
      namespace: default
      name: lq-tas
    spec:
      clusterQueue: cq-tas
    
  2. 매니페스트를 적용합니다.

    kubectl apply -f kueue-config.yaml
    

TAS가 사용 설정된 워크로드를 실행할 때 워크로드 매니페스트에서 다음 주석 중 하나를 사용하여 토폴로지 제약 조건을 얼마나 엄격하게 적용할지 지정할 수 있습니다.

  • kueue.x-k8s.io/podset-required-topology: 이 주석을 사용하면 Kueue는 워크로드를 요청된 토폴로지 제약 조건 내에서 예약할 수 있을 때까지 예약을 차단합니다. 이 주석을 사용하여 최적의 성능을 위해 포드가 함께 배치되도록 합니다.

  • kueue.x-k8s.io/podset-preferred-topology: 이 주석을 사용하면 Kueue는 요청된 토폴로지 제약 조건 내에서 포드를 스케줄링하려고 시도하지만 불가능한 경우 토폴로지 제약 조건을 충족하지 않고 워크로드를 허용합니다.

참고: DWS Flex-start 와 함께 필수 모드를 사용하지 마세요. Flex-start는 노드를 동적으로 프로비저닝하므로 결과 노드가 엄격한 토폴로지 요구사항을 충족하지 못할 수 있으며, 이로 인해 예약할 수 없는 워크로드가 발생할 수 있습니다. 이러한 구성의 경우 podset-preferred-topology를 대신 사용하세요.

주석 중 하나에 대해 다음 값 중 하나를 토폴로지 제약 조건으로 지정합니다.

  • cloud.google.com/gce-topology-block: 동일한 네트워크 블록 내에서 포드를 예약합니다.
  • cloud.google.com/gce-topology-subblock: 동일한 랙 내에서 포드를 예약합니다.
  • cloud.google.com/gce-topology-host: 동일한 물리적 호스트에서 포드를 예약합니다.

두 개의 Flex-start 노드에서 테스트

A3 Mega 또는 A3 High Flex-start VM을 사용하는 GKE 클러스터에서 NCCL 테스트를 실행하려면 다음 절차를 따르세요. 이 절차에서는 JobSet 매니페스트를 사용하여 두 노드에서 NCCL 테스트를 실행합니다.

  1. 다음 매니페스트를 nccl-tas-jobset.yaml로 저장합니다.

    A3 Mega

    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: nccl-configmap
    data:
      allgather.sh: |
        #!/bin/bash
        service ssh restart;
        /scripts/init_ssh.sh ${@};
        pushd /scripts;
        /scripts/gen_hostfiles.sh ${@};
        popd;
        # Set up environment variables for GPUDirect-TCPXO
        export LD_LIBRARY_PATH=/usr/local/nvidia/lib64
        export NCCL_FASTRAK_CTRL_DEV=eth0
        export NCCL_FASTRAK_IFNAME=eth1,eth2,eth3,eth4,eth5,eth6,eth7,eth8
        export NCCL_SOCKET_IFNAME=eth0
        export NCCL_CROSS_NIC=0
        export NCCL_ALGO=Ring,Tree
        export NCCL_PROTO=Simple
        export NCCL_NET_GDR_LEVEL=PIX
        # Run the benchmark
        /scripts/demo-run-nccl-test-tcpxo-via-mpi.sh
    ---
    apiVersion: jobset.x-k8s.io/v1alpha2
    kind: JobSet
    metadata:
      name: nccl-tas-test
      labels:
        kueue.x-k8s.io/queue-name: lq-tas
    spec:
      ttlSecondsAfterFinished: 1200
      suspend: true
      network:
        enableDNSHostnames: true
      replicatedJobs:
        - name: worker
          replicas: 2
          template:
            spec:
              parallelism: 1
              completions: 1
              template:
                metadata:
                  annotations:
                    kueue.x-k8s.io/podset-preferred-topology: "cloud.google.com/gce-topology-block"
                    networking.gke.io/default-interface: 'eth0'
                    networking.gke.io/interfaces: |
                      [
                        {"interfaceName":"eth0","network":"default"},
                        {"interfaceName":"eth1","network":"vpc0"},
                        {"interfaceName":"eth2","network":"vpc1"},
                        {"interfaceName":"eth3","network":"vpc2"},
                        {"interfaceName":"eth4","network":"vpc3"},
                        {"interfaceName":"eth5","network":"vpc4"},
                        {"interfaceName":"eth6","network":"vpc5"},
                        {"interfaceName":"eth7","network":"vpc6"},
                        {"interfaceName":"eth8","network":"vpc7"}
                      ]
                spec:
                  activeDeadlineSeconds: 3600
                  restartPolicy: Never
                  nodeSelector:
                    cloud.google.com/gke-accelerator: nvidia-h100-mega-80gb
                  tolerations:
                  - key: cloud.google.com/gke-queued
                    effect: NoSchedule
                    value: "true"
                  - key: "nvidia.com/gpu"
                    operator: "Exists"
                    effect: "NoSchedule"
                  setHostnameAsFQDN: true
                  volumes:
                  - name: nvidia
                    hostPath:
                      path: /home/kubernetes/bin/nvidia
                  - name: lib64
                    hostPath:
                      path: /lib64
                  - name: proc
                    hostPath:
                      path: /proc
                  - name: shared-memory
                    emptyDir:
                      medium: "Memory"
                      sizeLimit: 250Gi
                  - name: nccl-config
                    configMap:
                      name: nccl-configmap
                      defaultMode: 0755
                  containers:
                  - name: nccl-test
                    image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/nccl-plugin-gpudirecttcpx-dev:v1.0.15
                    stdin: true
                    tty: true
                    securityContext:
                      privileged: true
                    env:
                    - name: LD_LIBRARY_PATH
                      value: /usr/local/nvidia/lib64
                    volumeMounts:
                    - name: nvidia
                      mountPath: /usr/local/nvidia
                    - name: shared-memory
                      mountPath: /dev/shm
                    - name: nccl-config
                      mountPath: /configs
                    resources:
                      limits:
                        cpu: "200"
                        memory: "3700Gi"
                        nvidia.com/gpu: 8
                      requests:
                        cpu: "200"
                        memory: "3700Gi"
                        nvidia.com/gpu: 8
                  - name: tcpxo-daemon
                    image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/tcpgpudmarxd-dev:v1.0.21
                    imagePullPolicy: Always
                    command: ["/bin/sh", "-c"]
                    args:
                      - |
                        set -ex
                        chmod 755 /fts/entrypoint_rxdm_container.sh
                        /fts/entrypoint_rxdm_container.sh --num_hops=2 --num_nics=8 --uid= --alsologtostderr
                    securityContext:
                      privileged: true
                      capabilities:
                        add:
                          - NET_ADMIN
                          - NET_BIND_SERVICE
                    volumeMounts:
                    - name: nvidia
                      mountPath: /usr/local/nvidia/lib64
                    - name: proc
                      mountPath: /proc
                    env:
                    - name: LD_LIBRARY_PATH
                      value: /usr/local/nvidia/lib64
    

    A3 High

    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: nccl-config
    data:
      allgather.sh: |
        #!/bin/bash
        for script in /configs/*; do
          name=$(basename $script)
          cp $script "/scripts/$name"
          chmod +x "/scripts/$name"
        done
        /scripts/init_ssh.sh ${@};
        pushd /scripts;
        /scripts/gen_hostfiles.sh ${@};
        popd;
        /scripts/run-allgather.sh 8 eth1,eth2,eth3,eth4 1M 512M ${#};
    ---
    apiVersion: jobset.x-k8s.io/v1alpha2
    kind: JobSet
    metadata:
      name: nccl-tas-test
      labels:
        kueue.x-k8s.io/queue-name: lq-tas
    spec:
      suspend: true
      network:
        enableDNSHostnames: true
      replicatedJobs:
      - name: worker
        replicas: 2
        template:
          spec:
            parallelism: 1
            completions: 1
            template:
              metadata:
                annotations:
                  kueue.x-k8s.io/podset-preferred-topology: "cloud.google.com/gce-topology-block"
                  networking.gke.io/default-interface: 'eth0'
                  networking.gke.io/interfaces: |
                    [
                      {"interfaceName":"eth0","network":"default"},
                      {"interfaceName":"eth1","network":"vpc0"},
                      {"interfaceName":"eth2","network":"vpc1"},
                      {"interfaceName":"eth3","network":"vpc2"},
                      {"interfaceName":"eth4","network":"vpc3"}
                    ]
              spec:
                terminationGracePeriodSeconds: 0
                nodeSelector:
                  cloud.google.com/gke-accelerator: nvidia-h100-80gb
                tolerations:
                - key: cloud.google.com/gke-queued
                  effect: NoSchedule
                  value: "true"
                - key: "nvidia.com/gpu"
                  operator: "Exists"
                  effect: "NoSchedule"
                setHostnameAsFQDN: true
                containers:
                - name: tcpx-daemon
                  image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd-dev:v2.0.11
                  command:
                    - /tcpgpudmarxd/build/app/tcpgpudmarxd
                    - --gpu_nic_preset
                    - a3vm
                    - --gpu_shmem_type
                    - fd
                    - --uds_path
                    - /run/tcpx
                    - --setup_param
                    - "--verbose 128 2 0 "
                  securityContext:
                    privileged: true
                    capabilities:
                      add:
                        - NET_ADMIN
                  volumeMounts:
                    - name: libraries
                      mountPath: /usr/local/nvidia/lib64
                    - name: tcpx-socket
                      mountPath: /run/tcpx
                    - name: sys
                      mountPath: /hostsysfs
                    - name: proc-sys
                      mountPath: /hostprocsysfs
                  env:
                    - name: LD_LIBRARY_PATH
                      value: /usr/local/nvidia/lib64
                - name: nccl-test
                  image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx-dev:v3.1.8
                  command:
                    - bash
                    - -c
                    - |
                      /scripts/container_entry.sh daemon;
                      sleep infinity;
                  securityContext:
                    privileged: true
                  volumeMounts:
                    - name: tcpx-socket
                      mountPath: /tmp
                    - name: libraries
                      mountPath: /usr/local/nvidia/lib64
                    - name: nccl-config
                      mountPath: /configs
                    - name: shared-memory
                      mountPath: /dev/shm
                  resources:
                    limits:
                      cpu: "200"
                      memory: "1800Gi"
                      nvidia.com/gpu: 8
                    requests:
                      cpu: "200"
                      memory: "1800Gi"
                      nvidia.com/gpu: 8
                volumes:
                - name: libraries
                  hostPath:
                    path: /home/kubernetes/bin/nvidia/lib64
                - name: tcpx-socket
                  emptyDir: {}
                - name: sys
                  hostPath:
                    path: /sys
                - name: proc-sys
                  hostPath:
                    path: /proc/sys
                - name: shared-memory
                  emptyDir:
                    medium: Memory
                    sizeLimit: 250Gi
                - name: nccl-config
                  configMap:
                    name: nccl-config
                    defaultMode: 0777
    
  2. 매니페스트를 클러스터에 적용합니다.

    kubectl apply -f nccl-tas-jobset.yaml
    
  3. JobSet이 허용되고 실행 중인지 확인합니다.

    kubectl get jobset nccl-tas-test
    

    JobSet이 일시중지 해제되고 포드가 Running 상태에 도달할 때까지 기다립니다.

  4. 첫 번째 작업자 포드에서 allgather.sh 스크립트를 실행하여 NCCL 테스트를 트리거합니다.

    kubectl exec --stdin --tty --container=nccl-test nccl-tas-test-worker-0-0 -- /configs/allgather.sh nccl-tas-test-worker-0-0 nccl-tas-test-worker-1-0
    

    2노드 테스트의 출력은 다음과 비슷합니다.

    A3 Mega

    #                                                              out-of-place                       in-place
    #        size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
    #         (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
        0                 0         float    none      -1     0.24    0.00    0.00      0     0.18    0.00    0.00      0
        ...
        8589934592     134217728    float    none      -1    42603  201.63  189.03      0    42670  201.31  188.73      0
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : 45.7587
    

    A3 High

    #                                                              out-of-place                       in-place
    #       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
    #        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
        1048576         16384     float    none      -1    696.8    1.50    1.41      0    729.0    1.44    1.35      0
        ...
        536870912       8388608     float    none      -1   7101.7   75.60   70.87      0   7060.9   76.03   71.28      0
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : 29.8293
    

TAS로 NCCL 테스트 워크로드 배포

노드가 3개 이상인 경우 토폴로지 인식 예약 (TAS)을 사용하는 다음 테스트를 사용하는 것이 좋습니다. A3 Mega 또는 A3 High Flex-start VM을 사용하는 GKE 클러스터에서 TAS로 NCCL 테스트를 실행하려면 다음 절차를 따르세요.

  1. 다음 매니페스트를 nccl-jobset-test.yaml로 저장합니다. NUM_NODES를 노드 풀의 노드 수로 바꿉니다.

    A3 Mega

    apiVersion: jobset.x-k8s.io/v1alpha2
    kind: JobSet
    metadata:
      name: nccl-ag
      labels:
        kueue.x-k8s.io/queue-name: lq-tas
    spec:
      ttlSecondsAfterFinished: 1200
      suspend: true
      network:
        enableDNSHostnames: true
      replicatedJobs:
        - name: worker
          template:
            spec:
              parallelism: NUM_NODES
              completions: NUM_NODES
              template:
                metadata:
                  annotations:
                    kueue.x-k8s.io/podset-preferred-topology: "cloud.google.com/gce-topology-subblock"
                    networking.gke.io/default-interface: 'eth0'
                    networking.gke.io/interfaces: |
                      [
                        {"interfaceName":"eth0","network":"default"},
                        {"interfaceName":"eth1","network":"vpc0"},
                        {"interfaceName":"eth2","network":"vpc1"},
                        {"interfaceName":"eth3","network":"vpc2"},
                        {"interfaceName":"eth4","network":"vpc3"},
                        {"interfaceName":"eth5","network":"vpc4"},
                        {"interfaceName":"eth6","network":"vpc5"},
                        {"interfaceName":"eth7","network":"vpc6"},
                        {"interfaceName":"eth8","network":"vpc7"}
                      ]
                spec:
                  activeDeadlineSeconds: 3600
                  restartPolicy: Never
                  nodeSelector:
                    cloud.google.com/gke-accelerator: nvidia-h100-mega-80gb
                  tolerations:
                  - key: "nvidia.com/gpu"
                    operator: "Exists"
                    effect: "NoSchedule"
                  setHostnameAsFQDN: true
                  volumes:
                  - name: proc
                    hostPath:
                      path: /proc
                  - name: nvidia
                    hostPath:
                      path: /home/kubernetes/bin/nvidia
                  - name: lib64
                    hostPath:
                      path: /lib64
                  - name: shared-memory
                    emptyDir:
                      medium: "Memory"
                      sizeLimit: 250Gi
                  containers:
                  - name: nccl-test
                    stdin: true
                    tty: true
                    image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/nccl-plugin-tcpxo-diagnostic:v1.0.6
                    securityContext:
                      privileged: true
                    env:
                    - name: MY_NODE_NAME
                      valueFrom:
                        fieldRef:
                          fieldPath: spec.nodeName
                    - name: OMPI_ALLOW_RUN_AS_ROOT
                      value: "1"
                    - name: OMPI_ALLOW_RUN_AS_ROOT_CONFIRM
                      value: "1"
                    - name: N_NODES
                      value: "NUM_NODES"
                    - name: NCCL_SOCKET_IFNAME
                      value: eth0
                    - name: NCCL_FASTRAK_CTRL_DEV
                      value: eth0
                    - name: NCCL_FASTRAK_IFNAME
                      value: eth1,eth2,eth3,eth4,eth5,eth6,eth7,eth8
                    - name: NCCL_CROSS_NIC
                      value: "0"
                    - name: NCCL_ALGO
                      value: Ring,Tree
                    - name: NCCL_PROTO
                      value: Simple
                    - name: NCCL_NET_GDR_LEVEL
                      value: PIX
                    - name: LD_LIBRARY_PATH
                      value: /usr/local/nvidia/lib64
                    command:
                    - bash
                    - -c
                    - |
                      set -x
                      /scripts/container_entry.sh daemon &
                      export POSTFIX=$(hostname | cut -d . -f 2-)
                      export WORKERS_BASENAME=$(hostname | cut -d . -f 1 | rev | cut -d - -f 2- | rev )
                      export NODE_RANK=$JOB_COMPLETION_INDEX
                      for i in `seq 0 $(($N_NODES-1))`; do
                        OTHER=<span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8333em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">W</span><span class="mord mathnormal" style="margin-right:0.00773em;">OR</span><span class="mord mathnormal" style="margin-right:0.07153em;">K</span><span class="mord mathnormal" style="margin-right:0.00773em;">ER</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.0576em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.05017em;">B</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.10903em;">SEN</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.05764em;">ME</span></span><span class="mord">−</span></span></span></span>{i}.${POSTFIX}
                        until ssh -p 222 -o StrictHostKeyChecking=no $OTHER hostname; do
                          sleep 10
                        done
                        echo ${OTHER} port=222 slots=8 | tee -a /tmp/hostfile;
                      done
                      if [[ "${NODE_RANK}" -eq "0" ]]; then
                          export NCCL_TESTS_SPLIT_MASK="0x0";
                          ENV_VARS=$(echo ${!NCCL*} ${!OMPI*} LD_LIBRARY_PATH PATH | sed 's/ / -x /g')
                          mpirun --hostfile /tmp/hostfile \
                            -x $ENV_VARS  \
                            -mca plm_rsh_no_tree_spawn 1 \
                            --mca orte_keep_fqdn_hostnames 1 \
                            --mca btl self,tcp \
                            --mca btl_tcp_if_include eth0 \
                            --bind-to none \
                            --mca plm_rsh_agent "ssh -q -o LogLevel=ERROR -o StrictHostKeyChecking=no -p 222" \
                            /third_party/nccl-tests/build/all_gather_perf -b 1K -e 8G -f 2 -g 1 -w 5 --iters 100 -c 1
                      else
                          while ping -c 1 <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8333em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">W</span><span class="mord mathnormal" style="margin-right:0.00773em;">OR</span><span class="mord mathnormal" style="margin-right:0.07153em;">K</span><span class="mord mathnormal" style="margin-right:0.00773em;">ER</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.0576em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.05017em;">B</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.10903em;">SEN</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.05764em;">ME</span></span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.6444em;"></span><span class="mord">0.</span></span></span></span>{POSTFIX}; do
                          sleep 5
                      done
                      fi
                      exit 0
                    volumeMounts:
                    - name: nvidia
                      mountPath: /usr/local/nvidia
                    - name: lib64
                      mountPath: /lib64
                    - name: shared-memory
                      mountPath: /dev/shm
                    resources:
                      limits:
                        cpu: "200"
                        memory: "3700Gi"
                        nvidia.com/gpu: 8
                      requests:
                        cpu: "200"
                        memory: "3700Gi"
                        nvidia.com/gpu: 8
                  - name: tcpxo-daemon
                    image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/tcpxo-daemon:v1.0.1
                    imagePullPolicy: Always
                    command:
                    - bash
                    - -c
                    - |
                      /usr/bin/tcpxo_daemon
                    securityContext:
                      privileged: true
                    volumeMounts:
                    - name: nvidia
                      mountPath: /usr/local/nvidia
                    - name: proc
                      mountPath: /proc
                    env:
                    - name: LD_LIBRARY_PATH
                      value: /usr/local/nvidia/lib64
    

    A3 High

    apiVersion: jobset.x-k8s.io/v1alpha2
    kind: JobSet
    metadata:
      name: nccl-ag
      labels:
        kueue.x-k8s.io/queue-name: lq-tas
    spec:
      ttlSecondsAfterFinished: 1200
      suspend: true
      network:
        enableDNSHostnames: true
      replicatedJobs:
        - name: worker
          template:
            spec:
              parallelism: NUM_NODES
              completions: NUM_NODES
              template:
                metadata:
                  annotations:
                    kueue.x-k8s.io/podset-preferred-topology: "cloud.google.com/gce-topology-subblock"
                    networking.gke.io/default-interface: 'eth0'
                    networking.gke.io/interfaces: |
                      [
                        {"interfaceName":"eth0","network":"default"},
                        {"interfaceName":"eth1","network":"vpc0"},
                        {"interfaceName":"eth2","network":"vpc1"},
                        {"interfaceName":"eth3","network":"vpc2"},
                        {"interfaceName":"eth4","network":"vpc3"}
                      ]
                spec:
                  activeDeadlineSeconds: 3600
                  restartPolicy: Never
                  nodeSelector:
                    cloud.google.com/gke-accelerator: nvidia-h100-80gb
                  tolerations:
                  - key: "nvidia.com/gpu"
                    operator: "Exists"
                    effect: "NoSchedule"
                  setHostnameAsFQDN: true
                  volumes:
                  - name: proc
                    hostPath:
                      path: /proc
                  - name: nvidia
                    hostPath:
                      path: /home/kubernetes/bin/nvidia
                  - name: libraries
                    hostPath:
                      path: /home/kubernetes/bin/nvidia/lib64
                  - name: tcpx-socket
                    emptyDir: {}
                  - name: shared-memory
                    emptyDir:
                      medium: "Memory"
                      sizeLimit: 250Gi
                  containers:
                  - name: tcpx-daemon
                    image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd-dev:v2.0.11
                    command:
                      - /tcpgpudmarxd/build/app/tcpgpudmarxd
                      - --gpu_nic_preset
                      - a3vm
                      - --uds_path
                      - /run/tcpx
                    securityContext:
                      privileged: true
                    volumeMounts:
                      - name: tcpx-socket
                        mountPath: /run/tcpx
                      - name: libraries
                        mountPath: /usr/local/nvidia/lib64
                  - name: nccl-test
                    stdin: true
                    tty: true
                    image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx-dev:v3.1.8
                    securityContext:
                      privileged: true
                    env:
                    - name: MY_NODE_NAME
                      valueFrom:
                        fieldRef:
                          fieldPath: spec.nodeName
                    - name: OMPI_ALLOW_RUN_AS_ROOT
                      value: "1"
                    - name: OMPI_ALLOW_RUN_AS_ROOT_CONFIRM
                      value: "1"
                    - name: N_NODES
                      value: "NUM_NODES"
                    - name: LD_LIBRARY_PATH
                      value: /usr/local/nvidia/lib64
                    command:
                    - bash
                    - -c
                    - |
                      /scripts/container_entry.sh daemon &
                      export POSTFIX=$(hostname | cut -d . -f 2-)
                      export WORKERS_BASENAME=$(hostname | cut -d . -f 1 | rev | cut -d - -f 2- | rev )
                      export NODE_RANK=$JOB_COMPLETION_INDEX
                      for i in `seq 0 $(($N_NODES-1))`; do
                        OTHER=<span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8333em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">W</span><span class="mord mathnormal" style="margin-right:0.00773em;">OR</span><span class="mord mathnormal" style="margin-right:0.07153em;">K</span><span class="mord mathnormal" style="margin-right:0.00773em;">ER</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.0576em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.05017em;">B</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.10903em;">SEN</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.05764em;">ME</span></span><span class="mord">−</span></span></span></span>{i}.${POSTFIX}
                        until ssh -p 222 -o StrictHostKeyChecking=no $OTHER hostname; do
                          sleep 10
                        done
                        echo ${OTHER} port=222 slots=8 | tee -a /tmp/hostfile;
                      done
                      if [[ "${NODE_RANK}" -eq "0" ]]; then
                          /scripts/run-allgather.sh 8 eth1,eth2,eth3,eth4 1M 512M ${N_NODES}
                      else
                          while ping -c 1 <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8333em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">W</span><span class="mord mathnormal" style="margin-right:0.00773em;">OR</span><span class="mord mathnormal" style="margin-right:0.07153em;">K</span><span class="mord mathnormal" style="margin-right:0.00773em;">ER</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.0576em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.05017em;">B</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.10903em;">SEN</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.05764em;">ME</span></span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.6444em;"></span><span class="mord">0.</span></span></span></span>{POSTFIX}; do
                          sleep 5
                      done
                      fi
                      exit 0
                    volumeMounts:
                    - name: nvidia
                      mountPath: /usr/local/nvidia
                    - name: tcpx-socket
                      mountPath: /tmp
                    - name: libraries
                      mountPath: /usr/local/nvidia/lib64
                    - name: shared-memory
                      mountPath: /dev/shm
                    resources:
                      limits:
                        cpu: "200"
                        memory: "1800Gi"
                        nvidia.com/gpu: 8
                      requests:
                        cpu: "200"
                        memory: "1800Gi"
                        nvidia.com/gpu: 8
    
  2. 매니페스트를 적용합니다.

    kubectl apply -f nccl-jobset-test.yaml
    
  3. 워크로드가 허용되고 Completed 상태에 도달하는지 확인합니다.

  4. nccl-ag-worker-0-0-.*와 일치하는 포드의 로그를 가져와 결과를 확인합니다.

    kubectl logs $(kubectl get pods -o go-template='{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | grep nccl-ag-worker-0-0)
    

다음 단계