NCCL auf benutzerdefinierten GKE-Clustern mit A3 Mega oder A3 High ausführen

Auf dieser Seite wird beschrieben, wie Sie NVIDIA Collective Communications Library (NCCL)-Tests auf benutzerdefinierten GKE-Clustern ausführen, die die Netzwerkprotokolle GPUDirect-TCPXO und GPUDirect-TCPX verwenden. Ein benutzerdefinierter GKE-Cluster ist ein Cluster, den Sie mit gcloud-Befehlen erstellen.

Sie können die auf dieser Seite beschriebenen Tests in den folgenden Szenarien verwenden:

Hinweis

Für die Tests auf dieser Seite werden JobSet und Kueue mit Topologie-basierter Planung verwendet. Bevor Sie Tests ausführen, müssen Sie Ihren Cluster einrichten und Folgendes tun:

  1. JobSet installieren.

  2. Kueue installieren.

    kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.16.5/manifests.yaml
    

Cluster mit JobSet und Kueue einrichten

Nachdem Sie JobSet und Kueue installiert haben, führen Sie die folgenden Schritte aus:

  1. Speichern Sie das folgende Manifest als kueue-config.yaml:

    A3 High

    apiVersion: kueue.x-k8s.io/v1beta2
    kind: Topology
    metadata:
      name: "gke-default"
    spec:
      levels:
      - nodeLabel: "cloud.google.com/gce-topology-block"
      - nodeLabel: "cloud.google.com/gce-topology-subblock"
      - nodeLabel: "cloud.google.com/gce-topology-host"
      - nodeLabel: "kubernetes.io/hostname"
    ---
    apiVersion: kueue.x-k8s.io/v1beta2
    kind: ResourceFlavor
    metadata:
      name: a3-high-flavor
    spec:
      nodeLabels:
        cloud.google.com/gke-accelerator: nvidia-h100-80gb
      topologyName: "gke-default"
    ---
    apiVersion: kueue.x-k8s.io/v1beta2
    kind: ResourceFlavor
    metadata:
      name: a3-high-dws-flavor
    spec:
      nodeLabels:
        cloud.google.com/gke-accelerator: nvidia-h100-80gb
      topologyName: "gke-default"
      tolerations:
      - key: "cloud.google.com/gke-queued"
        operator: "Exists"
        effect: NoSchedule
    ---     
    apiVersion: kueue.x-k8s.io/v1beta2
    kind: AdmissionCheck
    metadata:
      name: dws-prov
    spec:
      controllerName: kueue.x-k8s.io/provisioning-request
      parameters:
        apiGroup: kueue.x-k8s.io
        kind: ProvisioningRequestConfig
        name: dws-config
    ---
    apiVersion: kueue.x-k8s.io/v1beta2
    kind: ProvisioningRequestConfig
    metadata:
      name: dws-config
    spec:
      provisioningClassName: queued-provisioning.gke.io
      podSetUpdates:
      - key: autoscaling.gke.io/provisioning-request
        valueFromProvisioningClassDetail: ResizeRequestName
      managedResources:
      - nvidia.com/gpu
    ---
    apiVersion: kueue.x-k8s.io/v1beta2
    kind: ClusterQueue
    metadata:
      name: cq-tas
    spec:
      namespaceSelector: {}
      clusterQueueingStrategy: BestEffortFIFO
      resourceGroups:
      - flavors:
        - name: a3-high-flavor
          resources:
          - name: "cpu"
            nominalQuota: 1000
          - name: "memory"
            nominalQuota: 1000Ti
          - name: "nvidia.com/gpu"
            nominalQuota: 1000
        - name: a3-high-dws-flavor
          resources:
          - name: "cpu"
            nominalQuota: 1000
          - name: "memory"
            nominalQuota: 1000Ti
          - name: "nvidia.com/gpu"
            nominalQuota: 1000
      admissionChecksStrategy:
        admissionChecks:
        - name: "dws-prov"
          onFlavors: [a3-high-dws-flavor]
    ---
    apiVersion: kueue.x-k8s.io/v1beta2
    kind: LocalQueue
    metadata:
      namespace: default
      name: lq-tas
    spec:
      clusterQueue: cq-tas
    

    A3 Mega

    apiVersion: kueue.x-k8s.io/v1beta2
    kind: Topology
    metadata:
      name: "gke-default"
    spec:
      levels:
      - nodeLabel: "cloud.google.com/gce-topology-block"
      - nodeLabel: "cloud.google.com/gce-topology-subblock"
      - nodeLabel: "cloud.google.com/gce-topology-host"
      - nodeLabel: "kubernetes.io/hostname"
    ---
    apiVersion: kueue.x-k8s.io/v1beta2
    kind: ResourceFlavor
    metadata:
      name: a3-mega-flavor
    spec:
      nodeLabels:
        cloud.google.com/gke-accelerator: nvidia-h100-mega-80gb
      topologyName: "gke-default"
    ---
    apiVersion: kueue.x-k8s.io/v1beta2
    kind: ResourceFlavor
    metadata:
      name: a3-mega-dws-flavor
    spec:
      nodeLabels:
        cloud.google.com/gke-accelerator: nvidia-h100-mega-80gb
      topologyName: "gke-default"
      tolerations:
      - key: "cloud.google.com/gke-queued"
        operator: "Exists"
        effect: NoSchedule
    ---     
    apiVersion: kueue.x-k8s.io/v1beta2
    kind: AdmissionCheck
    metadata:
      name: dws-prov
    spec:
      controllerName: kueue.x-k8s.io/provisioning-request
      parameters:
        apiGroup: kueue.x-k8s.io
        kind: ProvisioningRequestConfig
        name: dws-config
    ---
    apiVersion: kueue.x-k8s.io/v1beta2
    kind: ProvisioningRequestConfig
    metadata:
      name: dws-config
    spec:
      provisioningClassName: queued-provisioning.gke.io
      podSetUpdates:
      - key: autoscaling.gke.io/provisioning-request
        valueFromProvisioningClassDetail: ResizeRequestName
      managedResources:
      - nvidia.com/gpu
    ---
    apiVersion: kueue.x-k8s.io/v1beta2
    kind: ClusterQueue
    metadata:
      name: cq-tas
    spec:
      namespaceSelector: {}
      clusterQueueingStrategy: BestEffortFIFO
      resourceGroups:
      - flavors:
        - name: a3-mega-flavor
          resources:
          - name: "cpu"
            nominalQuota: 1000
          - name: "memory"
            nominalQuota: 1000Ti
          - name: "nvidia.com/gpu"
            nominalQuota: 1000
        - name: a3-mega-dws-flavor
          resources:
          - name: "cpu"
            nominalQuota: 1000
          - name: "memory"
            nominalQuota: 1000Ti
          - name: "nvidia.com/gpu"
            nominalQuota: 1000
      admissionChecksStrategy:
        admissionChecks:
        - name: "dws-prov"
          onFlavors: [a3-mega-dws-flavor]
    ---
    apiVersion: kueue.x-k8s.io/v1beta2
    kind: LocalQueue
    metadata:
      namespace: default
      name: lq-tas
    spec:
      clusterQueue: cq-tas
    
  2. Wenden Sie das Manifest an:

    kubectl apply -f kueue-config.yaml
    

Wenn Sie Arbeitslasten mit aktivierter Topologie-basierter Planung ausführen, können Sie angeben, wie streng Topologieeinschränkungen erzwungen werden. Verwenden Sie dazu eine der folgenden Annotationen im Arbeitslastmanifest:

  • kueue.x-k8s.io/podset-required-topology: Wenn Sie diese Annotation verwenden, blockiert Kueue die Planung, bis die Arbeitslast innerhalb der angeforderten Topologieeinschränkung geplant werden kann. Verwenden Sie diese Annotation, um sicherzustellen, dass Pods für eine optimale Leistung zusammen platziert werden.

  • kueue.x-k8s.io/podset-preferred-topology: Wenn Sie diese Annotation verwenden, versucht Kueue, Pods innerhalb der angeforderten Topologieeinschränkung zuzuordnen. Wenn das nicht möglich ist, lässt Kueue die Arbeitslast zu, ohne die Topologieeinschränkungen zu erfüllen.

Hinweis:Vermeiden Sie die Verwendung des erforderlichen Modus mit DWS Flex-Start. Da Flex-Start Knoten dynamisch bereitstellt, erfüllen die resultierenden Knoten möglicherweise nicht die strengen Topologieanforderungen, was zu nicht planbaren Arbeitslasten führen kann. Verwenden Sie für diese Konfigurationen stattdessen podset-preferred-topology.

Geben Sie für beide Annotationen einen der folgenden Werte als Topologieeinschränkung an:

  • cloud.google.com/gce-topology-block: Plant Pods innerhalb desselben Netzwerkblocks.
  • cloud.google.com/gce-topology-subblock: Plant Pods innerhalb desselben Racks.
  • cloud.google.com/gce-topology-host: Plant Pods auf demselben physischen Host.

Test auf zwei Flex-Start-Knoten

Führen Sie die folgenden Schritte aus, um NCCL-Tests auf einem GKE-Cluster auszuführen, der A3 Mega- oder A3 High-Flex-Start-VMs verwendet. In diesem Verfahren wird ein JobSet-Manifest verwendet, um einen NCCL-Test auf zwei Knoten auszuführen.

  1. Speichern Sie das folgende Manifest als nccl-tas-jobset.yaml:

    A3 Mega

    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: nccl-configmap
    data:
      allgather.sh: |
        #!/bin/bash
        service ssh restart;
        /scripts/init_ssh.sh ${@};
        pushd /scripts;
        /scripts/gen_hostfiles.sh ${@};
        popd;
        # Set up environment variables for GPUDirect-TCPXO
        export LD_LIBRARY_PATH=/usr/local/nvidia/lib64
        export NCCL_FASTRAK_CTRL_DEV=eth0
        export NCCL_FASTRAK_IFNAME=eth1,eth2,eth3,eth4,eth5,eth6,eth7,eth8
        export NCCL_SOCKET_IFNAME=eth0
        export NCCL_CROSS_NIC=0
        export NCCL_ALGO=Ring,Tree
        export NCCL_PROTO=Simple
        export NCCL_NET_GDR_LEVEL=PIX
        # Run the benchmark
        /scripts/demo-run-nccl-test-tcpxo-via-mpi.sh
    ---
    apiVersion: jobset.x-k8s.io/v1alpha2
    kind: JobSet
    metadata:
      name: nccl-tas-test
      labels:
        kueue.x-k8s.io/queue-name: lq-tas
    spec:
      ttlSecondsAfterFinished: 1200
      suspend: true
      network:
        enableDNSHostnames: true
      replicatedJobs:
        - name: worker
          replicas: 2
          template:
            spec:
              parallelism: 1
              completions: 1
              template:
                metadata:
                  annotations:
                    kueue.x-k8s.io/podset-preferred-topology: "cloud.google.com/gce-topology-block"
                    networking.gke.io/default-interface: 'eth0'
                    networking.gke.io/interfaces: |
                      [
                        {"interfaceName":"eth0","network":"default"},
                        {"interfaceName":"eth1","network":"vpc0"},
                        {"interfaceName":"eth2","network":"vpc1"},
                        {"interfaceName":"eth3","network":"vpc2"},
                        {"interfaceName":"eth4","network":"vpc3"},
                        {"interfaceName":"eth5","network":"vpc4"},
                        {"interfaceName":"eth6","network":"vpc5"},
                        {"interfaceName":"eth7","network":"vpc6"},
                        {"interfaceName":"eth8","network":"vpc7"}
                      ]
                spec:
                  activeDeadlineSeconds: 3600
                  restartPolicy: Never
                  nodeSelector:
                    cloud.google.com/gke-accelerator: nvidia-h100-mega-80gb
                  tolerations:
                  - key: cloud.google.com/gke-queued
                    effect: NoSchedule
                    value: "true"
                  - key: "nvidia.com/gpu"
                    operator: "Exists"
                    effect: "NoSchedule"
                  setHostnameAsFQDN: true
                  volumes:
                  - name: nvidia
                    hostPath:
                      path: /home/kubernetes/bin/nvidia
                  - name: lib64
                    hostPath:
                      path: /lib64
                  - name: proc
                    hostPath:
                      path: /proc
                  - name: shared-memory
                    emptyDir:
                      medium: "Memory"
                      sizeLimit: 250Gi
                  - name: nccl-config
                    configMap:
                      name: nccl-configmap
                      defaultMode: 0755
                  containers:
                  - name: nccl-test
                    image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/nccl-plugin-gpudirecttcpx-dev:v1.0.15
                    stdin: true
                    tty: true
                    securityContext:
                      privileged: true
                    env:
                    - name: LD_LIBRARY_PATH
                      value: /usr/local/nvidia/lib64
                    volumeMounts:
                    - name: nvidia
                      mountPath: /usr/local/nvidia
                    - name: shared-memory
                      mountPath: /dev/shm
                    - name: nccl-config
                      mountPath: /configs
                    resources:
                      limits:
                        cpu: "200"
                        memory: "3700Gi"
                        nvidia.com/gpu: 8
                      requests:
                        cpu: "200"
                        memory: "3700Gi"
                        nvidia.com/gpu: 8
                  - name: tcpxo-daemon
                    image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/tcpgpudmarxd-dev:v1.0.21
                    imagePullPolicy: Always
                    command: ["/bin/sh", "-c"]
                    args:
                      - |
                        set -ex
                        chmod 755 /fts/entrypoint_rxdm_container.sh
                        /fts/entrypoint_rxdm_container.sh --num_hops=2 --num_nics=8 --uid= --alsologtostderr
                    securityContext:
                      privileged: true
                      capabilities:
                        add:
                          - NET_ADMIN
                          - NET_BIND_SERVICE
                    volumeMounts:
                    - name: nvidia
                      mountPath: /usr/local/nvidia/lib64
                    - name: proc
                      mountPath: /proc
                    env:
                    - name: LD_LIBRARY_PATH
                      value: /usr/local/nvidia/lib64
    

    A3 High

    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: nccl-config
    data:
      allgather.sh: |
        #!/bin/bash
        for script in /configs/*; do
          name=$(basename $script)
          cp $script "/scripts/$name"
          chmod +x "/scripts/$name"
        done
        /scripts/init_ssh.sh ${@};
        pushd /scripts;
        /scripts/gen_hostfiles.sh ${@};
        popd;
        /scripts/run-allgather.sh 8 eth1,eth2,eth3,eth4 1M 512M ${#};
    ---
    apiVersion: jobset.x-k8s.io/v1alpha2
    kind: JobSet
    metadata:
      name: nccl-tas-test
      labels:
        kueue.x-k8s.io/queue-name: lq-tas
    spec:
      suspend: true
      network:
        enableDNSHostnames: true
      replicatedJobs:
      - name: worker
        replicas: 2
        template:
          spec:
            parallelism: 1
            completions: 1
            template:
              metadata:
                annotations:
                  kueue.x-k8s.io/podset-preferred-topology: "cloud.google.com/gce-topology-block"
                  networking.gke.io/default-interface: 'eth0'
                  networking.gke.io/interfaces: |
                    [
                      {"interfaceName":"eth0","network":"default"},
                      {"interfaceName":"eth1","network":"vpc0"},
                      {"interfaceName":"eth2","network":"vpc1"},
                      {"interfaceName":"eth3","network":"vpc2"},
                      {"interfaceName":"eth4","network":"vpc3"}
                    ]
              spec:
                terminationGracePeriodSeconds: 0
                nodeSelector:
                  cloud.google.com/gke-accelerator: nvidia-h100-80gb
                tolerations:
                - key: cloud.google.com/gke-queued
                  effect: NoSchedule
                  value: "true"
                - key: "nvidia.com/gpu"
                  operator: "Exists"
                  effect: "NoSchedule"
                setHostnameAsFQDN: true
                containers:
                - name: tcpx-daemon
                  image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd-dev:v2.0.11
                  command:
                    - /tcpgpudmarxd/build/app/tcpgpudmarxd
                    - --gpu_nic_preset
                    - a3vm
                    - --gpu_shmem_type
                    - fd
                    - --uds_path
                    - /run/tcpx
                    - --setup_param
                    - "--verbose 128 2 0 "
                  securityContext:
                    privileged: true
                    capabilities:
                      add:
                        - NET_ADMIN
                  volumeMounts:
                    - name: libraries
                      mountPath: /usr/local/nvidia/lib64
                    - name: tcpx-socket
                      mountPath: /run/tcpx
                    - name: sys
                      mountPath: /hostsysfs
                    - name: proc-sys
                      mountPath: /hostprocsysfs
                  env:
                    - name: LD_LIBRARY_PATH
                      value: /usr/local/nvidia/lib64
                - name: nccl-test
                  image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx-dev:v3.1.8
                  command:
                    - bash
                    - -c
                    - |
                      /scripts/container_entry.sh daemon;
                      sleep infinity;
                  securityContext:
                    privileged: true
                  volumeMounts:
                    - name: tcpx-socket
                      mountPath: /tmp
                    - name: libraries
                      mountPath: /usr/local/nvidia/lib64
                    - name: nccl-config
                      mountPath: /configs
                    - name: shared-memory
                      mountPath: /dev/shm
                  resources:
                    limits:
                      cpu: "200"
                      memory: "1800Gi"
                      nvidia.com/gpu: 8
                    requests:
                      cpu: "200"
                      memory: "1800Gi"
                      nvidia.com/gpu: 8
                volumes:
                - name: libraries
                  hostPath:
                    path: /home/kubernetes/bin/nvidia/lib64
                - name: tcpx-socket
                  emptyDir: {}
                - name: sys
                  hostPath:
                    path: /sys
                - name: proc-sys
                  hostPath:
                    path: /proc/sys
                - name: shared-memory
                  emptyDir:
                    medium: Memory
                    sizeLimit: 250Gi
                - name: nccl-config
                  configMap:
                    name: nccl-config
                    defaultMode: 0777
    
  2. Wenden Sie das Manifest auf Ihren Cluster an:

    kubectl apply -f nccl-tas-jobset.yaml
    
  3. Prüfen Sie, ob das JobSet zugelassen ist und ausgeführt wird:

    kubectl get jobset nccl-tas-test
    

    Warten Sie, bis die Aussetzung des JobSets aufgehoben wurde und die Pods den Status Running erreicht haben.

  4. Lösen Sie den NCCL-Test aus, indem Sie das Skript allgather.sh vom ersten Worker-Pod ausführen:

    kubectl exec --stdin --tty --container=nccl-test nccl-tas-test-worker-0-0 -- /configs/allgather.sh nccl-tas-test-worker-0-0 nccl-tas-test-worker-1-0
    

    Die Ausgabe für einen Test mit zwei Knoten sieht in etwa so aus:

    A3 Mega

    #                                                              out-of-place                       in-place
    #        size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
    #         (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
        0                 0         float    none      -1     0.24    0.00    0.00      0     0.18    0.00    0.00      0
        ...
        8589934592     134217728    float    none      -1    42603  201.63  189.03      0    42670  201.31  188.73      0
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : 45.7587
    

    A3 High

    #                                                              out-of-place                       in-place
    #       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
    #        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
        1048576         16384     float    none      -1    696.8    1.50    1.41      0    729.0    1.44    1.35      0
        ...
        536870912       8388608     float    none      -1   7101.7   75.60   70.87      0   7060.9   76.03   71.28      0
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : 29.8293
    

NCCL-Testarbeitslast mit Topologie-basierter Planung bereitstellen

Wenn Sie mehr als zwei Knoten haben, empfehlen wir den folgenden Test, bei dem die Topologie-basierte Planung verwendet wird. Führen Sie die folgenden Schritte aus, um NCCL-Tests mit Topologie-basierter Planung auf einem GKE-Cluster auszuführen, der A3 Mega- oder A3 High-Flex-Start-VMs verwendet.

  1. Speichern Sie das folgende Manifest als nccl-jobset-test.yaml. Ersetzen Sie NUM_NODES durch die Anzahl der Knoten im Knotenpool:

    A3 Mega

    apiVersion: jobset.x-k8s.io/v1alpha2
    kind: JobSet
    metadata:
      name: nccl-ag
      labels:
        kueue.x-k8s.io/queue-name: lq-tas
    spec:
      ttlSecondsAfterFinished: 1200
      suspend: true
      network:
        enableDNSHostnames: true
      replicatedJobs:
        - name: worker
          template:
            spec:
              parallelism: NUM_NODES
              completions: NUM_NODES
              template:
                metadata:
                  annotations:
                    kueue.x-k8s.io/podset-preferred-topology: "cloud.google.com/gce-topology-subblock"
                    networking.gke.io/default-interface: 'eth0'
                    networking.gke.io/interfaces: |
                      [
                        {"interfaceName":"eth0","network":"default"},
                        {"interfaceName":"eth1","network":"vpc0"},
                        {"interfaceName":"eth2","network":"vpc1"},
                        {"interfaceName":"eth3","network":"vpc2"},
                        {"interfaceName":"eth4","network":"vpc3"},
                        {"interfaceName":"eth5","network":"vpc4"},
                        {"interfaceName":"eth6","network":"vpc5"},
                        {"interfaceName":"eth7","network":"vpc6"},
                        {"interfaceName":"eth8","network":"vpc7"}
                      ]
                spec:
                  activeDeadlineSeconds: 3600
                  restartPolicy: Never
                  nodeSelector:
                    cloud.google.com/gke-accelerator: nvidia-h100-mega-80gb
                  tolerations:
                  - key: "nvidia.com/gpu"
                    operator: "Exists"
                    effect: "NoSchedule"
                  setHostnameAsFQDN: true
                  volumes:
                  - name: proc
                    hostPath:
                      path: /proc
                  - name: nvidia
                    hostPath:
                      path: /home/kubernetes/bin/nvidia
                  - name: lib64
                    hostPath:
                      path: /lib64
                  - name: shared-memory
                    emptyDir:
                      medium: "Memory"
                      sizeLimit: 250Gi
                  containers:
                  - name: nccl-test
                    stdin: true
                    tty: true
                    image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/nccl-plugin-tcpxo-diagnostic:v1.0.6
                    securityContext:
                      privileged: true
                    env:
                    - name: MY_NODE_NAME
                      valueFrom:
                        fieldRef:
                          fieldPath: spec.nodeName
                    - name: OMPI_ALLOW_RUN_AS_ROOT
                      value: "1"
                    - name: OMPI_ALLOW_RUN_AS_ROOT_CONFIRM
                      value: "1"
                    - name: N_NODES
                      value: "NUM_NODES"
                    - name: NCCL_SOCKET_IFNAME
                      value: eth0
                    - name: NCCL_FASTRAK_CTRL_DEV
                      value: eth0
                    - name: NCCL_FASTRAK_IFNAME
                      value: eth1,eth2,eth3,eth4,eth5,eth6,eth7,eth8
                    - name: NCCL_CROSS_NIC
                      value: "0"
                    - name: NCCL_ALGO
                      value: Ring,Tree
                    - name: NCCL_PROTO
                      value: Simple
                    - name: NCCL_NET_GDR_LEVEL
                      value: PIX
                    - name: LD_LIBRARY_PATH
                      value: /usr/local/nvidia/lib64
                    command:
                    - bash
                    - -c
                    - |
                      set -x
                      /scripts/container_entry.sh daemon &
                      export POSTFIX=$(hostname | cut -d . -f 2-)
                      export WORKERS_BASENAME=$(hostname | cut -d . -f 1 | rev | cut -d - -f 2- | rev )
                      export NODE_RANK=$JOB_COMPLETION_INDEX
                      for i in `seq 0 $(($N_NODES-1))`; do
                        OTHER=<span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8333em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">W</span><span class="mord mathnormal" style="margin-right:0.00773em;">OR</span><span class="mord mathnormal" style="margin-right:0.07153em;">K</span><span class="mord mathnormal" style="margin-right:0.00773em;">ER</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.0576em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.05017em;">B</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.10903em;">SEN</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.05764em;">ME</span></span><span class="mord">−</span></span></span></span>{i}.${POSTFIX}
                        until ssh -p 222 -o StrictHostKeyChecking=no $OTHER hostname; do
                          sleep 10
                        done
                        echo ${OTHER} port=222 slots=8 | tee -a /tmp/hostfile;
                      done
                      if [[ "${NODE_RANK}" -eq "0" ]]; then
                          export NCCL_TESTS_SPLIT_MASK="0x0";
                          ENV_VARS=$(echo ${!NCCL*} ${!OMPI*} LD_LIBRARY_PATH PATH | sed 's/ / -x /g')
                          mpirun --hostfile /tmp/hostfile \
                            -x $ENV_VARS  \
                            -mca plm_rsh_no_tree_spawn 1 \
                            --mca orte_keep_fqdn_hostnames 1 \
                            --mca btl self,tcp \
                            --mca btl_tcp_if_include eth0 \
                            --bind-to none \
                            --mca plm_rsh_agent "ssh -q -o LogLevel=ERROR -o StrictHostKeyChecking=no -p 222" \
                            /third_party/nccl-tests/build/all_gather_perf -b 1K -e 8G -f 2 -g 1 -w 5 --iters 100 -c 1
                      else
                          while ping -c 1 <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8333em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">W</span><span class="mord mathnormal" style="margin-right:0.00773em;">OR</span><span class="mord mathnormal" style="margin-right:0.07153em;">K</span><span class="mord mathnormal" style="margin-right:0.00773em;">ER</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.0576em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.05017em;">B</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.10903em;">SEN</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.05764em;">ME</span></span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.6444em;"></span><span class="mord">0.</span></span></span></span>{POSTFIX}; do
                          sleep 5
                      done
                      fi
                      exit 0
                    volumeMounts:
                    - name: nvidia
                      mountPath: /usr/local/nvidia
                    - name: lib64
                      mountPath: /lib64
                    - name: shared-memory
                      mountPath: /dev/shm
                    resources:
                      limits:
                        cpu: "200"
                        memory: "3700Gi"
                        nvidia.com/gpu: 8
                      requests:
                        cpu: "200"
                        memory: "3700Gi"
                        nvidia.com/gpu: 8
                  - name: tcpxo-daemon
                    image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/tcpxo-daemon:v1.0.1
                    imagePullPolicy: Always
                    command:
                    - bash
                    - -c
                    - |
                      /usr/bin/tcpxo_daemon
                    securityContext:
                      privileged: true
                    volumeMounts:
                    - name: nvidia
                      mountPath: /usr/local/nvidia
                    - name: proc
                      mountPath: /proc
                    env:
                    - name: LD_LIBRARY_PATH
                      value: /usr/local/nvidia/lib64
    

    A3 High

    apiVersion: jobset.x-k8s.io/v1alpha2
    kind: JobSet
    metadata:
      name: nccl-ag
      labels:
        kueue.x-k8s.io/queue-name: lq-tas
    spec:
      ttlSecondsAfterFinished: 1200
      suspend: true
      network:
        enableDNSHostnames: true
      replicatedJobs:
        - name: worker
          template:
            spec:
              parallelism: NUM_NODES
              completions: NUM_NODES
              template:
                metadata:
                  annotations:
                    kueue.x-k8s.io/podset-preferred-topology: "cloud.google.com/gce-topology-subblock"
                    networking.gke.io/default-interface: 'eth0'
                    networking.gke.io/interfaces: |
                      [
                        {"interfaceName":"eth0","network":"default"},
                        {"interfaceName":"eth1","network":"vpc0"},
                        {"interfaceName":"eth2","network":"vpc1"},
                        {"interfaceName":"eth3","network":"vpc2"},
                        {"interfaceName":"eth4","network":"vpc3"}
                      ]
                spec:
                  activeDeadlineSeconds: 3600
                  restartPolicy: Never
                  nodeSelector:
                    cloud.google.com/gke-accelerator: nvidia-h100-80gb
                  tolerations:
                  - key: "nvidia.com/gpu"
                    operator: "Exists"
                    effect: "NoSchedule"
                  setHostnameAsFQDN: true
                  volumes:
                  - name: proc
                    hostPath:
                      path: /proc
                  - name: nvidia
                    hostPath:
                      path: /home/kubernetes/bin/nvidia
                  - name: libraries
                    hostPath:
                      path: /home/kubernetes/bin/nvidia/lib64
                  - name: tcpx-socket
                    emptyDir: {}
                  - name: shared-memory
                    emptyDir:
                      medium: "Memory"
                      sizeLimit: 250Gi
                  containers:
                  - name: tcpx-daemon
                    image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd-dev:v2.0.11
                    command:
                      - /tcpgpudmarxd/build/app/tcpgpudmarxd
                      - --gpu_nic_preset
                      - a3vm
                      - --uds_path
                      - /run/tcpx
                    securityContext:
                      privileged: true
                    volumeMounts:
                      - name: tcpx-socket
                        mountPath: /run/tcpx
                      - name: libraries
                        mountPath: /usr/local/nvidia/lib64
                  - name: nccl-test
                    stdin: true
                    tty: true
                    image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx-dev:v3.1.8
                    securityContext:
                      privileged: true
                    env:
                    - name: MY_NODE_NAME
                      valueFrom:
                        fieldRef:
                          fieldPath: spec.nodeName
                    - name: OMPI_ALLOW_RUN_AS_ROOT
                      value: "1"
                    - name: OMPI_ALLOW_RUN_AS_ROOT_CONFIRM
                      value: "1"
                    - name: N_NODES
                      value: "NUM_NODES"
                    - name: LD_LIBRARY_PATH
                      value: /usr/local/nvidia/lib64
                    command:
                    - bash
                    - -c
                    - |
                      /scripts/container_entry.sh daemon &
                      export POSTFIX=$(hostname | cut -d . -f 2-)
                      export WORKERS_BASENAME=$(hostname | cut -d . -f 1 | rev | cut -d - -f 2- | rev )
                      export NODE_RANK=$JOB_COMPLETION_INDEX
                      for i in `seq 0 $(($N_NODES-1))`; do
                        OTHER=<span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8333em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">W</span><span class="mord mathnormal" style="margin-right:0.00773em;">OR</span><span class="mord mathnormal" style="margin-right:0.07153em;">K</span><span class="mord mathnormal" style="margin-right:0.00773em;">ER</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.0576em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.05017em;">B</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.10903em;">SEN</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.05764em;">ME</span></span><span class="mord">−</span></span></span></span>{i}.${POSTFIX}
                        until ssh -p 222 -o StrictHostKeyChecking=no $OTHER hostname; do
                          sleep 10
                        done
                        echo ${OTHER} port=222 slots=8 | tee -a /tmp/hostfile;
                      done
                      if [[ "${NODE_RANK}" -eq "0" ]]; then
                          /scripts/run-allgather.sh 8 eth1,eth2,eth3,eth4 1M 512M ${N_NODES}
                      else
                          while ping -c 1 <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8333em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">W</span><span class="mord mathnormal" style="margin-right:0.00773em;">OR</span><span class="mord mathnormal" style="margin-right:0.07153em;">K</span><span class="mord mathnormal" style="margin-right:0.00773em;">ER</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.0576em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.05017em;">B</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.10903em;">SEN</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.05764em;">ME</span></span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.6444em;"></span><span class="mord">0.</span></span></span></span>{POSTFIX}; do
                          sleep 5
                      done
                      fi
                      exit 0
                    volumeMounts:
                    - name: nvidia
                      mountPath: /usr/local/nvidia
                    - name: tcpx-socket
                      mountPath: /tmp
                    - name: libraries
                      mountPath: /usr/local/nvidia/lib64
                    - name: shared-memory
                      mountPath: /dev/shm
                    resources:
                      limits:
                        cpu: "200"
                        memory: "1800Gi"
                        nvidia.com/gpu: 8
                      requests:
                        cpu: "200"
                        memory: "1800Gi"
                        nvidia.com/gpu: 8
    
  2. Wenden Sie das Manifest an:

    kubectl apply -f nccl-jobset-test.yaml
    
  3. Prüfen Sie, ob die Arbeitslast zugelassen ist und den Status Completed erreicht hat.

  4. Rufen Sie die Logs für den Pod ab, der mit nccl-ag-worker-0-0-.* übereinstimmt, um die Ergebnisse zu sehen:

    kubectl logs $(kubectl get pods -o go-template='{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | grep nccl-ag-worker-0-0)
    

Nächste Schritte