Auf dieser Seite wird beschrieben, wie Sie NVIDIA Collective Communications Library (NCCL)-Tests auf benutzerdefinierten GKE-Clustern ausführen, die die Netzwerkprotokolle GPUDirect-TCPXO und
GPUDirect-TCPX verwenden. Ein benutzerdefinierter GKE-Cluster ist ein Cluster, den Sie mit gcloud-Befehlen erstellen.
Sie können die auf dieser Seite beschriebenen Tests in den folgenden Szenarien verwenden:
- Wenn Ihr GKE-Cluster Flex-Start-Knoten verwendet, führen Sie einen einfachen Test auf zwei Knoten aus.
- Wenn Ihr GKE-Cluster verschiedene Knotentypen wie On-Demand- oder reservierungsgebundene Knoten verwendet, führen Sie einen NCCL-Test mit Topologie-basierter Planung aus.
Hinweis
Für die Tests auf dieser Seite werden JobSet und Kueue mit Topologie-basierter Planung verwendet. Bevor Sie Tests ausführen, müssen Sie Ihren Cluster einrichten und Folgendes tun:
Kueue installieren.
kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.16.5/manifests.yaml
Cluster mit JobSet und Kueue einrichten
Nachdem Sie JobSet und Kueue installiert haben, führen Sie die folgenden Schritte aus:
Speichern Sie das folgende Manifest als
kueue-config.yaml:A3 High
apiVersion: kueue.x-k8s.io/v1beta2 kind: Topology metadata: name: "gke-default" spec: levels: - nodeLabel: "cloud.google.com/gce-topology-block" - nodeLabel: "cloud.google.com/gce-topology-subblock" - nodeLabel: "cloud.google.com/gce-topology-host" - nodeLabel: "kubernetes.io/hostname" --- apiVersion: kueue.x-k8s.io/v1beta2 kind: ResourceFlavor metadata: name: a3-high-flavor spec: nodeLabels: cloud.google.com/gke-accelerator: nvidia-h100-80gb topologyName: "gke-default" --- apiVersion: kueue.x-k8s.io/v1beta2 kind: ResourceFlavor metadata: name: a3-high-dws-flavor spec: nodeLabels: cloud.google.com/gke-accelerator: nvidia-h100-80gb topologyName: "gke-default" tolerations: - key: "cloud.google.com/gke-queued" operator: "Exists" effect: NoSchedule --- apiVersion: kueue.x-k8s.io/v1beta2 kind: AdmissionCheck metadata: name: dws-prov spec: controllerName: kueue.x-k8s.io/provisioning-request parameters: apiGroup: kueue.x-k8s.io kind: ProvisioningRequestConfig name: dws-config --- apiVersion: kueue.x-k8s.io/v1beta2 kind: ProvisioningRequestConfig metadata: name: dws-config spec: provisioningClassName: queued-provisioning.gke.io podSetUpdates: - key: autoscaling.gke.io/provisioning-request valueFromProvisioningClassDetail: ResizeRequestName managedResources: - nvidia.com/gpu --- apiVersion: kueue.x-k8s.io/v1beta2 kind: ClusterQueue metadata: name: cq-tas spec: namespaceSelector: {} clusterQueueingStrategy: BestEffortFIFO resourceGroups: - flavors: - name: a3-high-flavor resources: - name: "cpu" nominalQuota: 1000 - name: "memory" nominalQuota: 1000Ti - name: "nvidia.com/gpu" nominalQuota: 1000 - name: a3-high-dws-flavor resources: - name: "cpu" nominalQuota: 1000 - name: "memory" nominalQuota: 1000Ti - name: "nvidia.com/gpu" nominalQuota: 1000 admissionChecksStrategy: admissionChecks: - name: "dws-prov" onFlavors: [a3-high-dws-flavor] --- apiVersion: kueue.x-k8s.io/v1beta2 kind: LocalQueue metadata: namespace: default name: lq-tas spec: clusterQueue: cq-tasA3 Mega
apiVersion: kueue.x-k8s.io/v1beta2 kind: Topology metadata: name: "gke-default" spec: levels: - nodeLabel: "cloud.google.com/gce-topology-block" - nodeLabel: "cloud.google.com/gce-topology-subblock" - nodeLabel: "cloud.google.com/gce-topology-host" - nodeLabel: "kubernetes.io/hostname" --- apiVersion: kueue.x-k8s.io/v1beta2 kind: ResourceFlavor metadata: name: a3-mega-flavor spec: nodeLabels: cloud.google.com/gke-accelerator: nvidia-h100-mega-80gb topologyName: "gke-default" --- apiVersion: kueue.x-k8s.io/v1beta2 kind: ResourceFlavor metadata: name: a3-mega-dws-flavor spec: nodeLabels: cloud.google.com/gke-accelerator: nvidia-h100-mega-80gb topologyName: "gke-default" tolerations: - key: "cloud.google.com/gke-queued" operator: "Exists" effect: NoSchedule --- apiVersion: kueue.x-k8s.io/v1beta2 kind: AdmissionCheck metadata: name: dws-prov spec: controllerName: kueue.x-k8s.io/provisioning-request parameters: apiGroup: kueue.x-k8s.io kind: ProvisioningRequestConfig name: dws-config --- apiVersion: kueue.x-k8s.io/v1beta2 kind: ProvisioningRequestConfig metadata: name: dws-config spec: provisioningClassName: queued-provisioning.gke.io podSetUpdates: - key: autoscaling.gke.io/provisioning-request valueFromProvisioningClassDetail: ResizeRequestName managedResources: - nvidia.com/gpu --- apiVersion: kueue.x-k8s.io/v1beta2 kind: ClusterQueue metadata: name: cq-tas spec: namespaceSelector: {} clusterQueueingStrategy: BestEffortFIFO resourceGroups: - flavors: - name: a3-mega-flavor resources: - name: "cpu" nominalQuota: 1000 - name: "memory" nominalQuota: 1000Ti - name: "nvidia.com/gpu" nominalQuota: 1000 - name: a3-mega-dws-flavor resources: - name: "cpu" nominalQuota: 1000 - name: "memory" nominalQuota: 1000Ti - name: "nvidia.com/gpu" nominalQuota: 1000 admissionChecksStrategy: admissionChecks: - name: "dws-prov" onFlavors: [a3-mega-dws-flavor] --- apiVersion: kueue.x-k8s.io/v1beta2 kind: LocalQueue metadata: namespace: default name: lq-tas spec: clusterQueue: cq-tasWenden Sie das Manifest an:
kubectl apply -f kueue-config.yaml
Wenn Sie Arbeitslasten mit aktivierter Topologie-basierter Planung ausführen, können Sie angeben, wie streng Topologieeinschränkungen erzwungen werden. Verwenden Sie dazu eine der folgenden Annotationen im Arbeitslastmanifest:
kueue.x-k8s.io/podset-required-topology: Wenn Sie diese Annotation verwenden, blockiert Kueue die Planung, bis die Arbeitslast innerhalb der angeforderten Topologieeinschränkung geplant werden kann. Verwenden Sie diese Annotation, um sicherzustellen, dass Pods für eine optimale Leistung zusammen platziert werden.kueue.x-k8s.io/podset-preferred-topology: Wenn Sie diese Annotation verwenden, versucht Kueue, Pods innerhalb der angeforderten Topologieeinschränkung zuzuordnen. Wenn das nicht möglich ist, lässt Kueue die Arbeitslast zu, ohne die Topologieeinschränkungen zu erfüllen.
Hinweis:Vermeiden Sie die Verwendung des erforderlichen Modus mit DWS Flex-Start. Da Flex-Start Knoten dynamisch bereitstellt, erfüllen die resultierenden Knoten möglicherweise nicht die strengen Topologieanforderungen, was zu nicht planbaren Arbeitslasten führen kann.
Verwenden Sie für diese Konfigurationen stattdessen podset-preferred-topology.
Geben Sie für beide Annotationen einen der folgenden Werte als Topologieeinschränkung an:
cloud.google.com/gce-topology-block: Plant Pods innerhalb desselben Netzwerkblocks.cloud.google.com/gce-topology-subblock: Plant Pods innerhalb desselben Racks.cloud.google.com/gce-topology-host: Plant Pods auf demselben physischen Host.
Test auf zwei Flex-Start-Knoten
Führen Sie die folgenden Schritte aus, um NCCL-Tests auf einem GKE-Cluster auszuführen, der A3 Mega- oder A3 High-Flex-Start-VMs verwendet. In diesem Verfahren wird ein JobSet-Manifest verwendet, um einen NCCL-Test auf zwei Knoten auszuführen.
Speichern Sie das folgende Manifest als
nccl-tas-jobset.yaml:A3 Mega
apiVersion: v1 kind: ConfigMap metadata: name: nccl-configmap data: allgather.sh: | #!/bin/bash service ssh restart; /scripts/init_ssh.sh ${@}; pushd /scripts; /scripts/gen_hostfiles.sh ${@}; popd; # Set up environment variables for GPUDirect-TCPXO export LD_LIBRARY_PATH=/usr/local/nvidia/lib64 export NCCL_FASTRAK_CTRL_DEV=eth0 export NCCL_FASTRAK_IFNAME=eth1,eth2,eth3,eth4,eth5,eth6,eth7,eth8 export NCCL_SOCKET_IFNAME=eth0 export NCCL_CROSS_NIC=0 export NCCL_ALGO=Ring,Tree export NCCL_PROTO=Simple export NCCL_NET_GDR_LEVEL=PIX # Run the benchmark /scripts/demo-run-nccl-test-tcpxo-via-mpi.sh --- apiVersion: jobset.x-k8s.io/v1alpha2 kind: JobSet metadata: name: nccl-tas-test labels: kueue.x-k8s.io/queue-name: lq-tas spec: ttlSecondsAfterFinished: 1200 suspend: true network: enableDNSHostnames: true replicatedJobs: - name: worker replicas: 2 template: spec: parallelism: 1 completions: 1 template: metadata: annotations: kueue.x-k8s.io/podset-preferred-topology: "cloud.google.com/gce-topology-block" networking.gke.io/default-interface: 'eth0' networking.gke.io/interfaces: | [ {"interfaceName":"eth0","network":"default"}, {"interfaceName":"eth1","network":"vpc0"}, {"interfaceName":"eth2","network":"vpc1"}, {"interfaceName":"eth3","network":"vpc2"}, {"interfaceName":"eth4","network":"vpc3"}, {"interfaceName":"eth5","network":"vpc4"}, {"interfaceName":"eth6","network":"vpc5"}, {"interfaceName":"eth7","network":"vpc6"}, {"interfaceName":"eth8","network":"vpc7"} ] spec: activeDeadlineSeconds: 3600 restartPolicy: Never nodeSelector: cloud.google.com/gke-accelerator: nvidia-h100-mega-80gb tolerations: - key: cloud.google.com/gke-queued effect: NoSchedule value: "true" - key: "nvidia.com/gpu" operator: "Exists" effect: "NoSchedule" setHostnameAsFQDN: true volumes: - name: nvidia hostPath: path: /home/kubernetes/bin/nvidia - name: lib64 hostPath: path: /lib64 - name: proc hostPath: path: /proc - name: shared-memory emptyDir: medium: "Memory" sizeLimit: 250Gi - name: nccl-config configMap: name: nccl-configmap defaultMode: 0755 containers: - name: nccl-test image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/nccl-plugin-gpudirecttcpx-dev:v1.0.15 stdin: true tty: true securityContext: privileged: true env: - name: LD_LIBRARY_PATH value: /usr/local/nvidia/lib64 volumeMounts: - name: nvidia mountPath: /usr/local/nvidia - name: shared-memory mountPath: /dev/shm - name: nccl-config mountPath: /configs resources: limits: cpu: "200" memory: "3700Gi" nvidia.com/gpu: 8 requests: cpu: "200" memory: "3700Gi" nvidia.com/gpu: 8 - name: tcpxo-daemon image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/tcpgpudmarxd-dev:v1.0.21 imagePullPolicy: Always command: ["/bin/sh", "-c"] args: - | set -ex chmod 755 /fts/entrypoint_rxdm_container.sh /fts/entrypoint_rxdm_container.sh --num_hops=2 --num_nics=8 --uid= --alsologtostderr securityContext: privileged: true capabilities: add: - NET_ADMIN - NET_BIND_SERVICE volumeMounts: - name: nvidia mountPath: /usr/local/nvidia/lib64 - name: proc mountPath: /proc env: - name: LD_LIBRARY_PATH value: /usr/local/nvidia/lib64A3 High
apiVersion: v1 kind: ConfigMap metadata: name: nccl-config data: allgather.sh: | #!/bin/bash for script in /configs/*; do name=$(basename $script) cp $script "/scripts/$name" chmod +x "/scripts/$name" done /scripts/init_ssh.sh ${@}; pushd /scripts; /scripts/gen_hostfiles.sh ${@}; popd; /scripts/run-allgather.sh 8 eth1,eth2,eth3,eth4 1M 512M ${#}; --- apiVersion: jobset.x-k8s.io/v1alpha2 kind: JobSet metadata: name: nccl-tas-test labels: kueue.x-k8s.io/queue-name: lq-tas spec: suspend: true network: enableDNSHostnames: true replicatedJobs: - name: worker replicas: 2 template: spec: parallelism: 1 completions: 1 template: metadata: annotations: kueue.x-k8s.io/podset-preferred-topology: "cloud.google.com/gce-topology-block" networking.gke.io/default-interface: 'eth0' networking.gke.io/interfaces: | [ {"interfaceName":"eth0","network":"default"}, {"interfaceName":"eth1","network":"vpc0"}, {"interfaceName":"eth2","network":"vpc1"}, {"interfaceName":"eth3","network":"vpc2"}, {"interfaceName":"eth4","network":"vpc3"} ] spec: terminationGracePeriodSeconds: 0 nodeSelector: cloud.google.com/gke-accelerator: nvidia-h100-80gb tolerations: - key: cloud.google.com/gke-queued effect: NoSchedule value: "true" - key: "nvidia.com/gpu" operator: "Exists" effect: "NoSchedule" setHostnameAsFQDN: true containers: - name: tcpx-daemon image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd-dev:v2.0.11 command: - /tcpgpudmarxd/build/app/tcpgpudmarxd - --gpu_nic_preset - a3vm - --gpu_shmem_type - fd - --uds_path - /run/tcpx - --setup_param - "--verbose 128 2 0 " securityContext: privileged: true capabilities: add: - NET_ADMIN volumeMounts: - name: libraries mountPath: /usr/local/nvidia/lib64 - name: tcpx-socket mountPath: /run/tcpx - name: sys mountPath: /hostsysfs - name: proc-sys mountPath: /hostprocsysfs env: - name: LD_LIBRARY_PATH value: /usr/local/nvidia/lib64 - name: nccl-test image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx-dev:v3.1.8 command: - bash - -c - | /scripts/container_entry.sh daemon; sleep infinity; securityContext: privileged: true volumeMounts: - name: tcpx-socket mountPath: /tmp - name: libraries mountPath: /usr/local/nvidia/lib64 - name: nccl-config mountPath: /configs - name: shared-memory mountPath: /dev/shm resources: limits: cpu: "200" memory: "1800Gi" nvidia.com/gpu: 8 requests: cpu: "200" memory: "1800Gi" nvidia.com/gpu: 8 volumes: - name: libraries hostPath: path: /home/kubernetes/bin/nvidia/lib64 - name: tcpx-socket emptyDir: {} - name: sys hostPath: path: /sys - name: proc-sys hostPath: path: /proc/sys - name: shared-memory emptyDir: medium: Memory sizeLimit: 250Gi - name: nccl-config configMap: name: nccl-config defaultMode: 0777Wenden Sie das Manifest auf Ihren Cluster an:
kubectl apply -f nccl-tas-jobset.yamlPrüfen Sie, ob das JobSet zugelassen ist und ausgeführt wird:
kubectl get jobset nccl-tas-testWarten Sie, bis die Aussetzung des JobSets aufgehoben wurde und die Pods den Status
Runningerreicht haben.Lösen Sie den NCCL-Test aus, indem Sie das Skript
allgather.shvom ersten Worker-Pod ausführen:kubectl exec --stdin --tty --container=nccl-test nccl-tas-test-worker-0-0 -- /configs/allgather.sh nccl-tas-test-worker-0-0 nccl-tas-test-worker-1-0Die Ausgabe für einen Test mit zwei Knoten sieht in etwa so aus:
A3 Mega
# out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 0 0 float none -1 0.24 0.00 0.00 0 0.18 0.00 0.00 0 ... 8589934592 134217728 float none -1 42603 201.63 189.03 0 42670 201.31 188.73 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 45.7587A3 High
# out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1048576 16384 float none -1 696.8 1.50 1.41 0 729.0 1.44 1.35 0 ... 536870912 8388608 float none -1 7101.7 75.60 70.87 0 7060.9 76.03 71.28 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 29.8293
NCCL-Testarbeitslast mit Topologie-basierter Planung bereitstellen
Wenn Sie mehr als zwei Knoten haben, empfehlen wir den folgenden Test, bei dem die Topologie-basierte Planung verwendet wird. Führen Sie die folgenden Schritte aus, um NCCL-Tests mit Topologie-basierter Planung auf einem GKE-Cluster auszuführen, der A3 Mega- oder A3 High-Flex-Start-VMs verwendet.
Speichern Sie das folgende Manifest als
nccl-jobset-test.yaml. Ersetzen SieNUM_NODESdurch die Anzahl der Knoten im Knotenpool:A3 Mega
apiVersion: jobset.x-k8s.io/v1alpha2 kind: JobSet metadata: name: nccl-ag labels: kueue.x-k8s.io/queue-name: lq-tas spec: ttlSecondsAfterFinished: 1200 suspend: true network: enableDNSHostnames: true replicatedJobs: - name: worker template: spec: parallelism: NUM_NODES completions: NUM_NODES template: metadata: annotations: kueue.x-k8s.io/podset-preferred-topology: "cloud.google.com/gce-topology-subblock" networking.gke.io/default-interface: 'eth0' networking.gke.io/interfaces: | [ {"interfaceName":"eth0","network":"default"}, {"interfaceName":"eth1","network":"vpc0"}, {"interfaceName":"eth2","network":"vpc1"}, {"interfaceName":"eth3","network":"vpc2"}, {"interfaceName":"eth4","network":"vpc3"}, {"interfaceName":"eth5","network":"vpc4"}, {"interfaceName":"eth6","network":"vpc5"}, {"interfaceName":"eth7","network":"vpc6"}, {"interfaceName":"eth8","network":"vpc7"} ] spec: activeDeadlineSeconds: 3600 restartPolicy: Never nodeSelector: cloud.google.com/gke-accelerator: nvidia-h100-mega-80gb tolerations: - key: "nvidia.com/gpu" operator: "Exists" effect: "NoSchedule" setHostnameAsFQDN: true volumes: - name: proc hostPath: path: /proc - name: nvidia hostPath: path: /home/kubernetes/bin/nvidia - name: lib64 hostPath: path: /lib64 - name: shared-memory emptyDir: medium: "Memory" sizeLimit: 250Gi containers: - name: nccl-test stdin: true tty: true image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/nccl-plugin-tcpxo-diagnostic:v1.0.6 securityContext: privileged: true env: - name: MY_NODE_NAME valueFrom: fieldRef: fieldPath: spec.nodeName - name: OMPI_ALLOW_RUN_AS_ROOT value: "1" - name: OMPI_ALLOW_RUN_AS_ROOT_CONFIRM value: "1" - name: N_NODES value: "NUM_NODES" - name: NCCL_SOCKET_IFNAME value: eth0 - name: NCCL_FASTRAK_CTRL_DEV value: eth0 - name: NCCL_FASTRAK_IFNAME value: eth1,eth2,eth3,eth4,eth5,eth6,eth7,eth8 - name: NCCL_CROSS_NIC value: "0" - name: NCCL_ALGO value: Ring,Tree - name: NCCL_PROTO value: Simple - name: NCCL_NET_GDR_LEVEL value: PIX - name: LD_LIBRARY_PATH value: /usr/local/nvidia/lib64 command: - bash - -c - | set -x /scripts/container_entry.sh daemon & export POSTFIX=$(hostname | cut -d . -f 2-) export WORKERS_BASENAME=$(hostname | cut -d . -f 1 | rev | cut -d - -f 2- | rev ) export NODE_RANK=$JOB_COMPLETION_INDEX for i in `seq 0 $(($N_NODES-1))`; do OTHER=<span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8333em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">W</span><span class="mord mathnormal" style="margin-right:0.00773em;">OR</span><span class="mord mathnormal" style="margin-right:0.07153em;">K</span><span class="mord mathnormal" style="margin-right:0.00773em;">ER</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.0576em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.05017em;">B</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.10903em;">SEN</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.05764em;">ME</span></span><span class="mord">−</span></span></span></span>{i}.${POSTFIX} until ssh -p 222 -o StrictHostKeyChecking=no $OTHER hostname; do sleep 10 done echo ${OTHER} port=222 slots=8 | tee -a /tmp/hostfile; done if [[ "${NODE_RANK}" -eq "0" ]]; then export NCCL_TESTS_SPLIT_MASK="0x0"; ENV_VARS=$(echo ${!NCCL*} ${!OMPI*} LD_LIBRARY_PATH PATH | sed 's/ / -x /g') mpirun --hostfile /tmp/hostfile \ -x $ENV_VARS \ -mca plm_rsh_no_tree_spawn 1 \ --mca orte_keep_fqdn_hostnames 1 \ --mca btl self,tcp \ --mca btl_tcp_if_include eth0 \ --bind-to none \ --mca plm_rsh_agent "ssh -q -o LogLevel=ERROR -o StrictHostKeyChecking=no -p 222" \ /third_party/nccl-tests/build/all_gather_perf -b 1K -e 8G -f 2 -g 1 -w 5 --iters 100 -c 1 else while ping -c 1 <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8333em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">W</span><span class="mord mathnormal" style="margin-right:0.00773em;">OR</span><span class="mord mathnormal" style="margin-right:0.07153em;">K</span><span class="mord mathnormal" style="margin-right:0.00773em;">ER</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.0576em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.05017em;">B</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.10903em;">SEN</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.05764em;">ME</span></span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.6444em;"></span><span class="mord">0.</span></span></span></span>{POSTFIX}; do sleep 5 done fi exit 0 volumeMounts: - name: nvidia mountPath: /usr/local/nvidia - name: lib64 mountPath: /lib64 - name: shared-memory mountPath: /dev/shm resources: limits: cpu: "200" memory: "3700Gi" nvidia.com/gpu: 8 requests: cpu: "200" memory: "3700Gi" nvidia.com/gpu: 8 - name: tcpxo-daemon image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/tcpxo-daemon:v1.0.1 imagePullPolicy: Always command: - bash - -c - | /usr/bin/tcpxo_daemon securityContext: privileged: true volumeMounts: - name: nvidia mountPath: /usr/local/nvidia - name: proc mountPath: /proc env: - name: LD_LIBRARY_PATH value: /usr/local/nvidia/lib64A3 High
apiVersion: jobset.x-k8s.io/v1alpha2 kind: JobSet metadata: name: nccl-ag labels: kueue.x-k8s.io/queue-name: lq-tas spec: ttlSecondsAfterFinished: 1200 suspend: true network: enableDNSHostnames: true replicatedJobs: - name: worker template: spec: parallelism: NUM_NODES completions: NUM_NODES template: metadata: annotations: kueue.x-k8s.io/podset-preferred-topology: "cloud.google.com/gce-topology-subblock" networking.gke.io/default-interface: 'eth0' networking.gke.io/interfaces: | [ {"interfaceName":"eth0","network":"default"}, {"interfaceName":"eth1","network":"vpc0"}, {"interfaceName":"eth2","network":"vpc1"}, {"interfaceName":"eth3","network":"vpc2"}, {"interfaceName":"eth4","network":"vpc3"} ] spec: activeDeadlineSeconds: 3600 restartPolicy: Never nodeSelector: cloud.google.com/gke-accelerator: nvidia-h100-80gb tolerations: - key: "nvidia.com/gpu" operator: "Exists" effect: "NoSchedule" setHostnameAsFQDN: true volumes: - name: proc hostPath: path: /proc - name: nvidia hostPath: path: /home/kubernetes/bin/nvidia - name: libraries hostPath: path: /home/kubernetes/bin/nvidia/lib64 - name: tcpx-socket emptyDir: {} - name: shared-memory emptyDir: medium: "Memory" sizeLimit: 250Gi containers: - name: tcpx-daemon image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd-dev:v2.0.11 command: - /tcpgpudmarxd/build/app/tcpgpudmarxd - --gpu_nic_preset - a3vm - --uds_path - /run/tcpx securityContext: privileged: true volumeMounts: - name: tcpx-socket mountPath: /run/tcpx - name: libraries mountPath: /usr/local/nvidia/lib64 - name: nccl-test stdin: true tty: true image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx-dev:v3.1.8 securityContext: privileged: true env: - name: MY_NODE_NAME valueFrom: fieldRef: fieldPath: spec.nodeName - name: OMPI_ALLOW_RUN_AS_ROOT value: "1" - name: OMPI_ALLOW_RUN_AS_ROOT_CONFIRM value: "1" - name: N_NODES value: "NUM_NODES" - name: LD_LIBRARY_PATH value: /usr/local/nvidia/lib64 command: - bash - -c - | /scripts/container_entry.sh daemon & export POSTFIX=$(hostname | cut -d . -f 2-) export WORKERS_BASENAME=$(hostname | cut -d . -f 1 | rev | cut -d - -f 2- | rev ) export NODE_RANK=$JOB_COMPLETION_INDEX for i in `seq 0 $(($N_NODES-1))`; do OTHER=<span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8333em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">W</span><span class="mord mathnormal" style="margin-right:0.00773em;">OR</span><span class="mord mathnormal" style="margin-right:0.07153em;">K</span><span class="mord mathnormal" style="margin-right:0.00773em;">ER</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.0576em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.05017em;">B</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.10903em;">SEN</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.05764em;">ME</span></span><span class="mord">−</span></span></span></span>{i}.${POSTFIX} until ssh -p 222 -o StrictHostKeyChecking=no $OTHER hostname; do sleep 10 done echo ${OTHER} port=222 slots=8 | tee -a /tmp/hostfile; done if [[ "${NODE_RANK}" -eq "0" ]]; then /scripts/run-allgather.sh 8 eth1,eth2,eth3,eth4 1M 512M ${N_NODES} else while ping -c 1 <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8333em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">W</span><span class="mord mathnormal" style="margin-right:0.00773em;">OR</span><span class="mord mathnormal" style="margin-right:0.07153em;">K</span><span class="mord mathnormal" style="margin-right:0.00773em;">ER</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.0576em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.05017em;">B</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.10903em;">SEN</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.05764em;">ME</span></span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.6444em;"></span><span class="mord">0.</span></span></span></span>{POSTFIX}; do sleep 5 done fi exit 0 volumeMounts: - name: nvidia mountPath: /usr/local/nvidia - name: tcpx-socket mountPath: /tmp - name: libraries mountPath: /usr/local/nvidia/lib64 - name: shared-memory mountPath: /dev/shm resources: limits: cpu: "200" memory: "1800Gi" nvidia.com/gpu: 8 requests: cpu: "200" memory: "1800Gi" nvidia.com/gpu: 8Wenden Sie das Manifest an:
kubectl apply -f nccl-jobset-test.yamlPrüfen Sie, ob die Arbeitslast zugelassen ist und den Status
Completederreicht hat.Rufen Sie die Logs für den Pod ab, der mit
nccl-ag-worker-0-0-.*übereinstimmt, um die Ergebnisse zu sehen:kubectl logs $(kubectl get pods -o go-template='{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | grep nccl-ag-worker-0-0)
Nächste Schritte
- Informationen zum Erfassen und Analysieren von NCCL-Logs zur Fehlerbehebung finden Sie unter NCCL-Logs zur Fehlerbehebung erfassen und analysieren.
- Informationen zur Fehlerbehebung bei langsamer Leistung