Cette page explique comment exécuter des tests NVIDIA Collective Communications Library (NCCL) sur des clusters GKE personnalisés qui utilisent les protocoles réseau GPUDirect-TCPXO et GPUDirect-TCPX. Un cluster GKE personnalisé est un cluster que vous créez à l'aide de commandes gcloud.
Vous pouvez utiliser les tests décrits sur cette page dans les scénarios suivants :
- Si votre cluster GKE utilise des nœuds à démarrage flexible, effectuez un test de base sur deux nœuds.
- Si votre cluster GKE utilise différents types de nœuds, tels que des nœuds à la demande ou liés à une réservation, utilisez un test NCCL avec la planification tenant compte de la topologie.
Avant de commencer
Les tests de cette page utilisent JobSet et Kueue avec la planification basée sur la topologie (TAS). Avant d'exécuter des tests, vous devez configurer votre cluster et effectuer les opérations suivantes :
Installez Kueue.
kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.16.5/manifests.yaml
Configurer votre cluster avec Jobset et Kueue
Après avoir installé JobSet et Kueue, procédez comme suit :
Enregistrez le manifeste suivant sous le nom
kueue-config.yaml:A3 High
apiVersion: kueue.x-k8s.io/v1beta2 kind: Topology metadata: name: "gke-default" spec: levels: - nodeLabel: "cloud.google.com/gce-topology-block" - nodeLabel: "cloud.google.com/gce-topology-subblock" - nodeLabel: "cloud.google.com/gce-topology-host" - nodeLabel: "kubernetes.io/hostname" --- apiVersion: kueue.x-k8s.io/v1beta2 kind: ResourceFlavor metadata: name: a3-high-flavor spec: nodeLabels: cloud.google.com/gke-accelerator: nvidia-h100-80gb topologyName: "gke-default" --- apiVersion: kueue.x-k8s.io/v1beta2 kind: ResourceFlavor metadata: name: a3-high-dws-flavor spec: nodeLabels: cloud.google.com/gke-accelerator: nvidia-h100-80gb topologyName: "gke-default" tolerations: - key: "cloud.google.com/gke-queued" operator: "Exists" effect: NoSchedule --- apiVersion: kueue.x-k8s.io/v1beta2 kind: AdmissionCheck metadata: name: dws-prov spec: controllerName: kueue.x-k8s.io/provisioning-request parameters: apiGroup: kueue.x-k8s.io kind: ProvisioningRequestConfig name: dws-config --- apiVersion: kueue.x-k8s.io/v1beta2 kind: ProvisioningRequestConfig metadata: name: dws-config spec: provisioningClassName: queued-provisioning.gke.io podSetUpdates: - key: autoscaling.gke.io/provisioning-request valueFromProvisioningClassDetail: ResizeRequestName managedResources: - nvidia.com/gpu --- apiVersion: kueue.x-k8s.io/v1beta2 kind: ClusterQueue metadata: name: cq-tas spec: namespaceSelector: {} clusterQueueingStrategy: BestEffortFIFO resourceGroups: - flavors: - name: a3-high-flavor resources: - name: "cpu" nominalQuota: 1000 - name: "memory" nominalQuota: 1000Ti - name: "nvidia.com/gpu" nominalQuota: 1000 - name: a3-high-dws-flavor resources: - name: "cpu" nominalQuota: 1000 - name: "memory" nominalQuota: 1000Ti - name: "nvidia.com/gpu" nominalQuota: 1000 admissionChecksStrategy: admissionChecks: - name: "dws-prov" onFlavors: [a3-high-dws-flavor] --- apiVersion: kueue.x-k8s.io/v1beta2 kind: LocalQueue metadata: namespace: default name: lq-tas spec: clusterQueue: cq-tasA3 Mega
apiVersion: kueue.x-k8s.io/v1beta2 kind: Topology metadata: name: "gke-default" spec: levels: - nodeLabel: "cloud.google.com/gce-topology-block" - nodeLabel: "cloud.google.com/gce-topology-subblock" - nodeLabel: "cloud.google.com/gce-topology-host" - nodeLabel: "kubernetes.io/hostname" --- apiVersion: kueue.x-k8s.io/v1beta2 kind: ResourceFlavor metadata: name: a3-mega-flavor spec: nodeLabels: cloud.google.com/gke-accelerator: nvidia-h100-mega-80gb topologyName: "gke-default" --- apiVersion: kueue.x-k8s.io/v1beta2 kind: ResourceFlavor metadata: name: a3-mega-dws-flavor spec: nodeLabels: cloud.google.com/gke-accelerator: nvidia-h100-mega-80gb topologyName: "gke-default" tolerations: - key: "cloud.google.com/gke-queued" operator: "Exists" effect: NoSchedule --- apiVersion: kueue.x-k8s.io/v1beta2 kind: AdmissionCheck metadata: name: dws-prov spec: controllerName: kueue.x-k8s.io/provisioning-request parameters: apiGroup: kueue.x-k8s.io kind: ProvisioningRequestConfig name: dws-config --- apiVersion: kueue.x-k8s.io/v1beta2 kind: ProvisioningRequestConfig metadata: name: dws-config spec: provisioningClassName: queued-provisioning.gke.io podSetUpdates: - key: autoscaling.gke.io/provisioning-request valueFromProvisioningClassDetail: ResizeRequestName managedResources: - nvidia.com/gpu --- apiVersion: kueue.x-k8s.io/v1beta2 kind: ClusterQueue metadata: name: cq-tas spec: namespaceSelector: {} clusterQueueingStrategy: BestEffortFIFO resourceGroups: - flavors: - name: a3-mega-flavor resources: - name: "cpu" nominalQuota: 1000 - name: "memory" nominalQuota: 1000Ti - name: "nvidia.com/gpu" nominalQuota: 1000 - name: a3-mega-dws-flavor resources: - name: "cpu" nominalQuota: 1000 - name: "memory" nominalQuota: 1000Ti - name: "nvidia.com/gpu" nominalQuota: 1000 admissionChecksStrategy: admissionChecks: - name: "dws-prov" onFlavors: [a3-mega-dws-flavor] --- apiVersion: kueue.x-k8s.io/v1beta2 kind: LocalQueue metadata: namespace: default name: lq-tas spec: clusterQueue: cq-tasAppliquez le fichier manifeste :
kubectl apply -f kueue-config.yaml
Lorsque vous exécutez des charges de travail avec TAS activé, vous pouvez spécifier le degré de respect des contraintes de topologie en utilisant l'une des annotations suivantes dans le fichier manifeste de votre charge de travail :
kueue.x-k8s.io/podset-required-topology: si vous utilisez cette annotation, Kueue bloque la planification jusqu'à ce que la charge de travail puisse être planifiée dans la contrainte de topologie demandée. Utilisez cette annotation pour vous assurer que les pods sont placés ensemble pour des performances optimales.kueue.x-k8s.io/podset-preferred-topology: si vous utilisez cette annotation, Kueue tente de planifier les pods dans la contrainte de topologie demandée, mais si cela n'est pas possible, il accepte la charge de travail sans respecter les contraintes de topologie.
Remarque : Évitez d'utiliser le mode requis avec DWS démarrage flexible. Étant donné que le démarrage flexible provisionne les nœuds de manière dynamique, les nœuds obtenus peuvent ne pas répondre à des exigences topologiques strictes, ce qui peut entraîner des charges de travail non planifiables.
Pour ces configurations, utilisez plutôt podset-preferred-topology.
Pour l'une ou l'autre des annotations, spécifiez l'une des valeurs suivantes comme contrainte de topologie :
cloud.google.com/gce-topology-block: planifie les pods dans le même bloc réseau.cloud.google.com/gce-topology-subblock: planifie les pods dans le même rack.cloud.google.com/gce-topology-host: planifie les pods sur le même hôte physique.
Tester sur deux nœuds Flex-start
Pour exécuter des tests NCCL sur un cluster GKE qui utilise des VM A3 Mega ou A3 High Flex-start, suivez la procédure ci-dessous. Cette procédure utilise un fichier manifeste JobSet pour exécuter un test NCCL sur deux nœuds.
Enregistrez le manifeste suivant sous le nom
nccl-tas-jobset.yaml:A3 Mega
apiVersion: v1 kind: ConfigMap metadata: name: nccl-configmap data: allgather.sh: | #!/bin/bash service ssh restart; /scripts/init_ssh.sh ${@}; pushd /scripts; /scripts/gen_hostfiles.sh ${@}; popd; # Set up environment variables for GPUDirect-TCPXO export LD_LIBRARY_PATH=/usr/local/nvidia/lib64 export NCCL_FASTRAK_CTRL_DEV=eth0 export NCCL_FASTRAK_IFNAME=eth1,eth2,eth3,eth4,eth5,eth6,eth7,eth8 export NCCL_SOCKET_IFNAME=eth0 export NCCL_CROSS_NIC=0 export NCCL_ALGO=Ring,Tree export NCCL_PROTO=Simple export NCCL_NET_GDR_LEVEL=PIX # Run the benchmark /scripts/demo-run-nccl-test-tcpxo-via-mpi.sh --- apiVersion: jobset.x-k8s.io/v1alpha2 kind: JobSet metadata: name: nccl-tas-test labels: kueue.x-k8s.io/queue-name: lq-tas spec: ttlSecondsAfterFinished: 1200 suspend: true network: enableDNSHostnames: true replicatedJobs: - name: worker replicas: 2 template: spec: parallelism: 1 completions: 1 template: metadata: annotations: kueue.x-k8s.io/podset-preferred-topology: "cloud.google.com/gce-topology-block" networking.gke.io/default-interface: 'eth0' networking.gke.io/interfaces: | [ {"interfaceName":"eth0","network":"default"}, {"interfaceName":"eth1","network":"vpc0"}, {"interfaceName":"eth2","network":"vpc1"}, {"interfaceName":"eth3","network":"vpc2"}, {"interfaceName":"eth4","network":"vpc3"}, {"interfaceName":"eth5","network":"vpc4"}, {"interfaceName":"eth6","network":"vpc5"}, {"interfaceName":"eth7","network":"vpc6"}, {"interfaceName":"eth8","network":"vpc7"} ] spec: activeDeadlineSeconds: 3600 restartPolicy: Never nodeSelector: cloud.google.com/gke-accelerator: nvidia-h100-mega-80gb tolerations: - key: cloud.google.com/gke-queued effect: NoSchedule value: "true" - key: "nvidia.com/gpu" operator: "Exists" effect: "NoSchedule" setHostnameAsFQDN: true volumes: - name: nvidia hostPath: path: /home/kubernetes/bin/nvidia - name: lib64 hostPath: path: /lib64 - name: proc hostPath: path: /proc - name: shared-memory emptyDir: medium: "Memory" sizeLimit: 250Gi - name: nccl-config configMap: name: nccl-configmap defaultMode: 0755 containers: - name: nccl-test image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/nccl-plugin-gpudirecttcpx-dev:v1.0.15 stdin: true tty: true securityContext: privileged: true env: - name: LD_LIBRARY_PATH value: /usr/local/nvidia/lib64 volumeMounts: - name: nvidia mountPath: /usr/local/nvidia - name: shared-memory mountPath: /dev/shm - name: nccl-config mountPath: /configs resources: limits: cpu: "200" memory: "3700Gi" nvidia.com/gpu: 8 requests: cpu: "200" memory: "3700Gi" nvidia.com/gpu: 8 - name: tcpxo-daemon image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/tcpgpudmarxd-dev:v1.0.21 imagePullPolicy: Always command: ["/bin/sh", "-c"] args: - | set -ex chmod 755 /fts/entrypoint_rxdm_container.sh /fts/entrypoint_rxdm_container.sh --num_hops=2 --num_nics=8 --uid= --alsologtostderr securityContext: privileged: true capabilities: add: - NET_ADMIN - NET_BIND_SERVICE volumeMounts: - name: nvidia mountPath: /usr/local/nvidia/lib64 - name: proc mountPath: /proc env: - name: LD_LIBRARY_PATH value: /usr/local/nvidia/lib64A3 High
apiVersion: v1 kind: ConfigMap metadata: name: nccl-config data: allgather.sh: | #!/bin/bash for script in /configs/*; do name=$(basename $script) cp $script "/scripts/$name" chmod +x "/scripts/$name" done /scripts/init_ssh.sh ${@}; pushd /scripts; /scripts/gen_hostfiles.sh ${@}; popd; /scripts/run-allgather.sh 8 eth1,eth2,eth3,eth4 1M 512M ${#}; --- apiVersion: jobset.x-k8s.io/v1alpha2 kind: JobSet metadata: name: nccl-tas-test labels: kueue.x-k8s.io/queue-name: lq-tas spec: suspend: true network: enableDNSHostnames: true replicatedJobs: - name: worker replicas: 2 template: spec: parallelism: 1 completions: 1 template: metadata: annotations: kueue.x-k8s.io/podset-preferred-topology: "cloud.google.com/gce-topology-block" networking.gke.io/default-interface: 'eth0' networking.gke.io/interfaces: | [ {"interfaceName":"eth0","network":"default"}, {"interfaceName":"eth1","network":"vpc0"}, {"interfaceName":"eth2","network":"vpc1"}, {"interfaceName":"eth3","network":"vpc2"}, {"interfaceName":"eth4","network":"vpc3"} ] spec: terminationGracePeriodSeconds: 0 nodeSelector: cloud.google.com/gke-accelerator: nvidia-h100-80gb tolerations: - key: cloud.google.com/gke-queued effect: NoSchedule value: "true" - key: "nvidia.com/gpu" operator: "Exists" effect: "NoSchedule" setHostnameAsFQDN: true containers: - name: tcpx-daemon image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd-dev:v2.0.11 command: - /tcpgpudmarxd/build/app/tcpgpudmarxd - --gpu_nic_preset - a3vm - --gpu_shmem_type - fd - --uds_path - /run/tcpx - --setup_param - "--verbose 128 2 0 " securityContext: privileged: true capabilities: add: - NET_ADMIN volumeMounts: - name: libraries mountPath: /usr/local/nvidia/lib64 - name: tcpx-socket mountPath: /run/tcpx - name: sys mountPath: /hostsysfs - name: proc-sys mountPath: /hostprocsysfs env: - name: LD_LIBRARY_PATH value: /usr/local/nvidia/lib64 - name: nccl-test image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx-dev:v3.1.8 command: - bash - -c - | /scripts/container_entry.sh daemon; sleep infinity; securityContext: privileged: true volumeMounts: - name: tcpx-socket mountPath: /tmp - name: libraries mountPath: /usr/local/nvidia/lib64 - name: nccl-config mountPath: /configs - name: shared-memory mountPath: /dev/shm resources: limits: cpu: "200" memory: "1800Gi" nvidia.com/gpu: 8 requests: cpu: "200" memory: "1800Gi" nvidia.com/gpu: 8 volumes: - name: libraries hostPath: path: /home/kubernetes/bin/nvidia/lib64 - name: tcpx-socket emptyDir: {} - name: sys hostPath: path: /sys - name: proc-sys hostPath: path: /proc/sys - name: shared-memory emptyDir: medium: Memory sizeLimit: 250Gi - name: nccl-config configMap: name: nccl-config defaultMode: 0777Appliquez le fichier manifeste à votre cluster :
kubectl apply -f nccl-tas-jobset.yamlVérifiez que le JobSet est accepté et en cours d'exécution :
kubectl get jobset nccl-tas-testAttendez que le JobSet soit réactivé et que les pods atteignent l'état
Running.Déclenchez le test NCCL en exécutant le script
allgather.shà partir du premier pod de nœud de calcul :kubectl exec --stdin --tty --container=nccl-test nccl-tas-test-worker-0-0 -- /configs/allgather.sh nccl-tas-test-worker-0-0 nccl-tas-test-worker-1-0Le résultat d'un test à deux nœuds ressemble à ce qui suit :
A3 Mega
# out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 0 0 float none -1 0.24 0.00 0.00 0 0.18 0.00 0.00 0 ... 8589934592 134217728 float none -1 42603 201.63 189.03 0 42670 201.31 188.73 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 45.7587A3 High
# out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1048576 16384 float none -1 696.8 1.50 1.41 0 729.0 1.44 1.35 0 ... 536870912 8388608 float none -1 7101.7 75.60 70.87 0 7060.9 76.03 71.28 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 29.8293
Déployer une charge de travail de test NCCL avec TAS
Si vous avez plus de deux nœuds, nous vous recommandons d'utiliser le test suivant, qui utilise la planification tenant compte de la topologie (TAS, Topology Aware Scheduling). Pour exécuter des tests NCCL avec TAS sur un cluster GKE qui utilise des VM A3 Mega ou A3 à démarrage flexible, suivez la procédure ci-dessous.
Enregistrez le fichier manifeste suivant sous le nom
nccl-jobset-test.yaml. RemplacezNUM_NODESpar le nombre de nœuds du pool de nœuds :A3 Mega
apiVersion: jobset.x-k8s.io/v1alpha2 kind: JobSet metadata: name: nccl-ag labels: kueue.x-k8s.io/queue-name: lq-tas spec: ttlSecondsAfterFinished: 1200 suspend: true network: enableDNSHostnames: true replicatedJobs: - name: worker template: spec: parallelism: NUM_NODES completions: NUM_NODES template: metadata: annotations: kueue.x-k8s.io/podset-preferred-topology: "cloud.google.com/gce-topology-subblock" networking.gke.io/default-interface: 'eth0' networking.gke.io/interfaces: | [ {"interfaceName":"eth0","network":"default"}, {"interfaceName":"eth1","network":"vpc0"}, {"interfaceName":"eth2","network":"vpc1"}, {"interfaceName":"eth3","network":"vpc2"}, {"interfaceName":"eth4","network":"vpc3"}, {"interfaceName":"eth5","network":"vpc4"}, {"interfaceName":"eth6","network":"vpc5"}, {"interfaceName":"eth7","network":"vpc6"}, {"interfaceName":"eth8","network":"vpc7"} ] spec: activeDeadlineSeconds: 3600 restartPolicy: Never nodeSelector: cloud.google.com/gke-accelerator: nvidia-h100-mega-80gb tolerations: - key: "nvidia.com/gpu" operator: "Exists" effect: "NoSchedule" setHostnameAsFQDN: true volumes: - name: proc hostPath: path: /proc - name: nvidia hostPath: path: /home/kubernetes/bin/nvidia - name: lib64 hostPath: path: /lib64 - name: shared-memory emptyDir: medium: "Memory" sizeLimit: 250Gi containers: - name: nccl-test stdin: true tty: true image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/nccl-plugin-tcpxo-diagnostic:v1.0.6 securityContext: privileged: true env: - name: MY_NODE_NAME valueFrom: fieldRef: fieldPath: spec.nodeName - name: OMPI_ALLOW_RUN_AS_ROOT value: "1" - name: OMPI_ALLOW_RUN_AS_ROOT_CONFIRM value: "1" - name: N_NODES value: "NUM_NODES" - name: NCCL_SOCKET_IFNAME value: eth0 - name: NCCL_FASTRAK_CTRL_DEV value: eth0 - name: NCCL_FASTRAK_IFNAME value: eth1,eth2,eth3,eth4,eth5,eth6,eth7,eth8 - name: NCCL_CROSS_NIC value: "0" - name: NCCL_ALGO value: Ring,Tree - name: NCCL_PROTO value: Simple - name: NCCL_NET_GDR_LEVEL value: PIX - name: LD_LIBRARY_PATH value: /usr/local/nvidia/lib64 command: - bash - -c - | set -x /scripts/container_entry.sh daemon & export POSTFIX=$(hostname | cut -d . -f 2-) export WORKERS_BASENAME=$(hostname | cut -d . -f 1 | rev | cut -d - -f 2- | rev ) export NODE_RANK=$JOB_COMPLETION_INDEX for i in `seq 0 $(($N_NODES-1))`; do OTHER=<span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8333em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">W</span><span class="mord mathnormal" style="margin-right:0.00773em;">OR</span><span class="mord mathnormal" style="margin-right:0.07153em;">K</span><span class="mord mathnormal" style="margin-right:0.00773em;">ER</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.0576em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.05017em;">B</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.10903em;">SEN</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.05764em;">ME</span></span><span class="mord">−</span></span></span></span>{i}.${POSTFIX} until ssh -p 222 -o StrictHostKeyChecking=no $OTHER hostname; do sleep 10 done echo ${OTHER} port=222 slots=8 | tee -a /tmp/hostfile; done if [[ "${NODE_RANK}" -eq "0" ]]; then export NCCL_TESTS_SPLIT_MASK="0x0"; ENV_VARS=$(echo ${!NCCL*} ${!OMPI*} LD_LIBRARY_PATH PATH | sed 's/ / -x /g') mpirun --hostfile /tmp/hostfile \ -x $ENV_VARS \ -mca plm_rsh_no_tree_spawn 1 \ --mca orte_keep_fqdn_hostnames 1 \ --mca btl self,tcp \ --mca btl_tcp_if_include eth0 \ --bind-to none \ --mca plm_rsh_agent "ssh -q -o LogLevel=ERROR -o StrictHostKeyChecking=no -p 222" \ /third_party/nccl-tests/build/all_gather_perf -b 1K -e 8G -f 2 -g 1 -w 5 --iters 100 -c 1 else while ping -c 1 <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8333em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">W</span><span class="mord mathnormal" style="margin-right:0.00773em;">OR</span><span class="mord mathnormal" style="margin-right:0.07153em;">K</span><span class="mord mathnormal" style="margin-right:0.00773em;">ER</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.0576em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.05017em;">B</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.10903em;">SEN</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.05764em;">ME</span></span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.6444em;"></span><span class="mord">0.</span></span></span></span>{POSTFIX}; do sleep 5 done fi exit 0 volumeMounts: - name: nvidia mountPath: /usr/local/nvidia - name: lib64 mountPath: /lib64 - name: shared-memory mountPath: /dev/shm resources: limits: cpu: "200" memory: "3700Gi" nvidia.com/gpu: 8 requests: cpu: "200" memory: "3700Gi" nvidia.com/gpu: 8 - name: tcpxo-daemon image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/tcpxo-daemon:v1.0.1 imagePullPolicy: Always command: - bash - -c - | /usr/bin/tcpxo_daemon securityContext: privileged: true volumeMounts: - name: nvidia mountPath: /usr/local/nvidia - name: proc mountPath: /proc env: - name: LD_LIBRARY_PATH value: /usr/local/nvidia/lib64A3 High
apiVersion: jobset.x-k8s.io/v1alpha2 kind: JobSet metadata: name: nccl-ag labels: kueue.x-k8s.io/queue-name: lq-tas spec: ttlSecondsAfterFinished: 1200 suspend: true network: enableDNSHostnames: true replicatedJobs: - name: worker template: spec: parallelism: NUM_NODES completions: NUM_NODES template: metadata: annotations: kueue.x-k8s.io/podset-preferred-topology: "cloud.google.com/gce-topology-subblock" networking.gke.io/default-interface: 'eth0' networking.gke.io/interfaces: | [ {"interfaceName":"eth0","network":"default"}, {"interfaceName":"eth1","network":"vpc0"}, {"interfaceName":"eth2","network":"vpc1"}, {"interfaceName":"eth3","network":"vpc2"}, {"interfaceName":"eth4","network":"vpc3"} ] spec: activeDeadlineSeconds: 3600 restartPolicy: Never nodeSelector: cloud.google.com/gke-accelerator: nvidia-h100-80gb tolerations: - key: "nvidia.com/gpu" operator: "Exists" effect: "NoSchedule" setHostnameAsFQDN: true volumes: - name: proc hostPath: path: /proc - name: nvidia hostPath: path: /home/kubernetes/bin/nvidia - name: libraries hostPath: path: /home/kubernetes/bin/nvidia/lib64 - name: tcpx-socket emptyDir: {} - name: shared-memory emptyDir: medium: "Memory" sizeLimit: 250Gi containers: - name: tcpx-daemon image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd-dev:v2.0.11 command: - /tcpgpudmarxd/build/app/tcpgpudmarxd - --gpu_nic_preset - a3vm - --uds_path - /run/tcpx securityContext: privileged: true volumeMounts: - name: tcpx-socket mountPath: /run/tcpx - name: libraries mountPath: /usr/local/nvidia/lib64 - name: nccl-test stdin: true tty: true image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx-dev:v3.1.8 securityContext: privileged: true env: - name: MY_NODE_NAME valueFrom: fieldRef: fieldPath: spec.nodeName - name: OMPI_ALLOW_RUN_AS_ROOT value: "1" - name: OMPI_ALLOW_RUN_AS_ROOT_CONFIRM value: "1" - name: N_NODES value: "NUM_NODES" - name: LD_LIBRARY_PATH value: /usr/local/nvidia/lib64 command: - bash - -c - | /scripts/container_entry.sh daemon & export POSTFIX=$(hostname | cut -d . -f 2-) export WORKERS_BASENAME=$(hostname | cut -d . -f 1 | rev | cut -d - -f 2- | rev ) export NODE_RANK=$JOB_COMPLETION_INDEX for i in `seq 0 $(($N_NODES-1))`; do OTHER=<span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8333em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">W</span><span class="mord mathnormal" style="margin-right:0.00773em;">OR</span><span class="mord mathnormal" style="margin-right:0.07153em;">K</span><span class="mord mathnormal" style="margin-right:0.00773em;">ER</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.0576em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.05017em;">B</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.10903em;">SEN</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.05764em;">ME</span></span><span class="mord">−</span></span></span></span>{i}.${POSTFIX} until ssh -p 222 -o StrictHostKeyChecking=no $OTHER hostname; do sleep 10 done echo ${OTHER} port=222 slots=8 | tee -a /tmp/hostfile; done if [[ "${NODE_RANK}" -eq "0" ]]; then /scripts/run-allgather.sh 8 eth1,eth2,eth3,eth4 1M 512M ${N_NODES} else while ping -c 1 <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8333em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">W</span><span class="mord mathnormal" style="margin-right:0.00773em;">OR</span><span class="mord mathnormal" style="margin-right:0.07153em;">K</span><span class="mord mathnormal" style="margin-right:0.00773em;">ER</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.0576em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.05017em;">B</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.10903em;">SEN</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.05764em;">ME</span></span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.6444em;"></span><span class="mord">0.</span></span></span></span>{POSTFIX}; do sleep 5 done fi exit 0 volumeMounts: - name: nvidia mountPath: /usr/local/nvidia - name: tcpx-socket mountPath: /tmp - name: libraries mountPath: /usr/local/nvidia/lib64 - name: shared-memory mountPath: /dev/shm resources: limits: cpu: "200" memory: "1800Gi" nvidia.com/gpu: 8 requests: cpu: "200" memory: "1800Gi" nvidia.com/gpu: 8Appliquez le fichier manifeste :
kubectl apply -f nccl-jobset-test.yamlVérifiez que la charge de travail est acceptée et qu'elle atteint l'état
Completed.Récupérez les journaux du pod correspondant à
nccl-ag-worker-0-0-.*pour afficher les résultats :kubectl logs $(kubectl get pods -o go-template='{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | grep nccl-ag-worker-0-0)
Étapes suivantes
- Collecter et comprendre les journaux NCCL pour le dépannage pour comprendre les résultats des tests et résoudre les problèmes.
- Découvrez comment résoudre les problèmes de lenteur des performances.