En esta página, se describe cómo ejecutar pruebas de la biblioteca de comunicación colectiva de NVIDIA (NCCL) en clústeres personalizados de GKE que usan los protocolos de redes GPUDirect-TCPXO y GPUDirect-TCPX. Un clúster de GKE personalizado es un clúster que creas con comandos gcloud.
Puedes usar las pruebas que se describen en esta página para las siguientes situaciones:
- Si tu clúster de GKE usa nodos de inicio flexible, realiza una prueba básica en dos nodos.
- Si tu clúster de GKE usa diferentes tipos de nodos, como nodos bajo demanda o vinculados a reservas, usa una prueba de NCCL con Topology Aware Scheduling.
Antes de comenzar
Las pruebas en esta página usan JobSet y Kueue con la programación consciente de la topología (TAS). Antes de ejecutar cualquier prueba, debes configurar tu clúster y hacer lo siguiente:
Instala Kueue.
kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.16.5/manifests.yaml
Configura tu clúster con JobSet y Kueue
Después de instalar JobSet y Kueue, sigue estos pasos:
Guarda el siguiente manifiesto como
kueue-config.yaml:A3 High
apiVersion: kueue.x-k8s.io/v1beta2 kind: Topology metadata: name: "gke-default" spec: levels: - nodeLabel: "cloud.google.com/gce-topology-block" - nodeLabel: "cloud.google.com/gce-topology-subblock" - nodeLabel: "cloud.google.com/gce-topology-host" - nodeLabel: "kubernetes.io/hostname" --- apiVersion: kueue.x-k8s.io/v1beta2 kind: ResourceFlavor metadata: name: a3-high-flavor spec: nodeLabels: cloud.google.com/gke-accelerator: nvidia-h100-80gb topologyName: "gke-default" --- apiVersion: kueue.x-k8s.io/v1beta2 kind: ResourceFlavor metadata: name: a3-high-dws-flavor spec: nodeLabels: cloud.google.com/gke-accelerator: nvidia-h100-80gb topologyName: "gke-default" tolerations: - key: "cloud.google.com/gke-queued" operator: "Exists" effect: NoSchedule --- apiVersion: kueue.x-k8s.io/v1beta2 kind: AdmissionCheck metadata: name: dws-prov spec: controllerName: kueue.x-k8s.io/provisioning-request parameters: apiGroup: kueue.x-k8s.io kind: ProvisioningRequestConfig name: dws-config --- apiVersion: kueue.x-k8s.io/v1beta2 kind: ProvisioningRequestConfig metadata: name: dws-config spec: provisioningClassName: queued-provisioning.gke.io podSetUpdates: - key: autoscaling.gke.io/provisioning-request valueFromProvisioningClassDetail: ResizeRequestName managedResources: - nvidia.com/gpu --- apiVersion: kueue.x-k8s.io/v1beta2 kind: ClusterQueue metadata: name: cq-tas spec: namespaceSelector: {} clusterQueueingStrategy: BestEffortFIFO resourceGroups: - flavors: - name: a3-high-flavor resources: - name: "cpu" nominalQuota: 1000 - name: "memory" nominalQuota: 1000Ti - name: "nvidia.com/gpu" nominalQuota: 1000 - name: a3-high-dws-flavor resources: - name: "cpu" nominalQuota: 1000 - name: "memory" nominalQuota: 1000Ti - name: "nvidia.com/gpu" nominalQuota: 1000 admissionChecksStrategy: admissionChecks: - name: "dws-prov" onFlavors: [a3-high-dws-flavor] --- apiVersion: kueue.x-k8s.io/v1beta2 kind: LocalQueue metadata: namespace: default name: lq-tas spec: clusterQueue: cq-tasA3 Mega
apiVersion: kueue.x-k8s.io/v1beta2 kind: Topology metadata: name: "gke-default" spec: levels: - nodeLabel: "cloud.google.com/gce-topology-block" - nodeLabel: "cloud.google.com/gce-topology-subblock" - nodeLabel: "cloud.google.com/gce-topology-host" - nodeLabel: "kubernetes.io/hostname" --- apiVersion: kueue.x-k8s.io/v1beta2 kind: ResourceFlavor metadata: name: a3-mega-flavor spec: nodeLabels: cloud.google.com/gke-accelerator: nvidia-h100-mega-80gb topologyName: "gke-default" --- apiVersion: kueue.x-k8s.io/v1beta2 kind: ResourceFlavor metadata: name: a3-mega-dws-flavor spec: nodeLabels: cloud.google.com/gke-accelerator: nvidia-h100-mega-80gb topologyName: "gke-default" tolerations: - key: "cloud.google.com/gke-queued" operator: "Exists" effect: NoSchedule --- apiVersion: kueue.x-k8s.io/v1beta2 kind: AdmissionCheck metadata: name: dws-prov spec: controllerName: kueue.x-k8s.io/provisioning-request parameters: apiGroup: kueue.x-k8s.io kind: ProvisioningRequestConfig name: dws-config --- apiVersion: kueue.x-k8s.io/v1beta2 kind: ProvisioningRequestConfig metadata: name: dws-config spec: provisioningClassName: queued-provisioning.gke.io podSetUpdates: - key: autoscaling.gke.io/provisioning-request valueFromProvisioningClassDetail: ResizeRequestName managedResources: - nvidia.com/gpu --- apiVersion: kueue.x-k8s.io/v1beta2 kind: ClusterQueue metadata: name: cq-tas spec: namespaceSelector: {} clusterQueueingStrategy: BestEffortFIFO resourceGroups: - flavors: - name: a3-mega-flavor resources: - name: "cpu" nominalQuota: 1000 - name: "memory" nominalQuota: 1000Ti - name: "nvidia.com/gpu" nominalQuota: 1000 - name: a3-mega-dws-flavor resources: - name: "cpu" nominalQuota: 1000 - name: "memory" nominalQuota: 1000Ti - name: "nvidia.com/gpu" nominalQuota: 1000 admissionChecksStrategy: admissionChecks: - name: "dws-prov" onFlavors: [a3-mega-dws-flavor] --- apiVersion: kueue.x-k8s.io/v1beta2 kind: LocalQueue metadata: namespace: default name: lq-tas spec: clusterQueue: cq-tasAplica el manifiesto
kubectl apply -f kueue-config.yaml
Cuando ejecutas cargas de trabajo con TAS habilitado, puedes especificar qué tan estrictamente se aplican las restricciones de topología con una de las siguientes anotaciones en el manifiesto de tu carga de trabajo:
kueue.x-k8s.io/podset-required-topology: Si usas esta anotación, Kueue bloquea la programación hasta que la carga de trabajo se pueda programar dentro de la restricción de topología solicitada. Usa esta anotación para asegurarte de que los pods se coloquen juntos y obtener un rendimiento óptimo.kueue.x-k8s.io/podset-preferred-topology: Si usas esta anotación, Kueue intenta programar Pods dentro de la restricción de topología solicitada, pero, si eso no es posible, admite la carga de trabajo sin cumplir con las restricciones de topología.
Nota: Evita usar el modo obligatorio con DWS inicio flexible. Dado que el inicio flexible aprovisiona nodos de forma dinámica, es posible que los nodos resultantes no satisfagan requisitos topológicos estrictos, lo que puede generar cargas de trabajo no programables.
Para estas configuraciones, usa podset-preferred-topology en su lugar.
Para cualquiera de las anotaciones, especifica uno de los siguientes valores como la restricción de topología:
cloud.google.com/gce-topology-block: Programa Pods dentro del mismo bloque de red.cloud.google.com/gce-topology-subblock: Programa pods dentro del mismo rack.cloud.google.com/gce-topology-host: Programa pods en el mismo host físico.
Prueba en dos nodos de inicio flexible
Para ejecutar pruebas de NCCL en un clúster de GKE que usa VMs A3 Mega o A3 High Flex-start, sigue este procedimiento. En este procedimiento, se usa un manifiesto de JobSet para ejecutar una prueba de NCCL en dos nodos.
Guarda el siguiente manifiesto como
nccl-tas-jobset.yaml:A3 Mega
apiVersion: v1 kind: ConfigMap metadata: name: nccl-configmap data: allgather.sh: | #!/bin/bash service ssh restart; /scripts/init_ssh.sh ${@}; pushd /scripts; /scripts/gen_hostfiles.sh ${@}; popd; # Set up environment variables for GPUDirect-TCPXO export LD_LIBRARY_PATH=/usr/local/nvidia/lib64 export NCCL_FASTRAK_CTRL_DEV=eth0 export NCCL_FASTRAK_IFNAME=eth1,eth2,eth3,eth4,eth5,eth6,eth7,eth8 export NCCL_SOCKET_IFNAME=eth0 export NCCL_CROSS_NIC=0 export NCCL_ALGO=Ring,Tree export NCCL_PROTO=Simple export NCCL_NET_GDR_LEVEL=PIX # Run the benchmark /scripts/demo-run-nccl-test-tcpxo-via-mpi.sh --- apiVersion: jobset.x-k8s.io/v1alpha2 kind: JobSet metadata: name: nccl-tas-test labels: kueue.x-k8s.io/queue-name: lq-tas spec: ttlSecondsAfterFinished: 1200 suspend: true network: enableDNSHostnames: true replicatedJobs: - name: worker replicas: 2 template: spec: parallelism: 1 completions: 1 template: metadata: annotations: kueue.x-k8s.io/podset-preferred-topology: "cloud.google.com/gce-topology-block" networking.gke.io/default-interface: 'eth0' networking.gke.io/interfaces: | [ {"interfaceName":"eth0","network":"default"}, {"interfaceName":"eth1","network":"vpc0"}, {"interfaceName":"eth2","network":"vpc1"}, {"interfaceName":"eth3","network":"vpc2"}, {"interfaceName":"eth4","network":"vpc3"}, {"interfaceName":"eth5","network":"vpc4"}, {"interfaceName":"eth6","network":"vpc5"}, {"interfaceName":"eth7","network":"vpc6"}, {"interfaceName":"eth8","network":"vpc7"} ] spec: activeDeadlineSeconds: 3600 restartPolicy: Never nodeSelector: cloud.google.com/gke-accelerator: nvidia-h100-mega-80gb tolerations: - key: cloud.google.com/gke-queued effect: NoSchedule value: "true" - key: "nvidia.com/gpu" operator: "Exists" effect: "NoSchedule" setHostnameAsFQDN: true volumes: - name: nvidia hostPath: path: /home/kubernetes/bin/nvidia - name: lib64 hostPath: path: /lib64 - name: proc hostPath: path: /proc - name: shared-memory emptyDir: medium: "Memory" sizeLimit: 250Gi - name: nccl-config configMap: name: nccl-configmap defaultMode: 0755 containers: - name: nccl-test image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/nccl-plugin-gpudirecttcpx-dev:v1.0.15 stdin: true tty: true securityContext: privileged: true env: - name: LD_LIBRARY_PATH value: /usr/local/nvidia/lib64 volumeMounts: - name: nvidia mountPath: /usr/local/nvidia - name: shared-memory mountPath: /dev/shm - name: nccl-config mountPath: /configs resources: limits: cpu: "200" memory: "3700Gi" nvidia.com/gpu: 8 requests: cpu: "200" memory: "3700Gi" nvidia.com/gpu: 8 - name: tcpxo-daemon image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/tcpgpudmarxd-dev:v1.0.21 imagePullPolicy: Always command: ["/bin/sh", "-c"] args: - | set -ex chmod 755 /fts/entrypoint_rxdm_container.sh /fts/entrypoint_rxdm_container.sh --num_hops=2 --num_nics=8 --uid= --alsologtostderr securityContext: privileged: true capabilities: add: - NET_ADMIN - NET_BIND_SERVICE volumeMounts: - name: nvidia mountPath: /usr/local/nvidia/lib64 - name: proc mountPath: /proc env: - name: LD_LIBRARY_PATH value: /usr/local/nvidia/lib64A3 High
apiVersion: v1 kind: ConfigMap metadata: name: nccl-config data: allgather.sh: | #!/bin/bash for script in /configs/*; do name=$(basename $script) cp $script "/scripts/$name" chmod +x "/scripts/$name" done /scripts/init_ssh.sh ${@}; pushd /scripts; /scripts/gen_hostfiles.sh ${@}; popd; /scripts/run-allgather.sh 8 eth1,eth2,eth3,eth4 1M 512M ${#}; --- apiVersion: jobset.x-k8s.io/v1alpha2 kind: JobSet metadata: name: nccl-tas-test labels: kueue.x-k8s.io/queue-name: lq-tas spec: suspend: true network: enableDNSHostnames: true replicatedJobs: - name: worker replicas: 2 template: spec: parallelism: 1 completions: 1 template: metadata: annotations: kueue.x-k8s.io/podset-preferred-topology: "cloud.google.com/gce-topology-block" networking.gke.io/default-interface: 'eth0' networking.gke.io/interfaces: | [ {"interfaceName":"eth0","network":"default"}, {"interfaceName":"eth1","network":"vpc0"}, {"interfaceName":"eth2","network":"vpc1"}, {"interfaceName":"eth3","network":"vpc2"}, {"interfaceName":"eth4","network":"vpc3"} ] spec: terminationGracePeriodSeconds: 0 nodeSelector: cloud.google.com/gke-accelerator: nvidia-h100-80gb tolerations: - key: cloud.google.com/gke-queued effect: NoSchedule value: "true" - key: "nvidia.com/gpu" operator: "Exists" effect: "NoSchedule" setHostnameAsFQDN: true containers: - name: tcpx-daemon image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd-dev:v2.0.11 command: - /tcpgpudmarxd/build/app/tcpgpudmarxd - --gpu_nic_preset - a3vm - --gpu_shmem_type - fd - --uds_path - /run/tcpx - --setup_param - "--verbose 128 2 0 " securityContext: privileged: true capabilities: add: - NET_ADMIN volumeMounts: - name: libraries mountPath: /usr/local/nvidia/lib64 - name: tcpx-socket mountPath: /run/tcpx - name: sys mountPath: /hostsysfs - name: proc-sys mountPath: /hostprocsysfs env: - name: LD_LIBRARY_PATH value: /usr/local/nvidia/lib64 - name: nccl-test image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx-dev:v3.1.8 command: - bash - -c - | /scripts/container_entry.sh daemon; sleep infinity; securityContext: privileged: true volumeMounts: - name: tcpx-socket mountPath: /tmp - name: libraries mountPath: /usr/local/nvidia/lib64 - name: nccl-config mountPath: /configs - name: shared-memory mountPath: /dev/shm resources: limits: cpu: "200" memory: "1800Gi" nvidia.com/gpu: 8 requests: cpu: "200" memory: "1800Gi" nvidia.com/gpu: 8 volumes: - name: libraries hostPath: path: /home/kubernetes/bin/nvidia/lib64 - name: tcpx-socket emptyDir: {} - name: sys hostPath: path: /sys - name: proc-sys hostPath: path: /proc/sys - name: shared-memory emptyDir: medium: Memory sizeLimit: 250Gi - name: nccl-config configMap: name: nccl-config defaultMode: 0777Aplica el manifiesto al clúster:
kubectl apply -f nccl-tas-jobset.yamlVerifica que el JobSet se admita y se ejecute:
kubectl get jobset nccl-tas-testEspera a que se reanude el JobSet y a que los Pods alcancen el estado
Running.Para activar la prueba de NCCL, ejecuta la secuencia de comandos
allgather.shdesde el primer Pod de trabajador:kubectl exec --stdin --tty --container=nccl-test nccl-tas-test-worker-0-0 -- /configs/allgather.sh nccl-tas-test-worker-0-0 nccl-tas-test-worker-1-0El resultado de una prueba de dos nodos es similar al siguiente:
A3 Mega
# out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 0 0 float none -1 0.24 0.00 0.00 0 0.18 0.00 0.00 0 ... 8589934592 134217728 float none -1 42603 201.63 189.03 0 42670 201.31 188.73 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 45.7587A3 High
# out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1048576 16384 float none -1 696.8 1.50 1.41 0 729.0 1.44 1.35 0 ... 536870912 8388608 float none -1 7101.7 75.60 70.87 0 7060.9 76.03 71.28 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 29.8293
Implementa una carga de trabajo de prueba de NCCL con TAS
Si tienes más de dos nodos, te recomendamos que uses la siguiente prueba, que utiliza la programación con reconocimiento de la topología (TAS). Para ejecutar pruebas de NCCL con TAS en un clúster de GKE que usa VMs A3 Mega o A3 High con inicio flexible, sigue este procedimiento.
Guarda el siguiente manifiesto como
nccl-jobset-test.yaml. ReemplazaNUM_NODESpor la cantidad de nodos del grupo de nodos:A3 Mega
apiVersion: jobset.x-k8s.io/v1alpha2 kind: JobSet metadata: name: nccl-ag labels: kueue.x-k8s.io/queue-name: lq-tas spec: ttlSecondsAfterFinished: 1200 suspend: true network: enableDNSHostnames: true replicatedJobs: - name: worker template: spec: parallelism: NUM_NODES completions: NUM_NODES template: metadata: annotations: kueue.x-k8s.io/podset-preferred-topology: "cloud.google.com/gce-topology-subblock" networking.gke.io/default-interface: 'eth0' networking.gke.io/interfaces: | [ {"interfaceName":"eth0","network":"default"}, {"interfaceName":"eth1","network":"vpc0"}, {"interfaceName":"eth2","network":"vpc1"}, {"interfaceName":"eth3","network":"vpc2"}, {"interfaceName":"eth4","network":"vpc3"}, {"interfaceName":"eth5","network":"vpc4"}, {"interfaceName":"eth6","network":"vpc5"}, {"interfaceName":"eth7","network":"vpc6"}, {"interfaceName":"eth8","network":"vpc7"} ] spec: activeDeadlineSeconds: 3600 restartPolicy: Never nodeSelector: cloud.google.com/gke-accelerator: nvidia-h100-mega-80gb tolerations: - key: "nvidia.com/gpu" operator: "Exists" effect: "NoSchedule" setHostnameAsFQDN: true volumes: - name: proc hostPath: path: /proc - name: nvidia hostPath: path: /home/kubernetes/bin/nvidia - name: lib64 hostPath: path: /lib64 - name: shared-memory emptyDir: medium: "Memory" sizeLimit: 250Gi containers: - name: nccl-test stdin: true tty: true image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/nccl-plugin-tcpxo-diagnostic:v1.0.6 securityContext: privileged: true env: - name: MY_NODE_NAME valueFrom: fieldRef: fieldPath: spec.nodeName - name: OMPI_ALLOW_RUN_AS_ROOT value: "1" - name: OMPI_ALLOW_RUN_AS_ROOT_CONFIRM value: "1" - name: N_NODES value: "NUM_NODES" - name: NCCL_SOCKET_IFNAME value: eth0 - name: NCCL_FASTRAK_CTRL_DEV value: eth0 - name: NCCL_FASTRAK_IFNAME value: eth1,eth2,eth3,eth4,eth5,eth6,eth7,eth8 - name: NCCL_CROSS_NIC value: "0" - name: NCCL_ALGO value: Ring,Tree - name: NCCL_PROTO value: Simple - name: NCCL_NET_GDR_LEVEL value: PIX - name: LD_LIBRARY_PATH value: /usr/local/nvidia/lib64 command: - bash - -c - | set -x /scripts/container_entry.sh daemon & export POSTFIX=$(hostname | cut -d . -f 2-) export WORKERS_BASENAME=$(hostname | cut -d . -f 1 | rev | cut -d - -f 2- | rev ) export NODE_RANK=$JOB_COMPLETION_INDEX for i in `seq 0 $(($N_NODES-1))`; do OTHER=<span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8333em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">W</span><span class="mord mathnormal" style="margin-right:0.00773em;">OR</span><span class="mord mathnormal" style="margin-right:0.07153em;">K</span><span class="mord mathnormal" style="margin-right:0.00773em;">ER</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.0576em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.05017em;">B</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.10903em;">SEN</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.05764em;">ME</span></span><span class="mord">−</span></span></span></span>{i}.${POSTFIX} until ssh -p 222 -o StrictHostKeyChecking=no $OTHER hostname; do sleep 10 done echo ${OTHER} port=222 slots=8 | tee -a /tmp/hostfile; done if [[ "${NODE_RANK}" -eq "0" ]]; then export NCCL_TESTS_SPLIT_MASK="0x0"; ENV_VARS=$(echo ${!NCCL*} ${!OMPI*} LD_LIBRARY_PATH PATH | sed 's/ / -x /g') mpirun --hostfile /tmp/hostfile \ -x $ENV_VARS \ -mca plm_rsh_no_tree_spawn 1 \ --mca orte_keep_fqdn_hostnames 1 \ --mca btl self,tcp \ --mca btl_tcp_if_include eth0 \ --bind-to none \ --mca plm_rsh_agent "ssh -q -o LogLevel=ERROR -o StrictHostKeyChecking=no -p 222" \ /third_party/nccl-tests/build/all_gather_perf -b 1K -e 8G -f 2 -g 1 -w 5 --iters 100 -c 1 else while ping -c 1 <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8333em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">W</span><span class="mord mathnormal" style="margin-right:0.00773em;">OR</span><span class="mord mathnormal" style="margin-right:0.07153em;">K</span><span class="mord mathnormal" style="margin-right:0.00773em;">ER</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.0576em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.05017em;">B</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.10903em;">SEN</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.05764em;">ME</span></span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.6444em;"></span><span class="mord">0.</span></span></span></span>{POSTFIX}; do sleep 5 done fi exit 0 volumeMounts: - name: nvidia mountPath: /usr/local/nvidia - name: lib64 mountPath: /lib64 - name: shared-memory mountPath: /dev/shm resources: limits: cpu: "200" memory: "3700Gi" nvidia.com/gpu: 8 requests: cpu: "200" memory: "3700Gi" nvidia.com/gpu: 8 - name: tcpxo-daemon image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/tcpxo-daemon:v1.0.1 imagePullPolicy: Always command: - bash - -c - | /usr/bin/tcpxo_daemon securityContext: privileged: true volumeMounts: - name: nvidia mountPath: /usr/local/nvidia - name: proc mountPath: /proc env: - name: LD_LIBRARY_PATH value: /usr/local/nvidia/lib64A3 High
apiVersion: jobset.x-k8s.io/v1alpha2 kind: JobSet metadata: name: nccl-ag labels: kueue.x-k8s.io/queue-name: lq-tas spec: ttlSecondsAfterFinished: 1200 suspend: true network: enableDNSHostnames: true replicatedJobs: - name: worker template: spec: parallelism: NUM_NODES completions: NUM_NODES template: metadata: annotations: kueue.x-k8s.io/podset-preferred-topology: "cloud.google.com/gce-topology-subblock" networking.gke.io/default-interface: 'eth0' networking.gke.io/interfaces: | [ {"interfaceName":"eth0","network":"default"}, {"interfaceName":"eth1","network":"vpc0"}, {"interfaceName":"eth2","network":"vpc1"}, {"interfaceName":"eth3","network":"vpc2"}, {"interfaceName":"eth4","network":"vpc3"} ] spec: activeDeadlineSeconds: 3600 restartPolicy: Never nodeSelector: cloud.google.com/gke-accelerator: nvidia-h100-80gb tolerations: - key: "nvidia.com/gpu" operator: "Exists" effect: "NoSchedule" setHostnameAsFQDN: true volumes: - name: proc hostPath: path: /proc - name: nvidia hostPath: path: /home/kubernetes/bin/nvidia - name: libraries hostPath: path: /home/kubernetes/bin/nvidia/lib64 - name: tcpx-socket emptyDir: {} - name: shared-memory emptyDir: medium: "Memory" sizeLimit: 250Gi containers: - name: tcpx-daemon image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd-dev:v2.0.11 command: - /tcpgpudmarxd/build/app/tcpgpudmarxd - --gpu_nic_preset - a3vm - --uds_path - /run/tcpx securityContext: privileged: true volumeMounts: - name: tcpx-socket mountPath: /run/tcpx - name: libraries mountPath: /usr/local/nvidia/lib64 - name: nccl-test stdin: true tty: true image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx-dev:v3.1.8 securityContext: privileged: true env: - name: MY_NODE_NAME valueFrom: fieldRef: fieldPath: spec.nodeName - name: OMPI_ALLOW_RUN_AS_ROOT value: "1" - name: OMPI_ALLOW_RUN_AS_ROOT_CONFIRM value: "1" - name: N_NODES value: "NUM_NODES" - name: LD_LIBRARY_PATH value: /usr/local/nvidia/lib64 command: - bash - -c - | /scripts/container_entry.sh daemon & export POSTFIX=$(hostname | cut -d . -f 2-) export WORKERS_BASENAME=$(hostname | cut -d . -f 1 | rev | cut -d - -f 2- | rev ) export NODE_RANK=$JOB_COMPLETION_INDEX for i in `seq 0 $(($N_NODES-1))`; do OTHER=<span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8333em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">W</span><span class="mord mathnormal" style="margin-right:0.00773em;">OR</span><span class="mord mathnormal" style="margin-right:0.07153em;">K</span><span class="mord mathnormal" style="margin-right:0.00773em;">ER</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.0576em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.05017em;">B</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.10903em;">SEN</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.05764em;">ME</span></span><span class="mord">−</span></span></span></span>{i}.${POSTFIX} until ssh -p 222 -o StrictHostKeyChecking=no $OTHER hostname; do sleep 10 done echo ${OTHER} port=222 slots=8 | tee -a /tmp/hostfile; done if [[ "${NODE_RANK}" -eq "0" ]]; then /scripts/run-allgather.sh 8 eth1,eth2,eth3,eth4 1M 512M ${N_NODES} else while ping -c 1 <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8333em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">W</span><span class="mord mathnormal" style="margin-right:0.00773em;">OR</span><span class="mord mathnormal" style="margin-right:0.07153em;">K</span><span class="mord mathnormal" style="margin-right:0.00773em;">ER</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.0576em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.05017em;">B</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.10903em;">SEN</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.05764em;">ME</span></span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.6444em;"></span><span class="mord">0.</span></span></span></span>{POSTFIX}; do sleep 5 done fi exit 0 volumeMounts: - name: nvidia mountPath: /usr/local/nvidia - name: tcpx-socket mountPath: /tmp - name: libraries mountPath: /usr/local/nvidia/lib64 - name: shared-memory mountPath: /dev/shm resources: limits: cpu: "200" memory: "1800Gi" nvidia.com/gpu: 8 requests: cpu: "200" memory: "1800Gi" nvidia.com/gpu: 8Aplica el manifiesto
kubectl apply -f nccl-jobset-test.yamlComprueba que se admita la carga de trabajo y que alcance el estado
Completed.Recupera los registros del Pod que coincide con
nccl-ag-worker-0-0-.*para ver los resultados:kubectl logs $(kubectl get pods -o go-template='{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | grep nccl-ag-worker-0-0)
¿Qué sigue?
- Collect and Understand NCCL Logs for Troubleshooting para comprender los resultados de las pruebas y solucionar problemas
- Obtén más información para solucionar problemas de rendimiento lento.