Questa pagina descrive come eseguire i test NCCL/gIB sui cluster di cui è stato eseguito il provisioning che utilizzano GPUDirect RDMA. Descrive i test per i seguenti scenari:
- Se hai nodi di cui è stato eseguito il provisioning con l'avvio flessibile (anteprima), utilizza un test di base su due nodi.
- Se hai un numero maggiore di nodi di cui non è stato eseguito il provisioning con l'avvio flessibile, utilizza un test NCCL con Topology Aware Scheduling.
Test su due nodi
Connettiti al cluster:
gcloud container clusters get-credentials CLUSTER_NAME \ --location=COMPUTE_REGIONSostituisci le seguenti variabili:
CLUSTER_NAME: il nome del tuo cluster, che, per i cluster creati con Cluster Toolkit, si basa suDEPLOYMENT_NAME.COMPUTE_REGION: il nome della regione di computing.
Per eseguire il deployment di un carico di lavoro di test NCCL di due pod di test in esecuzione su due nodi A4X, esegui questo comando:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/gpudirect-rdma/nccl-test-imex-a4x.yamlControlla se i pod sono in esecuzione su alcuni nodi:
kubectl get pods nccl-test-host-1 nccl-test-host-2Se i due Pod mostrano lo stato
Running, puoi procedere al passaggio successivo.Attiva un test di raccolta per i nodi A4X:
kubectl exec nccl-test-host-1 -it -- /usr/local/gib/scripts/run_nccl_tests.sh -t all_gather -b 1K -e 8G nccl-host-1 nccl-host-2L'output è simile al seguente:
# out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1024 32 float none -1 21.20 0.05 0.04 0 20.56 0.05 0.04 0 2048 64 float none -1 21.03 0.10 0.09 0 20.82 0.10 0.09 0 4096 128 float none -1 21.11 0.19 0.17 0 20.98 0.20 0.17 0 8192 256 float none -1 21.51 0.38 0.33 0 21.15 0.39 0.34 0 16384 512 float none -1 21.85 0.75 0.66 0 21.72 0.75 0.66 0 32768 1024 float none -1 24.08 1.36 1.19 0 23.73 1.38 1.21 0 65536 2048 float none -1 24.68 2.66 2.32 0 24.02 2.73 2.39 0 131072 4096 float none -1 24.93 5.26 4.60 0 24.30 5.40 4.72 0 262144 8192 float none -1 24.86 10.55 9.23 0 24.33 10.78 9.43 0 524288 16384 float none -1 25.10 20.89 18.28 0 24.48 21.41 18.74 0 1048576 32768 float none -1 25.43 41.24 36.09 0 24.82 42.25 36.97 0 2097152 65536 float none -1 32.30 64.93 56.81 0 31.28 67.04 58.66 0 4194304 131072 float none -1 45.92 91.34 79.92 0 44.22 94.84 82.99 0 8388608 262144 float none -1 71.38 117.52 102.83 0 68.98 121.61 106.41 0 16777216 524288 float none -1 74.17 226.20 197.93 0 72.37 231.83 202.85 0 33554432 1048576 float none -1 116.6 287.84 251.86 0 112.7 297.75 260.54 0 67108864 2097152 float none -1 188.9 355.27 310.86 0 184.0 364.71 319.12 0 134217728 4194304 float none -1 309.6 433.56 379.36 0 299.7 447.83 391.85 0 268435456 8388608 float none -1 559.0 480.23 420.20 0 540.3 496.85 434.75 0 536870912 16777216 float none -1 1053.7 509.52 445.83 0 1021.4 525.64 459.93 0 1073741824 33554432 float none -1 2087.4 514.39 450.10 0 2013.8 533.19 466.54 0 2147483648 67108864 float none -1 4154.7 516.88 452.27 0 3987.4 538.57 471.25 0 4294967296 134217728 float none -1 8289.2 518.14 453.37 0 7907.4 543.16 475.26 0 8589934592 268435456 float none -1 16556 518.85 453.99 0 15726 546.24 477.96 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 175.233 #
Test con TAS
Per convalidare la funzionalità del cluster di cui è stato eseguito il provisioning, puoi eseguire il seguente test NCCL con TAS.
Configura Kueue con TAS abilitato
- Installa Kueue con TAS abilitato.
Configura Kueue con TAS abilitato creando il seguente file, che chiamerai
a4x-kueue-config.yaml:apiVersion: kueue.x-k8s.io/v1alpha1 kind: Topology metadata: name: "a4x-default" spec: levels: - nodeLabel: "cloud.google.com/gce-topology-block" - nodeLabel: "cloud.google.com/gce-topology-subblock" - nodeLabel: "cloud.google.com/gke-nodepool" - nodeLabel: "cloud.google.com/gce-topology-host" - nodeLabel: "kubernetes.io/hostname" --- kind: ResourceFlavor apiVersion: kueue.x-k8s.io/v1beta1 metadata: name: "a4x" spec: nodeLabels: cloud.google.com/gke-accelerator: nvidia-gb200 topologyName: "a4x-default" tolerations: - key: "nvidia.com/gpu" operator: "Exists" effect: NoSchedule - key: "kubernetes.io/arch" operator: "Exists" effect: NoSchedule --- apiVersion: kueue.x-k8s.io/v1beta1 kind: ClusterQueue metadata: name: "a4x" spec: namespaceSelector: {} # match all. resourceGroups: - coveredResources: ["nvidia.com/gpu"] flavors: - name: "a4x" resources: - name: "nvidia.com/gpu" nominalQuota: 1_000_000_000 --- apiVersion: kueue.x-k8s.io/v1beta1 kind: LocalQueue metadata: namespace: "default" name: "a4x" spec: clusterQueue: "a4x"Esegui il test:
kubectl apply -f a4x-kueue-config.yaml
Pianifica un test NCCL sensibile alla topologia con Kueue con TAS abilitato
Il seguente workload deve essere inserito in un unico blocco secondario del dominio NVLink.
- Installa JobSet, un'API nativa di Kubernetes per la gestione di un gruppo di job Kubernetes come unità. Assicurati che i pool di nodi non GPU dispongano di risorse sufficienti per pianificare i controller JobSet.
Crea il seguente file con il nome
nccl-tas-test.yaml. SostituisciNUM_NODEScon il numero di nodi previsto per l'esecuzione del test NCCL, fino a18:apiVersion: resource.nvidia.com/v1beta1 kind: ComputeDomain metadata: name: nccl-test-compute-domain spec: numNodes: NUM_NODES channel: resourceClaimTemplate: name: nccl-test-compute-domain-channel --- apiVersion: jobset.x-k8s.io/v1alpha2 kind: JobSet metadata: name: kueue-tas-nccl-all-gather labels: kueue.x-k8s.io/queue-name: a4x spec: ttlSecondsAfterFinished: 1200 network: enableDNSHostnames: true replicatedJobs: - name: worker template: spec: parallelism: NUM_NODES completions: NUM_NODES template: metadata: annotations: kueue.x-k8s.io/podset-required-topology: "cloud.google.com/gce-topology-subblock" networking.gke.io/default-interface: 'eth0' networking.gke.io/interfaces: | [ {"interfaceName":"eth0","network":"default"}, {"interfaceName":"eth2","network":"rdma-0"}, {"interfaceName":"eth3","network":"rdma-1"}, {"interfaceName":"eth4","network":"rdma-2"}, {"interfaceName":"eth5","network":"rdma-3"} ] spec: activeDeadlineSeconds: 3600 restartPolicy: Never nodeSelector: cloud.google.com/gke-accelerator: nvidia-gb200 tolerations: - key: nvidia.com/gpu operator: Equal value: present effect: NoSchedule - key: kubernetes.io/arch operator: Equal value: arm64 effect: NoSchedule setHostnameAsFQDN: true volumes: - name: gib hostPath: path: /home/kubernetes/bin/gib - name: nvidia hostPath: path: /home/kubernetes/bin/nvidia - name: lib64 hostPath: path: /lib64 - name: shared-memory emptyDir: medium: "Memory" sizeLimit: 250Gi resourceClaims: - name: compute-domain-channel resourceClaimTemplateName: nccl-test-compute-domain-channel containers: - name: nccl-test stdin: true tty: true image: us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib-diagnostic-arm64:v1.0.4 env: - name: MY_NODE_NAME valueFrom: fieldRef: fieldPath: spec.nodeName - name: OMPI_ALLOW_RUN_AS_ROOT value: "1" - name: OMPI_ALLOW_RUN_AS_ROOT_CONFIRM value: "1" - name: N_NODES value: "NUM_NODES" - name: LD_LIBRARY_PATH value: /usr/local/nvidia/lib64 command: - bash - -c - | set -x echo "Starting workload container on ${MY_NODE_NAME} for $N_NODES benchmark" # Install ping apt update -y apt install -y iputils-ping # Start sshd /scripts/container_entry.sh daemon & # Get helper variables to form all hostnames export POSTFIX=$(hostname | cut -d . -f 2-) export WORKERS_BASENAME=$(hostname | cut -d . -f 1 | rev | cut -d - -f 2- | rev ) export NODE_RANK=$JOB_COMPLETION_INDEX # For every worker, wait till online and add to hostfile for i in `seq 0 $(($N_NODES-1))`; do OTHER=${WORKERS_BASENAME}-${i}.${POSTFIX} until ssh -p 222 -o StrictHostKeyChecking=no $OTHER hostname; do echo Waiting for ${OTHER}... sleep 10 done echo ${OTHER} port=222 slots=4 | tee -a /tmp/hostfile; done cat /tmp/hostfile # Launch from head node if [[ "${NODE_RANK}" -eq "0" ]]; then # World Level = 0x0, Rail Aligned = 0x7 export NCCL_TESTS_SPLIT_MASK="0x0"; # Force use of libnccl-gib export NCCL_NET=gIB # Set all the correct libnccl-gib environment variables source /usr/local/gib/scripts/set_nccl_env.sh # Get all relevant NCCL / env vars to pass to all workers ENV_VARS=$(echo ${!NCCL*} ${!OMPI*} LD_LIBRARY_PATH PATH | sed 's/ / -x /g') mpirun --hostfile /tmp/hostfile \ -x $ENV_VARS \ -mca plm_rsh_no_tree_spawn 1 \ --mca orte_keep_fqdn_hostnames 1 \ --mca btl self,tcp \ --mca btl_tcp_if_include eth0 \ --bind-to none \ --mca plm_rsh_agent "ssh -q -o LogLevel=ERROR -o StrictHostKeyChecking=no -p 222" \ /third_party/nccl-tests/build/all_gather_perf -b 1K -e 8G -f 2 -g 1 -w 5 --iters 100 -c 1 else while ping -c 1 ${WORKERS_BASENAME}-0.${POSTFIX}; do sleep 5 done fi exit 0 volumeMounts: - name: nvidia mountPath: /usr/local/nvidia - name: gib mountPath: /usr/local/gib - name: shared-memory mountPath: /dev/shm resources: limits: nvidia.com/gpu: 4 requests: nvidia.com/gpu: 4 claims: - name: compute-domain-channel restartPolicy: NeverEsegui il test:
kubectl apply -f nccl-tas-test.yamlControlla il risultato del test esaminando i log:
kubectl logs $(kubectl get pods -o go-template='{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | grep kueue-tas-nccl-all-gather-worker-0-0)
L'output dovrebbe essere simile al seguente:
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1024 8 float none -1 56.72 0.02 0.02 0 56.12 0.02 0.02 0 2048 16 float none -1 56.85 0.04 0.03 0 56.87 0.04 0.03 0 4096 32 float none -1 57.53 0.07 0.07 0 57.47 0.07 0.07 0 8192 64 float none -1 58.43 0.14 0.14 0 58.27 0.14 0.14 0 16384 128 float none -1 59.29 0.28 0.27 0 58.87 0.28 0.27 0 32768 256 float none -1 60.02 0.55 0.53 0 59.60 0.55 0.53 0 65536 512 float none -1 61.83 1.06 1.03 0 61.64 1.06 1.03 0 131072 1024 float none -1 70.99 1.85 1.79 0 70.82 1.85 1.79 0 262144 2048 float none -1 71.56 3.66 3.55 0 71.07 3.69 3.57 0 524288 4096 float none -1 72.62 7.22 6.99 0 71.90 7.29 7.06 0 1048576 8192 float none -1 72.80 14.40 13.95 0 72.31 14.50 14.05 0 2097152 16384 float none -1 73.40 28.57 27.68 0 72.96 28.74 27.85 0 4194304 32768 float none -1 73.86 56.78 55.01 0 73.44 57.12 55.33 0 8388608 65536 float none -1 102.5 81.86 79.30 0 101.4 82.69 80.11 0 16777216 131072 float none -1 158.3 105.97 102.66 0 156.8 107.02 103.68 0 33554432 262144 float none -1 158.4 211.89 205.26 0 157.5 212.99 206.33 0 67108864 524288 float none -1 250.7 267.68 259.32 0 248.7 269.81 261.38 0 134217728 1048576 float none -1 417.7 321.29 311.25 0 414.1 324.13 314.01 0 268435456 2097152 float none -1 728.8 368.32 356.81 0 721.5 372.08 360.45 0 536870912 4194304 float none -1 1226.5 437.72 424.04 0 1216.1 441.46 427.66 0 1073741824 8388608 float none -1 2268.4 473.35 458.56 0 2247.0 477.86 462.93 0 2147483648 16777216 float none -1 4330.6 495.88 480.39 0 4291.6 500.39 484.76 0 4294967296 33554432 float none -1 8640.9 497.05 481.52 0 8544.0 502.69 486.98 0 8589934592 67108864 float none -1 17258 497.75 482.19 0 17052 503.75 488.00 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 157.091
Passaggi successivi
- Raccogli e comprendi i log NCCL per la risoluzione dei problemi per comprendere gli output del test e risolvere i problemi.
- Scopri di più sulla risoluzione dei problemi di prestazioni lente.