このページでは、GPUDirect-TCPXO と
GPUDirect-TCPX ネットワーク プロトコルを使用する カスタム GKE クラスタで NVIDIA Collective Communications Library(NCCL)テストを実行する方法について説明します。カスタム
GKE クラスタは、gcloud コマンドを使用して作成するクラスタです。
このページで説明するテストは、次のシナリオで使用できます。
- GKE クラスタで Flex Start ノードを使用する場合は、基本的な 2 つのノードでテストを使用します。
- GKE クラスタでオンデマンド ノードや 予約バインド ノードなど、さまざまなタイプのノードを使用する場合は、トポロジを考慮したスケジューリングで NCCL テストを使用します。
始める前に
このページのテストでは、JobSet と トポロジを考慮したスケジューリング(TAS)で Kueueを使用します。テストを実行する前に、クラスタを設定して次の操作を行う必要があります。
Kueue をインストールします。
kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.16.5/manifests.yaml
Jobset と Kueue を使用してクラスタを設定する
JobSet と Kueue をインストールしたら、次の手順を行います。
次のマニフェストを
kueue-config.yamlとして保存します。A3 High
apiVersion: kueue.x-k8s.io/v1beta2 kind: Topology metadata: name: "gke-default" spec: levels: - nodeLabel: "cloud.google.com/gce-topology-block" - nodeLabel: "cloud.google.com/gce-topology-subblock" - nodeLabel: "cloud.google.com/gce-topology-host" - nodeLabel: "kubernetes.io/hostname" --- apiVersion: kueue.x-k8s.io/v1beta2 kind: ResourceFlavor metadata: name: a3-high-flavor spec: nodeLabels: cloud.google.com/gke-accelerator: nvidia-h100-80gb topologyName: "gke-default" --- apiVersion: kueue.x-k8s.io/v1beta2 kind: ResourceFlavor metadata: name: a3-high-dws-flavor spec: nodeLabels: cloud.google.com/gke-accelerator: nvidia-h100-80gb topologyName: "gke-default" tolerations: - key: "cloud.google.com/gke-queued" operator: "Exists" effect: NoSchedule --- apiVersion: kueue.x-k8s.io/v1beta2 kind: AdmissionCheck metadata: name: dws-prov spec: controllerName: kueue.x-k8s.io/provisioning-request parameters: apiGroup: kueue.x-k8s.io kind: ProvisioningRequestConfig name: dws-config --- apiVersion: kueue.x-k8s.io/v1beta2 kind: ProvisioningRequestConfig metadata: name: dws-config spec: provisioningClassName: queued-provisioning.gke.io podSetUpdates: - key: autoscaling.gke.io/provisioning-request valueFromProvisioningClassDetail: ResizeRequestName managedResources: - nvidia.com/gpu --- apiVersion: kueue.x-k8s.io/v1beta2 kind: ClusterQueue metadata: name: cq-tas spec: namespaceSelector: {} clusterQueueingStrategy: BestEffortFIFO resourceGroups: - flavors: - name: a3-high-flavor resources: - name: "cpu" nominalQuota: 1000 - name: "memory" nominalQuota: 1000Ti - name: "nvidia.com/gpu" nominalQuota: 1000 - name: a3-high-dws-flavor resources: - name: "cpu" nominalQuota: 1000 - name: "memory" nominalQuota: 1000Ti - name: "nvidia.com/gpu" nominalQuota: 1000 admissionChecksStrategy: admissionChecks: - name: "dws-prov" onFlavors: [a3-high-dws-flavor] --- apiVersion: kueue.x-k8s.io/v1beta2 kind: LocalQueue metadata: namespace: default name: lq-tas spec: clusterQueue: cq-tasA3 Mega
apiVersion: kueue.x-k8s.io/v1beta2 kind: Topology metadata: name: "gke-default" spec: levels: - nodeLabel: "cloud.google.com/gce-topology-block" - nodeLabel: "cloud.google.com/gce-topology-subblock" - nodeLabel: "cloud.google.com/gce-topology-host" - nodeLabel: "kubernetes.io/hostname" --- apiVersion: kueue.x-k8s.io/v1beta2 kind: ResourceFlavor metadata: name: a3-mega-flavor spec: nodeLabels: cloud.google.com/gke-accelerator: nvidia-h100-mega-80gb topologyName: "gke-default" --- apiVersion: kueue.x-k8s.io/v1beta2 kind: ResourceFlavor metadata: name: a3-mega-dws-flavor spec: nodeLabels: cloud.google.com/gke-accelerator: nvidia-h100-mega-80gb topologyName: "gke-default" tolerations: - key: "cloud.google.com/gke-queued" operator: "Exists" effect: NoSchedule --- apiVersion: kueue.x-k8s.io/v1beta2 kind: AdmissionCheck metadata: name: dws-prov spec: controllerName: kueue.x-k8s.io/provisioning-request parameters: apiGroup: kueue.x-k8s.io kind: ProvisioningRequestConfig name: dws-config --- apiVersion: kueue.x-k8s.io/v1beta2 kind: ProvisioningRequestConfig metadata: name: dws-config spec: provisioningClassName: queued-provisioning.gke.io podSetUpdates: - key: autoscaling.gke.io/provisioning-request valueFromProvisioningClassDetail: ResizeRequestName managedResources: - nvidia.com/gpu --- apiVersion: kueue.x-k8s.io/v1beta2 kind: ClusterQueue metadata: name: cq-tas spec: namespaceSelector: {} clusterQueueingStrategy: BestEffortFIFO resourceGroups: - flavors: - name: a3-mega-flavor resources: - name: "cpu" nominalQuota: 1000 - name: "memory" nominalQuota: 1000Ti - name: "nvidia.com/gpu" nominalQuota: 1000 - name: a3-mega-dws-flavor resources: - name: "cpu" nominalQuota: 1000 - name: "memory" nominalQuota: 1000Ti - name: "nvidia.com/gpu" nominalQuota: 1000 admissionChecksStrategy: admissionChecks: - name: "dws-prov" onFlavors: [a3-mega-dws-flavor] --- apiVersion: kueue.x-k8s.io/v1beta2 kind: LocalQueue metadata: namespace: default name: lq-tas spec: clusterQueue: cq-tas次のようにマニフェストを適用します。
kubectl apply -f kueue-config.yaml
TAS を有効にしてワークロードを実行する場合は、ワークロード マニフェストで次のいずれかのアノテーションを使用して、トポロジ制約の適用を厳密に指定できます。
kueue.x-k8s.io/podset-required-topology: このアノテーションを使用すると、リクエストされたトポロジ制約内でワークロードをスケジュールできるようになるまで、Kueue はスケジューリングをブロックします。このアノテーションを使用すると、最適なパフォーマンスを得るために Pod が配置されます。kueue.x-k8s.io/podset-preferred-topology: このアノテーションを使用すると、Kueue はリクエストされたトポロジ制約内で Pod をスケジュールしようとしますが、それが不可能な場合は、トポロジ制約を満たさずにワークロードを許可します。
注: DWS Flex Start で必須モードを使用しないでください。Flex Start
はノードを動的にプロビジョニングするため、結果として得られるノードが厳密なトポロジ要件を満たさない可能性があり、ワークロードをスケジュールできなくなる可能性があります。
このような構成の場合は、代わりに
podset-preferred-topology を使用してください。
どちらのアノテーションでも、トポロジ制約として次のいずれかの値を指定します。
cloud.google.com/gce-topology-block: 同じネットワーク ブロック内の Pod をスケジュールします。cloud.google.com/gce-topology-subblock: 同じラック内の Pod をスケジュールします。cloud.google.com/gce-topology-host: 同じ物理ホスト上の Pod をスケジュールします。
2 つの Flex Start ノードでテストする
A3 Mega または A3 High Flex Start VM を使用する GKE クラスタで NCCL テストを実行するには、次の手順を使用します。この手順では、 JobSet マニフェストを使用して 2 つのノードで NCCL テストを実行します。
次のマニフェストを
nccl-tas-jobset.yamlとして保存します。A3 Mega
apiVersion: v1 kind: ConfigMap metadata: name: nccl-configmap data: allgather.sh: | #!/bin/bash service ssh restart; /scripts/init_ssh.sh ${@}; pushd /scripts; /scripts/gen_hostfiles.sh ${@}; popd; # Set up environment variables for GPUDirect-TCPXO export LD_LIBRARY_PATH=/usr/local/nvidia/lib64 export NCCL_FASTRAK_CTRL_DEV=eth0 export NCCL_FASTRAK_IFNAME=eth1,eth2,eth3,eth4,eth5,eth6,eth7,eth8 export NCCL_SOCKET_IFNAME=eth0 export NCCL_CROSS_NIC=0 export NCCL_ALGO=Ring,Tree export NCCL_PROTO=Simple export NCCL_NET_GDR_LEVEL=PIX # Run the benchmark /scripts/demo-run-nccl-test-tcpxo-via-mpi.sh --- apiVersion: jobset.x-k8s.io/v1alpha2 kind: JobSet metadata: name: nccl-tas-test labels: kueue.x-k8s.io/queue-name: lq-tas spec: ttlSecondsAfterFinished: 1200 suspend: true network: enableDNSHostnames: true replicatedJobs: - name: worker replicas: 2 template: spec: parallelism: 1 completions: 1 template: metadata: annotations: kueue.x-k8s.io/podset-preferred-topology: "cloud.google.com/gce-topology-block" networking.gke.io/default-interface: 'eth0' networking.gke.io/interfaces: | [ {"interfaceName":"eth0","network":"default"}, {"interfaceName":"eth1","network":"vpc0"}, {"interfaceName":"eth2","network":"vpc1"}, {"interfaceName":"eth3","network":"vpc2"}, {"interfaceName":"eth4","network":"vpc3"}, {"interfaceName":"eth5","network":"vpc4"}, {"interfaceName":"eth6","network":"vpc5"}, {"interfaceName":"eth7","network":"vpc6"}, {"interfaceName":"eth8","network":"vpc7"} ] spec: activeDeadlineSeconds: 3600 restartPolicy: Never nodeSelector: cloud.google.com/gke-accelerator: nvidia-h100-mega-80gb tolerations: - key: cloud.google.com/gke-queued effect: NoSchedule value: "true" - key: "nvidia.com/gpu" operator: "Exists" effect: "NoSchedule" setHostnameAsFQDN: true volumes: - name: nvidia hostPath: path: /home/kubernetes/bin/nvidia - name: lib64 hostPath: path: /lib64 - name: proc hostPath: path: /proc - name: shared-memory emptyDir: medium: "Memory" sizeLimit: 250Gi - name: nccl-config configMap: name: nccl-configmap defaultMode: 0755 containers: - name: nccl-test image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/nccl-plugin-gpudirecttcpx-dev:v1.0.15 stdin: true tty: true securityContext: privileged: true env: - name: LD_LIBRARY_PATH value: /usr/local/nvidia/lib64 volumeMounts: - name: nvidia mountPath: /usr/local/nvidia - name: shared-memory mountPath: /dev/shm - name: nccl-config mountPath: /configs resources: limits: cpu: "200" memory: "3700Gi" nvidia.com/gpu: 8 requests: cpu: "200" memory: "3700Gi" nvidia.com/gpu: 8 - name: tcpxo-daemon image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/tcpgpudmarxd-dev:v1.0.21 imagePullPolicy: Always command: ["/bin/sh", "-c"] args: - | set -ex chmod 755 /fts/entrypoint_rxdm_container.sh /fts/entrypoint_rxdm_container.sh --num_hops=2 --num_nics=8 --uid= --alsologtostderr securityContext: privileged: true capabilities: add: - NET_ADMIN - NET_BIND_SERVICE volumeMounts: - name: nvidia mountPath: /usr/local/nvidia/lib64 - name: proc mountPath: /proc env: - name: LD_LIBRARY_PATH value: /usr/local/nvidia/lib64A3 High
apiVersion: v1 kind: ConfigMap metadata: name: nccl-config data: allgather.sh: | #!/bin/bash for script in /configs/*; do name=$(basename $script) cp $script "/scripts/$name" chmod +x "/scripts/$name" done /scripts/init_ssh.sh ${@}; pushd /scripts; /scripts/gen_hostfiles.sh ${@}; popd; /scripts/run-allgather.sh 8 eth1,eth2,eth3,eth4 1M 512M ${#}; --- apiVersion: jobset.x-k8s.io/v1alpha2 kind: JobSet metadata: name: nccl-tas-test labels: kueue.x-k8s.io/queue-name: lq-tas spec: suspend: true network: enableDNSHostnames: true replicatedJobs: - name: worker replicas: 2 template: spec: parallelism: 1 completions: 1 template: metadata: annotations: kueue.x-k8s.io/podset-preferred-topology: "cloud.google.com/gce-topology-block" networking.gke.io/default-interface: 'eth0' networking.gke.io/interfaces: | [ {"interfaceName":"eth0","network":"default"}, {"interfaceName":"eth1","network":"vpc0"}, {"interfaceName":"eth2","network":"vpc1"}, {"interfaceName":"eth3","network":"vpc2"}, {"interfaceName":"eth4","network":"vpc3"} ] spec: terminationGracePeriodSeconds: 0 nodeSelector: cloud.google.com/gke-accelerator: nvidia-h100-80gb tolerations: - key: cloud.google.com/gke-queued effect: NoSchedule value: "true" - key: "nvidia.com/gpu" operator: "Exists" effect: "NoSchedule" setHostnameAsFQDN: true containers: - name: tcpx-daemon image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd-dev:v2.0.11 command: - /tcpgpudmarxd/build/app/tcpgpudmarxd - --gpu_nic_preset - a3vm - --gpu_shmem_type - fd - --uds_path - /run/tcpx - --setup_param - "--verbose 128 2 0 " securityContext: privileged: true capabilities: add: - NET_ADMIN volumeMounts: - name: libraries mountPath: /usr/local/nvidia/lib64 - name: tcpx-socket mountPath: /run/tcpx - name: sys mountPath: /hostsysfs - name: proc-sys mountPath: /hostprocsysfs env: - name: LD_LIBRARY_PATH value: /usr/local/nvidia/lib64 - name: nccl-test image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx-dev:v3.1.8 command: - bash - -c - | /scripts/container_entry.sh daemon; sleep infinity; securityContext: privileged: true volumeMounts: - name: tcpx-socket mountPath: /tmp - name: libraries mountPath: /usr/local/nvidia/lib64 - name: nccl-config mountPath: /configs - name: shared-memory mountPath: /dev/shm resources: limits: cpu: "200" memory: "1800Gi" nvidia.com/gpu: 8 requests: cpu: "200" memory: "1800Gi" nvidia.com/gpu: 8 volumes: - name: libraries hostPath: path: /home/kubernetes/bin/nvidia/lib64 - name: tcpx-socket emptyDir: {} - name: sys hostPath: path: /sys - name: proc-sys hostPath: path: /proc/sys - name: shared-memory emptyDir: medium: Memory sizeLimit: 250Gi - name: nccl-config configMap: name: nccl-config defaultMode: 0777マニフェストをクラスタに適用します。
kubectl apply -f nccl-tas-jobset.yamlJobSet が許可され、実行されていることを確認します。
kubectl get jobset nccl-tas-testJobSet の一時停止が解除され、Pod が
Runningステータスになるまで待ちます。最初のワーカー Pod から
allgather.shスクリプトを実行して、NCCL テストをトリガーします。kubectl exec --stdin --tty --container=nccl-test nccl-tas-test-worker-0-0 -- /configs/allgather.sh nccl-tas-test-worker-0-0 nccl-tas-test-worker-1-02 ノード テストの出力は次のようになります。
A3 Mega
# out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 0 0 float none -1 0.24 0.00 0.00 0 0.18 0.00 0.00 0 ... 8589934592 134217728 float none -1 42603 201.63 189.03 0 42670 201.31 188.73 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 45.7587A3 High
# out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1048576 16384 float none -1 696.8 1.50 1.41 0 729.0 1.44 1.35 0 ... 536870912 8388608 float none -1 7101.7 75.60 70.87 0 7060.9 76.03 71.28 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 29.8293
TAS を使用して NCCL テスト ワークロードをデプロイする
3 つ以上のノードがある場合は、トポロジを考慮したスケジューリング(TAS)を使用する次のテストを使用することをおすすめします。A3 Mega または A3 High Flex Start VM を使用する GKE クラスタで TAS を使用して NCCL テストを実行するには、次の手順を使用します。
次のマニフェストを
nccl-jobset-test.yamlとして保存します。NUM_NODESは、ノードプール内のノード数に置き換えます。A3 Mega
apiVersion: jobset.x-k8s.io/v1alpha2 kind: JobSet metadata: name: nccl-ag labels: kueue.x-k8s.io/queue-name: lq-tas spec: ttlSecondsAfterFinished: 1200 suspend: true network: enableDNSHostnames: true replicatedJobs: - name: worker template: spec: parallelism: NUM_NODES completions: NUM_NODES template: metadata: annotations: kueue.x-k8s.io/podset-preferred-topology: "cloud.google.com/gce-topology-subblock" networking.gke.io/default-interface: 'eth0' networking.gke.io/interfaces: | [ {"interfaceName":"eth0","network":"default"}, {"interfaceName":"eth1","network":"vpc0"}, {"interfaceName":"eth2","network":"vpc1"}, {"interfaceName":"eth3","network":"vpc2"}, {"interfaceName":"eth4","network":"vpc3"}, {"interfaceName":"eth5","network":"vpc4"}, {"interfaceName":"eth6","network":"vpc5"}, {"interfaceName":"eth7","network":"vpc6"}, {"interfaceName":"eth8","network":"vpc7"} ] spec: activeDeadlineSeconds: 3600 restartPolicy: Never nodeSelector: cloud.google.com/gke-accelerator: nvidia-h100-mega-80gb tolerations: - key: "nvidia.com/gpu" operator: "Exists" effect: "NoSchedule" setHostnameAsFQDN: true volumes: - name: proc hostPath: path: /proc - name: nvidia hostPath: path: /home/kubernetes/bin/nvidia - name: lib64 hostPath: path: /lib64 - name: shared-memory emptyDir: medium: "Memory" sizeLimit: 250Gi containers: - name: nccl-test stdin: true tty: true image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/nccl-plugin-tcpxo-diagnostic:v1.0.6 securityContext: privileged: true env: - name: MY_NODE_NAME valueFrom: fieldRef: fieldPath: spec.nodeName - name: OMPI_ALLOW_RUN_AS_ROOT value: "1" - name: OMPI_ALLOW_RUN_AS_ROOT_CONFIRM value: "1" - name: N_NODES value: "NUM_NODES" - name: NCCL_SOCKET_IFNAME value: eth0 - name: NCCL_FASTRAK_CTRL_DEV value: eth0 - name: NCCL_FASTRAK_IFNAME value: eth1,eth2,eth3,eth4,eth5,eth6,eth7,eth8 - name: NCCL_CROSS_NIC value: "0" - name: NCCL_ALGO value: Ring,Tree - name: NCCL_PROTO value: Simple - name: NCCL_NET_GDR_LEVEL value: PIX - name: LD_LIBRARY_PATH value: /usr/local/nvidia/lib64 command: - bash - -c - | set -x /scripts/container_entry.sh daemon & export POSTFIX=$(hostname | cut -d . -f 2-) export WORKERS_BASENAME=$(hostname | cut -d . -f 1 | rev | cut -d - -f 2- | rev ) export NODE_RANK=$JOB_COMPLETION_INDEX for i in `seq 0 $(($N_NODES-1))`; do OTHER=<span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8333em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">W</span><span class="mord mathnormal" style="margin-right:0.00773em;">OR</span><span class="mord mathnormal" style="margin-right:0.07153em;">K</span><span class="mord mathnormal" style="margin-right:0.00773em;">ER</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.0576em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.05017em;">B</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.10903em;">SEN</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.05764em;">ME</span></span><span class="mord">−</span></span></span></span>{i}.${POSTFIX} until ssh -p 222 -o StrictHostKeyChecking=no $OTHER hostname; do sleep 10 done echo ${OTHER} port=222 slots=8 | tee -a /tmp/hostfile; done if [[ "${NODE_RANK}" -eq "0" ]]; then export NCCL_TESTS_SPLIT_MASK="0x0"; ENV_VARS=$(echo ${!NCCL*} ${!OMPI*} LD_LIBRARY_PATH PATH | sed 's/ / -x /g') mpirun --hostfile /tmp/hostfile \ -x $ENV_VARS \ -mca plm_rsh_no_tree_spawn 1 \ --mca orte_keep_fqdn_hostnames 1 \ --mca btl self,tcp \ --mca btl_tcp_if_include eth0 \ --bind-to none \ --mca plm_rsh_agent "ssh -q -o LogLevel=ERROR -o StrictHostKeyChecking=no -p 222" \ /third_party/nccl-tests/build/all_gather_perf -b 1K -e 8G -f 2 -g 1 -w 5 --iters 100 -c 1 else while ping -c 1 <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8333em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">W</span><span class="mord mathnormal" style="margin-right:0.00773em;">OR</span><span class="mord mathnormal" style="margin-right:0.07153em;">K</span><span class="mord mathnormal" style="margin-right:0.00773em;">ER</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.0576em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.05017em;">B</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.10903em;">SEN</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.05764em;">ME</span></span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.6444em;"></span><span class="mord">0.</span></span></span></span>{POSTFIX}; do sleep 5 done fi exit 0 volumeMounts: - name: nvidia mountPath: /usr/local/nvidia - name: lib64 mountPath: /lib64 - name: shared-memory mountPath: /dev/shm resources: limits: cpu: "200" memory: "3700Gi" nvidia.com/gpu: 8 requests: cpu: "200" memory: "3700Gi" nvidia.com/gpu: 8 - name: tcpxo-daemon image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/tcpxo-daemon:v1.0.1 imagePullPolicy: Always command: - bash - -c - | /usr/bin/tcpxo_daemon securityContext: privileged: true volumeMounts: - name: nvidia mountPath: /usr/local/nvidia - name: proc mountPath: /proc env: - name: LD_LIBRARY_PATH value: /usr/local/nvidia/lib64A3 High
apiVersion: jobset.x-k8s.io/v1alpha2 kind: JobSet metadata: name: nccl-ag labels: kueue.x-k8s.io/queue-name: lq-tas spec: ttlSecondsAfterFinished: 1200 suspend: true network: enableDNSHostnames: true replicatedJobs: - name: worker template: spec: parallelism: NUM_NODES completions: NUM_NODES template: metadata: annotations: kueue.x-k8s.io/podset-preferred-topology: "cloud.google.com/gce-topology-subblock" networking.gke.io/default-interface: 'eth0' networking.gke.io/interfaces: | [ {"interfaceName":"eth0","network":"default"}, {"interfaceName":"eth1","network":"vpc0"}, {"interfaceName":"eth2","network":"vpc1"}, {"interfaceName":"eth3","network":"vpc2"}, {"interfaceName":"eth4","network":"vpc3"} ] spec: activeDeadlineSeconds: 3600 restartPolicy: Never nodeSelector: cloud.google.com/gke-accelerator: nvidia-h100-80gb tolerations: - key: "nvidia.com/gpu" operator: "Exists" effect: "NoSchedule" setHostnameAsFQDN: true volumes: - name: proc hostPath: path: /proc - name: nvidia hostPath: path: /home/kubernetes/bin/nvidia - name: libraries hostPath: path: /home/kubernetes/bin/nvidia/lib64 - name: tcpx-socket emptyDir: {} - name: shared-memory emptyDir: medium: "Memory" sizeLimit: 250Gi containers: - name: tcpx-daemon image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd-dev:v2.0.11 command: - /tcpgpudmarxd/build/app/tcpgpudmarxd - --gpu_nic_preset - a3vm - --uds_path - /run/tcpx securityContext: privileged: true volumeMounts: - name: tcpx-socket mountPath: /run/tcpx - name: libraries mountPath: /usr/local/nvidia/lib64 - name: nccl-test stdin: true tty: true image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx-dev:v3.1.8 securityContext: privileged: true env: - name: MY_NODE_NAME valueFrom: fieldRef: fieldPath: spec.nodeName - name: OMPI_ALLOW_RUN_AS_ROOT value: "1" - name: OMPI_ALLOW_RUN_AS_ROOT_CONFIRM value: "1" - name: N_NODES value: "NUM_NODES" - name: LD_LIBRARY_PATH value: /usr/local/nvidia/lib64 command: - bash - -c - | /scripts/container_entry.sh daemon & export POSTFIX=$(hostname | cut -d . -f 2-) export WORKERS_BASENAME=$(hostname | cut -d . -f 1 | rev | cut -d - -f 2- | rev ) export NODE_RANK=$JOB_COMPLETION_INDEX for i in `seq 0 $(($N_NODES-1))`; do OTHER=<span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8333em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">W</span><span class="mord mathnormal" style="margin-right:0.00773em;">OR</span><span class="mord mathnormal" style="margin-right:0.07153em;">K</span><span class="mord mathnormal" style="margin-right:0.00773em;">ER</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.0576em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.05017em;">B</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.10903em;">SEN</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.05764em;">ME</span></span><span class="mord">−</span></span></span></span>{i}.${POSTFIX} until ssh -p 222 -o StrictHostKeyChecking=no $OTHER hostname; do sleep 10 done echo ${OTHER} port=222 slots=8 | tee -a /tmp/hostfile; done if [[ "${NODE_RANK}" -eq "0" ]]; then /scripts/run-allgather.sh 8 eth1,eth2,eth3,eth4 1M 512M ${N_NODES} else while ping -c 1 <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8333em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">W</span><span class="mord mathnormal" style="margin-right:0.00773em;">OR</span><span class="mord mathnormal" style="margin-right:0.07153em;">K</span><span class="mord mathnormal" style="margin-right:0.00773em;">ER</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.0576em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.05017em;">B</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.10903em;">SEN</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.05764em;">ME</span></span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.6444em;"></span><span class="mord">0.</span></span></span></span>{POSTFIX}; do sleep 5 done fi exit 0 volumeMounts: - name: nvidia mountPath: /usr/local/nvidia - name: tcpx-socket mountPath: /tmp - name: libraries mountPath: /usr/local/nvidia/lib64 - name: shared-memory mountPath: /dev/shm resources: limits: cpu: "200" memory: "1800Gi" nvidia.com/gpu: 8 requests: cpu: "200" memory: "1800Gi" nvidia.com/gpu: 8次のようにマニフェストを適用します。
kubectl apply -f nccl-jobset-test.yamlワークロードが許可され、
Completed状態になったことを確認します。nccl-ag-worker-0-0-.*に一致する Pod のログを取得して、結果を確認します。kubectl logs $(kubectl get pods -o go-template='{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | grep nccl-ag-worker-0-0)
次のステップ
- トラブルシューティングのために NCCL ログを収集して理解し、テストの出力を理解して問題をトラブルシューティングします。
- パフォーマンスの低下のトラブルシューティングについて学習します。