使用拓撲感知調度 (TAS) 安排 GKE 工作負載時程

AI 和 ML 工作負載需要大量的 Pod 對 Pod 通訊。由於這項需求，Pod 之間的網路頻寬會直接影響工作負載的執行時間和成本。頻寬取決於叢集中的虛擬機器 (VM) 執行個體位置。

本文說明如何在 Google Kubernetes Engine (GKE) 叢集上，針對效能和可靠性，最佳化大規模 AI 或 ML 工作負載的排程。具體來說，您會設定叢集，使用拓撲感知排程 (TAS) 進行低延遲通訊。這種方法可盡量減少通訊負擔，並盡可能提高工作負載的效能。

什麼是拓撲感知排程 (TAS)？

TAS 可大幅提升大型語言模型 (LLM) 訓練效率。TAS 會策略性地將工作人員放置在網路拓撲上，盡量減少梯度彙整期間的通訊負擔，因為這項作業需要工作人員依特定排序進行通訊。TAS 可減少連續通訊工作站之間的網路躍點，降低網路爭用情形並提升頻寬使用率，進而加快收斂速度並縮短訓練時間。隨著 LLM 模型越來越大，TAS 對於盡量提升分散式訓練的效能和擴充性至關重要。

TAS 最適合密集放置的容量，這類容量可透過預訂取得。使用彈性啟動 VM 或 Spot VM 時，容量不太可能配置在相近位置，因此 TAS 在這種情況下可能無法正常運作。

事前準備

開始之前，請確認您已完成下列工作：

啟用 Google Kubernetes Engine API。

啟用 Google Kubernetes Engine API

如要使用 Google Cloud CLI 執行這項工作，請安裝並初始化 gcloud CLI。如果您先前已安裝 gcloud CLI，請執行 gcloud components update 指令，取得最新版本。較舊的 gcloud CLI 版本可能不支援執行本文件中的指令。
注意：如果是現有的 gcloud CLI 安裝項目，請務必設定 compute/region property。如果您主要使用區域叢集，請改為設定 compute/zone。設定預設位置後，即可避免 gcloud CLI 發生下列錯誤：One of [--zone, --region] must be supplied: Please specify location。如果叢集位置與您設定的預設位置不同，您可能需要在特定指令中指定位置。

如要連線至叢集，請執行下列指令：
```
gcloud container clusters get-credentials CLUSTER_NAME
```
將 CLUSTER_NAME 替換為叢集名稱。

準備 GKE 叢集

如要準備 GKE 叢集，以便使用 TAS 執行工作負載，請完成下列步驟：

安裝 Kueue 並啟用 TAS
查看 GKE 叢集的拓撲
設定 Kueue

安裝已啟用 TAS 的 Kueue

建議您搭配使用 TAS 與 Kueue。有了 Kueue 這個 Kubernetes 原生系統，即可管理配額和工作耗用配額的方式。TAS 需要 Kueue 0.10.0以上版本，且您必須明確啟用這項功能。

如要安裝 Kueue 並啟用 TAS，請選取下列其中一個選項：

Kueue 資訊清單

安裝 Kueue：

kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.10.0/manifests.yaml

在 Kueue 中啟用 TAS：

kubectl -n kueue-system patch deployment kueue-controller-manager \
    --type json -p='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--feature-gates=TopologyAwareScheduling=true"}]'

Helm 資訊套件

使用 Helm 資訊套件安裝已啟用 TAS 的 Kueue：

helm install kueue oci://us-central1-docker.pkg.dev/k8s-staging-images/charts/kueue \
    --version="v0.10.0" \
    --create-namespace \
    --namespace=kueue-system \
    --set="controllerManager.featureGates[0].name=TopologyAwareScheduling,controllerManager.featureGates[0].enabled=true"

安裝 Kueke 後，您必須設定 Kueke，瞭解要管理的基礎架構，詳情請見下一節。

查看 GKE 叢集的拓撲

如要查看以 Spot VM 形式佈建的 A4X、A4、A3 Ultra、A3 Mega 和 A3 High (8 個 GPU) 節點拓撲，您必須在 GKE 節點上定義緊密放置，才能向 TAS 公開這些節點的實體拓撲。否則會發生錯誤。

如要查看特定節點集區中 GKE 叢集節點的拓撲，請執行下列指令：

kubectl get nodes -l cloud.google.com/gke-nodepool=NODE_POOL_NAME \
    -ocustom-columns='NAME:.metadata.name,BLOCK:.metadata.labels.cloud\.google\.com/gce-topology-block,SUBBLOCK:.metadata.labels.cloud\.google\.com/gce-topology-subblock,HOST:.metadata.labels.cloud\.google\.com/gce-topology-host' | sort -k2,4

將 NODE_POOL_NAME 替換為節點集區名稱。

如要瞭解輸出內容中 VM 的 GKE 節點實體拓撲，請參閱下列節點標籤：

cloud.google.com/gce-topology-block：VM 所在的預留區塊專屬 ID。
cloud.google.com/gce-topology-subblock：VM 所在的子區塊專屬 ID。
cloud.google.com/gce-topology-host：VM 所在主機的 ID。
kubernetes.io/hostname：Kubernetes 節點的主機名稱。這個主機名稱通常也是 GKE 節點名稱。

兩個 VM 共用的標籤值越多，VM 的實體位置就越接近。如要進一步瞭解這些字詞，請參閱「術語」一文。

設定 Kueue

安裝 Kueue 後，您必須設定 Kueue，指定要管理的基礎架構。一般來說，Kueue 需要ClusterQueue 資源配額定義，可以是靜態基礎架構，也可以是啟用叢集自動調度的動態基礎架構。只有在工作負載要求的資源小於或等於 ClusterQueue 中定義的資源集區時，ClusterQueue 才會允許工作負載。按照本節說明設定 Kueue 後，Kueue 會使用 TAS 允許工作負載，方式如下：

TAS 工作負載：Kueue 會檢查實體基礎架構的拓撲和目前用量。
非 TAS 工作負載：Kueue 不會檢查實體基礎架構的拓撲。Kueue 會管理設定中定義的完整配額，並將節點指派作業交由 kube-scheduler 處理。

如要瞭解如何向 Kueue 提供 ClusterQueue 資源配額定義，請參閱下列範例：

配額非常高：Kueue 幾乎不會根據要求資源停止工作負載的許可。根據 TAS 定義，Kueue 可能會或不會根據基礎架構拓撲接受工作負載。詳情請參閱「非常高的資源配額」。
實際配額：只有在工作負載要求的資源符合這些資源配額限制時，Kueue 才會允許工作負載。Kueue 會根據 TAS 定義檢查基礎架構拓撲，然後允許工作負載。詳情請參閱「實際資源配額」。

下列各節中提及的資源配額，皆指 ClusterQueue 資源配額。

資源配額極高

以下範例使用非常高的資源配額，因此 Kueue 絕不會根據可用資源配額停止工作負載。Kueue 會使用可用節點的拓撲資訊，盡量將拓撲與工作負載的需求相符。

如要使用下列資源配額定義，請完成下列步驟：

開啟您選擇的檔案編輯器。接著，在名為 kueue-tas-config-very-high-quota.yaml 的 YAML 檔案中加入下列配額定義：

  apiVersion: kueue.x-k8s.io/v1alpha1
  kind: Topology
  metadata:
    name: "gke-default"
  spec:
    levels:
    - nodeLabel: "cloud.google.com/gce-topology-block"
    - nodeLabel: "cloud.google.com/gce-topology-subblock"
    - nodeLabel: "cloud.google.com/gce-topology-host"
    - nodeLabel: "kubernetes.io/hostname"
---
  kind: ResourceFlavor
  apiVersion: kueue.x-k8s.io/v1beta1
  metadata:
    name: "tas-flavor"
  spec:
    nodeLabels:
      cloud.google.com/gke-nodepool: "NODE_POOL_NAME"
    topologyName: "gke-default"
    tolerations:
    - key: "nvidia.com/gpu"
      operator: "Exists"
      effect: NoSchedule
---
  apiVersion: kueue.x-k8s.io/v1beta1
  kind: ClusterQueue
  metadata:
    name: "tas-cluster-queue"
  spec:
    namespaceSelector: {}
    resourceGroups:
    - coveredResources: ["nvidia.com/gpu"]
      flavors:
      - name: "tas-flavor"
        resources:
        - name: "nvidia.com/gpu"
          nominalQuota: 10000000
---
  apiVersion: kueue.x-k8s.io/v1beta1
  kind: LocalQueue
  metadata:
    namespace: "default"
    name: "tas-user-queue"
  spec:
    clusterQueue: "tas-cluster-queue"

將 NODE_POOL_NAME 替換為節點集區名稱。

為 Kueue 工作佇列系統建立及套用資源配額設定：
```
kubectl create -f kueue-tas-config-very-high-quota.yaml
```

實際資源配額

上例只設定了 GPU 資源。不過，Kueue 可以管理所有與 Kubernetes 相容的資源。

以下範例定義了更實際的資源配額，包括 CPU、記憶體和 GPU。這適用於 100 部 a3-ultragpu-8g 機器。單一機器的規格為 224 個 vCPU、2944 GB 記憶體和 8 個 GPU。

如要使用下列資源配額定義，請完成下列步驟：

開啟您選擇的檔案編輯器。接著，在名為 kueue-tas-config-real-quota.yaml 的 YAML 檔案中加入下列配額定義：

  apiVersion: kueue.x-k8s.io/v1alpha1
  kind: Topology
  metadata:
    name: "gke-default"
  spec:
    levels:
    - nodeLabel: "cloud.google.com/gce-topology-block"
    - nodeLabel: "cloud.google.com/gce-topology-subblock"
    - nodeLabel: "cloud.google.com/gce-topology-host"
    - nodeLabel: "kubernetes.io/hostname"
---
  kind: ResourceFlavor
  apiVersion: kueue.x-k8s.io/v1beta1
  metadata:
    name: "tas-flavor"
  spec:
    nodeLabels:
      cloud.google.com/gke-nodepool: "NODE_POOL_NAME"
    topologyName: "gke-default"
    tolerations:
    - key: "nvidia.com/gpu"
      operator: "Exists"
      effect: NoSchedule
---
  apiVersion: kueue.x-k8s.io/v1beta1
  kind: ClusterQueue
  metadata:
    name: "tas-cluster-queue"
  spec:
    namespaceSelector: {} # match all
    resourceGroups:
    - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
      flavors:
      - name: "tas-flavor"
        resources:
        # numbers below represent quota of 100 a3-ultragpu-8g machines
        - name: "cpu"
          nominalQuota: 22400
        - name: "memory"
          nominalQuota: 294400Gi
        - name: "nvidia.com/gpu"
          nominalQuota: 800
---
  apiVersion: kueue.x-k8s.io/v1beta1
  kind: LocalQueue
  metadata:
    namespace: "default"
    name: "tas-user-queue"
  spec:
    clusterQueue: "tas-cluster-queue"

將 NODE_POOL_NAME 替換為節點集區名稱。

為 Kueue 工作佇列系統建立及套用資源配額設定：

kubectl create -f kueue-tas-config-real-quota.yaml

輸出結果會與下列內容相似：

topology.kueue.x-k8s.io/gke-default created
resourceflavor.kueue.x-k8s.io/tas-flavor created
clusterqueue.kueue.x-k8s.io/tas-cluster-queue created
localqueue.kueue.x-k8s.io/tas-user-queue created

使用 Kueue 透過 TAS 安排工作負載

下列情境說明如何使用拓撲要求類型和拓撲要求層級，指示 Kueue 和 TAS 管理常見的工作負載和基礎架構組合：

以下是可用的拓撲要求類型 (偏好或必要)：
- kueue.x-k8s.io/podset-preferred-topology：Kueue 會優先在指定拓撲層級內排定整個工作負載，但仍會允許不符合該拓撲層級的工作負載。如果工作負載可能適合單一拓撲層級，Kueue 可能會跨該拓撲層級的多個執行個體排定工作負載。
- kueue.x-k8s.io/podset-required-topology：Kueue 會持續嘗試允許這項工作負載，直到整個工作負載都能符合所選拓撲層級為止。
以下是可用的拓撲要求層級，可讓您更具體地指定偏好或需要執行作業的實體基礎架構：
- cloud.google.com/gce-topology-block
- cloud.google.com/gce-topology-subblock
- cloud.google.com/gce-topology-host
- kubernetes.io/hostname

如要使用這些值排定工作負載，請使用下列 Job YAML 檔案：

apiVersion: batch/v1
kind: Job
metadata:
  generateName: JOB_NAME
  labels:
    kueue.x-k8s.io/queue-name: tas-user-queue
spec:
  parallelism: NUMBER_OF_REPLICAS
  completions: NUMBER_OF_REPLICAS
  completionMode: Indexed
  template:
    metadata:
      annotations:
        ANNOTATIONS_STRING
    spec:
      containers:
      - name: dummy-job
        image: gcr.io/k8s-staging-perf-tests/sleep:v0.1.0
        args: ["60s"]
        resources:
          requests:
            nvidia.com/gpu: "1"
          limits:
            nvidia.com/gpu: "1"
      restartPolicy: Never

請替換下列變數：

JOB_NAME：作業名稱。
NUMBER_OF_REPLICAS：並行執行的 Pod 數量。

ANNOTATIONS_STRING：請參閱下表：

要求的拓撲類型和層級	說明	`ANNOTATIONS_STRING`
建議在主機名稱中執行	只要有足夠的可用資源來滿足工作負載的資源需求，即使容量分散，這項設定也會允許工作負載。Kueue 會盡可能緊湊地排定 Pod 時間。	`kueue.x-k8s.io/podset-preferred-topology: "kubernetes.io/hostname"`
必須在主機中執行	只有在主機有足夠資源滿足工作負載的資源需求時，這項設定才會允許工作負載。如果每個主機有多個 VM (例如較小的機器類型)，或單一節點可執行多個 Pod，這項功能就非常實用。在這種情況下，如果工作負載獲得許可，就會在單一主機上執行。	`kueue.x-k8s.io/podset-required-topology: "cloud.google.com/gce-topology-host"`
偏好在主機中執行	只要有足夠的可用資源來滿足工作負載的資源需求，即使容量分散，這項設定也會允許工作負載。Kueue 會嘗試在主機內排定 Pod，並視需要使用其他主機。	`kueue.x-k8s.io/podset-preferred-topology: "cloud.google.com/gce-topology-host"`
必須在子區塊中執行	只有在有足夠資源的子區塊可滿足工作負載的資源需求時，這項設定才會允許工作負載。	`kueue.x-k8s.io/podset-required-topology: "cloud.google.com/gce-topology-subblock"`
建議在子模塊中執行	只要有足夠的可用資源來滿足工作負載的資源需求，即使容量分散，這項設定也會允許工作負載。Kueue 會嘗試在子區塊內排定 Pod，並視需要使用其他子區塊。在這種情況下，即使子區塊的容量足夠滿足需求，但如果可用容量較多，Kueue 仍會將該子區塊的排名調高。	`kueue.x-k8s.io/podset-preferred-topology: "cloud.google.com/gce-topology-subblock"`
必須在區塊中執行	只有當區塊內的可用資源滿足工作負載的資源需求時，這項設定才會允許工作負載。如果獲得許可，Kueue 會盡量減少子區塊和主機數量，以便排定工作負載。這可能會導致可用容量分散。	`kueue.x-k8s.io/podset-required-topology: "cloud.google.com/gce-topology-block"`
建議在區塊中執行	只要有足夠的可用資源來滿足工作負載的資源需求，即使容量分散，這項設定也會允許工作負載。Kueue 會嘗試在一個區塊內排定 Pod，並視需要使用其他區塊。	`kueue.x-k8s.io/podset-preferred-topology: "cloud.google.com/gce-topology-block"`

使用 PodGroup 和 TAS，透過 Kueue 排定工作負載

使用 PodGroup 時，您必須為 PodGroup 中的每個 Pod 指定三個額外欄位：

標籤：
- kueue.x-k8s.io/pod-group-name：用於匯總的 PodGroup 名稱。
- kueue.x-k8s.io/pod-group-pod-index：PodGroup 中每個 Pod 的索引。
備註：
- kueue.x-k8s.io/pod-group-total-count： PodGroup 中的 Pod 總數。

視您使用的機器學習架構而定，PodGroup 的領導者可能需要 GPU，也可能不需要。由於 Kueue 的限制，這些情況需要以不同方式處理。以下範例說明如何建立含有三個 Pod 的 PodGroup，其中一個是領導者，兩個是工作者。

案例 1：領導者也是工作站，且需要 GPU

如果領導者是其中一個工作站，且也需要 GPU，則領導者可以擁有 PodGroup 內的任意數量。為求簡化，在下列範例中，領導者的索引為 0：

apiVersion: v1
kind: Pod
metadata:
  generateName: tas-podgroup-leader-
  labels:
    kueue.x-k8s.io/queue-name: tas-user-queue
    kueue.x-k8s.io/pod-group-name: "tas-podgroup-example-group"
    kueue.x-k8s.io/pod-group-pod-index: "0"
  annotations:
    kueue.x-k8s.io/pod-group-total-count: "3"
    kueue.x-k8s.io/podset-required-topology: "cloud.google.com/gce-topology-block"
spec:
  containers:
  - name: leader
    image: gcr.io/k8s-staging-perf-tests/sleep:v0.1.0
    args: ["600s"]
    resources:
      requests:
        nvidia.com/gpu: "1"
      limits:
        nvidia.com/gpu: "1"
  restartPolicy: Never
---
apiVersion: v1
kind: Pod
metadata:
  generateName: tas-podgroup-worker-1-
  labels:
    kueue.x-k8s.io/queue-name: tas-user-queue
    kueue.x-k8s.io/pod-group-name: "tas-podgroup-example-group"
    kueue.x-k8s.io/pod-group-pod-index: "1"
  annotations:
    kueue.x-k8s.io/pod-group-total-count: "3"
    kueue.x-k8s.io/podset-required-topology: "cloud.google.com/gce-topology-block"
spec:
  restartPolicy: Never
  containers:
  - name: worker
    image: gcr.io/k8s-staging-perf-tests/sleep:v0.1.0
    args: ["600s"]
    resources:
      requests:
        nvidia.com/gpu: "1"
      limits:
        nvidia.com/gpu: "1"
---
apiVersion: v1
kind: Pod
metadata:
  generateName: tas-podgroup-worker-2-
  labels:
    kueue.x-k8s.io/queue-name: tas-user-queue
    kueue.x-k8s.io/pod-group-name: "tas-podgroup-example-group"
    kueue.x-k8s.io/pod-group-pod-index: "2"
  annotations:
    kueue.x-k8s.io/pod-group-total-count: "3"
    kueue.x-k8s.io/podset-required-topology: "cloud.google.com/gce-topology-block"
spec:
  restartPolicy: Never
  containers:
  - name: worker
    image: gcr.io/k8s-staging-perf-tests/sleep:v0.1.0
    args: ["600s"]
    resources:
      requests:
        nvidia.com/gpu: "1"
      limits:
        nvidia.com/gpu: "1"

案例 2：領導者不是工作者，也不需要 GPU

如果由於 Kueue 的限制，領導者不是其中一位工作人員，則領導者必須在 PodGroup 中擁有最後一個索引，因為 Kueue 會建立 PodSet。如果領導者沒有最後一個索引，且第一個工作者未使用第一個索引，Kueue 就不會套用排名指派。

請參閱以下範例：

---
apiVersion: v1
kind: Pod
metadata:
  generateName: tas-podgroup-leader-
  labels:
    kueue.x-k8s.io/queue-name: tas-user-queue
    kueue.x-k8s.io/pod-group-name: "tas-podgroup-example-group2"
    kueue.x-k8s.io/pod-group-pod-index: "2"
  annotations:
    kueue.x-k8s.io/pod-group-total-count: "3"
    kueue.x-k8s.io/podset-required-topology: "cloud.google.com/gce-topology-block"
spec:
  containers:
  - name: leader
    image: gcr.io/k8s-staging-perf-tests/sleep:v0.1.0
    args: ["600s"]
    resources:
      requests:
        cpu: "1"
      limits:
        cpu: "1"
  restartPolicy: Never
---
apiVersion: v1
kind: Pod
metadata:
  generateName: tas-podgroup-worker-0-
  labels:
    kueue.x-k8s.io/queue-name: tas-user-queue
    kueue.x-k8s.io/pod-group-name: "tas-podgroup-example-group2"
    kueue.x-k8s.io/pod-group-pod-index: "0"
  annotations:
    kueue.x-k8s.io/pod-group-total-count: "3"
    kueue.x-k8s.io/podset-required-topology: "cloud.google.com/gce-topology-block"
spec:
  restartPolicy: Never
  containers:
  - name: worker
    image: gcr.io/k8s-staging-perf-tests/sleep:v0.1.0
    args: ["600s"]
    resources:
      requests:
        nvidia.com/gpu: "1"
      limits:
        nvidia.com/gpu: "1"
---
apiVersion: v1
kind: Pod
metadata:
  generateName: tas-podgroup-worker-1-
  labels:
    kueue.x-k8s.io/queue-name: tas-user-queue
    kueue.x-k8s.io/pod-group-name: "tas-podgroup-example-group2"
    kueue.x-k8s.io/pod-group-pod-index: "1"
  annotations:
    kueue.x-k8s.io/pod-group-total-count: "3"
    kueue.x-k8s.io/podset-required-topology: "cloud.google.com/gce-topology-block"
spec:
  restartPolicy: Never
  containers:
  - name: worker
    image: gcr.io/k8s-staging-perf-tests/sleep:v0.1.0
    args: ["600s"]
    resources:
      requests:
        nvidia.com/gpu: "1"
      limits:
        nvidia.com/gpu: "1"

後續步驟

如要進一步瞭解如何在 GKE 叢集中啟用節點健康狀態預測功能，請參閱「啟用節點健康狀態預測功能」。
如要瞭解如何管理與 GKE 叢集和 AI 工作負載相關的常見事件，請參閱「管理 AI 適用 GKE 叢集」。
如要進一步瞭解如何使用 Kueue 在 GKE 上排定工作，請參閱「使用 Kueue 部署批次系統」。

使用拓撲感知調度 (TAS) 安排 GKE 工作負載時程 透過集合功能整理內容 你可以依據偏好儲存及分類內容。

什麼是拓撲感知排程 (TAS)？

事前準備

準備 GKE 叢集

安裝已啟用 TAS 的 Kueue

Kueue 資訊清單

Helm 資訊套件

查看 GKE 叢集的拓撲

設定 Kueue

資源配額極高

實際資源配額

使用 Kueue 透過 TAS 安排工作負載

使用 PodGroup 和 TAS，透過 Kueue 排定工作負載

案例 1：領導者也是工作站，且需要 GPU

案例 2：領導者不是工作者，也不需要 GPU

後續步驟

使用拓撲感知調度 (TAS) 安排 GKE 工作負載時程