如需版本資訊，請參閱「Distributed Cloud connected 版本資訊」。

Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

管理 GPU 工作負載

本頁說明如何在 Google Distributed Cloud 上啟用及管理圖形處理器 (GPU) 工作負載。如要使用這項功能，您必須擁有包含 GPU 的 Distributed Cloud 硬體設定。GPU 支援功能預設為停用。您必須在 Distributed Cloud 叢集上明確啟用 GPU 支援。

如要規劃及訂購這類設定，請在下列文件中選擇設定 2：

如果 Distributed Cloud 機架包含 GPU，您可以設定 Distributed Cloud 工作負載，使用 GPU 資源。

Distributed Cloud 工作負載可在容器和虛擬機器中執行：

在容器中執行的 GPU 工作負載。啟用 GPU 支援後，Distributed Cloud 叢集上的所有 GPU 資源最初都會分配給在容器中執行的工作負載。Distributed Cloud 內含執行 GPU 型容器化工作負載的 GPU 驅動程式。在每個容器中，GPU 程式庫會掛接在 /opt/nvidia。
在虛擬機器上執行的 GPU 工作負載。如要在虛擬機器上執行 GPU 型工作負載，您必須在目標 Distributed Cloud 節點上，為虛擬機器分配 GPU 資源，詳情請參閱本頁稍後內容。這麼做會略過內建的 GPU 驅動程式，並直接將 GPU 傳遞至虛擬機器。您必須在每個虛擬機器的客體作業系統上手動安裝相容的 GPU 驅動程式。您也必須取得所有授權，才能在虛擬機器上執行專用 GPU 驅動程式。

如要確認 Distributed Cloud 節點上是否有 GPU，請確認節點是否具有 vm.cluster.gke.io.gpu=true 標籤。如果節點上沒有標籤，表示對應的 Distributed Cloud 實體機器未安裝 GPU。

啟用 GPU 支援

如要為工作負載啟用 GPU 支援，您必須建立或修改 VMRuntime 自訂資源，其中包含 enableGPU 參數，且該參數的值設為 true，然後將該資源套用至 Distributed Cloud 叢集。例如：

apiVersion: vm.cluster.gke.io/v1
kind: VMRuntime
metadata:
  name: vmruntime
spec:
  # Enable GPU support
  enableGPU: true

視要啟用 VM Runtime on GDC 虛擬機器子系統的叢集類型而定，請執行下列其中一項操作：

如果 Cloud 控制層叢集尚未啟用 GDC 虛擬機器子系統的 VM Runtime，您必須手動建立 VMRuntime 資源。
對於已啟用 VM Runtime on GDC 虛擬機器子系統的 Cloud 控制層叢集，您必須編輯現有的 VMRuntime 資源
如果是本機控制層叢集，您必須編輯現有的 VMRuntime 資源。

這個 VMRuntime 資源也會使用 enable 參數，在叢集上設定 GDC 支援的 VM 執行階段。請務必根據工作負載需求設定這兩個參數。您不需要在 GDC 支援服務上啟用 VM Runtime，即可在 Distributed Cloud 叢集上啟用 GPU 支援。

下表說明可用的設定。

`enable` 值	`enableGPU` 值	產生的設定
`false`	`false`	工作負載只能在容器中執行，且無法使用 GPU 資源。
`false`	`true`	工作負載只會在容器中執行，且可使用 GPU 資源。
`true`	`true`	工作負載可在虛擬機器和容器中執行。這兩種工作負載都可以使用 GPU 資源。
`true`	`false`	工作負載可在虛擬機器和容器中執行。這兩種工作負載都無法使用 GPU 資源。

確認已啟用 GPU 支援

如要確認叢集已啟用 GPU 支援，請使用下列指令：

kubectl get pods --namespace vm-system

指令會傳回類似以下範例的輸出內容：

NAME                                         READY   STATUS     RESTARTS  AGE
...
gpu-controller-controller-manager-vbv4w      2/2     Running    0         31h
kubevirt-gpu-dp-daemonset-gxj7g              1/1     Running    0         31h
nvidia-gpu-dp-daemonset-bq2vj                1/1     Running    0         31h
...

在輸出內容中，您可以確認 GPU 控制器 Pod 已部署，且正在 vm-system 命名空間中執行。

分配 GPU 資源

根據預設，在 Distributed Cloud 叢集上啟用 GPU 支援時，叢集中每個節點上的所有 GPU 資源都會分配給容器化工作負載。如要自訂每個節點的 GPU 資源分配，請完成本節中的步驟。

設定 GPU 資源分配

如要在 Distributed Cloud 節點上分配 GPU 資源，請使用下列指令編輯目標節點上的 GPUAllocation 自訂資源：
```
kubectl edit gpuallocation NODE_NAME --namespace vm-system
```
請將 NODE_NAME 改成目標 Distributed Cloud 節點的名稱。

在下列範例中，指令的輸出內容會顯示原廠預設的 GPU 資源分配。根據預設，所有 GPU 資源都會分配給容器化 (pod) 工作負載，不會分配給虛擬機器 (vm) 工作負載：
```
...
spec:
  pod:   2  # Number of GPUs allocated for container workloads
  vm:    0  # Number of GPUs allocated for VM workloads
```
請依下列方式設定 GPU 資源分配：
- 如要將 GPU 資源分配給容器化工作負載，請增加 pod 欄位的值，並減少 vm 欄位的值，兩者增減的量須相同。
- 如要將 GPU 資源分配給虛擬機器工作負載，請增加 vm 欄位的值，並減少 pod 欄位的值，增減幅度相同。
分配的 GPU 資源總數不得超過節點執行的實體 Distributed Cloud 機器上安裝的 GPU 數量；否則，節點會拒絕無效的分配。

在下列範例中，有兩項 GPU 資源已從容器化 (pod) 工作負載重新分配至虛擬機器 (vm) 工作負載：
```
...
spec:
  pod:   0  # Number of GPUs allocated for container workloads
  vm:    2  # Number of GPUs allocated for VM workloads
```
完成後，將修改後的 GPUAllocation 資源套用至叢集，並等待狀態變更為 AllocationFulfilled。

檢查 GPU 資源分配

如要檢查 GPU 資源分配情形，請使用下列指令：

kubectl describe gpuallocations NODE_NAME --namespace vm-system

請將 NODE_NAME 改成目標 Distributed Cloud 節點的名稱。

指令會傳回類似以下範例的輸出內容：

 Name:         mynode1
 ...
 spec:
   node:  mynode1
   pod:   2  # Number of GPUs allocated for container workloads
   vm:    0  # Number of GPUs allocated for VM workloads
 Status:
   Allocated:  true
   Conditions:
     Last Transition Time:  2022-09-23T03:14:10Z
     Message:
     Observed Generation:   1
     Reason:                AllocationFulfilled
     Status:                True
     Type:                  AllocationStatus
     Last Transition Time:  2022-09-23T03:14:16Z
     Message:
     Observed Generation:   1
     Reason:                DeviceStateUpdated
     Status:                True
     Type:                  DeviceStateUpdated
   Consumption:
     pod:         0/2   # Number of GPUs currently consumed by container workloads
     vm:          0/0   # Number of GPUs currently consumed by VM workloads
   Device Model:  Tesla T4
 Events:          <none>

設定容器以使用 GPU 資源

如要設定在 Distributed Cloud 上執行的容器使用 GPU 資源，請按照下列範例設定規格，然後套用至叢集：

  apiVersion: v1
  kind: Pod
  metadata:
    name: my-gpu-pod
  spec:
    containers:
    - name: my-gpu-container
      image: CUDA_TOOLKIT_IMAGE
      command: ["/bin/bash", "-c", "--"]
      args: ["while true; do sleep 600; done;"]
      env:
      resources:
        requests:
        nvidia.com/gpu-pod-TESLA_T4: 2
        limits:
        nvidia.com/gpu-pod-TESLA_T4: 2
    nodeSelector:
      kubernetes.io/hostname: NODE_NAME

更改下列內容：

CUDA_TOOLKIT_IMAGE：NVIDIA CUDA 工具包映像檔的完整路徑和名稱。CUDA 工具包版本必須與 Distributed Cloud 叢集上執行的 NVIDIA 驅動程式版本相符。如要判斷 NVIDIA 驅動程式版本，請參閱「Distributed Cloud 版本資訊」。如要尋找相符的 CUDA Toolkit 版本，請參閱「CUDA 相容性」。
NODE_NAME：目標 Distributed Cloud 節點的名稱。

設定虛擬機器以使用 GPU 資源

如要設定在 Distributed Cloud 上執行的虛擬機器使用 GPU 資源，請按照下列範例設定 VirtualMachine 資源規格，然後套用至叢集：

apiVersion: vm.cluster.gke.io/v1
kind: VirtualMachine
...
spec:
  ...
  gpu:
    model: nvidia.com/gpu-vm-TESLA_T4
    quantity: 2