针对混合 AI/机器学习训练和推理工作负载优化 GKE 资源利用率

Standard Autopilot

本教程介绍如何在单个 Google Kubernetes Engine (GKE) 集群中的训练和推理服务工作负载之间高效共享加速器资源。通过在单个集群中分布混合工作负载，您可以提高资源利用率、简化集群管理、减少加速器数量限制带来的问题，并提高整体的成本效益。

在本教程中，您将使用用于推理的 Gemma 2 大语言模型 (LLM) 和 Hugging Face TGI（文本生成接口）服务框架创建一个高优先级的服务 Deployment，并设置一个低优先级的 LLM 微调作业。这两项工作负载都在使用 NVIDIA L4 GPU 的单个集群上运行。您可以使用开源 Kubernetes 原生作业排队系统 Kueue 来管理和安排工作负载。Kueue 可让您优先运行服务任务，并抢占优先级较低的训练作业，以优化资源利用率。当服务需求减少时，您可以重新分配已释放的加速器以恢复训练作业。您可以使用 Kueue 和优先级类来管理整个流程中的资源配额。

本教程适用于机器学习 (ML) 工程师、平台管理员和运维人员，以及希望在 GKE 集群上训练和托管机器学习 (ML) 模型并降低成本和管理开销（尤其是在处理数量有限的加速器时）的数据和 AI 专家。如需详细了解我们在 Google Cloud 内容中提及的常见角色和示例任务，请参阅常见的 GKE 用户角色和任务。

在阅读本页面之前，请确保您熟悉以下内容：

目标

阅读完本指南后，您应该能够执行以下步骤：

配置高优先级的服务 Deployment。
设置低优先级的训练作业。
实施抢占策略以满足不同的需求。
使用 Kueue 管理训练任务和服务任务之间的资源分配。

准备工作

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the required APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the APIs

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the required APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the APIs

Make sure that you have the following role or roles on the project: roles/container.admin, roles/iam.serviceAccountAdmin

Check for the roles

In the Google Cloud console, go to the IAM page.
Go to IAM
Select the project.
In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.

Grant the roles

In the Google Cloud console, go to the IAM page.
前往 IAM
选择项目。
点击 授予访问权限。
在新的主账号字段中，输入您的用户标识符。这通常是 Google 账号的电子邮件地址。
在选择角色列表中，选择一个角色。
如需授予其他角色，请点击 添加其他角色，然后添加其他各个角色。
点击 Save（保存）。

如果您还没有 Hugging Face 账号，请创建一个。
确保您的项目具有足够的 L4 GPU 配额。如需了解详情，请参阅 GPU 简介和分配配额。

准备环境

在本部分中，您将预配为推理和训练工作负载部署 TGI 和模型所需的资源。

获取对模型的访问权限

如需获取对 Gemma 模型的访问权限以便部署到 GKE，您必须先签署许可同意协议，然后生成 Huggging Face 访问令牌。

签署许可同意协议。访问模型同意页面，使用您的 Hugging Face 账号验证同意，并接受模型条款。
生成一个访问令牌。如需通过 Hugging Face 访问模型，您需要 Hugging Face 令牌。如果您还没有令牌，请按照以下步骤生成新令牌：
1. 点击您的个人资料 > 设置 > 访问令牌。
2. 选择新建令牌 (New Token)。
3. 指定您选择的名称和一个至少为 Read 的角色。
4. 选择生成令牌。
5. 将生成的令牌复制到剪贴板。

启动 Cloud Shell

在本教程中，您将使用 Cloud Shell 来管理Google Cloud上托管的资源。Cloud Shell 预安装了本教程所需的软件，包括 kubectl、gcloud CLI 和 Terraform。

如需使用 Cloud Shell 设置您的环境，请按照以下步骤操作：

在 Google Cloud 控制台中，点击 Google Cloud 控制台中的 激活 Cloud Shell 以启动 Cloud Shell 会话。此操作会在 Google Cloud 控制台的底部窗格中启动会话。

设置默认环境变量：

gcloud config set project PROJECT_ID
export PROJECT_ID=$(gcloud config get project)

将 PROJECT_ID 替换为您的 Google Cloud项目 ID。

从 GitHub 克隆示例代码。在 Cloud Shell 中，运行以下命令：

git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/
cd kubernetes-engine-samples/ai-ml/mix-train-and-inference
export EXAMPLE_HOME=$(pwd)

创建 GKE 集群

您可以为混合工作负载使用 Autopilot 或 Standard 集群。我们建议您使用 Autopilot 集群获得全托管式 Kubernetes 体验。如需选择最适合您的工作负载的 GKE 操作模式，请参阅选择 GKE 操作模式。

Autopilot

在 Cloud Shell 中设置默认环境变量：
```
export HF_TOKEN=HF_TOKEN
export REGION=REGION
export CLUSTER_NAME="llm-cluster"
export PROJECT_NUMBER=$(gcloud projects list \
    --filter="$(gcloud config get-value project)" \
    --format="value(PROJECT_NUMBER)")
export MODEL_BUCKET="model-bucket-$PROJECT_ID"
```
替换以下值：
- HF_TOKEN：您之前生成的 Hugging Face 令牌。
- REGION：支持要使用的加速器类型的区域，例如支持 L4 GPU 的 us-central1。
您可以调整 MODEL_BUCKET 变量，它表示存储训练后模型权重的 Cloud Storage 存储桶。

创建 Autopilot 集群：

gcloud container clusters create-auto ${CLUSTER_NAME} \
    --project=${PROJECT_ID} \
    --location=${REGION} \
    --release-channel=rapid

创建用于微调作业的 Cloud Storage 存储桶：

gcloud storage buckets create gs://${MODEL_BUCKET} \
    --location ${REGION} \
    --uniform-bucket-level-access

如需授予对 Cloud Storage 存储桶的访问权限，请运行以下命令：

gcloud storage buckets add-iam-policy-binding "gs://$MODEL_BUCKET" \
    --role=roles/storage.objectAdmin \
    --member=principal://iam.googleapis.com/projects/$PROJECT_NUMBER/locations/global/workloadIdentityPools/$PROJECT_ID.svc.id.goog/subject/ns/llm/sa/default \
    --condition=None

如需获取集群的身份验证凭据，请运行以下命令：

gcloud container clusters get-credentials llm-cluster \
    --location=$REGION \
    --project=$PROJECT_ID

为 Deployment 创建命名空间。在 Cloud Shell 中，运行以下命令：
```
kubectl create ns llm
```

Standard

在 Cloud Shell 中设置默认环境变量：
```
export HF_TOKEN=HF_TOKEN
export REGION=REGION
export CLUSTER_NAME="llm-cluster"
export GPU_POOL_MACHINE_TYPE="g2-standard-24"
export GPU_POOL_ACCELERATOR_TYPE="nvidia-l4"
export PROJECT_NUMBER=$(gcloud projects list \
    --filter="$(gcloud config get-value project)" \
    --format="value(PROJECT_NUMBER)")
export MODEL_BUCKET="model-bucket-$PROJECT_ID"
```
替换以下值：
- HF_TOKEN：您之前生成的 Hugging Face 令牌。
- REGION：支持要使用的加速器类型的区域，例如支持 L4 GPU 的 us-central1。
您可以调整以下变量：
- GPU_POOL_MACHINE_TYPE：您要在所选区域中使用的节点池机器系列。此值取决于您选择的加速器类型。如需了解详情，请参阅在 GKE 上使用 GPU 的限制。例如，本教程使用 g2-standard-24，每个节点挂接两个 GPU。如需查看最新的可用 GPU 列表，请参阅面向计算工作负载的 GPU。
- GPU_POOL_ACCELERATOR_TYPE：您所选区域中支持的加速器类型。例如，本教程使用 nvidia-l4。如需查看最新的可用 GPU 列表，请参阅面向计算工作负载的 GPU。
- MODEL_BUCKET：存储训练后模型权重的 Cloud Storage 存储桶。

创建 Standard 集群：

gcloud container clusters create ${CLUSTER_NAME} \
    --project=${PROJECT_ID} \
    --location=${REGION} \
    --workload-pool=${PROJECT_ID}.svc.id.goog \
    --release-channel=rapid \
    --machine-type=e2-standard-4 \
    --addons GcsFuseCsiDriver \
    --num-nodes=1

创建用于推理和微调工作负载的 GPU 节点池：

gcloud container node-pools create gpupool \
    --accelerator type=${GPU_POOL_ACCELERATOR_TYPE},count=2,gpu-driver-version=latest \
    --project=${PROJECT_ID} \
    --location=${REGION} \
    --node-locations=${REGION}-a \
    --cluster=${CLUSTER_NAME} \
    --machine-type=${GPU_POOL_MACHINE_TYPE} \
    --num-nodes=3

创建用于微调作业的 Cloud Storage 存储桶：

gcloud storage buckets create gs://${MODEL_BUCKET} \
    --location ${REGION} \
    --uniform-bucket-level-access

如需授予对 Cloud Storage 存储桶的访问权限，请运行以下命令：

gcloud storage buckets add-iam-policy-binding "gs://$MODEL_BUCKET" \
    --role=roles/storage.objectAdmin \
    --member=principal://iam.googleapis.com/projects/$PROJECT_NUMBER/locations/global/workloadIdentityPools/$PROJECT_ID.svc.id.goog/subject/ns/llm/sa/default \
    --condition=None

如需获取集群的身份验证凭据，请运行以下命令：

gcloud container clusters get-credentials llm-cluster \
    --location=$REGION \
    --project=$PROJECT_ID

为 Deployment 创建命名空间。在 Cloud Shell 中，运行以下命令：
```
kubectl create ns llm
```

为 Hugging Face 凭据创建 Kubernetes Secret

如需创建包含 Hugging Face 令牌的 Kubernetes Secret，请运行以下命令：

kubectl create secret generic hf-secret \
    --from-literal=hf_api_token=$HF_TOKEN \
    --dry-run=client -o yaml | kubectl apply --namespace=llm --filename=-

配置 Kueue

在本教程中，Kueue 是中央资源管理器，可实现训练和服务工作负载之间的高效 GPU 共享。Kueue 实现此目的的具体方法包括，定义资源要求（“变种”）、通过队列设置工作负载优先级（服务任务优先于训练任务），以及根据需求和优先级动态分配资源。本教程使用工作负载资源类型分别对推理和微调工作负载进行分组。

Kueue 的抢占功能会在资源紧缺时暂停或逐出低优先级训练作业，从而确保高优先级服务工作负载始终拥有必要的资源。

如需使用 Kueue 控制推理服务器 Deployment，您需要启用 pod 集成，并配置 managedJobsNamespaceSelector 以排除 kube-system 和 kueue-system 命名空间。

在 /kueue 目录中，查看 kustomization.yaml 中的代码。此清单会使用自定义配置安装 Kueue 资源管理器。

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- https://github.com/kubernetes-sigs/kueue/releases/download/v0.12.3/manifests.yaml
patches:
- path: patch.yaml
  target:
    version: v1
    kind: ConfigMap
    name: kueue-manager-config

在 /kueue 目录中，查看 patch.yaml 中的代码。此 ConfigMap 会自定义 Kueue，以排除对 kube-system 和 kueue-system 命名空间中的 Pod 的管理。

apiVersion: v1
kind: ConfigMap
metadata:
  name: kueue-manager-config
data:
  controller_manager_config.yaml: |
    apiVersion: config.kueue.x-k8s.io/v1beta1
    kind: Configuration
    health:
      healthProbeBindAddress: :8081
    metrics:
      bindAddress: :8080
    # enableClusterQueueResources: true
    webhook:
      port: 9443
    leaderElection:
      leaderElect: true
      resourceName: c1f6bfd2.kueue.x-k8s.io
    controller:
      groupKindConcurrency:
        Job.batch: 5
        Pod: 5
        Workload.kueue.x-k8s.io: 5
        LocalQueue.kueue.x-k8s.io: 1
        ClusterQueue.kueue.x-k8s.io: 1
        ResourceFlavor.kueue.x-k8s.io: 1
    clientConnection:
      qps: 50
      burst: 100
    #pprofBindAddress: :8083
    #waitForPodsReady:
    #  enable: false
    #  timeout: 5m
    #  blockAdmission: false
    #  requeuingStrategy:
    #    timestamp: Eviction
    #    backoffLimitCount: null # null indicates infinite requeuing
    #    backoffBaseSeconds: 60
    #    backoffMaxSeconds: 3600
    #manageJobsWithoutQueueName: true
    managedJobsNamespaceSelector:
      matchExpressions:
        - key: kubernetes.io/metadata.name
          operator: NotIn
          values: [ kube-system, kueue-system ]
    #internalCertManagement:
    #  enable: false
    #  webhookServiceName: ""
    #  webhookSecretName: ""
    integrations:
      frameworks:
      - "batch/job"
      - "kubeflow.org/mpijob"
      - "ray.io/rayjob"
      - "ray.io/raycluster"
      - "jobset.x-k8s.io/jobset"
      - "kubeflow.org/paddlejob"
      - "kubeflow.org/pytorchjob"
      - "kubeflow.org/tfjob"
      - "kubeflow.org/xgboostjob"
      - "kubeflow.org/jaxjob"
      - "workload.codeflare.dev/appwrapper"
      - "pod"
    #  - "deployment" # requires enabling pod integration
    #  - "statefulset" # requires enabling pod integration
    #  - "leaderworkerset.x-k8s.io/leaderworkerset" # requires enabling pod integration
    #  externalFrameworks:
    #  - "Foo.v1.example.com"
    #fairSharing:
    #  enable: true
    #  preemptionStrategies: [LessThanOrEqualToFinalShare, LessThanInitialShare]
    #admissionFairSharing:
    #  usageHalfLifeTime: "168h" # 7 days
    #  usageSamplingInterval: "5m"
    #  resourceWeights: # optional, defaults to 1 for all resources if not specified
    #    cpu: 0    # if you want to completely ignore cpu usage
    #    memory: 0 # ignore completely memory usage
    #    example.com/gpu: 100 # and you care only about GPUs usage
    #resources:
    #  excludeResourcePrefixes: []
    #  transformations:
    #  - input: nvidia.com/mig-4g.5gb
    #    strategy: Replace | Retain
    #    outputs:
    #      example.com/accelerator-memory: 5Gi
    #      example.com/accelerator-gpc: 4
    #objectRetentionPolicies:
    #  workloads:
    #    afterFinished: null # null indicates infinite retention, 0s means no retention at all
    #    afterDeactivatedByKueue: null # null indicates infinite retention, 0s means no retention at all

在 Cloud Shell 中，运行以下命令以安装 Kueue：

cd ${EXAMPLE_HOME}
kubectl kustomize kueue |kubectl apply --server-side --filename=-

等待 Kueue Pod 准备就绪：

watch kubectl --namespace=kueue-system get pods

输出应类似如下所示：

NAME                                        READY   STATUS    RESTARTS   AGE
kueue-controller-manager-bdc956fc4-vhcmx    1/1     Running   0          3m15s

在 /workloads 目录中，查看 flavors.yaml、cluster-queue.yaml 和 local-queue.yaml 文件。这些清单指定了 Kueue 如何管理资源配额：

ResourceFlavor

此清单在 Kueue 中定义了默认的 ResourceFlavor，用于资源管理。

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: default-flavor

ClusterQueue

此清单会设置一个 Kueue ClusterQueue，并设置 CPU、内存和 GPU 的资源限制。

本教程使用挂接两个 Nvidia L4 GPU 的节点，对应的节点类型为 g2-standard-24，提供 24 个 vCPU 和 96 GB RAM。示例代码展示了如何将工作负载的资源用量限制为最多 6 个 GPU。

ClusterQueue 配置中的 preemption 字段引用 PriorityClasses，以确定在资源紧缺时哪些 Pod 可以被抢占。

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "cluster-queue"
spec:
  namespaceSelector: {} # match all.
  preemption:
    reclaimWithinCohort: LowerPriority
    withinClusterQueue: LowerPriority
  resourceGroups:
  - coveredResources: [ "cpu", "memory", "nvidia.com/gpu", "ephemeral-storage" ]
    flavors:
    - name: default-flavor
      resources:
      - name: "cpu"
        nominalQuota: 72
      - name: "memory"
        nominalQuota: 288Gi
      - name: "nvidia.com/gpu"
        nominalQuota: 6
      - name: "ephemeral-storage"
        nominalQuota: 200Gi

LocalQueue

此清单会在 llm 命名空间中创建一个名为 lq 的 Kueue LocalQueue。

apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  namespace: llm # LocalQueue under llm namespace 
  name: lq
spec:
  clusterQueue: cluster-queue # Point to the ClusterQueue

查看 default-priorityclass.yaml、low-priorityclass.yaml 和 high-priorityclass.yaml 文件。这些清单定义了 Kubernetes 调度中的 PriorityClass 对象。

默认优先级

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: default-priority-nonpreempting
value: 10
preemptionPolicy: Never
globalDefault: true
description: "This priority class will not cause other pods to be preempted."

低优先级

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority-preempting
value: 20
preemptionPolicy: PreemptLowerPriority
globalDefault: false
description: "This priority class will cause pods with lower priority to be preempted."

高优先级

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority-preempting
value: 30
preemptionPolicy: PreemptLowerPriority
globalDefault: false
description: "This high priority class will cause other pods to be preempted."

运行以下命令来应用相应的清单，以创建 Kueue 和 Kubernetes 对象。

cd ${EXAMPLE_HOME}/workloads
kubectl apply --filename=flavors.yaml
kubectl apply --filename=default-priorityclass.yaml
kubectl apply --filename=high-priorityclass.yaml
kubectl apply --filename=low-priorityclass.yaml
kubectl apply --filename=cluster-queue.yaml
kubectl apply --filename=local-queue.yaml --namespace=llm

部署 TGI 推理服务器

在本部分中，您将部署 TGI 容器以应用 Gemma 2 模型。

在 /workloads 目录中，查看 tgi-gemma-2-9b-it-hp.yaml 文件。此清单定义了一个 Kubernetes Deployment，用于部署 TGI 服务运行时和 gemma-2-9B-it 模型。Deployment 是一个 Kubernetes API 对象，可让您运行在集群节点中分布的多个 Pod 副本。

Deployment 会优先处理推理任务，并为模型使用两个 GPU。它通过设置 NUM_SHARD 环境变量来使用张量并行处理，使模型适配 GPU 内存。

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tgi-gemma-deployment
  labels:
    app: gemma-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gemma-server
  template:
    metadata:
      labels:
        app: gemma-server
        ai.gke.io/model: gemma-2-9b-it
        ai.gke.io/inference-server: text-generation-inference
        examples.ai.gke.io/source: user-guide
        kueue.x-k8s.io/queue-name: lq
    spec:
      priorityClassName: high-priority-preempting
      containers:
      - name: inference-server
        image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.2-1.ubuntu2204.py310
        resources:
          requests:
            cpu: "4"
            memory: "30Gi"
            ephemeral-storage: "30Gi"
            nvidia.com/gpu: "2"
          limits:
            cpu: "4"
            memory: "30Gi"
            ephemeral-storage: "30Gi"
            nvidia.com/gpu: "2"
        env:
        - name: AIP_HTTP_PORT
          value: '8000'
        - name: NUM_SHARD
          value: '2'
        - name: MODEL_ID
          value: google/gemma-2-9b-it
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-secret
              key: hf_api_token
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
      volumes:
      - name: dshm
        emptyDir:
          medium: Memory
      nodeSelector:
        cloud.google.com/gke-accelerator: "nvidia-l4"
---
apiVersion: v1
kind: Service
metadata:
  name: llm-service
spec:
  selector:
    app: gemma-server
  type: ClusterIP
  ports:
  - protocol: TCP
    port: 8000
    targetPort: 8000

通过运行以下命令来应用清单：
```
kubectl apply --filename=tgi-gemma-2-9b-it-hp.yaml --namespace=llm
```
部署操作需要几分钟才能完成。

如需检查 GKE 是否已成功创建 Deployment，请运行以下命令：

kubectl --namespace=llm get deployment

输出应类似如下所示：

NAME                   READY   UP-TO-DATE   AVAILABLE   AGE
tgi-gemma-deployment   1/1     1            1           5m13s

验证 Kueue 配额管理

在本部分中，您将确认 Kueue 已正确地为您的 Deployment 强制执行 GPU 配额。

如需检查 Kueue 是否知道您的 Deployment，请运行以下命令以检索工作负载对象的状态：

kubectl --namespace=llm get workloads

输出应类似如下所示：

NAME                                              QUEUE   RESERVED IN     ADMITTED   FINISHED   AGE
pod-tgi-gemma-deployment-6bf9ffdc9b-zcfrh-84f19   lq      cluster-queue   True                  8m23s

如需测试替换配额限制，请将 Deployment 扩容到四个副本：

kubectl scale --replicas=4 deployment/tgi-gemma-deployment --namespace=llm

运行以下命令以查看 GKE 部署的副本数量：

kubectl get workloads --namespace=llm

输出应类似如下所示：

NAME                                              QUEUE   RESERVED IN     ADMITTED   FINISHED   AGE
pod-tgi-gemma-deployment-6cb95cc7f5-5thgr-3f7d4   lq      cluster-queue   True                  14s
pod-tgi-gemma-deployment-6cb95cc7f5-cbxg2-d9fe7   lq      cluster-queue   True                  5m41s
pod-tgi-gemma-deployment-6cb95cc7f5-tznkl-80f6b   lq                                            13s
pod-tgi-gemma-deployment-6cb95cc7f5-wd4q9-e4302   lq      cluster-queue   True                  13s

输出显示，由于 Kueue 强制执行的资源配额，仅有 3 个 Pod 被接受。

运行以下命令以显示 llm 命名空间中的 Pod：

kubectl get pod --namespace=llm

输出应类似如下所示：

NAME                                    READY   STATUS            RESTARTS   AGE
tgi-gemma-deployment-7649884d64-6j256   1/1     Running           0          4m45s
tgi-gemma-deployment-7649884d64-drpvc   0/1     SchedulingGated   0          7s
tgi-gemma-deployment-7649884d64-thdkq   0/1     Pending           0          7s
tgi-gemma-deployment-7649884d64-znvpb   0/1     Pending           0          7s

现在，将 Deployment 缩减回 1。在部署微调作业之前必须执行此步骤，否则由于推理作业具有优先权，微调作业将无法被接受。
```
kubectl scale --replicas=1 deployment/tgi-gemma-deployment --namespace=llm
```

行为说明

由于您在 ClusterQueue 配置中设置的 GPU 配额限制，此扩缩示例仅会生成 3 个副本（尽管扩容到 4 个副本）。ClusterQueue 的 spec.resourceGroups 部分为 nvidia.com/gpu 定义了“6”的 NominalQuota。Deployment 指定每个 Pod 需要“2”个 GPU。因此，ClusterQueue 一次最多只能容纳 3 个 Deployment 副本（因为 3 个副本 * 每个副本 2 个 GPU = 6 个 GPU，即总配额）。

当您尝试扩容到 4 个副本时，Kueue 会识别到此操作将超出 GPU 配额，并阻止调度第 4 个副本。这由第 4 个 Pod 的 SchedulingGated 状态表明。此行为演示了 Kueue 的资源配额强制执行。

部署训练作业

在本部分中，您将为 Gemma 2 模型部署低优先级的微调作业，该模型需要在两个 Pod 中使用四个 GPU。Kubernetes 中的 Job 控制器会创建一个或多个 Pod，并确保它们成功执行特定任务。

此作业将使用 ClusterQueue 中的剩余 GPU 配额。该作业使用预构建的映像并保存检查点，以允许从中间结果重新启动。

微调作业使用 b-mc2/sql-create-context 数据集。您可以在仓库中找到微调作业的源代码。

查看 fine-tune-l4.yaml 文件。此清单定义了微调作业。

apiVersion: v1
kind: Service
metadata:
  name: headless-svc-l4
spec:
  clusterIP: None # clusterIP must be None to create a headless service
  selector:
    job-name: finetune-gemma-l4 # must match Job name
---
apiVersion: batch/v1
kind: Job
metadata:
  name: finetune-gemma-l4
  labels:
    kueue.x-k8s.io/queue-name: lq
spec:
  backoffLimit: 4
  completions: 2
  parallelism: 2
  completionMode: Indexed
  suspend: true # Set to true to allow Kueue to control the Job when it starts
  template:
    metadata:
      labels:
        app: finetune-job
      annotations:
        gke-gcsfuse/volumes: "true"
        gke-gcsfuse/memory-limit: "35Gi"
    spec:
      priorityClassName: low-priority-preempting
      containers:
      - name: gpu-job
        imagePullPolicy: Always
        image: us-docker.pkg.dev/google-samples/containers/gke/gemma-fine-tuning:v1.0.0
        ports:
        - containerPort: 29500
        resources:
          requests:
            nvidia.com/gpu: "2"
          limits:
            nvidia.com/gpu: "2"
        command:
        - bash
        - -c
        - |
          accelerate launch \
          --config_file fsdp_config.yaml \
          --debug \
          --main_process_ip finetune-gemma-l4-0.headless-svc-l4 \
          --main_process_port 29500 \
          --machine_rank ${JOB_COMPLETION_INDEX} \
          --num_processes 4 \
          --num_machines 2 \
          fine_tune.py
        env:
        - name: "EXPERIMENT"
          value: "finetune-experiment"
        - name: MODEL_NAME
          value: "google/gemma-2-2b"
        - name: NEW_MODEL
          value: "gemma-ft"
        - name: MODEL_PATH
          value: "/model-data/model-gemma2/experiment"
        - name: DATASET_NAME
          value: "b-mc2/sql-create-context"
        - name: DATASET_LIMIT
          value: "5000"
        - name: EPOCHS
          value: "1"
        - name: GRADIENT_ACCUMULATION_STEPS
          value: "2"
        - name: CHECKPOINT_SAVE_STEPS
          value: "10"
        - name: HF_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-secret
              key: hf_api_token
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
        - name: gcs-fuse-csi-ephemeral
          mountPath: /model-data
          readOnly: false
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-l4
      restartPolicy: OnFailure
      serviceAccountName: default
      subdomain: headless-svc-l4
      terminationGracePeriodSeconds: 60
      volumes:
      - name: dshm
        emptyDir:
          medium: Memory
      - name: gcs-fuse-csi-ephemeral
        csi:
          driver: gcsfuse.csi.storage.gke.io
          volumeAttributes:
            bucketName: <MODEL_BUCKET>
            mountOptions: "implicit-dirs"
            gcsfuseLoggingSeverity: warning

应用清单以创建微调作业：

cd ${EXAMPLE_HOME}/workloads

sed -e "s/<MODEL_BUCKET>/$MODEL_BUCKET/g" \
    -e "s/<PROJECT_ID>/$PROJECT_ID/g" \
    -e "s/<REGION>/$REGION/g" \
    fine-tune-l4.yaml |kubectl apply --filename=- --namespace=llm

验证 Deployment 正在运行。如需检查工作负载对象的状态，请运行以下命令：

kubectl get workloads --namespace=llm

输出应类似如下所示：

NAME                                              QUEUE   RESERVED IN     ADMITTED   FINISHED   AGE
job-finetune-gemma-l4-3316f                       lq      cluster-queue   True                  29m
pod-tgi-gemma-deployment-6cb95cc7f5-cbxg2-d9fe7   lq      cluster-queue   True                  68m

接下来，通过运行以下命令查看 llm 命名空间中的 Pod：

kubectl get pod --namespace=llm

输出应类似如下所示：

NAME                                    READY   STATUS    RESTARTS   AGE
finetune-gemma-l4-0-vcxpz               2/2     Running   0          31m
finetune-gemma-l4-1-9ppt9               2/2     Running   0          31m
tgi-gemma-deployment-6cb95cc7f5-cbxg2   1/1     Running   0          70m

输出结果显示，Kueue 根据您指定的配额限制预留正确的资源，允许精细调整作业和推理服务器 Pod 运行。

查看输出日志，以验证微调作业将检查点保存到 Cloud Storage 存储桶。微调作业需要大约 10 分钟才会开始保存第一个检查点。

kubectl logs --namespace=llm --follow --selector=app=finetune-job

第一个已保存检查点的输出类似于以下内容：

{"name": "finetune", "thread": 133763559483200, "threadName": "MainThread", "processName": "MainProcess", "process": 33, "message": "Fine tuning started", "timestamp": 1731002351.0016131, "level": "INFO", "runtime": 451579.89835739136}
…
{"name": "accelerate.utils.fsdp_utils", "thread": 136658669348672, "threadName": "MainThread", "processName": "MainProcess", "process": 32, "message": "Saving model to /model-data/model-gemma2/experiment/checkpoint-10/pytorch_model_fsdp_0", "timestamp": 1731002386.1763802, "level": "INFO", "runtime": 486753.8924217224}

在混合工作负载上测试 Kueue 抢占和动态分配

在本部分中，您将模拟推理服务器的负载增加，需要进行扩容的场景。此场景演示了 Kueue 如何在资源受限时暂停并抢占低优先级的微调作业，从而优先运行高优先级的推理服务器。

运行以下命令，将推理服务器的副本扩容为两个：

kubectl scale --replicas=2 deployment/tgi-gemma-deployment --namespace=llm

检查工作负载对象的状态：

kubectl get workloads --namespace=llm

输出类似于以下内容：

NAME                                              QUEUE   RESERVED IN     ADMITTED   FINISHED   AGE
job-finetune-gemma-l4-3316f                       lq                      False                 32m
pod-tgi-gemma-deployment-6cb95cc7f5-cbxg2-d9fe7   lq      cluster-queue   True                  70m
pod-tgi-gemma-deployment-6cb95cc7f5-p49sh-167de   lq      cluster-queue   True                  14s

输出显示，由于增加的推理服务器副本使用了可用的 GPU 配额，因此优化作业不再被接受。

检查微调作业的状态：

kubectl get job --namespace=llm

输出类似于以下内容，表明微调作业的状态现在为暂停：

NAME                STATUS      COMPLETIONS   DURATION   AGE
finetune-gemma-l4   Suspended   0/2                      33m

运行以下命令检查您的 Pod：

kubectl get pod --namespace=llm

输出类似于以下内容，表明 Kueue 终止了精调作业 Pod，以便为优先级更高的推理服务器 Deployment 释放资源。

NAME                                    READY   STATUS              RESTARTS   AGE
tgi-gemma-deployment-6cb95cc7f5-cbxg2   1/1     Running             0          72m
tgi-gemma-deployment-6cb95cc7f5-p49sh   0/1     ContainerCreating   0          91s

接下来，测试推理服务器负载降低并且其 Pod 缩容的场景。运行以下命令：

kubectl scale --replicas=1 deployment/tgi-gemma-deployment --namespace=llm

运行以下命令以显示工作负载对象：

kubectl get workloads --namespace=llm

输出类似于以下内容，表明其中一个推理服务器 Deployment 已终止，并且微调作业已被重新接受。

NAME                                              QUEUE   RESERVED IN     ADMITTED   FINISHED   AGE
job-finetune-gemma-l4-3316f                       lq      cluster-queue   True                  37m
pod-tgi-gemma-deployment-6cb95cc7f5-cbxg2-d9fe7   lq      cluster-queue   True                  75m

运行以下命令以显示作业：

kubectl get job --namespace=llm

输出类似于以下内容，表明微调作业从最新的可用检查点恢复，正在重新运行。

NAME                STATUS    COMPLETIONS   DURATION   AGE
finetune-gemma-l4   Running   0/2           2m11s      38m

清理

为避免因本教程中使用的资源导致您的 Google Cloud 账号产生费用，请删除包含这些资源的项目，或者保留项目但删除各个资源。

删除已部署的资源

为避免因您在本指南中创建的资源导致您的 Google Cloud 账号产生费用，请运行以下命令：

gcloud storage rm --recursive gs://${MODEL_BUCKET}
gcloud container clusters delete ${CLUSTER_NAME} --location ${REGION}

针对混合 AI/机器学习训练和推理工作负载优化 GKE 资源利用率 使用集合让一切井井有条 根据您的偏好保存内容并对其进行分类。

目标

准备工作

Check for the roles

Grant the roles

准备环境

获取对模型的访问权限

启动 Cloud Shell

创建 GKE 集群

Autopilot

Standard

为 Hugging Face 凭据创建 Kubernetes Secret

配置 Kueue

ResourceFlavor

ClusterQueue

LocalQueue

默认优先级

低优先级

高优先级

部署 TGI 推理服务器

验证 Kueue 配额管理

行为说明

部署训练作业

在混合工作负载上测试 Kueue 抢占和动态分配

清理

删除已部署的资源

后续步骤

针对混合 AI/机器学习训练和推理工作负载优化 GKE 资源利用率