使用全容量模式的 TPU 管理维护事件

所有 TPU 主机都会定期维护。在 TPU 全容量模式下，您可以规划即将到来的维护事件，并在需要时对所有容量启动维护操作。您可以同时或分别更新已用容量和未用容量。您还可以在虚拟机、子块、块或预留级层执行维护。这种精细的维护控制功能可让您创建最佳维护序列，并安排维护操作以最大限度地减少对业务的影响。

TPU 全容量模式仅支持“分组维护”，这意味着预留中所有虚拟机实例的维护操作都会安排在同一时间。预留中的所有 TPU 虚拟机都具有相同的维护窗口。不过，维护操作可以在主机、子块、块或预留级层单独执行。维护通知会提前大约 90 天发送。维护频率不会超过每 90 天一次。

如果您在 GKE 上使用 TPU Cluster Director，并且使用的是多主机 TPU 切片节点池，建议您先删除 GKE 节点池，然后再手动开始该节点池中任何主机的待处理维护。当原始节点池中的所有主机都已执行维护后，您可以重新创建该节点池。

以下是 TPU 主机维护事件的时间轴示例：

已安排维护。系统会向您发送通知，告知您主机将在 90 天内更新。
您可以选择在 90 天内手动更新主机。
90 天后，系统会毫无例外地运行维护操作。
如果在上一个维护事件运行之前安排了另一个维护事件，则第二个操作安排在 180 天后运行，即在安排初始维护事件后 90 天运行。

针对物理容量设置维护通知提醒

Compute Engine 会针对已安排、已开始或已完成的维护向您发送 Cloud Logging 事件。这些维护事件会保留在日志中，因此您可以构建日志查询，以查看容量的维护历史视图。您还可以创建基于日志的提醒政策，以便在预留、块或子块的未来维护事件发生时收到通知。

如需针对物理容量的维护事件创建提醒，请执行以下操作：

在 Google Cloud 控制台中，前往 Logs Explorer。
确保显示查询处于开启状态。
在查询窗格中，按以下各部分中列出的格式构建查询。相应地替换对应的参数占位符，然后运行查询。
验证返回的结果是否符合您的预期后，您可以从查询结果工具栏的操作下拉菜单中选择创建日志提醒，提供所请求的信息，从而创建提醒。

查询即将进行的维护

以下是查询即将进行的维护的查询示例：

protoPayload.methodName="compute.CAPACITY_COMPONENT.upcomingGroupMaintenance" severity>=DEFAULT
protoPayload.resourceName="projects/shared-reservation-project/reservations/RESOURCE_NAME"
protoPayload.status.message =~ "scheduled"

将 CAPACITY_COMPONENT 和 RESOURCE_NAME 替换为以下值：

接收有关以下内容的即将进行的维护的通知	`CAPACITY_COMPONENT`	`RESOURCE_NAME`
所有预留	`reservations`	省略 `RESOURCE_NAME`
特定预留	`reservations`	`YOUR_RESERVATION_NAME`
所有预留中的块	`reservations.blocks`	省略 `RESOURCE_NAME`
特定块	`reservations.blocks`	`YOUR_RESERVATION_NAME/reservationBlocks/YOUR_RESERVATION_BLOCK_ID`
所有预留中的子块	`reservations.blocks.subblocks`	省略 `RESOURCE_NAME`
特定子块	`reservations.blocks.subblocks`	`YOUR_RESERVATION_NAME/reservationBlocks/YOUR_RESERVATION_BLOCK_ID/reservationSubBlocks/YOUR_RESERVATION_SUBBLOCK_ID`

查询维护窗口开启

protoPayload.methodName="compute.reservations.CAPACITY_COMPONENT.startGroupMaintenance" severity>=DEFAULT
protoPayload.status.message =~ "started"

将 CAPACITY_COMPONENT 替换为以下某个值：

接收有关维护窗口开启的通知	`CAPACITY_COMPONENT`
预留中的块	`reservations.blocks`
预留中的子块	`reservations.blocks.subblocks`

查询已完成的维护

以下是查询已完成的维护的查询示例：

protoPayload.methodName="compute.reservations.CAPACITY_COMPONENT.completedGroupMaintenance" severity>=DEFAULT
protoPayload.resourceName="projects/YOUR_RESERVATION_PROJECT/reservations/RESOURCE"
protoPayload.status.message =~ "completed"

将 CAPACITY_COMPONENT 和 RESOURCE_NAME 替换为以下值：

接收有关以下内容的维护完成通知	`CAPACITY_COMPONENT`	RESOURCE_NAME
所有预留	`reservations`	省略 RESOURCE_NAME
特定预留	`reservations`	`YOUR_RESERVATION_NAME`
所有预留中的块	`reservations.blocks`	省略 RESOURCE_NAME
特定块	`reservations.blocks`	`YOUR_RESERVATION_NAME/reservationBlocks/YOUR_RESERVATION_BLOCK_ID`
所有预留中的子块	`reservations.blocks.subblocks`	省略 RESOURCE_NAME
特定子块	`reservations.blocks.subblocks`	`YOUR_RESERVATION_NAME/reservationBlocks/YOUR_RESERVATION_BLOCK_ID/reservationSubBlocks/YOUR_RESERVATION_SUBBLOCK_ID`

查看物理容量的维护状态

您可以通过 Cloud Logging、API 和 CLI 了解容量的维护状态。维护状态信息分为四个级层：预留、块、子块和主机。

Cloud Logging

以下示例 JSON 是针对此示例查询生成的：

{
  "protoPayload": {
    "@type": "type.googleapis.com/google.cloud.audit.AuditLog",
    "status": {
      "message": "Maintenance is scheduled for this block in reservation
      YOUR_RESERVATION. Review the maintenance schedule
      by describing the reservation and block via gcloud CLI"
    },
    "metadata": {
      "type":SCHEDULED
      "canReschedule":True
      "windowGroupStartTime": '2025-09-12T13:00:00.000-07:00',
      "windowGroupEndTime": '2025-09-12T17:00:00.000-07:00',
      "maintenanceGroupStatus":PENDING,
      "maintenancePendingCount":128 # Used and Unused Machines,
      "instanceMaintenancePendingCount": 64 # VMs Only
    },
  "methodName": "compute.reservations.block.upcomingGroupMaintenance",
  …
  },
}

gcloud

gcloud compute reservations blocks describe YOUR_RESERVATION \
--block-name=YOUR_BLOCK \
--project=YOUR_PROJECT \
--zone=YOUR_ZONE

输出类似于以下内容：

count: 128 # Host count
creationTimestamp: '2025-08-19T18:23:32.825-07:00'
id: '6404259976725386932'
inUseCount: 64 # In use host count
kind: compute#reservationBlock
name: exr1-block-0002
…
reservationMaintenance:
instanceMaintenanceOngoingCount: 0
instanceMaintenancePendingCount: 64 # VMs Only
maintenanceOngoingCount: 0
maintenancePendingCount: 128 # Used and Unused Hosts
schedulingType: GROUPED
subblockInfraMaintenanceOngoingCount: 0
subblockInfraMaintenancePendingCount: 0
upcomingGroupMaintenance:
  canReschedule: true
  maintenanceReasons:


PLANNED_UPDATE
maintenanceStatus: PENDING
type: SCHEDULED
windowEndTime: '2025-09-12T17:00:00.000-07:00'
windowStartTime: '2025-09-12T13:00:00.000-07:00'
…

输出中的以下值描述了维护信息：

reservationMaintenance.instanceMaintenanceOngoingCount：正在更新的已使用主机数量
reservationMaintenance.instanceMaintenancePendingCount：待维护的已使用主机数量
reservationMaintenance.maintenanceOngoingCount：正在更新的未使用主机数量
reservationMaintenance.maintenancePendingCount：待维护的未使用主机数量
reservationMaintenance.upcomingGroupMaintenance.maintenanceReasons：维护类型
reservationMaintenance.upcomingGroupMaintenance.maintenanceStatus：维护操作的状态
reservationMaintenance.upcomingGroupMaintenance.type：维护类型（SCHEDULED 表示计划内维护，UNSCHEDULED 表示计划外维护或紧急维护）
reservationMaintenance.upcomingGroupMaintenance.windowEndTime：维护操作时间窗口的预定结束时间
reservationMaintenance.upcomingGroupMaintenance.windowStartTime：维护操作时间窗口的预定开始时间

针对 TPU 虚拟机设置维护通知提醒

您可以针对 TPU 虚拟机上的维护事件创建提醒：

在 Google Cloud 控制台中，前往 Logs Explorer。
将显示查询切换开关设置为“开启”位置。
在查询窗格中，按以下各部分中列出的格式构建查询。
验证返回的结果是否符合您的预期后，您可以点击操作下拉菜单，选择创建日志提醒，然后在创建基于日志的提醒政策窗格中填写相关信息，从而创建提醒。

查询何时针对虚拟机实例安排了维护

protoPayload.methodName="compute.instances.upcomingMaintenance" severity>=DEFAULT
protoPayload.status.message =~ "scheduled"

查询虚拟机实例的维护窗口何时开启

protoPayload.methodName="compute.instances.upcomingMaintenance" severity>=DEFAULT
protoPayload.status.message =~ "ongoing"

查询虚拟机实例的维护开始时间

protoPayload.methodName="compute.instances.blocks.terminateOnHostMaintenance" severity>=DEFAULT

查询虚拟机实例的维护何时完成

protoPayload.methodName="compute.instances.upcomingMaintenance" severity>=DEFAULT
protoPayload.status.message =~ "completed"

查看 Cloud TPU 虚拟机的维护状态

您可以在客机操作系统中使用 Compute Engine 实例 API 或 curl 命令检索 Cloud TPU 虚拟机的维护状态。

描述实例

gcloud

gcloud compute instances describe <var>INSTANCE</var> --zone <var>ZONE</var>

该命令会返回如下输出：

…
upcomingMaintenance:{
"type":"SCHEDULED"
"canReschedule":True
"windowStartTime": '2025-09-12T13:00:00.000-07:00'
"windowEndTime": 2025-09-12T17:00:00.000-07:00
"latestWindowStartTime": '2025-09-12T13:00:00.000-07:00'
"maintenanceStatus":"PENDING"
...

curl

curl http://metadata.google.internal/computeMetadata/v1/instance/upcoming-maintenance?alt=json -H "Metadata-Flavor: Google"

该命令会返回如下输出：

{
"maintenanceType":"SCHEDULED"
"canReschedule":True
"windowStartTime": '2025-09-12T13:00:00.000-07:00'
"windowEndTime": 2025-09-12T17:00:00.000-07:00
"latestWindowStartTime": '2025-09-12T13:00:00.000-07:00'
"maintenanceStatus":"PENDING"
}

您还可以在 Cloud Logging 中查找维护通知。

以下是待处理的计划内维护的日志消息示例。如需查看示例查询，请参阅查看物理容量的维护状态。

{
  "protoPayload": {
    "@type": "type.googleapis.com/google.cloud.audit.AuditLog",
    "status": {
      "message": "Maintenance is scheduled for this instance. Review the maintenance schedule by describing the VM with gcloud CLI or querying the http://metadata.google.internal/computeMetadata/v1/instance/upcoming-maintenance metadata key."
    },
    "metadata": {
      "canReschedule": true
      "latestWindowStartTime": "2024-01-01:00:00:00PST"
      "maintenanceStatus": "PENDING"
      "type": "SCHEDULED"
      "windowEndTime": "2024-01-01:00:02:00PST"
      "windowStartTime": "2024-01-01:00:00:00PST"
    },
},
  "operation": {
    "id": "systemevent-1702539760425-60c736da2db40-701ddf19-b5424b20",
    "producer": "compute.instances.upcomingMaintenance",
    "first": true,
    "last": false
  },
}

以下示例是有关正在进行中的计划外维护的日志消息。如需查看示例查询，请参阅查询维护窗口何时开启

{
  "protoPayload": {
    "@type": "type.googleapis.com/google.cloud.audit.AuditLog",
    "status": {
      "message": "Maintenance window has started for this instance. Review the maintenance schedule by describing the VM with gcloud CLI or querying the http://metadata.google.internal/computeMetadata/v1/instance/upcoming-maintenance metadata key."
    },
    "metadata": {
      "canReschedule": true
      "latestWindowStartTime": "2024-01-01:00:00:00PST"
      "maintenanceStatus": "ONGING"
      "type": "UNSCHEDULED"
      "windowEndTime": "2024-01-01:00:02:00PST"
      "windowStartTime": "2024-01-01:00:00:00PST"
    },
},
  "operation": {
    "id": "systemevent-1702539760425-60c736da2db40-701ddf19-b5424b20",
    "producer": "compute.instances.upcomingMaintenance",
    "first": true,
    "last": false
  },
}

以下示例是已完成维护的日志消息。如需查看查询示例，请参阅查询虚拟机实例的维护何时完成。

{
  "protoPayload": {
    "@type": "type.googleapis.com/google.cloud.audit.AuditLog",
    "status": {
      "message": "Maintenance window has completed for this instance. All maintenance notifications on the instance have been removed."
    },
  "operation": {
    "id": "systemevent-1702539760425-60c736da2db40-701ddf19-b5424b20",
    "producer": "compute.instances.upcomingMaintenance",
    "first": false,
    "last": true
  },
}

手动启动待处理的物理容量维护

安排维护事件后（maintenanceStatus 设置为 PENDING），您可以手动开始对 canReschedule 属性设置为 True 的预留、块或子块进行维护。手动启动待处理的维护事件时，会发生什么情况取决于预留、块或子块的维护状态。下表介绍了每种状态发生的情况：

维护状态	说明	看到的内容
已预定	Compute Engine 已为预留安排了维护。您可以在预定时间之前手动开始维护。	在 Google Cloud CLI 或 REST API 中，`maintenanceStatus` 字段设置为 `PENDING`。
进行中	维护正在进行中。您无法重新安排。	在 Google Cloud CLI 或 REST API 中，`maintenanceStatus` 字段设置为 `ONGOING`。
已完成	维护已完成。Compute Engine 已从虚拟机中移除所有维护通知。	在 Google Cloud CLI 或 REST API 中，`maintenanceStatus` 字段不存在。

手动对整个预留启动维护

以下命令会对预留启动维护。使用 --scope 参数指定以下值之一，以指定维护操作的范围：

所有主机：--scope=all
正在运行虚拟机的主机：--scope=running
未使用、已停止或已暂停的虚拟机：--scope=unused

如需对预留的所有块启动维护，请运行以下命令：

gcloud compute reservations perform-maintenance YOUR_RESERVATION \
  --zone=YOUR_ZONE \
  --scope=all

如需查看维护事件的进度，请运行以下命令：

gcloud compute reservations describe YOUR_RESERVATION  \
  --project=YOUR_PROJECT \
  --zone=YOUR_ZONE

输出类似于以下内容：

ResourceStatus
  upcomingGroupMaintenance:
    "type":"SCHEDULED"
    "canReschedule":True
    "maintenanceStatus":"PENDING" → "ONGOING"
    "maintenancePendingCount":512 → 0 # all hosts are moved into an ongoing state.
    "maintenanceOngoingCount":0 → 512 → 256 → 0 # this number first increases to all hosts
                                           # as machines complete, this number reduces.

手动对块启动维护

以下命令会对块启动维护。使用 --scope 参数指定以下值之一，以指定维护操作的范围：

所有主机：--scope=all
正在运行虚拟机的主机：--scope=running
未使用、已停止或已暂停的虚拟机：--scope=unused

以下命令展示了如何对正在运行的主机启动维护：

gcloud compute reservations perform-maintenance YOUR_RESERVATION
    --scope=RUNNING \
    --project=YOUR_PROJECT \
    --zone=YOUR_ZONE

以下命令展示了如何检查块的维护进度：

gcloud compute reservations blocks describe YOUR_RESERVATION --block-name=YOUR_BLOCK_NAME  \
 --project=YOUR_PROJECT \
  --zone=YOUR_ZONE

输出类似于以下内容：

ResourceStatus
  upcomingGroupMaintenance:
    "maintenanceType":"SCHEDULED"
…
    "maintenanceGroupStatus":"PENDING" → "ONGOING"
    "maintenancePending":0
 "maintenanceOngoing":70 → 0

手动对子块启动维护

对子块启动维护时，您无需指定 --scope 参数，因为子块是最小的维护范围。

以下命令会对块中的所有主机启动维护：

gcloud compute reservations sub-blocks perform-maintenance YOUR_RESERVATION
    --block-name=YOUR_BLOCK_NAME \
    --sub-block-name=YOUR_SUBBLOCK_NAME \
    --project=YOUR_PROJECT \
    --zone=YOUR_ZONE

以下命令会检查维护进度：

gcloud compute reservations sub-blocks describe YOUR_RESERVATION
    --block-name=YOUR_BLOCK_NAME \
    --sub-block-name=YOUR_SUBBLOCK_NAME \
    --project=YOUR_PROJECT \
    --zone=YOUR_ZONE

输出类似于以下内容：

ResourceStatus
  groupMaintenance:
    "maintenanceType":"SCHEDULED"
    "canReschedule":True
    "maintenanceGroupStatus":"PENDING" → "ONGOING"
    "maintenancePendingCount": 32 → 0 # 32 hosts updated
    "maintenanceOngoingCount":0 → 32 → 0
    "instanceMaintenancePendingCount": 64 → 0
    "instanceMaintenanceOngoingCount": 0 → 64 → 0 # 64 instances updated

手动启动 TPU 虚拟机的待处理维护

如果主机正在运行多个虚拟机，则对一个虚拟机启动维护会对主机上所有虚拟机触发维护。

以下示例展示了如何手动对包含两个虚拟机的 Trillium 主机触发维护：

gcloud compute instances perform-maintenance vm-1

对 vm-1 触发维护会对 vm-2 触发维护。