查看“全容量”模式预留的拓扑和健康状态

在预配 TPU 切片之前或之后,您可以使用 Google Cloud 控制台或 Google Cloud CLI 检索全容量模式容量的拓扑和健康状况信息。您还可以通过 Compute Engine 实例 API 或通过 TPU 虚拟机的客机操作系统中的 curl 命令检索 TPU 虚拟机实例的物理位置。通过集群、块、子块、主机和虚拟机级别的拓扑和健康状况信息,您可以针对工作负载做出拓扑感知布置决策,针对特定块或子块进行部署,并了解 TPU 虚拟机实例之间的相对邻近程度。

在 Google Cloud 控制台中查看容量拓扑

如需使用 Google Cloud 控制台查看预留的详细信息,请执行以下操作:

  1. 在 Google Cloud 控制台中,使用搜索栏搜索“预留”,然后前往预留页面
  2. 选择按需预留标签页,然后找到您的 TPU 全容量模式预留。您的客户支持团队会告知您预留的名称。
  3. 选择您的预留,系统会显示预留详情页面。

对于全容量模式预留,运行模式会设置为全容量。系统会显示块列表以及其利用率和健康状况摘要。

从列表中选择一个块,即可查看块详情页面。块的拓扑结构显示在集群位置部分。此部分显示了集群名称、块的哈希 ID 和子块的哈希 ID。

集群名称在所有 Google 组织中都是全局唯一的。换句话说,两位不同的客户可能会看到相同的集群名称。与集群名称不同,块或子块的哈希 ID 在您Google Cloud 组织的各个项目中是唯一的。

您可以选择一个子块,以显示“子块详情”页面,该页面仅显示具有有效 TPU 虚拟机实例的物理主机。系统不会显示未使用的物理主机。

使用 Google Cloud CLI 查看容量拓扑

您可以在预留、块和子块上使用 Google Cloud CLI listdescribe 命令,以查找有关容量的拓扑和健康状况信息。

您可以使用本部分中的命令显示的信息来确定预留中物理容量的拓扑层次结构。

描述预留

您可以使用 gcloud compute reservations describe 查看预留容量的概览。以下命令会显示名为“example-reservation”的预留的摘要:

gcloud compute reservations describe example-reservation \
   --project=example-project \
   --zone=us-central1-c

此命令会显示类似如下所示的输出:

advancedDeploymentControl:
  reservationOperationalMode: ALL_CAPACITY
aggregateReservation:
  inUseResources:
  - accelerator:
      acceleratorCount: 48
      acceleratorType: projects/example-project/zones/us-central1-c/acceleratorTypes/tpu7x
  reservedResources:
  - accelerator:
      acceleratorCount: 128
      acceleratorType: projects/example-project/zones/us-central1-c/acceleratorTypes/tpu7x
  vmFamily: VM_FAMILY_CLOUD_TPU_POD_SLICE_TPU7X
  workloadType: UNSPECIFIED
creationTimestamp: '2025-11-05T14:16:30.571-08:00'
deleteAtTime: '2026-11-06T08:00:00Z'
deploymentType: DENSE 
enableEmergentMaintenance: false
id: '8873145979824927313'
kind: compute#reservation
linkedCommitments:
- https://www.googleapis.com/compute/v1/projects/example-project/regions/us-central1/commitments/example-cud
name: example-reservation
protectionTier: STANDARD
reservationSharingPolicy:
  serviceShareType: ALLOW_ALL
resourceStatus:
  healthInfo:
    degradedBlockCount: 0
    healthStatus: HEALTHY
    healthyBlockCount: 1
  reservationBlockCount: 1
  reservationMaintenance:
    schedulingType: 
schedulingType: GROUPED
selfLink: https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-c/reservations/example-reservation
shareSettings:
  projectMap:
    '111111111111':
      projectId: '111111111111'
  shareType: SPECIFIC_PROJECTS
specificReservationRequired: true
status: READY
zone: https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-c

输出中的以下值描述了预留:

  • advancedDeploymentControl.reservationOperationalMode - 预留的容量模式
  • aggregateReservation.inUseResources.accelerator.acceleratorCount - 正在使用的 TPU 芯片数量
  • aggregateReservation.inUseResources.accelerator.acceleratorType - TPU 版本
  • reservedResources.accelerator.acceleratorCount - 预留中的 TPU 芯片数量
  • deploymentType - 部署类型(对于 TPU,始终为 DENSE
  • reservationSharingPolicy.serviceShareType - 服务共享类型
  • resourceStatus.healthInfo.healthStatus - 容量的总体健康状况
  • resourceStatus.healthInfo.healthyBlockCount - 预留中的健康块数
  • resourceStatus.reservationBlockCount - 预留中的块数

列出所有预留块

您可以使用 gcloud compute reservations blocks list 命令显示预留中所有块的容量、拓扑和健康状况信息。在以下示例中,预留“example-reservation”包含 2 个块:“example-reservation-block-0001”和“example-reservation-block-0002”。这两个块都位于集群“example-cluster”中。

每个块、子块和主机对象都由一个哈希 ID 标识。父对象的 ID 显示在子对象的物理拓扑字段中。您可以使用哈希 ID 构建容量的拓扑层次结构视图。

gcloud compute reservations blocks list example-reservation \
  --project=example-project \
  --zone=us-central1-c

该命令显示以下输出:

count: 32
creationTimestamp: '2025-11-05T15:00:15.223-08:00'
healthInfo:
  degradedSubBlockCount: 0
  healthStatus: HEALTHY
  healthySubBlockCount: 2
  id: '2996501069483632657'
inUseCount: 12
kind: compute#reservationBlock
name: example-reservation-block-0001
physicalTopology:
  block: 9a0e671424e45fd480ca172ad7a4e25d
  cluster: example-cluster
reservationMaintenance:
  schedulingType: GROUPED
reservationSubBlockCount: 2
reservationSubBlockInUseCount: 1
selfLink: https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-c/reservations/example-reservation/reservationBlocks/example-reservation-block-0001
selfLinkWithId: https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-c/reservations/example-reservation/reservationBlocks/2996501069483632657
status: READY
zone: https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-c


count: 128
creationTimestamp: '2025-08-19T18:23:32.825-07:00'
healthInfo:
  degradedSubBlockCount: 0
  healthStatus: HEALTHY
  healthySubBlockCount: 4
id: '9a0e671424e45fd480ca172ad7a4e25d'
inUseCount: 64
kind: compute#reservationBlock
name: example-reservation-block-0002
physicalTopology:
  block: 3feffcdeb6434d68bb818a836f75c1b8
  cluster: example-cluster
reservationMaintenance:
  schedulingType: GROUPED
reservationSubBlockCount: 2
reservationSubBlockInUseCount: 1
selfLink: https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-c/reservations/example-reservation/reservationBlocks/example-reservation-block-0001
selfLinkWithId: https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-c/reservations/example-reservation/reservationBlocks/2996501069483632657
status: READY
zone: https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-c

输出中的以下值描述了预留中的块:

  • count - 物理主机的数量
  • healthInfo.healthStatus - 块的总体健康状况
  • healthInfo.healthySubblockCount - 块中状况良好的子块数量
  • id - 块的 ID
  • inUseCount - 使用的物理主机数量
  • kind - 所描述对象的类型
  • name - 块的名称
  • physicalTopology.block - 块 ID
  • physicalTopology.cluster - 块所在的集群
  • reservationSubBlockCount - 此块中的子块数量
  • reservationSubBlockInUseCount - 正在使用的子区块数量

描述预留区块

您可以使用 gcloud compute reservations blocks describe 命令来显示指定块的信息。

gcloud compute reservations blocks describe example-reservation \
  --block-name=example-reservation-block-0001 \
  --project=example-project \
  --zone=us-central1-c

该命令会显示以下输出:

resource:
  count: 32
  creationTimestamp: '2025-11-05T15:00:15.223-08:00'
  healthInfo:
    degradedSubBlockCount: 0
    healthStatus: HEALTHY
    healthySubBlockCount: 2
  id: '2996501069483632657'
  inUseCount: 12
  kind: compute#reservationBlock
  name: example-reservation-block-0001
  physicalTopology:
    block: 9a0e671424e45fd480ca172ad7a4e25d
    cluster: example-cluster
  reservationMaintenance:
    schedulingType: GROUPED
  reservationSubBlockCount: 2
  reservationSubBlockInUseCount: 1
  selfLink: https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-c/reservations/example-reservation/reservationBlocks/example-reservation-block-0001
  selfLinkWithId: https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-c/reservations/example-reservation/reservationBlocks/2996501069483632657
  status: READY
  zone: https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-c

输出中的以下值描述了预留中的块:

  • count - 块中的主机数量
  • healthInfo.healthStatus - 块的总体健康状况
  • healthInfo.healthySubblockCount - 块中状况良好的子块数量
  • id - 块的 ID
  • inUseCount - 使用中的主机数量
  • kind - 所描述对象的类型
  • name - 块的名称
  • physicalTopology.block - 块 ID
  • physicalTopology.cluster - 块所在的集群
  • reservationSubBlockCount - 此块中的子块数量
  • reservationSubBlockInUseCount - 正在使用的子区块数量

列出某个块的所有子块

您可以列出块中的子块,以显示每个子块的信息:

gcloud compute reservations sub-blocks list example-reservation \
  --block-name=example-reservation-block-0001 \
  --project=example-project \
  --zone=us-central1-c

此命令会显示以下信息:

count: 16
creationTimestamp: '2025-11-05T15:00:16.738-08:00'
healthInfo:
  degradedHostCount: 0
  degradedInfraCount: 0
  healthStatus: HEALTHY
  healthyHostCount: 16
  healthyInfraCount: 1
id: '8309376980435233263'
inUseCount: 0
kind: compute#reservationSubBlock
name: example-reservation-block-0001-subblock-0001
physicalTopology:
  block: 9a0e671424e45fd480ca172ad7a4e25d
  cluster: example-cluster
  subBlock: a0122935eb54d02750b65eef2d4f0366
reservationSubBlockMaintenance:
  schedulingType: GROUPED
selfLink: https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-c/reservations/example-reservation/reservationBlocks/example-reservation-block-0001/reservationSubBlocks/example-reservation-block-0001-subblock-0001
selfLinkWithId: https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-c/reservations/example-reservation/reservationBlocks/example-reservation-block-0001/reservationSubBlocks/8309376980435233263
status: READY
zone: https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-c
---
count: 16
creationTimestamp: '2025-11-05T15:00:16.736-08:00'
healthInfo:
  degradedHostCount: 0
  degradedInfraCount: 0
  healthStatus: HEALTHY
  healthyHostCount: 16
  healthyInfraCount: 1
id: '5629213080155482607'
inUseCount: 12
kind: compute#reservationSubBlock
name: example-reservation-block-0001-subblock-0002
physicalTopology:
  block: 9a0e671424e45fd480ca172ad7a4e25d
  cluster: example-cluster
  subBlock: 7aca49831e54d32970631524bc060d9c
reservationSubBlockMaintenance:
  schedulingType: GROUPED
selfLink: https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-c/reservations/example-reservation/reservationBlocks/example-reservation-block-0001/reservationSubBlocks/example-reservation-block-0001-subblock-0002
selfLinkWithId: https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-c/reservations/example-reservation/reservationBlocks/example-reservation-block-0001/reservationSubBlocks/5629213080155482607
status: READY
zone: https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-c

输出中的以下值描述了预留中的子块:

  • count - 主机数量
  • healthInfo.degradedInfraCount - Ironwood 立方体的光路交换机 (OCS) 的健康状态。如果此值为 1,则表示 Ironwood Cube 的 OCS 已降级。此值不适用于 Trillium
  • healthInfo.healthStatus - 子块的总体健康状况
  • healthInfo.healthyHostCount - 子区块中健康主机的数量
  • id - 块的 ID
  • inUseCount - 使用中的主机数量
  • kind - 所描述对象的类型
  • name - 子块的名称
  • physicalTopology.block - 包含相应子块的块的 ID
  • physicalTopology.cluster - 块所在的集群
  • physicalTopology.subblock - 子块的 ID

描述预留子块

您可以使用 gcloud compute sub-blocks describe 查看有关子块的信息:

gcloud compute reservations sub-blocks describe example-reservation \
  --block-name=example-reservation-block-0001 \
  --sub-block-name=example-reservation-block-0001-subblock-0002 \
  --project=example-project \
  --zone=us-central1-c

此命令会显示以下信息:

resource:
  count: 16
  creationTimestamp: '2025-11-05T15:00:16.736-08:00'
  healthInfo:
    degradedHostCount: 0
    degradedInfraCount: 0
    healthStatus: HEALTHY
    healthyHostCount: 16
    healthyInfraCount: 1
  id: '5629213080155482607'
  inUseCount: 12
  kind: compute#reservationSubBlock
  name: example-reservation-block-0001-subblock-0002
  physicalTopology:
    block: 9a0e671424e45fd480ca172ad7a4e25d
    cluster: example-cluster
    subBlock: 7aca49831e54d32970631524bc060d9c
  reservationSubBlockMaintenance:
    schedulingType: GROUPED
  selfLink: https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-c/reservations/example-reservation/reservationBlocks/example-reservation-block-0001/reservationSubBlocks/example-reservation-block-0001-subblock-0002
  selfLinkWithId: https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-c/reservations/example-reservation/reservationBlocks/example-reservation-block-0001/reservationSubBlocks/5629213080155482607
  status: READY
  zone: https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-c

输出中的以下值描述了预留中的子块:

  • count - 主机数量
  • healthInfo.degradedInfraCount - Ironwood 立方体的光路交换机 (OCS) 的健康状态。如果此值为 1,则表示 Ironwood Cube 的 OCS 已降级。此值不适用于 Trillium
  • healthInfo.healthStatus - 子块的总体健康状况
  • healthInfo.healthyHostCount - 子区块中健康主机的数量
  • id - 块的 ID
  • inUseCount - 使用中的主机数量
  • kind - 所描述对象的类型
  • name - 子块的名称
  • physicalTopology.block - 包含相应子块的块的 ID
  • physicalTopology.cluster - 块所在的集群
  • physicalTopology.subblock - 子块的 ID

查找 TPU 虚拟机实例的物理位置

预配 TPU Slice 后,您可以检索 TPU 虚拟机实例的物理位置。这样一来,您就可以了解 TPU 虚拟机实例之间的相对距离,从而优化工作负载调度。

您可以使用 curl 或 Google Cloud CLI 查找 TPU 虚拟机实例的物理位置。以下示例展示了示例预留“example-reservation”中 TPU 虚拟机实例的实际位置。

curl

curl -s -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/attributes/physical_host_topology

gcloud

gcloud compute instances describe vm-1 \
--format="table[box,title=VM-Position](resourceStatus.physical_host_topology:label=location)" \
  --zone=ZONE

这两个命令都会显示有关您指定的 TPU 虚拟机的集群、块、子块和主机的信息:

block: 3feffcdeb6434d68bb818a836f75c1b8
cluster: southamerica-west1-cluster-njga
subblock: cbee689cb721abdb0c7f80a4f2d0c1c7
host: 36b2d9731c1e1cf8594a759c8c4178f0