查看“全容量”模式预留的拓扑和健康状态
在预配 TPU 切片之前或之后,您可以使用 Google Cloud 控制台或 Google Cloud CLI 检索全容量模式容量的拓扑和健康状况信息。您还可以通过 Compute Engine 实例 API 或通过 TPU 虚拟机的客机操作系统中的curl 命令检索 TPU 虚拟机实例的物理位置。通过集群、块、子块、主机和虚拟机级别的拓扑和健康状况信息,您可以针对工作负载做出拓扑感知布置决策,针对特定块或子块进行部署,并了解 TPU 虚拟机实例之间的相对邻近程度。
在 Google Cloud 控制台中查看容量拓扑
如需使用 Google Cloud 控制台查看预留的详细信息,请执行以下操作:
- 在 Google Cloud 控制台中,使用搜索栏搜索“预留”,然后前往预留页面。
- 选择按需预留标签页,然后找到您的 TPU 全容量模式预留。您的客户支持团队会告知您预留的名称。
- 选择您的预留,系统会显示预留详情页面。
对于全容量模式预留,运行模式会设置为全容量。系统会显示块列表以及其利用率和健康状况摘要。
从列表中选择一个块,即可查看块详情页面。块的拓扑结构显示在集群位置部分。此部分显示了集群名称、块的哈希 ID 和子块的哈希 ID。
集群名称在所有 Google 组织中都是全局唯一的。换句话说,两位不同的客户可能会看到相同的集群名称。与集群名称不同,块或子块的哈希 ID 在您Google Cloud 组织的各个项目中是唯一的。
您可以选择一个子块,以显示“子块详情”页面,该页面仅显示具有有效 TPU 虚拟机实例的物理主机。系统不会显示未使用的物理主机。
使用 Google Cloud CLI 查看容量拓扑
您可以在预留、块和子块上使用 Google Cloud CLI list 和 describe 命令,以查找有关容量的拓扑和健康状况信息。
您可以使用本部分中的命令显示的信息来确定预留中物理容量的拓扑层次结构。
描述预留
您可以使用 gcloud compute reservations describe 查看预留容量的概览。以下命令会显示名为“example-reservation”的预留的摘要:
gcloud compute reservations describe example-reservation \
--project=example-project \
--zone=us-central1-c
此命令会显示类似如下所示的输出:
advancedDeploymentControl: reservationOperationalMode: ALL_CAPACITY aggregateReservation: inUseResources: - accelerator: acceleratorCount: 48 acceleratorType: projects/example-project/zones/us-central1-c/acceleratorTypes/tpu7x reservedResources: - accelerator: acceleratorCount: 128 acceleratorType: projects/example-project/zones/us-central1-c/acceleratorTypes/tpu7x vmFamily: VM_FAMILY_CLOUD_TPU_POD_SLICE_TPU7X workloadType: UNSPECIFIED creationTimestamp: '2025-11-05T14:16:30.571-08:00' deleteAtTime: '2026-11-06T08:00:00Z' deploymentType: DENSE enableEmergentMaintenance: false id: '8873145979824927313' kind: compute#reservation linkedCommitments: - https://www.googleapis.com/compute/v1/projects/example-project/regions/us-central1/commitments/example-cud name: example-reservation protectionTier: STANDARD reservationSharingPolicy: serviceShareType: ALLOW_ALL resourceStatus: healthInfo: degradedBlockCount: 0 healthStatus: HEALTHY healthyBlockCount: 1 reservationBlockCount: 1 reservationMaintenance: schedulingType: schedulingType: GROUPED selfLink: https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-c/reservations/example-reservation shareSettings: projectMap: '111111111111': projectId: '111111111111' shareType: SPECIFIC_PROJECTS specificReservationRequired: true status: READY zone: https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-c
输出中的以下值描述了预留:
advancedDeploymentControl.reservationOperationalMode- 预留的容量模式aggregateReservation.inUseResources.accelerator.acceleratorCount- 正在使用的 TPU 芯片数量aggregateReservation.inUseResources.accelerator.acceleratorType- TPU 版本reservedResources.accelerator.acceleratorCount- 预留中的 TPU 芯片数量deploymentType- 部署类型(对于 TPU,始终为DENSE)reservationSharingPolicy.serviceShareType- 服务共享类型resourceStatus.healthInfo.healthStatus- 容量的总体健康状况resourceStatus.healthInfo.healthyBlockCount- 预留中的健康块数resourceStatus.reservationBlockCount- 预留中的块数
列出所有预留块
您可以使用 gcloud compute reservations blocks list 命令显示预留中所有块的容量、拓扑和健康状况信息。在以下示例中,预留“example-reservation”包含 2 个块:“example-reservation-block-0001”和“example-reservation-block-0002”。这两个块都位于集群“example-cluster”中。
每个块、子块和主机对象都由一个哈希 ID 标识。父对象的 ID 显示在子对象的物理拓扑字段中。您可以使用哈希 ID 构建容量的拓扑层次结构视图。
gcloud compute reservations blocks list example-reservation \
--project=example-project \
--zone=us-central1-c
该命令显示以下输出:
count: 32 creationTimestamp: '2025-11-05T15:00:15.223-08:00' healthInfo: degradedSubBlockCount: 0 healthStatus: HEALTHY healthySubBlockCount: 2 id: '2996501069483632657' inUseCount: 12 kind: compute#reservationBlock name: example-reservation-block-0001 physicalTopology: block: 9a0e671424e45fd480ca172ad7a4e25d cluster: example-cluster reservationMaintenance: schedulingType: GROUPED reservationSubBlockCount: 2 reservationSubBlockInUseCount: 1 selfLink: https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-c/reservations/example-reservation/reservationBlocks/example-reservation-block-0001 selfLinkWithId: https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-c/reservations/example-reservation/reservationBlocks/2996501069483632657 status: READY zone: https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-c count: 128 creationTimestamp: '2025-08-19T18:23:32.825-07:00' healthInfo: degradedSubBlockCount: 0 healthStatus: HEALTHY healthySubBlockCount: 4 id: '9a0e671424e45fd480ca172ad7a4e25d' inUseCount: 64 kind: compute#reservationBlock name: example-reservation-block-0002 physicalTopology: block: 3feffcdeb6434d68bb818a836f75c1b8 cluster: example-cluster reservationMaintenance: schedulingType: GROUPED reservationSubBlockCount: 2 reservationSubBlockInUseCount: 1 selfLink: https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-c/reservations/example-reservation/reservationBlocks/example-reservation-block-0001 selfLinkWithId: https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-c/reservations/example-reservation/reservationBlocks/2996501069483632657 status: READY zone: https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-c
输出中的以下值描述了预留中的块:
count- 物理主机的数量healthInfo.healthStatus- 块的总体健康状况healthInfo.healthySubblockCount- 块中状况良好的子块数量id- 块的 IDinUseCount- 使用的物理主机数量kind- 所描述对象的类型name- 块的名称physicalTopology.block- 块 IDphysicalTopology.cluster- 块所在的集群reservationSubBlockCount- 此块中的子块数量reservationSubBlockInUseCount- 正在使用的子区块数量
描述预留区块
您可以使用 gcloud compute reservations blocks describe 命令来显示指定块的信息。
gcloud compute reservations blocks describe example-reservation \
--block-name=example-reservation-block-0001 \
--project=example-project \
--zone=us-central1-c
该命令会显示以下输出:
resource: count: 32 creationTimestamp: '2025-11-05T15:00:15.223-08:00' healthInfo: degradedSubBlockCount: 0 healthStatus: HEALTHY healthySubBlockCount: 2 id: '2996501069483632657' inUseCount: 12 kind: compute#reservationBlock name: example-reservation-block-0001 physicalTopology: block: 9a0e671424e45fd480ca172ad7a4e25d cluster: example-cluster reservationMaintenance: schedulingType: GROUPED reservationSubBlockCount: 2 reservationSubBlockInUseCount: 1 selfLink: https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-c/reservations/example-reservation/reservationBlocks/example-reservation-block-0001 selfLinkWithId: https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-c/reservations/example-reservation/reservationBlocks/2996501069483632657 status: READY zone: https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-c
输出中的以下值描述了预留中的块:
count- 块中的主机数量healthInfo.healthStatus- 块的总体健康状况healthInfo.healthySubblockCount- 块中状况良好的子块数量id- 块的 IDinUseCount- 使用中的主机数量kind- 所描述对象的类型name- 块的名称physicalTopology.block- 块 IDphysicalTopology.cluster- 块所在的集群reservationSubBlockCount- 此块中的子块数量reservationSubBlockInUseCount- 正在使用的子区块数量
列出某个块的所有子块
您可以列出块中的子块,以显示每个子块的信息:
gcloud compute reservations sub-blocks list example-reservation \
--block-name=example-reservation-block-0001 \
--project=example-project \
--zone=us-central1-c
此命令会显示以下信息:
count: 16 creationTimestamp: '2025-11-05T15:00:16.738-08:00' healthInfo: degradedHostCount: 0 degradedInfraCount: 0 healthStatus: HEALTHY healthyHostCount: 16 healthyInfraCount: 1 id: '8309376980435233263' inUseCount: 0 kind: compute#reservationSubBlock name: example-reservation-block-0001-subblock-0001 physicalTopology: block: 9a0e671424e45fd480ca172ad7a4e25d cluster: example-cluster subBlock: a0122935eb54d02750b65eef2d4f0366 reservationSubBlockMaintenance: schedulingType: GROUPED selfLink: https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-c/reservations/example-reservation/reservationBlocks/example-reservation-block-0001/reservationSubBlocks/example-reservation-block-0001-subblock-0001 selfLinkWithId: https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-c/reservations/example-reservation/reservationBlocks/example-reservation-block-0001/reservationSubBlocks/8309376980435233263 status: READY zone: https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-c --- count: 16 creationTimestamp: '2025-11-05T15:00:16.736-08:00' healthInfo: degradedHostCount: 0 degradedInfraCount: 0 healthStatus: HEALTHY healthyHostCount: 16 healthyInfraCount: 1 id: '5629213080155482607' inUseCount: 12 kind: compute#reservationSubBlock name: example-reservation-block-0001-subblock-0002 physicalTopology: block: 9a0e671424e45fd480ca172ad7a4e25d cluster: example-cluster subBlock: 7aca49831e54d32970631524bc060d9c reservationSubBlockMaintenance: schedulingType: GROUPED selfLink: https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-c/reservations/example-reservation/reservationBlocks/example-reservation-block-0001/reservationSubBlocks/example-reservation-block-0001-subblock-0002 selfLinkWithId: https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-c/reservations/example-reservation/reservationBlocks/example-reservation-block-0001/reservationSubBlocks/5629213080155482607 status: READY zone: https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-c
输出中的以下值描述了预留中的子块:
count- 主机数量healthInfo.degradedInfraCount- Ironwood 立方体的光路交换机 (OCS) 的健康状态。如果此值为 1,则表示 Ironwood Cube 的 OCS 已降级。此值不适用于 TrilliumhealthInfo.healthStatus- 子块的总体健康状况healthInfo.healthyHostCount- 子区块中健康主机的数量id- 块的 IDinUseCount- 使用中的主机数量kind- 所描述对象的类型name- 子块的名称physicalTopology.block- 包含相应子块的块的 IDphysicalTopology.cluster- 块所在的集群physicalTopology.subblock- 子块的 ID
描述预留子块
您可以使用 gcloud compute sub-blocks describe 查看有关子块的信息:
gcloud compute reservations sub-blocks describe example-reservation \
--block-name=example-reservation-block-0001 \
--sub-block-name=example-reservation-block-0001-subblock-0002 \
--project=example-project \
--zone=us-central1-c
此命令会显示以下信息:
resource: count: 16 creationTimestamp: '2025-11-05T15:00:16.736-08:00' healthInfo: degradedHostCount: 0 degradedInfraCount: 0 healthStatus: HEALTHY healthyHostCount: 16 healthyInfraCount: 1 id: '5629213080155482607' inUseCount: 12 kind: compute#reservationSubBlock name: example-reservation-block-0001-subblock-0002 physicalTopology: block: 9a0e671424e45fd480ca172ad7a4e25d cluster: example-cluster subBlock: 7aca49831e54d32970631524bc060d9c reservationSubBlockMaintenance: schedulingType: GROUPED selfLink: https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-c/reservations/example-reservation/reservationBlocks/example-reservation-block-0001/reservationSubBlocks/example-reservation-block-0001-subblock-0002 selfLinkWithId: https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-c/reservations/example-reservation/reservationBlocks/example-reservation-block-0001/reservationSubBlocks/5629213080155482607 status: READY zone: https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-c
输出中的以下值描述了预留中的子块:
count- 主机数量healthInfo.degradedInfraCount- Ironwood 立方体的光路交换机 (OCS) 的健康状态。如果此值为 1,则表示 Ironwood Cube 的 OCS 已降级。此值不适用于 TrilliumhealthInfo.healthStatus- 子块的总体健康状况healthInfo.healthyHostCount- 子区块中健康主机的数量id- 块的 IDinUseCount- 使用中的主机数量kind- 所描述对象的类型name- 子块的名称physicalTopology.block- 包含相应子块的块的 IDphysicalTopology.cluster- 块所在的集群physicalTopology.subblock- 子块的 ID
查找 TPU 虚拟机实例的物理位置
预配 TPU Slice 后,您可以检索 TPU 虚拟机实例的物理位置。这样一来,您就可以了解 TPU 虚拟机实例之间的相对距离,从而优化工作负载调度。
您可以使用 curl 或 Google Cloud CLI 查找 TPU 虚拟机实例的物理位置。以下示例展示了示例预留“example-reservation”中 TPU 虚拟机实例的实际位置。
curl
curl -s -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/attributes/physical_host_topology
gcloud
gcloud compute instances describe vm-1 \
--format="table[box,title=VM-Position](resourceStatus.physical_host_topology:label=location)" \
--zone=ZONE
这两个命令都会显示有关您指定的 TPU 虚拟机的集群、块、子块和主机的信息:
block: 3feffcdeb6434d68bb818a836f75c1b8
cluster: southamerica-west1-cluster-njga
subblock: cbee689cb721abdb0c7f80a4f2d0c1c7
host: 36b2d9731c1e1cf8594a759c8c4178f0