创建 Cloud TPU 虚拟机

您可以使用 Create Node API、排队资源 API 或 Google Kubernetes Engine (GKE) 创建 TPU 虚拟机。

当您使用 Google Cloud CLI 运行 gcloud compute tpus tpu-vm create 命令以及使用Google Cloud 控制台创建 TPU 虚拟机时，系统会调用 Create Node API。当您使用“创建节点”API 时，系统会立即处理您的请求。如果没有足够的容量来满足您的请求，则请求会失败。

我们建议使用 Queued Resources API 创建 TPU 虚拟机。当您使用已排队的资源 API 创建 TPU 虚拟机时，Cloud TPU 服务会将您的已排队的资源请求添加到该服务维护的队列中。请求的资源可用后，该服务会将其分配给您的 Google Cloud 项目，供您立即专门使用。如需了解详情，请参阅管理已排队的资源。

如果您想使用 Google Kubernetes Engine (GKE) 管理 TPU 资源，请先创建一个 GKE 集群。然后，向集群添加包含 TPU 切片的节点池。如需了解详情，请参阅 GKE 中的 TPU 简介。

前提条件

完成以下前提条件：

按照为 TPU 设置 Google Cloud 项目中的说明，为 TPU 创建 Google Cloud 项目。
按照规划 Cloud TPU 资源中所述确定 TPU 要求。
如果您使用的是某个 Cloud 客户端库，请按照适用于您所用语言的设置说明执行操作。
- Python
- Java
- Node.js
设置用于创建具有 8 个芯片的 v5e TPU 的环境变量。以下示例使用具有 8 个芯片的 v5e TPU。您可以指定其他加速器类型和版本。如需了解详情，请参阅 TPU 版本。
```
  export TPU_NAME=your-tpu-name
  export PROJECT_ID=your-project
  export ZONE=us-central1-a
  export ACCELERATOR_TYPE=v5litepod-8
  export VERSION=v2-alpha-tpuv5-lite
```

使用 Create Node API 创建 Cloud TPU

您可以使用 gcloud、 Google Cloud 控制台或 Cloud TPU API 创建 Cloud TPU。

创建 Cloud TPU 时，请指定 TPU 软件版本（也称为运行时版本）。如需确定要使用的软件版本，请参阅 TPU 软件版本。

此外，还要指定所用 TPU 配置的 TensorCore 或 TPU 芯片数量。如需了解详情，请参阅系统架构中您所使用的 TPU 版本的相应部分。

gcloud

使用 gcloud compute tpus tpu-vm create 命令通过 Create Node API 创建 TPU。如需配置特定的内部或外部 IP 地址，请参阅外部和内部 IP 地址。

以下命令会创建一个具有 8 个 TPU 芯片的 v5e TPU 虚拟机：

gcloud compute tpus tpu-vm create $TPU_NAME \
  --project=$PROJECT_ID
  --zone=$ZONE \
  --accelerator-type=$ACCELERATOR_TYPE \
  --version=$VERSION

命令标志说明

zone: 您创建 Cloud TPU 的可用区。
accelerator-type: 加速器类型用于指定您要创建的 Cloud TPU 的版本和大小。如需详细了解每个 TPU 版本支持的加速器类型，请参阅 TPU 版本。
version: TPU 软件版本。

控制台

本部分介绍如何创建具有 8 个 TPU 芯片的 v5e TPU 虚拟机：

在 Google Cloud 控制台中，前往 TPU 页面：

前往 TPU
点击创建 TPU。
在名称字段中，输入 TPU 的名称。
在可用区字段中，选择您创建 TPU 的可用区。
在 TPU 类型字段中，选择一种加速器类型。加速器类型用于指定您要创建的 Cloud TPU 的版本和大小。如需详细了解每个 TPU 版本支持的加速器类型，请参阅 TPU 版本。
在 TPU 软件版本字段中，选择一个软件版本。创建 Cloud TPU 虚拟机时，TPU 软件版本指定了要安装的 TPU 运行时的版本。如需了解详情，请参阅 TPU 软件版本。
点击创建以创建资源。

curl

以下命令使用 curl 创建一个具有 8 个 TPU 芯片的 v5e TPU 虚拟机。

curl -X POST -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "Content-Type: application/json" -d "{accelerator_type: $ACCELERATOR_TYPE, \
runtime_version:'$VERSION', \
network_config: {enable_external_ips: true}, \
shielded_instance_config: { enable_secure_boot: true }}" \
https://tpu.googleapis.com/v2/projects/$PROJECT_ID/locations/$ZONE/nodes?node_id=$TPU_NAME

必填字段

runtime_version: 您使用的 Cloud TPU 运行时版本。
project-id: 已注册的 Google Cloud 项目的名称。
zone: 您创建 Cloud TPU 的可用区。
node_name: 您创建的 TPU 虚拟机的名称。

Java

此代码示例使用 Java 中的 Cloud TPU API 创建一个具有 8 个 TPU 芯片的 v5e TPU 虚拟机。

如需向 Cloud TPU 进行身份验证，请设置应用默认凭证。如需了解详情，请参阅为本地开发环境设置身份验证。

import com.google.api.gax.longrunning.OperationTimedPollAlgorithm;
import com.google.api.gax.retrying.RetrySettings;
import com.google.cloud.tpu.v2.CreateNodeRequest;
import com.google.cloud.tpu.v2.Node;
import com.google.cloud.tpu.v2.TpuClient;
import com.google.cloud.tpu.v2.TpuSettings;
import java.io.IOException;
import java.util.concurrent.ExecutionException;
import org.threeten.bp.Duration;

public class CreateTpuVm {

  public static void main(String[] args)
      throws IOException, ExecutionException, InterruptedException {
    // TODO(developer): Replace these variables before running the sample.
    // Project ID or project number of the Google Cloud project you want to create a node.
    String projectId = "YOUR_PROJECT_ID";
    // The zone in which to create the TPU.
    // For more information about supported TPU types for specific zones,
    // see https://cloud.google.com/tpu/docs/regions-zones
    String zone = "europe-west4-a";
    // The name for your TPU.
    String nodeName = "YOUR_TPU_NAME";
    // The accelerator type that specifies the version and size of the Cloud TPU you want to create.
    // For more information about supported accelerator types for each TPU version,
    // see https://cloud.google.com/tpu/docs/system-architecture-tpu-vm#versions.
    String tpuType = "v2-8";
    // Software version that specifies the version of the TPU runtime to install.
    // For more information see https://cloud.google.com/tpu/docs/runtimes
    String tpuSoftwareVersion = "v2-tpuv5-litepod";

    createTpuVm(projectId, zone, nodeName, tpuType, tpuSoftwareVersion);
  }

  // Creates a TPU VM with the specified name, zone, accelerator type, and version.
  public static Node createTpuVm(
      String projectId, String zone, String nodeName, String tpuType, String tpuSoftwareVersion)
      throws IOException, ExecutionException, InterruptedException {
    // With these settings the client library handles the Operation's polling mechanism
    // and prevent CancellationException error
    TpuSettings.Builder clientSettings =
        TpuSettings.newBuilder();
    clientSettings
        .createNodeOperationSettings()
        .setPollingAlgorithm(
            OperationTimedPollAlgorithm.create(
                RetrySettings.newBuilder()
                    .setInitialRetryDelay(Duration.ofMillis(5000L))
                    .setRetryDelayMultiplier(1.5)
                    .setMaxRetryDelay(Duration.ofMillis(45000L))
                    .setInitialRpcTimeout(Duration.ZERO)
                    .setRpcTimeoutMultiplier(1.0)
                    .setMaxRpcTimeout(Duration.ZERO)
                    .setTotalTimeout(Duration.ofHours(24L))
                    .build()));

    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests.
    try (TpuClient tpuClient = TpuClient.create(clientSettings.build())) {
      String parent = String.format("projects/%s/locations/%s", projectId, zone);

      Node tpuVm = Node.newBuilder()
              .setName(nodeName)
              .setAcceleratorType(tpuType)
              .setRuntimeVersion(tpuSoftwareVersion)
              .build();

      CreateNodeRequest request = CreateNodeRequest.newBuilder()
              .setParent(parent)
              .setNodeId(nodeName)
              .setNode(tpuVm)
              .build();

      return tpuClient.createNodeAsync(request).get();
    }
  }
}

Node.js

此代码示例使用 Node.js 中的 Cloud TPU API 创建一个具有 8 个 TPU 芯片的 v5e TPU 虚拟机。

如需向 Cloud TPU 进行身份验证，请设置应用默认凭证。如需了解详情，请参阅为本地开发环境设置身份验证。

// Import the TPUClient
// TODO(developer): Uncomment below line before running the sample.
// const {TpuClient} = require('@google-cloud/tpu').v2;
const {Node, NetworkConfig} =
  require('@google-cloud/tpu').protos.google.cloud.tpu.v2;

// Instantiate a tpuClient
// TODO(developer): Uncomment below line before running the sample.
// tpuClient = new TpuClient();

// TODO(developer): Update below line before running the sample.
// Project ID or project number of the Google Cloud project you want to create a node.
const projectId = await tpuClient.getProjectId();

// The name of the network you want the TPU node to connect to. The network should be assigned to your project.
const networkName = 'compute-tpu-network';

// The region of the network, that you want the TPU node to connect to.
const region = 'europe-west4';

// The name for your TPU.
const nodeName = 'node-name-1';

// The zone in which to create the TPU.
// For more information about supported TPU types for specific zones,
// see https://cloud.google.com/tpu/docs/regions-zones
const zone = 'europe-west4-a';

// The accelerator type that specifies the version and size of the Cloud TPU you want to create.
// For more information about supported accelerator types for each TPU version,
// see https://cloud.google.com/tpu/docs/system-architecture-tpu-vm#versions.
const tpuType = 'v5litepod-4';

// Software version that specifies the version of the TPU runtime to install. For more information,
// see https://cloud.google.com/tpu/docs/runtimes
const tpuSoftwareVersion = 'v2-tpuv5-litepod';

async function callCreateTpuVM() {
  // Create a node
  const node = new Node({
    name: nodeName,
    zone,
    acceleratorType: tpuType,
    runtimeVersion: tpuSoftwareVersion,
    // Define network
    networkConfig: new NetworkConfig({
      enableExternalIps: true,
      network: `projects/${projectId}/global/networks/${networkName}`,
      subnetwork: `projects/${projectId}/regions/${region}/subnetworks/${networkName}`,
    }),
  });

  const parent = `projects/${projectId}/locations/${zone}`;
  const request = {parent, node, nodeId: nodeName};

  const [operation] = await tpuClient.createNode(request);

  // Wait for the create operation to complete.
  const [response] = await operation.promise();

  console.log(`TPU VM: ${nodeName} created.`);
  return response;
}
return await callCreateTpuVM();

Python

此代码示例使用 Python 中的 Cloud TPU API 创建一个具有 8 个 TPU 芯片的 v5e TPU 虚拟机。

如需向 Cloud TPU 进行身份验证，请设置应用默认凭证。如需了解详情，请参阅为本地开发环境设置身份验证。

from google.cloud import tpu_v2

# TODO(developer): Update and un-comment below lines
# project_id = "your-project-id"
# zone = "us-central1-a"
# tpu_name = "tpu-name"
# tpu_type = "v5litepod-4"
# runtime_version = "v2-tpuv5-litepod"

# Create a TPU node
node = tpu_v2.Node()
node.accelerator_type = tpu_type
# To see available runtime version use command:
# gcloud compute tpus versions list --zone={ZONE}
node.runtime_version = runtime_version

request = tpu_v2.CreateNodeRequest(
    parent=f"projects/{project_id}/locations/{zone}",
    node_id=tpu_name,
    node=node,
)

# Create a TPU client
client = tpu_v2.TpuClient()
operation = client.create_node(request=request)
print("Waiting for operation to complete...")

response = operation.result()
print(response)
# Example response:
# name: "projects/[project_id]/locations/[zone]/nodes/my-tpu"
# accelerator_type: "v5litepod-4"
# state: READY
# ...

运行启动脚本

您可以在创建 TPU 虚拟机时指定 --metadata startup-script 标志，以便在 TPU 虚拟机上运行启动脚本。

gcloud

此命令会创建 TPU 虚拟机并指定启动脚本。

gcloud compute tpus tpu-vm create $TPU_NAME \
  --zone=$ZONE \
  --accelerator-type=$ACCELERATOR_TYPE \
  --version=$VERSION \
  --metadata startup-script='#! /bin/bash
     pip3 install numpy
     EOF'

Java

此代码示例使用 Java 创建 TPU 虚拟机并指定启动脚本。

如需向 Cloud TPU 进行身份验证，请设置应用默认凭证。如需了解详情，请参阅为本地开发环境设置身份验证。

import com.google.cloud.tpu.v2.CreateNodeRequest;
import com.google.cloud.tpu.v2.Node;
import com.google.cloud.tpu.v2.TpuClient;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.ExecutionException;

public class CreateTpuVmWithStartupScript {
  public static void main(String[] args)
      throws IOException, ExecutionException, InterruptedException {
    // TODO(developer): Replace these variables before running the sample.
    // Project ID or project number of the Google Cloud project you want to create a node.
    String projectId = "YOUR_PROJECT_ID";
    // The zone in which to create the TPU.
    // For more information about supported TPU types for specific zones,
    // see https://cloud.google.com/tpu/docs/regions-zones
    String zone = "us-central1-a";
    // The name for your TPU.
    String nodeName = "YOUR_TPU_NAME";
    // The accelerator type that specifies the version and size of the Cloud TPU you want to create.
    // For more information about supported accelerator types for each TPU version,
    // see https://cloud.google.com/tpu/docs/system-architecture-tpu-vm#versions.
    String acceleratorType = "v5litepod-4";
    // Software version that specifies the version of the TPU runtime to install.
    // For more information, see https://cloud.google.com/tpu/docs/runtimes
    String tpuSoftwareVersion = "v2-tpuv5-litepod";

    createTpuVmWithStartupScript(projectId, zone, nodeName, acceleratorType, tpuSoftwareVersion);
  }

  // Create a TPU VM with a startup script.
  public static Node createTpuVmWithStartupScript(String projectId, String zone,
      String nodeName, String acceleratorType, String tpuSoftwareVersion)
      throws IOException, ExecutionException, InterruptedException {
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests.
    try (TpuClient tpuClient = TpuClient.create()) {
      String parent = String.format("projects/%s/locations/%s", projectId, zone);

      String startupScriptContent = "#!/bin/bash\necho \"Hello from the startup script!\"";
      // Add startup script to metadata
      Map<String, String> metadata = new HashMap<>();
      metadata.put("startup-script", startupScriptContent);

      Node tpuVm =
          Node.newBuilder()
             .setName(nodeName)
             .setAcceleratorType(acceleratorType)
             .setRuntimeVersion(tpuSoftwareVersion)
             .putAllMetadata(metadata)
             .build();

      CreateNodeRequest request =
          CreateNodeRequest.newBuilder()
             .setParent(parent)
             .setNodeId(nodeName)
             .setNode(tpuVm)
             .build();

      return tpuClient.createNodeAsync(request).get();
    }
  }
}

Node.js

此代码示例创建了一个 TPU 虚拟机，并在 Node.js 中指定了一个启动脚本。

如需向 Cloud TPU 进行身份验证，请设置应用默认凭证。如需了解详情，请参阅为本地开发环境设置身份验证。

// Import the TPUClient
// TODO(developer): Uncomment below line before running the sample.
// const {TpuClient} = require('@google-cloud/tpu').v2;

const {Node, NetworkConfig} =
  require('@google-cloud/tpu').protos.google.cloud.tpu.v2;

// Instantiate a tpuClient
// TODO(developer): Uncomment below line before running the sample.
// tpuClient = new TpuClient();

// TODO(developer): Update these variables before running the sample.
// Project ID or project number of the Google Cloud project you want to create a node.
const projectId = await tpuClient.getProjectId();

// The name of the network you want the TPU node to connect to. The network should be assigned to your project.
const networkName = 'compute-tpu-network';

// The region of the network, that you want the TPU node to connect to.
const region = 'europe-west4';

// The name for your TPU.
const nodeName = 'node-name-1';

// The zone in which to create the TPU.
// For more information about supported TPU types for specific zones,
// see https://cloud.google.com/tpu/docs/regions-zones
const zone = 'europe-west4-a';

// The accelerator type that specifies the version and size of the Cloud TPU you want to create.
// For more information about supported accelerator types for each TPU version,
// see https://cloud.google.com/tpu/docs/system-architecture-tpu-vm#versions.
const tpuType = 'v5litepod-4';

// Software version that specifies the version of the TPU runtime to install. For more information,
// see https://cloud.google.com/tpu/docs/runtimes
const tpuSoftwareVersion = 'v2-tpuv5-litepod';

async function callCreateTpuVMStartupScript() {
  // Create a node
  const node = new Node({
    name: nodeName,
    zone,
    acceleratorType: tpuType,
    runtimeVersion: tpuSoftwareVersion,
    // Define network
    networkConfig: new NetworkConfig({
      enableExternalIps: true,
      network: `projects/${projectId}/global/networks/${networkName}`,
      subnetwork: `projects/${projectId}/regions/${region}/subnetworks/${networkName}`,
    }),
    metadata: {
      // The script updates numpy to the latest version and logs the output to a file.
      'startup-script': `#!/bin/bash
        echo "Hello World" > /var/log/hello.log
        sudo pip3 install --upgrade numpy >> /var/log/hello.log 2>&1`,
    },
  });

  const parent = `projects/${projectId}/locations/${zone}`;
  const request = {parent, node, nodeId: nodeName};

  const [operation] = await tpuClient.createNode(request);

  // Wait for the create operation to complete.
  const [response] = await operation.promise();

  console.log(JSON.stringify(response));
  return response;
}
return await callCreateTpuVMStartupScript();

Python

此代码示例会创建一个 TPU 虚拟机，并以 Python 语言指定启动脚本。

如需向 Cloud TPU 进行身份验证，请设置应用默认凭证。如需了解详情，请参阅为本地开发环境设置身份验证。

from google.cloud import tpu_v2

# TODO(developer): Update and un-comment below lines
# project_id = "your-project-id"
# zone = "us-central1-a"
# tpu_name = "tpu-name"
# tpu_type = "v5litepod-4"
# runtime_version = "v2-tpuv5-litepod"

node = tpu_v2.Node()
node.accelerator_type = tpu_type
node.runtime_version = runtime_version

# This startup script updates numpy to the latest version and logs the output to a file.
metadata = {
    "startup-script": """#!/bin/bash
echo "Hello World" > /var/log/hello.log
sudo pip3 install --upgrade numpy >> /var/log/hello.log 2>&1
"""
}

# Adding metadata with startup script to the TPU node.
node.metadata = metadata
# Enabling external IPs for internet access from the TPU node.
node.network_config = tpu_v2.NetworkConfig(enable_external_ips=True)

request = tpu_v2.CreateNodeRequest(
    parent=f"projects/{project_id}/locations/{zone}",
    node_id=tpu_name,
    node=node,
)

client = tpu_v2.TpuClient()
operation = client.create_node(request=request)
print("Waiting for operation to complete...")

response = operation.result()
print(response.metadata)
# Example response:
# {'startup-script': '#!/bin/bash\n    echo "Hello World" > /var/log/hello.log\n
# ...

后续步骤

了解已排队的资源。
了解如何管理 TPU 虚拟机。
了解 GKE 中的 TPU。
了解如何在 TPU 虚拟机上运行 JAX 代码。
了解如何在 TPU 虚拟机上运行 PyTorch 代码。
了解如何在 TPU 上运行机器学习工作负载，例如通过 vLLM 在 TPU 上提供 Qwen2-72B-Instruct。