建立 Cloud TPU VM

您可以使用「建立節點」API、排隊資源 API 或 Google Kubernetes Engine (GKE) 建立 TPU VM。

使用 Google Cloud CLI 執行 gcloud compute tpus tpu-vm create 指令時,以及使用Google Cloud 控制台建立 TPU VM 時,系統會呼叫 Create Node API。使用「建立節點」API 時,系統會立即處理您的要求。如果容量不足,無法完成要求,要求就會失敗。

建議使用佇列資源 API 建立 TPU VM。使用佇列資源 API 建立 TPU VM 時,Cloud TPU 服務會將佇列資源要求新增至服務維護的佇列。當要求的資源可用時,服務會將其指派給您的 Google Cloud 專案,供您立即專屬使用。詳情請參閱「管理佇列資源」。

如要使用 Google Kubernetes Engine (GKE) 管理 TPU 資源,請先建立 GKE 叢集。接著,在叢集中新增包含 TPU 節點的節點集區。詳情請參閱「GKE 中的 TPU 簡介」。

必要條件

完成下列必要條件:

  1. 如「設定 TPU 專案」一文所述,為 TPU 建立 Google Cloud 專案。 Google Cloud

  2. 如「規劃 Cloud TPU 資源」一文所述,判斷 TPU 需求。

  3. 如果您使用Cloud 用戶端程式庫,請按照所用語言的設定說明操作:

  4. 設定環境變數,建立含有八個晶片的 v5e TPU。下列範例使用八個晶片的 v5e TPU。您可以指定其他加速器類型和版本。詳情請參閱「TPU 版本」。

      export TPU_NAME=your-tpu-name
      export PROJECT_ID=your-project
      export ZONE=us-central1-a
      export ACCELERATOR_TYPE=v5litepod-8
      export VERSION=v2-alpha-tpuv5-lite
    

使用 Create Node API 建立 Cloud TPU

您可以使用 gcloud、 Google Cloud 控制台或 Cloud TPU API 建立 Cloud TPU。

建立 Cloud TPU 時,請指定 TPU 軟體版本 (也稱為執行階段版本)。如要判斷要使用的軟體版本,請參閱「TPU 軟體版本」。

此外,請為使用的 TPU 設定指定 TensorCore 或 TPU 晶片的數量。詳情請參閱「系統架構」中 TPU 版本的相關章節。

gcloud

使用 gcloud compute tpus tpu-vm create 指令,透過 Create Node API 建立 TPU。如要設定特定內部或外部 IP 位址,請參閱「外部和內部 IP 位址」。

下列指令會建立含有 8 個 TPU 晶片的 v5e TPU VM:

gcloud compute tpus tpu-vm create $TPU_NAME \
  --project=$PROJECT_ID
  --zone=$ZONE \
  --accelerator-type=$ACCELERATOR_TYPE \
  --version=$VERSION

指令旗標說明

zone
您建立 Cloud TPU 的區域
accelerator-type
加速器類型會指定您建立的 Cloud TPU 版本和大小。如要進一步瞭解各個 TPU 版本支援的加速器類型,請參閱 TPU 版本
version
TPU 軟體版本

控制台

以下操作說明會建立具有 8 個 TPU 晶片的 v5e TPU VM:

  1. 前往 Google Cloud 控制台的「TPUs」頁面:

    前往 TPU

  2. 按一下「建立 TPU」

  3. 在「Name」(名稱) 欄位中,輸入 TPU 的名稱。

  4. 在「Zone」(可用區) 欄位中,選取建立 TPU 的可用區。

  5. 在「TPU type」(TPU 類型) 欄位中,選取加速器類型。 加速器類型會指定您建立的 Cloud TPU 版本和大小。如要進一步瞭解各個 TPU 版本支援的加速器類型,請參閱「TPU 版本」。

  6. 在「TPU 軟體版本」欄位中,選取軟體版本。建立 Cloud TPU VM 時,TPU 軟體版本會指定要安裝的 TPU 執行階段版本。詳情請參閱「TPU 軟體版本」。

  7. 按一下「建立」即可建立資源。

curl

下列指令會使用 curl 建立含有 8 個 TPU 晶片的 v5e TPU VM。

curl -X POST -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "Content-Type: application/json" -d "{accelerator_type: $ACCELERATOR_TYPE, \
runtime_version:'$VERSION', \
network_config: {enable_external_ips: true}, \
shielded_instance_config: { enable_secure_boot: true }}" \
https://tpu.googleapis.com/v2/projects/$PROJECT_ID/locations/$ZONE/nodes?node_id=$TPU_NAME

必填欄位

runtime_version
您使用的 Cloud TPU 執行階段版本。
project-id
已註冊 Google Cloud 專案的名稱。
zone
您建立 Cloud TPU 的區域
node_name
您建立的 TPU VM 名稱。

Java

這個程式碼範例會使用 Java 中的 Cloud TPU API,建立具有 8 個 TPU 晶片的 v5e TPU VM。

如要向 Cloud TPU 進行驗證,請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證機制」。

import com.google.api.gax.longrunning.OperationTimedPollAlgorithm;
import com.google.api.gax.retrying.RetrySettings;
import com.google.cloud.tpu.v2.CreateNodeRequest;
import com.google.cloud.tpu.v2.Node;
import com.google.cloud.tpu.v2.TpuClient;
import com.google.cloud.tpu.v2.TpuSettings;
import java.io.IOException;
import java.util.concurrent.ExecutionException;
import org.threeten.bp.Duration;

public class CreateTpuVm {

  public static void main(String[] args)
      throws IOException, ExecutionException, InterruptedException {
    // TODO(developer): Replace these variables before running the sample.
    // Project ID or project number of the Google Cloud project you want to create a node.
    String projectId = "YOUR_PROJECT_ID";
    // The zone in which to create the TPU.
    // For more information about supported TPU types for specific zones,
    // see https://cloud.google.com/tpu/docs/regions-zones
    String zone = "europe-west4-a";
    // The name for your TPU.
    String nodeName = "YOUR_TPU_NAME";
    // The accelerator type that specifies the version and size of the Cloud TPU you want to create.
    // For more information about supported accelerator types for each TPU version,
    // see https://cloud.google.com/tpu/docs/system-architecture-tpu-vm#versions.
    String tpuType = "v2-8";
    // Software version that specifies the version of the TPU runtime to install.
    // For more information see https://cloud.google.com/tpu/docs/runtimes
    String tpuSoftwareVersion = "v2-tpuv5-litepod";

    createTpuVm(projectId, zone, nodeName, tpuType, tpuSoftwareVersion);
  }

  // Creates a TPU VM with the specified name, zone, accelerator type, and version.
  public static Node createTpuVm(
      String projectId, String zone, String nodeName, String tpuType, String tpuSoftwareVersion)
      throws IOException, ExecutionException, InterruptedException {
    // With these settings the client library handles the Operation's polling mechanism
    // and prevent CancellationException error
    TpuSettings.Builder clientSettings =
        TpuSettings.newBuilder();
    clientSettings
        .createNodeOperationSettings()
        .setPollingAlgorithm(
            OperationTimedPollAlgorithm.create(
                RetrySettings.newBuilder()
                    .setInitialRetryDelay(Duration.ofMillis(5000L))
                    .setRetryDelayMultiplier(1.5)
                    .setMaxRetryDelay(Duration.ofMillis(45000L))
                    .setInitialRpcTimeout(Duration.ZERO)
                    .setRpcTimeoutMultiplier(1.0)
                    .setMaxRpcTimeout(Duration.ZERO)
                    .setTotalTimeout(Duration.ofHours(24L))
                    .build()));

    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests.
    try (TpuClient tpuClient = TpuClient.create(clientSettings.build())) {
      String parent = String.format("projects/%s/locations/%s", projectId, zone);

      Node tpuVm = Node.newBuilder()
              .setName(nodeName)
              .setAcceleratorType(tpuType)
              .setRuntimeVersion(tpuSoftwareVersion)
              .build();

      CreateNodeRequest request = CreateNodeRequest.newBuilder()
              .setParent(parent)
              .setNodeId(nodeName)
              .setNode(tpuVm)
              .build();

      return tpuClient.createNodeAsync(request).get();
    }
  }
}

Node.js

這個程式碼範例會使用 Node.js 中的 Cloud TPU API,建立具有 8 個 TPU 晶片的 v5e TPU VM。

如要向 Cloud TPU 進行驗證,請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證機制」。

// Import the TPUClient
// TODO(developer): Uncomment below line before running the sample.
// const {TpuClient} = require('@google-cloud/tpu').v2;
const {Node, NetworkConfig} =
  require('@google-cloud/tpu').protos.google.cloud.tpu.v2;

// Instantiate a tpuClient
// TODO(developer): Uncomment below line before running the sample.
// tpuClient = new TpuClient();

// TODO(developer): Update below line before running the sample.
// Project ID or project number of the Google Cloud project you want to create a node.
const projectId = await tpuClient.getProjectId();

// The name of the network you want the TPU node to connect to. The network should be assigned to your project.
const networkName = 'compute-tpu-network';

// The region of the network, that you want the TPU node to connect to.
const region = 'europe-west4';

// The name for your TPU.
const nodeName = 'node-name-1';

// The zone in which to create the TPU.
// For more information about supported TPU types for specific zones,
// see https://cloud.google.com/tpu/docs/regions-zones
const zone = 'europe-west4-a';

// The accelerator type that specifies the version and size of the Cloud TPU you want to create.
// For more information about supported accelerator types for each TPU version,
// see https://cloud.google.com/tpu/docs/system-architecture-tpu-vm#versions.
const tpuType = 'v5litepod-4';

// Software version that specifies the version of the TPU runtime to install. For more information,
// see https://cloud.google.com/tpu/docs/runtimes
const tpuSoftwareVersion = 'v2-tpuv5-litepod';

async function callCreateTpuVM() {
  // Create a node
  const node = new Node({
    name: nodeName,
    zone,
    acceleratorType: tpuType,
    runtimeVersion: tpuSoftwareVersion,
    // Define network
    networkConfig: new NetworkConfig({
      enableExternalIps: true,
      network: `projects/${projectId}/global/networks/${networkName}`,
      subnetwork: `projects/${projectId}/regions/${region}/subnetworks/${networkName}`,
    }),
  });

  const parent = `projects/${projectId}/locations/${zone}`;
  const request = {parent, node, nodeId: nodeName};

  const [operation] = await tpuClient.createNode(request);

  // Wait for the create operation to complete.
  const [response] = await operation.promise();

  console.log(`TPU VM: ${nodeName} created.`);
  return response;
}
return await callCreateTpuVM();

Python

這個程式碼範例會使用 Python 中的 Cloud TPU API,建立具有 8 個 TPU 晶片的 v5e TPU VM。

如要向 Cloud TPU 進行驗證,請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證機制」。

from google.cloud import tpu_v2

# TODO(developer): Update and un-comment below lines
# project_id = "your-project-id"
# zone = "us-central1-a"
# tpu_name = "tpu-name"
# tpu_type = "v5litepod-4"
# runtime_version = "v2-tpuv5-litepod"

# Create a TPU node
node = tpu_v2.Node()
node.accelerator_type = tpu_type
# To see available runtime version use command:
# gcloud compute tpus versions list --zone={ZONE}
node.runtime_version = runtime_version

request = tpu_v2.CreateNodeRequest(
    parent=f"projects/{project_id}/locations/{zone}",
    node_id=tpu_name,
    node=node,
)

# Create a TPU client
client = tpu_v2.TpuClient()
operation = client.create_node(request=request)
print("Waiting for operation to complete...")

response = operation.result()
print(response)
# Example response:
# name: "projects/[project_id]/locations/[zone]/nodes/my-tpu"
# accelerator_type: "v5litepod-4"
# state: READY
# ...

執行開機指令碼

如要在 TPU VM 上執行啟動指令碼,請在建立 TPU VM 時指定 --metadata startup-script 標記。

gcloud

這個指令會建立 TPU VM,並指定啟動指令碼。

gcloud compute tpus tpu-vm create $TPU_NAME \
  --zone=$ZONE \
  --accelerator-type=$ACCELERATOR_TYPE \
  --version=$VERSION \
  --metadata startup-script='#! /bin/bash
     pip3 install numpy
     EOF'

Java

這個程式碼範例會建立 TPU VM,並以 Java 指定開機指令碼。

如要向 Cloud TPU 進行驗證,請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證機制」。

import com.google.cloud.tpu.v2.CreateNodeRequest;
import com.google.cloud.tpu.v2.Node;
import com.google.cloud.tpu.v2.TpuClient;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.ExecutionException;

public class CreateTpuVmWithStartupScript {
  public static void main(String[] args)
      throws IOException, ExecutionException, InterruptedException {
    // TODO(developer): Replace these variables before running the sample.
    // Project ID or project number of the Google Cloud project you want to create a node.
    String projectId = "YOUR_PROJECT_ID";
    // The zone in which to create the TPU.
    // For more information about supported TPU types for specific zones,
    // see https://cloud.google.com/tpu/docs/regions-zones
    String zone = "us-central1-a";
    // The name for your TPU.
    String nodeName = "YOUR_TPU_NAME";
    // The accelerator type that specifies the version and size of the Cloud TPU you want to create.
    // For more information about supported accelerator types for each TPU version,
    // see https://cloud.google.com/tpu/docs/system-architecture-tpu-vm#versions.
    String acceleratorType = "v5litepod-4";
    // Software version that specifies the version of the TPU runtime to install.
    // For more information, see https://cloud.google.com/tpu/docs/runtimes
    String tpuSoftwareVersion = "v2-tpuv5-litepod";

    createTpuVmWithStartupScript(projectId, zone, nodeName, acceleratorType, tpuSoftwareVersion);
  }

  // Create a TPU VM with a startup script.
  public static Node createTpuVmWithStartupScript(String projectId, String zone,
      String nodeName, String acceleratorType, String tpuSoftwareVersion)
      throws IOException, ExecutionException, InterruptedException {
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests.
    try (TpuClient tpuClient = TpuClient.create()) {
      String parent = String.format("projects/%s/locations/%s", projectId, zone);

      String startupScriptContent = "#!/bin/bash\necho \"Hello from the startup script!\"";
      // Add startup script to metadata
      Map<String, String> metadata = new HashMap<>();
      metadata.put("startup-script", startupScriptContent);

      Node tpuVm =
          Node.newBuilder()
             .setName(nodeName)
             .setAcceleratorType(acceleratorType)
             .setRuntimeVersion(tpuSoftwareVersion)
             .putAllMetadata(metadata)
             .build();

      CreateNodeRequest request =
          CreateNodeRequest.newBuilder()
             .setParent(parent)
             .setNodeId(nodeName)
             .setNode(tpuVm)
             .build();

      return tpuClient.createNodeAsync(request).get();
    }
  }
}

Node.js

這個程式碼範例會建立 TPU VM,並在 Node.js 中指定開機指令碼。

如要向 Cloud TPU 進行驗證,請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證機制」。

// Import the TPUClient
// TODO(developer): Uncomment below line before running the sample.
// const {TpuClient} = require('@google-cloud/tpu').v2;

const {Node, NetworkConfig} =
  require('@google-cloud/tpu').protos.google.cloud.tpu.v2;

// Instantiate a tpuClient
// TODO(developer): Uncomment below line before running the sample.
// tpuClient = new TpuClient();

// TODO(developer): Update these variables before running the sample.
// Project ID or project number of the Google Cloud project you want to create a node.
const projectId = await tpuClient.getProjectId();

// The name of the network you want the TPU node to connect to. The network should be assigned to your project.
const networkName = 'compute-tpu-network';

// The region of the network, that you want the TPU node to connect to.
const region = 'europe-west4';

// The name for your TPU.
const nodeName = 'node-name-1';

// The zone in which to create the TPU.
// For more information about supported TPU types for specific zones,
// see https://cloud.google.com/tpu/docs/regions-zones
const zone = 'europe-west4-a';

// The accelerator type that specifies the version and size of the Cloud TPU you want to create.
// For more information about supported accelerator types for each TPU version,
// see https://cloud.google.com/tpu/docs/system-architecture-tpu-vm#versions.
const tpuType = 'v5litepod-4';

// Software version that specifies the version of the TPU runtime to install. For more information,
// see https://cloud.google.com/tpu/docs/runtimes
const tpuSoftwareVersion = 'v2-tpuv5-litepod';

async function callCreateTpuVMStartupScript() {
  // Create a node
  const node = new Node({
    name: nodeName,
    zone,
    acceleratorType: tpuType,
    runtimeVersion: tpuSoftwareVersion,
    // Define network
    networkConfig: new NetworkConfig({
      enableExternalIps: true,
      network: `projects/${projectId}/global/networks/${networkName}`,
      subnetwork: `projects/${projectId}/regions/${region}/subnetworks/${networkName}`,
    }),
    metadata: {
      // The script updates numpy to the latest version and logs the output to a file.
      'startup-script': `#!/bin/bash
        echo "Hello World" > /var/log/hello.log
        sudo pip3 install --upgrade numpy >> /var/log/hello.log 2>&1`,
    },
  });

  const parent = `projects/${projectId}/locations/${zone}`;
  const request = {parent, node, nodeId: nodeName};

  const [operation] = await tpuClient.createNode(request);

  // Wait for the create operation to complete.
  const [response] = await operation.promise();

  console.log(JSON.stringify(response));
  return response;
}
return await callCreateTpuVMStartupScript();

Python

這個程式碼範例會建立 TPU VM,並以 Python 指定開機指令碼。

如要向 Cloud TPU 進行驗證,請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證機制」。

from google.cloud import tpu_v2

# TODO(developer): Update and un-comment below lines
# project_id = "your-project-id"
# zone = "us-central1-a"
# tpu_name = "tpu-name"
# tpu_type = "v5litepod-4"
# runtime_version = "v2-tpuv5-litepod"

node = tpu_v2.Node()
node.accelerator_type = tpu_type
node.runtime_version = runtime_version

# This startup script updates numpy to the latest version and logs the output to a file.
metadata = {
    "startup-script": """#!/bin/bash
echo "Hello World" > /var/log/hello.log
sudo pip3 install --upgrade numpy >> /var/log/hello.log 2>&1
"""
}

# Adding metadata with startup script to the TPU node.
node.metadata = metadata
# Enabling external IPs for internet access from the TPU node.
node.network_config = tpu_v2.NetworkConfig(enable_external_ips=True)

request = tpu_v2.CreateNodeRequest(
    parent=f"projects/{project_id}/locations/{zone}",
    node_id=tpu_name,
    node=node,
)

client = tpu_v2.TpuClient()
operation = client.create_node(request=request)
print("Waiting for operation to complete...")

response = operation.result()
print(response.metadata)
# Example response:
# {'startup-script': '#!/bin/bash\n    echo "Hello World" > /var/log/hello.log\n
# ...

後續步驟