管理 TPU 資源

本頁說明如何列出、停止、啟動、刪除及連線至 TPU VM。

必要條件

執行這些程序之前,請先完成下列步驟:

  1. 如「設定 TPU 專案」一文所述,為 TPU 建立 Google Cloud 專案。 Google Cloud

  2. 如「規劃 Cloud TPU 資源」一文所述,判斷 TPU 需求。

  3. 如「建立 TPU VM」一文所述,建立 TPU VM。

  4. 如果您使用其中一個 Cloud 用戶端程式庫,請按照所用語言的設定說明操作:

  5. 設定環境變數。

      export TPU_NAME=your-tpu-name
      export ZONE=your-zone
    

連線至 Cloud TPU

您可以使用 SSH 連線至 Cloud TPU。

如果無法使用 SSH 連線至 TPU VM,可能是因為 TPU VM 沒有外部 IP 位址。如要存取沒有外部 IP 位址的 TPU VM,請按照「連線至沒有公開 IP 位址的 TPU VM」一文中的操作說明進行。

gcloud

使用 SSH 連線至 Cloud TPU:

$ gcloud compute tpus tpu-vm ssh $TPU_NAME --zone=$ZONE

如果要求的配量大於單一主機,Cloud TPU 會為每個主機建立 TPU VM。每個主機的 TPU 晶片數量取決於 TPU 版本

如要安裝二進位檔或執行程式碼,請使用 tpu-vm ssh 指令連線至每個 TPU VM。

$ gcloud compute tpus tpu-vm ssh $TPU_NAME --zone=$ZONE

如要使用 SSH 連線至特定 TPU VM,請使用 --worker 旗標和以 0 為基準的索引:

$ gcloud compute tpus tpu-vm ssh $TPU_NAME --zone=$ZONE --worker=1

如要在所有 TPU VM 上執行指令,請使用 --worker=all--command 標記:

$ gcloud compute tpus tpu-vm ssh $TPU_NAME \
  --zone=$ZONE \
  --worker=all \
  --command='pip install "jax[tpu]==0.4.20" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html'

如果是 Multislice,您可以使用列舉的 TPU 名稱在單一 VM 上執行指令,每個切片的前置字串和編號都會附加至該名稱。如要在所有切片的所有 TPU VM 上執行指令,請使用 --node=all--worker=all--command 標記,並視需要使用 --batch-size 標記。

$ gcloud compute tpus queued-resources ssh your-queued-resource-id \
  --zone=$ZONE \
  --node=all \
  --worker=all \
  --command='pip install jax[tpu] -f https://storage.googleapis.com/jax-releases/libtpu_releases.html' \
  --batch-size=4

使用 Google Cloud CLI 連線至 VM 時,Compute Engine 會建立永久 SSH 金鑰。

控制台

如要透過 Google Cloud 控制台連線至 TPU,請在瀏覽器中使用 SSH:

  1. 前往 Google Cloud 控制台的「TPUs」頁面:

    前往 TPU

  2. 在 TPU VM 清單中,找到要連線的 TPU VM,然後按一下該列中的「SSH」SSH

    透過 Google Cloud 控制台連線至 TPU VM 時,Compute Engine 會建立暫時性 SSH 金鑰。

列出 Cloud TPU 資源

您可以列出指定區域中的所有 Cloud TPU 資源。

gcloud

$ gcloud compute tpus tpu-vm list --zone=$ZONE

控制台

前往 Google Cloud 控制台的「TPUs」頁面:

前往 TPU

Java

如要向 Cloud TPU 進行驗證,請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證機制」。

import com.google.cloud.tpu.v2.ListNodesRequest;
import com.google.cloud.tpu.v2.TpuClient;
import java.io.IOException;

public class ListTpuVms {

  public static void main(String[] args) throws IOException {
    // TODO(developer): Replace these variables before running the sample.
    // Project ID or project number of the Google Cloud project you want to use.
    String projectId = "YOUR_PROJECT_ID";
    // The zone where the TPUs are located.
    // For more information about supported TPU types for specific zones,
    // see https://cloud.google.com/tpu/docs/regions-zones
    String zone = "us-central1-f";

    listTpuVms(projectId, zone);
  }

  // Lists TPU VMs in the specified zone.
  public static TpuClient.ListNodesPage listTpuVms(String projectId, String zone)
      throws IOException {
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests.
    try (TpuClient tpuClient = TpuClient.create()) {
      String parent = String.format("projects/%s/locations/%s", projectId, zone);

      ListNodesRequest request = ListNodesRequest.newBuilder().setParent(parent).build();

      return tpuClient.listNodes(request).getPage();
    }
  }
}

Node.js

如要向 Cloud TPU 進行驗證,請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證機制」。

// Import the TPUClient
// TODO(developer): Uncomment below line before running the sample.
// const {TpuClient} = require('@google-cloud/tpu').v2;

// Instantiate a tpuClient
// TODO(developer): Uncomment below line before running the sample.
// tpuClient = new TpuClient();

// TODO(developer): Update these variables before running the sample.
// Project ID or project number of the Google Cloud project you want to retrive a list of TPU nodes.
const projectId = await tpuClient.getProjectId();

// The zone from which the TPUs are retrived.
const zone = 'europe-west4-a';

async function callTpuVMList() {
  const request = {
    parent: `projects/${projectId}/locations/${zone}`,
  };

  const [response] = await tpuClient.listNodes(request);

  return response;
}

return await callTpuVMList();

Python

如要向 Cloud TPU 進行驗證,請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證機制」。

from google.cloud import tpu_v2

# TODO(developer): Update and un-comment below lines
# project_id = "your-project-id"
# zone = "us-central1-b"

client = tpu_v2.TpuClient()

nodes = client.list_nodes(parent=f"projects/{project_id}/locations/{zone}")
for node in nodes:
    print(node.name)
    print(node.state)
    print(node.accelerator_type)
# Example response:
# projects/[project_id]/locations/[zone]/nodes/node-name
# State.READY
# v2-8

擷取 Cloud TPU 資訊

您可以擷取特定 Cloud TPU 的相關資訊。

gcloud

$ gcloud compute tpus tpu-vm describe $TPU_NAME \
  --zone=$ZONE

控制台

  1. 前往 Google Cloud 控制台的「TPUs」頁面:

    前往 TPU

  2. 按一下 Cloud TPU 的名稱。主控台會顯示 Cloud TPU 詳細資料頁面。

Java

如要向 Cloud TPU 進行驗證,請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證機制」。

import com.google.cloud.tpu.v2.GetNodeRequest;
import com.google.cloud.tpu.v2.Node;
import com.google.cloud.tpu.v2.NodeName;
import com.google.cloud.tpu.v2.TpuClient;
import java.io.IOException;

public class GetTpuVm {

  public static void main(String[] args) throws IOException {
    // TODO(developer): Replace these variables before running the sample.
    // Project ID or project number of the Google Cloud project you want to use.
    String projectId = "YOUR_PROJECT_ID";
    // The zone in which to create the TPU.
    // For more information about supported TPU types for specific zones,
    // see https://cloud.google.com/tpu/docs/regions-zones
    String zone = "europe-west4-a";
    // The name for your TPU.
    String nodeName = "YOUR_TPU_NAME";

    getTpuVm(projectId, zone, nodeName);
  }

  // Describes a TPU VM with the specified name in the given project and zone.
  public static Node getTpuVm(String projectId, String zone, String nodeName)
      throws IOException {
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests.
    try (TpuClient tpuClient = TpuClient.create()) {
      String name = NodeName.of(projectId, zone, nodeName).toString();

      GetNodeRequest request = GetNodeRequest.newBuilder().setName(name).build();

      return tpuClient.getNode(request);
    }
  }
}

Node.js

如要向 Cloud TPU 進行驗證,請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證機制」。

// Import the TPUClient
// TODO(developer): Uncomment below line before running the sample.
// const {TpuClient} = require('@google-cloud/tpu').v2;

// Instantiate a tpuClient
// TODO(developer): Uncomment below line before running the sample.
// tpuClient = new TpuClient();

// TODO(developer): Update these variables before running the sample.
// Project ID or project number of the Google Cloud project you want to retrive a node.
const projectId = await tpuClient.getProjectId();

// The name of TPU to retrive.
const nodeName = 'node-name-1';

// The zone, where the TPU is created.
const zone = 'europe-west4-a';

async function callGetTpuVM() {
  const request = {
    name: `projects/${projectId}/locations/${zone}/nodes/${nodeName}`,
  };

  const [response] = await tpuClient.getNode(request);

  console.log(`Node: ${nodeName} retrived.`);
  return response;
}

return await callGetTpuVM();

Python

如要向 Cloud TPU 進行驗證,請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證機制」。

from google.cloud import tpu_v2

# TODO(developer): Update and un-comment below lines
# project_id = "your-project-id"
# zone = "us-central1-b"
# tpu_name = "tpu-name"

client = tpu_v2.TpuClient()
node = client.get_node(
    name=f"projects/{project_id}/locations/{zone}/nodes/{tpu_name}"
)

print(node)
# Example response:
# name: "projects/[project_id]/locations/[zone]/nodes/tpu-name"
# state: "READY"
# runtime_version: ...

停止 Cloud TPU 資源

您可以停止單一 Cloud TPU,避免產生費用,同時保留 VM 設定和軟體。

已加入佇列的資源 API 不支援停止 TPU 配量或 TPU。如要停止透過排入佇列的資源 API 分配 TPU 的計費,請刪除 TPU。

gcloud

$ gcloud compute tpus tpu-vm stop $TPU_NAME \
  --zone=$ZONE

控制台

  1. 前往 Google Cloud 控制台的「TPUs」頁面:

    前往 TPU

  2. 選取 Cloud TPU 旁的核取方塊。

  3. 按一下 「停止」

Java

如要向 Cloud TPU 進行驗證,請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證機制」。

import com.google.cloud.tpu.v2.Node;
import com.google.cloud.tpu.v2.NodeName;
import com.google.cloud.tpu.v2.StopNodeRequest;
import com.google.cloud.tpu.v2.TpuClient;
import java.io.IOException;
import java.util.concurrent.ExecutionException;

public class StopTpuVm {

  public static void main(String[] args)
      throws IOException, ExecutionException, InterruptedException {
    // TODO(developer): Replace these variables before running the sample.
    // Project ID or project number of the Google Cloud project you want to use.
    String projectId = "YOUR_PROJECT_ID";
    // The zone where the TPU is located.
    // For more information about supported TPU types for specific zones,
    // see https://cloud.google.com/tpu/docs/regions-zones
    String zone = "us-central1-f";
    // The name for your TPU.
    String nodeName = "YOUR_TPU_NAME";

    stopTpuVm(projectId, zone, nodeName);
  }

  // Stops a TPU VM with the specified name in the given project and zone.
  public static Node stopTpuVm(String projectId, String zone, String nodeName)
      throws IOException, ExecutionException, InterruptedException {
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests.
    try (TpuClient tpuClient = TpuClient.create()) {
      String name = NodeName.of(projectId, zone, nodeName).toString();

      StopNodeRequest request = StopNodeRequest.newBuilder().setName(name).build();

      return tpuClient.stopNodeAsync(request).get();
    }
  }
}

Node.js

如要向 Cloud TPU 進行驗證,請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證機制」。

// Import the TPUClient
// TODO(developer): Uncomment below line before running the sample.
// const {TpuClient} = require('@google-cloud/tpu').v2;

// Instantiate a tpuClient
// TODO(developer): Uncomment below line before running the sample.
// tpuClient = new TpuClient();

// TODO(developer): Update these variables before running the sample.
// Project ID or project number of the Google Cloud project you want to stop a node.
const projectId = await tpuClient.getProjectId();

// The name of TPU to stop.
const nodeName = 'node-name-1';

// The zone, where the TPU is created.
const zone = 'europe-west4-a';

async function callStopTpuVM() {
  const request = {
    name: `projects/${projectId}/locations/${zone}/nodes/${nodeName}`,
  };

  const [operation] = await tpuClient.stopNode(request);
  // Wait for the operation to complete.
  const [response] = await operation.promise();

  console.log(`Node: ${nodeName} stopped.`);
  return response;
}

return await callStopTpuVM();

Python

如要向 Cloud TPU 進行驗證,請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證機制」。

from google.cloud import tpu_v2

# TODO(developer): Update and un-comment below lines
# project_id = "your-project-id"
# zone = "us-central1-a"
# tpu_name = "tpu-name"

client = tpu_v2.TpuClient()

request = tpu_v2.StopNodeRequest(
    name=f"projects/{project_id}/locations/{zone}/nodes/{tpu_name}",
)
try:
    operation = client.stop_node(request=request)
    print("Waiting for stop operation to complete...")
    response = operation.result()
    print(f"This TPU {tpu_name} has been stopped")
    print(response.state)
    # Example response:
    # State.STOPPED

except Exception as e:
    print(e)
    raise e

啟動 Cloud TPU 資源

您可以啟動已停止的 Cloud TPU。

佇列資源 API 不支援啟動 TPU Pod 或 TPU。

gcloud

$ gcloud compute tpus tpu-vm start $TPU_NAME \
  --zone=$ZONE

控制台

  1. 前往 Google Cloud 控制台的「TPUs」頁面:

    前往 TPU

  2. 選取 Cloud TPU 旁的核取方塊。

  3. 按一下「開始」

Java

如要向 Cloud TPU 進行驗證,請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證機制」。

import com.google.cloud.tpu.v2.Node;
import com.google.cloud.tpu.v2.NodeName;
import com.google.cloud.tpu.v2.StartNodeRequest;
import com.google.cloud.tpu.v2.TpuClient;
import java.io.IOException;
import java.util.concurrent.ExecutionException;

public class StartTpuVm {

  public static void main(String[] args)
      throws IOException, ExecutionException, InterruptedException {
    // TODO(developer): Replace these variables before running the sample.
    // Project ID or project number of the Google Cloud project you want to use.
    String projectId = "YOUR_PROJECT_ID";
    // The zone where the TPU is located.
    // For more information about supported TPU types for specific zones,
    // see https://cloud.google.com/tpu/docs/regions-zones
    String zone = "us-central1-f";
    // The name for your TPU.
    String nodeName = "YOUR_TPU_NAME";

    startTpuVm(projectId, zone, nodeName);
  }

  // Starts a TPU VM with the specified name in the given project and zone.
  public static Node startTpuVm(String projectId, String zone, String nodeName)
      throws IOException, ExecutionException, InterruptedException {
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests.
    try (TpuClient tpuClient = TpuClient.create()) {
      String name = NodeName.of(projectId, zone, nodeName).toString();

      StartNodeRequest request = StartNodeRequest.newBuilder().setName(name).build();

      return tpuClient.startNodeAsync(request).get();
    }
  }
}

Node.js

如要向 Cloud TPU 進行驗證,請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證機制」。

// Import the TPUClient
// TODO(developer): Uncomment below line before running the sample.
// const {TpuClient} = require('@google-cloud/tpu').v2;

// Instantiate a tpuClient
// TODO(developer): Uncomment below line before running the sample.
// tpuClient = new TpuClient();

// TODO(developer): Update these variables before running the sample.
// Project ID or project number of the Google Cloud project you want to start a node.
const projectId = await tpuClient.getProjectId();

// The name of TPU to start.
const nodeName = 'node-name-1';

// The zone, where the TPU is created.
const zone = 'europe-west4-a';

async function callStartTpuVM() {
  const request = {
    name: `projects/${projectId}/locations/${zone}/nodes/${nodeName}`,
  };

  const [operation] = await tpuClient.startNode(request);

  // Wait for the operation to complete.
  const [response] = await operation.promise();

  console.log(`Node: ${nodeName} started.`);
  return response;
}

return await callStartTpuVM();

Python

如要向 Cloud TPU 進行驗證,請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證機制」。

from google.cloud import tpu_v2

# TODO(developer): Update and un-comment below lines
# project_id = "your-project-id"
# zone = "us-central1-a"
# tpu_name = "tpu-name"

client = tpu_v2.TpuClient()

request = tpu_v2.StartNodeRequest(
    name=f"projects/{project_id}/locations/{zone}/nodes/{tpu_name}",
)
try:
    operation = client.start_node(request=request)
    print("Waiting for start operation to complete...")
    response = operation.result()
    print(f"TPU {tpu_name} has been started")
    print(response.state)
    # Example response:
    # State.READY

except Exception as e:
    print(e)
    raise e

刪除 Cloud TPU

工作階段結束後,請刪除 TPU VM 節點。

gcloud

$ gcloud compute tpus tpu-vm delete $TPU_NAME \
  --zone=$ZONE \
  --quiet

指令旗標說明

  • zone:您要刪除 Cloud TPU 的區域
  • quiet:執行 gcloud CLI 指令時,停用所有互動式提示。

控制台

  1. 前往 Google Cloud 控制台的「TPUs」頁面:

    前往 TPU

  2. 選取 Cloud TPU 旁的核取方塊。

  3. 按一下「刪除」圖示

Java

如要向 Cloud TPU 進行驗證,請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證機制」。

import com.google.api.gax.longrunning.OperationTimedPollAlgorithm;
import com.google.api.gax.retrying.RetrySettings;
import com.google.cloud.tpu.v2.DeleteNodeRequest;
import com.google.cloud.tpu.v2.NodeName;
import com.google.cloud.tpu.v2.TpuClient;
import com.google.cloud.tpu.v2.TpuSettings;
import java.io.IOException;
import java.util.concurrent.ExecutionException;
import org.threeten.bp.Duration;

public class DeleteTpuVm {

  public static void main(String[] args)
      throws IOException, ExecutionException, InterruptedException {
    // TODO(developer): Replace these variables before running the sample.
    // Project ID or project number of the Google Cloud project you want to create a node.
    String projectId = "YOUR_PROJECT_ID";
    // The zone in which to create the TPU.
    // For more information about supported TPU types for specific zones,
    // see https://cloud.google.com/tpu/docs/regions-zones
    String zone = "europe-west4-a";
    // The name for your TPU.
    String nodeName = "YOUR_TPU_NAME";

    deleteTpuVm(projectId, zone, nodeName);
  }

  // Deletes a TPU VM with the specified name in the given project and zone.
  public static void deleteTpuVm(String projectId, String zone, String nodeName)
      throws IOException, ExecutionException, InterruptedException {
    // With these settings the client library handles the Operation's polling mechanism
    // and prevent CancellationException error
    TpuSettings.Builder clientSettings =
        TpuSettings.newBuilder();
    clientSettings
        .deleteNodeOperationSettings()
        .setPollingAlgorithm(
            OperationTimedPollAlgorithm.create(
                RetrySettings.newBuilder()
                    .setInitialRetryDelay(Duration.ofMillis(5000L))
                    .setRetryDelayMultiplier(1.5)
                    .setMaxRetryDelay(Duration.ofMillis(45000L))
                    .setInitialRpcTimeout(Duration.ZERO)
                    .setRpcTimeoutMultiplier(1.0)
                    .setMaxRpcTimeout(Duration.ZERO)
                    .setTotalTimeout(Duration.ofHours(24L))
                    .build()));

    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests.
    try (TpuClient tpuClient = TpuClient.create(clientSettings.build())) {
      String name = NodeName.of(projectId, zone, nodeName).toString();

      DeleteNodeRequest request = DeleteNodeRequest.newBuilder().setName(name).build();

      tpuClient.deleteNodeAsync(request).get();
      System.out.println("TPU VM deleted");
    }
  }
}

Node.js

如要向 Cloud TPU 進行驗證,請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證機制」。

// Import the TPUClient
// TODO(developer): Uncomment below line before running the sample.
// const {TpuClient} = require('@google-cloud/tpu').v2;

// Instantiate a tpuClient
// TODO(developer): Uncomment below line before running the sample.
// tpuClient = new TpuClient();

// TODO(developer): Update these variables before running the sample.
// Project ID or project number of the Google Cloud project you want to delete a node.
const projectId = await tpuClient.getProjectId();

// The name of TPU to delete.
const nodeName = 'node-name-1';

// The zone, where the TPU is created.
const zone = 'europe-west4-a';

async function callDeleteTpuVM() {
  const request = {
    name: `projects/${projectId}/locations/${zone}/nodes/${nodeName}`,
  };

  const [operation] = await tpuClient.deleteNode(request);

  // Wait for the delete operation to complete.
  const [response] = await operation.promise();

  console.log(`Node: ${nodeName} deleted.`);
  return response;
}

return await callDeleteTpuVM();

Python

如要向 Cloud TPU 進行驗證,請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證機制」。

from google.cloud import tpu_v2

# TODO(developer): Update and un-comment below lines
# project_id = "your-project-id"
# zone = "us-central1-b"
# tpu_name = "tpu-name"

client = tpu_v2.TpuClient()
try:
    client.delete_node(
        name=f"projects/{project_id}/locations/{zone}/nodes/{tpu_name}"
    )
    print("The TPU node was deleted.")
except Exception as e:
    print(e)

後續步驟