Manage TPU resources

This page describes how to list, stop, start, delete, and connect to TPU VMs.

Prerequisites

Before you run these procedures, complete the following steps:

  1. Create a Google Cloud project for your TPUs as described in Set up a Google Cloud project for TPUs.

  2. Determine your TPU requirements as described in Plan your Cloud TPU resources.

  3. Create a TPU VM as described in Create a TPU VM.

  4. If you use one of the Cloud Client Libraries, follow the setup instructions for the language you're using:

  5. Set up environment variables.

      export TPU_NAME=your-tpu-name
      export ZONE=your-zone
    

Connect to a Cloud TPU

You can connect to a Cloud TPU using SSH.

If you can't connect to a TPU VM using SSH, the TPU VM might not have an external IP address. To access a TPU VM without an external IP address, follow the instructions in Connect to a TPU VM without a public IP address.

gcloud

Connect to your Cloud TPU using SSH:

$ gcloud compute tpus tpu-vm ssh $TPU_NAME --zone=$ZONE

When you request a slice larger than a single host, Cloud TPU creates a TPU VM for each host. The number of TPU chips per host depends on the TPU version.

To install binaries or run code, connect to each TPU VM using the tpu-vm ssh command.

$ gcloud compute tpus tpu-vm ssh $TPU_NAME --zone=$ZONE

To connect to a specific TPU VM using SSH, use the --worker flag with a 0-based index:

$ gcloud compute tpus tpu-vm ssh $TPU_NAME --zone=$ZONE --worker=1

To run a command on all TPU VMs, use the --worker=all and --command flags:

$ gcloud compute tpus tpu-vm ssh $TPU_NAME \
  --zone=$ZONE \
  --worker=all \
  --command='pip install "jax[tpu]==0.4.20" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html'

For Multislice, you can run a command on a single VM using the enumerated TPU name, with each slice prefix and the number appended to it. To run a command on all TPU VMs in all slices, use the --node=all, --worker=all, and --command flags, with an optional --batch-size flag.

$ gcloud compute tpus queued-resources ssh your-queued-resource-id \
  --zone=$ZONE \
  --node=all \
  --worker=all \
  --command='pip install jax[tpu] -f https://storage.googleapis.com/jax-releases/libtpu_releases.html' \
  --batch-size=4

When you connect to VMs using the Google Cloud CLI, Compute Engine creates a persistent SSH key.

Console

To connect to your TPUs in the Google Cloud console, use SSH in your browser:

  1. In the Google Cloud console, go to the TPUs page:

    Go to TPUs

  2. In the list of TPU VMs, click SSH in the row of the TPU VM that you want to connect to.

    When you connect to TPU VMs using the Google Cloud console, Compute Engine creates an ephemeral SSH key.

List Cloud TPU resources

You can list all Cloud TPU resources in a specified zone.

gcloud

$ gcloud compute tpus tpu-vm list --zone=$ZONE

Console

In the Google Cloud console, go to the TPUs page:

Go to TPUs

Java

To authenticate to Cloud TPU, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

import com.google.cloud.tpu.v2.ListNodesRequest;
import com.google.cloud.tpu.v2.TpuClient;
import java.io.IOException;

public class ListTpuVms {

  public static void main(String[] args) throws IOException {
    // TODO(developer): Replace these variables before running the sample.
    // Project ID or project number of the Google Cloud project you want to use.
    String projectId = "YOUR_PROJECT_ID";
    // The zone where the TPUs are located.
    // For more information about supported TPU types for specific zones,
    // see https://cloud.google.com/tpu/docs/regions-zones
    String zone = "us-central1-f";

    listTpuVms(projectId, zone);
  }

  // Lists TPU VMs in the specified zone.
  public static TpuClient.ListNodesPage listTpuVms(String projectId, String zone)
      throws IOException {
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests.
    try (TpuClient tpuClient = TpuClient.create()) {
      String parent = String.format("projects/%s/locations/%s", projectId, zone);

      ListNodesRequest request = ListNodesRequest.newBuilder().setParent(parent).build();

      return tpuClient.listNodes(request).getPage();
    }
  }
}

Node.js

To authenticate to Cloud TPU, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

// Import the TPUClient
// TODO(developer): Uncomment below line before running the sample.
// const {TpuClient} = require('@google-cloud/tpu').v2;

// Instantiate a tpuClient
// TODO(developer): Uncomment below line before running the sample.
// tpuClient = new TpuClient();

// TODO(developer): Update these variables before running the sample.
// Project ID or project number of the Google Cloud project you want to retrive a list of TPU nodes.
const projectId = await tpuClient.getProjectId();

// The zone from which the TPUs are retrived.
const zone = 'europe-west4-a';

async function callTpuVMList() {
  const request = {
    parent: `projects/${projectId}/locations/${zone}`,
  };

  const [response] = await tpuClient.listNodes(request);

  return response;
}

return await callTpuVMList();

Python

To authenticate to Cloud TPU, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

from google.cloud import tpu_v2

# TODO(developer): Update and un-comment below lines
# project_id = "your-project-id"
# zone = "us-central1-b"

client = tpu_v2.TpuClient()

nodes = client.list_nodes(parent=f"projects/{project_id}/locations/{zone}")
for node in nodes:
    print(node.name)
    print(node.state)
    print(node.accelerator_type)
# Example response:
# projects/[project_id]/locations/[zone]/nodes/node-name
# State.READY
# v2-8

Retrieve Cloud TPU information

You can retrieve information about a specific Cloud TPU.

gcloud

$ gcloud compute tpus tpu-vm describe $TPU_NAME \
  --zone=$ZONE

Console

  1. In the Google Cloud console, go to the TPUs page:

    Go to TPUs

  2. Click the name of your Cloud TPU. The console displays the Cloud TPU details page.

Java

To authenticate to Cloud TPU, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

import com.google.cloud.tpu.v2.GetNodeRequest;
import com.google.cloud.tpu.v2.Node;
import com.google.cloud.tpu.v2.NodeName;
import com.google.cloud.tpu.v2.TpuClient;
import java.io.IOException;

public class GetTpuVm {

  public static void main(String[] args) throws IOException {
    // TODO(developer): Replace these variables before running the sample.
    // Project ID or project number of the Google Cloud project you want to use.
    String projectId = "YOUR_PROJECT_ID";
    // The zone in which to create the TPU.
    // For more information about supported TPU types for specific zones,
    // see https://cloud.google.com/tpu/docs/regions-zones
    String zone = "europe-west4-a";
    // The name for your TPU.
    String nodeName = "YOUR_TPU_NAME";

    getTpuVm(projectId, zone, nodeName);
  }

  // Describes a TPU VM with the specified name in the given project and zone.
  public static Node getTpuVm(String projectId, String zone, String nodeName)
      throws IOException {
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests.
    try (TpuClient tpuClient = TpuClient.create()) {
      String name = NodeName.of(projectId, zone, nodeName).toString();

      GetNodeRequest request = GetNodeRequest.newBuilder().setName(name).build();

      return tpuClient.getNode(request);
    }
  }
}

Node.js

To authenticate to Cloud TPU, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

// Import the TPUClient
// TODO(developer): Uncomment below line before running the sample.
// const {TpuClient} = require('@google-cloud/tpu').v2;

// Instantiate a tpuClient
// TODO(developer): Uncomment below line before running the sample.
// tpuClient = new TpuClient();

// TODO(developer): Update these variables before running the sample.
// Project ID or project number of the Google Cloud project you want to retrive a node.
const projectId = await tpuClient.getProjectId();

// The name of TPU to retrive.
const nodeName = 'node-name-1';

// The zone, where the TPU is created.
const zone = 'europe-west4-a';

async function callGetTpuVM() {
  const request = {
    name: `projects/${projectId}/locations/${zone}/nodes/${nodeName}`,
  };

  const [response] = await tpuClient.getNode(request);

  console.log(`Node: ${nodeName} retrived.`);
  return response;
}

return await callGetTpuVM();

Python

To authenticate to Cloud TPU, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

from google.cloud import tpu_v2

# TODO(developer): Update and un-comment below lines
# project_id = "your-project-id"
# zone = "us-central1-b"
# tpu_name = "tpu-name"

client = tpu_v2.TpuClient()
node = client.get_node(
    name=f"projects/{project_id}/locations/{zone}/nodes/{tpu_name}"
)

print(node)
# Example response:
# name: "projects/[project_id]/locations/[zone]/nodes/tpu-name"
# state: "READY"
# runtime_version: ...

Stop Cloud TPU resources

You can stop a single Cloud TPU to avoid incurring charges without losing its VM configuration and software.

The queued resources API does not support stopping TPU slices or TPUs. To stop incurring charges for TPUs allocated through the queued resources API, delete the TPU.

gcloud

$ gcloud compute tpus tpu-vm stop $TPU_NAME \
  --zone=$ZONE

Console

  1. In the Google Cloud console, go to the TPUs page:

    Go to TPUs

  2. Select the checkbox next to your Cloud TPU.

  3. Click Stop.

Java

To authenticate to Cloud TPU, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

import com.google.cloud.tpu.v2.Node;
import com.google.cloud.tpu.v2.NodeName;
import com.google.cloud.tpu.v2.StopNodeRequest;
import com.google.cloud.tpu.v2.TpuClient;
import java.io.IOException;
import java.util.concurrent.ExecutionException;

public class StopTpuVm {

  public static void main(String[] args)
      throws IOException, ExecutionException, InterruptedException {
    // TODO(developer): Replace these variables before running the sample.
    // Project ID or project number of the Google Cloud project you want to use.
    String projectId = "YOUR_PROJECT_ID";
    // The zone where the TPU is located.
    // For more information about supported TPU types for specific zones,
    // see https://cloud.google.com/tpu/docs/regions-zones
    String zone = "us-central1-f";
    // The name for your TPU.
    String nodeName = "YOUR_TPU_NAME";

    stopTpuVm(projectId, zone, nodeName);
  }

  // Stops a TPU VM with the specified name in the given project and zone.
  public static Node stopTpuVm(String projectId, String zone, String nodeName)
      throws IOException, ExecutionException, InterruptedException {
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests.
    try (TpuClient tpuClient = TpuClient.create()) {
      String name = NodeName.of(projectId, zone, nodeName).toString();

      StopNodeRequest request = StopNodeRequest.newBuilder().setName(name).build();

      return tpuClient.stopNodeAsync(request).get();
    }
  }
}

Node.js

To authenticate to Cloud TPU, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

// Import the TPUClient
// TODO(developer): Uncomment below line before running the sample.
// const {TpuClient} = require('@google-cloud/tpu').v2;

// Instantiate a tpuClient
// TODO(developer): Uncomment below line before running the sample.
// tpuClient = new TpuClient();

// TODO(developer): Update these variables before running the sample.
// Project ID or project number of the Google Cloud project you want to stop a node.
const projectId = await tpuClient.getProjectId();

// The name of TPU to stop.
const nodeName = 'node-name-1';

// The zone, where the TPU is created.
const zone = 'europe-west4-a';

async function callStopTpuVM() {
  const request = {
    name: `projects/${projectId}/locations/${zone}/nodes/${nodeName}`,
  };

  const [operation] = await tpuClient.stopNode(request);
  // Wait for the operation to complete.
  const [response] = await operation.promise();

  console.log(`Node: ${nodeName} stopped.`);
  return response;
}

return await callStopTpuVM();

Python

To authenticate to Cloud TPU, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

from google.cloud import tpu_v2

# TODO(developer): Update and un-comment below lines
# project_id = "your-project-id"
# zone = "us-central1-a"
# tpu_name = "tpu-name"

client = tpu_v2.TpuClient()

request = tpu_v2.StopNodeRequest(
    name=f"projects/{project_id}/locations/{zone}/nodes/{tpu_name}",
)
try:
    operation = client.stop_node(request=request)
    print("Waiting for stop operation to complete...")
    response = operation.result()
    print(f"This TPU {tpu_name} has been stopped")
    print(response.state)
    # Example response:
    # State.STOPPED

except Exception as e:
    print(e)
    raise e

Start Cloud TPU resources

You can start a stopped Cloud TPU.

The queued resources API does not support starting TPU Pods or TPUs.

gcloud

$ gcloud compute tpus tpu-vm start $TPU_NAME \
  --zone=$ZONE

Console

  1. In the Google Cloud console, go to the TPUs page:

    Go to TPUs

  2. Select the checkbox next to your Cloud TPU.

  3. Click Start.

Java

To authenticate to Cloud TPU, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

import com.google.cloud.tpu.v2.Node;
import com.google.cloud.tpu.v2.NodeName;
import com.google.cloud.tpu.v2.StartNodeRequest;
import com.google.cloud.tpu.v2.TpuClient;
import java.io.IOException;
import java.util.concurrent.ExecutionException;

public class StartTpuVm {

  public static void main(String[] args)
      throws IOException, ExecutionException, InterruptedException {
    // TODO(developer): Replace these variables before running the sample.
    // Project ID or project number of the Google Cloud project you want to use.
    String projectId = "YOUR_PROJECT_ID";
    // The zone where the TPU is located.
    // For more information about supported TPU types for specific zones,
    // see https://cloud.google.com/tpu/docs/regions-zones
    String zone = "us-central1-f";
    // The name for your TPU.
    String nodeName = "YOUR_TPU_NAME";

    startTpuVm(projectId, zone, nodeName);
  }

  // Starts a TPU VM with the specified name in the given project and zone.
  public static Node startTpuVm(String projectId, String zone, String nodeName)
      throws IOException, ExecutionException, InterruptedException {
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests.
    try (TpuClient tpuClient = TpuClient.create()) {
      String name = NodeName.of(projectId, zone, nodeName).toString();

      StartNodeRequest request = StartNodeRequest.newBuilder().setName(name).build();

      return tpuClient.startNodeAsync(request).get();
    }
  }
}

Node.js

To authenticate to Cloud TPU, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

// Import the TPUClient
// TODO(developer): Uncomment below line before running the sample.
// const {TpuClient} = require('@google-cloud/tpu').v2;

// Instantiate a tpuClient
// TODO(developer): Uncomment below line before running the sample.
// tpuClient = new TpuClient();

// TODO(developer): Update these variables before running the sample.
// Project ID or project number of the Google Cloud project you want to start a node.
const projectId = await tpuClient.getProjectId();

// The name of TPU to start.
const nodeName = 'node-name-1';

// The zone, where the TPU is created.
const zone = 'europe-west4-a';

async function callStartTpuVM() {
  const request = {
    name: `projects/${projectId}/locations/${zone}/nodes/${nodeName}`,
  };

  const [operation] = await tpuClient.startNode(request);

  // Wait for the operation to complete.
  const [response] = await operation.promise();

  console.log(`Node: ${nodeName} started.`);
  return response;
}

return await callStartTpuVM();

Python

To authenticate to Cloud TPU, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

from google.cloud import tpu_v2

# TODO(developer): Update and un-comment below lines
# project_id = "your-project-id"
# zone = "us-central1-a"
# tpu_name = "tpu-name"

client = tpu_v2.TpuClient()

request = tpu_v2.StartNodeRequest(
    name=f"projects/{project_id}/locations/{zone}/nodes/{tpu_name}",
)
try:
    operation = client.start_node(request=request)
    print("Waiting for start operation to complete...")
    response = operation.result()
    print(f"TPU {tpu_name} has been started")
    print(response.state)
    # Example response:
    # State.READY

except Exception as e:
    print(e)
    raise e

Delete a Cloud TPU

Delete your TPU VM slices after your session.

gcloud

$ gcloud compute tpus tpu-vm delete $TPU_NAME \
  --zone=$ZONE \
  --quiet

Command flag descriptions

  • zone: The zone where you plan to delete your Cloud TPU.
  • quiet: Disables all interactive prompts when running gcloud CLI commands.

Console

  1. In the Google Cloud console, go to the TPUs page:

    Go to TPUs

  2. Select the checkbox next to your Cloud TPU.

  3. Click Delete.

Java

To authenticate to Cloud TPU, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

import com.google.api.gax.longrunning.OperationTimedPollAlgorithm;
import com.google.api.gax.retrying.RetrySettings;
import com.google.cloud.tpu.v2.DeleteNodeRequest;
import com.google.cloud.tpu.v2.NodeName;
import com.google.cloud.tpu.v2.TpuClient;
import com.google.cloud.tpu.v2.TpuSettings;
import java.io.IOException;
import java.util.concurrent.ExecutionException;
import org.threeten.bp.Duration;

public class DeleteTpuVm {

  public static void main(String[] args)
      throws IOException, ExecutionException, InterruptedException {
    // TODO(developer): Replace these variables before running the sample.
    // Project ID or project number of the Google Cloud project you want to create a node.
    String projectId = "YOUR_PROJECT_ID";
    // The zone in which to create the TPU.
    // For more information about supported TPU types for specific zones,
    // see https://cloud.google.com/tpu/docs/regions-zones
    String zone = "europe-west4-a";
    // The name for your TPU.
    String nodeName = "YOUR_TPU_NAME";

    deleteTpuVm(projectId, zone, nodeName);
  }

  // Deletes a TPU VM with the specified name in the given project and zone.
  public static void deleteTpuVm(String projectId, String zone, String nodeName)
      throws IOException, ExecutionException, InterruptedException {
    // With these settings the client library handles the Operation's polling mechanism
    // and prevent CancellationException error
    TpuSettings.Builder clientSettings =
        TpuSettings.newBuilder();
    clientSettings
        .deleteNodeOperationSettings()
        .setPollingAlgorithm(
            OperationTimedPollAlgorithm.create(
                RetrySettings.newBuilder()
                    .setInitialRetryDelay(Duration.ofMillis(5000L))
                    .setRetryDelayMultiplier(1.5)
                    .setMaxRetryDelay(Duration.ofMillis(45000L))
                    .setInitialRpcTimeout(Duration.ZERO)
                    .setRpcTimeoutMultiplier(1.0)
                    .setMaxRpcTimeout(Duration.ZERO)
                    .setTotalTimeout(Duration.ofHours(24L))
                    .build()));

    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests.
    try (TpuClient tpuClient = TpuClient.create(clientSettings.build())) {
      String name = NodeName.of(projectId, zone, nodeName).toString();

      DeleteNodeRequest request = DeleteNodeRequest.newBuilder().setName(name).build();

      tpuClient.deleteNodeAsync(request).get();
      System.out.println("TPU VM deleted");
    }
  }
}

Node.js

To authenticate to Cloud TPU, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

// Import the TPUClient
// TODO(developer): Uncomment below line before running the sample.
// const {TpuClient} = require('@google-cloud/tpu').v2;

// Instantiate a tpuClient
// TODO(developer): Uncomment below line before running the sample.
// tpuClient = new TpuClient();

// TODO(developer): Update these variables before running the sample.
// Project ID or project number of the Google Cloud project you want to delete a node.
const projectId = await tpuClient.getProjectId();

// The name of TPU to delete.
const nodeName = 'node-name-1';

// The zone, where the TPU is created.
const zone = 'europe-west4-a';

async function callDeleteTpuVM() {
  const request = {
    name: `projects/${projectId}/locations/${zone}/nodes/${nodeName}`,
  };

  const [operation] = await tpuClient.deleteNode(request);

  // Wait for the delete operation to complete.
  const [response] = await operation.promise();

  console.log(`Node: ${nodeName} deleted.`);
  return response;
}

return await callDeleteTpuVM();

Python

To authenticate to Cloud TPU, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

from google.cloud import tpu_v2

# TODO(developer): Update and un-comment below lines
# project_id = "your-project-id"
# zone = "us-central1-b"
# tpu_name = "tpu-name"

client = tpu_v2.TpuClient()
try:
    client.delete_node(
        name=f"projects/{project_id}/locations/{zone}/nodes/{tpu_name}"
    )
    print("The TPU node was deleted.")
except Exception as e:
    print(e)

What's next