This document explains how to use dynamic slicing in Google Kubernetes Engine (GKE). Dynamic slicing lets you configure provisioned TPU sub-blocks into different topologies. This capability reduces the need to re-create node pools, enhances fault tolerance by allowing automatic recovery when a failure occurs, and optimizes resource utilization.
This document is intended for AI/ML engineers and platform administrators who want to optimize TPU utilization, reduce provisioning time, and improve fault tolerance for large-scale training and inference workloads.
Before reading this document, ensure that you are familiar with the following:
- TPUs in GKE.
- TPU Cluster Director. Dynamic slicing is a TPU feature enabled by TPU Cluster Director.
- All Capacity mode reservations. Dynamic slicing features are available exclusively on TPUs that use All Capacity mode.
What is dynamic slicing?
Dynamic slicing delivers flexibility in managing Cloud TPU capacity by letting you decouple TPU provisioning. Dynamic slicing involves the following process:
- You provision
resources as smaller units called
sub-blocks. A sub-block is the
fundamental logical building unit of Ironwood (TPU7x) capacity.
For Ironwood (TPU7x)—it represents a 16-node group of TPU VMs with a
4x4x4topology of interconnected TPU chips. In the context of TPU All Capacity mode and dynamic slicing, a node pool maps directly to a sub-block. - Dynamic slicing then stitches these sub-blocks together into larger slices.
Benefits of dynamic slicing
Dynamic slicing helps you to achieve the following:
- Reduce time to provision: individually provisioning sub-blocks leads to faster overall provisioning because it minimizes the impact of any single failure.
- Reduce time to recover: if a TPU chip failure occurs, the smallest unit of failure is a sub-block. Dynamic slicing isolates faulty sub-blocks so that workloads can be rescheduled on healthy sub-blocks faster than re-provisioning an entire large slice.
- Reshape capacity: if you have diverse workload requirements, you don't need to delete and re-create node pools for topology changes, which would be necessary without dynamic slicing. Instead, you can dynamically reconfigure the provisioned node pools to match specified shapes.
Key elements of dynamic slicing
Dynamic slicing introduces the following key concepts:
- Incremental provisioning of node pools: dynamic slicing uses incremental provisioning, which is a fault-tolerant provisioning model of node pools. This model converts all your TPU capacity into node pools of 16-node group of TPU VMs.
- Slice controller: a Kubernetes Custom Resource controller running within the GKE control plane that manages dynamic slicing. The slice controller manages the lifecycle of a Slice custom resource, which represents a dynamic slice. The slice controller handles creating, continuously monitoring, and deleting the Slice. When you use a scheduler, the scheduler directs creating and deleting the Slice custom resource.
- Slice custom resource: dynamically stitches these sub-blocks together based on the requested TPU topology. This process relies on the dynamic reconfiguration of the OCS network to connect the TPU node pools, which helps to ensure optimized performance. You can inspect the progress or health of dynamic slice formation by inspecting the Slice custom resource's status fields.
Requirements
To use dynamic slicing in GKE, you must meet the following requirements:
- Use a Standard cluster in version 1.35.0-gke.274500 or later, in the Rapid channel.
- Use Ironwood (TPU7x) version.
- Use the Container-Optimized OS image for your nodes.
- To use incremental provisioning, use All Capacity mode reservations. All Capacity mode is a feature enabled by TPU Cluster Director.
Before you begin
Before you start, make sure that you have performed the following tasks:
- Enable the Google Kubernetes Engine API. Enable Google Kubernetes Engine API
- If you want to use the Google Cloud CLI for this task,
install and then
initialize the
gcloud CLI. If you previously installed the gcloud CLI, get the latest
version by running the
gcloud components updatecommand. Earlier gcloud CLI versions might not support running the commands in this document.
- Ensure that you have an existing Standard cluster in version 1.35.0-gke.274500 or later, in the Rapid channel. To create a new cluster, see Creating a regional cluster.
- Ensure you have sufficient quota for Ironwood (TPU7x) in your region.
- If you plan to run multislice workloads, install JobSet v0.10.1 or later
- Request TPU capacity in All Capacity mode.
Limitations
- A single slice must use sub-blocks within the same TPU block under a reservation. To use sub-blocks across TPU blocks, use TPU Multislices.
- Dynamic slicing does not support topologies smaller than
4x4x4.
Use dynamic slicing in GKE with Kueue
This section describes the workflow for using dynamic slicing in GKE.
- View the topology and health status of All Capacity mode reservations.
- Enable the slice controller in your cluster.
- Create TPU node pools.
- Configure Kueue to create a Slice custom resource.
- Run workloads on dynamic slicing with Kueue.
- Clean up.
Enable the slice controller
To use dynamic slicing, enable the slice controller in your cluster.
Update your cluster:
gcloud container clusters update CLUSTER_NAME \ --location=LOCATION \ --enable-slice-controllerReplace the following:
CLUSTER_NAME: the name of your cluster.LOCATION: the region with your available TPU capacity.
Get credentials so that you can communicate with your cluster with
kubectlcommands:gcloud config set container/cluster CLUSTER_NAME gcloud container clusters get-credentials CLUSTER_NAME \ --location=LOCATIONIn the output of the following command, verify that the
slices.accelerator.gke.iovalue is present:kubectl get crd slices.accelerator.gke.ioThe output is similar to the following:
slices.accelerator.gke.io 2026-01-09T23:58:02Z
Create node pools with incremental provisioning
This section describes how to create the TPU node pools with incremental provisioning. GKE converts all your TPU capacity into node pools of 16-node group of TPU VMs, or sub-blocks. GKE provisions these node pools even when it can't find all 16 healthy VMs by placing nodes on healthy parts of the host machine and incrementally provisioning unhealthy machines while they are repaired.
You can target your node pool to belong to any of the following:
- A specific block of TPUs, which is exposed in All Capacity mode reservations. Block targeting allows GKE to create the node pool in any available sub-block within the specified block.
- A specific sub-block, or a specific 16-node group of TPU VMs, of TPUs for more granular control.
Create a workload policy
To create a TPU slice node pool with Ironwood (TPU7x), you must first create a
workload policy with the accelerator-topology-mode field set to provision_only. This setting
triggers the incremental provisioning process.
Create a workload policy:
gcloud compute resource-policies create workload-policy WORKLOAD_POLICY_NAME \
--project=PROJECT_ID \
--region=REGION \
--type=HIGH_THROUGHPUT \
--accelerator-topology=4x4x4 \
--accelerator-topology-mode=provision_only
Replace the following:
WORKLOAD_POLICY_NAME: a name for your workload policy.PROJECT_ID: your Google Cloud project ID.REGION: the region for the workload policy.
In this command, do the following::
- Always set the
accelerator-topologyfield to4x4x4to match the total number of chips within a single sub-block. - Always set the
accelerator-topology-modefield toprovision_onlyto ensure the incremental provisioning process is triggered. When theprovision_onlyfield is set, the node pool provisions TPU nodes without forming ICI or OCS links.
Target your node pool to belong to a block or a sub-block
You can target specific sub-blocks or blocks within your All Capacity mode reservation.
- Target a block: each node pool uses capacity from a specified block. GKE places the node pool within an available sub-block in that block. You must create as many node pools as there are sub-blocks in the block you want to use.
Target a sub-block: each node pool maps to a specific and available sub-block. When using sub-block targeting, GKE creates the node pool if at least one VM is healthy. Incremental provisioning ensures that all nodes are placed within the specified sub-block.
Block
To retrieve the name of the block in a reservation and the count of available sub-blocks in the block, complete the following steps in the View the topology and health status of All Capacity Mode reservations document:
Identify the name of the block by listing all reservation blocks and copying the value in the
name:field. This value is the name of the block orBLOCK_NAMEin this document.Determine how many node pools to create by describing a reservation block and identifying the value in the
reservationSubBlockCountfield. This value is the number of sub-blocks available. For example, thereservationSubBlockCount: 4value indicates that the block has four sub-blocks available, and you need to create four separate node pools.
Set the reservation path:
export RESERVATION_PATH="projects/PROJECT_ID/reservations/RESERVATION_NAME/reservationBlocks/BLOCK_NAME"Replace the following:
RESERVATION_NAME: the name of your TPU reservation.BLOCK_NAME: the name of the block.
Create a node pool for each sub-block identified in the preceding step. For example, if the count is
4, run this command four times. Use a unique name for each node pool.gcloud container node-pools create NODE_POOL_NAME \ --cluster=CLUSTER_NAME \ --node-locations=ZONE \ --machine-type=tpu7x-standard-4t \ --num-nodes=16 \ --placement-policy=WORKLOAD_POLICY_NAME \ --reservation-affinity=specific \ --reservation=${RESERVATION_PATH}Replace the following:
NODE_POOL_NAME: the name of your new node pool.CLUSTER_NAME: the name of your GKE cluster.WORKLOAD_POLICY_NAME: the name of the workload policy you created.ZONE: the zone for the node pool, for example,us-central1-a.
Sub-block
To retrieve the name of the block and the IDs of the available sub-blocks, complete the following steps in the View the topology and health status of All Capacity Mode reservations document:
To identify the name of the block, list all reservation blocks and copy the value in the
name:field. This value is the name of the block orBLOCK_NAMEon this document.To identify the name of the sub-blocks, list all sub-blocks of a block and copy the value in the
name:field for each entry underreservationSubBlocks. This value is the name of the sub-block orSUBBLOCK_NAMEin this document.
Set the reservation path:
export RESERVATION_PATH="projects/PROJECT_ID/reservations/RESERVATION_NAME/reservationBlocks/BLOCK_NAME/reservationSubBlocks/SUBBLOCK_NAME"Replace the following:
RESERVATION_NAME: the name of your TPU reservation.BLOCK_NAME: the name of the block.SUBBLOCK_NAME: the name of the sub-block.
Create the node pool:
gcloud container node-pools create NODE_POOL_NAME \ --project=PROJECT_ID \ --cluster=CLUSTER_NAME \ --node-locations=ZONE \ --machine-type=tpu7x-standard-4t \ --num-nodes=16 \ --placement-policy=WORKLOAD_POLICY_NAME \ --reservation-affinity=specific \ --reservation=${RESERVATION_PATH}Replace the following:
NODE_POOL_NAME: a unique name for your new node pool, for example,sub-block-pool-1.PROJECT_ID: your Google Cloud project ID.CLUSTER_NAME: the name of your GKE cluster.ZONE: the zone for the node pool, for example,us-central2-b.WORKLOAD_POLICY_NAME: the name of the workload policy you created.
At this stage, the nodes are created, but their Inter-Chip Interconnect (ICI) links are not yet active. Therefore, you can't run workloads on these node pools directly.
To enable all the necessary ICI links to form the slice and allow workloads to be scheduled, create a dynamic slice by using one of the following methods:
- Create a Slice custom resource. Instead of Pods, you use a Slice custom resource to define the specified topology, which the slice controller activates.
- Schedule GKE workloads with Kueue and TAS. Kueue automatically handles the creation and deletion of Slice custom resources. Avoid manually modifying Slice custom resources created by Kueue.
Create a dynamic slice with Kueue and TAS
In this section, you schedule GKE workloads with Kueue and TAS.
Install JobSet and Kueue resources for dynamic slicing
Install JobSet:
helm install jobset oci://registry.k8s.io/jobset/charts/jobset \ --version 0.10.1 \ --namespace jobset-system \ --create-namespace \ --set controller.resources.requests.cpu=4 \ --set controller.resources.requests.memory=16GiInstall Kueue:
helm install kueue oci://registry.k8s.io/kueue/charts/kueue \ --version 0.16.1 \ --namespace kueue-system \ --create-namespace \ --wait \ --set controllerManager.replicas=3 \ --set controllerManager.manager.resources.requests.cpu=16 \ --set controllerManager.manager.resources.requests.memory=64GiInstall Kueue slice controller:
kubectl apply -f https://gist.githubusercontent.com/mwysokin/cd90010d0d375b3bf57c536905692547/raw/506c36dd070f4ac222ba8a5e58ba28bbfcfa8ed3/kueue-slice-controller-v0.8.0-130.yamlTo configure Kueue for dynamic slicing, save the following manifest as
dynamic-slice-topology.yaml:apiVersion: kueue.x-k8s.io/v1beta1 kind: Topology metadata: name: superslice-topology spec: levels: # Label to identify the physical block a sub-block belongs to. # Only sub-blocks from the same block can form a slice. - nodeLabel: cloud.google.com/gce-topology-block # Label to identify individual TPU sub-blocks (4x4x4 topology). - nodeLabel: cloud.google.com/gke-tpu-partition-4x4x4-id # Standard Kubernetes label for individual nodes. # Required to assign Pods to specific VMs. - nodeLabel: kubernetes.io/hostname --- apiVersion: kueue.x-k8s.io/v1beta1 kind: ResourceFlavor metadata: name: superslice-rf spec: nodeLabels: cloud.google.com/gke-tpu-accelerator: tpu7x topologyName: superslice-topology --- apiVersion: kueue.x-k8s.io/v1beta1 kind: AdmissionCheck metadata: name: superslice-ac spec: controllerName: accelerator.gke.io/slice --- apiVersion: kueue.x-k8s.io/v1beta1 kind: ClusterQueue metadata: name: cq spec: namespaceSelector: {} admissionChecks: - superslice-ac resourceGroups: - coveredResources: - google.com/tpu flavors: - name: superslice-rf resources: - name: google.com/tpu nominalQuota: "999999" # modeling unlimited quota --- apiVersion: kueue.x-k8s.io/v1beta1 kind: LocalQueue metadata: name: lq namespace: default spec: clusterQueue: cqApply the
dynamic-slice-topology.yamlmanifest:kubectl apply -f dynamic-slice-topology.yamlIn this manifest, you configure Kueue for dynamic slicing by defining the following resources:
- Ironwood (TPU7x) dynamic slice topology (
superslice-topology): the topology defines the levels Kueue considers when it schedules dynamic slicing workloads. These levels are the following:cloud.google.com/gce-topology-blocklabel: this level is required to understand which sub-blocks belong to which blocks, because only sub-blocks from the same block can form a slice.cloud.google.com/gke-tpu-partition-4x4x4-idlabel: this level represents individual Ironwood (TPU7x) sub-blocks (4x4x4topology).kubernetes.io/hostnamelabel: this level is required to assign Pods to specific VMs and to observe their labels and taints.
- Ironwood (TPU7x) SuperSlice ResourceFlavor (
superslice-rf): the resource flavor for Ironwood (TPU7x) sub-blocks includes thecloud.google.com/gke-tpu-accelerator: tpu7xlabel to match nodes with Ironwood (TPU7x) machines. - SuperSlice AdmissionCheck (
superslice-ac): this admission check tells Kueue not to schedule a workload until the GKE slice controller confirms that the slice has become active. The admission check is first defined and then added to theClusterQueuethat handles dynamic slicing workloads. - ClusterQueue (
cq) and LocalQueue (lq): these fields managegoogle.com/tpuresources. ThecqClusterQueue includes thesuperslice-acadmission check. ThenominalQuotaforgoogle.com/tpucan be configured in two ways:- Specific quota: set
nominalQuotato match existing capacity for fair-sharing and quota management. - Unlimited quota: set
nominalQuotato a very high value such as"999999", to model unlimited quota. To focus on TAS and dynamic slicing, this configuration bypasses Kueue's quota management functionality.
- Specific quota: set
- Ironwood (TPU7x) dynamic slice topology (
Define the sub-block health selection
Beyond standard node health and readiness, GKE exposes the specific state of
each sub-block by using the cloud.google.com/gke-tpu-partition-4x4x4-state label.
This label lets GKE account for factors that influence slice formation, such as
the state of TPU links.
You can define the value of the cloud.google.com/gke-tpu-partition-4x4x4-state label as follows:
HEALTHY: the infrastructure is healthy.DEGRADED: the sub-block's infrastructure is in a degraded state, for example, because of OCS link degradation. The sub-block can still form a slice, but overall performance might be lower compared to healthy sub-blocks. If you can tolerate potentially degraded performance, you can configure your workload to useDEGRADEDsub-blocks by using node affinity, as shown in Example 3.UNHEALTHY: the sub-block is unhealthy and can't form a slice.
The Kueue Slice Controller webhook validates if a workload includes a specific sub-block health requirement. If no preference is indicated, the webhook injects a default node affinity.
The behavior is as follows:
- If a
nodeSelectorornodeAffinitythat targets thecloud.google.com/gke-tpu-partition-4x4x4-statelabel is present, it remains unchanged. If no such label configuration exists, the webhook injects the following default node affinity to ensure only available sub-blocks are used:
nodeSelector: cloud.google.com/gke-tpu-partition-4x4x4-state: "HEALTHY"
The following section includes examples where the
cloud.google.com/gke-tpu-partition-4x4x4-state label is configured to specify
the different sub-block health configurations.
Run test workloads on dynamic slicing with Kueue
This section describes how to deploy workloads on dynamic slicing with Kueue and TAS. It includes three examples that show how to create a dynamic slice workload and a workload consisting of multiple slices. The workloads are submitted as JobSets.
Example 1: Single workload uses a single dynamic slice
The following example describes how to create a workload using a slice with a
4x12x16 topology, which is composed of 12 sub-blocks. The number of Pods was
calculated as: (4 * 12 * 16) / 4 chips per node = 192 Pods.
Save the following manifest as
big-super-slice.yaml:apiVersion: jobset.x-k8s.io/v1alpha2 kind: JobSet metadata: name: big-super-slice labels: kueue.x-k8s.io/queue-name: lq annotations: spec: replicatedJobs: - name: job-jax replicas: 1 template: spec: parallelism: 192 # pods per slice calculation: 4*12*16 / 4 = 192 completions: 192 backoffLimit: 10 template: metadata: annotations: cloud.google.com/gke-tpu-slice-topology: 4x12x16 spec: tolerations: - key: "google.com/tpu" operator: "Equal" value: "present" effect: "NoSchedule" nodeSelector: cloud.google.com/gke-tpu-accelerator: tpu7x cloud.google.com/gke-tpu-partition-4x4x4-state: "HEALTHY" containers: - name: jax image: python:latest command: - bash - -c - | printenv pip install "jax[tpu]" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html python -c 'import jax; print("Global device count:", jax.device_count(), "Local device count:", jax.local_device_count())' resources: limits: google.com/tpu: 4 restartPolicy: NeverIn this manifest, the following annotations tell Kueue the slice characteristics and topology to configure the following:
cloud.google.com/gke-tpu-slice-topology: specifies"4x12x16"as the dynamic slice topology. Requirements for thetpu7xaccelerator topology include the following rules:- The minimum topology is
4x4x4. - The topology must be a three-dimensional string in the format
AxBxC. For example,4x8x8. - Each dimension (A, B, and C) must be a multiple of four.
- The dimensions must be sorted in non-decreasing order: A <= B <= C. For
example,
4x8x4is invalid; it should be4x4x8. - The product of the dimensions (ABC) must not exceed 9,216.
- The largest supported slice topologies can include up to 32 sub-blocks. For
example,
8x16x16with 32 sub-blocks,8x12x20with 30 sub-blocks, or12x12x12with 27 sub-blocks are within the accepted limits.
- The minimum topology is
cloud.google.com/gke-tpu-accelerator: tpu7x: schedules Pods on on VMs that run Ironwood (TPU7x).kueue.x-k8s.io/queue-name: assigns the JobSet to a Kueue LocalQueue.
Apply the
big-super-slice.yamlmanifest:kubectl apply -f big-super-slice.yamlAfter you apply the manifest, Kueue creates a
JobSetnamedbig-super-slice. Kueue then attempts to form a single dynamic slice with a4x12x16topology. After the slice is active, Kueue admits the workload, and the 192 Pods are scheduled on the nodes to form the dynamic slice that runs your workloads.
Example 2: Workload with more than one replica
The following example demonstrates how to create a workload that uses two dynamic slices, each composed of four sub-blocks.
Save the following manifest as
two-super-slices.yaml:apiVersion: jobset.x-k8s.io/v1alpha2 kind: JobSet metadata: name: two-super-slices labels: kueue.x-k8s.io/queue-name: lq annotations: spec: replicatedJobs: - name: job-jax replicas: 2 template: spec: parallelism: 64 # Pods per slice calculation: (4*8*8) / 4 = 64 completions: 64 backoffLimit: 10 template: metadata: annotations: cloud.google.com/gke-tpu-slice-topology: 4x8x8 spec: tolerations: - key: "google.com/tpu" operator: "Equal" value: "present" effect: "NoSchedule" nodeSelector: cloud.google.com/gke-tpu-accelerator: tpu7x cloud.google.com/gke-tpu-partition-4x4x4-state: "HEALTHY" containers: - name: jax image: python:latest command: - bash - -c - | printenv pip install "jax[tpu]" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html python -c 'import jax; print("Global device count:", jax.device_count(), "Local device count:", jax.local_device_count())' resources: limits: google.com/tpu: 4 restartPolicy: NeverApply the
two-super-slices.yamlmanifest:kubectl apply -f two-super-slices.yaml
In this manifest, you set replicas: 2 in the replicatedJobs field.
After you apply the manifest, Kueue
attempts to form two separate slices with a 4x8x8 topology. Kueue creates a
dynamic slice for each replica defined in jobset.spec.replicatedJobs[].replicas.
If n replicas are specified, Kueue creates n dynamic slices for the workload
and waits for all slices to become active before admitting the workload.
Example 3: Workload with single dynamic slice and NodeAffinity
Starting from Kueue 0.15, Kueue supports NodeAffinity for TAS node selection.
This functionality can be used to allow for both HEALTHY and DEGRADED nodes
to be part of a dynamic slice. The following example shows how to configure a
workload with single dynamic slice and NodeAffinity:
Save the following manifest as
slice-8x8x8-na.yaml:apiVersion: jobset.x-k8s.io/v1alpha2 kind: JobSet metadata: name: slice-8x8x8-na labels: kueue.x-k8s.io/queue-name: lq spec: replicatedJobs: - name: rj1 replicas: 1 template: spec: parallelism: 128 completions: 128 backoffLimit: 10 template: metadata: annotations: cloud.google.com/gke-tpu-slice-topology: 8x8x8 spec: tolerations: - key: "google.com/tpu" operator: "Equal" value: "present" effect: "NoSchedule" nodeSelector: cloud.google.com/gke-tpu-accelerator: tpu7x affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cloud.google.com/gke-tpu-partition-4x4x4-state operator: In values: - "HEALTHY" - "DEGRADED" containers: - name: jax image: python:latest command: - bash - -c - | printenv pip install "jax[tpu]" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html python -c 'import jax; print("Global device count:", jax.device_count(), "Local device count:", jax.local_device_count())' resources: limits: google.com/tpu: 4 restartPolicy: NeverApply the
slice-8x8x8-na.yamlmanifest:kubectl apply -f slice-8x8x8-na.yamlAfter you apply the manifest, Kueue creates a
JobSetnamedslice-8x8x8-na. Kueue then attempts to form a single dynamic slice with an8x8x8topology, which allows for bothHEALTHYandDEGRADEDnodes to be included due to the specified NodeAffinity. After the slice is active, Kueue admits the workload, and the 128 Pods are scheduled on the nodes forming the dynamic slice.
Monitor the status of the slice
To check the status of your dynamic slices, run the following command:
kubectl describe slice SLICE_NAME
Replace SLICE_NAME with the name of your slice. The
slice name is typically derived from the JobSet name and replica index. For
Example 1, a slice created by Kueue would have a name similar to
default-jobset-big-super-slice-yyyyy-job-jax-0.
The output is similar to the following:
Name: test-slice
Namespace:
Labels: <none>
Annotations: <none>
API Version: accelerator.gke.io/v1beta1
Kind: Slice
Metadata:
Creation Timestamp: 2026-02-12T23:44:28Z
Finalizers:
accelerator.gke.io/slice-finalizer
Generation: 1
Resource Version: 1770939905695871008
UID: 6dbbfe14-4486-4462-864d-e078d0ca8b5b
Spec:
Partition Ids:
5eae6a4f59d59cf30a9bf49de618eb2b
Topology: 4x4x4
Type: tpu7x
Status:
Conditions:
Last Transition Time: 2026-02-12T23:45:05Z
Message:
Reason: ACTIVE
Status: True
Type: Ready
Last Transition Time: 2026-02-12T23:45:05Z
Message: NodeLabelingCompleted
Reason: NodeLabelIsAdded
Status: True
Type: NodeLabeled
Events: <none>
The slice name adheres to the following rules to ensure compatibility with underlying Compute Engine resource naming conventions:
- Template:
{namespace}-jobset-{jobset.metadata.name}-kueueHash[5-character]-{jobset.spec.replicatedJobs[].name}-sliceIndex. - Length: the name has 54 characters or fewer. The controller appends a hyphen and an 8-character cluster hash to create Compute Engine resource names, which have a 63-character limit.
- Format: the name matches the regular expression
^[a-z]([-a-z0-9]*[a-z0-9])?$. The name has the following characteristics:- Starts with a lowercase letter.
- Only contains lowercase letters, numbers, and hyphens (-).
- Ends with a lowercase letter or a number (it cannot end with a hyphen).
Clean up
To avoid unexpected charges, delete your slices before deleting node pools.
Delete the JobSet. This action triggers Kueue to delete the associated Slice custom resources.
kubectl delete jobset JOBSET_NAMEReplace
JOBSET_NAMEwith the name of your JobSet, for example,big-super-slice.Delete the TPU node pool:
gcloud container node-pools delete NODE_POOL_NAME \ --cluster=CLUSTER_NAME \ --location=LOCATION
(Optional) Use dynamic slicing with your own scheduler
This document focuses on using Kueue and TAS. However, you can also manage dynamic slicing with your own custom scheduler. If you choose to use a different scheduler, follow the Slice custom resource reference information.
What's next
- Learn more about TPU Cluster Director.
- Learn how to Manage maintenance events with TPUs in All Capacity mode.