This document is for Standard clusters only.
This document shows you how to use GKE Pod snapshots to save the state of a running Agent Sandbox environment.
Agent Sandbox provides a secure and isolated environment for executing untrusted code, such as code generated by large language models (LLMs). Running this type of code directly in a cluster poses security risks, because untrusted code could potentially access or interfere with other apps or the underlying cluster node itself.
GKE Pod snapshots let you save and restore the state of sandboxed environments. This functionality is useful for the following reasons:
- Fast startup: reduce sandbox startup time by restoring from a pre-warmed snapshot.
- Long-running agents: pause a sandbox that takes a long time to run and resume it later, or move it to a different node, without losing progress.
- Stateful workloads: persist the context of an agent, such as conversation history or intermediate calculations, by saving and restoring the state of its sandboxed environment.
- Reproducibility: capture a specific state and use it as a base to start multiple new sandboxes with the same initialized state.
A snapshot can be triggered in two ways:
- Manual trigger: you create a
PodSnapshotManualTriggerresource to trigger the snapshot. - Workload trigger: the sandboxed application itself signals when it's ready to be saved.
This document shows you how to manually trigger a snapshot.
Note: GKE Pod snapshots is a Preview feature.
Before you begin
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
-
Create a project: To create a project, you need the Project Creator role
(
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission. Learn how to grant roles.
-
Verify that billing is enabled for your Google Cloud project.
-
Enable the Google Kubernetes Engine, Cloud Storage, Identity and Access Management (IAM) APIs.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission. Learn how to grant roles. -
In the Google Cloud console, activate Cloud Shell.
Requirements
To use GKE Pod snapshots, you need a Standard mode cluster running GKE version 1.34.1-gke.3084001 or later. We recommend that you use GKE version 1.35 or later. GKE version 1.34 has a known issue where checkpointing might fail with an input/output error relating to Cloud Storage permission bugs.
Pod snapshots don't support E2 machine types. Consequently, this tutorial creates a node pool consisting of N2 machines.
If you want to use GPU-based machine types for the nodes in your node pool, see Limitations and requirements.
Define environment variables
To simplify the commands that you run in this document, you can set environment variables in Cloud Shell. These variables store values such as the ID of your Google Cloud project, the name of the Cloud Storage bucket that will store your snapshots, and the location of your GKE cluster.
After you define these variables, you can reuse them in multiple commands by
referencing the variable name (for example, $CLUSTER_NAME) instead of retyping
or replacing values each time. This approach makes the process easier to follow
and reduces the risk of errors.
To define the following useful environment variables in Cloud Shell, run the following commands:
export PROJECT_ID=$(gcloud config get project)
export CLUSTER_NAME="agent-sandbox-cluster"
export GKE_LOCATION="us-central1"
export GKE_VERSION="1.35.0-gke.1795000"
export AGENT_SANDBOX_VERSION="v0.1.0"
export NODE_POOL_NAME="agent-sandbox-node-pool"
export MACHINE_TYPE="n2-standard-2"
export SNAPSHOTS_BUCKET_NAME="agent-sandbox-snapshots-${PROJECT_ID}"
export SNAPSHOT_NAMESPACE="pod-snapshots-ns"
export SNAPSHOT_KSA_NAME="pod-snapshot-sa"
export SNAPSHOT_FOLDER="my-snapshots"
export PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format="value(projectNumber)")
Here's an explanation of these environment variables:
PROJECT_ID: the ID of your current Google Cloud project. Defining this variable helps ensure that all resources, like your GKE cluster, are created in the correct project.CLUSTER_NAME: the name of your GKE cluster (for example,agent-sandbox-cluster).GKE_LOCATION: the Google Cloud region where your GKE cluster will be created (for example,us-central1).GKE_VERSION: the GKE cluster and node version required for compatibility with Pod snapshots. GKE Pod snapshots requires GKE version 1.34.1-gke.3084001 or later.AGENT_SANDBOX_VERSION: the version of the Agent Sandbox controller to be deployed to your cluster.NODE_POOL_NAME: the name of the node pool that will run sandboxed workloads (for example,agent-sandbox-node-pool).MACHINE_TYPE: the machine type of the nodes in your node pool (for example,n2-standard-2). Pod snapshots don't support E2 machine types. If you want to use GPU-based machine types for the nodes in your node pool, see Limitations and requirements. For details about different machine series and choosing between different options, see the Machine families resource and comparison guide.SNAPSHOTS_BUCKET_NAME: the name of the Cloud Storage bucket you will create to store snapshots.SNAPSHOT_NAMESPACE: the Kubernetes namespace where your snapshot workload and service account will reside.SNAPSHOT_KSA_NAME: the name of the Kubernetes service account your workload will use to authenticate.SNAPSHOT_FOLDER: the directory inside your Cloud Storage bucket where snapshots will be organized.PROJECT_NUMBER: the unique numerical identifier for your project, used for IAM permission bindings.
Overview of configuration steps
To enable Pod snapshots of Agent Sandbox environments, you need to perform some configuration steps. To understand these steps, it's helpful to first understand a few key concepts and then the snapshot process:
Key concepts
- Environment: a sandboxed application runs inside a Kubernetes Pod on a GKE cluster node.
- Identity: the Pod is associated with a Kubernetes service account and runs in a special namespace that you create. Together, the Kubernetes service account and the namespace form a unique identity that is used to grant the Pod secure access to Google Cloud resources.
- Permissions: to allow snapshots to be saved to Cloud Storage, the Pod's identity must be granted specific IAM permissions that allow it to write to a Cloud Storage bucket.
Snapshot process
- Trigger: a snapshot is initiated, either manually (externally) or by
the sandboxed workload itself. This document demonstrates a manual trigger
which you initiate by creating a
PodSnapshotManualTriggerresource. - Capture: GKE captures the running state of the Pod, such as the state of the Pod's memory and its file system.
- Upload: using the permissions granted to the Pod's Kubernetes service account, GKE uploads the captured state as snapshot files to the designated Cloud Storage bucket.
To learn more about how GKE uses Kubernetes service accounts and IAM roles to access Google Cloud resources, see Authenticate to Google Cloud APIs from GKE workloads.
To enable Pod snapshots of Agent Sandbox environments, perform the following configuration. First, you prepare a cluster environment by creating a GKE cluster with Workload Identity Federation for GKE and the Pod snapshots features enabled. Next, you configure Cloud Storage and IAM policies to help ensure that your snapshots are stored securely and that your sandbox has the necessary permissions. Finally, you create the snapshot resources that specify the storage location and policies for your sandbox.
The following table summarizes the configuration steps you need to perform, and each step is explained in the sections that follow:
Cluster setup
A sandboxed app runs inside a Pod on a GKE cluster node, so you need to set up a cluster environment. This section shows you how to create a GKE cluster, and deploy the Agent Sandbox controller.
Create a GKE cluster
Create a GKE Standard cluster. This cluster provides the Kubernetes environment where you will take a snapshot of an Agent Sandbox environment. The following command creates the cluster and enables Workload Identity Federation for GKE and the Pod snapshots feature:
To create a Standard cluster using the Google Cloud CLI, run the following command:
gcloud beta container clusters create ${CLUSTER_NAME} \ --location=${GKE_LOCATION} \ --cluster-version=${GKE_VERSION} \ --workload-pool=${PROJECT_ID}.svc.id.goog \ --workload-metadata=GKE_METADATA \ --num-nodes=1 \ --enable-pod-snapshotsIn addition to creating the cluster, this command enables Workload Identity Federation for GKE and the Pod snapshots feature on the cluster.
Retrieve the cluster's credentials so that the
kubectlCLI can connect to the cluster. This command updates your Kubernetes config file, which is stored by default in the~/.kube/configdirectory. This config file contains the credentials thatkubectlrequires to interact with your GKE cluster:gcloud container clusters get-credentials ${CLUSTER_NAME} \ --location=${GKE_LOCATION}Create a node pool with gVisor enabled:
gcloud container node-pools create ${NODE_POOL_NAME} \ --cluster=${CLUSTER_NAME} \ --location=${GKE_LOCATION} \ --machine-type=${MACHINE_TYPE} \ --node-version=${GKE_VERSION} \ --image-type=cos_containerd \ --num-nodes=1 \ --sandbox type=gvisorThis command uses the following key flags:
--image-type=cos_containerd: specifies that the nodes use the Container-Optimized OS with containerd.--sandbox type=gvisor: enables the gVisor sandbox technology on the nodes, which is required for Agent Sandbox.
Deploy the Agent Sandbox Controller to your cluster
You can deploy the Agent Sandbox controller and its required components by applying the official release manifests to your cluster. These manifests are configuration files that instruct Kubernetes to download all the necessary components that are required to deploy and run the Agent Sandbox controller on your cluster.
To deploy Agent Sandbox to your GKE cluster, run the following command:
# Apply the main manifest for the controller and its Custom Resource Definitions (CRDs)
kubectl apply \
-f https://github.com/kubernetes-sigs/agent-sandbox/releases/download/${AGENT_SANDBOX_VERSION}/manifest.yaml \
-f https://github.com/kubernetes-sigs/agent-sandbox/releases/download/${AGENT_SANDBOX_VERSION}/extensions.yaml
Verify the controller
After applying the manifests, check that the Agent Sandbox controller Pod is
running correctly in the agent-sandbox-system namespace. The manifests
automatically created the agent-sandbox-system namespace when you applied them
in the previous step.
kubectl get pods -n agent-sandbox-system
Wait for the Pod to show 'Running' in the STATUS column and '1/1' in the READY column. When the Pod is running correctly, the output looks similar to this:
NAME READY STATUS RESTARTS AGE
agent-sandbox-controller-0 1/1 Running 0 44d
Now that the Agent Sandbox controller is running, it can automatically create
and manage sandboxed environments for any Sandbox resources that you create in
your cluster.
Configure storage and permissions
This section shows you how to configure the storage and permissions required for Pod snapshots. You create a Cloud Storage bucket and a managed folder to store the snapshot data. Then, you grant your sandbox and the snapshot controller the permissions they need to access that storage.
Create the Cloud Storage bucket
Create a bucket to store your snapshots. To help ensure that the snapshot process is fast and cost-effective, we recommend that you create the bucket with the following settings:
- Enable hierarchical namespaces: the hierarchical namespaces feature organizes your bucket into a file system hierarchy rather than a flat namespace. This configuration allows for higher read and write throughput, and consequently speeds up saving and restoring snapshots.
- Disable soft delete: the soft delete feature protects data by retaining deleted files for a set period. However, the snapshot process creates and deletes many temporary files during upload. We recommend that you disable soft delete to avoid unnecessary charges for storing these temporary files.
To create the bucket with these settings, run the following command:
gcloud storage buckets create "gs://${SNAPSHOTS_BUCKET_NAME}" \
--uniform-bucket-level-access \
--enable-hierarchical-namespace \
--soft-delete-duration=0d \
--location="${GKE_LOCATION}"
Create a managed folder
Create a managed folder to organize snapshots within your bucket. Managed folders allow you to apply IAM permissions to a specific folder rather than to the entire bucket. This folder-level access limits your sandbox's access to only its own snapshots, and isolates those snapshots from other data in the bucket.
To create a managed folder, run this command:
gcloud storage managed-folders create "gs://${SNAPSHOTS_BUCKET_NAME}/${SNAPSHOT_FOLDER}/"
Configure the service account and IAM roles
To allow GKE to save snapshots securely, the Kubernetes service account used by the Pods running your sandboxed workload needs permission to write to your bucket. You grant this permission by binding Google Cloud IAM roles to the Kubernetes service account used by the Pods. This section shows you how to create a custom IAM role, create the Kubernetes service account, and configure the necessary permissions.
Create a custom IAM role called
podSnapshotGcsReadWriterthat contains the permissions required to write snapshot data:gcloud iam roles create podSnapshotGcsReadWriter \ --project="${PROJECT_ID}" \ --permissions="storage.objects.get,storage.objects.create,storage.objects.delete,storage.folders.create"When the role is created successfully, the output looks like this:
Created role [podSnapshotGcsReadWriter]. etag: BwZJUfjNbew= includedPermissions: - storage.folders.create - storage.objects.create - storage.objects.delete - storage.objects.get name: projects/${PROJECT_ID}/roles/podSnapshotGcsReadWriter stage: ALPHA title: podSnapshotGcsReadWriterCreate the namespace in which your sandbox and its service account will reside:
kubectl create namespace "${SNAPSHOT_NAMESPACE}"Create the Kubernetes service account in the namespace you just created. Together, the Kubernetes service account and the namespace form a unique identity used to grant your sandbox secure access to Google Cloud resources:
kubectl create serviceaccount "${SNAPSHOT_KSA_NAME}" \ --namespace "${SNAPSHOT_NAMESPACE}"Grant the
roles/storage.bucketViewerrole to all service accounts in the namespace. This role allows the accounts to view the bucket's metadata but not to read or write the data itself:gcloud storage buckets add-iam-policy-binding "gs://${SNAPSHOTS_BUCKET_NAME}" \ --member="principalSet://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT_ID}.svc.id.goog/namespace/${SNAPSHOT_NAMESPACE}" \ --role="roles/storage.bucketViewer"Grant your custom
podSnapshotGcsReadWriterrole to the Kubernetes service account for your sandbox. This binding allows only this specific account to write data to the managed folder:gcloud storage buckets add-iam-policy-binding "gs://${SNAPSHOTS_BUCKET_NAME}" \ --member="principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT_ID}.svc.id.goog/subject/ns/${SNAPSHOT_NAMESPACE}/sa/${SNAPSHOT_KSA_NAME}" \ --role="projects/${PROJECT_ID}/roles/podSnapshotGcsReadWriter"Grant the
roles/storage.objectUserrole to the Kubernetes service account. This role is required for the Pod snapshot agent to perform operations on managed folders:gcloud storage buckets add-iam-policy-binding "gs://${SNAPSHOTS_BUCKET_NAME}" \ --member="principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT_ID}.svc.id.goog/subject/ns/${SNAPSHOT_NAMESPACE}/sa/${SNAPSHOT_KSA_NAME}" \ --role="roles/storage.objectUser"
Grant permissions to the snapshot controller
Grant the objectUser role to the GKE system's snapshot
controller. This permission allows the controller to manage the snapshot
lifecycle, such as deleting snapshot objects when you delete a PodSnapshot
resource:
gcloud storage buckets add-iam-policy-binding "gs://${SNAPSHOTS_BUCKET_NAME}" \
--member="serviceAccount:service-${PROJECT_NUMBER}@container-engine-robot.iam.gserviceaccount.com" \
--role="roles/storage.objectUser"
Configure snapshot resources
This section shows you how to configure the snapshot resources for your Agent Sandbox workload.
Define snapshot storage and rules
To specify where GKE saves your snapshots and which rules govern the snapshot process, you create two custom resources:
PodSnapshotStorageConfig: this resource specifies the Cloud Storage bucket and folder location where GKE saves your snapshot files.PodSnapshotPolicy: this resource defines which Pods are eligible for snapshots based on their Kubernetes labels. It also specifies the trigger rules, such as whether snapshots are manual or initiated by the sandbox workload.
To apply both resources in one step, run the following command in Cloud Shell. This method helps to ensure that your environment variables are correctly injected:
kubectl apply -f - <<EOF
apiVersion: podsnapshot.gke.io/v1alpha1
kind: PodSnapshotStorageConfig
metadata:
name: cpu-pssc-gcs
spec:
snapshotStorageConfig:
gcs:
bucket: "${SNAPSHOTS_BUCKET_NAME}"
path: "${SNAPSHOT_FOLDER}"
EOF
sleep 5
kubectl apply -f - <<EOF
apiVersion: podsnapshot.gke.io/v1alpha1
kind: PodSnapshotPolicy
metadata:
name: cpu-psp
namespace: ${SNAPSHOT_NAMESPACE}
spec:
storageConfigName: cpu-pssc-gcs
selector:
matchLabels:
app: agent-sandbox-workload
triggerConfig:
type: manual
postCheckpoint: resume
EOF
Verify the configuration
After you apply the snapshot storage configuration and policy, verify that the resources are ready to use. This section shows you how to check the status of these custom resources.
Check the status of the
PodSnapshotStorageConfigresource:kubectl get podsnapshotstorageconfigs.podsnapshot.gke.io cpu-pssc-gcs \ --namespace "${SNAPSHOT_NAMESPACE}" -o yamlThe output should contain a condition with
type: Readyandstatus: "True":status: conditions: - lastTransitionTime: "2025-10-31T18:18:02Z" message: Valid PodSnapshotStorageConfig reason: StorageConfigValid status: "True" type: ReadyCheck the status of the
PodSnapshotPolicyresource:kubectl get podsnapshotpolicies.podsnapshot.gke.io cpu-psp \ --namespace "${SNAPSHOT_NAMESPACE}" -o yamlThe output should contain a condition with
type: Readyandstatus: "True". It should also indicate that the referencedPodSnapshotStorageConfigwas found:status: conditions: - lastTransitionTime: "2025-10-31T18:19:47Z" message: The referenced PodSnapshotStorageConfig "cpu-pssc-gcs" was found reason: StorageConfigValid status: "True" type: Ready
Create the sandbox template
Now that the storage policies and permissions are in place, you create the
SandboxTemplate and SandboxClaim resources. For the snapshot process to
work, the sandbox must run with the Kubernetes service account you created
earlier in this document. The sandbox must also have the labels that you
specified earlier in the PodSnapshotPolicy.
This example uses a Python app that prints an incrementing counter to the logs. This counter lets you verify that the state is successfully saved and restored later.
To create the SandboxTemplate and SandboxClaim
resources, apply the following manifest:
kubectl apply -f - <<EOF
---
apiVersion: extensions.agents.x-k8s.io/v1alpha1
kind: SandboxTemplate
metadata:
name: python-runtime-template
namespace: ${SNAPSHOT_NAMESPACE}
spec:
podTemplate:
metadata:
labels:
app: agent-sandbox-workload
spec:
serviceAccountName: ${SNAPSHOT_KSA_NAME}
runtimeClassName: gvisor
containers:
- name: my-container
image: python:3.10-slim
command: ["python3", "-c"]
args:
- |
import time
i = 0
while True:
print(f"Count: {i}", flush=True)
i += 1
time.sleep(1)
---
apiVersion: extensions.agents.x-k8s.io/v1alpha1
kind: SandboxClaim
metadata:
name: python-sandbox-example
namespace: ${SNAPSHOT_NAMESPACE}
labels:
app: agent-sandbox-workload
spec:
sandboxTemplateRef:
name: python-runtime-template
EOF
Your sandbox is now running with the correct identity and is ready to be snapshotted.
Create a snapshot
This section shows you how to manually trigger a snapshot of your running sandbox. You create a trigger resource that targets your sandbox Pod and then you verify that the snapshot process completes successfully.
Check the initial counter logs: Before triggering the snapshot, view the logs of the running sandbox to see the current counter value. Viewing the logs establishes a baseline to compare against after restoring.
kubectl logs python-sandbox-example --namespace "${SNAPSHOT_NAMESPACE}" --tail=5The output shows the last few lines of the counter, for example:
Count: 15 Count: 16 Count: 17Note the last few "Count" values printed.
Create a PodSnapshotManualTrigger resource: Initiate the snapshot:
kubectl apply -f - <<EOF apiVersion: podsnapshot.gke.io/v1alpha1 kind: PodSnapshotManualTrigger metadata: name: cpu-snapshot-trigger namespace: ${SNAPSHOT_NAMESPACE} spec: targetPod: python-sandbox-example EOFVerify that the manual trigger was successful:
kubectl get podsnapshotmanualtriggers.podsnapshot.gke.io \ --namespace "${SNAPSHOT_NAMESPACE}"The output should show a status of
Complete, indicating that GKE successfully triggered the snapshot of the target Pod:NAME TARGET POD STATUS AGE cpu-snapshot-trigger python-sandbox-example Complete XXsView more details about the captured state by describing the trigger:
kubectl describe podsnapshotmanualtriggers.podsnapshot.gke.io cpu-snapshot-trigger \ --namespace "${SNAPSHOT_NAMESPACE}"The output contains a
Snapshot Createdsection with the unique name of the snapshot files stored in your bucket:Status: Conditions: Last Transition Time: 2026-01-30T19:11:04Z Message: checkpoint completed successfully Reason: Complete Status: True Type: Triggered Observed Generation: 1 Snapshot Created: Name: <UNIQUE_SNAPSHOT_NAME>
Restore from a snapshot
After you capture a snapshot, you can restore the sandbox environment to resume
execution from its saved state. To restore the sandbox, create a new
SandboxClaim that references the original SandboxTemplate. The Pod Snapshot
controller automatically identifies and restores the most recent matching
snapshot.
Create a new
SandboxClaimto restore the sandbox:kubectl apply -f - <<EOF apiVersion: extensions.agents.x-k8s.io/v1alpha1 kind: SandboxClaim metadata: name: python-sandbox-from-snapshot namespace: ${SNAPSHOT_NAMESPACE} labels: app: agent-sandbox-workload spec: sandboxTemplateRef: name: python-runtime-template EOFVerify that the restore took place by viewing the logs. Note that the counter continues from the point where the snapshot was taken:
kubectl logs python-sandbox-from-snapshot --namespace "${SNAPSHOT_NAMESPACE}"The output should show the counter resuming, for example:
Count: 18 Count: 19 Count: 20 Count: 21
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this document, perform the following steps in order to delete the resources you created:
Delete the sandbox claims to stop any running Pods and allow the Agent Sandbox controller to gracefully terminate the containers.
kubectl delete sandboxclaims --all --namespace "${SNAPSHOT_NAMESPACE}"Delete the sandbox templates and manual triggers used to create sandboxes and initiate snapshots.
# Delete the blueprints kubectl delete sandboxtemplates --all --namespace "${SNAPSHOT_NAMESPACE}" # Delete the snapshot initiation objects kubectl delete podsnapshotmanualtriggers --all --namespace "${SNAPSHOT_NAMESPACE}"Delete the snapshot policies that define which Pods are eligible for snapshots within your namespace.
kubectl delete podsnapshotpolicy cpu-psp --namespace "${SNAPSHOT_NAMESPACE}"Delete the snapshot storage configuration, which is the global definition of your snapshot storage backend. Because this resource is cluster-scoped, don't use a namespace flag.
kubectl delete podsnapshotstorageconfig cpu-pssc-gcsDelete the Kubernetes namespace to automatically remove the Kubernetes service account and any remaining namespaced metadata.
kubectl delete namespace "${SNAPSHOT_NAMESPACE}"Delete the GKE cluster to remove the underlying infrastructure and all nodes associated with the tutorial.
gcloud container clusters delete "${CLUSTER_NAME}" --location="${GKE_LOCATION}" --quietDelete the Cloud Storage bucket (optional) by using the recursive remove command if you want to fully reset your storage. Note that you can skip this if you intend to reuse your correctly configured bucket for future tests.
gcloud storage rm --recursive "gs://${SNAPSHOTS_BUCKET_NAME:?Error: SNAPSHOTS_BUCKET_NAME is not set. Please re-define the environment variables you defined earlier.}"Delete the custom IAM role (optional) if you want to return your project to a completely clean state. Because IAM roles persist even after the cluster is deleted, you must delete them separately.
gcloud iam roles delete podSnapshotGcsReadWriter --project="${PROJECT_ID}"
What's next
- Learn more about GKE Pod Snapshots.
- Learn more about the Agent Sandbox open-source project on GitHub.
- Learn how to Isolate AI code execution with Agent Sandbox.