This tutorial shows you how to deploy and test the Agent Sandbox snapshot feature from within a Google Kubernetes Engine (GKE) cluster. You learn how to run a client application inside the cluster to programmatically create, pause, and resume sandboxed environments.
For more information about taking snapshots of Pods, see Restore from a Pod snapshot.
Costs
Agent Sandbox is offered at no extra charge in GKE. GKE pricing applies to the resources that you create.
Before you begin
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
-
Create a project: To create a project, you need the Project Creator role
(
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission. Learn how to grant roles.
-
Verify that billing is enabled for your Google Cloud project.
Enable the Artifact Registry, Kubernetes Engine APIs.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission. Learn how to grant roles.-
In the Google Cloud console, activate Cloud Shell.
- Verify that you have the permissions required to complete this tutorial.
Required roles
To get the permissions that
you need to create and manage sandboxes,
ask your administrator to grant you the
Kubernetes Engine Admin (roles/container.admin) IAM role on your project.
For more information about granting roles, see Manage access to projects, folders, and organizations.
You might also be able to get the required permissions through custom roles or other predefined roles.
Limitations
In a regional cluster, nodes in different zones might have different CPU
microarchitectures. Because snapshots capture the CPU state, restoring a
snapshot on a node with missing CPU features fails (for example, with the error
OCI runtime restore failed: incompatible FeatureSet).
To avoid this issue, use the appropriate configuration for your environment:
- Production: To preserve high availability across your cluster, don't pin workloads to a specific zone. Instead, help ensure CPU feature consistency across all zones by specifying a minimum CPU platform. For more information, see Choose a minimum CPU platform.
- Testing: To simplify setup and avoid initial CPU mismatch errors, you
can use a
nodeSelectorfield in yourSandboxTemplatemanifest to pin the Pod to a specific zone, such asus-central1-a. The example in this tutorial uses this testing configuration.
Define environment variables
To simplify the commands that you run in this tutorial, you can set environment variables in Cloud Shell. In Cloud Shell, define the following useful environment variables by running the following commands:
export PROJECT_ID=$(gcloud config get project)
export PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format="value(projectNumber)")
export CLUSTER_NAME="test-snapshot"
export LOCATION="us-central1"
export BUCKET_LOCATION="us"
export MACHINE_TYPE="n2-standard-2"
export REPOSITORY_NAME="agent-sandbox"
export BUCKET_NAME="${PROJECT_ID}_snapshots"
export CLOUDBUILD_BUCKET_NAME="${PROJECT_ID}_cloudbuild"
Here's an explanation of these environment variables:
PROJECT_ID: the ID of your current Google Cloud project. Defining this variable helps ensure that all resources are created in the correct project.PROJECT_NUMBER: the project number of your current Google Cloud project.CLUSTER_NAME: the name of your GKE cluster—for example,test-snapshot.LOCATION: the Google Cloud region where your GKE cluster and Artifact Registry repository are located—for example,us-central1.BUCKET_LOCATION: the location of your Cloud Storage buckets—for example,us.BUCKET_NAME: the name of the Cloud Storage bucket used for snapshots.CLOUDBUILD_BUCKET_NAME: the name of the Cloud Storage bucket used for Cloud Build logs.MACHINE_TYPE: the machine type to use for the cluster nodes—for example,e2-standard-8.REPOSITORY_NAME: the name of the Artifact Registry repository—for example,agent-sandbox.
Overview of configuration steps
To enable and test Pod snapshots of Agent Sandbox environments from within your cluster, you need to perform several configuration steps. To understand these steps, it's helpful to first understand the components involved in the overall workflow.
Key components
This tutorial uses the following two Python applications to test the snapshot process:
- Client application: a Python script running in a standard Pod in your
cluster. This application manages the sandbox lifecycle: it
programmatically creates the sandbox, pauses it to trigger a snapshot,
resumes the sandbox, and verifies that the state was preserved. In this
tutorial, you create a Kubernetes service account named
agent-sandbox-client-saand grant it RBAC permissions so that the client application Pod can manage sandbox custom resources and snapshot trigger objects using the Kubernetes API. - Sandboxed application: a Python script that increments and prints a
counter every second. This application runs securely inside the isolated
sandbox environment to generate a changing state that the client application
can verify. In this tutorial, you create a dedicated Kubernetes service
account named
snapshot-saand configure Workload Identity to authorize the sandboxed Pod to securely read and write snapshot objects in Cloud Storage.
Configuration and testing process
The following list summarizes the steps you need to perform to set up your environment and run the test:
- Create a cluster: create an Autopilot or Standard cluster with Pod snapshots and the Agent Sandbox feature enabled.
- Create an Artifact Registry repository: create a Docker repository to store the container image for your client application.
- Install Agent Sandbox: install the core agent-sandbox components and extensions on your cluster.
- Configure storage and permissions: create a Cloud Storage bucket and configure Workload Identity permissions to allow snapshots to be saved securely.
- Configure Pod snapshots: create and apply the snapshot storage configuration, the snapshot policy, and the sandbox template.
- Build the client application: build the container image for the client application and push it to your Artifact Registry repository.
- Run the test: deploy the client application Pod, which creates the sandbox, pauses it to capture a snapshot, resumes it, and verifies that the counter's state was successfully restored.
Create a cluster
Create a new GKE cluster with Pod snapshots enabled. For full feature compatibility, specify the rapid release channel.
Autopilot
Create an Autopilot cluster with the required features:
gcloud beta container clusters create-auto ${CLUSTER_NAME} \
--enable-pod-snapshots \
--release-channel=rapid \
--location=${LOCATION}
Standard
Create a Standard cluster with the required features:
gcloud beta container clusters create ${CLUSTER_NAME} \
--enable-pod-snapshots \
--release-channel=rapid \
--machine-type=${MACHINE_TYPE} \
--workload-pool=${PROJECT_ID}.svc.id.goog \
--workload-metadata=GKE_METADATA \
--num-nodes=1 \
--location=${LOCATION}
Create a node pool with gVisor enabled:
gcloud container node-pools create gvisor-pool \
--cluster ${CLUSTER_NAME} \
--num-nodes=1 \
--location=${LOCATION} \
--project=${PROJECT_ID} \
--sandbox type=gvisor
Create an Artifact Registry repository
Create a Docker repository in Artifact Registry to store the container image for your client application (the application that creates and manages the sandbox):
gcloud artifacts repositories create ${REPOSITORY_NAME} \
--repository-format=docker \
--location=${LOCATION} \
--description="Docker repository for Agent Sandbox"
Install Agent Sandbox
Install the Agent Sandbox core components and extensions on your cluster (using version v0.4.6 as an example):
# Install the core agent-sandbox components
kubectl apply -f https://github.com/kubernetes-sigs/agent-sandbox/releases/download/v0.4.6/manifest.yaml
# Install the extensions (e.g., Warm Pools, Claims)
kubectl apply -f https://github.com/kubernetes-sigs/agent-sandbox/releases/download/v0.4.6/extensions.yaml
Configure storage and permissions
Configure a Cloud Storage bucket for storing Pod snapshots and grant
the required Workload Identity permissions to the snapshot-sa service account
and the GKE service agent. This allows your sandboxed workloads
to securely save and retrieve snapshot objects:
Create a new Cloud Storage bucket:
gcloud storage buckets create "gs://${BUCKET_NAME}" \ --uniform-bucket-level-access \ --enable-hierarchical-namespace \ --soft-delete-duration=0d \ --location="${BUCKET_LOCATION}"Create a Kubernetes service account in the
defaultnamespace. Your sandboxed application (the Python counter script) uses this identity to authenticate to external APIs and securely access snapshot objects stored in Cloud Storage:kubectl create serviceaccount "snapshot-sa" \ --namespace "default"Bind the
storage.bucketViewerrole to your service account using Workload Identity. This role allows the sandboxed workload to list the bucket contents and locate specific snapshots:gcloud storage buckets add-iam-policy-binding "gs://${BUCKET_NAME}" \ --member="principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT_ID}.svc.id.goog/subject/ns/default/sa/snapshot-sa" \ --role="roles/storage.bucketViewer"Bind the
storage.objectUserrole to your service account using Workload Identity. This role provides the permission to read, save, and delete snapshot binary objects within the bucket:gcloud storage buckets add-iam-policy-binding "gs://${BUCKET_NAME}" \ --member="principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT_ID}.svc.id.goog/subject/ns/default/sa/snapshot-sa" \ --role="roles/storage.objectUser"Grant the GKE service agent permissions to manage (create, list, read, and delete) snapshot objects within the bucket:
gcloud projects add-iam-policy-binding "${PROJECT_ID}" \ --member="serviceAccount:service-${PROJECT_NUMBER}@container-engine-robot.iam.gserviceaccount.com" \ --role="roles/storage.objectUser" \ --condition="expression=resource.name.startsWith(\"projects/_/buckets/${BUCKET_NAME}\"),title=restrict_to_bucket,description=Restricts access to one bucket only"
Configure Pod snapshots
Create and apply the configuration files to install the required Kubernetes custom resources. These resources define how the cluster stores and manages Pod snapshots:
- PodSnapshotStorageConfig: specifies the Cloud Storage bucket designated for storing snapshot binary objects.
- PodSnapshotPolicy: defines how snapshots are triggered manually, how frequently they are grouped, and their retention policies.
- SandboxTemplate: defines the underlying container, node selectors, and service accounts for running the isolated sandboxed workload.
Create a file named
test_client/snapshot_storage_config.yaml. This configuration specifies the target Cloud Storage bucket where the cluster saves the binary Pod snapshot state:apiVersion: podsnapshot.gke.io/v1 kind: PodSnapshotStorageConfig metadata: name: example-pod-snapshot-storage-config spec: snapshotStorageConfig: gcs: bucket: "$BUCKET_NAME"Substitute the environment variable placeholder in the configuration file:
sed -i "s/\$BUCKET_NAME/$BUCKET_NAME/g" test_client/snapshot_storage_config.yamlApply the storage configuration manifest:
kubectl apply -f test_client/snapshot_storage_config.yamlWait for the storage configuration to be ready:
kubectl wait --for=condition=Ready podsnapshotstorageconfig/example-pod-snapshot-storage-config --timeout=60sCreate a file named
test_client/snapshot_policy.yaml. This configuration establishes a retention rule that retains a maximum of two snapshots for your sandboxed workload. The trigger type is set tomanual: this allows the client application to control snapshots on demand:apiVersion: podsnapshot.gke.io/v1 kind: PodSnapshotPolicy metadata: name: example-pod-snapshot-policy namespace: default spec: storageConfigName: example-pod-snapshot-storage-config selector: matchLabels: app: agent-sandbox-workload triggerConfig: type: manual postCheckpoint: resume snapshotGroupingRules: groupByLabelValue: labels: ["agents.x-k8s.io/sandbox-name-hash", "tenant-id", "user-id"] groupRetentionPolicy: maxSnapshotCountPerGroup: 2Apply the snapshot policy manifest:
kubectl apply -f test_client/snapshot_policy.yamlCreate a file named
test_client/python-counter-template.yaml. This configuration defines the sandbox Pod, and assigns thesnapshot-saservice account identity to it. This assignment helps ensure that the sandbox runs securely. Inside that Pod, the sandboxed application (a Python script) continuously prints an incrementing counter to the container logs:apiVersion: extensions.agents.x-k8s.io/v1alpha1 kind: SandboxTemplate metadata: name: python-counter-template namespace: default spec: podTemplate: metadata: labels: app: agent-sandbox-workload spec: serviceAccountName: snapshot-sa runtimeClassName: gvisor nodeSelector: topology.kubernetes.io/zone: us-central1-a # Pin to a zone to avoid CPU mismatch during restore containers: - name: python-counter image: python:3.13-slim command: ["python3", "-c"] args: - | import time i = 0 while True: print(f"Count: {i}", flush=True) i += 1 time.sleep(1)Apply the sandbox template manifest:
kubectl apply -f test_client/python-counter-template.yaml
Build the client application
Create the container image for the client application and upload it to Artifact Registry.
Create a file named
test_client/Dockerfile.client. This file defines the Python runtime environment and dependencies for the client application:FROM python:3.13-slim WORKDIR /app RUN pip install "k8s-agent-sandbox[tracing]==0.4.6" # Copy test script COPY client_test.py /app/client_test.py CMD ["python", "/app/client_test.py"]Create a file named
test_client/client_test.py. This script manages the sandbox lifecycle and verifies that the state successfully resumes after taking a snapshot:import time import logging import re from kubernetes import config, client from k8s_agent_sandbox.gke_extensions.snapshots import PodSnapshotSandboxClient logging.basicConfig(level=logging.INFO) def get_last_count(pod_name, namespace): v1 = client.CoreV1Api() try: logs = v1.read_namespaced_pod_log(name=pod_name, namespace=namespace) counts = re.findall(r"Count: (\d+)", logs) if counts: return int(counts[-1]) return None except Exception as e: logging.error(f"Failed to read logs for pod {pod_name}: {e}") return None def get_current_pod_name(sandbox_id, namespace): custom_api = client.CustomObjectsApi() try: sandbox_cr = custom_api.get_namespaced_custom_object( group="agents.x-k8s.io", version="v1alpha1", namespace=namespace, plural="sandboxes", name=sandbox_id ) metadata = sandbox_cr.get("metadata", {}) annotations = metadata.get("annotations", {}) return annotations.get("agents.x-k8s.io/pod-name") except Exception as e: logging.error(f"Failed to get sandbox CR: {e}") return None def get_current_count(sandbox_id, namespace="default"): pod_name = get_current_pod_name(sandbox_id, namespace) if not pod_name: logging.error(f"Could not determine pod name for sandbox {sandbox_id}") return None return get_last_count(pod_name, namespace) def suspend_sandbox(sandbox): logging.info("Pausing sandbox (using snapshots)...") try: suspend_resp = sandbox.suspend(snapshot_before_suspend=True) if suspend_resp.success: logging.info("Sandbox paused successfully.") if suspend_resp.snapshot_response: logging.info(f"Snapshot created: {suspend_resp.snapshot_response.snapshot_uid}") return suspend_resp else: logging.error(f"Failed to pause: {suspend_resp.error_reason}") exit(1) except Exception as e: logging.error(f"Failed to pause sandbox: {e}") exit(1) def resume_sandbox(sandbox): logging.info("Resuming sandbox (using snapshots)...") try: resume_resp = sandbox.resume() if resume_resp.success: logging.info("Sandbox resumed successfully.") if resume_resp.restored_from_snapshot: logging.info(f"Restored from snapshot: {resume_resp.snapshot_uid}") return resume_resp else: logging.error(f"Failed to resume: {resume_resp.error_reason}") exit(1) except Exception as e: logging.error(f"Failed to resume sandbox: {e}") exit(1) def verify_continuity(count_before, count_after): if count_before is not None and count_after is not None: logging.info(f"Verification: Count before={count_before}, Count after={count_after}") if count_after >= count_before: logging.info("SUCCESS: Sandbox resumed from where it left off (or later).") else: logging.error("FAIL: Sandbox counter reset or went backwards!") else: logging.warning("Could not verify counter continuity.") def main(): try: config.load_incluster_config() except config.ConfigException: config.load_kube_config() client_reg = PodSnapshotSandboxClient() logging.info("Creating sandbox...") sandbox = client_reg.create_sandbox(template="python-counter-template", namespace="default") logging.info(f"Sandbox created with ID: {sandbox.sandbox_id}") logging.info("Waiting for sandbox to run...") time.sleep(10) count_before = get_current_count(sandbox.sandbox_id) logging.info(f"Count before suspend: {count_before}") suspend_sandbox(sandbox) logging.info("Waiting 10 seconds...") time.sleep(10) resume_sandbox(sandbox) logging.info("Waiting for sandbox to be ready again...") time.sleep(10) count_after = get_current_count(sandbox.sandbox_id) logging.info(f"Count after resume: {count_after}") verify_continuity(count_before, count_after) logging.info("Snapshot test completed successfully.") if __name__ == "__main__": main()Build the client container image and upload it to Artifact Registry. If your environment (such as Cloud Shell) has Docker installed, you can use Docker to build the image locally. If you are working in an environment without Docker, you can use Cloud Build to build and push the image remotely:
Docker
Configure Docker authentication for Artifact Registry:
gcloud auth configure-docker "${LOCATION}-docker.pkg.dev"Build and push the client container image locally:
docker build -t "${LOCATION}-docker.pkg.dev/${PROJECT_ID}/${REPOSITORY_NAME}/sandbox-client:latest" -f test_client/Dockerfile.client test_client docker push "${LOCATION}-docker.pkg.dev/${PROJECT_ID}/${REPOSITORY_NAME}/sandbox-client:latest"
Cloud Build
Create a file named
test_client/cloudbuild.yaml:steps: - name: 'gcr.io/cloud-builders/docker' args: ['build', '-t', '$LOCATION-docker.pkg.dev/$PROJECT_ID/$REPOSITORY_NAME/sandbox-client:latest', '-f', 'test_client/Dockerfile.client', 'test_client'] images: - '$LOCATION-docker.pkg.dev/$PROJECT_ID/$REPOSITORY_NAME/sandbox-client:latest'Substitute the environment variable placeholders in the configuration file:
sed -i "s/\$REPOSITORY_NAME/$REPOSITORY_NAME/g" test_client/cloudbuild.yaml sed -i "s/\$LOCATION/$LOCATION/g" test_client/cloudbuild.yaml sed -i "s/\$PROJECT_ID/$PROJECT_ID/g" test_client/cloudbuild.yamlGrant necessary permissions to the Cloud Build service account:
gcloud projects add-iam-policy-binding $PROJECT_ID \ --member="serviceAccount:$PROJECT_NUMBER-compute@developer.gserviceaccount.com" \ --role="roles/artifactregistry.writer" gcloud projects add-iam-policy-binding $PROJECT_ID \ --member="serviceAccount:$PROJECT_NUMBER-compute@developer.gserviceaccount.com" \ --role="roles/logging.logWriter" gcloud storage buckets add-iam-policy-binding "gs://$CLOUDBUILD_BUCKET_NAME" \ --member="serviceAccount:$PROJECT_NUMBER-compute@developer.gserviceaccount.com" \ --role="roles/storage.objectAdmin"Run the build using Cloud Build:
gcloud builds submit --config test_client/cloudbuild.yaml
Run the test
Deploy the client application to create the sandbox, trigger a snapshot, and verify that the internal counter successfully resumes from its saved state.
Create a file named
test_client/client_sa.yaml. This manifest defines theagent-sandbox-client-saservice account and its required RBAC permissions for managing sandbox custom resources:apiVersion: v1 kind: ServiceAccount metadata: name: agent-sandbox-client-sa namespace: default --- apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: agent-sandbox-client-role namespace: default rules: - apiGroups: ["agents.x-k8s.io"] resources: ["sandboxes"] verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] - apiGroups: ["extensions.agents.x-k8s.io"] resources: ["sandboxclaims"] verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] - apiGroups: ["podsnapshot.gke.io"] resources: ["podsnapshotmanualtriggers", "podsnapshots"] verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] - apiGroups: [""] resources: ["pods", "pods/log"] verbs: ["get", "list", "watch"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: agent-sandbox-client-rolebinding namespace: default subjects: - kind: ServiceAccount name: agent-sandbox-client-sa namespace: default roleRef: kind: Role name: agent-sandbox-client-role apiGroup: rbac.authorization.k8s.ioApply the client service account and RBAC manifest:
kubectl apply -f test_client/client_sa.yamlCreate a file named
test_client/client_pod.yaml. This manifest creates the client application Pod using the prebuilt container image:apiVersion: v1 kind: Pod metadata: name: agent-sandbox-client-pod namespace: default spec: serviceAccountName: agent-sandbox-client-sa containers: - name: client image: $LOCATION-docker.pkg.dev/$PROJECT_ID/$REPOSITORY_NAME/sandbox-client:latest imagePullPolicy: Always restartPolicy: NeverSubstitute the environment variable placeholders in the manifest:
sed -i "s/\$REPOSITORY_NAME/$REPOSITORY_NAME/g" test_client/client_pod.yaml sed -i "s/\$LOCATION/$LOCATION/g" test_client/client_pod.yaml sed -i "s/\$PROJECT_ID/$PROJECT_ID/g" test_client/client_pod.yamlApply the client application Pod manifest:
kubectl apply -f test_client/client_pod.yamlStream the Pod logs to verify the execution flow:
kubectl logs -f agent-sandbox-client-pod
When the test is running correctly, the output looks similar to this (shortened here for readability):
2026-04-21 23:02:39,030 - INFO - Creating sandbox...
...
2026-04-21 23:02:51,755 - INFO - Count before suspend: 23
2026-04-21 23:02:51,755 - INFO - Pausing sandbox (using snapshots)...
...
2026-04-21 23:03:07,115 - INFO - Resuming sandbox (using snapshots)...
...
2026-04-21 23:03:21,329 - INFO - Count after resume: 38
2026-04-21 23:03:21,329 - INFO - Verification: Count before=23, Count after=38
2026-04-21 23:03:21,329 - INFO - SUCCESS: Sandbox resumed from where it left off (or later).
The output shows that the sandbox successfully preserves its state when suspended and resumed. The counter stops advancing while the sandbox is suspended (paused and scaled to zero), and resumes the counter when the sandbox is restored. Without suspending, the counter would have continued to advance during the suspension period and the count would be significantly higher.
Clean up resources
To avoid incurring charges to your Google Cloud account, delete the resources that you created:
Delete the GKE cluster. This also deletes the node pool and all Kubernetes service accounts inside it:
gcloud beta container clusters delete test-snapshot --location="${LOCATION}" --quietDelete the Artifact Registry repository to remove the Docker repository you created for the test image:
gcloud artifacts repositories delete ${REPOSITORY_NAME} --location="${LOCATION}" --quietDelete the Cloud Storage bucket and all the snapshots inside it. This automatically removes the bucket-level Workload Identity IAM bindings applied to it:
gcloud storage rm --recursive "gs://${BUCKET_NAME}"Remove the project-level IAM binding for the GKE service agent:
gcloud projects remove-iam-policy-binding "${PROJECT_ID}" \ --member="serviceAccount:service-${PROJECT_NUMBER}@container-engine-robot.iam.gserviceaccount.com" \ --role="roles/storage.objectUser" \ --condition="expression=resource.name.startsWith(\"projects/_/buckets/${BUCKET_NAME}\"),title=restrict_to_bucket,description=Restricts access to one bucket only"If you used Cloud Build instead of Docker to build and push the container image, delete the logs bucket and remove the service account permissions:
gcloud storage rm --recursive "gs://${CLOUDBUILD_BUCKET_NAME}" gcloud projects remove-iam-policy-binding $PROJECT_ID \ --member="serviceAccount:$PROJECT_NUMBER-compute@developer.gserviceaccount.com" \ --role="roles/artifactregistry.writer" gcloud projects remove-iam-policy-binding $PROJECT_ID \ --member="serviceAccount:$PROJECT_NUMBER-compute@developer.gserviceaccount.com" \ --role="roles/logging.logWriter"
What's next
- Learn how to Isolate AI code execution: external trigger.
- Learn how to Save and restore Agent Sandbox environments.
- To understand the isolation layers that protect your untrusted workloads, see GKE Sandbox.
- Explore the open-source Agent Sandbox project on GitHub.