This page describes troubleshooting steps that you might find helpful if you run into problems when you use Gemini Enterprise Agent Platform Workbench.
See also Troubleshooting Agent Platform for help using other components of Gemini Enterprise Agent Platform.
To filter this page's content, click a topic:
Agent Platform Workbench instances
This section describes troubleshooting steps for Agent Platform Workbench instances.
Troubleshooting with AI Tools
This section discusses how to use AI tools for troubleshooting;
Troubleshooting with Cloud Assistance Investigations
When connecting Agent Platform with other Google Cloud products, you may find Gemini Cloud Assist Investigations to be helpful in troubleshooting integration issues. It may also accelerate troubleshooting on the instance itself. Gemini Cloud Assist lets you draw insights from metrics and logs generated by the instance.
- Stop the instance and follow the
View in Compute Enginelink. - Install the Ops Agent (Recommended). This will take a few minutes
- Add a Custom Metadata field
notebook-enable-debugand set this totrue - Restart the instance and reproduce the issue.
- Enable and configure the Cloud Assist Investigations API.
- Create an new investigation and describe the issue in detail using a natural language prompt.
- As you type, a dialog appears that suggests resources to add to the investigation. Review this list and be sure to add the instance as a resource as well as any other resources in this list of supported products
- Start the investigation and review the results.
Troubleshoot diagnostic files with Gemini CLI
You may use the results of from the Cloud Assistance Investigation to inform further AI driven investigation on the diagnostic file from the instance.
- Run the diagnostic tool and specify a Cloud Storage bucket to upload the results.
sudo /opt/deeplearning/bin/diagnostic_tool.sh [--repair] [--bucket=$BUCKET]
- Download the diagnosting file to your workstation then install and configure Gemini CLI.
- Start the application then describe your issue. Include the hypothesis from the Cloud Assistance investigation in the context. Ask the model to extend the investigation by reading the contents of the diagnostic file using natural language prompts.
Connecting to and opening JupyterLab
This section describes troubleshooting steps for connecting to and opening JupyterLab.
Nothing happens after clicking Open JupyterLab
Issue
When you click Open JupyterLab, nothing happens.
Solution
Verify that your browser doesn't block new tabs from opening automatically. JupyterLab opens in a new browser tab.
Can't access the terminal in a Agent Platform Workbench instance
Issue
If you're unable to access the terminal or can't find the terminal window in the launcher, it could be because your Agent Platform Workbench instance doesn't have terminal access enabled.
Solution
You must create a new Agent Platform Workbench instance with the Terminal access option enabled. This option can't be changed after instance creation.
502 error when opening JupyterLab
Issue
A 502 error might mean that your Agent Platform Workbench instance isn't ready yet.
Solution
Wait a few minutes, refresh the Google Cloud console browser tab, and try again.
Notebook is unresponsive
Issue
Your Agent Platform Workbench instance isn't running cells or appears to be frozen.
Solution
First try restarting the kernel by clicking Kernel from the top menu and then Restart Kernel. If that doesn't work, you can try the following:
- Refresh the JupyterLab browser page. Unsaved cell output doesn't persist, so you must run those cells again to regenerate the output.
- Reset your instance.
Unable to connect with Agent Platform Workbench instance using SSH
Issue
You're unable to connect to your instance by using SSH through a terminal window.
Agent Platform Workbench instances use OS Login to
enable SSH access. When you create an instance, Agent Platform Workbench enables
OS Login by default by setting the metadata key enable-oslogin to TRUE. If
you're unable to use SSH to connect to your instance, this metadata key might
need to be set to TRUE.
Solution
Connecting to a Agent Platform Workbench instance by using the Google Cloud console isn't supported. If you're unable to connect to your instance by using SSH through a terminal window, see the following:
To set the metadata key enable-oslogin to TRUE, use the
projects.locations.instances.patch
method in the Notebooks API or the
gcloud workbench instances update
command in the Agent Platform SDK.
GPU quota has been exceeded
Issue
You're unable to create a Agent Platform Workbench instance with GPUs.
Solution
Determine the number of GPUs available in your project by checking the quotas page. If GPUs aren't listed on the quotas page, or you require additional GPU quota, you can request a quota increase for Compute Engine GPUs. See Request a higher quota limit.
Creating Agent Platform Workbench instances
This section describes how to troubleshoot issues related to creating Agent Platform Workbench instances.
Instance stays in pending state indefinitely or is stuck in provisioning status
Issue
After creating a Agent Platform Workbench instance, it stays in the pending state indefinitely. An error like the following might appear in the serial logs:
Could not resolve host: notebooks.googleapis.com
If your instance is stuck in provisioning status, this could be because you have an invalid private networking configuration for your instance.
Solution
Follow the steps in the Instance logs show connection or timeout errors section.
Unable to create an instance within a Shared VPC network
Issue
Attempting to create an instance within a Shared VPC network results in an error message like the following:
Required 'compute.subnetworks.use' permission for 'projects/network-administration/regions/us-central1/subnetworks/v'
Solution
The issue is that the Notebooks Service Account is attempting to create the instance without the correct permissions.
To ensure that the Notebooks Service Account has the necessary
permissions to ensure that the Notebooks Service Account can create a Agent Platform Workbench instance within a Shared VPC network,
ask your administrator to grant the
Compute Network User role (roles/compute.networkUser)
IAM role to the Notebooks Service Account on the host project.
For more information about granting roles, see Manage access to projects, folders, and organizations.
This predefined role contains the permissions required to ensure that the Notebooks Service Account can create a Agent Platform Workbench instance within a Shared VPC network. To see the exact permissions that are required, expand the Required permissions section:
Required permissions
The following permissions are required to ensure that the Notebooks Service Account can create a Agent Platform Workbench instance within a Shared VPC network:
-
To use subnetworks:
compute.subnetworks.use
Your administrator might also be able to give the Notebooks Service Account these permissions with custom roles or other predefined roles.
Can't create a Agent Platform Workbench instance with a custom container
Issue
There isn't an option to use a custom container when creating a Agent Platform Workbench instance in the Google Cloud console.
Solution
Adding a custom container to a Agent Platform Workbench instance isn't supported, and you can't add a custom container by using the Google Cloud console.
Adding a conda environment is recommended instead of using a custom container.
You can add a custom container to a Agent Platform Workbench instance by using the Notebooks API, but this capability isn't supported.
Unable to use the Gemini CLI
Issue
The Gemini CLI tile is in the JupyterLab launcher and opens successfully, but Gemini doesn't respond to prompts.
Solution
An administrator might have blocked access to the Gemini CLI. See Control access to the Gemini CLI.
Mount shared storage button isn't there
Issue
The Mount shared storage button isn't in the File Browser tab of the JupyterLab interface.
Solution
The storage.buckets.list permission is required for the
Mount shared storage button to appear in the JupyterLab interface of your
Agent Platform Workbench instance. Ask your administrator to grant your
Agent Platform Workbench instance's service account the
storage.buckets.list permission on the project.
599 error when using Managed Service for Apache Spark
Issue
Attempting to create a Managed Service for Apache Spark-enabled instance results in an error message like the following:
HTTP 599: Unknown (Error from Gateway: [Timeout while connecting] Exception while attempting to connect to Gateway server url. Ensure gateway url is valid and the Gateway instance is running.)
Solution
In your Cloud DNS configuration, add a Cloud DNS entry for the
*.googleusercontent.com domain.
Unable to install third-party JupyterLab extension
Issue
Attempting to install a third-party JupyterLab extension results in an
Error: 500 message.
Solution
Third-party JupyterLab extensions aren't supported in Agent Platform Workbench instances.
Unable to edit underlying virtual machine
Issue
When you try to edit the underlying virtual machine (VM) of a Agent Platform Workbench instance, you might get an error message similar to the following:
Current principal doesn't have permission to mutate this resource.
Solution
This error occurs because you can't edit the underlying VM of an instance by using the Google Cloud console or the Compute Engine API.
To edit a Agent Platform Workbench instance's underlying VM, use the
projects.locations.instances.patch
method in the Notebooks API or the
gcloud workbench instances update
command in the Agent Platform SDK
pip packages aren't available after adding conda environment
Issue
Your pip packages aren't available after you add a conda-based kernel.
Solution
To resolve the issue, see Add a conda environment and try the following:
Check that you used the
DL_ANACONDA_ENV_HOMEvariable and that it contains the name of your environment.Check that
pipis located in a path similar toopt/conda/envs/ENVIRONMENT/bin/pip. You can run thewhich pipcommand to get the path.
Unable to access or copy data of an instance with single user access
Issue
The data on an instance with single user access is inaccessible.
For Agent Platform Workbench instances that are set up with single user access, only the specified single user (the owner) can access the data on the instance.
Solution
To access or copy the data when you aren't the owner of the instance, open a support case.
Unexpected shutdown
Issue
Your Agent Platform Workbench instance shuts down unexpectedly.
Solution
If your instance shuts down unexpectedly, this could be because idle shutdown was initiated.
If you enabled idle shutdown, your instance shuts down when there is no kernel activity for the specified time period. For example, running a cell or new output printing to a notebook is activity that resets the idle timeout timer. CPU usage doesn't reset the idle timeout timer.
Instance logs show connection or timeout errors
Issue
Your Agent Platform Workbench instance's logs show connection or timeout errors.
Solution
If you notice connection or timeout errors in the instance's logs make sure that the Jupyter server is running on port 8080. Follow the steps in the Verify that the Jupyter internal API is active section.
If you have turned off External IP and you are using a private VPC
network make sure you have also followed the
network configuration options documentation.
Consider the following:
You must enable Private Google Access on the chosen subnetwork in the same region where your instance is located in the VPC host project. For more information on configuring Private Google Access, see the Private Google Access documentation.
If you're using Cloud DNS, the instance must be able to resolve the required Cloud DNS domains specified in the network configuration options documentation. To verify this, follow the steps in the Verify the instance can resolve the required DNS domains section.
Instance logs show 'Unable to contact Jupyter API' 'ReadTimeoutError'
Issue
Your Agent Platform Workbench instance logs show an error such as:
notebooks_collection_agent. Unable to contact Jupyter API:
HTTPConnectionPool(host=\'127.0.0.1\', port=8080):
Max retries exceeded ReadTimeoutError(\"HTTPConnectionPool(host=\'127.0.0.1\', port=8080
Solution
Follow the steps in the
Instance logs show connection or timeout errors section.
You can also try
modifying the Notebooks Collection Agent script
to change HTTP_TIMEOUT_SESSION to a larger value,
for example: 60, to help verify whether the request has failed due to
the call taking too long to respond or the requested URL can't be reached.
docker0 address conflicts with VPC addressing
Issue
By default, the docker0 interface is created with an IP address of 172.17.0.1/16. This might conflict with the IP addressing in your VPC network such that the instance is unable to connect to other endpoints with 172.17.0.1/16 addresses.
Solution
You can force the docker0 interface to be created with an IP address that
doesn't conflict with your VPC network by using the following
post-startup script and
setting the post-startup script behavior to run_once.
#!/bin/bash # Wait for Docker to be fully started while ! systemctl is-active docker; do sleep 1 done # Stop the Docker service systemctl stop docker # Modify /etc/docker/daemon.json cat </etc/docker/daemon.json { "bip": "CUSTOM_DOCKER_IP/16" } EOF # Restart the Docker service systemctl start docker
Specified reservations don't exist
Issue
The operation to create the instance results in a Specified reservations do
not exist error message. The operation's output might be similar to the following:
{ "name": "projects/PROJECT/locations/LOCATION/operations/OPERATION_ID", "metadata": { "@type": "type.googleapis.com/google.cloud.notebooks.v2.OperationMetadata", "createTime": "2025-01-01T01:00:01.000000000Z", "endTime": "2025-01-01T01:00:01.000000000Z", "target": "projects/PROJECT/locations/LOCATION/instances/INSTANCE_NAME", "verb": "create", "requestedCancellation": false, "apiVersion": "v2", "endpoint": "CreateInstance" }, "done": true, "error": { "code": 3, "message": "Invalid value for field 'resource.reservationAffinity': '{ \"consumeReservationType\": \"SPECIFIC_ALLOCATION\", \"key\": \"compute.googleapis.com/reservation-name...'. Specified reservations [projects/PROJECT/zones/ZONE/futureReservations/RESERVATION_NAME] do not exist.", "details": [ { "@type": "type.googleapis.com/google.rpc.RequestInfo", "requestId": "REQUEST_ID" } ] } }
Solution
Some Compute Engine machine types require additional parameters at creation such as local SSDs or a minimum CPU platform. The instance specification must include these additional fields.
- Agent Platform Workbench instances use automatic minimum CPU platform by
default. If your reservation sets a specific platform, you need to set the
min_cpu_platformaccordingly when creating Agent Platform Workbench instances. - Agent Platform Workbench instances always set the number of local SSDs to the
default values according to the machine type.
For example,
a2-ultragpu-1galways has 1 local SSD, whilea2-highgpu-1galways has 0 local SSD. When creating reservations to be used for Agent Platform Workbench instances, you need to leave the local SSD to its default value.
Helpful procedures
This section describes procedures that you might find helpful.
Use SSH to connect to your Agent Platform Workbench instance
Use ssh to connect to your instance by typing the following command in either Cloud Shell or any environment where the Google Cloud CLI is installed.
gcloud compute ssh --project PROJECT_ID \
--zone ZONE \
INSTANCE_NAME -- -L 8080:localhost:8080
Replace the following:
PROJECT_ID: Your project IDZONE: The Google Cloud zone where your instance is locatedINSTANCE_NAME: The name of your instance
You can also connect to your instance by opening your instance's Compute Engine detail page, and then clicking the SSH button.
Re-register with the Inverting Proxy server
To re-register the Agent Platform Workbench instance with the internal Inverting Proxy server, you can stop and start the VM from the Instances page or you can use ssh to connect to your Agent Platform Workbench instance and enter:
cd /opt/deeplearning/bin sudo ./attempt-register-vm-on-proxy.sh
Verify the Docker service status
To verify the Docker service status you can use ssh to connect to your Agent Platform Workbench instance and enter:
sudo service docker status
Verify that the Inverting Proxy agent is running
To verify if the notebook Inverting Proxy agent is running, use ssh to connect to your Agent Platform Workbench instance and enter:
# Confirm Inverting Proxy agent Docker container is running (proxy-agent) sudo docker ps # Verify State.Status is running and State.Running is true. sudo docker inspect proxy-agent # Grab logs sudo docker logs proxy-agent
Verify the Jupyter service status and collect logs
To verify the Jupyter service status you can use ssh to connect to your Agent Platform Workbench instance and enter:
sudo service jupyter status
To collect Jupyter service logs:
sudo journalctl -u jupyter.service --no-pager
Verify that the Jupyter internal API is active
The Jupyter API should always run on port 8080. You can verify this by inspecting the instance's syslogs for an entry similar to:
Jupyter Server ... running at: http://localhost:8080
To verify that the Jupyter internal API is active you can also use ssh to connect to your Agent Platform Workbench instance and enter:
curl http://127.0.0.1:8080/api/kernelspecs
You can also measure the time it takes for the API to respond in case the requests are taking too long:
time curl -V http://127.0.0.1:8080/api/status
time curl -V http://127.0.0.1:8080/api/kernels
time curl -V http://127.0.0.1:8080/api/connections
To run these commands in your Agent Platform Workbench instance, open JupyterLab, and create a new terminal.
Restart the Docker service
To restart the Docker service, you can stop and start the VM from the Instances page or you can use ssh to connect to your Agent Platform Workbench instance and enter:
sudo service docker restart
Restart the Inverting Proxy agent
To restart the Inverting Proxy agent, you can stop and start the VM from the Instances page or you can use ssh to connect to your Agent Platform Workbench instance and enter:
sudo docker restart proxy-agent
Restart the Jupyter service
To restart the Jupyter service, you can stop and start the VM from the Instances page or you can use ssh to connect to your Agent Platform Workbench instance and enter:
sudo service jupyter restart
Restart the Notebooks Collection Agent
The Notebooks Collection Agent service runs a Python process in the background that verifies the status of the Agent Platform Workbench instance's core services.
To restart the Notebooks Collection Agent service, you can stop and start the VM from the Google Cloud console or you can use ssh to connect to your Agent Platform Workbench instance and enter:
sudo systemctl stop notebooks-collection-agent.service
followed by:
sudo systemctl start notebooks-collection-agent.service
To run these commands in your Agent Platform Workbench instance, open JupyterLab, and create a new terminal.
Modify the Notebooks Collection Agent script
To access and edit the script open a terminal in our instance or use ssh to connect to your Agent Platform Workbench instance, and enter:
nano /opt/deeplearning/bin/notebooks_collection_agent.py
After editing the file, remember to save it.
Then, you must restart the Notebooks Collection Agent service.
Verify the instance can resolve the required DNS domains
To verify that the instance can resolve the required DNS domains, you can use ssh to connect to your Agent Platform Workbench instance and enter:
host notebooks.googleapis.com
host *.notebooks.cloud.google.com
host *.notebooks.googleusercontent.com
host *.kernels.googleusercontent.com
or:
curl --silent --output /dev/null "https://notebooks.cloud.google.com"; echo $?
If the instance has Managed Service for Apache Spark enabled, you can verify that the instance
resolves *.kernels.googleusercontent.com by running:
curl --verbose -H "Authorization: Bearer $(gcloud auth print-access-token)" https://${PROJECT_NUMBER}-dot-${REGION}.kernels.googleusercontent.com/api/kernelspecs | jq .
To run these commands in your Agent Platform Workbench instance, open JupyterLab, and create a new terminal.
Make a copy of the user data on an instance
To store a copy of an instance's user data in Cloud Storage, complete the following steps.
Create a Cloud Storage bucket (optional)
In the same project where your instance is located, create a Cloud Storage bucket where you can store your user data. If you already have a Cloud Storage bucket, skip this step.
-
Create a Cloud Storage bucket:
Replacegcloud storage buckets create gs://BUCKET_NAME
BUCKET_NAMEwith a bucket name that meets the bucket naming requirements.
Copy your user data
In your instance's JupyterLab interface, select File > New > Terminal to open a terminal window. For Agent Platform Workbench instances, you can instead connect to your instance's terminal by using SSH.
Use the gcloud CLI to copy your user data to a Cloud Storage bucket. The following example command copies all of the files from your instance's
/home/jupyter/directory to a directory in a Cloud Storage bucket.gcloud storage cp /home/jupyter/* gs://BUCKET_NAMEPATH --recursive
Replace the following:
BUCKET_NAME: the name of your Cloud Storage bucketPATH: the path to the directory where you want to copy your files, for example:/copy/jupyter/
Investigate an instance stuck in provisioning by using gcpdiag
gcpdiag
is an open source tool. It is not an officially supported Google Cloud product.
You can use the gcpdiag tool to help you identify and fix Google Cloud
project issues. For more information, see the
gcpdiag project on GitHub.
gcpdiag runbook investigates potential causes for a
Agent Platform Workbench instance to get stuck in provisioning status,
including the following areas:
- Status: Checks the instance's current status to ensure that it is stuck in provisioning and not stopped or active.
- Instance's Compute Engine VM boot disk image:
Checks whether the instance was created with a custom container, an official
workbench-instancesimage, Deep Learning VM Images, or unsupported images that might cause the instance to get stuck in provisioning status. - Custom scripts: Checks whether the instance is using custom startup or post-startup scripts that change the default Jupyter port or break dependencies that might cause the instance to get stuck in provisioning status.
- Environment version: Checks whether the instance is using the latest environment version by checking its upgradability. Earlier versions might cause the instance to get stuck in provisioning status.
- Instance's Compute Engine VM performance: Checks the VM's current performance to ensure that it isn't impaired by high CPU usage, insufficient memory, or disk space issues that might disrupt normal operations.
- Instance's Compute Engine serial port or
system logging: Checks whether the instance has
serial port logs, which are analyzed to
ensure that Jupyter is running on port
127.0.0.1:8080. - Instance's Compute Engine SSH and terminal access: Checks whether the instance's Compute Engine VM is running so that the user can SSH and open a terminal to verify that space usage in 'home/jupyter' is lower than 85%. If no space is left, this might cause the instance to get stuck in provisioning status.
- External IP turned off: Checks whether external IP access is turned off. An incorrect networking configuration can cause the instance to get stuck in provisioning status.
Docker
You can
run gcpdiag using a wrapper that starts gcpdiag in a
Docker container. Docker or
Podman must be installed.
- Copy and run the following command on your local workstation.
curl https://gcpdiag.dev/gcpdiag.sh >gcpdiag && chmod +x gcpdiag
- Execute the
gcpdiagcommand../gcpdiag runbook vertex/workbench-instance-stuck-in-provisioning \ --parameter project_id=PROJECT_ID \ --parameter instance_name=INSTANCE_NAME \ --parameter zone=ZONE
View available parameters for this runbook.
Replace the following:
- PROJECT_ID: The ID of the project containing the resource.
- INSTANCE_NAME: The name of the target Agent Platform Workbench instance within your project.
- ZONE: The zone in which your target Agent Platform Workbench instance is located.
Useful flags:
--universe-domain: If applicable, the Trusted Partner Sovereign Cloud domain hosting the resource--parameteror-p: Runbook parameters
For a list and description of all gcpdiag tool flags, see the
gcpdiag usage instructions.
Permissions errors when using service account roles with Agent Platform
Issue
You get general permissions errors when you use service account roles with Agent Platform.
These errors can appear in Cloud Logging in either the product component logs or audit logs. They may also appear in any combination of the affected projects.
These issues can be caused by one or both of the following:
Use of the
Service Account Token Creatorrole when theService Account Userrole should have been used, or the other way around. These roles grant different permissions on a service account and aren't interchangeable. To learn about the differences between theService Account Token CreatorandService Account Userroles, see Service account roles.You've granted a service account permissions across multiple projects, which isn't permitted by default.
Solution
To resolve the issue, try one or more of the following:
Determine whether the
Service Account Token CreatororService Account Userrole is needed. To learn more, read the IAM documentation for the Agent Platform services you are using, as well as any other product integrations that you are using.If you have granted a service account permissions across multiple projects, enable service accounts to be attached across projects by ensuring that
iam.disableCrossProjectServiceAccountUsage. isn't enforced. To ensure thatiam.disableCrossProjectServiceAccountUsageisn't enforced, run the following command:gcloud resource-manager org-policies disable-enforce \ iam.disableCrossProjectServiceAccountUsage \ --project=PROJECT_ID