Troubleshooting Agent Platform Workbench

This page describes troubleshooting steps that you might find helpful if you run into problems when you use Gemini Enterprise Agent Platform Workbench.

See also Troubleshooting Agent Platform for help using other components of Gemini Enterprise Agent Platform.

To filter this page's content, click a topic:

Agent Platform Workbench instances

This section describes troubleshooting steps for Agent Platform Workbench instances.

Troubleshooting with AI Tools

This section discusses how to use AI tools for troubleshooting;

Troubleshooting with Cloud Assistance Investigations

When connecting Agent Platform with other Google Cloud products, you may find Gemini Cloud Assist Investigations to be helpful in troubleshooting integration issues. It may also accelerate troubleshooting on the instance itself. Gemini Cloud Assist lets you draw insights from metrics and logs generated by the instance.

  • Stop the instance and follow the View in Compute Engine link.
  • Install the Ops Agent (Recommended). This will take a few minutes
  • Add a Custom Metadata field notebook-enable-debug and set this to true
  • Restart the instance and reproduce the issue.
  • Enable and configure the Cloud Assist Investigations API.
  • Create an new investigation and describe the issue in detail using a natural language prompt.
  • As you type, a dialog appears that suggests resources to add to the investigation. Review this list and be sure to add the instance as a resource as well as any other resources in this list of supported products
  • Start the investigation and review the results.

Troubleshoot diagnostic files with Gemini CLI

You may use the results of from the Cloud Assistance Investigation to inform further AI driven investigation on the diagnostic file from the instance.

  • Run the diagnostic tool and specify a Cloud Storage bucket to upload the results.
sudo /opt/deeplearning/bin/diagnostic_tool.sh [--repair] [--bucket=$BUCKET]
  • Download the diagnosting file to your workstation then install and configure Gemini CLI.
  • Start the application then describe your issue. Include the hypothesis from the Cloud Assistance investigation in the context. Ask the model to extend the investigation by reading the contents of the diagnostic file using natural language prompts.

Connecting to and opening JupyterLab

This section describes troubleshooting steps for connecting to and opening JupyterLab.

Nothing happens after clicking Open JupyterLab

Issue

When you click Open JupyterLab, nothing happens.

Solution

Verify that your browser doesn't block new tabs from opening automatically. JupyterLab opens in a new browser tab.

Can't access the terminal in a Agent Platform Workbench instance

Issue

If you're unable to access the terminal or can't find the terminal window in the launcher, it could be because your Agent Platform Workbench instance doesn't have terminal access enabled.

Solution

You must create a new Agent Platform Workbench instance with the Terminal access option enabled. This option can't be changed after instance creation.

502 error when opening JupyterLab

Issue

A 502 error might mean that your Agent Platform Workbench instance isn't ready yet.

Solution

Wait a few minutes, refresh the Google Cloud console browser tab, and try again.

Notebook is unresponsive

Issue

Your Agent Platform Workbench instance isn't running cells or appears to be frozen.

Solution

First try restarting the kernel by clicking Kernel from the top menu and then Restart Kernel. If that doesn't work, you can try the following:

  • Refresh the JupyterLab browser page. Unsaved cell output doesn't persist, so you must run those cells again to regenerate the output.
  • Reset your instance.

Unable to connect with Agent Platform Workbench instance using SSH

Issue

You're unable to connect to your instance by using SSH through a terminal window.

Agent Platform Workbench instances use OS Login to enable SSH access. When you create an instance, Agent Platform Workbench enables OS Login by default by setting the metadata key enable-oslogin to TRUE. If you're unable to use SSH to connect to your instance, this metadata key might need to be set to TRUE.

Solution

Connecting to a Agent Platform Workbench instance by using the Google Cloud console isn't supported. If you're unable to connect to your instance by using SSH through a terminal window, see the following:

To set the metadata key enable-oslogin to TRUE, use the projects.locations.instances.patch method in the Notebooks API or the gcloud workbench instances update command in the Agent Platform SDK.

GPU quota has been exceeded

Issue

You're unable to create a Agent Platform Workbench instance with GPUs.

Solution

Determine the number of GPUs available in your project by checking the quotas page. If GPUs aren't listed on the quotas page, or you require additional GPU quota, you can request a quota increase for Compute Engine GPUs. See Request a higher quota limit.

Creating Agent Platform Workbench instances

This section describes how to troubleshoot issues related to creating Agent Platform Workbench instances.

Instance stays in pending state indefinitely or is stuck in provisioning status

Issue

After creating a Agent Platform Workbench instance, it stays in the pending state indefinitely. An error like the following might appear in the serial logs:

Could not resolve host: notebooks.googleapis.com

If your instance is stuck in provisioning status, this could be because you have an invalid private networking configuration for your instance.

Solution

Follow the steps in the Instance logs show connection or timeout errors section.

Unable to create an instance within a Shared VPC network

Issue

Attempting to create an instance within a Shared VPC network results in an error message like the following:

Required 'compute.subnetworks.use' permission for
'projects/network-administration/regions/us-central1/subnetworks/v'

Solution

The issue is that the Notebooks Service Account is attempting to create the instance without the correct permissions.

To ensure that the Notebooks Service Account has the necessary permissions to ensure that the Notebooks Service Account can create a Agent Platform Workbench instance within a Shared VPC network, ask your administrator to grant the Compute Network User role (roles/compute.networkUser) IAM role to the Notebooks Service Account on the host project.

For more information about granting roles, see Manage access to projects, folders, and organizations.

This predefined role contains the permissions required to ensure that the Notebooks Service Account can create a Agent Platform Workbench instance within a Shared VPC network. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to ensure that the Notebooks Service Account can create a Agent Platform Workbench instance within a Shared VPC network:

  • To use subnetworks: compute.subnetworks.use

Your administrator might also be able to give the Notebooks Service Account these permissions with custom roles or other predefined roles.

Can't create a Agent Platform Workbench instance with a custom container

Issue

There isn't an option to use a custom container when creating a Agent Platform Workbench instance in the Google Cloud console.

Solution

Adding a custom container to a Agent Platform Workbench instance isn't supported, and you can't add a custom container by using the Google Cloud console.

Adding a conda environment is recommended instead of using a custom container.

You can add a custom container to a Agent Platform Workbench instance by using the Notebooks API, but this capability isn't supported.

Unable to use the Gemini CLI

Issue

The Gemini CLI tile is in the JupyterLab launcher and opens successfully, but Gemini doesn't respond to prompts.

Solution

An administrator might have blocked access to the Gemini CLI. See Control access to the Gemini CLI.

Mount shared storage button isn't there

Issue

The Mount shared storage button isn't in the File Browser tab of the JupyterLab interface.

Solution

The storage.buckets.list permission is required for the Mount shared storage button to appear in the JupyterLab interface of your Agent Platform Workbench instance. Ask your administrator to grant your Agent Platform Workbench instance's service account the storage.buckets.list permission on the project.

599 error when using Managed Service for Apache Spark

Issue

Attempting to create a Managed Service for Apache Spark-enabled instance results in an error message like the following:

HTTP 599: Unknown (Error from Gateway: [Timeout while connecting]
Exception while attempting to connect to Gateway server url.
Ensure gateway url is valid and the Gateway instance is running.)

Solution

In your Cloud DNS configuration, add a Cloud DNS entry for the *.googleusercontent.com domain.

Unable to install third-party JupyterLab extension

Issue

Attempting to install a third-party JupyterLab extension results in an Error: 500 message.

Solution

Third-party JupyterLab extensions aren't supported in Agent Platform Workbench instances.

Unable to edit underlying virtual machine

Issue

When you try to edit the underlying virtual machine (VM) of a Agent Platform Workbench instance, you might get an error message similar to the following:

Current principal doesn't have permission to mutate this resource.

Solution

This error occurs because you can't edit the underlying VM of an instance by using the Google Cloud console or the Compute Engine API.

To edit a Agent Platform Workbench instance's underlying VM, use the projects.locations.instances.patch method in the Notebooks API or the gcloud workbench instances update command in the Agent Platform SDK

pip packages aren't available after adding conda environment

Issue

Your pip packages aren't available after you add a conda-based kernel.

Solution

To resolve the issue, see Add a conda environment and try the following:

  • Check that you used the DL_ANACONDA_ENV_HOME variable and that it contains the name of your environment.

  • Check that pip is located in a path similar to opt/conda/envs/ENVIRONMENT/bin/pip. You can run the which pip command to get the path.

Unable to access or copy data of an instance with single user access

Issue

The data on an instance with single user access is inaccessible.

For Agent Platform Workbench instances that are set up with single user access, only the specified single user (the owner) can access the data on the instance.

Solution

To access or copy the data when you aren't the owner of the instance, open a support case.

Unexpected shutdown

Issue

Your Agent Platform Workbench instance shuts down unexpectedly.

Solution

If your instance shuts down unexpectedly, this could be because idle shutdown was initiated.

If you enabled idle shutdown, your instance shuts down when there is no kernel activity for the specified time period. For example, running a cell or new output printing to a notebook is activity that resets the idle timeout timer. CPU usage doesn't reset the idle timeout timer.

Instance logs show connection or timeout errors

Issue

Your Agent Platform Workbench instance's logs show connection or timeout errors.

Solution

If you notice connection or timeout errors in the instance's logs make sure that the Jupyter server is running on port 8080. Follow the steps in the Verify that the Jupyter internal API is active section.

If you have turned off External IP and you are using a private VPC network make sure you have also followed the network configuration options documentation. Consider the following:

Instance logs show 'Unable to contact Jupyter API' 'ReadTimeoutError'

Issue

Your Agent Platform Workbench instance logs show an error such as:

notebooks_collection_agent. Unable to contact Jupyter API:
HTTPConnectionPool(host=\'127.0.0.1\', port=8080):
Max retries exceeded ReadTimeoutError(\"HTTPConnectionPool(host=\'127.0.0.1\', port=8080

Solution

Follow the steps in the Instance logs show connection or timeout errors section. You can also try modifying the Notebooks Collection Agent script to change HTTP_TIMEOUT_SESSION to a larger value, for example: 60, to help verify whether the request has failed due to the call taking too long to respond or the requested URL can't be reached.

docker0 address conflicts with VPC addressing

Issue

By default, the docker0 interface is created with an IP address of 172.17.0.1/16. This might conflict with the IP addressing in your VPC network such that the instance is unable to connect to other endpoints with 172.17.0.1/16 addresses.

Solution

You can force the docker0 interface to be created with an IP address that doesn't conflict with your VPC network by using the following post-startup script and setting the post-startup script behavior to run_once.

#!/bin/bash
# Wait for Docker to be fully started
while ! systemctl is-active docker; do
sleep 1
done
# Stop the Docker service
systemctl stop docker
# Modify /etc/docker/daemon.json
cat < /etc/docker/daemon.json
{
"bip": "CUSTOM_DOCKER_IP/16"
}
EOF
# Restart the Docker service
systemctl start docker

Specified reservations don't exist

Issue

The operation to create the instance results in a Specified reservations do not exist error message. The operation's output might be similar to the following:

{
  "name": "projects/PROJECT/locations/LOCATION/operations/OPERATION_ID",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.notebooks.v2.OperationMetadata",
    "createTime": "2025-01-01T01:00:01.000000000Z",
    "endTime": "2025-01-01T01:00:01.000000000Z",
    "target": "projects/PROJECT/locations/LOCATION/instances/INSTANCE_NAME",
    "verb": "create",
    "requestedCancellation": false,
    "apiVersion": "v2",
    "endpoint": "CreateInstance"
  },
  "done": true,
  "error": {
    "code": 3,
    "message": "Invalid value for field 'resource.reservationAffinity': '{  \"consumeReservationType\": \"SPECIFIC_ALLOCATION\",  \"key\": \"compute.googleapis.com/reservation-name...'. Specified reservations [projects/PROJECT/zones/ZONE/futureReservations/RESERVATION_NAME] do not exist.",
    "details": [
      {
        "@type": "type.googleapis.com/google.rpc.RequestInfo",
        "requestId": "REQUEST_ID"
      }
    ]
  }
}

Solution

Some Compute Engine machine types require additional parameters at creation such as local SSDs or a minimum CPU platform. The instance specification must include these additional fields.

  • Agent Platform Workbench instances use automatic minimum CPU platform by default. If your reservation sets a specific platform, you need to set the min_cpu_platform accordingly when creating Agent Platform Workbench instances.
  • Agent Platform Workbench instances always set the number of local SSDs to the default values according to the machine type. For example, a2-ultragpu-1g always has 1 local SSD, while a2-highgpu-1g always has 0 local SSD. When creating reservations to be used for Agent Platform Workbench instances, you need to leave the local SSD to its default value.

Helpful procedures

This section describes procedures that you might find helpful.

Use SSH to connect to your Agent Platform Workbench instance

Use ssh to connect to your instance by typing the following command in either Cloud Shell or any environment where the Google Cloud CLI is installed.

gcloud compute ssh --project PROJECT_ID \
  --zone ZONE \
  INSTANCE_NAME -- -L 8080:localhost:8080

Replace the following:

  • PROJECT_ID: Your project ID
  • ZONE: The Google Cloud zone where your instance is located
  • INSTANCE_NAME: The name of your instance

You can also connect to your instance by opening your instance's Compute Engine detail page, and then clicking the SSH button.

Re-register with the Inverting Proxy server

To re-register the Agent Platform Workbench instance with the internal Inverting Proxy server, you can stop and start the VM from the Instances page or you can use ssh to connect to your Agent Platform Workbench instance and enter:

cd /opt/deeplearning/bin
sudo ./attempt-register-vm-on-proxy.sh

Verify the Docker service status

To verify the Docker service status you can use ssh to connect to your Agent Platform Workbench instance and enter:

sudo service docker status

Verify that the Inverting Proxy agent is running

To verify if the notebook Inverting Proxy agent is running, use ssh to connect to your Agent Platform Workbench instance and enter:

# Confirm Inverting Proxy agent Docker container is running (proxy-agent)
sudo docker ps

# Verify State.Status is running and State.Running is true.
sudo docker inspect proxy-agent

# Grab logs
sudo docker logs proxy-agent

Verify the Jupyter service status and collect logs

To verify the Jupyter service status you can use ssh to connect to your Agent Platform Workbench instance and enter:

sudo service jupyter status

To collect Jupyter service logs:

sudo journalctl -u jupyter.service --no-pager

Verify that the Jupyter internal API is active

The Jupyter API should always run on port 8080. You can verify this by inspecting the instance's syslogs for an entry similar to:

Jupyter Server ... running at:
http://localhost:8080

To verify that the Jupyter internal API is active you can also use ssh to connect to your Agent Platform Workbench instance and enter:

curl http://127.0.0.1:8080/api/kernelspecs

You can also measure the time it takes for the API to respond in case the requests are taking too long:

time curl -V http://127.0.0.1:8080/api/status
time curl -V http://127.0.0.1:8080/api/kernels
time curl -V http://127.0.0.1:8080/api/connections

To run these commands in your Agent Platform Workbench instance, open JupyterLab, and create a new terminal.

Restart the Docker service

To restart the Docker service, you can stop and start the VM from the Instances page or you can use ssh to connect to your Agent Platform Workbench instance and enter:

sudo service docker restart

Restart the Inverting Proxy agent

To restart the Inverting Proxy agent, you can stop and start the VM from the Instances page or you can use ssh to connect to your Agent Platform Workbench instance and enter:

sudo docker restart proxy-agent

Restart the Jupyter service

To restart the Jupyter service, you can stop and start the VM from the Instances page or you can use ssh to connect to your Agent Platform Workbench instance and enter:

sudo service jupyter restart

Restart the Notebooks Collection Agent

The Notebooks Collection Agent service runs a Python process in the background that verifies the status of the Agent Platform Workbench instance's core services.

To restart the Notebooks Collection Agent service, you can stop and start the VM from the Google Cloud console or you can use ssh to connect to your Agent Platform Workbench instance and enter:

sudo systemctl stop notebooks-collection-agent.service

followed by:

sudo systemctl start notebooks-collection-agent.service

To run these commands in your Agent Platform Workbench instance, open JupyterLab, and create a new terminal.

Modify the Notebooks Collection Agent script

To access and edit the script open a terminal in our instance or use ssh to connect to your Agent Platform Workbench instance, and enter:

nano /opt/deeplearning/bin/notebooks_collection_agent.py

After editing the file, remember to save it.

Then, you must restart the Notebooks Collection Agent service.

Verify the instance can resolve the required DNS domains

To verify that the instance can resolve the required DNS domains, you can use ssh to connect to your Agent Platform Workbench instance and enter:

host notebooks.googleapis.com
host *.notebooks.cloud.google.com
host *.notebooks.googleusercontent.com
host *.kernels.googleusercontent.com

or:

curl --silent --output /dev/null "https://notebooks.cloud.google.com"; echo $?

If the instance has Managed Service for Apache Spark enabled, you can verify that the instance resolves *.kernels.googleusercontent.com by running:

curl --verbose -H "Authorization: Bearer $(gcloud auth print-access-token)" https://${PROJECT_NUMBER}-dot-${REGION}.kernels.googleusercontent.com/api/kernelspecs | jq .

To run these commands in your Agent Platform Workbench instance, open JupyterLab, and create a new terminal.

Make a copy of the user data on an instance

To store a copy of an instance's user data in Cloud Storage, complete the following steps.

Create a Cloud Storage bucket (optional)

In the same project where your instance is located, create a Cloud Storage bucket where you can store your user data. If you already have a Cloud Storage bucket, skip this step.

  • Create a Cloud Storage bucket:
    gcloud storage buckets create gs://BUCKET_NAME
    Replace BUCKET_NAME with a bucket name that meets the bucket naming requirements.

Copy your user data

  1. In your instance's JupyterLab interface, select File > New > Terminal to open a terminal window. For Agent Platform Workbench instances, you can instead connect to your instance's terminal by using SSH.

  2. Use the gcloud CLI to copy your user data to a Cloud Storage bucket. The following example command copies all of the files from your instance's /home/jupyter/ directory to a directory in a Cloud Storage bucket.

    gcloud storage cp /home/jupyter/* gs://BUCKET_NAMEPATH --recursive
    

    Replace the following:

    • BUCKET_NAME: the name of your Cloud Storage bucket
    • PATH: the path to the directory where you want to copy your files, for example: /copy/jupyter/

Investigate an instance stuck in provisioning by using gcpdiag

gcpdiag is an open source tool. It is not an officially supported Google Cloud product. You can use the gcpdiag tool to help you identify and fix Google Cloud project issues. For more information, see the gcpdiag project on GitHub.

This gcpdiag runbook investigates potential causes for a Agent Platform Workbench instance to get stuck in provisioning status, including the following areas:
  • Status: Checks the instance's current status to ensure that it is stuck in provisioning and not stopped or active.
  • Instance's Compute Engine VM boot disk image: Checks whether the instance was created with a custom container, an official workbench-instances image, Deep Learning VM Images, or unsupported images that might cause the instance to get stuck in provisioning status.
  • Custom scripts: Checks whether the instance is using custom startup or post-startup scripts that change the default Jupyter port or break dependencies that might cause the instance to get stuck in provisioning status.
  • Environment version: Checks whether the instance is using the latest environment version by checking its upgradability. Earlier versions might cause the instance to get stuck in provisioning status.
  • Instance's Compute Engine VM performance: Checks the VM's current performance to ensure that it isn't impaired by high CPU usage, insufficient memory, or disk space issues that might disrupt normal operations.
  • Instance's Compute Engine serial port or system logging: Checks whether the instance has serial port logs, which are analyzed to ensure that Jupyter is running on port 127.0.0.1:8080.
  • Instance's Compute Engine SSH and terminal access: Checks whether the instance's Compute Engine VM is running so that the user can SSH and open a terminal to verify that space usage in 'home/jupyter' is lower than 85%. If no space is left, this might cause the instance to get stuck in provisioning status.
  • External IP turned off: Checks whether external IP access is turned off. An incorrect networking configuration can cause the instance to get stuck in provisioning status.

Docker

You can run gcpdiag using a wrapper that starts gcpdiag in a Docker container. Docker or Podman must be installed.

  1. Copy and run the following command on your local workstation.
    curl https://gcpdiag.dev/gcpdiag.sh >gcpdiag && chmod +x gcpdiag
  2. Execute the gcpdiag command.
    ./gcpdiag runbook vertex/workbench-instance-stuck-in-provisioning \
        --parameter project_id=PROJECT_ID \
        --parameter instance_name=INSTANCE_NAME \
        --parameter zone=ZONE

View available parameters for this runbook.

Replace the following:

  • PROJECT_ID: The ID of the project containing the resource.
  • INSTANCE_NAME: The name of the target Agent Platform Workbench instance within your project.
  • ZONE: The zone in which your target Agent Platform Workbench instance is located.

Useful flags:

For a list and description of all gcpdiag tool flags, see the gcpdiag usage instructions.

Permissions errors when using service account roles with Agent Platform

Issue

You get general permissions errors when you use service account roles with Agent Platform.

These errors can appear in Cloud Logging in either the product component logs or audit logs. They may also appear in any combination of the affected projects.

These issues can be caused by one or both of the following:

  • Use of the Service Account Token Creator role when the Service Account User role should have been used, or the other way around. These roles grant different permissions on a service account and aren't interchangeable. To learn about the differences between the Service Account Token Creator and Service Account User roles, see Service account roles.

  • You've granted a service account permissions across multiple projects, which isn't permitted by default.

Solution

To resolve the issue, try one or more of the following:

  • Determine whether the Service Account Token Creator or Service Account User role is needed. To learn more, read the IAM documentation for the Agent Platform services you are using, as well as any other product integrations that you are using.

  • If you have granted a service account permissions across multiple projects, enable service accounts to be attached across projects by ensuring that iam.disableCrossProjectServiceAccountUsage. isn't enforced. To ensure that iam.disableCrossProjectServiceAccountUsage isn't enforced, run the following command:

    gcloud resource-manager org-policies disable-enforce \
      iam.disableCrossProjectServiceAccountUsage \
      --project=PROJECT_ID