Troubleshooting and known issues

This page includes troubleshooting steps for some common issues and errors.

Known issues

  • When no I/O is happening on the file system, performance graphs show the message "No data is available for the selected time frame". This is because performance metrics are generated only when there is I/O. For more information about metrics, see Monitor instances and operations.
  • Managed Lustre does not support VPC-SC.
  • The following Lustre features are not supported:
    • Client-side data compression
    • Persistent client caching
  • Some Lustre commands are not supported.
  • Some exceptions to POSIX compliance exist.

Compute Engine issues

When encountering issues mounting a Managed Lustre file system on a Compute Engine instance, follow these steps to diagnose the problem.

Verify that the Managed Lustre instance is reachable

First, ensure that your Managed Lustre instance is reachable from your Compute Engine instance:

sudo lctl ping IP_ADDRESS@tcp

To obtain the value of IP_ADDRESS, see Get an instance.

A successful ping returns a response similar to the following:

12345-0@lo
12345-10.115.0.3@tcp

A failed ping returns the following:

failed to ping 10.115.0.3@tcp: Input/output error

If your ping fails:

  • Make sure your Managed Lustre instance and your Compute Engine instance are in the same VPC network. Compare the output of the following commands:

    gcloud compute instances describe VM_NAME \
      --zone=VM_ZONE \
      --format='get(networkInterfaces[0].network)'
    
    gcloud lustre instances describe INSTANCE_NAME \
      --location=ZONE --format='get(network)'
    

    The output looks like:

    https://www.googleapis.com/compute/v1/projects/my-project/global/networks/my-network
    projects/my-project/global/networks/my-network
    

    The output of the gcloud compute instances describe command is prefixed with https://www.googleapis.com/compute/v1/; everything following that string must match the output of the gcloud lustre instances describe command.

  • Review your VPC network's firewall rules and routing configurations to ensure they allow traffic between your Compute Engine instance and the Managed Lustre instance.

Check the LNet accept port

Managed Lustre instances can be configured to support GKE clients by specifying the --gke-support-enabled flag at the time of creation.

If GKE support has been enabled, you must configure LNet on all Compute Engine instances to use accept_port 6988. See Configure LNet for gke-support-enabled instances.

To determine whether the instance has been configured to support GKE clients, run the following command:

gcloud lustre instances describe INSTANCE_NAME \
  --location=LOCATION | grep gkeSupportEnabled

If the command returns gkeSupportEnabled: true then you must configure LNet.

Ubuntu kernel version mismatch with Lustre client

For Compute Engine instances running Ubuntu, the Ubuntu kernel version must match the specific version of the Lustre client packages. If your Lustre client tools are failing, check whether your Compute Engine instance has auto-upgraded to a newer kernel.

To check your kernel version:

uname -r

The response looks like:

6.8.0-1029-gcp

To check your Lustre client package version:

dpkg -l | grep -i lustre

The response looks like:

ii  lustre-client-modules-6.8.0-1029-gcp 2.14.0-ddn198-1  amd64  Lustre Linux kernel module (kernel 6.8.0-1029-gcp)
ii  lustre-client-utils                  2.14.0-ddn198-1  amd64  Userspace utilities for the Lustre filesystem (client)

If there is a mismatch between the kernel version listed from both commands, you must re-install the Lustre client packages.

Check dmesg for Lustre errors

Many Lustre warnings and errors are logged to the Linux kernel ring buffer. The dmesg command prints the kernel ring buffer.

To search for Lustre-specific messages, use grep in conjunction with dmesg:

dmesg | grep -i lustre

Or, to look for more general errors that might be related:

dmesg | grep -i error

Information to include with a support request

If you're unable to resolve the mount failure, gather diagnostic information before creating a support case.

Run sosreport: This utility collects system logs and configuration information and generates a compressed tarball:

sudo sosreport

Attach the sosreport archive and any relevant output from dmesg to your support case.

GKE issues

Before following the troubleshooting steps in this section, refer to the limitations when connecting to Managed Lustre from GKE.

Incorrect performance tier for dynamically provisioned Lustre instances

When dynamically provisioning a Lustre instance, the instance creation fails with an InvalidArgument error for PerUnitStorageThroughput, regardless of the perUnitStorageThroughput value specified in the API request. This affects GKE 1.33 versions before 1.33.4-gke.1036000.

Workaround:

Upgrade the GKE cluster to version 1.33.4-gke.1036000 or later. If using the Stable channel, a newer version might not be available yet. In this case, you can manually select a version from the Regular or Rapid channels that includes the fix.

Managed Lustre communication ports

The Managed Lustre CSI driver uses different ports for communication with Managed Lustre instances, depending on your GKE cluster version and existing Managed Lustre configurations.

  • Default port (988): For new GKE clusters that run version 1.33.2-gke.4780000 or later, the driver uses port 988 for Lustre communication by default.

  • Legacy Port (6988): The driver uses port 6988 in the following scenarios:

    • Earlier GKE versions: If your GKE cluster runs a version earlier than 1.33.2-gke.4780000, the --enable-legacy-lustre-port flag is required when enabling the CSI driver. Enabling this flag works around a port conflict with the gke-metadata-server on GKE nodes.
    • Existing Managed Lustre instances with GKE support: If you are connecting to an existing Managed Lustre instance that was created with the --gke-support-enabled flag, you must include --enable-legacy-lustre-port when enabling the CSI driver, irrespective of your cluster version. Without this flag, your GKE cluster will fail to mount the existing Lustre instance.

    For more details on enabling the CSI driver with the legacy port, see Lustre communication ports.

Log queries

To check logs, run the following query in Logs Explorer.

To return Managed Lustre CSI driver node server logs:

resource.type="k8s_container"
resource.labels.pod_name=~"lustre-csi-node*"

Troubleshoot volume provisioning

If the PersistentVolumeClaim (PVC) remains in a Pending state and no PersistentVolume (PV) is created after 20-30 minutes, an error might have occurred.

  1. Check the PVC events:

    kubectl describe pvc PVC_NAME
    
  2. If the error indicates configuration issues or invalid arguments, verify your StorageClass parameters.

  3. Recreate the PVC.

  4. If the issue persists, contact Cloud Customer Care.

Troubleshoot volume mounting

After the Pod is scheduled to a node, the volume is mounted. If this fails, check the Pod events and kubelet logs.

kubectl describe pod POD_NAME

CSI driver enablement issues

Symptom:

MountVolume.MountDevice failed for volume "yyy" : kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name lustre.csi.storage.gke.io not found in the list of registered CSI drivers

or

MountVolume.SetUp failed for volume "yyy" : kubernetes.io/csi: mounter.SetUpAt failed to get CSI client: driver name lustre.csi.storage.gke.io not found in the list of registered CSI drivers

Cause: The CSI driver is not enabled or not yet running.

Resolution:

  1. Verify the CSI driver is enabled.
  2. If the cluster was recently scaled or upgraded, wait a few minutes for the driver to become functional.
  3. If the error persists, check the lustre-csi-node logs for "Operation not permitted". This indicates that the node version is too old to support Managed Lustre. To resolve this, upgrade your node pool to version 1.33.2-gke.1111000 or above.
  4. If the logs show "LNET_PORT mismatch", upgrade your node pool to ensure compatible Lustre kernel modules are installed.

Mountpoint already exists

Symptom:

MountVolume.MountDevice failed for volume "yyy" : rpc error: code = AlreadyExists
desc = A mountpoint with the same lustre filesystem name "yyy" already exists on
node "yyy". Please mount different lustre filesystems

Cause: Mounting multiple volumes from different Managed Lustre instances with the same file system name on a single node is not supported.

Resolution: Use a unique file system name for each Managed Lustre instance.

Mount failed: No such file or directory

Symptom:

MountVolume.MountDevice failed for volume "yyy" : rpc error: code = Internal desc = Could not mount ... failed: No such file or directory

Cause: The file system name specified is incorrect or does not exist.

Resolution: Verify the fs_name in your StorageClass or PV configuration matches the Managed Lustre instance.

Mount failed: Input/output error

Symptom:

MountVolume.MountDevice failed for volume "yyy" : rpc error: code = Internal desc = Could not mount ... failed: Input/output error

Cause: The cluster cannot connect to the Managed Lustre instance.

Resolution:

  1. Verify the IP address of the Managed Lustre instance.
  2. Ensure the GKE cluster and Managed Lustre instance are in the same VPC network or are correctly peered.

Internal errors

Symptom: rpc error: code = Internal desc = ...

Resolution: If the error persists, contact Cloud Customer Care.

Troubleshoot volume unmounting

Symptom:

UnmountVolume.TearDown failed for volume "yyy" : rpc error: code = Internal desc = ...

Resolution:

  1. Force delete the Pod:

    kubectl delete pod POD_NAME --force
    
  2. If the issue persists, contact Cloud Customer Care.

Troubleshoot volume deletion

If the PV remains in a "Released" state for an extended period (e.g., more than an hour) after deleting the PVC, contact Cloud Customer Care.

Troubleshoot volume expansion

PVC stuck in ExternalExpanding

Symptom: The PVC status does not change to Resizing, and events show ExternalExpanding.

Cause: The allowVolumeExpansion field might be missing or set to false.

Resolution: Ensure the StorageClass has allowVolumeExpansion: true.

kubectl get storageclass STORAGE_CLASS_NAME -o yaml

Expansion fails: Invalid Argument

Symptom: VolumeResizeFailed: rpc error: code = InvalidArgument ...

Cause: The requested size is invalid (e.g., not a multiple of the step size or outside limits).

Resolution: Check the valid capacity ranges and update the PVC with a valid size.

Expansion fails: Internal Error

Symptom: VolumeResizeFailed ... rpc error: code = Internal

Resolution: Retry the expansion by reapplying the PVC. If it fails repeatedly, contact Cloud Customer Care.

Deadline Exceeded

Symptom: VolumeResizeFailed with DEADLINE_EXCEEDED.

Cause: The operation is taking longer than expected but may still be in progress.

Resolution: Wait for the operation to complete. The resizer will retry automatically. If it remains stuck for a long time (e.g., > 90 minutes), contact support.

Quota Exceeded

Symptom: Expansion fails due to quota limits.

Resolution: Request a quota increase or request a smaller capacity increase.

VPC network issues

The following sections describe common VPC network issues.

Managed Lustre does not support VPC-SC

Managed Lustre does not support VPC Service Controls (VPC-SC).

The Google Cloud project where you create your Managed Lustre instance must not be part of any VPC-SC perimeter.

GKE and Compute Engine clients connecting to your Managed Lustre instance must also be outside of any VPC-SC perimeter.

Can't access Managed Lustre from a peered project

To access your Managed Lustre instance from a VM in a peered VPC network, you must use Network Connectivity Center (NCC). NCC lets you connect multiple VPC networks and on-premises networks to a central hub, providing connectivity between them.

For instructions on how to set up NCC, refer to the Network Connectivity Center documentation.

Mounting Lustre on a multi-NIC VM fails

When a VM has multiple network interface controllers (NICs), and the Managed Lustre instance is on a VPC connected to a secondary NIC (e.g., eth1), mounting the instance may fail. To resolve this issue, follow the instructions to mount using a secondary NIC.

Cannot connect from the 172.17.0.0/16 subnet range

Compute Engine and GKE clients with an IP address in the 172.17.0.0/16 subnet range cannot mount Managed Lustre instances.

Permission denied to add peering for service servicenetworking.googleapis.com

ERROR: (gcloud.services.vpc-peerings.connect) User [$(USER)] does not have
permission to access services instance [servicenetworking.googleapis.com]
(or it may not exist): Permission denied to add peering for service
'servicenetworking.googleapis.com'.

This error means that you don't have servicenetworking.services.addPeering IAM permission on your user account.

See Access control with IAM for instructions on adding one of the following roles to your account:

  • roles/compute.networkAdmin or
  • roles/servicenetworking.networksAdmin

Cannot modify allocated ranges in CreateConnection

ERROR: (gcloud.services.vpc-peerings.connect) The operation
"operations/[operation_id]" resulted in a failure "Cannot modify allocated
ranges in CreateConnection. Please use UpdateConnection."

This error is returned when you have already created a vpc-peering on this network with different IP ranges. There are two possible solutions:

Replace the existing IP ranges:

gcloud services vpc-peerings update \
  --network=NETWORK_NAME \
  --ranges=IP_RANGE_NAME \
  --service=servicenetworking.googleapis.com \
  --force

Or, add the new IP range to the existing connection:

  1. Retrieve the list of existing IP ranges for the peering:

    EXISTING_RANGES="$(
      gcloud services vpc-peerings list \
        --network=NETWORK_NAME \
        --service=servicenetworking.googleapis.com \
        --format="value(reservedPeeringRanges.list())" \
        --flatten=reservedPeeringRanges
    )
    
  2. Then, add the new range to the peering:

    gcloud services vpc-peerings update \
      --network=NETWORK_NAME \
      --ranges="${EXISTING_RANGES}",IP_RANGE_NAME \
      --service=servicenetworking.googleapis.com
    

IP address range exhausted

If instance creation fails with range exhausted error:

ERROR: (gcloud.alpha.Google Cloud Managed Lustre.instances.create) FAILED_PRECONDITION: Invalid
resource state for "NETWORK_RANGES_NOT_AVAILABLE": IP address range exhausted

Follow the VPC guide to modify the existing private connection to add IP address ranges.

We recommend a prefix length of at least /20 (1024 addresses).