Troubleshoot client connections

When encountering issues mounting or connecting to a Managed Lustre file system on a client VM or instance, follow these steps to diagnose the problem.

Verify that the Managed Lustre instance is reachable

First, ensure that your Managed Lustre instance is reachable from your client instance:

sudo lctl ping IP_ADDRESS@tcp

To obtain the value of IP_ADDRESS, see Get an instance.

A successful ping returns a response similar to the following:

12345-0@lo
12345-10.115.0.3@tcp

A failed ping returns the following:

failed to ping 10.115.0.3@tcp: Input/output error

If your ping fails:

  • Make sure your Managed Lustre instance and your client instance are in the same VPC network. Compare the output of the following commands:

    gcloud compute instances describe VM_NAME \
      --zone=VM_ZONE \
      --format='get(networkInterfaces[0].network)'
    
    gcloud lustre instances describe INSTANCE_NAME \
      --location=ZONE --format='get(network)'
    

    The output looks like:

    https://www.googleapis.com/compute/v1/projects/my-project/global/networks/my-network
    projects/my-project/global/networks/my-network
    

    The output of the gcloud compute instances describe command is prefixed with https://www.googleapis.com/compute/v1/; everything following that string must match the output of the gcloud lustre instances describe command.

  • Review your VPC network's firewall rules and routing configurations to ensure they allow traffic between your client instance and the Managed Lustre instance.

Check the LNet accept port (legacy instances)

Although the --gke-support-enabled flag is deprecated and no longer required when creating new Managed Lustre instances, you might have existing older instances that were created with this flag.

If you are connecting to a legacy instance where GKE support was enabled, you must configure LNet on all client Compute Engine instances to use accept_port 6988. See Configure LNet for gke-support-enabled instances.

To determine whether an existing instance was configured with this legacy flag, run the following command:

gcloud lustre instances describe INSTANCE_NAME \
  --location=LOCATION | grep gkeSupportEnabled

If the command returns gkeSupportEnabled: true, you must configure LNet on your client VMs.

Ubuntu kernel version mismatch with Lustre client

For Compute Engine instances running Ubuntu, the Ubuntu kernel version must match the specific version of the Lustre client packages. If your Lustre client tools are failing, check whether your Compute Engine instance has auto-upgraded to a newer kernel.

To check your kernel version:

uname -r

The response looks like:

6.8.0-1029-gcp

To check your Lustre client package version:

dpkg -l | grep -i lustre

The response looks like:

ii  lustre-client-modules-6.8.0-1029-gcp 2.14.0-ddn198-1  amd64  Lustre Linux kernel module (kernel 6.8.0-1029-gcp)
ii  lustre-client-utils                  2.14.0-ddn198-1  amd64  Userspace utilities for the Lustre filesystem (client)

If there is a mismatch between the kernel version listed from both commands, you must re-install the Lustre client packages.

Check dmesg for Lustre errors

Many Lustre warnings and errors are logged to the Linux kernel ring buffer. The dmesg command prints the kernel ring buffer.

To search for Lustre-specific messages, use grep in conjunction with dmesg:

dmesg | grep -i lustre

Or, to look for more general errors that might be related:

dmesg | grep -i error

Mounting Lustre on a multi-NIC VM fails

When a VM has multiple network interface controllers (NICs), and the Managed Lustre instance is on a VPC connected to a secondary NIC (for example, eth1), mounting the instance may fail. To resolve this issue, follow the instructions to mount using a secondary NIC.

Cannot connect from the 172.17.0.0/16 subnet range

Compute Engine and GKE clients with an IP address in the 172.17.0.0/16 subnet range cannot mount Managed Lustre instances.

Can't access Managed Lustre from a peered project

To access your Managed Lustre instance from a VM in a peered VPC network, you must use Network Connectivity Center (NCC). NCC lets you connect multiple VPC networks and on-premises networks to a central hub, providing connectivity between them.

For instructions on how to set up NCC, refer to the Network Connectivity Center documentation.

Mounting fails on Shielded VMs (Secure Boot)

Managed Lustre can't be mounted on Shielded VMs. Attempting to load the Lustre kernel module in a Secure Boot environment fails with the error: ERROR: could not insert 'lustre': Required key not available.

Information to include with a support request

If you're unable to resolve the mount failure, gather diagnostic information before creating a support case.

Run sosreport: This utility collects system logs and configuration information and generates a compressed tarball:

sudo sosreport

Attach the sosreport archive and any relevant output from dmesg to your support case.