Troubleshooting and known issues

This page includes troubleshooting steps for some common issues and errors.

Known issues

Compute Engine issues

When encountering issues mounting a Managed Lustre file system on a Compute Engine instance, follow these steps to diagnose the problem.

Verify that the Managed Lustre instance is reachable

First, ensure that your Managed Lustre instance is reachable from your Compute Engine instance:

 sudo  
lctl  
ping  
 IP_ADDRESS 
@tcp 

To obtain the value of IP_ADDRESS , see Get an instance .

A successful ping returns a response similar to the following:

 12345-0@lo
12345-10.115.0.3@tcp 

A failed ping returns the following:

 failed to ping 10.115.0.3@tcp: Input/output error 

If your ping fails:

  • Make sure your Managed Lustre instance and your Compute Engine instance are in the same VPC network. Compare the output of the following commands:

     gcloud  
    compute  
    instances  
    describe  
     VM_NAME 
      
     \ 
      
    --zone = 
     VM_ZONE 
      
     \ 
      
    --format = 
     'get(networkInterfaces[0].network)' 
    gcloud  
    lustre  
    instances  
    describe  
     INSTANCE_NAME 
      
     \ 
      
    --location = 
     ZONE 
      
    --format = 
     'get(network)' 
     
    

    The output looks like:

     https://www.googleapis.com/compute/v1/projects/my-project/global/networks/my-network
    projects/my-project/global/networks/my-network 
    

    The output of the gcloud compute instances describe command is prefixed with https://www.googleapis.com/compute/v1/ ; everything following that string must match the output of the gcloud lustre instances describe command.

  • Review your VPC network's firewall rules and routing configurations to ensure they allow traffic between your Compute Engine instance and the Managed Lustre instance.

Check the LNet accept port

Managed Lustre instances can be configured to support GKE clients by specifying the --gke-support-enabled flag at the time of creation.

If GKE support has been enabled, you must configure LNet on all Compute Engine instances to use accept_port 6988. See Configure LNet for gke-support-enabled instances .

To determine whether the instance has been configured to support GKE clients, run the following command:

 gcloud  
lustre  
instances  
describe  
 INSTANCE_NAME 
  
 \ 
  
--location = 
 LOCATION 
  
 | 
  
grep  
gkeSupportEnabled 

If the command returns gkeSupportEnabled: true then you must configure LNet.

Ubuntu kernel version mismatch with Lustre client

For Compute Engine instances running Ubuntu, the Ubuntu kernel version must match the specific version of the Lustre client packages. If your Lustre client tools are failing, check whether your Compute Engine instance has auto-upgraded to a newer kernel.

To check your kernel version:

 uname  
-r 

The response looks like:

 6.8.0-1029-gcp 

To check your Lustre client package version:

 dpkg  
-l  
 | 
  
grep  
-i  
lustre 

The response looks like:

 ii  lustre-client-modules-6.8.0-1029-gcp 2.14.0-ddn198-1  amd64  Lustre Linux kernel module (kernel 6.8.0-1029-gcp)
ii  lustre-client-utils                  2.14.0-ddn198-1  amd64  Userspace utilities for the Lustre filesystem (client) 

If there is a mismatch between the kernel version listed from both commands, you must re-install the Lustre client packages .

Check dmesg for Lustre errors

Many Lustre warnings and errors are logged to the Linux kernel ring buffer. The dmesg command prints the kernel ring buffer.

To search for Lustre-specific messages, use grep in conjunction with dmesg :

 dmesg  
 | 
  
grep  
-i  
lustre 

Or, to look for more general errors that might be related:

 dmesg  
 | 
  
grep  
-i  
error 

Information to include with a support request

If you're unable to resolve the mount failure, gather diagnostic information before creating a support case.

Run sosreport:This utility collects system logs and configuration information and generates a compressed tarball:

 sudo  
sosreport 

Attach the sosreport archive and any relevant output from dmesg to your support case.

GKE issues

Before following the troubleshooting steps in this section, refer to the limitations when connecting to Managed Lustre from GKE .

Google Kubernetes Engine nodes are not able to connect to a Managed Lustre instance

Verify that the Managed Lustre instance has gke-support-enabled specified:

 gcloud  
lustre  
instances  
describe  
 INSTANCE_ID 
  
 \ 
  
--location = 
 LOCATION 
  
 | 
  
grep  
gkeSupportEnabled 

If the GKE support flag has been enabled and you still cannot connect, continue to the next section.

Log queries

To check logs, run the following query in Logs Explorer .

To return Managed Lustre CSI driver node server logs:

 resource.type="k8s_container"
resource.labels.pod_name=~"lustre-csi-node*" 

Pod event warnings

If your workload Pods cannot start up, run the following command to check the Pod events:

 kubectl  
describe  
pod  
 POD_NAME 
  
-n  
 NAMESPACE 
 

Then, read the following sections for information about your specific error.

CSI driver enablement issues

The following errors indicate issues with the CSI driver:

 MountVolume.MountDevice failed for volume "xxx" : kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name lustre.csi.storage.gke.io not found in the list of registered CSI drivers 
 MountVolume.SetUp failed for volume "xxx" : kubernetes.io/csi: mounter.SetUpAt failed to get CSI client: driver name lustre.csi.storage.gke.io not found in the list of registered CSI drivers 

These warnings indicate that the CSI driver is either not installed or not yet running. Double-check that the CSI driver is running in your cluster by following the instructions in Install the CSI driver .

If the cluster was recently scaled, updated, or upgraded, this warning is expected and should be transient, as the CSI driver Pods may take a few minutes to become fully functional after cluster operations.

MountVolume failures

AlreadyExists

An AlreadyExists error may look like the following:

 MountVolume.MountDevice failed for volume "xxx" : rpc error: code = AlreadyExists
desc = A mountpoint with the same lustre filesystem name "xxx" already exists on
node "xxx". Please mount different lustre filesystems 

Recreate the Managed Lustre instance with a different file system name, or use another Managed Lustre instance with a unique file system name. Mounting multiple volumes from different Managed Lustre instances with the same file system name on a single node is not supported. This is because identical file system names result in the same major and minor device numbers, which conflicts with the shared mount architecture on a per-node basis.

Internal

An Internal error code usually contains additional information to help locate the issue.

Is the MGS specification correct? Is the filesystem name correct?

 MountVolume.MountDevice  
failed  
 for 
  
volume  
 "preprov-pv-wrongfs" 
  
:  
rpc  
error:  
 code 
  
 = 
  
Internal  
 desc 
  
 = 
  
Could  
not  
mount  
 "10.90.2.4@tcp:/testlfs1" 
  
at  
 "/var/lib/kubelet/plugins/kubernetes.io/csi/lustre.csi.storage.gke.io/639947affddca6d2ff04eac5ec9766c65dd851516ce34b3b44017babfc01b5dc/globalmount" 
  
on  
node  
gke-lustre-default-nw-6988-pool-1-acbefebf-jl1v:  
mount  
failed:  
 exit 
  
status  
 2 
Mounting  
command:  
mount
Mounting  
arguments:  
-t  
lustre  
 10 
.90.2.4@tcp:/testlfs1  
/var/lib/kubelet/plugins/kubernetes.io/csi/lustre.csi.storage.gke.io/639947affddca6d2ff04eac5ec9766c65dd851516ce34b3b44017babfc01b5dc/globalmount
Output:  
mount.lustre:  
mount  
 10 
.90.2.4@tcp:/testlfs1  
at  
/var/lib/kubelet/plugins/kubernetes.io/csi/lustre.csi.storage.gke.io/639947affddca6d2ff04eac5ec9766c65dd851516ce34b3b44017babfc01b5dc/globalmount  
failed:  
No  
such  
file  
or  
directory
Is  
the  
MGS  
specification  
correct?
Is  
the  
filesystem  
name  
correct?
If  
upgrading,  
is  
the  
copied  
client  
log  
valid?  
 ( 
see  
upgrade  
docs ) 
 

This error means the file system name of the Managed Lustre instance you're trying to mount is incorrect or does not exist. Double-check the file system name of the Managed Lustre instance.

Is the MGS running?

 MountVolume.MountDevice  
failed  
 for 
  
volume  
 "preprov-pv-wrongfs" 
  
:  
rpc  
error:  
 code 
  
 = 
  
Internal  
 desc 
  
 = 
  
Could  
not  
mount  
 "10.90.2.5@tcp:/testlfs" 
  
at  
 "/var/lib/kubelet/plugins/kubernetes.io/csi/lustre.csi.storage.gke.io/639947affddca6d2ff04eac5ec9766c65dd851516ce34b3b44017babfc01b5dc/globalmount" 
  
on  
node  
gke-lustre-default-nw-6988-pool-1-acbefebf-jl1v:  
mount  
failed:  
 exit 
  
status  
 5 
Mounting  
command:  
mount
Mounting  
arguments:  
-t  
lustre  
 10 
.90.2.5@tcp:/testlfs  
/var/lib/kubelet/plugins/kubernetes.io/csi/lustre.csi.storage.gke.io/639947affddca6d2ff04eac5ec9766c65dd851516ce34b3b44017babfc01b5dc/globalmount
Output:  
mount.lustre:  
mount  
 10 
.90.2.5@tcp:/testlfs  
at  
/var/lib/kubelet/plugins/kubernetes.io/csi/lustre.csi.storage.gke.io/639947affddca6d2ff04eac5ec9766c65dd851516ce34b3b44017babfc01b5dc/globalmount  
failed:  
Input/output  
error
Is  
the  
MGS  
running? 

This error means your Google Kubernetes Engine cluster cannot connect to the Managed Lustre instance using the specified IP address and file system name. Ensure the IP address is correct and that your Google Kubernetes Engine cluster is in the same VPC network as your Managed Lustre instance.

To check that the IP address is correct, run the following command:

 sudo  
lctl  
ping  
 IP_ADDRESS 
@tcp0 

If the IP address is correct, the result looks like the following:

 12345-0@lo
12345-172.26.15.16@tcp 

If the IP address is unreachable, the error looks like the following:

 failed to ping 172.26.15.16@tcp: Connection timed out 

Errors not listed

Warnings not listed in this section that include an RPC error code Internal indicate unexpected issues in the CSI driver. Create a new issue on the GitHub project page, including your GKE cluster version, detailed workload information, and the Pod event warning message in the issue.

VPC network issues

The following sections describe common VPC network issues.

Managed Lustre does not support VPC-SC

Managed Lustre does not support VPC Service Controls (VPC-SC).

The Google Cloud project where you create your Managed Lustre instance must not be part of any VPC-SC perimeter.

GKE and Compute Engine clients connecting to your Managed Lustre instance must also be outside of any VPC-SC perimeter.

Can't access Managed Lustre from a peered project

To access your Managed Lustre instance from a VM in a peered VPC network, you must use Network Connectivity Center (NCC). NCC lets you connect multiple VPC networks and on-premises networks to a central hub, providing connectivity between them.

For instructions on how to set up NCC, refer to the Network Connectivity Center documentation .

Mounting Lustre on a multi-NIC VM fails

When a VM has multiple network interface controllers (NICs), and the Managed Lustre instance is on a VPC connected to a secondary NIC (e.g., eth1 ), mounting the instance may fail.

To resolve this issue:

  1. Configure LNET to use the correct NIC.

    Copy the contents of /etc/lnet.conf to a file named /etc/modprobe.d/lustre.conf on the VM. Append the following line to the file, replacing eth1 with the name of the secondary NIC:

     options  
    lnet  
     networks 
     = 
     "tcp0(eth1)" 
     
    

    Reload the Lustre kernel module:

     lustre_rmmod
    modprobe  
    lustre 
    

    Verify that LNET is configured to use the second NIC. The output of the following command should be the second NIC:

     lctl  
    list_nids 
    
  2. Add a static route for the MGS network.

    Add a static route to the MGS network via the secondary NIC's gateway. For example, if the MGS network is 172.16.0.0/16 , the secondary NIC is eth1 , and its gateway is 10.128.0.1 , run the following command:

     ip  
    route  
    add  
     172 
    .16.0.0/16  
    via  
     10 
    .128.0.1  
    dev  
    eth1 
    
    • To find the gateway, run the route command.
    • To find the MGS network, run gcloud lustre instances describe to find the value of mountPoint . Convert the IP address to a CIDR range containing it, with a size of /16 .

Cannot connect from the 172.17.0.0/16 subnet range

Compute Engine and GKE clients with an IP address in the 172.17.0.0/16 subnet range cannot mount Managed Lustre instances.

Permission denied to add peering for service servicenetworking.googleapis.com

 ERROR: (gcloud.services.vpc-peerings.connect) User [$(USER)] does not have
permission to access services instance [servicenetworking.googleapis.com]
(or it may not exist): Permission denied to add peering for service
'servicenetworking.googleapis.com'. 

This error means that you don't have servicenetworking.services.addPeering IAM permission on your user account.

See Access control with IAM for instructions on adding one of the following roles to your account:

  • roles/compute.networkAdmin or
  • roles/servicenetworking.networksAdmin

Cannot modify allocated ranges in CreateConnection

 ERROR: (gcloud.services.vpc-peerings.connect) The operation
"operations/[operation_id]" resulted in a failure "Cannot modify allocated
ranges in CreateConnection. Please use UpdateConnection." 

This error is returned when you have already created a vpc-peering on this network with different IP ranges. There are two possible solutions:

Replace the existing IP ranges:

 gcloud  
services  
vpc-peerings  
update  
 \ 
  
--network = 
 NETWORK_NAME 
  
 \ 
  
--ranges = 
 IP_RANGE_NAME 
  
 \ 
  
--service = 
servicenetworking.googleapis.com  
 \ 
  
--force 

Or, add the new IP range to the existing connection:

  1. Retrieve the list of existing IP ranges for the peering:

      EXISTING_RANGES 
     = 
     " 
     $( 
      
    gcloud  
    services  
    vpc-peerings  
    list  
     \ 
      
    --network = 
     NETWORK_NAME 
      
     \ 
      
    --service = 
    servicenetworking.googleapis.com  
     \ 
      
    --format = 
     "value(reservedPeeringRanges.list())" 
      
     \ 
      
    --flatten = 
    reservedPeeringRanges ) 
     
    
  2. Then, add the new range to the peering:

     gcloud  
    services  
    vpc-peerings  
    update  
     \ 
      
    --network = 
     NETWORK_NAME 
      
     \ 
      
    --ranges = 
     " 
     ${ 
     EXISTING_RANGES 
     } 
     " 
    , IP_RANGE_NAME 
      
     \ 
      
    --service = 
    servicenetworking.googleapis.com 
    

IP address range exhausted

If instance creation fails with range exhausted error:

 ERROR: (gcloud.alpha.Google Cloud Managed Lustre.instances.create) FAILED_PRECONDITION: Invalid
resource state for "NETWORK_RANGES_NOT_AVAILABLE": IP address range exhausted 

Follow the VPC guide to modify the existing private connection to add IP address ranges.

We recommend a prefix length of at least /20 (1024 addresses).

Data transfer issues

Known issues with the Autoclass and Object Lifecycle Management features

Cloud Storage's Autoclass and Object Lifecycle Management features automatically adjust an object's storage class to optimize costs, moving infrequently-accessed objects to cheaper classes or frequently-accessed objects to more available classes.

When an object's storage class changes, Cloud Storage updates its updated timestamp to reflect the time of this change.

How this affects data transfers with Managed Lustre

The Managed Lustre import and export commands use the updated timestamp to determine if an object has changed and needs to be synchronized between Cloud Storage and Managed Lustre. A difference in this timestamp signals a change to Managed Lustre.

For data imports to Managed Lustre instances: Objects in Cloud Storage whose storage class has changed will be re-transferred during the next import operation, even if the object's content has not been modified. This can lead to unnecessary data transfer costs and increased import times.

For data exports from Managed Lustre instances: Managed Lustre will fail to export an object from the instance if its storage class in Cloud Storage was changed after the object was last modified on the Managed Lustre instance. This can result in data inconsistencies between your instance and Cloud Storage.

Recommendation

For imports, avoid using Autoclass or OLM if incremental import behavior is important. For exports, we strongly recommend not using Cloud Storage buckets with Autoclass or Object Lifecycle Management features enabled.

Design a Mobile Site
View Site in Mobile | Classic
Share by: