This page includes troubleshooting steps for some common issues and errors.
FAILED instance
The FAILED
status means that the instance data has been lost and the instance
must be deleted.
Parallelstore instances in a FAILED
state continue to be billed until they're
deleted.
To retrieve an instance's state, follow the instructions at Manage instances: Retrieve an instance .
To delete an instance, read Manage instances: Delete an instance .
Timeouts during dfuse mount or network tests
If, when mounting your Parallelstore instance, the dfuse -m
command times out;
or if network test commands such as self_test
or daos health net-test
time
out, this may be due to a network connectivity issue.
To verify connectivity to the Parallelstore servers, run
self_test --use-daos-agent-env -r 1
If the test reports a connection issue, two possible reasons are:
The DAOS agent may have selected the wrong network interface during setup
You may need to exclude network interfaces that are not able to reach the IPs
in the access_points
list.
-
Run
ifconfig
to list the available network interfaces. An example output may show several network interfaces such aseth0
,docker0
,ens8
,lo
, etc. -
Stop the daos_agent.
-
Edit
/etc/daos/daos_agent.yml
to exclude the unwanted network interfaces. Uncomment theexclude_fabric_ifaces
line and update the values. The entries you include are specific to your situation. For example:exclude_fabric_ifaces: ["docker0", "ens8", "lo"]
-
Restart the daos_agent.
The instance or client IP address conflicts with internal IP addresses
Parallelstore instances and clients cannot use an IP address from the 172.17.0.0/16 subnet range. See Known issues for more information.
ENOSPC
when there is unused capacity in the instance
If your instance uses minimum or (the default of) balanced striping, you might run into ENOSPC
errors even if the existing files are not using all of the capacity of the
instance. This is likely to happen when writing large files that are generally greater
than 8 GiB, or when importing such files from Cloud Storage.
Use maximum file striping to reduce the likelihood of these errors.
Google Kubernetes Engine troubleshooting
The following section lists some common issues and steps to resolve them.
Transport endpoint is not connected
in workload Pods
This error is due to dfuse termination. In most cases, dfuse was terminated
because of out-of-memory. Use the Pod annotations gke-parallelstore/[cpu-limit|memory-limit]
to allocate more resources to
the Parallelstore sidecar container. You can set gke-parallelstore/memory-limit: "0"
to remove the sidecar memory limitation
if you don't know how much memory you want to allocate to it. Note that this
only works with Standard clusters; with Autopilot clusters, you cannot
use value 0
to unset the sidecar container resource limits and requests. You
have to explicitly set a larger resource limit for the sidecar container.
Once you've modified the annotations, you must restart your workload Pod. Adding annotations to a running workload doesn't dynamically modify the resource allocation.
Pod event warnings
If your workload Pods cannot start up, check the Pod events:
kubectl
describe
pod
POD_NAME
-n
NAMESPACE
The following solutions are for common errors.
CSI driver enablement issues
Common CSI driver enablement errors are as follows:
MountVolume.MountDevice failed for volume " volume
" : kubernetes.io/csi:
attacher.MountDevice failed to create newCsiDriverClient:
driver name parallelstore.csi.storage.gke.io not found in the list of registered CSI drivers
MountVolume.SetUp failed for volume " volume
" : kubernetes.io/csi:
mounter.SetUpAt failed to get CSI client:
driver name parallelstore.csi.storage.gke.io not found in the list of registered CSI drivers
These warnings indicate that the CSI driver is not enabled, or not running.
If your cluster was just scaled, updated, or upgraded, this warning is normal and should be transient. It takes a few minutes for the CSI driver Pods to be functional after cluster operations.
Otherwise, confirm that the CSI driver is enabled on your cluster. See Enable the CSI driver
for details. If the CSI is enabled,
each node shows a Pod named parallelstore-csi-node- id
up and
running.
AttachVolume.Attach failures
After the Pod is scheduled to a node, the volume will be attached to the node and the mounter Pod will be created if using node mount.
This happens on the controller and involves the AttachVolume step in from attachdetach-controller.
-
AttachVolume.Attach failed for volume " volume " : rpc error: code = InvalidArgument desc = an error occurred while preparing mount options: invalid mount options
-
AttachVolume.Attach failed for volume " volume " : rpc error: code = NotFound desc = failed to get instance " instance "
MountVolume.MountDevice failures
After the volume is attached to a node, the volume will be staged to the node.
This happens on the node and involves the MountVolume.MountDevice step in from kubelet.
-
MountVolume.MountDevice failed for volume " volume " : rpc error: code = FailedPrecondition desc = mounter pod " pod " expected to exist but was not found
-
MountVolume.MountDevice failed for volume " volume ": rpc error: code = DeadlineExceeded desc = context deadline exceeded
MountVolume.SetUp failures
After the volume is staged to the node, the volume will be mounted and provided to the container on the Pod. This happens on the node and involves the MountVolume.SetUp step in kubelet.
Pod mount
-
MountVolume.SetUp failed for volume " volume " : rpc error: code = ResourceExhausted desc = the sidecar container failed with error: signal: killed
-
MountVolume.SetUp failed for volume " volume " : rpc error: code = ResourceExhausted desc = the sidecar container terminated due to OOMKilled, exit code: 137
gke-parallelstore/memory-limit
annotation. If you're unsure about the amount of memory you want to allocate to
the parallelstore-sidecar, we recommend setting gke-parallelstore/memory-limit: "0"
to eliminate the memory
restriction imposed by Parallelstore.
-
MountVolume.SetUp failed for volume " volume " : rpc error: code = Aborted desc = NodePublishVolume request is aborted due to rate limit
-
MountVolume.SetUp failed for volume " volume " : rpc error: code = Aborted desc = An operation with the given volume key key already exists
MountVolume.SetUp failed for volume " volume
" : rpc
error: code = InvalidArgument desc =
MountVolume.SetUp failed for volume " volume
" : rpc
error: code = FailedPrecondition desc = can not find the sidecar
container in Pod spec
gke-parallelstore/volumes: "true"
Pod annotation is set correctly.Node mount
-
MountVolume.SetUp failed for volume " volume " : rpc error: code = Aborted desc = NodePublishVolume request is aborted due to rate limit
-
MountVolume.SetUp failed for volume " volume " : rpc error: code = Aborted desc = An operation with the given volume key key already exists
MountVolume.SetUp failed for volume " volume
" : rpc
error: code = InvalidArgument desc =
MountVolume.SetUp failed for volume " volume
" : rpc
error: code = FailedPrecondition desc = mounter pod expected to exist but was not found
MountVolume.SetUp failed for volume " volume
" : rpc
error: code = DeadlineExceeded desc = timeout waiting for mounter pod gRPC server to become available
Troubleshooting VPC networks
Permission denied to add peering for service servicenetworking.googleapis.com
ERROR: (gcloud.services.vpc-peerings.connect) User [$(USER)] does not have
permission to access services instance [servicenetworking.googleapis.com]
(or it may not exist): Permission denied to add peering for service
'servicenetworking.googleapis.com'.
This error means that you don't have servicenetworking.services.addPeering
IAM permission on your user account.
See Access control with IAM for instructions on adding one of the following roles to your account:
-
roles/compute.networkAdmin
or -
roles/servicenetworking.networksAdmin
Cannot modify allocated ranges in CreateConnection
ERROR: (gcloud.services.vpc-peerings.connect) The operation
"operations/[operation_id]" resulted in a failure "Cannot modify allocated
ranges in CreateConnection. Please use UpdateConnection.
This error is returned when you have already created a vpc-peering on this network with different IP ranges. There are two possible solutions:
Replace the existing IP ranges:
gcloud
services
vpc-peerings
update
\
--network =
NETWORK_NAME
\
--ranges =
IP_RANGE_NAME
\
--service =
servicenetworking.googleapis.com
\
--force
Or, add the new IP range to the existing connection:
-
Retrieve the list of existing IP ranges for the peering:
EXISTING_RANGES = $( gcloud services vpc-peerings list \ --network = NETWORK_NAME \ --service = servicenetworking.googleapis.com \ --format = "value(reservedPeeringRanges.list())" )
-
Then, add the new range to the peering:
gcloud services vpc-peerings update \ --network = NETWORK_NAME \ --ranges = $EXISTING_RANGES , IP_RANGE_NAME \ --service = servicenetworking.googleapis.com
IP address range exhausted
Instance creation might fail with the following range exhausted error:
ERROR: (gcloud.alpha.Parallelstore.instances.create) FAILED_PRECONDITION: Invalid
resource state for "NETWORK_RANGES_NOT_AVAILABLE": IP address range exhausted
If you see this error message, follow the VPC guide to either recreate the IP range or extend the existing IP range.
If you're recreating a Parallelstore instance, you must recreate the IP range instead of extending it.
Maintenance blocked due to restrictive Pod Disruption Budget
The Google Cloud console might display the following error message indicating that maintenance can't proceed because a Pod Disruption Budget (PDB) is configured to allow zero Pod evictions:
GKE can't perform maintenance because the Pod Disruption Budget allows for 0 Pods evictions.
If you see this error message, identify the problematic Pod by completing the following steps:
-
Click the error message to open the error insight panel.
-
Check the Unpermissive Pod Disruption Budgetssection for the Pod's name.
-
If the Pod is
parallelstorecsi-mount
, you can disregard this error as it won't prevent maintenance. For any other Pod, examine your PDB.