This document gives troubleshooting guidance for storage issues.
Volume fails to attach
This issue can occur if a virtual disk is attached to the wrong virtual machine, it may be due to Issue #32727 in Kubernetes 1.12.
The output of gkectl diagnose cluster
looks like this:
Checking cluster object...PASS Checking machine objects...PASS Checking control plane pods...PASS Checking gke-connect pods...PASS Checking kube-system pods...PASS Checking gke-system pods...PASS Checking storage...FAIL PersistentVolume pvc-776459c3-d350-11e9-9db8-e297f465bc84: virtual disk "[datastore_nfs] kubevols/kubernetes-dynamic-pvc-776459c3-d350-11e9-9db8-e297f465bc84.vmdk" IS attached to machine "gsl-test-user-9b46dbf9b-9wdj7" but IS NOT listed in the Node.Status 1 storage errors
One or more Pods are stuck in the ContainerCreating
state with warnings like
this:
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedAttachVolume 6s (x6 over 31s) attachdetach-controller AttachVolume.Attach failed for volume "pvc-776459c3-d350-11e9-9db8-e297f465bc84" : Failed to add disk 'scsi0:6'.
To resolve this issue:
If a virtual disk is attached to the wrong virtual machine, you might need to manually detach it:
-
Drain the node. See Safely draining a node {:.external"}. You might want to include the
--ignore-daemonsets
and--delete-local-data
flags in your kubectl drain {:.external"> command. -
Edit the VM's hardware config in vCenter to remove the volume.
Volume is lost
This issue can occur if a virtual disk was permanently deleted. This can happen if an operator manually deletes a virtual disk or the virtual machine it is attached to. If you see a "not found" error related to your VMDK file, it is likely that the virtual disk was permanently deleted.
The output of gkectl diagnose cluster
looks like this:
Checking cluster object...PASS Checking machine objects...PASS Checking control plane pods...PASS Checking gke-connect pods...PASS Checking kube-system pods...PASS Checking gke-system pods...PASS Checking storage...FAIL PersistentVolume pvc-52161704-d350-11e9-9db8-e297f465bc84: virtual disk "[datastore_nfs] kubevols/kubernetes-dynamic-pvc-52161704-d350-11e9-9db8-e297f465bc84.vmdk" IS NOT found 1 storage errors
One or more Pods are stuck in the ContainerCreating
state:
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedAttachVolume 71s (x28 over 42m) attachdetach-controller AttachVolume.Attach failed for volume "pvc-52161704-d350-11e9-9db8-e297f465bc84" : File []/vmfs/volumes/43416d29-03095e58/kubevols/ kubernetes-dynamic-pvc-52161704-d350-11e9-9db8-e297f465bc84.vmdk was not found
To prevent this issue from occurring, manage your virtual machines as described in Resizing a user cluster and Upgrading clusters .
To resolve this issue, you might need to manually clean up related Kubernetes resources:
-
Delete the PVC that referenced the PV by running
kubectl delete pvc [PVC_NAME]
. -
Delete the Pod that referenced the PVC by running
kubectl delete pod [POD_NAME]
. -
Repeat step 2. Yes, really. See Kubernetes issue 74374 {:.external}.
vSphere CSI Volume fails to detach
This issue occurs if the CNS > Searchable
privilege has not been granted to
the vSphere user.
If you find pods stuck in the ContainerCreating
phase with FailedAttachVolume
warnings, it could be due to a failed detach on
a different node.
To check for CSI detach errors:
kubectl get volumeattachments -o=custom-columns=NAME:metadata.name,DETACH_ERROR:status.detachError.message
The output is similar to the following:
NAME DETACH_ERROR csi-0e80d9be14dc09a49e1997cc17fc69dd8ce58254bd48d0d8e26a554d930a91e5 rpc error: code = Internal desc = QueryVolume failed for volumeID: "57549b5d-0ad3-48a9-aeca-42e64a773469". ServerFaultCode: NoPermission csi-164d56e3286e954befdf0f5a82d59031dbfd50709c927a0e6ccf21d1fa60192dcsi-8d9c3d0439f413fa9e176c63f5cc92bd67a33a1b76919d42c20347d52c57435c csi-e40d65005bc64c45735e91d7f7e54b2481a2bd41f5df7cc219a2c03608e8e7a8
To resolve this issue, add the CNS > Searchable
privilege to your vcenter user account
.
The detach operation automatically retries until it succeeds.
CSI volume creation fails with NotSupported
error
This issue occurs when an ESXi host in the vSphere cluster is running a version lower than ESXi 6.7U3.
The output of kubectl describe pvc
includes this error:
Failed to provision volume with StorageClass: rpc error: code = Internal desc = Failed to create volume. Error: CnsFault error: CNS: Failed to create disk.:Fault cause: vmodl.fault.NotSupported
To resolve this issue, upgrade your ESXi hosts to version 6.7U3 or later.
vSphere CSI volume fails to attach
This known issue {:.external} in the open-source vSphere CSI driver occurs when a node is shut down, deleted, or fails.
The output of kubectl describe pod
looks like this:
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedAttachVolume 2m30s attachdetach-controller Multi-Attach error for volume "pvc-xxxxx" Volume is already exclusively attached to one node and can't be attached to another
To resolve this issue:
-
Note the name of the PersistentVolumeClaim (PVC) in the preceding output.
-
Find the VolumeAttachments that are associated with that PVC. For example:
kubectl get volumeattachments | grep pvc-xxxxx
The output shows the names of the VolumeAttachments. For example:
csi-yyyyy csi.vsphere.vmware.com pvc-xxxxx node-zzzzz ...
-
Describe the VolumeAttachments. For example:
kubectl describe volumeattachments csi-yyy | grep "Deletion Timestamp"
Make a note of the deletion timestamp in the output. For example:
Deletion Timestamp: 2021-03-10T22:14:58Z
-
Wait until the time specified by the deletion timestamp, and then force delete the VolumeAttachment. To do this, edit the VolumeAttachment object and delete the finalizer. For example:
kubectl edit volumeattachment csi-yyyyy Finalizers: external-attacher/csi-vsphere-vmware-com