This is continuation of bug #1569404. Cinder driver we use for testing is broken: https://github.com/kubernetes/cloud-provider-openstack/issues/150 It trusts Cinder that it returns correct device name for volumes. This is not true in OpenStack + KVM combo - Cinder sometimes returns wrong device name there. In-tree Cinder driver has a code that finds the right volume despite Cinder returning a wrong one. Current CSI driver does not have such code. In SetUp call, it waits until the (wrong) device appears. It never appears and kubelet times out and pod is ContainerCreating forever. So far so good (at least on Kubernetes side). Kubernetes should recover from that. The bug is that it does *not* recover when user deletes the pod. -> Kubelet calls CSI's UnmountDevice() -> UnmountDevice() calls PVs.Get() to get coordinates of the volume (i.e. driver name and volume handle) -> The pod is already deleted, so Kubelet can't read the PV -> error csi_attacher.go:471] kubernetes.io/csi: attacher.UnmountDevice failed to get driver and volume name from device mount path: persistentvolumes "kubernetes-dynamic-pv-b2f4b7ad6a4511e8" is forbidden: User "system:node:qe-juzhao-310-qeos-2-nrr-1" cannot get persistentvolumes at the cluster scope: no path found to object The volume is never removed from volumesInUse. The volume is forcefully detached by A/D controller when it gives up waiting for kubelet (5 minutes?). Version-Release number of selected component (if applicable): openshift v3.10.0-0.63.0
Created attachment 1448693 [details] UnmountDevice log
Upstream issue: https://github.com/kubernetes/kubernetes/issues/64875
Easy steps to reproduce (with working Cinder CSI driver): 1. Run a pod with a CSI volume. 2. Wait until it's Running. 3. oc delete pod --grace-period=0 --force What happens: - The volume is still in node.status.volumesInUse - The volume is not detached - VolumeAttachment does not disappear What should happen: - The volume disappears from node.status.volumesInUse - The volume is detached - VolumeAttachment disappears
Upstream PR: https://github.com/kubernetes/kubernetes/pull/64882
Additional PR with fixes: https://github.com/kubernetes/kubernetes/pull/65323
3.10 pr: https://github.com/openshift/origin/pull/20111
OSE 3.10.1 PR: https://github.com/openshift/ose/pull/1341
Sorry, 3.10.1 PR is still open.
The fix has been merged into origin/3.10 branch.
Verified in OCP: v3.10.34 # uname -a Linux qe-piqin-master-etcd-1 3.10.0-862.9.1.el7.x86_64 #1 SMP Wed Jun 27 04:30:39 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux # cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.5 (Maipo)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2376