1588453 – CSI: volumes not detached on kubelet error

Bug 1588453 - CSI: volumes not detached on kubelet error

Summary: CSI: volumes not detached on kubelet error

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	3.10.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	3.10.z
Assignee:	Jan Safranek
QA Contact:	Qin Ping
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-06-07 11:44 UTC by Jan Safranek
Modified:	2018-08-31 06:18 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-08-31 06:18:10 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
UnmountDevice log (2.47 KB, text/plain) 2018-06-07 11:56 UTC, Jan Safranek	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:2376	0	None	None	None	2018-08-31 06:18:58 UTC

Description Jan Safranek 2018-06-07 11:44:37 UTC

This is continuation of bug #1569404.

Cinder driver we use for testing is broken:

https://github.com/kubernetes/cloud-provider-openstack/issues/150

It trusts Cinder that it returns correct device name for volumes. This is not true in OpenStack + KVM combo - Cinder sometimes returns wrong device name there.

In-tree Cinder driver has a code that finds the right volume despite Cinder returning a wrong one.

Current CSI driver does not have such code. In SetUp call, it waits until the (wrong) device appears. It never appears and kubelet times out and pod is ContainerCreating forever.

So far so good (at least on Kubernetes side). Kubernetes should recover from that. The bug is that it does *not* recover when user deletes the pod.

-> Kubelet calls CSI's UnmountDevice()
-> UnmountDevice() calls PVs.Get() to get coordinates of the volume (i.e.
driver name and volume handle)
-> The pod is already deleted, so Kubelet can't read the PV
-> error

csi_attacher.go:471] kubernetes.io/csi: attacher.UnmountDevice failed to get driver and volume name from device mount path: persistentvolumes "kubernetes-dynamic-pv-b2f4b7ad6a4511e8" is forbidden: User "system:node:qe-juzhao-310-qeos-2-nrr-1" cannot get persistentvolumes at the cluster scope: no path found to object

The volume is never removed from volumesInUse. The volume is forcefully detached by A/D controller when it gives up waiting for kubelet (5 minutes?).

Version-Release number of selected component (if applicable):
openshift v3.10.0-0.63.0

Comment 1 Jan Safranek 2018-06-07 11:56:34 UTC

Created attachment 1448693 [details]
UnmountDevice log

Comment 2 Jan Safranek 2018-06-07 12:11:41 UTC

Upstream issue: https://github.com/kubernetes/kubernetes/issues/64875

Comment 3 Jan Safranek 2018-06-08 07:47:38 UTC

Easy steps to reproduce (with working Cinder CSI driver):

1. Run a pod with a CSI volume.
2. Wait until it's Running.
3. oc delete pod --grace-period=0 --force

What happens:
- The volume is still in node.status.volumesInUse
- The volume is not detached
- VolumeAttachment does not disappear

What should happen:
- The volume disappears from node.status.volumesInUse
- The volume is detached
- VolumeAttachment disappears

Comment 4 Jan Safranek 2018-06-08 11:24:03 UTC

Upstream PR: https://github.com/kubernetes/kubernetes/pull/64882

Comment 5 Jan Safranek 2018-06-21 17:37:43 UTC

Additional PR with fixes: https://github.com/kubernetes/kubernetes/pull/65323

Comment 6 Jan Safranek 2018-06-27 11:06:34 UTC

3.10 pr: https://github.com/openshift/origin/pull/20111

Comment 7 Jan Safranek 2018-07-04 11:47:56 UTC

OSE 3.10.1 PR: https://github.com/openshift/ose/pull/1341

Comment 8 Jan Safranek 2018-07-23 08:24:05 UTC

Sorry, 3.10.1 PR is still open.

Comment 9 Jan Safranek 2018-08-20 07:59:53 UTC

The fix has been merged into origin/3.10 branch.

Comment 11 Qin Ping 2018-08-24 02:42:37 UTC

Verified in OCP: v3.10.34

# uname -a
Linux qe-piqin-master-etcd-1 3.10.0-862.9.1.el7.x86_64 #1 SMP Wed Jun 27 04:30:39 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux

# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.5 (Maipo)

Comment 13 errata-xmlrpc 2018-08-31 06:18:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2376

Note You need to log in before you can comment on or make changes to this bug.