Bug 1588453 - CSI: volumes not detached on kubelet error
Summary: CSI: volumes not detached on kubelet error
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 3.10.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 3.10.z
Assignee: Jan Safranek
QA Contact: Qin Ping
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-06-07 11:44 UTC by Jan Safranek
Modified: 2018-08-31 06:18 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-08-31 06:18:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
UnmountDevice log (2.47 KB, text/plain)
2018-06-07 11:56 UTC, Jan Safranek
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:2376 0 None None None 2018-08-31 06:18:58 UTC

Description Jan Safranek 2018-06-07 11:44:37 UTC
This is continuation of bug #1569404.

Cinder driver we use for testing is broken:

https://github.com/kubernetes/cloud-provider-openstack/issues/150

It trusts Cinder that it returns correct device name for volumes. This is not true in OpenStack + KVM combo - Cinder sometimes returns wrong device name there.

In-tree Cinder driver has a code that finds the right volume despite Cinder returning a wrong one.

Current CSI driver does not have such code. In SetUp call, it waits until the (wrong) device appears. It never appears and kubelet times out and pod is ContainerCreating forever.

So far so good (at least on Kubernetes side). Kubernetes should recover from that. The bug is that it does *not* recover when user deletes the pod.

-> Kubelet calls CSI's UnmountDevice()
  -> UnmountDevice() calls PVs.Get() to get coordinates of the volume (i.e. 
driver name and volume handle)
    -> The pod is already deleted, so Kubelet can't read the PV
      -> error

csi_attacher.go:471] kubernetes.io/csi: attacher.UnmountDevice failed to get driver and volume name from device mount path: persistentvolumes "kubernetes-dynamic-pv-b2f4b7ad6a4511e8" is forbidden:  User "system:node:qe-juzhao-310-qeos-2-nrr-1" cannot get persistentvolumes at the cluster scope: no path found to object

The volume is never removed from volumesInUse. The volume is forcefully detached by A/D controller when it gives up waiting for kubelet (5 minutes?).

Version-Release number of selected component (if applicable):
openshift v3.10.0-0.63.0

Comment 1 Jan Safranek 2018-06-07 11:56:34 UTC
Created attachment 1448693 [details]
UnmountDevice log

Comment 2 Jan Safranek 2018-06-07 12:11:41 UTC
Upstream issue: https://github.com/kubernetes/kubernetes/issues/64875

Comment 3 Jan Safranek 2018-06-08 07:47:38 UTC
Easy steps to reproduce (with working Cinder CSI driver):

1. Run a pod with a CSI volume.
2. Wait until it's Running.
3. oc delete pod --grace-period=0 --force

What happens:
- The volume is still in node.status.volumesInUse
- The volume is not detached
- VolumeAttachment does not disappear

What should happen:
- The volume disappears from node.status.volumesInUse
- The volume is detached
- VolumeAttachment disappears

Comment 4 Jan Safranek 2018-06-08 11:24:03 UTC
Upstream PR: https://github.com/kubernetes/kubernetes/pull/64882

Comment 5 Jan Safranek 2018-06-21 17:37:43 UTC
Additional PR with fixes: https://github.com/kubernetes/kubernetes/pull/65323

Comment 6 Jan Safranek 2018-06-27 11:06:34 UTC
3.10 pr: https://github.com/openshift/origin/pull/20111

Comment 7 Jan Safranek 2018-07-04 11:47:56 UTC
OSE 3.10.1 PR: https://github.com/openshift/ose/pull/1341

Comment 8 Jan Safranek 2018-07-23 08:24:05 UTC
Sorry, 3.10.1 PR is still open.

Comment 9 Jan Safranek 2018-08-20 07:59:53 UTC
The fix has been merged into origin/3.10 branch.

Comment 11 Qin Ping 2018-08-24 02:42:37 UTC
Verified in OCP: v3.10.34

# uname -a
Linux qe-piqin-master-etcd-1 3.10.0-862.9.1.el7.x86_64 #1 SMP Wed Jun 27 04:30:39 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux

# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.5 (Maipo)

Comment 13 errata-xmlrpc 2018-08-31 06:18:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2376


Note You need to log in before you can comment on or make changes to this bug.