Bug 1489603

Summary: Volume unmounted but not being detached from node
Product: OpenShift Container Platform Reporter: Hemant Kumar <hekumar>
Component: StorageAssignee: Hemant Kumar <hekumar>
Status: CLOSED ERRATA QA Contact: Chao Yang <chaoyang>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.6.1CC: aos-bugs, aos-storage-staff, bchilds, dcaldwel, hekumar, lxia, tlarsson
Target Milestone: ---   
Target Release: 3.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-03-28 14:06:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Hemant Kumar 2017-09-07 21:12:29 UTC
I am seeing many instances of volumes being unmounted but not detached from node on various Openshift clusters. Need to find out why is it happening.

Comment 5 Hemant Kumar 2017-09-14 19:57:58 UTC
As I have stated above, the root cause of this bug was:

1. User created a pod with volume, but volume was stuck in "attaching" state for more than 1 hour.
2. AttachDetach Controller gave up after certain time and this volume was not added to actual_state_of_World of A/D Controller.
3. Eventually attach succeeds but A/D controller no longer knows about this volume. 

Obviously the main thing is - volume shouldn't have been stuck in attaching state for such a long time. We have to work with Amazon to find solution for that problem.

Comment 9 David Caldwell 2017-10-16 13:04:34 UTC
Hey guys, 

Any updates on this issue? 

Is there a workaround?

Thanks,

David.

Comment 10 Hemant Kumar 2017-10-16 18:01:14 UTC
Each instance of this problem is caused by different underlying problem. Can you give some more details about customer's problem?

This bug I opened is caused by - a volume being stuck in "attaching" state too long and then user deletes the pod while waiting for pod to come up. Volume attach eventually succeeds but because attach finishes outside the expiry Window of attach/detach controller, it doesn't know about the volume and hence it never gets detached.

I am not sure if incident you linked is same as what I outlined above. It may be that symptoms are similar from outside but root cause can be different.

I would request you to open a new bug with following details:

1. PV & PVC yaml
2. output of describe pv and pvc
3. Node logs where this happened.
4. Controller log during same time period.

Comment 11 Hemant Kumar 2017-12-20 01:48:15 UTC
We have opened a PR against Openshift-3.8 which will cause all dangling volumes to correct itself - https://github.com/openshift/origin/pull/17544

Specific commit that includes the fix is - https://github.com/openshift/origin/pull/17544/commits/2885375c4d0f1738dc45a013e11d64d638f0f050

Comment 13 Hemant Kumar 2018-01-18 22:55:46 UTC
Yes that is fine. The fix has been merged in 3.9. Moving to modified.

Comment 15 Chao Yang 2018-01-24 07:41:06 UTC
This is passed on 
oc v3.9.0-0.23.0
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-18-14-251.ec2.internal:443
openshift v3.9.0-0.23.0
kubernetes v1.9.1+a0ce1bc657


1. Make sure the pod is in ContainerCreating due to volume could not attach
[root@ip-172-18-14-251 ~]# oc get pods
NAME      READY     STATUS              RESTARTS   AGE
mypod1    0/1       ContainerCreating   0          1h
2. Let the pod become running
[root@ip-172-18-14-251 ~]# oc get pods
NAME      READY     STATUS    RESTARTS   AGE
mypod1    1/1       Running   0          1h
3. Delete the pod, check volume is detached and become available

Comment 18 errata-xmlrpc 2018-03-28 14:06:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0489