Bug 1455680

Summary:	Volume stuck in attaching state for 15 minutes
Product:	OpenShift Container Platform	Reporter:	Hemant Kumar <hekumar>
Component:	Storage	Assignee:	Hemant Kumar <hekumar>
Status:	CLOSED ERRATA	QA Contact:	Chao Yang <chaoyang>
Severity:	low	Docs Contact:
Priority:	low
Version:	3.5.1	CC:	aos-bugs, bchilds, mwhittin, wsun
Target Milestone:	---
Target Release:	3.9.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: Nodes enter impaired state when a volume is force detached and not rebooted. Consequence: Any new volume that we try to attach to the node is stuck in attaching state. Fix: Any node which has volume stuck in attaching state for more than 21 mins will be tainted and must be removed from cluster and added back to remove the taint and fix impaired state of node. Result: Nodes which are impaired are removed from scheduling , giving the Openshift admin to fix the node and bring it back.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-05-17 06:42:42 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Comment 1 Hemant Kumar 2017-05-25 19:01:47 UTC

There is nothing specific in kubelet logs because device isn't even attached to the node yet. When this happened - the node had only 5 EBS volumes attached to it (in other words node wasn't very crowded).

ops fixed this by force detached the volume.

Comment 8 Hemant Kumar 2017-06-02 16:16:26 UTC

we don't use trailing digits in device naming. https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/aws/device_allocator.go#L82

The device pool looks like - ["/dev/xvdbb", "/dev/xvdbc"...., "/dev/xvdbz", "/dev/xvdcb", "/dev/xvdcc", ...., "/dev/xvdcz"]

Comment 10 Hemant Kumar 2017-12-20 01:46:50 UTC

We are going to handle the problem of volume's stuck in "attaching" state by detecting such nodes at the earliest and stopping pods from getting scheduled on them. 

I have opened https://github.com/openshift/origin/pull/17544 which taints the node if volume is stuck. This approach should almost bring down volume's stuck in "attaching" state to near 0 and few odd problems that still happen can be solved by Openshift admin.

Comment 11 Hemant Kumar 2018-01-16 21:53:59 UTC

Most instances of this bug are caused by admin error and admins not restarting nodes after force detaching volumes. The fix that implements node taint for stuck volumes has been merged in 3.9 - this will make sure that, never more than one volume can be stuck on a node and admins can be notified at the earliest whenever that happens.

The true root cause of this bug is somewhere in EBS stack and we do not know enough to fix that. I am hoping that mitigations we are putting in place in 3.9 help reducing this problem.

Comment 15 Chao Yang 2018-04-16 09:16:56 UTC

This is passed on 
oc v3.9.20
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-18-14-85.ec2.internal:8443
openshift v3.9.20
kubernetes v1.9.1+a0ce1bc657

Node will be taint to NoSchedule if volume is attaching to node for long time.

Comment 17 Wei Sun 2018-04-28 02:22:03 UTC

Per #Comment 15,it has been verified.Changing to previous state since it was moved to ON_QA by errata.

Comment 20 errata-xmlrpc 2018-05-17 06:42:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1566