Bug 1455680
Summary: | Volume stuck in attaching state for 15 minutes | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Hemant Kumar <hekumar> |
Component: | Storage | Assignee: | Hemant Kumar <hekumar> |
Status: | CLOSED ERRATA | QA Contact: | Chao Yang <chaoyang> |
Severity: | low | Docs Contact: | |
Priority: | low | ||
Version: | 3.5.1 | CC: | aos-bugs, bchilds, mwhittin, wsun |
Target Milestone: | --- | ||
Target Release: | 3.9.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause: Nodes enter impaired state when a volume is force detached and not rebooted.
Consequence: Any new volume that we try to attach to the node is stuck in attaching state.
Fix: Any node which has volume stuck in attaching state for more than 21 mins will be tainted and must be removed from cluster and added back to remove the taint and fix impaired state of node.
Result: Nodes which are impaired are removed from scheduling , giving the Openshift admin to fix the node and bring it back.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2018-05-17 06:42:42 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Comment 1
Hemant Kumar
2017-05-25 19:01:47 UTC
we don't use trailing digits in device naming. https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/aws/device_allocator.go#L82 The device pool looks like - ["/dev/xvdbb", "/dev/xvdbc"...., "/dev/xvdbz", "/dev/xvdcb", "/dev/xvdcc", ...., "/dev/xvdcz"] We are going to handle the problem of volume's stuck in "attaching" state by detecting such nodes at the earliest and stopping pods from getting scheduled on them. I have opened https://github.com/openshift/origin/pull/17544 which taints the node if volume is stuck. This approach should almost bring down volume's stuck in "attaching" state to near 0 and few odd problems that still happen can be solved by Openshift admin. Most instances of this bug are caused by admin error and admins not restarting nodes after force detaching volumes. The fix that implements node taint for stuck volumes has been merged in 3.9 - this will make sure that, never more than one volume can be stuck on a node and admins can be notified at the earliest whenever that happens. The true root cause of this bug is somewhere in EBS stack and we do not know enough to fix that. I am hoping that mitigations we are putting in place in 3.9 help reducing this problem. This is passed on oc v3.9.20 kubernetes v1.9.1+a0ce1bc657 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://ip-172-18-14-85.ec2.internal:8443 openshift v3.9.20 kubernetes v1.9.1+a0ce1bc657 Node will be taint to NoSchedule if volume is attaching to node for long time. Per #Comment 15,it has been verified.Changing to previous state since it was moved to ON_QA by errata. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1566 |