There is nothing specific in kubelet logs because device isn't even attached to the node yet. When this happened - the node had only 5 EBS volumes attached to it (in other words node wasn't very crowded).
ops fixed this by force detached the volume.
we don't use trailing digits in device naming. https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/aws/device_allocator.go#L82
The device pool looks like - ["/dev/xvdbb", "/dev/xvdbc"...., "/dev/xvdbz", "/dev/xvdcb", "/dev/xvdcc", ...., "/dev/xvdcz"]
We are going to handle the problem of volume's stuck in "attaching" state by detecting such nodes at the earliest and stopping pods from getting scheduled on them.
I have opened https://github.com/openshift/origin/pull/17544 which taints the node if volume is stuck. This approach should almost bring down volume's stuck in "attaching" state to near 0 and few odd problems that still happen can be solved by Openshift admin.
Most instances of this bug are caused by admin error and admins not restarting nodes after force detaching volumes. The fix that implements node taint for stuck volumes has been merged in 3.9 - this will make sure that, never more than one volume can be stuck on a node and admins can be notified at the earliest whenever that happens.
The true root cause of this bug is somewhere in EBS stack and we do not know enough to fix that. I am hoping that mitigations we are putting in place in 3.9 help reducing this problem.
This is passed on
features: Basic-Auth GSSAPI Kerberos SPNEGO
Node will be taint to NoSchedule if volume is attaching to node for long time.
Per #Comment 15,it has been verified.Changing to previous state since it was moved to ON_QA by errata.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.