Bug 1486523 - New EBS PVs sometime can't attach and result in errors in events & multiple retries
Summary: New EBS PVs sometime can't attach and result in errors in events & multiple r...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: OpenShift Online
Classification: Red Hat
Component: Storage
Version: 3.x
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Tomas Smetana
QA Contact: Jianwei Hou
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-08-30 02:50 UTC by jchevret
Modified: 2017-10-12 14:14 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-10-12 14:14:41 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
OpenShift.io Che pod log attached (6.00 KB, text/plain)
2017-08-30 19:48 UTC, Len DiMaggio
no flags Details

Description jchevret 2017-08-30 02:50:47 UTC
Description of problem:

we are seeing occurences where even fresh pvc's are being notified that
the backing ebs cant be attached. we believe this might be is down to
openshift trying to re-use an existing ebs already in the cluster. we were unable to investigate OpenShift logs as we do not have that level of access.

1) should we really be seeing these errors where an ebs vol failed to
attach saying its already attached elsewhere, for a new pvc claim ?

2) if the ebs is indeed being recycled, should it not have been detached
from the other node when the corresponding claim was deleted ?

3) this does mean that every new deployment has a high chance of
getting stuck in this state.

Sample error event right after creating a new project, creating the pvc and running the pod that uses it:

12:45:24 PM jenkins-1-7sm8n Failed to attach volume "pvc-60471f01-8cd9-11e7-b495-02d7377a4b17" on node "ip-172-31-71-117.us-east-2.compute.internal" with: Error attaching EBS volume "vol-04f6fecacace99087" to instance "i-0fc1543476b107dcf": VolumeInUse: vol-04f6fecacace99087 is already attached to an instance status code: 400, request id: a7b9219a-2262-4b4f-a36a-1ea533b595e3. The volume is currently attached to instance "i-070649039c09d5ee5"

Comment 1 Tomas Smetana 2017-08-30 07:26:34 UTC
It's hard to guess without logs what is going on. The one known problem we are solving right now is described in the bug #1481729: If AWS takes long time to attach a volume to a node and the pod requesting the volume is deleted in the meantime the volume gets attached but the controller will never (== after long timeout) detach it since it considers it mounted.

If this is the same problem then I have a fix suggested: however the patch basically adds a synchronization "mechanism" between kubelet and the controller so I'm fixing the kubelet and ADC tests for it and I expect some discussions around it in the upstream too (too many components involved).

Comment 2 Len DiMaggio 2017-08-30 19:46:19 UTC
This issue affects:  https://github.com/openshiftio/openshift.io/issues/666

Comment 3 Len DiMaggio 2017-08-30 19:48:47 UTC
Created attachment 1320280 [details]
OpenShift.io Che pod log attached

OpenShift.io Che pod log from related issue: https://github.com/openshiftio/openshift.io/issues/666

Comment 4 Tomas Smetana 2017-08-31 08:12:45 UTC
These are event logs... There is nothing much to discover there: do we have the controller and kubelet logs? What was actually the instance "i-0e724bdbe7dea5968"?

It seems like something grabbed the newly created pvc as soon as it was created...

Comment 5 Tomas Smetana 2017-09-08 14:48:34 UTC
We need the logs from kubelet on the affected nodes and from controller on master. It is not possible to deduce what is going on here just from the pod events.

Comment 6 Tomas Smetana 2017-09-08 14:54:29 UTC
Marking as "UpcomingRelease".

Comment 7 Tomas Smetana 2017-10-12 08:04:20 UTC
I'm tempted to close this one with "Insufficient data". However: we have discovered we run into API quota issues on the online cluster. I think it might explain the cause of the problem.

Comment 8 jchevret 2017-10-12 13:17:09 UTC
I have not seen this issue again since the last cluster upgrades. Lets close and I will re-open /w the requested logs if the issue comes back.

Comment 9 Tomas Smetana 2017-10-12 14:14:41 UTC
OK. Thanks for the response.


Note You need to log in before you can comment on or make changes to this bug.