Bug 1486523

Summary: New EBS PVs sometime can't attach and result in errors in events & multiple retries
Product: OpenShift Online Reporter: jchevret
Component: StorageAssignee: Tomas Smetana <tsmetana>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Jianwei Hou <jhou>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.xCC: aos-bugs, aos-storage-staff, jchevret, ldimaggi, xtian
Target Milestone: ---Keywords: OnlineStarter
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-10-12 14:14:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
OpenShift.io Che pod log attached none

Description jchevret 2017-08-30 02:50:47 UTC
Description of problem:

we are seeing occurences where even fresh pvc's are being notified that
the backing ebs cant be attached. we believe this might be is down to
openshift trying to re-use an existing ebs already in the cluster. we were unable to investigate OpenShift logs as we do not have that level of access.

1) should we really be seeing these errors where an ebs vol failed to
attach saying its already attached elsewhere, for a new pvc claim ?

2) if the ebs is indeed being recycled, should it not have been detached
from the other node when the corresponding claim was deleted ?

3) this does mean that every new deployment has a high chance of
getting stuck in this state.

Sample error event right after creating a new project, creating the pvc and running the pod that uses it:

12:45:24 PM jenkins-1-7sm8n Failed to attach volume "pvc-60471f01-8cd9-11e7-b495-02d7377a4b17" on node "ip-172-31-71-117.us-east-2.compute.internal" with: Error attaching EBS volume "vol-04f6fecacace99087" to instance "i-0fc1543476b107dcf": VolumeInUse: vol-04f6fecacace99087 is already attached to an instance status code: 400, request id: a7b9219a-2262-4b4f-a36a-1ea533b595e3. The volume is currently attached to instance "i-070649039c09d5ee5"

Comment 1 Tomas Smetana 2017-08-30 07:26:34 UTC
It's hard to guess without logs what is going on. The one known problem we are solving right now is described in the bug #1481729: If AWS takes long time to attach a volume to a node and the pod requesting the volume is deleted in the meantime the volume gets attached but the controller will never (== after long timeout) detach it since it considers it mounted.

If this is the same problem then I have a fix suggested: however the patch basically adds a synchronization "mechanism" between kubelet and the controller so I'm fixing the kubelet and ADC tests for it and I expect some discussions around it in the upstream too (too many components involved).

Comment 2 Len DiMaggio 2017-08-30 19:46:19 UTC
This issue affects:  https://github.com/openshiftio/openshift.io/issues/666

Comment 3 Len DiMaggio 2017-08-30 19:48:47 UTC
Created attachment 1320280 [details]
OpenShift.io Che pod log attached

OpenShift.io Che pod log from related issue: https://github.com/openshiftio/openshift.io/issues/666

Comment 4 Tomas Smetana 2017-08-31 08:12:45 UTC
These are event logs... There is nothing much to discover there: do we have the controller and kubelet logs? What was actually the instance "i-0e724bdbe7dea5968"?

It seems like something grabbed the newly created pvc as soon as it was created...

Comment 5 Tomas Smetana 2017-09-08 14:48:34 UTC
We need the logs from kubelet on the affected nodes and from controller on master. It is not possible to deduce what is going on here just from the pod events.

Comment 6 Tomas Smetana 2017-09-08 14:54:29 UTC
Marking as "UpcomingRelease".

Comment 7 Tomas Smetana 2017-10-12 08:04:20 UTC
I'm tempted to close this one with "Insufficient data". However: we have discovered we run into API quota issues on the online cluster. I think it might explain the cause of the problem.

Comment 8 jchevret 2017-10-12 13:17:09 UTC
I have not seen this issue again since the last cluster upgrades. Lets close and I will re-open /w the requested logs if the issue comes back.

Comment 9 Tomas Smetana 2017-10-12 14:14:41 UTC
OK. Thanks for the response.