Description of problem: we are seeing occurences where even fresh pvc's are being notified that the backing ebs cant be attached. we believe this might be is down to openshift trying to re-use an existing ebs already in the cluster. we were unable to investigate OpenShift logs as we do not have that level of access. 1) should we really be seeing these errors where an ebs vol failed to attach saying its already attached elsewhere, for a new pvc claim ? 2) if the ebs is indeed being recycled, should it not have been detached from the other node when the corresponding claim was deleted ? 3) this does mean that every new deployment has a high chance of getting stuck in this state. Sample error event right after creating a new project, creating the pvc and running the pod that uses it: 12:45:24 PM jenkins-1-7sm8n Failed to attach volume "pvc-60471f01-8cd9-11e7-b495-02d7377a4b17" on node "ip-172-31-71-117.us-east-2.compute.internal" with: Error attaching EBS volume "vol-04f6fecacace99087" to instance "i-0fc1543476b107dcf": VolumeInUse: vol-04f6fecacace99087 is already attached to an instance status code: 400, request id: a7b9219a-2262-4b4f-a36a-1ea533b595e3. The volume is currently attached to instance "i-070649039c09d5ee5"
It's hard to guess without logs what is going on. The one known problem we are solving right now is described in the bug #1481729: If AWS takes long time to attach a volume to a node and the pod requesting the volume is deleted in the meantime the volume gets attached but the controller will never (== after long timeout) detach it since it considers it mounted. If this is the same problem then I have a fix suggested: however the patch basically adds a synchronization "mechanism" between kubelet and the controller so I'm fixing the kubelet and ADC tests for it and I expect some discussions around it in the upstream too (too many components involved).
This issue affects: https://github.com/openshiftio/openshift.io/issues/666
Created attachment 1320280 [details] OpenShift.io Che pod log attached OpenShift.io Che pod log from related issue: https://github.com/openshiftio/openshift.io/issues/666
These are event logs... There is nothing much to discover there: do we have the controller and kubelet logs? What was actually the instance "i-0e724bdbe7dea5968"? It seems like something grabbed the newly created pvc as soon as it was created...
We need the logs from kubelet on the affected nodes and from controller on master. It is not possible to deduce what is going on here just from the pod events.
Marking as "UpcomingRelease".
I'm tempted to close this one with "Insufficient data". However: we have discovered we run into API quota issues on the online cluster. I think it might explain the cause of the problem.
I have not seen this issue again since the last cluster upgrades. Lets close and I will re-open /w the requested logs if the issue comes back.
OK. Thanks for the response.