Description of problem: with the upgrade to 3.6, it looks like some of the EBS problems are back, we are now seeing issues like this in the event logs again: ref: project : aslak-che on starter-us-east-2 cluster ) 21m 26m 3 che-2-jr9lf Pod Warning FailedMount kubelet, ip-172-31-79-193.us-east-2.compute.internal Unable to mount volumes for pod "che-2-jr9lf_aslak-che(8834b53f-81a6-11e7-a1ae-0233cba325d9)": timeout expired waiting for volumes to attach/mount for pod "aslak-che"/"che-2-jr9lf". list of unattached/unmounted volumes=[che-data-volume] 21m 26m 3 che-2-jr9lf Pod Warning FailedSync kubelet, ip-172-31-79-193.us-east-2.compute.internal Error syncing pod 25m 25m 1 che-2-jr9lf Pod Warning FailedMount attachdetach Failed to attach volume "pvc-950f4b94-814e-11e7-ac45-0233cba325d9" on node "ip-172-31-79-193.us-east-2.compute.internal" with: Error attaching EBS volume "vol-0f06f75a93ad3a6a0" to instance "i-0ca452e5adc5d3e40": VolumeInUse: vol-0f06f75a93ad3a6a0 is already attached to an instance status code: 400, request id: cf794c0d-2580-4395-81a7-987f3766dce9. The volume is currently attached to instance "i-0e12b6108c6915c15" 23m 23m 1 che-2-jr9lf Pod Warning FailedMount attachdetach (combined from similar events): Failed to attach volume "pvc-950f4b94-814e-11e7-ac45-0233cba325d9" on node "ip-172-31-79-193.us-east-2.compute.internal" with: Error attaching EBS volume "vol-0f06f75a93ad3a6a0" to instance "i-0ca452e5adc5d3e40": VolumeInUse: vol-0f06f75a93ad3a6a0 is already attached to an instance status code: 400, request id: cf87853e-6ddf-40fa-acbe-8d44e57e91e4. The volume is currently attached to instance "i-0e12b6108c6915c15 Version-Release number of selected component (if applicable): OpenShift Master: v3.6.173.0.5 (online version 3.5.0.20) Kubernetes Master: v1.6.1+5115d708d7 How reproducible: Try attaching an EBS volume Steps to Reproduce: 1. 2. 3. Actual results: Volume not attaching Expected results: Volume attaching Additional info: Description of problem: Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Master Log: Node Log (of failed PODs): PV Dump: PVC Dump: StorageClass Dump (if StorageClass used by PV/PVC): Additional info: Race condition, locks don't expire? Workaround was in place on 3.5.
Please provide kubelet logs for ip-172-31-77-48.us-east-2.compute.internal & ip-172-31-79-193.us-east-2.compute.internal that capture 10:40~10:50
Opened smaller PR for fixing it - https://github.com/kubernetes/kubernetes/pull/52221 We need to convince that it is correct.
*** Bug 1472530 has been marked as a duplicate of this bug. ***
Upstream doesn't want to merge this fix so late in the 1.8 release, but has review&approved for merge into 1.9 when its open. Per eparis, we will carry this patch: https://github.com/openshift/ose/pull/864 And Origin: https://github.com/openshift/origin/pull/16384
Starter is already on OCP 3.7 - can this bug be tested/verified on Starter?
The fix was merged in upstream as well and is part of 3.7, 3.8 and 3.9. But just a reminder - this bug is about a narrow case of: 1. Create a pod with EBS volume. 2. Before pod can start running on new node but volume gets attached to the node, delete the pod. 3. Before this fix - detaching will take 6-8 minutes itself. 4. After the fix, the volume should be detached sooner.
The fix for this issue should be in INT/STG for the Pro tier at this point.
Verify this bug as below script. --------------------- #!/bin/bash oc create -f https://raw.githubusercontent.com/chao007/v3-testfiles/master/persistent-volumes/ebs/dynamic-provisioning-pvc.json #make sure pv and pvc is bound sleep 5 oc create -f pod.yaml sleep 6 oc describe pods mypod oc get pod oc delete pods mypod ------------------------ During the step `sleep 6`, I can see the ebs volume is attached status from aws web console and the pod is in "ContainerCreating" status After `oc delete pods mypod`, the ebs volume will in `detaching` status immediately and become available soon. QE could not verify the ebs volume status from aws web console, so I test it on the ocp product and version is oc v3.9.0-0.24.0 kubernetes v1.9.1+a0ce1bc657 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://ip-172-18-2-46.ec2.internal:443 openshift v3.9.0-0.24.0 kubernetes v1.9.1+a0ce1bc657