Bug 1481729
| Summary: | EBS issues on us-starter-east-2 | ||
|---|---|---|---|
| Product: | OpenShift Online | Reporter: | Paul Bergene <pbergene> |
| Component: | Storage | Assignee: | Hemant Kumar <hekumar> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Chao Yang <chaoyang> |
| Severity: | urgent | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 3.x | CC: | abhgupta, aos-bugs, aos-storage-staff, bchilds, chaoyang, erich, hchen, hekumar, hongkliu, jfiala, jupierce, lxia, mifiedle, pbergene, rhowe, sspeiche, sten, tsmetana |
| Target Milestone: | --- | Keywords: | OnlinePro, OnlineStarter |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-03-05 18:15:57 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Please provide kubelet logs for ip-172-31-77-48.us-east-2.compute.internal & ip-172-31-79-193.us-east-2.compute.internal that capture 10:40~10:50 Opened smaller PR for fixing it - https://github.com/kubernetes/kubernetes/pull/52221 We need to convince that it is correct. *** Bug 1472530 has been marked as a duplicate of this bug. *** Upstream doesn't want to merge this fix so late in the 1.8 release, but has review&approved for merge into 1.9 when its open. Per eparis, we will carry this patch: https://github.com/openshift/ose/pull/864 And Origin: https://github.com/openshift/origin/pull/16384 Starter is already on OCP 3.7 - can this bug be tested/verified on Starter? The fix was merged in upstream as well and is part of 3.7, 3.8 and 3.9. But just a reminder - this bug is about a narrow case of: 1. Create a pod with EBS volume. 2. Before pod can start running on new node but volume gets attached to the node, delete the pod. 3. Before this fix - detaching will take 6-8 minutes itself. 4. After the fix, the volume should be detached sooner. The fix for this issue should be in INT/STG for the Pro tier at this point. Verify this bug as below script. --------------------- #!/bin/bash oc create -f https://raw.githubusercontent.com/chao007/v3-testfiles/master/persistent-volumes/ebs/dynamic-provisioning-pvc.json #make sure pv and pvc is bound sleep 5 oc create -f pod.yaml sleep 6 oc describe pods mypod oc get pod oc delete pods mypod ------------------------ During the step `sleep 6`, I can see the ebs volume is attached status from aws web console and the pod is in "ContainerCreating" status After `oc delete pods mypod`, the ebs volume will in `detaching` status immediately and become available soon. QE could not verify the ebs volume status from aws web console, so I test it on the ocp product and version is oc v3.9.0-0.24.0 kubernetes v1.9.1+a0ce1bc657 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://ip-172-18-2-46.ec2.internal:443 openshift v3.9.0-0.24.0 kubernetes v1.9.1+a0ce1bc657 |
Description of problem: with the upgrade to 3.6, it looks like some of the EBS problems are back, we are now seeing issues like this in the event logs again: ref: project : aslak-che on starter-us-east-2 cluster ) 21m 26m 3 che-2-jr9lf Pod Warning FailedMount kubelet, ip-172-31-79-193.us-east-2.compute.internal Unable to mount volumes for pod "che-2-jr9lf_aslak-che(8834b53f-81a6-11e7-a1ae-0233cba325d9)": timeout expired waiting for volumes to attach/mount for pod "aslak-che"/"che-2-jr9lf". list of unattached/unmounted volumes=[che-data-volume] 21m 26m 3 che-2-jr9lf Pod Warning FailedSync kubelet, ip-172-31-79-193.us-east-2.compute.internal Error syncing pod 25m 25m 1 che-2-jr9lf Pod Warning FailedMount attachdetach Failed to attach volume "pvc-950f4b94-814e-11e7-ac45-0233cba325d9" on node "ip-172-31-79-193.us-east-2.compute.internal" with: Error attaching EBS volume "vol-0f06f75a93ad3a6a0" to instance "i-0ca452e5adc5d3e40": VolumeInUse: vol-0f06f75a93ad3a6a0 is already attached to an instance status code: 400, request id: cf794c0d-2580-4395-81a7-987f3766dce9. The volume is currently attached to instance "i-0e12b6108c6915c15" 23m 23m 1 che-2-jr9lf Pod Warning FailedMount attachdetach (combined from similar events): Failed to attach volume "pvc-950f4b94-814e-11e7-ac45-0233cba325d9" on node "ip-172-31-79-193.us-east-2.compute.internal" with: Error attaching EBS volume "vol-0f06f75a93ad3a6a0" to instance "i-0ca452e5adc5d3e40": VolumeInUse: vol-0f06f75a93ad3a6a0 is already attached to an instance status code: 400, request id: cf87853e-6ddf-40fa-acbe-8d44e57e91e4. The volume is currently attached to instance "i-0e12b6108c6915c15 Version-Release number of selected component (if applicable): OpenShift Master: v3.6.173.0.5 (online version 3.5.0.20) Kubernetes Master: v1.6.1+5115d708d7 How reproducible: Try attaching an EBS volume Steps to Reproduce: 1. 2. 3. Actual results: Volume not attaching Expected results: Volume attaching Additional info: Description of problem: Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Master Log: Node Log (of failed PODs): PV Dump: PVC Dump: StorageClass Dump (if StorageClass used by PV/PVC): Additional info: Race condition, locks don't expire? Workaround was in place on 3.5.