Red Hat Bugzilla – Bug 1463717
ebs volume stuck on other instance
Last modified: 2018-01-16 16:48:56 EST
Description of problem:
Ebs volume stuck on other instance. Error messages:
Unable to mount volumes for pod "postgres-9-xyz...": timeout expired waiting for volumes to attach/mount for pod "app-test"/"postgres-9-....". list of unattached/unmounted volumes=[postgres-data]
Version-Release number of selected component (if applicable):
Only happens on one account, tried creating different account, but unable to reproduce. Seems likely to be related to restarted instances and/or failure to remove volume from instance after taking down pod.
Steps to Reproduce:
I'm not sure if this will trigger it, but steps I would assume could trigger it:
1. Create postgres pod
2. Force instance down or to crash, without gracefully unmounting ebs volumes
3. Redeploy pod (Should go to a new instance)
Volume is stuck on previous instance
EBS volume is moved to new instance, so pod can attach to it.
@Leif - is there a chance you can post exact error message you saw with exact pod name and pvc name?
Did you delete the project afterwards?
I cannot get the exact podname, unfortunately, as the pod has been recreated and now works. The monitoring events are gone. (12h retention?)
Here's the only error message i stored locally.
Failed to attach volume "pvc-52c6736b-3ede-11e7-aeb3-0a69cdf75e6f" on node "ip-172-31-48-232.us-west-2.compute.internal" with: Error attaching EBS volume "vol-0358c51a80c111fa4" to instance "i-0673191765eefd002": VolumeInUse: vol-0358c51a80c111fa4 is already attached to an instance status code: 400, request id:
The project is not deleted.
Okay thank you. Also to confirm again. Were you using Openshift online environment or you were using a internal openshift cluster running on AWS?
> Force instance down or to crash, without gracefully unmounting ebs volumes
Can you elaborate that? Did you terminate the EC2 instance? or did you just shut it down or it crashed?
@Hemant, this question originally came through the OpenShift Online community support form. I can confirm that this user is provisioned on starter-us-west-2, where he experienced the above issue.
@Hemant Sorry for the late reply, yes, on openshift online.
The bug appears again:
Messages in events:
Successfully assigned postgres-11-ws9zp to ip-172-31-61-198.us-west-2.compute.internal
Failed to attach volume "pvc-52c6736b-3ede-11e7-aeb3-0a69cdf75e6f" on node "ip-172-31-61-198.us-west-2.compute.internal" with: Error attaching EBS volume "vol-0358c51a80c111fa4" to instance "i-07d2474b3b0cc27be": VolumeInUse: vol-0358c51a80c111fa4 is already attached to an instance status code: 400, request id:
4 times in the last 2 minutes
pulling image "registry.access.redhat.com/rhscl/postgresql-95-rhel7@sha256:54cfbbaac6c89aec0baf62d49e854c7ae43816138b6ff7a3de5016f90b29f4f5"
Successfully pulled image "registry.access.redhat.com/rhscl/postgresql-95-rhel7@sha256:54cfbbaac6c89aec0baf62d49e854c7ae43816138b6ff7a3de5016f90b29f4f5"
Created container with docker id 24620cd49343; Security:[seccomp=unconfined]
Started container with docker id 24620cd49343
This time the error message only appears 4 times and then it seems that the volume is moved. So currently it seems like the error is partially fixed, except that the volume is stuck for a while on a different instance.
I did not crash or take down an instance, I was only suggesting that that might be the way to trigger the bug, since I'm unable to recreate the bug by creating a new project.
yeah it is expected that the volume will not immediately move, because it has to be detached from old instance and get attached to new one. What should never happen is - attach on new instance shouldn't take forever.
I am still investigating.
I'm currently having a very similar issue on starter-us-west-1 with a MySQL database. I scaled down the database application from 1 to 0 pods, which I remember taking a long time. I did this because I was having issues deploying a new version of my Python Flask application in the same project and wanted to see if this would help. Shortly after this, I wanted to scale the database back up again to 1 pod. However, I now keep getting the following errors:
- Failed to attach volume "pvc-8bcc2d2b-8d92-11e7-8d9c-06d5ca59684e" on node "ip-172-31-21-202.us-west-1.compute.internal" with: Error attaching EBS volume "vol-08b957e6975554914" to instance "i-0a24213452d493c6e": VolumeInUse: vol-08b957e6975554914 is already attached to an instance status code: 400, request id: 5d1308eb-221d-427f-a0bc-5f419f055a70. The volume is currently attached to instance "i-08717eab8bf9d3a15"
- Unable to mount volumes for pod "mysql-10-5sb5f_uytdenhouwen(2d1556d5-bd4b-11e7-987b-06579ed29230)": timeout expired waiting for volumes to attach/mount for pod "uytdenhouwen"/"mysql-10-5sb5f". list of unattached/unmounted volumes=[volume-4lwk8]
- Error syncing pod
These errors keep rotating as long as the pod is trying to create the container. Let me know if you need more information.
We have implemented a generic recovery mechanism in Openshift 3.9, which will detect volumes stuck on another instance (and if there is no pod that is actively using the volume on that instance) and detach them if necessary.
One easy way to reproduce this problem is (before 3.9):
1. Create a standalone pod (no deployments, rc etc) with volumes.
2. Shutdown the node.
3. Now wait for the pod on the node to be deleted.
4. Once pod is deleted (spam kubectl get pods) but before controller-manager could detach the volume (there is minimum of 6 minute delay), restart the controller-manager.
5. Above action will cause volume information to be wiped from controller-manager
6. Now try to attach same PVC in another pod (may be scheduled on different node). The pod will stuck in "ContainerCreating" state in 3.7 but not on 3.9
There are few other ways to reproduce this error but this is perhaps easiest.