Bug 1463717 - ebs volume stuck on other instance
ebs volume stuck on other instance
Status: MODIFIED
Product: OpenShift Online
Classification: Red Hat
Component: Storage (Show other bugs)
3.x
Unspecified Unspecified
unspecified Severity low
: ---
: ---
Assigned To: Hemant Kumar
Jianwei Hou
: OnlineStarter
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-21 10:49 EDT by Leif Ringstad
Modified: 2018-01-16 16:48 EST (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Leif Ringstad 2017-06-21 10:49:09 EDT
Description of problem:
Ebs volume stuck on other instance. Error messages:

Unable to mount volumes for pod "postgres-9-xyz...": timeout expired waiting for volumes to attach/mount for pod "app-test"/"postgres-9-....". list of unattached/unmounted volumes=[postgres-data]

Version-Release number of selected component (if applicable):

How reproducible:
Only happens on one account, tried creating different account, but unable to reproduce. Seems likely to be related to restarted instances and/or failure to remove volume from instance after taking down pod.

Steps to Reproduce:

I'm not sure if this will trigger it, but steps I would assume could trigger it:
1. Create postgres pod
2. Force instance down or to crash, without gracefully unmounting ebs volumes
3. Redeploy pod (Should go to a new instance)

Actual results:
Volume is stuck on previous instance

Expected results:
EBS volume is moved to new instance, so pod can attach to it.


Additional info:
Comment 1 Hemant Kumar 2017-06-21 11:57:23 EDT
@Leif - is there a chance you can post exact error message you saw with exact pod name and pvc name? 

Did you delete the project afterwards?
Comment 3 Leif Ringstad 2017-06-22 03:02:22 EDT
I cannot get the exact podname, unfortunately, as the pod has been recreated and now works. The monitoring events are gone. (12h retention?)

Here's the only error message i stored locally.

Failed to attach volume "pvc-52c6736b-3ede-11e7-aeb3-0a69cdf75e6f" on node "ip-172-31-48-232.us-west-2.compute.internal" with: Error attaching EBS volume "vol-0358c51a80c111fa4" to instance "i-0673191765eefd002": VolumeInUse: vol-0358c51a80c111fa4 is already attached to an instance status code: 400, request id:

The project is not deleted.
Comment 4 Hemant Kumar 2017-06-22 14:10:22 EDT
Okay thank you. Also to confirm again. Were you using Openshift online environment or you were using a internal openshift cluster running on AWS?
Comment 5 Hemant Kumar 2017-06-22 14:11:37 EDT
> Force instance down or to crash, without gracefully unmounting ebs volumes

Can you elaborate that? Did you terminate the EC2 instance? or did you just shut it down or it crashed?
Comment 6 Will Gordon 2017-06-22 14:59:54 EDT
@Hemant, this question originally came through the OpenShift Online community support form. I can confirm that this user is provisioned on starter-us-west-2, where he experienced the above issue.
Comment 7 Leif Ringstad 2017-06-22 17:08:07 EDT
@Hemant Sorry for the late reply, yes, on openshift online.

The bug appears again:

Messages in events:

---

Successfully assigned postgres-11-ws9zp to ip-172-31-61-198.us-west-2.compute.internal

Failed to attach volume "pvc-52c6736b-3ede-11e7-aeb3-0a69cdf75e6f" on node "ip-172-31-61-198.us-west-2.compute.internal" with: Error attaching EBS volume "vol-0358c51a80c111fa4" to instance "i-07d2474b3b0cc27be": VolumeInUse: vol-0358c51a80c111fa4 is already attached to an instance status code: 400, request id:
4 times in the last 2 minutes

pulling image "registry.access.redhat.com/rhscl/postgresql-95-rhel7@sha256:54cfbbaac6c89aec0baf62d49e854c7ae43816138b6ff7a3de5016f90b29f4f5"

Successfully pulled image "registry.access.redhat.com/rhscl/postgresql-95-rhel7@sha256:54cfbbaac6c89aec0baf62d49e854c7ae43816138b6ff7a3de5016f90b29f4f5"

Created container with docker id 24620cd49343; Security:[seccomp=unconfined]

Started container with docker id 24620cd49343
---

This time the error message only appears 4 times and then it seems that the volume is moved. So currently it seems like the error is partially fixed, except that the volume is stuck for a while on a different instance.

I did not crash or take down an instance, I was only suggesting that that might be the way to trigger the bug, since I'm unable to recreate the bug by creating a new project.
Comment 8 Hemant Kumar 2017-06-22 17:12:24 EDT
yeah it is expected that the volume will not immediately move, because it has to be detached from old instance and get attached to new one. What should never happen is - attach on new instance shouldn't take forever.

I am still investigating.
Comment 14 uyt.95 2017-10-30 09:47:22 EDT
I'm currently having a very similar issue on starter-us-west-1 with a MySQL database. I scaled down the database application from 1 to 0 pods, which I remember taking a long time. I did this because I was having issues deploying a new version of my Python Flask application in the same project and wanted to see if this would help. Shortly after this, I wanted to scale the database back up again to 1 pod. However, I now keep getting the following errors:
- Failed to attach volume "pvc-8bcc2d2b-8d92-11e7-8d9c-06d5ca59684e" on node "ip-172-31-21-202.us-west-1.compute.internal" with: Error attaching EBS volume "vol-08b957e6975554914" to instance "i-0a24213452d493c6e": VolumeInUse: vol-08b957e6975554914 is already attached to an instance status code: 400, request id: 5d1308eb-221d-427f-a0bc-5f419f055a70. The volume is currently attached to instance "i-08717eab8bf9d3a15"
- Unable to mount volumes for pod "mysql-10-5sb5f_uytdenhouwen(2d1556d5-bd4b-11e7-987b-06579ed29230)": timeout expired waiting for volumes to attach/mount for pod "uytdenhouwen"/"mysql-10-5sb5f". list of unattached/unmounted volumes=[volume-4lwk8]
- Error syncing pod

These errors keep rotating as long as the pod is trying to create the container. Let me know if you need more information.
Comment 16 Hemant Kumar 2018-01-16 16:48:56 EST
We have implemented a generic recovery mechanism in Openshift 3.9, which will detect volumes stuck on another instance (and if there is no pod that is actively using the volume on that instance) and detach them if necessary. 

One easy way to reproduce this problem is (before 3.9):

1. Create a standalone pod (no deployments, rc etc) with volumes.
2. Shutdown the node.
3. Now wait for the pod on the node to be deleted.
4. Once pod is deleted (spam kubectl get pods) but before controller-manager could detach the volume (there is minimum of 6 minute delay), restart the controller-manager.

5. Above action will cause volume information to be wiped from controller-manager
6. Now try to attach same PVC in another pod (may be scheduled on different node). The pod will stuck in "ContainerCreating" state in 3.7 but not on 3.9

There are few other ways to reproduce this error but this is perhaps easiest.

Note You need to log in before you can comment on or make changes to this bug.