Bug 1463717 - ebs volume stuck on other instance
Summary: ebs volume stuck on other instance
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: unspecified
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: 3.9.z
Assignee: Hemant Kumar
QA Contact: Liang Xia
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-06-21 14:49 UTC by Leif Ringstad
Modified: 2019-06-06 06:56 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-06 06:56:05 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0788 0 None None None 2019-06-06 06:56:14 UTC

Description Leif Ringstad 2017-06-21 14:49:09 UTC
Description of problem:
Ebs volume stuck on other instance. Error messages:

Unable to mount volumes for pod "postgres-9-xyz...": timeout expired waiting for volumes to attach/mount for pod "app-test"/"postgres-9-....". list of unattached/unmounted volumes=[postgres-data]

Version-Release number of selected component (if applicable):

How reproducible:
Only happens on one account, tried creating different account, but unable to reproduce. Seems likely to be related to restarted instances and/or failure to remove volume from instance after taking down pod.

Steps to Reproduce:

I'm not sure if this will trigger it, but steps I would assume could trigger it:
1. Create postgres pod
2. Force instance down or to crash, without gracefully unmounting ebs volumes
3. Redeploy pod (Should go to a new instance)

Actual results:
Volume is stuck on previous instance

Expected results:
EBS volume is moved to new instance, so pod can attach to it.


Additional info:

Comment 1 Hemant Kumar 2017-06-21 15:57:23 UTC
@Leif - is there a chance you can post exact error message you saw with exact pod name and pvc name? 

Did you delete the project afterwards?

Comment 3 Leif Ringstad 2017-06-22 07:02:22 UTC
I cannot get the exact podname, unfortunately, as the pod has been recreated and now works. The monitoring events are gone. (12h retention?)

Here's the only error message i stored locally.

Failed to attach volume "pvc-52c6736b-3ede-11e7-aeb3-0a69cdf75e6f" on node "ip-172-31-48-232.us-west-2.compute.internal" with: Error attaching EBS volume "vol-0358c51a80c111fa4" to instance "i-0673191765eefd002": VolumeInUse: vol-0358c51a80c111fa4 is already attached to an instance status code: 400, request id:

The project is not deleted.

Comment 4 Hemant Kumar 2017-06-22 18:10:22 UTC
Okay thank you. Also to confirm again. Were you using Openshift online environment or you were using a internal openshift cluster running on AWS?

Comment 5 Hemant Kumar 2017-06-22 18:11:37 UTC
> Force instance down or to crash, without gracefully unmounting ebs volumes

Can you elaborate that? Did you terminate the EC2 instance? or did you just shut it down or it crashed?

Comment 6 Will Gordon 2017-06-22 18:59:54 UTC
@Hemant, this question originally came through the OpenShift Online community support form. I can confirm that this user is provisioned on starter-us-west-2, where he experienced the above issue.

Comment 7 Leif Ringstad 2017-06-22 21:08:07 UTC
@Hemant Sorry for the late reply, yes, on openshift online.

The bug appears again:

Messages in events:

---

Successfully assigned postgres-11-ws9zp to ip-172-31-61-198.us-west-2.compute.internal

Failed to attach volume "pvc-52c6736b-3ede-11e7-aeb3-0a69cdf75e6f" on node "ip-172-31-61-198.us-west-2.compute.internal" with: Error attaching EBS volume "vol-0358c51a80c111fa4" to instance "i-07d2474b3b0cc27be": VolumeInUse: vol-0358c51a80c111fa4 is already attached to an instance status code: 400, request id:
4 times in the last 2 minutes

pulling image "registry.access.redhat.com/rhscl/postgresql-95-rhel7@sha256:54cfbbaac6c89aec0baf62d49e854c7ae43816138b6ff7a3de5016f90b29f4f5"

Successfully pulled image "registry.access.redhat.com/rhscl/postgresql-95-rhel7@sha256:54cfbbaac6c89aec0baf62d49e854c7ae43816138b6ff7a3de5016f90b29f4f5"

Created container with docker id 24620cd49343; Security:[seccomp=unconfined]

Started container with docker id 24620cd49343
---

This time the error message only appears 4 times and then it seems that the volume is moved. So currently it seems like the error is partially fixed, except that the volume is stuck for a while on a different instance.

I did not crash or take down an instance, I was only suggesting that that might be the way to trigger the bug, since I'm unable to recreate the bug by creating a new project.

Comment 8 Hemant Kumar 2017-06-22 21:12:24 UTC
yeah it is expected that the volume will not immediately move, because it has to be detached from old instance and get attached to new one. What should never happen is - attach on new instance shouldn't take forever.

I am still investigating.

Comment 14 uyt.95 2017-10-30 13:47:22 UTC
I'm currently having a very similar issue on starter-us-west-1 with a MySQL database. I scaled down the database application from 1 to 0 pods, which I remember taking a long time. I did this because I was having issues deploying a new version of my Python Flask application in the same project and wanted to see if this would help. Shortly after this, I wanted to scale the database back up again to 1 pod. However, I now keep getting the following errors:
- Failed to attach volume "pvc-8bcc2d2b-8d92-11e7-8d9c-06d5ca59684e" on node "ip-172-31-21-202.us-west-1.compute.internal" with: Error attaching EBS volume "vol-08b957e6975554914" to instance "i-0a24213452d493c6e": VolumeInUse: vol-08b957e6975554914 is already attached to an instance status code: 400, request id: 5d1308eb-221d-427f-a0bc-5f419f055a70. The volume is currently attached to instance "i-08717eab8bf9d3a15"
- Unable to mount volumes for pod "mysql-10-5sb5f_uytdenhouwen(2d1556d5-bd4b-11e7-987b-06579ed29230)": timeout expired waiting for volumes to attach/mount for pod "uytdenhouwen"/"mysql-10-5sb5f". list of unattached/unmounted volumes=[volume-4lwk8]
- Error syncing pod

These errors keep rotating as long as the pod is trying to create the container. Let me know if you need more information.

Comment 16 Hemant Kumar 2018-01-16 21:48:56 UTC
We have implemented a generic recovery mechanism in Openshift 3.9, which will detect volumes stuck on another instance (and if there is no pod that is actively using the volume on that instance) and detach them if necessary. 

One easy way to reproduce this problem is (before 3.9):

1. Create a standalone pod (no deployments, rc etc) with volumes.
2. Shutdown the node.
3. Now wait for the pod on the node to be deleted.
4. Once pod is deleted (spam kubectl get pods) but before controller-manager could detach the volume (there is minimum of 6 minute delay), restart the controller-manager.

5. Above action will cause volume information to be wiped from controller-manager
6. Now try to attach same PVC in another pod (may be scheduled on different node). The pod will stuck in "ContainerCreating" state in 3.7 but not on 3.9

There are few other ways to reproduce this error but this is perhaps easiest.

Comment 21 Liang Xia 2019-04-17 08:45:06 UTC
Unable to reproduce with below version, move bug to verified.

oc v3.9.77
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-18-10-24.ec2.internal:8443
openshift v3.9.77
kubernetes v1.9.1+a0ce1bc657

Comment 23 errata-xmlrpc 2019-06-06 06:56:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0788


Note You need to log in before you can comment on or make changes to this bug.