1394973 – Error attaching EBS volume: <volume> is already attached to an instance

Bug 1394973 - Error attaching EBS volume: <volume> is already attached to an instance

Summary: Error attaching EBS volume: <volume> is already attached to an instance

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	3.3.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Hemant Kumar
QA Contact:	Chao Yang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	OSOPS_V3
TreeView+	depends on / blocked

Reported:	2016-11-14 22:45 UTC by Stefanie Forrester
Modified:	2020-06-11 13:05 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-05-31 19:46:26 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Stefanie Forrester 2016-11-14 22:45:40 UTC

Description of problem:

We have some reports of volumes failing to attach/detach in both Dedicated and Online. In some cases, the volume is already attached to the host requesting it. In other cases, the volume is attached to another instance.

In both cases, it would appear that detach operations are not retried often. But *attach* operations are retried indefinitely. (And too often).

This is leading to a lot of repeated AWS API requests to attach the volume, which always come back with this error:

Error attaching volume "aws://eu-west-1a/vol-096fa76a8eb
af7a9b": Error attaching EBS volume: VolumeInUse: vol-096fa76a8ebaf7a9b is already attached to an instance

In addition to that problem, I don't think we have an exponential backoff for the attachment retries, which is something that we'll need to prevent hitting the API too often. (Otherwise we get throttled, which is happening in prod right now).

http://docs.aws.amazon.com/general/latest/gr/api-retries.html

Version-Release number of selected component (if applicable):

atomic-openshift-3.3.1.3-1.git.0.86dc49a.el7.x86_64

How reproducible:

Unknown. It's happening in two clusters right now but it's something we've seen off and on for a long time as part of other bugs. In Preview, it's happening almost 200,000 times per day on 133 unique volumes. In Dedicated, it's happening ~1,300 times per day on a single volume.

(Manually detaching the volumes hasn't resolved the issue in Preview).

Steps to Reproduce:
1.
2.
3.

Actual results:

Volumes are unable to attach to a new instance after having been attached somewhere else.

Expected results:

Volume will have to detach from its current instance before trying to attach to a different instance.

Additional info:


# controller logs:

Nov 14 21:16:11 ip-172-31-61-119.eu-west-1.compute.internal atomic-openshift-master-controllers[18832]: E1114 21:16:11.959744   18832 attacher.go:66] Error attaching volume "aws://eu-west-1a/vol-096fa76a8eb
af7a9b": Error attaching EBS volume: VolumeInUse: vol-096fa76a8ebaf7a9b is already attached to an instance

# oc describe pv:

Name:           pv-aws-wy0r7
Labels:         failure-domain.beta.kubernetes.io/region=eu-west-1
                failure-domain.beta.kubernetes.io/zone=eu-west-1a
Status:         Bound
Claim:          openshift-infra/metrics-cassandra-1
Reclaim Policy: Delete
Access Modes:   RWO
Capacity:       100Gi
Message:
Source:
    Type:       AWSElasticBlockStore (a Persistent Disk resource in AWS)
    VolumeID:   aws://eu-west-1a/vol-096fa76a8ebaf7a9b
    FSType:     ext4
    Partition:  0
    ReadOnly:   false
No events.



[root@REDACTED-master-2a0e7 ~]# oc get events -w -n openshift-infra
LASTSEEN               FIRSTSEEN              COUNT     NAME                         KIND      SUBOBJECT   TYPE      REASON        SOURCE                                                  MESSAGE
2016-11-14T21:27:40Z   2016-11-11T06:19:32Z   2360      hawkular-cassandra-1-92nmz   Pod                   Warning   FailedMount   {kubelet ip-172-31-63-250.eu-west-1.compute.internal}   Unable to mount volumes for pod "hawkular-cassandra-1-92nmz_openshift-infra(0a7cf99d-a7d6-11e6-b044-0a542ebef4f7)": timeout expired waiting for volumes to attach/mount for pod "hawkular-cassandra-1-92nmz"/"openshift-infra". list of unattached/unmounted volumes=[cassandra-data]
2016-11-14T21:27:40Z   2016-11-11T06:19:32Z   2360      hawkular-cassandra-1-92nmz   Pod                   Warning   FailedSync    {kubelet ip-172-31-63-250.eu-west-1.compute.internal}   Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "hawkular-cassandra-1-92nmz"/"openshift-infra". list of unattached/unmounted volumes=[cassandra-data]


[root@REDACTED-master-2a0e7 ~]# oc get pods -n openshift-infra -o wide
NAME                                READY     STATUS              RESTARTS   AGE       IP         NODE
hawkular-cassandra-1-92nmz          0/1       ContainerCreating   0          3d        <none>     ip-172-31-63-250.eu-west-1.compute.internal

[root@REDACTED-master-2a0e7 ~]# oc get nodes ip-172-31-63-250.eu-west-1.compute.internal -o yaml |grep vol-096fa76a8ebaf7a9b
  - kubernetes.io/aws-ebs/aws://eu-west-1a/vol-096fa76a8ebaf7a9b

----------------------------------------------

In other cases, like the following app, the EBS volume is unable to attach to the correct instance because it's already attached to another instance.

# controller logs:

Nov 14 08:07:44 ip-172-31-10-25.ec2.internal atomic-openshift-master-controllers[89858]: I1114 08:07:44.714355   89858 aws.go:1158] Releasing mount device mapping: bc -> volume vol-09993c5d2634550b8
Nov 14 08:07:44 ip-172-31-10-25.ec2.internal atomic-openshift-master-controllers[89858]: E1114 08:07:44.714377   89858 attacher.go:66] Error attaching volume "aws://us-east-1c/vol-09993c5d2634550b8": Error attaching EBS volume: VolumeInUse: vol-09993c5d2634550b8 is already attached to an instance
Nov 14 08:07:44 ip-172-31-10-25.ec2.internal atomic-openshift-master-controllers[89858]: E1114 08:07:44.714408   89858 nestedpendingoperations.go:233] Operation for "\"kubernetes.io/aws-ebs/aws://us-east-1c/vol-09993c5d2634550b8\"" failed. No retries permitted until 2016-11-14 08:07:46.714396698 +0000 UTC (durationBeforeRetry 2s). Error: AttachVolume.Attach failed for volume "kubernetes.io/aws-ebs/aws://us-east-1c/vol-09993c5d2634550b8" (spec.Name: "pv-aws-tk0r4") from node "ip-172-31-2-71.ec2.internal" with: Error attaching EBS volume: VolumeInUse: vol-09993c5d2634550b8 is already attached to an instance
Nov 14 08:07:46 ip-172-31-10-25.ec2.internal atomic-openshift-master-controllers[89858]: I1114 08:07:46.767212   89858 reconciler.go:170] Started AttachVolume for volume "kubernetes.io/aws-ebs/aws://us-east-1c/vol-09993c5d2634550b8" to node "ip-172-31-2-71.ec2.internal"

[root@preview-master-e69da ~]# oc describe pv pv-aws-tk0r4
Name:           pv-aws-tk0r4
Labels:         failure-domain.beta.kubernetes.io/region=us-east-1
                failure-domain.beta.kubernetes.io/zone=us-east-1c
Status:         Bound
Claim:          insultapp2/postgresql
Reclaim Policy: Delete
Access Modes:   RWO
Capacity:       1Gi
Message:
Source:
    Type:       AWSElasticBlockStore (a Persistent Disk resource in AWS)
    VolumeID:   aws://us-east-1c/vol-09993c5d2634550b8
    FSType:     ext4
    Partition:  0
    ReadOnly:   false
No events.


[root@preview-master-e69da ~]# oc get events -n insultapp2
LASTSEEN   FIRSTSEEN   COUNT     NAME                 KIND                    SUBOBJECT   TYPE      REASON         SOURCE                                  MESSAGE
52s        5d          3870      postgresql-3-knctc   Pod                                 Warning   FailedMount    {kubelet ip-172-31-2-71.ec2.internal}   Unable to mount volumes for pod "postgresql-3-knctc_insultapp2(2eaa1c95-a608-11e6-a5a1-0e3d364e19a5)": timeout expired waiting for volumes to attach/mount for pod "postgresql-3-knctc"/"insultapp2". list of unattached/unmounted volumes=[postgresql-data]
52s        5d          3870      postgresql-3-knctc   Pod                                 Warning   FailedSync     {kubelet ip-172-31-2-71.ec2.internal}   Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "postgresql-3-knctc"/"insultapp2". list of unattached/unmounted volumes=[postgresql-data]

# the pod is running on one node and the PV is connected to another

[root@preview-master-e69da ~]# oc get pods -n insultapp2 -o wide
NAME                 READY     STATUS              RESTARTS   AGE       IP           NODE
postgresql-3-knctc   0/1       ContainerCreating   0          5d        <none>       ip-172-31-2-71.ec2.internal


# here are some bugs I filed where I was observing the same behavior. Possibly related.

https://bugzilla.redhat.com/show_bug.cgi?id=1335293
https://bugzilla.redhat.com/show_bug.cgi?id=1377486
https://bugzilla.redhat.com/show_bug.cgi?id=1370312
https://bugzilla.redhat.com/show_bug.cgi?id=1392650 # currently affecting Preview

Comment 1 Bradley Childs 2016-11-15 15:28:19 UTC

This is similar to issue's we've fixed in later builds of 3.3.. can you try and reproduce this in a 3.3.6 build?

Comment 2 Stefanie Forrester 2016-11-15 15:57:38 UTC

I'm unable to find any information on a 3.3.6 build. However, the latest build, 3.3.1.4, is coming out today. 

Do you happen to know the commit ID or have a link to the PR that fixes this? I can check the latest build to ensure the fix is in there and get started on STG testing.

Comment 3 Stefanie Forrester 2016-11-15 17:13:07 UTC

I found the version we'll need to install and I have it deployed to dev-preview-stg. Assuming it passes QE, I'll try to get it into Preview prod tomorrow and we'll find out if it fixes the issue there.

Comment 23 Hemant Kumar 2017-03-22 20:59:35 UTC

Do you have any other logs or output of `oc describe pod` when this happened?

Any users of openshift who rely on EBS volumes are requested to use 3.4.1.8 at minimum because of https://bugzilla.redhat.com/show_bug.cgi?id=1422531 bug.

Comment 30 Hemant Kumar 2017-05-31 19:46:26 UTC

Closing this bug, since we have closed several similar bugs in past and this bug should be closed by those fixes too.

Note You need to log in before you can comment on or make changes to this bug.