Description of problem: We have some reports of volumes failing to attach/detach in both Dedicated and Online. In some cases, the volume is already attached to the host requesting it. In other cases, the volume is attached to another instance. In both cases, it would appear that detach operations are not retried often. But *attach* operations are retried indefinitely. (And too often). This is leading to a lot of repeated AWS API requests to attach the volume, which always come back with this error: Error attaching volume "aws://eu-west-1a/vol-096fa76a8eb af7a9b": Error attaching EBS volume: VolumeInUse: vol-096fa76a8ebaf7a9b is already attached to an instance In addition to that problem, I don't think we have an exponential backoff for the attachment retries, which is something that we'll need to prevent hitting the API too often. (Otherwise we get throttled, which is happening in prod right now). http://docs.aws.amazon.com/general/latest/gr/api-retries.html Version-Release number of selected component (if applicable): atomic-openshift-3.3.1.3-1.git.0.86dc49a.el7.x86_64 How reproducible: Unknown. It's happening in two clusters right now but it's something we've seen off and on for a long time as part of other bugs. In Preview, it's happening almost 200,000 times per day on 133 unique volumes. In Dedicated, it's happening ~1,300 times per day on a single volume. (Manually detaching the volumes hasn't resolved the issue in Preview). Steps to Reproduce: 1. 2. 3. Actual results: Volumes are unable to attach to a new instance after having been attached somewhere else. Expected results: Volume will have to detach from its current instance before trying to attach to a different instance. Additional info: # controller logs: Nov 14 21:16:11 ip-172-31-61-119.eu-west-1.compute.internal atomic-openshift-master-controllers[18832]: E1114 21:16:11.959744 18832 attacher.go:66] Error attaching volume "aws://eu-west-1a/vol-096fa76a8eb af7a9b": Error attaching EBS volume: VolumeInUse: vol-096fa76a8ebaf7a9b is already attached to an instance # oc describe pv: Name: pv-aws-wy0r7 Labels: failure-domain.beta.kubernetes.io/region=eu-west-1 failure-domain.beta.kubernetes.io/zone=eu-west-1a Status: Bound Claim: openshift-infra/metrics-cassandra-1 Reclaim Policy: Delete Access Modes: RWO Capacity: 100Gi Message: Source: Type: AWSElasticBlockStore (a Persistent Disk resource in AWS) VolumeID: aws://eu-west-1a/vol-096fa76a8ebaf7a9b FSType: ext4 Partition: 0 ReadOnly: false No events. [root@REDACTED-master-2a0e7 ~]# oc get events -w -n openshift-infra LASTSEEN FIRSTSEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE 2016-11-14T21:27:40Z 2016-11-11T06:19:32Z 2360 hawkular-cassandra-1-92nmz Pod Warning FailedMount {kubelet ip-172-31-63-250.eu-west-1.compute.internal} Unable to mount volumes for pod "hawkular-cassandra-1-92nmz_openshift-infra(0a7cf99d-a7d6-11e6-b044-0a542ebef4f7)": timeout expired waiting for volumes to attach/mount for pod "hawkular-cassandra-1-92nmz"/"openshift-infra". list of unattached/unmounted volumes=[cassandra-data] 2016-11-14T21:27:40Z 2016-11-11T06:19:32Z 2360 hawkular-cassandra-1-92nmz Pod Warning FailedSync {kubelet ip-172-31-63-250.eu-west-1.compute.internal} Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "hawkular-cassandra-1-92nmz"/"openshift-infra". list of unattached/unmounted volumes=[cassandra-data] [root@REDACTED-master-2a0e7 ~]# oc get pods -n openshift-infra -o wide NAME READY STATUS RESTARTS AGE IP NODE hawkular-cassandra-1-92nmz 0/1 ContainerCreating 0 3d <none> ip-172-31-63-250.eu-west-1.compute.internal [root@REDACTED-master-2a0e7 ~]# oc get nodes ip-172-31-63-250.eu-west-1.compute.internal -o yaml |grep vol-096fa76a8ebaf7a9b - kubernetes.io/aws-ebs/aws://eu-west-1a/vol-096fa76a8ebaf7a9b ---------------------------------------------- In other cases, like the following app, the EBS volume is unable to attach to the correct instance because it's already attached to another instance. # controller logs: Nov 14 08:07:44 ip-172-31-10-25.ec2.internal atomic-openshift-master-controllers[89858]: I1114 08:07:44.714355 89858 aws.go:1158] Releasing mount device mapping: bc -> volume vol-09993c5d2634550b8 Nov 14 08:07:44 ip-172-31-10-25.ec2.internal atomic-openshift-master-controllers[89858]: E1114 08:07:44.714377 89858 attacher.go:66] Error attaching volume "aws://us-east-1c/vol-09993c5d2634550b8": Error attaching EBS volume: VolumeInUse: vol-09993c5d2634550b8 is already attached to an instance Nov 14 08:07:44 ip-172-31-10-25.ec2.internal atomic-openshift-master-controllers[89858]: E1114 08:07:44.714408 89858 nestedpendingoperations.go:233] Operation for "\"kubernetes.io/aws-ebs/aws://us-east-1c/vol-09993c5d2634550b8\"" failed. No retries permitted until 2016-11-14 08:07:46.714396698 +0000 UTC (durationBeforeRetry 2s). Error: AttachVolume.Attach failed for volume "kubernetes.io/aws-ebs/aws://us-east-1c/vol-09993c5d2634550b8" (spec.Name: "pv-aws-tk0r4") from node "ip-172-31-2-71.ec2.internal" with: Error attaching EBS volume: VolumeInUse: vol-09993c5d2634550b8 is already attached to an instance Nov 14 08:07:46 ip-172-31-10-25.ec2.internal atomic-openshift-master-controllers[89858]: I1114 08:07:46.767212 89858 reconciler.go:170] Started AttachVolume for volume "kubernetes.io/aws-ebs/aws://us-east-1c/vol-09993c5d2634550b8" to node "ip-172-31-2-71.ec2.internal" [root@preview-master-e69da ~]# oc describe pv pv-aws-tk0r4 Name: pv-aws-tk0r4 Labels: failure-domain.beta.kubernetes.io/region=us-east-1 failure-domain.beta.kubernetes.io/zone=us-east-1c Status: Bound Claim: insultapp2/postgresql Reclaim Policy: Delete Access Modes: RWO Capacity: 1Gi Message: Source: Type: AWSElasticBlockStore (a Persistent Disk resource in AWS) VolumeID: aws://us-east-1c/vol-09993c5d2634550b8 FSType: ext4 Partition: 0 ReadOnly: false No events. [root@preview-master-e69da ~]# oc get events -n insultapp2 LASTSEEN FIRSTSEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE 52s 5d 3870 postgresql-3-knctc Pod Warning FailedMount {kubelet ip-172-31-2-71.ec2.internal} Unable to mount volumes for pod "postgresql-3-knctc_insultapp2(2eaa1c95-a608-11e6-a5a1-0e3d364e19a5)": timeout expired waiting for volumes to attach/mount for pod "postgresql-3-knctc"/"insultapp2". list of unattached/unmounted volumes=[postgresql-data] 52s 5d 3870 postgresql-3-knctc Pod Warning FailedSync {kubelet ip-172-31-2-71.ec2.internal} Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "postgresql-3-knctc"/"insultapp2". list of unattached/unmounted volumes=[postgresql-data] # the pod is running on one node and the PV is connected to another [root@preview-master-e69da ~]# oc get pods -n insultapp2 -o wide NAME READY STATUS RESTARTS AGE IP NODE postgresql-3-knctc 0/1 ContainerCreating 0 5d <none> ip-172-31-2-71.ec2.internal # here are some bugs I filed where I was observing the same behavior. Possibly related. https://bugzilla.redhat.com/show_bug.cgi?id=1335293 https://bugzilla.redhat.com/show_bug.cgi?id=1377486 https://bugzilla.redhat.com/show_bug.cgi?id=1370312 https://bugzilla.redhat.com/show_bug.cgi?id=1392650 # currently affecting Preview
This is similar to issue's we've fixed in later builds of 3.3.. can you try and reproduce this in a 3.3.6 build?
I'm unable to find any information on a 3.3.6 build. However, the latest build, 3.3.1.4, is coming out today. Do you happen to know the commit ID or have a link to the PR that fixes this? I can check the latest build to ensure the fix is in there and get started on STG testing.
I found the version we'll need to install and I have it deployed to dev-preview-stg. Assuming it passes QE, I'll try to get it into Preview prod tomorrow and we'll find out if it fixes the issue there.
Do you have any other logs or output of `oc describe pod` when this happened? Any users of openshift who rely on EBS volumes are requested to use 3.4.1.8 at minimum because of https://bugzilla.redhat.com/show_bug.cgi?id=1422531 bug.
Closing this bug, since we have closed several similar bugs in past and this bug should be closed by those fixes too.