1698829 – flake: volume is not attached in time due to AWS API trhottling

Bug 1698829 - flake: volume is not attached in time due to AWS API trhottling

Summary: flake: volume is not attached in time due to AWS API trhottling

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Jan Safranek
QA Contact:	Liang Xia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1764044
TreeView+	depends on / blocked

Reported:	2019-04-11 10:02 UTC by Jan Safranek
Modified:	2020-01-23 11:04 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-01-23 11:03:45 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
controller-manager logs (456.58 KB, application/gzip) 2019-04-11 10:02 UTC, Jan Safranek	no flags	Details
controller-manager logs with extra logging (546.08 KB, application/gzip) 2019-05-02 10:56 UTC, Jan Safranek	no flags	Details
2nd log with extra logging (479.94 KB, application/gzip) 2019-05-02 12:11 UTC, Jan Safranek	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin pull 23977	0	'None'	closed	Bug 1698829: UPSTREAM: 83567: Fix attachment of just detached AWS volumes	2020-12-22 08:02:33 UTC
Red Hat Product Errata	RHBA-2020:0062	0	None	None	None	2020-01-23 11:04:13 UTC

Description Jan Safranek 2019-04-11 10:02:41 UTC

Created attachment 1554470 [details]
controller-manager logs

openshift-tests [sig-storage] In-tree Volumes [Driver: aws] [Testpattern: Dynamic PV (default fs)] provisioning should provision storage with mount options [Suite:openshift/conformance/parallel] [Suite:k8s]

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22512/pull-ci-openshift-origin-master-e2e-aws/6810

The tests fails with "Couldn't delete ns: "e2e-tests-volumes-r6nhr"", but the real reason the test flaked is that a pod with AWS EBS PV could not be started due to failed attach.

PV: pvc-f8e217c0-5bb4-11e9-96e0-12d7e5f9ebb2
EBS ID: vol-067c4e081c6c40573

There are two pods involved, (old) that writes data to the PV and (new) that reads it and checks for mount options.

The (old) pod is started and stopped without issues. The (new) can't start because it's volumes are not attached:

I0410 17:21:20.785231       1 operation_generator.go:423] DetachVolume.Detach succeeded for volume "pvc-f8e217c0-5bb4-11e9-96e0-12d7e5f9ebb2" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1a/vol-067c4e081c6c40573") on node "ip-10-0-132-132.ec2.internal"
I0410 17:21:20.787080       1 reconciler.go:289] attacherDetacher.AttachVolume started for volume "pvc-f8e217c0-5bb4-11e9-96e0-12d7e5f9ebb2" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1a/vol-067c4e081c6c40573") from node "ip-10-0-132-132.ec2.internal"

W0410 17:21:21.076418       1 aws.go:1740] Got assignment call for already-assigned volume: bo@vol-067c4e081c6c40573

I0410 17:21:21.119716       1 aws.go:2012] Waiting for volume "vol-067c4e081c6c40573" state: actual=detached, desired=attached

(and "Waiting for volume ... actual=detached, desired=attached" continues until the test gives up)

Is there a race? "Got assignment call for already-assigned volume" means that the volume is already assigned and getMountDevice returns alreadyAttached=true and controller-manager does not call AWS AttachVolume. But the volume has just detached...

Comment 1 Jan Safranek 2019-04-11 11:27:20 UTC

Another occurence: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/6604#openshift-tests-sig-storage-in-tree-volumes-driver-aws-testpattern-dynamic-pv-default-fs-subpath-should-support-existing-directories-when-readonly-specified-in-the-volumesource-suiteopenshiftconformanceparallel-suitek8s

Comment 2 Matthew Wong 2019-04-11 15:19:36 UTC

Previously: https://github.com/openshift/origin/issues/22084

I had the same conclusion: "aws.go:1653] Got " means that a device was already assigned according to DescribeInstances https://github.com/kubernetes/kubernetes/blob/9fd23b217800a312a45dbaedb70f4ffa65ec661b/pkg/cloudprovider/providers/aws/aws.go#L1750. How can this be if the device was successfully detached ~300ms ago and its assignment released, according to DescribeVolumes?

I gussed maybe that DescribeVolumes and DescribeInstances could disagree. Quite difficult to reproduce the flake though and test it out.

Comment 3 Jan Safranek 2019-04-12 13:59:27 UTC

I checked all pull-ci-openshift-origin-master-e2e-aws runs between Apr 4 and Apr 12 (=1 week) and I found "Got assignment call for already-assigned volume" in 33 of them out of 305.

Unfortunately, I haven't been able to reproduce a single occurrence of it in my cluster despite running many of the failed tests in parallel for half of a day, both upstream single-node Kubernetes cluster and OCP.

Comment 4 Jan Safranek 2019-04-12 15:12:28 UTC

PR with extra logs: https://github.com/openshift/origin/pull/22562

Will run it for couple of days and we'll see...

Comment 5 Jan Safranek 2019-05-02 10:56:10 UTC

Created attachment 1561529 [details]
controller-manager logs with extra logging

Got some results. 

1. vol-0a90988d68c4166b5 is attached to node i-02b72eba7c14c5434 as /dev/xvdcp
I0430 19:17:00.410307       1 aws.go:2155] AttachVolume volume="vol-0a90988d68c4166b5" instance="i-02b72eba7c14c5434" request returned {
  AttachTime: 2019-04-30 19:17:00.391 +0000 UTC,
  Device: "/dev/xvdcp",
  InstanceId: "i-02b72eba7c14c5434",
  State: "attaching",
  VolumeId: "vol-0a90988d68c4166b5"
}

...

I0430 19:17:06.547883       1 operation_generator.go:339] AttachVolume.Attach succeeded for volume "pvc-6e0cf15b-6b7c-11e9-b232-0af5bde0c660" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1b/vol-0a90988d68c4166b5") from node "ip-10-0-149-35.ec2.internal" 


2. The volume is detached;
I0430 19:20:00.183092       1 reconciler.go:230] attacherDetacher.DetachVolume started for volume "pvc-6e0cf15b-6b7c-11e9-b232-0af5bde0c660" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1b/vol-0a90988d68c4166b5") on node "ip-10-0-149-35.ec2.internal" 

...

I0430 19:20:07.184488       1 aws.go:2244] waitForAttachmentStatus returned non-nil attachment with state=detached: {
  AttachTime: 2019-04-30 19:17:00 +0000 UTC,
  DeleteOnTermination: false,
  Device: "/dev/xvdcp",
  InstanceId: "i-02b72eba7c14c5434",
  State: "detaching",
  VolumeId: "vol-0a90988d68c4166b5"
}
I0430 19:20:07.404850       1 aws.go:2258] volume ID: vol-0a90988d68c4166b5: DetachDisk: instance after detach: [{
  DeviceName: "/dev/xvda",
  Ebs: {
    AttachTime: 2019-04-30 18:20:37 +0000 UTC,
    DeleteOnTermination: true,
    Status: "attached",
    VolumeId: "vol-0714b435a9b3d1682"
  }
} {
  DeviceName: "/dev/xvdcv",
  Ebs: {
    AttachTime: 2019-04-30 19:16:53 +0000 UTC,
    DeleteOnTermination: false,
    Status: "attached",
    VolumeId: "vol-01a18b18d5521bf8c"
  }
}]
I0430 19:20:07.404931       1 operation_generator.go:423] DetachVolume.Detach succeeded for volume "pvc-6e0cf15b-6b7c-11e9-b232-0af5bde0c660" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1b/vol-0a90988d68c4166b5") on node "ip-10-0-149-35.ec2.internal" 

So, DescribeInstance shows that /dev/xvdcp is detached.


3. 0.4 seconds after that, the volume is being attached to the same node, but DescribeInstances shows that the volume is already attached:

I0430 19:20:07.800180       1 reconciler.go:289] attacherDetacher.AttachVolume started for volume "pvc-6e0cf15b-6b7c-11e9-b232-0af5bde0c660" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1b/vol-0a90988d68c4166b5") from node "ip-10-0-149-35.ec2.internal" 
I0430 19:20:07.932111       1 aws.go:2101] volumeID: vol-0a90988d68c4166b5, AttachDisk got AWS instance [{
  DeviceName: "/dev/xvda",
  Ebs: {
    AttachTime: 2019-04-30 18:20:37 +0000 UTC,
    DeleteOnTermination: true,
    Status: "attached",
    VolumeId: "vol-0714b435a9b3d1682"
  }
} {
  DeviceName: "/dev/xvdcv",
  Ebs: {
    AttachTime: 2019-04-30 19:16:53 +0000 UTC,
    DeleteOnTermination: false,
    Status: "attached",
    VolumeId: "vol-01a18b18d5521bf8c"
  }
} {
  DeviceName: "/dev/xvdcp",
  Ebs: {
    AttachTime: 2019-04-30 19:17:00 +0000 UTC,
    DeleteOnTermination: false,
    Status: "detaching",
    VolumeId: "vol-0a90988d68c4166b5"
  }
}]

Since the volume is shown as attached, volume plugin won't call AttachVolume.


It looks like a bug somewhere in AWS, DescribeInstances shows a volume detached and then as attached, without any action done by Kubernetes.

Comment 6 Jan Safranek 2019-05-02 12:11:08 UTC

Created attachment 1561575 [details]
2nd log with extra logging

Perhaps another occurrence, now with different symptoms:

0. (volume vol-0380c554281eeed57 is attached and detached couple of times without any issue)

1. Volume vol-0380c554281eeed57 is attached as /dev/xvdbo:
I0501 05:24:22.793642       1 aws.go:1802] Releasing in-process attachment entry: bo -> volume vol-0380c554281eeed57
I0501 05:24:22.793715       1 operation_generator.go:339] AttachVolume.Attach succeeded for volume "pvc-110962f1-6bd1-11e9-9b9e-0e090072c4a6" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1a/vol-0380c554281eeed57") from node "ip-10-0-137-111.ec2.internal" 


2. It is detached:
I0501 05:26:27.593192       1 reconciler.go:230] attacherDetacher.DetachVolume started for volume "pvc-110962f1-6bd1-11e9-9b9e-0e090072c4a6" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1a/vol-0380c554281eeed57") on node "ip-10-0-137-111.ec2.internal" 

...

I0501 05:26:36.076535       1 aws.go:2244] waitForAttachmentStatus returned non-nil attachment with state=detached: {
  AttachTime: 2019-05-01 05:24:19 +0000 UTC,
  DeleteOnTermination: false,
  Device: "/dev/xvdbo",
  InstanceId: "i-04a63d642e9877109",
  State: "detaching",
  VolumeId: "vol-0380c554281eeed57"
}

I0501 05:26:36.257923       1 aws.go:2258] volume ID: vol-0380c554281eeed57: DetachDisk: instance after detach: [{
  DeviceName: "/dev/xvda",
  Ebs: {
    AttachTime: 2019-05-01 04:54:55 +0000 UTC,
    DeleteOnTermination: true,
    Status: "attached",
    VolumeId: "vol-0de2ec74ea401d2f9"
  }
}]

I0501 05:26:36.257988       1 operation_generator.go:423] DetachVolume.Detach succeeded for volume "pvc-110962f1-6bd1-11e9-9b9e-0e090072c4a6" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1a/vol-0380c554281eeed57") on node "ip-10-0-137-111.ec2.internal" 

3. The volume is being attached again in ~50ms:

I0501 05:26:36.319815       1 reconciler.go:289] attacherDetacher.AttachVolume started for volume "pvc-110962f1-6bd1-11e9-9b9e-0e090072c4a6" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1a/vol-0380c554281eeed57") from node "ip-10-0-137-111.ec2.internal" 
I0501 05:26:36.517073       1 aws.go:2101] volumeID: vol-0380c554281eeed57, AttachDisk got AWS instance [{
  DeviceName: "/dev/xvda",
  Ebs: {
    AttachTime: 2019-05-01 04:54:55 +0000 UTC,
    DeleteOnTermination: true,
    Status: "attached",
    VolumeId: "vol-0de2ec74ea401d2f9"
  }
}]

(no issue here this time, DescribeInstances is correct)

I0501 05:26:36.517174       1 aws.go:1779] Assigned mount device bh -> volume vol-0380c554281eeed57
I0501 05:26:36.947801       1 aws.go:2155] AttachVolume volume="vol-0380c554281eeed57" instance="i-04a63d642e9877109" request returned {
  AttachTime: 2019-05-01 05:26:36.916 +0000 UTC,
  Device: "/dev/xvdbh",
  InstanceId: "i-04a63d642e9877109",
  State: "attaching",
  VolumeId: "vol-0380c554281eeed57"
}
I0501 05:26:37.090169       1 aws.go:1802] Releasing in-process attachment entry: bh -> volume vol-0380c554281eeed57
E0501 05:26:37.090388       1 attacher.go:82] Error attaching volume "aws://us-east-1a/vol-0380c554281eeed57" to node "ip-10-0-137-111.ec2.internal": disk attachment of "aws://us-east-1a/vol-0380c554281eeed57" to "ip-10-0-137-111.ec2.internal" failed: requested device "/dev/xvdbh" but found "/dev/xvdbo"
E0501 05:26:37.090559       1 nestedpendingoperations.go:278] Operation for "\"kubernetes.io/aws-ebs/aws://us-east-1a/vol-0380c554281eeed57\"" failed. No retries permitted until 2019-05-01 05:26:37.590505739 +0000 UTC m=+1480.632297215 (durationBeforeRetry 500ms). Error: "AttachVolume.Attach failed for volume \"pvc-110962f1-6bd1-11e9-9b9e-0e090072c4a6\" (UniqueName: \"kubernetes.io/aws-ebs/aws://us-east-1a/vol-0380c554281eeed57\") from node \"ip-10-0-137-111.ec2.internal\" : disk attachment of \"aws://us-east-1a/vol-0380c554281eeed57\" to \"ip-10-0-137-111.ec2.internal\" failed: requested device \"/dev/xvdbh\" but found \"/dev/xvdbo\""


So, AWS shows the volume as detached, but when we call AttachVolume again with a  different device (xvdbh), subsequent DescribeInstances show that it's already attached as the old device (xvdbo)!



4. Kubernetes re-tries to attach the volume and what is more interesting, subsequent DescribeInstances shows the volume as "attaching" as the new device name, even before calling AttachVolume this time:

I0501 05:26:37.632338       1 reconciler.go:289] attacherDetacher.AttachVolume started for volume "pvc-110962f1-6bd1-11e9-9b9e-0e090072c4a6" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1a/vol-0380c554281eeed57") from node "ip-10-0-137-111.ec2.internal" 
I0501 05:26:37.917172       1 aws.go:2101] volumeID: vol-0380c554281eeed57, AttachDisk got AWS instance [{
  DeviceName: "/dev/xvda",
  Ebs: {
    AttachTime: 2019-05-01 04:54:55 +0000 UTC,
    DeleteOnTermination: true,
    Status: "attached",
    VolumeId: "vol-0de2ec74ea401d2f9"
  }
} {
  DeviceName: "/dev/xvdbh",
  Ebs: {
    AttachTime: 2019-05-01 05:26:36 +0000 UTC,
    DeleteOnTermination: false,
    Status: "attaching",
    VolumeId: "vol-0380c554281eeed57"
  }
}]



5. Attach continues successfully for the new device and the volume is attached eventually:

I0501 05:26:38.029474       1 aws.go:2016] Waiting for volume "vol-0380c554281eeed57" state: actual=attaching, desired=attached

...

E0501 05:26:41.049673       1 aws.go:2114] endAttaching called for disk "vol-0380c554281eeed57" when attach not in progress
I0501 05:26:41.049742       1 operation_generator.go:339] AttachVolume.Attach succeeded for volume "pvc-110962f1-6bd1-11e9-9b9e-0e090072c4a6" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1a/vol-0380c554281eeed57") from node "ip-10-0-137-111.ec2.internal" 


So from Kubernetes point of view, we recovered from the hiccup without any issue.

"endAttaching called for disk XYZ when attach not in progress" looks a bit scary, but it seems to be harmless, since AWS reports the device as Attaching and Kubernetes won't reuse it for different volumes.

To sum it up, it seems that DescribeVolumes shows old attachment again, now even after we called AttachVolume with a new device name.

Comment 7 Hemant Kumar 2019-05-02 16:01:17 UTC

Yeah I think same volume going from detached (i.e not listed in the instance) and back to "detaching" is weird and most likely an error on AWS side. There is a instance cache in AWS cloudprovider, but it is not used in this code path. Only load balancers stuff use the instance cache. So, I do not think this is a client side caching issue of instances.

Comment 8 Jan Safranek 2019-05-03 10:49:28 UTC

Reported upstream as https://github.com/kubernetes/kubernetes/issues/77389

Comment 9 Hemant Kumar 2019-05-16 22:21:06 UTC

Another occurence of this bug - https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.1/809 . In this case the second pod that uses the volume for resizing could not start in time because AWS incorrectly reported volume as attached in describeinstance call. This caused code to enter "WaitForAttachmentStatus" loop and test timedout.

Comment 10 Fabio Bertinatto 2019-05-22 14:52:32 UTC

This seems to be another occurrence of this bug: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_console/1582/pull-ci-openshift-console-master-e2e-aws/1496

Comment 15 Liang Xia 2019-10-29 08:21:30 UTC

The flake does not occur from Oct 16 to today (Oct 29), move it to verified.
https://testgrid.k8s.io/redhat-openshift-release-4.2-blocking-ci#release-openshift-origin-installer-e2e-aws-4.2
https://testgrid.k8s.io/redhat-openshift-release-4.2-blocking-ocp#release-openshift-ocp-installer-e2e-aws-4.2

And also does not flake from Oct 26 to Oct 29 ( 18 builds every day).
https://testgrid.k8s.io/redhat-openshift-release-4.3-blocking-ci#release-openshift-origin-installer-e2e-aws-4.3
https://testgrid.k8s.io/redhat-openshift-release-4.3-blocking-ocp#release-openshift-ocp-installer-e2e-aws-4.3

Comment 16 Maciej Szulik 2019-12-09 21:07:12 UTC

It looks like this problem still persists, but in the azure environment, looking at https://testgrid.k8s.io/redhat-openshift-ocp-release-4.3-informing#release-openshift-ocp-installer-e2e-azure-4.3
you'll notice every 2nd or 3rd failures is related to:

[sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should perform rolling updates and roll backs of template modifications with PVCs 

test failure, logs from recent failures are:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/562
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/561
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/558
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/552
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/550

Comment 17 Jan Safranek 2019-12-10 09:56:52 UTC

This bug tracks AWS API throttling, not all generic test failures.

> https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/562
> https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/561
> https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/558

These three timed out after 15 minutes. Filed 1781579.

> https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/552

There is something wrong with API server (or network / auth path to API server):

    Error from server (Forbidden): pods "hostexec-ci-op-vbfvbx0x-cbd67-cdnbq-worker-eastus23-z9pp4-phbbk" is forbidden: failed to get pod: etcdserver: request timed out

> https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/550

many different errors (or one with many symptoms):

    the server is currently unable to handle the request (post oauthaccesstokens.oauth.openshift.io)
    Message: "Error trying to reach service: 'net/http: TLS handshake timeout'",
    Message: "Timeout: request did not complete within requested timeout 34s",
    error: Get https://api.ci-op-rltt6qy9-cbd67.ci.azure.devcluster.openshift.com:6443/apis/config.openshift.io/v1/infrastructures/cluster: dial tcp 52.142.31.115:6443: connect: connection refused

Comment 19 errata-xmlrpc 2020-01-23 11:03:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Note You need to log in before you can comment on or make changes to this bug.