Created attachment 1554470 [details] controller-manager logs openshift-tests [sig-storage] In-tree Volumes [Driver: aws] [Testpattern: Dynamic PV (default fs)] provisioning should provision storage with mount options [Suite:openshift/conformance/parallel] [Suite:k8s] https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22512/pull-ci-openshift-origin-master-e2e-aws/6810 The tests fails with "Couldn't delete ns: "e2e-tests-volumes-r6nhr"", but the real reason the test flaked is that a pod with AWS EBS PV could not be started due to failed attach. PV: pvc-f8e217c0-5bb4-11e9-96e0-12d7e5f9ebb2 EBS ID: vol-067c4e081c6c40573 There are two pods involved, (old) that writes data to the PV and (new) that reads it and checks for mount options. The (old) pod is started and stopped without issues. The (new) can't start because it's volumes are not attached: I0410 17:21:20.785231 1 operation_generator.go:423] DetachVolume.Detach succeeded for volume "pvc-f8e217c0-5bb4-11e9-96e0-12d7e5f9ebb2" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1a/vol-067c4e081c6c40573") on node "ip-10-0-132-132.ec2.internal" I0410 17:21:20.787080 1 reconciler.go:289] attacherDetacher.AttachVolume started for volume "pvc-f8e217c0-5bb4-11e9-96e0-12d7e5f9ebb2" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1a/vol-067c4e081c6c40573") from node "ip-10-0-132-132.ec2.internal" W0410 17:21:21.076418 1 aws.go:1740] Got assignment call for already-assigned volume: bo@vol-067c4e081c6c40573 I0410 17:21:21.119716 1 aws.go:2012] Waiting for volume "vol-067c4e081c6c40573" state: actual=detached, desired=attached (and "Waiting for volume ... actual=detached, desired=attached" continues until the test gives up) Is there a race? "Got assignment call for already-assigned volume" means that the volume is already assigned and getMountDevice returns alreadyAttached=true and controller-manager does not call AWS AttachVolume. But the volume has just detached...
Another occurence: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/6604#openshift-tests-sig-storage-in-tree-volumes-driver-aws-testpattern-dynamic-pv-default-fs-subpath-should-support-existing-directories-when-readonly-specified-in-the-volumesource-suiteopenshiftconformanceparallel-suitek8s
Previously: https://github.com/openshift/origin/issues/22084 I had the same conclusion: "aws.go:1653] Got " means that a device was already assigned according to DescribeInstances https://github.com/kubernetes/kubernetes/blob/9fd23b217800a312a45dbaedb70f4ffa65ec661b/pkg/cloudprovider/providers/aws/aws.go#L1750. How can this be if the device was successfully detached ~300ms ago and its assignment released, according to DescribeVolumes? I gussed maybe that DescribeVolumes and DescribeInstances could disagree. Quite difficult to reproduce the flake though and test it out.
I checked all pull-ci-openshift-origin-master-e2e-aws runs between Apr 4 and Apr 12 (=1 week) and I found "Got assignment call for already-assigned volume" in 33 of them out of 305. Unfortunately, I haven't been able to reproduce a single occurrence of it in my cluster despite running many of the failed tests in parallel for half of a day, both upstream single-node Kubernetes cluster and OCP.
PR with extra logs: https://github.com/openshift/origin/pull/22562 Will run it for couple of days and we'll see...
Created attachment 1561529 [details] controller-manager logs with extra logging Got some results. 1. vol-0a90988d68c4166b5 is attached to node i-02b72eba7c14c5434 as /dev/xvdcp I0430 19:17:00.410307 1 aws.go:2155] AttachVolume volume="vol-0a90988d68c4166b5" instance="i-02b72eba7c14c5434" request returned { AttachTime: 2019-04-30 19:17:00.391 +0000 UTC, Device: "/dev/xvdcp", InstanceId: "i-02b72eba7c14c5434", State: "attaching", VolumeId: "vol-0a90988d68c4166b5" } ... I0430 19:17:06.547883 1 operation_generator.go:339] AttachVolume.Attach succeeded for volume "pvc-6e0cf15b-6b7c-11e9-b232-0af5bde0c660" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1b/vol-0a90988d68c4166b5") from node "ip-10-0-149-35.ec2.internal" 2. The volume is detached; I0430 19:20:00.183092 1 reconciler.go:230] attacherDetacher.DetachVolume started for volume "pvc-6e0cf15b-6b7c-11e9-b232-0af5bde0c660" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1b/vol-0a90988d68c4166b5") on node "ip-10-0-149-35.ec2.internal" ... I0430 19:20:07.184488 1 aws.go:2244] waitForAttachmentStatus returned non-nil attachment with state=detached: { AttachTime: 2019-04-30 19:17:00 +0000 UTC, DeleteOnTermination: false, Device: "/dev/xvdcp", InstanceId: "i-02b72eba7c14c5434", State: "detaching", VolumeId: "vol-0a90988d68c4166b5" } I0430 19:20:07.404850 1 aws.go:2258] volume ID: vol-0a90988d68c4166b5: DetachDisk: instance after detach: [{ DeviceName: "/dev/xvda", Ebs: { AttachTime: 2019-04-30 18:20:37 +0000 UTC, DeleteOnTermination: true, Status: "attached", VolumeId: "vol-0714b435a9b3d1682" } } { DeviceName: "/dev/xvdcv", Ebs: { AttachTime: 2019-04-30 19:16:53 +0000 UTC, DeleteOnTermination: false, Status: "attached", VolumeId: "vol-01a18b18d5521bf8c" } }] I0430 19:20:07.404931 1 operation_generator.go:423] DetachVolume.Detach succeeded for volume "pvc-6e0cf15b-6b7c-11e9-b232-0af5bde0c660" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1b/vol-0a90988d68c4166b5") on node "ip-10-0-149-35.ec2.internal" So, DescribeInstance shows that /dev/xvdcp is detached. 3. 0.4 seconds after that, the volume is being attached to the same node, but DescribeInstances shows that the volume is already attached: I0430 19:20:07.800180 1 reconciler.go:289] attacherDetacher.AttachVolume started for volume "pvc-6e0cf15b-6b7c-11e9-b232-0af5bde0c660" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1b/vol-0a90988d68c4166b5") from node "ip-10-0-149-35.ec2.internal" I0430 19:20:07.932111 1 aws.go:2101] volumeID: vol-0a90988d68c4166b5, AttachDisk got AWS instance [{ DeviceName: "/dev/xvda", Ebs: { AttachTime: 2019-04-30 18:20:37 +0000 UTC, DeleteOnTermination: true, Status: "attached", VolumeId: "vol-0714b435a9b3d1682" } } { DeviceName: "/dev/xvdcv", Ebs: { AttachTime: 2019-04-30 19:16:53 +0000 UTC, DeleteOnTermination: false, Status: "attached", VolumeId: "vol-01a18b18d5521bf8c" } } { DeviceName: "/dev/xvdcp", Ebs: { AttachTime: 2019-04-30 19:17:00 +0000 UTC, DeleteOnTermination: false, Status: "detaching", VolumeId: "vol-0a90988d68c4166b5" } }] Since the volume is shown as attached, volume plugin won't call AttachVolume. It looks like a bug somewhere in AWS, DescribeInstances shows a volume detached and then as attached, without any action done by Kubernetes.
Created attachment 1561575 [details] 2nd log with extra logging Perhaps another occurrence, now with different symptoms: 0. (volume vol-0380c554281eeed57 is attached and detached couple of times without any issue) 1. Volume vol-0380c554281eeed57 is attached as /dev/xvdbo: I0501 05:24:22.793642 1 aws.go:1802] Releasing in-process attachment entry: bo -> volume vol-0380c554281eeed57 I0501 05:24:22.793715 1 operation_generator.go:339] AttachVolume.Attach succeeded for volume "pvc-110962f1-6bd1-11e9-9b9e-0e090072c4a6" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1a/vol-0380c554281eeed57") from node "ip-10-0-137-111.ec2.internal" 2. It is detached: I0501 05:26:27.593192 1 reconciler.go:230] attacherDetacher.DetachVolume started for volume "pvc-110962f1-6bd1-11e9-9b9e-0e090072c4a6" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1a/vol-0380c554281eeed57") on node "ip-10-0-137-111.ec2.internal" ... I0501 05:26:36.076535 1 aws.go:2244] waitForAttachmentStatus returned non-nil attachment with state=detached: { AttachTime: 2019-05-01 05:24:19 +0000 UTC, DeleteOnTermination: false, Device: "/dev/xvdbo", InstanceId: "i-04a63d642e9877109", State: "detaching", VolumeId: "vol-0380c554281eeed57" } I0501 05:26:36.257923 1 aws.go:2258] volume ID: vol-0380c554281eeed57: DetachDisk: instance after detach: [{ DeviceName: "/dev/xvda", Ebs: { AttachTime: 2019-05-01 04:54:55 +0000 UTC, DeleteOnTermination: true, Status: "attached", VolumeId: "vol-0de2ec74ea401d2f9" } }] I0501 05:26:36.257988 1 operation_generator.go:423] DetachVolume.Detach succeeded for volume "pvc-110962f1-6bd1-11e9-9b9e-0e090072c4a6" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1a/vol-0380c554281eeed57") on node "ip-10-0-137-111.ec2.internal" 3. The volume is being attached again in ~50ms: I0501 05:26:36.319815 1 reconciler.go:289] attacherDetacher.AttachVolume started for volume "pvc-110962f1-6bd1-11e9-9b9e-0e090072c4a6" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1a/vol-0380c554281eeed57") from node "ip-10-0-137-111.ec2.internal" I0501 05:26:36.517073 1 aws.go:2101] volumeID: vol-0380c554281eeed57, AttachDisk got AWS instance [{ DeviceName: "/dev/xvda", Ebs: { AttachTime: 2019-05-01 04:54:55 +0000 UTC, DeleteOnTermination: true, Status: "attached", VolumeId: "vol-0de2ec74ea401d2f9" } }] (no issue here this time, DescribeInstances is correct) I0501 05:26:36.517174 1 aws.go:1779] Assigned mount device bh -> volume vol-0380c554281eeed57 I0501 05:26:36.947801 1 aws.go:2155] AttachVolume volume="vol-0380c554281eeed57" instance="i-04a63d642e9877109" request returned { AttachTime: 2019-05-01 05:26:36.916 +0000 UTC, Device: "/dev/xvdbh", InstanceId: "i-04a63d642e9877109", State: "attaching", VolumeId: "vol-0380c554281eeed57" } I0501 05:26:37.090169 1 aws.go:1802] Releasing in-process attachment entry: bh -> volume vol-0380c554281eeed57 E0501 05:26:37.090388 1 attacher.go:82] Error attaching volume "aws://us-east-1a/vol-0380c554281eeed57" to node "ip-10-0-137-111.ec2.internal": disk attachment of "aws://us-east-1a/vol-0380c554281eeed57" to "ip-10-0-137-111.ec2.internal" failed: requested device "/dev/xvdbh" but found "/dev/xvdbo" E0501 05:26:37.090559 1 nestedpendingoperations.go:278] Operation for "\"kubernetes.io/aws-ebs/aws://us-east-1a/vol-0380c554281eeed57\"" failed. No retries permitted until 2019-05-01 05:26:37.590505739 +0000 UTC m=+1480.632297215 (durationBeforeRetry 500ms). Error: "AttachVolume.Attach failed for volume \"pvc-110962f1-6bd1-11e9-9b9e-0e090072c4a6\" (UniqueName: \"kubernetes.io/aws-ebs/aws://us-east-1a/vol-0380c554281eeed57\") from node \"ip-10-0-137-111.ec2.internal\" : disk attachment of \"aws://us-east-1a/vol-0380c554281eeed57\" to \"ip-10-0-137-111.ec2.internal\" failed: requested device \"/dev/xvdbh\" but found \"/dev/xvdbo\"" So, AWS shows the volume as detached, but when we call AttachVolume again with a different device (xvdbh), subsequent DescribeInstances show that it's already attached as the old device (xvdbo)! 4. Kubernetes re-tries to attach the volume and what is more interesting, subsequent DescribeInstances shows the volume as "attaching" as the new device name, even before calling AttachVolume this time: I0501 05:26:37.632338 1 reconciler.go:289] attacherDetacher.AttachVolume started for volume "pvc-110962f1-6bd1-11e9-9b9e-0e090072c4a6" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1a/vol-0380c554281eeed57") from node "ip-10-0-137-111.ec2.internal" I0501 05:26:37.917172 1 aws.go:2101] volumeID: vol-0380c554281eeed57, AttachDisk got AWS instance [{ DeviceName: "/dev/xvda", Ebs: { AttachTime: 2019-05-01 04:54:55 +0000 UTC, DeleteOnTermination: true, Status: "attached", VolumeId: "vol-0de2ec74ea401d2f9" } } { DeviceName: "/dev/xvdbh", Ebs: { AttachTime: 2019-05-01 05:26:36 +0000 UTC, DeleteOnTermination: false, Status: "attaching", VolumeId: "vol-0380c554281eeed57" } }] 5. Attach continues successfully for the new device and the volume is attached eventually: I0501 05:26:38.029474 1 aws.go:2016] Waiting for volume "vol-0380c554281eeed57" state: actual=attaching, desired=attached ... E0501 05:26:41.049673 1 aws.go:2114] endAttaching called for disk "vol-0380c554281eeed57" when attach not in progress I0501 05:26:41.049742 1 operation_generator.go:339] AttachVolume.Attach succeeded for volume "pvc-110962f1-6bd1-11e9-9b9e-0e090072c4a6" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1a/vol-0380c554281eeed57") from node "ip-10-0-137-111.ec2.internal" So from Kubernetes point of view, we recovered from the hiccup without any issue. "endAttaching called for disk XYZ when attach not in progress" looks a bit scary, but it seems to be harmless, since AWS reports the device as Attaching and Kubernetes won't reuse it for different volumes. To sum it up, it seems that DescribeVolumes shows old attachment again, now even after we called AttachVolume with a new device name.
Yeah I think same volume going from detached (i.e not listed in the instance) and back to "detaching" is weird and most likely an error on AWS side. There is a instance cache in AWS cloudprovider, but it is not used in this code path. Only load balancers stuff use the instance cache. So, I do not think this is a client side caching issue of instances.
Reported upstream as https://github.com/kubernetes/kubernetes/issues/77389
Another occurence of this bug - https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.1/809 . In this case the second pod that uses the volume for resizing could not start in time because AWS incorrectly reported volume as attached in describeinstance call. This caused code to enter "WaitForAttachmentStatus" loop and test timedout.
This seems to be another occurrence of this bug: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_console/1582/pull-ci-openshift-console-master-e2e-aws/1496
The flake does not occur from Oct 16 to today (Oct 29), move it to verified. https://testgrid.k8s.io/redhat-openshift-release-4.2-blocking-ci#release-openshift-origin-installer-e2e-aws-4.2 https://testgrid.k8s.io/redhat-openshift-release-4.2-blocking-ocp#release-openshift-ocp-installer-e2e-aws-4.2 And also does not flake from Oct 26 to Oct 29 ( 18 builds every day). https://testgrid.k8s.io/redhat-openshift-release-4.3-blocking-ci#release-openshift-origin-installer-e2e-aws-4.3 https://testgrid.k8s.io/redhat-openshift-release-4.3-blocking-ocp#release-openshift-ocp-installer-e2e-aws-4.3
It looks like this problem still persists, but in the azure environment, looking at https://testgrid.k8s.io/redhat-openshift-ocp-release-4.3-informing#release-openshift-ocp-installer-e2e-azure-4.3 you'll notice every 2nd or 3rd failures is related to: [sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should perform rolling updates and roll backs of template modifications with PVCs test failure, logs from recent failures are: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/562 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/561 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/558 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/552 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/550
This bug tracks AWS API throttling, not all generic test failures. > https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/562 > https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/561 > https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/558 These three timed out after 15 minutes. Filed 1781579. > https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/552 There is something wrong with API server (or network / auth path to API server): Error from server (Forbidden): pods "hostexec-ci-op-vbfvbx0x-cbd67-cdnbq-worker-eastus23-z9pp4-phbbk" is forbidden: failed to get pod: etcdserver: request timed out > https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/550 many different errors (or one with many symptoms): the server is currently unable to handle the request (post oauthaccesstokens.oauth.openshift.io) Message: "Error trying to reach service: 'net/http: TLS handshake timeout'", Message: "Timeout: request did not complete within requested timeout 34s", error: Get https://api.ci-op-rltt6qy9-cbd67.ci.azure.devcluster.openshift.com:6443/apis/config.openshift.io/v1/infrastructures/cluster: dial tcp 52.142.31.115:6443: connect: connection refused
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062