Bug 1698829
Summary: | flake: volume is not attached in time due to AWS API trhottling | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Jan Safranek <jsafrane> | ||||||||
Component: | Storage | Assignee: | Jan Safranek <jsafrane> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Liang Xia <lxia> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | 4.1.0 | CC: | aos-bugs, aos-storage-staff, bchilds, hekumar, maszulik | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | 4.3.0 | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | No Doc Update | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2020-01-23 11:03:45 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 1764044 | ||||||||||
Attachments: |
|
Description
Jan Safranek
2019-04-11 10:02:41 UTC
Previously: https://github.com/openshift/origin/issues/22084 I had the same conclusion: "aws.go:1653] Got " means that a device was already assigned according to DescribeInstances https://github.com/kubernetes/kubernetes/blob/9fd23b217800a312a45dbaedb70f4ffa65ec661b/pkg/cloudprovider/providers/aws/aws.go#L1750. How can this be if the device was successfully detached ~300ms ago and its assignment released, according to DescribeVolumes? I gussed maybe that DescribeVolumes and DescribeInstances could disagree. Quite difficult to reproduce the flake though and test it out. I checked all pull-ci-openshift-origin-master-e2e-aws runs between Apr 4 and Apr 12 (=1 week) and I found "Got assignment call for already-assigned volume" in 33 of them out of 305. Unfortunately, I haven't been able to reproduce a single occurrence of it in my cluster despite running many of the failed tests in parallel for half of a day, both upstream single-node Kubernetes cluster and OCP. PR with extra logs: https://github.com/openshift/origin/pull/22562 Will run it for couple of days and we'll see... Created attachment 1561529 [details]
controller-manager logs with extra logging
Got some results.
1. vol-0a90988d68c4166b5 is attached to node i-02b72eba7c14c5434 as /dev/xvdcp
I0430 19:17:00.410307 1 aws.go:2155] AttachVolume volume="vol-0a90988d68c4166b5" instance="i-02b72eba7c14c5434" request returned {
AttachTime: 2019-04-30 19:17:00.391 +0000 UTC,
Device: "/dev/xvdcp",
InstanceId: "i-02b72eba7c14c5434",
State: "attaching",
VolumeId: "vol-0a90988d68c4166b5"
}
...
I0430 19:17:06.547883 1 operation_generator.go:339] AttachVolume.Attach succeeded for volume "pvc-6e0cf15b-6b7c-11e9-b232-0af5bde0c660" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1b/vol-0a90988d68c4166b5") from node "ip-10-0-149-35.ec2.internal"
2. The volume is detached;
I0430 19:20:00.183092 1 reconciler.go:230] attacherDetacher.DetachVolume started for volume "pvc-6e0cf15b-6b7c-11e9-b232-0af5bde0c660" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1b/vol-0a90988d68c4166b5") on node "ip-10-0-149-35.ec2.internal"
...
I0430 19:20:07.184488 1 aws.go:2244] waitForAttachmentStatus returned non-nil attachment with state=detached: {
AttachTime: 2019-04-30 19:17:00 +0000 UTC,
DeleteOnTermination: false,
Device: "/dev/xvdcp",
InstanceId: "i-02b72eba7c14c5434",
State: "detaching",
VolumeId: "vol-0a90988d68c4166b5"
}
I0430 19:20:07.404850 1 aws.go:2258] volume ID: vol-0a90988d68c4166b5: DetachDisk: instance after detach: [{
DeviceName: "/dev/xvda",
Ebs: {
AttachTime: 2019-04-30 18:20:37 +0000 UTC,
DeleteOnTermination: true,
Status: "attached",
VolumeId: "vol-0714b435a9b3d1682"
}
} {
DeviceName: "/dev/xvdcv",
Ebs: {
AttachTime: 2019-04-30 19:16:53 +0000 UTC,
DeleteOnTermination: false,
Status: "attached",
VolumeId: "vol-01a18b18d5521bf8c"
}
}]
I0430 19:20:07.404931 1 operation_generator.go:423] DetachVolume.Detach succeeded for volume "pvc-6e0cf15b-6b7c-11e9-b232-0af5bde0c660" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1b/vol-0a90988d68c4166b5") on node "ip-10-0-149-35.ec2.internal"
So, DescribeInstance shows that /dev/xvdcp is detached.
3. 0.4 seconds after that, the volume is being attached to the same node, but DescribeInstances shows that the volume is already attached:
I0430 19:20:07.800180 1 reconciler.go:289] attacherDetacher.AttachVolume started for volume "pvc-6e0cf15b-6b7c-11e9-b232-0af5bde0c660" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1b/vol-0a90988d68c4166b5") from node "ip-10-0-149-35.ec2.internal"
I0430 19:20:07.932111 1 aws.go:2101] volumeID: vol-0a90988d68c4166b5, AttachDisk got AWS instance [{
DeviceName: "/dev/xvda",
Ebs: {
AttachTime: 2019-04-30 18:20:37 +0000 UTC,
DeleteOnTermination: true,
Status: "attached",
VolumeId: "vol-0714b435a9b3d1682"
}
} {
DeviceName: "/dev/xvdcv",
Ebs: {
AttachTime: 2019-04-30 19:16:53 +0000 UTC,
DeleteOnTermination: false,
Status: "attached",
VolumeId: "vol-01a18b18d5521bf8c"
}
} {
DeviceName: "/dev/xvdcp",
Ebs: {
AttachTime: 2019-04-30 19:17:00 +0000 UTC,
DeleteOnTermination: false,
Status: "detaching",
VolumeId: "vol-0a90988d68c4166b5"
}
}]
Since the volume is shown as attached, volume plugin won't call AttachVolume.
It looks like a bug somewhere in AWS, DescribeInstances shows a volume detached and then as attached, without any action done by Kubernetes.
Created attachment 1561575 [details]
2nd log with extra logging
Perhaps another occurrence, now with different symptoms:
0. (volume vol-0380c554281eeed57 is attached and detached couple of times without any issue)
1. Volume vol-0380c554281eeed57 is attached as /dev/xvdbo:
I0501 05:24:22.793642 1 aws.go:1802] Releasing in-process attachment entry: bo -> volume vol-0380c554281eeed57
I0501 05:24:22.793715 1 operation_generator.go:339] AttachVolume.Attach succeeded for volume "pvc-110962f1-6bd1-11e9-9b9e-0e090072c4a6" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1a/vol-0380c554281eeed57") from node "ip-10-0-137-111.ec2.internal"
2. It is detached:
I0501 05:26:27.593192 1 reconciler.go:230] attacherDetacher.DetachVolume started for volume "pvc-110962f1-6bd1-11e9-9b9e-0e090072c4a6" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1a/vol-0380c554281eeed57") on node "ip-10-0-137-111.ec2.internal"
...
I0501 05:26:36.076535 1 aws.go:2244] waitForAttachmentStatus returned non-nil attachment with state=detached: {
AttachTime: 2019-05-01 05:24:19 +0000 UTC,
DeleteOnTermination: false,
Device: "/dev/xvdbo",
InstanceId: "i-04a63d642e9877109",
State: "detaching",
VolumeId: "vol-0380c554281eeed57"
}
I0501 05:26:36.257923 1 aws.go:2258] volume ID: vol-0380c554281eeed57: DetachDisk: instance after detach: [{
DeviceName: "/dev/xvda",
Ebs: {
AttachTime: 2019-05-01 04:54:55 +0000 UTC,
DeleteOnTermination: true,
Status: "attached",
VolumeId: "vol-0de2ec74ea401d2f9"
}
}]
I0501 05:26:36.257988 1 operation_generator.go:423] DetachVolume.Detach succeeded for volume "pvc-110962f1-6bd1-11e9-9b9e-0e090072c4a6" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1a/vol-0380c554281eeed57") on node "ip-10-0-137-111.ec2.internal"
3. The volume is being attached again in ~50ms:
I0501 05:26:36.319815 1 reconciler.go:289] attacherDetacher.AttachVolume started for volume "pvc-110962f1-6bd1-11e9-9b9e-0e090072c4a6" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1a/vol-0380c554281eeed57") from node "ip-10-0-137-111.ec2.internal"
I0501 05:26:36.517073 1 aws.go:2101] volumeID: vol-0380c554281eeed57, AttachDisk got AWS instance [{
DeviceName: "/dev/xvda",
Ebs: {
AttachTime: 2019-05-01 04:54:55 +0000 UTC,
DeleteOnTermination: true,
Status: "attached",
VolumeId: "vol-0de2ec74ea401d2f9"
}
}]
(no issue here this time, DescribeInstances is correct)
I0501 05:26:36.517174 1 aws.go:1779] Assigned mount device bh -> volume vol-0380c554281eeed57
I0501 05:26:36.947801 1 aws.go:2155] AttachVolume volume="vol-0380c554281eeed57" instance="i-04a63d642e9877109" request returned {
AttachTime: 2019-05-01 05:26:36.916 +0000 UTC,
Device: "/dev/xvdbh",
InstanceId: "i-04a63d642e9877109",
State: "attaching",
VolumeId: "vol-0380c554281eeed57"
}
I0501 05:26:37.090169 1 aws.go:1802] Releasing in-process attachment entry: bh -> volume vol-0380c554281eeed57
E0501 05:26:37.090388 1 attacher.go:82] Error attaching volume "aws://us-east-1a/vol-0380c554281eeed57" to node "ip-10-0-137-111.ec2.internal": disk attachment of "aws://us-east-1a/vol-0380c554281eeed57" to "ip-10-0-137-111.ec2.internal" failed: requested device "/dev/xvdbh" but found "/dev/xvdbo"
E0501 05:26:37.090559 1 nestedpendingoperations.go:278] Operation for "\"kubernetes.io/aws-ebs/aws://us-east-1a/vol-0380c554281eeed57\"" failed. No retries permitted until 2019-05-01 05:26:37.590505739 +0000 UTC m=+1480.632297215 (durationBeforeRetry 500ms). Error: "AttachVolume.Attach failed for volume \"pvc-110962f1-6bd1-11e9-9b9e-0e090072c4a6\" (UniqueName: \"kubernetes.io/aws-ebs/aws://us-east-1a/vol-0380c554281eeed57\") from node \"ip-10-0-137-111.ec2.internal\" : disk attachment of \"aws://us-east-1a/vol-0380c554281eeed57\" to \"ip-10-0-137-111.ec2.internal\" failed: requested device \"/dev/xvdbh\" but found \"/dev/xvdbo\""
So, AWS shows the volume as detached, but when we call AttachVolume again with a different device (xvdbh), subsequent DescribeInstances show that it's already attached as the old device (xvdbo)!
4. Kubernetes re-tries to attach the volume and what is more interesting, subsequent DescribeInstances shows the volume as "attaching" as the new device name, even before calling AttachVolume this time:
I0501 05:26:37.632338 1 reconciler.go:289] attacherDetacher.AttachVolume started for volume "pvc-110962f1-6bd1-11e9-9b9e-0e090072c4a6" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1a/vol-0380c554281eeed57") from node "ip-10-0-137-111.ec2.internal"
I0501 05:26:37.917172 1 aws.go:2101] volumeID: vol-0380c554281eeed57, AttachDisk got AWS instance [{
DeviceName: "/dev/xvda",
Ebs: {
AttachTime: 2019-05-01 04:54:55 +0000 UTC,
DeleteOnTermination: true,
Status: "attached",
VolumeId: "vol-0de2ec74ea401d2f9"
}
} {
DeviceName: "/dev/xvdbh",
Ebs: {
AttachTime: 2019-05-01 05:26:36 +0000 UTC,
DeleteOnTermination: false,
Status: "attaching",
VolumeId: "vol-0380c554281eeed57"
}
}]
5. Attach continues successfully for the new device and the volume is attached eventually:
I0501 05:26:38.029474 1 aws.go:2016] Waiting for volume "vol-0380c554281eeed57" state: actual=attaching, desired=attached
...
E0501 05:26:41.049673 1 aws.go:2114] endAttaching called for disk "vol-0380c554281eeed57" when attach not in progress
I0501 05:26:41.049742 1 operation_generator.go:339] AttachVolume.Attach succeeded for volume "pvc-110962f1-6bd1-11e9-9b9e-0e090072c4a6" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1a/vol-0380c554281eeed57") from node "ip-10-0-137-111.ec2.internal"
So from Kubernetes point of view, we recovered from the hiccup without any issue.
"endAttaching called for disk XYZ when attach not in progress" looks a bit scary, but it seems to be harmless, since AWS reports the device as Attaching and Kubernetes won't reuse it for different volumes.
To sum it up, it seems that DescribeVolumes shows old attachment again, now even after we called AttachVolume with a new device name.
Yeah I think same volume going from detached (i.e not listed in the instance) and back to "detaching" is weird and most likely an error on AWS side. There is a instance cache in AWS cloudprovider, but it is not used in this code path. Only load balancers stuff use the instance cache. So, I do not think this is a client side caching issue of instances. Reported upstream as https://github.com/kubernetes/kubernetes/issues/77389 Another occurence of this bug - https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.1/809 . In this case the second pod that uses the volume for resizing could not start in time because AWS incorrectly reported volume as attached in describeinstance call. This caused code to enter "WaitForAttachmentStatus" loop and test timedout. This seems to be another occurrence of this bug: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_console/1582/pull-ci-openshift-console-master-e2e-aws/1496 The flake does not occur from Oct 16 to today (Oct 29), move it to verified. https://testgrid.k8s.io/redhat-openshift-release-4.2-blocking-ci#release-openshift-origin-installer-e2e-aws-4.2 https://testgrid.k8s.io/redhat-openshift-release-4.2-blocking-ocp#release-openshift-ocp-installer-e2e-aws-4.2 And also does not flake from Oct 26 to Oct 29 ( 18 builds every day). https://testgrid.k8s.io/redhat-openshift-release-4.3-blocking-ci#release-openshift-origin-installer-e2e-aws-4.3 https://testgrid.k8s.io/redhat-openshift-release-4.3-blocking-ocp#release-openshift-ocp-installer-e2e-aws-4.3 It looks like this problem still persists, but in the azure environment, looking at https://testgrid.k8s.io/redhat-openshift-ocp-release-4.3-informing#release-openshift-ocp-installer-e2e-azure-4.3 you'll notice every 2nd or 3rd failures is related to: [sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should perform rolling updates and roll backs of template modifications with PVCs test failure, logs from recent failures are: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/562 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/561 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/558 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/552 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/550 This bug tracks AWS API throttling, not all generic test failures. > https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/562 > https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/561 > https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/558 These three timed out after 15 minutes. Filed 1781579. > https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/552 There is something wrong with API server (or network / auth path to API server): Error from server (Forbidden): pods "hostexec-ci-op-vbfvbx0x-cbd67-cdnbq-worker-eastus23-z9pp4-phbbk" is forbidden: failed to get pod: etcdserver: request timed out > https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/550 many different errors (or one with many symptoms): the server is currently unable to handle the request (post oauthaccesstokens.oauth.openshift.io) Message: "Error trying to reach service: 'net/http: TLS handshake timeout'", Message: "Timeout: request did not complete within requested timeout 34s", error: Get https://api.ci-op-rltt6qy9-cbd67.ci.azure.devcluster.openshift.com:6443/apis/config.openshift.io/v1/infrastructures/cluster: dial tcp 52.142.31.115:6443: connect: connection refused Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062 |