Description of problem: Seen in https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-openshift-apiserver-operator/257/pull-ci-openshift-cluster-openshift-apiserver-operator-master-e2e-aws/1326 It seems like it timeout on: Nov 13 12:12:01.855: INFO: Waiting for pod aws-client to disappear Nov 13 12:12:01.872: INFO: Pod aws-client still exists Nov 13 12:12:01.872: INFO: Waiting for pod aws-client to disappear Nov 13 12:12:01.887: INFO: Pod aws-client still exists Nov 13 12:12:01.887: INFO: Failed to create client pod: timed out waiting for the condition This failed multiple times in last 14 days: https://search.svc.ci.openshift.org/?search=Failed+to+create+client+pod%3A+timed+out+waiting+for+the+condition&maxAge=336h&context=2&type=all Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Something is wrong with attach. A pod that uses a volume was deleted and another pod that uses the same volume was scheduled to the same node: Nov 13 12:12:27.676: INFO: At 2019-11-13 12:02:10 +0000 UTC - event for aws-client: {attachdetach-controller } FailedAttachVolume: AttachVolume.Attach failed for volume "aws-volume-0" : disk attachment of "aws://us-east-1a/vol-01a363c6613e81301" to "ip-10-0-137-61.ec2.internal" failed: requested device "/dev/xvdbn" but found "/dev/xvdbb" AWS eventual consistency strikes again?
In controller-manager logs, I can see: 1. The volume gets detached I1113 12:02:08.141511 1 operation_generator.go:558] DetachVolume.Detach succeeded for volume "aws-volume-0" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1a/vol-01a363c6613e81301") on node "ip-10-0-137-61.ec2.internal" 2. Immediately after that A/D controller tries to attach the volume again: I1113 12:02:08.258146 1 aws.go:1942] Assigned mount device bn -> volume vol-01a363c6613e81301 ... I1113 12:02:08.584687 1 aws.go:2321] AttachVolume volume="vol-01a363c6613e81301" instance="i-096f624460ef7d2f9" request returned { AttachTime: 2019-11-13 12:02:08.564 +0000 UTC, Device: "/dev/xvdbn", InstanceId: "i-096f624460ef7d2f9", State: "attaching", VolumeId: "vol-01a363c6613e81301" } ... I1113 12:02:10.716451 1 aws.go:1965] Releasing in-process attachment entry: bn -> volume vol-01a363c6613e81301 3. DescribeVolumes shows that the volume is still attached as xvdbb (= the attachment from 1.): E1113 12:02:10.716596 1 nestedpendingoperations.go:270] Operation for "\"kubernetes.io/aws-ebs/aws://us-east-1a/vol-01a363c6613e81301\"" failed. No retries permitted until 2019-11-13 12:02:11.216560511 +0000 UTC m=+2010.013859030 (durationBeforeRetry 500ms). Error: "AttachVolume.Attach failed for volume \"aws-volume-0\" (UniqueName: \"kubernetes.io/aws-ebs/aws://us-east-1a/vol-01a363c6613e81301\") from node \"ip-10-0-137-61.ec2.internal\" : disk attachment of \"aws://us-east-1a/vol-01a363c6613e81301\" to \"ip-10-0-137-61.ec2.internal\" failed: requested device \"/dev/xvdbn\" but found \"/dev/xvdbb\"" 4. A/D controller tries again. This time, the old attachment is already gone, but AWS plugin sees the volume is already attached: W1113 12:02:11.399486 1 aws.go:1906] Got assignment call for already-assigned volume: bn@vol-01a363c6613e81301, volume status: attaching ... W1113 12:02:12.505496 1 aws.go:1906] Got assignment call for already-assigned volume: bn@vol-01a363c6613e81301, volume status: attached ... We can see that the volume is already attached, but the AWS plugin cannot be sure if it's the old attachment (that's being torn down) or the new one we're waiting for.
Upstream PR: https://bugzilla.redhat.com/show_bug.cgi?id=1774582
Sorry, this is the upstream PR: https://github.com/kubernetes/kubernetes/pull/85675
Passed on version 4.4.0-0.nightly-2020-02-02-201619 True False 4h22m Cluster version is 4.4.0-0.nightly-2020-02-02-201619 1.Create pvc, pod1 2.oc delete pod pod1;oc create -f pod.yaml Tried several times. did not meet above issue.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581