Bug 1772189 - Failed to create client pod: timed out waiting for the condition
Summary: Failed to create client pod: timed out waiting for the condition
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.4.0
Assignee: Jan Safranek
QA Contact: Chao Yang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-11-13 20:50 UTC by Michal Fojtik
Modified: 2020-05-04 11:16 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-04 11:15:35 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift origin pull 24244 'None' closed Bug 1772189: UPSTREAM: 85675: Fix AWS eventual consistency of AttachDisk 2020-06-22 08:55:22 UTC
Red Hat Product Errata RHBA-2020:0581 None None None 2020-05-04 11:16:06 UTC

Description Michal Fojtik 2019-11-13 20:50:58 UTC
Description of problem:

Seen in https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-openshift-apiserver-operator/257/pull-ci-openshift-cluster-openshift-apiserver-operator-master-e2e-aws/1326

It seems like it timeout on:

Nov 13 12:12:01.855: INFO: Waiting for pod aws-client to disappear
Nov 13 12:12:01.872: INFO: Pod aws-client still exists
Nov 13 12:12:01.872: INFO: Waiting for pod aws-client to disappear
Nov 13 12:12:01.887: INFO: Pod aws-client still exists
Nov 13 12:12:01.887: INFO: Failed to create client pod: timed out waiting for the condition

This failed multiple times in last 14 days:

https://search.svc.ci.openshift.org/?search=Failed+to+create+client+pod%3A+timed+out+waiting+for+the+condition&maxAge=336h&context=2&type=all

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Jan Safranek 2019-11-14 10:22:33 UTC
Something is wrong with attach. A pod that uses a volume was deleted and another pod that uses the same volume was scheduled to the same node:

Nov 13 12:12:27.676: INFO: At 2019-11-13 12:02:10 +0000 UTC - event for aws-client: {attachdetach-controller } FailedAttachVolume: AttachVolume.Attach failed for volume "aws-volume-0" : disk attachment of "aws://us-east-1a/vol-01a363c6613e81301" to "ip-10-0-137-61.ec2.internal" failed: requested device "/dev/xvdbn" but found "/dev/xvdbb"

AWS eventual consistency strikes again?

Comment 2 Jan Safranek 2019-11-14 13:16:02 UTC
In controller-manager logs, I can see:

1. The volume gets detached

I1113 12:02:08.141511       1 operation_generator.go:558] DetachVolume.Detach succeeded for volume "aws-volume-0" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1a/vol-01a363c6613e81301") on node "ip-10-0-137-61.ec2.internal" 


2. Immediately after that A/D controller tries to attach the volume again:

I1113 12:02:08.258146       1 aws.go:1942] Assigned mount device bn -> volume vol-01a363c6613e81301
...
I1113 12:02:08.584687       1 aws.go:2321] AttachVolume volume="vol-01a363c6613e81301" instance="i-096f624460ef7d2f9" request returned {
  AttachTime: 2019-11-13 12:02:08.564 +0000 UTC,
  Device: "/dev/xvdbn",
  InstanceId: "i-096f624460ef7d2f9",
  State: "attaching",
  VolumeId: "vol-01a363c6613e81301"
}
...
I1113 12:02:10.716451       1 aws.go:1965] Releasing in-process attachment entry: bn -> volume vol-01a363c6613e81301


3. DescribeVolumes shows that the volume is still attached as xvdbb (= the attachment from 1.):

E1113 12:02:10.716596       1 nestedpendingoperations.go:270] Operation for "\"kubernetes.io/aws-ebs/aws://us-east-1a/vol-01a363c6613e81301\"" failed. No retries permitted until 2019-11-13 12:02:11.216560511 +0000 UTC m=+2010.013859030 (durationBeforeRetry 500ms). Error: "AttachVolume.Attach failed for volume \"aws-volume-0\" (UniqueName: \"kubernetes.io/aws-ebs/aws://us-east-1a/vol-01a363c6613e81301\") from node \"ip-10-0-137-61.ec2.internal\" : disk attachment of \"aws://us-east-1a/vol-01a363c6613e81301\" to \"ip-10-0-137-61.ec2.internal\" failed: requested device \"/dev/xvdbn\" but found \"/dev/xvdbb\""

4. A/D controller tries again. This time, the old attachment is already gone, but AWS plugin sees the volume is already attached:

W1113 12:02:11.399486       1 aws.go:1906] Got assignment call for already-assigned volume: bn@vol-01a363c6613e81301, volume status: attaching
...
W1113 12:02:12.505496       1 aws.go:1906] Got assignment call for already-assigned volume: bn@vol-01a363c6613e81301, volume status: attached
...

We can see that the volume is already attached, but the AWS plugin cannot be sure if it's the old attachment (that's being torn down) or the new one we're waiting for.

Comment 4 Jan Safranek 2019-11-27 10:05:12 UTC
Upstream PR: https://bugzilla.redhat.com/show_bug.cgi?id=1774582

Comment 5 Jan Safranek 2019-11-28 13:16:44 UTC
Sorry, this is the upstream PR: https://github.com/kubernetes/kubernetes/pull/85675

Comment 7 Chao Yang 2020-02-03 08:42:25 UTC
Passed on
version   4.4.0-0.nightly-2020-02-02-201619   True        False         4h22m   Cluster version is 4.4.0-0.nightly-2020-02-02-201619

1.Create pvc, pod1
2.oc delete pod pod1;oc create -f pod.yaml

Tried several times. did not meet above issue.

Comment 9 errata-xmlrpc 2020-05-04 11:15:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581


Note You need to log in before you can comment on or make changes to this bug.