Bug 1772189

Summary:	Failed to create client pod: timed out waiting for the condition
Product:	OpenShift Container Platform	Reporter:	Michal Fojtik <mfojtik>
Component:	Storage	Assignee:	Jan Safranek <jsafrane>
Status:	CLOSED ERRATA	QA Contact:	Chao Yang <chaoyang>
Severity:	low	Docs Contact:
Priority:	low
Version:	4.3.0	CC:	aos-bugs, jsafrane, lxia
Target Milestone:	---
Target Release:	4.4.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-05-04 11:15:35 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Michal Fojtik 2019-11-13 20:50:58 UTC

Description of problem:

Seen in https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-openshift-apiserver-operator/257/pull-ci-openshift-cluster-openshift-apiserver-operator-master-e2e-aws/1326

It seems like it timeout on:

Nov 13 12:12:01.855: INFO: Waiting for pod aws-client to disappear
Nov 13 12:12:01.872: INFO: Pod aws-client still exists
Nov 13 12:12:01.872: INFO: Waiting for pod aws-client to disappear
Nov 13 12:12:01.887: INFO: Pod aws-client still exists
Nov 13 12:12:01.887: INFO: Failed to create client pod: timed out waiting for the condition

This failed multiple times in last 14 days:

https://search.svc.ci.openshift.org/?search=Failed+to+create+client+pod%3A+timed+out+waiting+for+the+condition&maxAge=336h&context=2&type=all

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Jan Safranek 2019-11-14 10:22:33 UTC

Something is wrong with attach. A pod that uses a volume was deleted and another pod that uses the same volume was scheduled to the same node:

Nov 13 12:12:27.676: INFO: At 2019-11-13 12:02:10 +0000 UTC - event for aws-client: {attachdetach-controller } FailedAttachVolume: AttachVolume.Attach failed for volume "aws-volume-0" : disk attachment of "aws://us-east-1a/vol-01a363c6613e81301" to "ip-10-0-137-61.ec2.internal" failed: requested device "/dev/xvdbn" but found "/dev/xvdbb"

AWS eventual consistency strikes again?

Comment 2 Jan Safranek 2019-11-14 13:16:02 UTC

In controller-manager logs, I can see:

1. The volume gets detached

I1113 12:02:08.141511       1 operation_generator.go:558] DetachVolume.Detach succeeded for volume "aws-volume-0" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1a/vol-01a363c6613e81301") on node "ip-10-0-137-61.ec2.internal" 


2. Immediately after that A/D controller tries to attach the volume again:

I1113 12:02:08.258146       1 aws.go:1942] Assigned mount device bn -> volume vol-01a363c6613e81301
...
I1113 12:02:08.584687       1 aws.go:2321] AttachVolume volume="vol-01a363c6613e81301" instance="i-096f624460ef7d2f9" request returned {
  AttachTime: 2019-11-13 12:02:08.564 +0000 UTC,
  Device: "/dev/xvdbn",
  InstanceId: "i-096f624460ef7d2f9",
  State: "attaching",
  VolumeId: "vol-01a363c6613e81301"
}
...
I1113 12:02:10.716451       1 aws.go:1965] Releasing in-process attachment entry: bn -> volume vol-01a363c6613e81301


3. DescribeVolumes shows that the volume is still attached as xvdbb (= the attachment from 1.):

E1113 12:02:10.716596       1 nestedpendingoperations.go:270] Operation for "\"kubernetes.io/aws-ebs/aws://us-east-1a/vol-01a363c6613e81301\"" failed. No retries permitted until 2019-11-13 12:02:11.216560511 +0000 UTC m=+2010.013859030 (durationBeforeRetry 500ms). Error: "AttachVolume.Attach failed for volume \"aws-volume-0\" (UniqueName: \"kubernetes.io/aws-ebs/aws://us-east-1a/vol-01a363c6613e81301\") from node \"ip-10-0-137-61.ec2.internal\" : disk attachment of \"aws://us-east-1a/vol-01a363c6613e81301\" to \"ip-10-0-137-61.ec2.internal\" failed: requested device \"/dev/xvdbn\" but found \"/dev/xvdbb\""

4. A/D controller tries again. This time, the old attachment is already gone, but AWS plugin sees the volume is already attached:

W1113 12:02:11.399486       1 aws.go:1906] Got assignment call for already-assigned volume: bn@vol-01a363c6613e81301, volume status: attaching
...
W1113 12:02:12.505496       1 aws.go:1906] Got assignment call for already-assigned volume: bn@vol-01a363c6613e81301, volume status: attached
...

We can see that the volume is already attached, but the AWS plugin cannot be sure if it's the old attachment (that's being torn down) or the new one we're waiting for.

Comment 4 Jan Safranek 2019-11-27 10:05:12 UTC

Upstream PR: https://bugzilla.redhat.com/show_bug.cgi?id=1774582

Comment 5 Jan Safranek 2019-11-28 13:16:44 UTC

Sorry, this is the upstream PR: https://github.com/kubernetes/kubernetes/pull/85675

Comment 7 Chao Yang 2020-02-03 08:42:25 UTC

Passed on
version   4.4.0-0.nightly-2020-02-02-201619   True        False         4h22m   Cluster version is 4.4.0-0.nightly-2020-02-02-201619

1.Create pvc, pod1
2.oc delete pod pod1;oc create -f pod.yaml

Tried several times. did not meet above issue.

Comment 9 errata-xmlrpc 2020-05-04 11:15:35 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581