1772189 – Failed to create client pod: timed out waiting for the condition

Bug 1772189 - Failed to create client pod: timed out waiting for the condition

Summary: Failed to create client pod: timed out waiting for the condition

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Jan Safranek
QA Contact:	Chao Yang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-11-13 20:50 UTC by Michal Fojtik
Modified:	2020-05-04 11:16 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-05-04 11:15:35 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin pull 24244	0	'None'	closed	Bug 1772189: UPSTREAM: 85675: Fix AWS eventual consistency of AttachDisk	2021-02-19 13:06:55 UTC
Red Hat Product Errata	RHBA-2020:0581	0	None	None	None	2020-05-04 11:16:06 UTC

Description Michal Fojtik 2019-11-13 20:50:58 UTC

Description of problem:

Seen in https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-openshift-apiserver-operator/257/pull-ci-openshift-cluster-openshift-apiserver-operator-master-e2e-aws/1326

It seems like it timeout on:

Nov 13 12:12:01.855: INFO: Waiting for pod aws-client to disappear
Nov 13 12:12:01.872: INFO: Pod aws-client still exists
Nov 13 12:12:01.872: INFO: Waiting for pod aws-client to disappear
Nov 13 12:12:01.887: INFO: Pod aws-client still exists
Nov 13 12:12:01.887: INFO: Failed to create client pod: timed out waiting for the condition

This failed multiple times in last 14 days:

https://search.svc.ci.openshift.org/?search=Failed+to+create+client+pod%3A+timed+out+waiting+for+the+condition&maxAge=336h&context=2&type=all

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Jan Safranek 2019-11-14 10:22:33 UTC

Something is wrong with attach. A pod that uses a volume was deleted and another pod that uses the same volume was scheduled to the same node:

Nov 13 12:12:27.676: INFO: At 2019-11-13 12:02:10 +0000 UTC - event for aws-client: {attachdetach-controller } FailedAttachVolume: AttachVolume.Attach failed for volume "aws-volume-0" : disk attachment of "aws://us-east-1a/vol-01a363c6613e81301" to "ip-10-0-137-61.ec2.internal" failed: requested device "/dev/xvdbn" but found "/dev/xvdbb"

AWS eventual consistency strikes again?

Comment 2 Jan Safranek 2019-11-14 13:16:02 UTC

In controller-manager logs, I can see:

1. The volume gets detached

I1113 12:02:08.141511       1 operation_generator.go:558] DetachVolume.Detach succeeded for volume "aws-volume-0" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1a/vol-01a363c6613e81301") on node "ip-10-0-137-61.ec2.internal" 


2. Immediately after that A/D controller tries to attach the volume again:

I1113 12:02:08.258146       1 aws.go:1942] Assigned mount device bn -> volume vol-01a363c6613e81301
...
I1113 12:02:08.584687       1 aws.go:2321] AttachVolume volume="vol-01a363c6613e81301" instance="i-096f624460ef7d2f9" request returned {
  AttachTime: 2019-11-13 12:02:08.564 +0000 UTC,
  Device: "/dev/xvdbn",
  InstanceId: "i-096f624460ef7d2f9",
  State: "attaching",
  VolumeId: "vol-01a363c6613e81301"
}
...
I1113 12:02:10.716451       1 aws.go:1965] Releasing in-process attachment entry: bn -> volume vol-01a363c6613e81301


3. DescribeVolumes shows that the volume is still attached as xvdbb (= the attachment from 1.):

E1113 12:02:10.716596       1 nestedpendingoperations.go:270] Operation for "\"kubernetes.io/aws-ebs/aws://us-east-1a/vol-01a363c6613e81301\"" failed. No retries permitted until 2019-11-13 12:02:11.216560511 +0000 UTC m=+2010.013859030 (durationBeforeRetry 500ms). Error: "AttachVolume.Attach failed for volume \"aws-volume-0\" (UniqueName: \"kubernetes.io/aws-ebs/aws://us-east-1a/vol-01a363c6613e81301\") from node \"ip-10-0-137-61.ec2.internal\" : disk attachment of \"aws://us-east-1a/vol-01a363c6613e81301\" to \"ip-10-0-137-61.ec2.internal\" failed: requested device \"/dev/xvdbn\" but found \"/dev/xvdbb\""

4. A/D controller tries again. This time, the old attachment is already gone, but AWS plugin sees the volume is already attached:

W1113 12:02:11.399486       1 aws.go:1906] Got assignment call for already-assigned volume: bn@vol-01a363c6613e81301, volume status: attaching
...
W1113 12:02:12.505496       1 aws.go:1906] Got assignment call for already-assigned volume: bn@vol-01a363c6613e81301, volume status: attached
...

We can see that the volume is already attached, but the AWS plugin cannot be sure if it's the old attachment (that's being torn down) or the new one we're waiting for.

Comment 4 Jan Safranek 2019-11-27 10:05:12 UTC

Upstream PR: https://bugzilla.redhat.com/show_bug.cgi?id=1774582

Comment 5 Jan Safranek 2019-11-28 13:16:44 UTC

Sorry, this is the upstream PR: https://github.com/kubernetes/kubernetes/pull/85675

Comment 7 Chao Yang 2020-02-03 08:42:25 UTC

Passed on
version   4.4.0-0.nightly-2020-02-02-201619   True        False         4h22m   Cluster version is 4.4.0-0.nightly-2020-02-02-201619

1.Create pvc, pod1
2.oc delete pod pod1;oc create -f pod.yaml

Tried several times. did not meet above issue.

Comment 9 errata-xmlrpc 2020-05-04 11:15:35 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Note You need to log in before you can comment on or make changes to this bug.