Bug 1866868

Summary:	Flake: error waiting for deployment e2e-aws-fips
Product:	OpenShift Container Platform	Reporter:	Matthew Heon <mheon>
Component:	kube-controller-manager	Assignee:	Maciej Szulik <maszulik>
Status:	CLOSED ERRATA	QA Contact:	RamaKasturi <knarra>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	unspecified	CC:	aos-bugs, danili, fromani, jokerman, knarra, mfojtik
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	non-multi-arch
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-10-27 16:26:05 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Matthew Heon 2020-08-06 16:26:37 UTC

[BUILD-WATCHER]

Seeing frequent deployment errors in pull-ci-openshift-kubernetes-master-e2e-aws-fips job:

Aug  6 14:01:30.921: INFO: Running AfterSuite actions on all nodes
Aug  6 14:01:30.921: INFO: Running AfterSuite actions on node 1
fail [@/k8s.io/kubernetes/test/e2e/apps/deployment.go:904]: Unexpected error:
    <*errors.errorString | 0xc001b13e90>: {
        s: "error waiting for deployment \"test-rolling-update-with-lb\" status to match expectation: deployment status: v1.DeploymentStatus{ObservedGeneration:1, Replicas:3, UpdatedReplicas:3, ReadyReplicas:2, AvailableReplicas:2, UnavailableReplicas:1, Conditions:[]v1.DeploymentCondition{v1.DeploymentCondition{Type:\"Available\", Status:\"False\", LastUpdateTime:v1.Time{Time:time.Time{wall:0x0, ext:63732318989, loc:(*time.Location)(0x9e74040)}}, LastTransitionTime:v1.Time{Time:time.Time{wall:0x0, ext:63732318989, loc:(*time.Location)(0x9e74040)}}, Reason:\"MinimumReplicasUnavailable\", Message:\"Deployment does not have minimum availability.\"}, v1.DeploymentCondition{Type:\"Progressing\", Status:\"True\", LastUpdateTime:v1.Time{Time:time.Time{wall:0x0, ext:63732319009, loc:(*time.Location)(0x9e74040)}}, LastTransitionTime:v1.Time{Time:time.Time{wall:0x0, ext:63732318989, loc:(*time.Location)(0x9e74040)}}, Reason:\"ReplicaSetUpdated\", Message:\"ReplicaSet \\\"test-rolling-update-with-lb-b9c9c6bcc\\\" is progressing.\"}}, CollisionCount:(*int32)(nil)}",
    }
    error waiting for deployment "test-rolling-update-with-lb" status to match expectation: deployment status: v1.DeploymentStatus{ObservedGeneration:1, Replicas:3, UpdatedReplicas:3, ReadyReplicas:2, AvailableReplicas:2, UnavailableReplicas:1, Conditions:[]v1.DeploymentCondition{v1.DeploymentCondition{Type:"Available", Status:"False", LastUpdateTime:v1.Time{Time:time.Time{wall:0x0, ext:63732318989, loc:(*time.Location)(0x9e74040)}}, LastTransitionTime:v1.Time{Time:time.Time{wall:0x0, ext:63732318989, loc:(*time.Location)(0x9e74040)}}, Reason:"MinimumReplicasUnavailable", Message:"Deployment does not have minimum availability."}, v1.DeploymentCondition{Type:"Progressing", Status:"True", LastUpdateTime:v1.Time{Time:time.Time{wall:0x0, ext:63732319009, loc:(*time.Location)(0x9e74040)}}, LastTransitionTime:v1.Time{Time:time.Time{wall:0x0, ext:63732318989, loc:(*time.Location)(0x9e74040)}}, Reason:"ReplicaSetUpdated", Message:"ReplicaSet \"test-rolling-update-with-lb-b9c9c6bcc\" is progressing."}}, CollisionCount:(*int32)(nil)}
occurred

Sample failing job:

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-remote-libvirt-s390x-4.5/1291373614746046464

Search of all failing jobs:

https://search.ci.openshift.org/?search=error+waiting+for+deployment+&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Possibly related to https://bugzilla.redhat.com/show_bug.cgi?id=1861095

Comment 1 Seth Jennings 2020-08-06 18:18:40 UTC

e2e-fips is having a miriade of issues right now.  Sometimes the cluster doesn't even init and other times it has a ton of flakes/failures.

I want to try to narrow this before suggesting changes or passing it along.

Comment 2 Seth Jennings 2020-08-12 19:16:43 UTC

https://search.ci.openshift.org/?search=k8s.io%2Fkubernetes%2Ftest%2Fe2e%2Fapps%2Fdeployment.go%3A904

1 of the 3 deployment test pods is not being scheduled

Aug 12 18:33:01.831: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for test-rolling-update-with-lb-b9c9c6bcc-4phtw: { } Scheduled: Successfully assigned e2e-deployment-5371/test-rolling-update-with-lb-b9c9c6bcc-4phtw to ip-10-0-146-124.us-east-2.compute.internal
Aug 12 18:33:01.831: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for test-rolling-update-with-lb-b9c9c6bcc-hlsnz: { } FailedScheduling: 0/5 nodes are available: 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't match pod anti-affinity rules, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
Aug 12 18:33:01.831: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for test-rolling-update-with-lb-b9c9c6bcc-vx4j5: { } Scheduled: Successfully assigned e2e-deployment-5371/test-rolling-update-with-lb-b9c9c6bcc-vx4j5 to ip-10-0-228-200.us-east-2.compute.internal

Comment 3 Maciej Szulik 2020-08-17 11:44:58 UTC

Fix is in https://github.com/openshift/origin/pull/25408

Comment 4 Maciej Szulik 2020-08-21 14:00:09 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 5 Maciej Szulik 2020-09-10 18:47:10 UTC

Fix actually landed in https://github.com/openshift/origin/pull/25010

Comment 7 RamaKasturi 2020-09-14 16:12:43 UTC

will wait for few more days to check the flake and then move to verified state.

Comment 8 RamaKasturi 2020-09-16 06:11:32 UTC

Moving the bug to verified state as i see that the fix landed about 6 days ago and no failures seen from that point when checked here for about 7 days.

https://search.ci.openshift.org/?search=k8s.io%2Fkubernetes%2Ftest%2Fe2e%2Fapps%2Fdeployment.go%3A904

Comment 11 errata-xmlrpc 2020-10-27 16:26:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196