[BUILD-WATCHER] Seeing frequent deployment errors in pull-ci-openshift-kubernetes-master-e2e-aws-fips job: Aug 6 14:01:30.921: INFO: Running AfterSuite actions on all nodes Aug 6 14:01:30.921: INFO: Running AfterSuite actions on node 1 fail [@/k8s.io/kubernetes/test/e2e/apps/deployment.go:904]: Unexpected error: <*errors.errorString | 0xc001b13e90>: { s: "error waiting for deployment \"test-rolling-update-with-lb\" status to match expectation: deployment status: v1.DeploymentStatus{ObservedGeneration:1, Replicas:3, UpdatedReplicas:3, ReadyReplicas:2, AvailableReplicas:2, UnavailableReplicas:1, Conditions:[]v1.DeploymentCondition{v1.DeploymentCondition{Type:\"Available\", Status:\"False\", LastUpdateTime:v1.Time{Time:time.Time{wall:0x0, ext:63732318989, loc:(*time.Location)(0x9e74040)}}, LastTransitionTime:v1.Time{Time:time.Time{wall:0x0, ext:63732318989, loc:(*time.Location)(0x9e74040)}}, Reason:\"MinimumReplicasUnavailable\", Message:\"Deployment does not have minimum availability.\"}, v1.DeploymentCondition{Type:\"Progressing\", Status:\"True\", LastUpdateTime:v1.Time{Time:time.Time{wall:0x0, ext:63732319009, loc:(*time.Location)(0x9e74040)}}, LastTransitionTime:v1.Time{Time:time.Time{wall:0x0, ext:63732318989, loc:(*time.Location)(0x9e74040)}}, Reason:\"ReplicaSetUpdated\", Message:\"ReplicaSet \\\"test-rolling-update-with-lb-b9c9c6bcc\\\" is progressing.\"}}, CollisionCount:(*int32)(nil)}", } error waiting for deployment "test-rolling-update-with-lb" status to match expectation: deployment status: v1.DeploymentStatus{ObservedGeneration:1, Replicas:3, UpdatedReplicas:3, ReadyReplicas:2, AvailableReplicas:2, UnavailableReplicas:1, Conditions:[]v1.DeploymentCondition{v1.DeploymentCondition{Type:"Available", Status:"False", LastUpdateTime:v1.Time{Time:time.Time{wall:0x0, ext:63732318989, loc:(*time.Location)(0x9e74040)}}, LastTransitionTime:v1.Time{Time:time.Time{wall:0x0, ext:63732318989, loc:(*time.Location)(0x9e74040)}}, Reason:"MinimumReplicasUnavailable", Message:"Deployment does not have minimum availability."}, v1.DeploymentCondition{Type:"Progressing", Status:"True", LastUpdateTime:v1.Time{Time:time.Time{wall:0x0, ext:63732319009, loc:(*time.Location)(0x9e74040)}}, LastTransitionTime:v1.Time{Time:time.Time{wall:0x0, ext:63732318989, loc:(*time.Location)(0x9e74040)}}, Reason:"ReplicaSetUpdated", Message:"ReplicaSet \"test-rolling-update-with-lb-b9c9c6bcc\" is progressing."}}, CollisionCount:(*int32)(nil)} occurred Sample failing job: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-remote-libvirt-s390x-4.5/1291373614746046464 Search of all failing jobs: https://search.ci.openshift.org/?search=error+waiting+for+deployment+&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job Possibly related to https://bugzilla.redhat.com/show_bug.cgi?id=1861095
e2e-fips is having a miriade of issues right now. Sometimes the cluster doesn't even init and other times it has a ton of flakes/failures. I want to try to narrow this before suggesting changes or passing it along.
https://search.ci.openshift.org/?search=k8s.io%2Fkubernetes%2Ftest%2Fe2e%2Fapps%2Fdeployment.go%3A904 1 of the 3 deployment test pods is not being scheduled Aug 12 18:33:01.831: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for test-rolling-update-with-lb-b9c9c6bcc-4phtw: { } Scheduled: Successfully assigned e2e-deployment-5371/test-rolling-update-with-lb-b9c9c6bcc-4phtw to ip-10-0-146-124.us-east-2.compute.internal Aug 12 18:33:01.831: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for test-rolling-update-with-lb-b9c9c6bcc-hlsnz: { } FailedScheduling: 0/5 nodes are available: 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't match pod anti-affinity rules, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Aug 12 18:33:01.831: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for test-rolling-update-with-lb-b9c9c6bcc-vx4j5: { } Scheduled: Successfully assigned e2e-deployment-5371/test-rolling-update-with-lb-b9c9c6bcc-vx4j5 to ip-10-0-228-200.us-east-2.compute.internal
Fix is in https://github.com/openshift/origin/pull/25408
Iām adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.
Fix actually landed in https://github.com/openshift/origin/pull/25010
will wait for few more days to check the flake and then move to verified state.
Moving the bug to verified state as i see that the fix landed about 6 days ago and no failures seen from that point when checked here for about 7 days. https://search.ci.openshift.org/?search=k8s.io%2Fkubernetes%2Ftest%2Fe2e%2Fapps%2Fdeployment.go%3A904
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196