Bug 1866868 - Flake: error waiting for deployment e2e-aws-fips
Summary: Flake: error waiting for deployment e2e-aws-fips
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-controller-manager
Version: unspecified
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.6.0
Assignee: Maciej Szulik
QA Contact: RamaKasturi
URL:
Whiteboard: non-multi-arch
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-08-06 16:26 UTC by Matthew Heon
Modified: 2020-10-27 16:26 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:26:05 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:26:08 UTC

Description Matthew Heon 2020-08-06 16:26:37 UTC
[BUILD-WATCHER]

Seeing frequent deployment errors in pull-ci-openshift-kubernetes-master-e2e-aws-fips job:

Aug  6 14:01:30.921: INFO: Running AfterSuite actions on all nodes
Aug  6 14:01:30.921: INFO: Running AfterSuite actions on node 1
fail [@/k8s.io/kubernetes/test/e2e/apps/deployment.go:904]: Unexpected error:
    <*errors.errorString | 0xc001b13e90>: {
        s: "error waiting for deployment \"test-rolling-update-with-lb\" status to match expectation: deployment status: v1.DeploymentStatus{ObservedGeneration:1, Replicas:3, UpdatedReplicas:3, ReadyReplicas:2, AvailableReplicas:2, UnavailableReplicas:1, Conditions:[]v1.DeploymentCondition{v1.DeploymentCondition{Type:\"Available\", Status:\"False\", LastUpdateTime:v1.Time{Time:time.Time{wall:0x0, ext:63732318989, loc:(*time.Location)(0x9e74040)}}, LastTransitionTime:v1.Time{Time:time.Time{wall:0x0, ext:63732318989, loc:(*time.Location)(0x9e74040)}}, Reason:\"MinimumReplicasUnavailable\", Message:\"Deployment does not have minimum availability.\"}, v1.DeploymentCondition{Type:\"Progressing\", Status:\"True\", LastUpdateTime:v1.Time{Time:time.Time{wall:0x0, ext:63732319009, loc:(*time.Location)(0x9e74040)}}, LastTransitionTime:v1.Time{Time:time.Time{wall:0x0, ext:63732318989, loc:(*time.Location)(0x9e74040)}}, Reason:\"ReplicaSetUpdated\", Message:\"ReplicaSet \\\"test-rolling-update-with-lb-b9c9c6bcc\\\" is progressing.\"}}, CollisionCount:(*int32)(nil)}",
    }
    error waiting for deployment "test-rolling-update-with-lb" status to match expectation: deployment status: v1.DeploymentStatus{ObservedGeneration:1, Replicas:3, UpdatedReplicas:3, ReadyReplicas:2, AvailableReplicas:2, UnavailableReplicas:1, Conditions:[]v1.DeploymentCondition{v1.DeploymentCondition{Type:"Available", Status:"False", LastUpdateTime:v1.Time{Time:time.Time{wall:0x0, ext:63732318989, loc:(*time.Location)(0x9e74040)}}, LastTransitionTime:v1.Time{Time:time.Time{wall:0x0, ext:63732318989, loc:(*time.Location)(0x9e74040)}}, Reason:"MinimumReplicasUnavailable", Message:"Deployment does not have minimum availability."}, v1.DeploymentCondition{Type:"Progressing", Status:"True", LastUpdateTime:v1.Time{Time:time.Time{wall:0x0, ext:63732319009, loc:(*time.Location)(0x9e74040)}}, LastTransitionTime:v1.Time{Time:time.Time{wall:0x0, ext:63732318989, loc:(*time.Location)(0x9e74040)}}, Reason:"ReplicaSetUpdated", Message:"ReplicaSet \"test-rolling-update-with-lb-b9c9c6bcc\" is progressing."}}, CollisionCount:(*int32)(nil)}
occurred

Sample failing job:

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-remote-libvirt-s390x-4.5/1291373614746046464

Search of all failing jobs:

https://search.ci.openshift.org/?search=error+waiting+for+deployment+&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Possibly related to https://bugzilla.redhat.com/show_bug.cgi?id=1861095

Comment 1 Seth Jennings 2020-08-06 18:18:40 UTC
e2e-fips is having a miriade of issues right now.  Sometimes the cluster doesn't even init and other times it has a ton of flakes/failures.

I want to try to narrow this before suggesting changes or passing it along.

Comment 2 Seth Jennings 2020-08-12 19:16:43 UTC
https://search.ci.openshift.org/?search=k8s.io%2Fkubernetes%2Ftest%2Fe2e%2Fapps%2Fdeployment.go%3A904

1 of the 3 deployment test pods is not being scheduled

Aug 12 18:33:01.831: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for test-rolling-update-with-lb-b9c9c6bcc-4phtw: { } Scheduled: Successfully assigned e2e-deployment-5371/test-rolling-update-with-lb-b9c9c6bcc-4phtw to ip-10-0-146-124.us-east-2.compute.internal
Aug 12 18:33:01.831: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for test-rolling-update-with-lb-b9c9c6bcc-hlsnz: { } FailedScheduling: 0/5 nodes are available: 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't match pod anti-affinity rules, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
Aug 12 18:33:01.831: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for test-rolling-update-with-lb-b9c9c6bcc-vx4j5: { } Scheduled: Successfully assigned e2e-deployment-5371/test-rolling-update-with-lb-b9c9c6bcc-vx4j5 to ip-10-0-228-200.us-east-2.compute.internal

Comment 3 Maciej Szulik 2020-08-17 11:44:58 UTC
Fix is in https://github.com/openshift/origin/pull/25408

Comment 4 Maciej Szulik 2020-08-21 14:00:09 UTC
Iā€™m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 5 Maciej Szulik 2020-09-10 18:47:10 UTC
Fix actually landed in https://github.com/openshift/origin/pull/25010

Comment 7 RamaKasturi 2020-09-14 16:12:43 UTC
will wait for few more days to check the flake and then move to verified state.

Comment 8 RamaKasturi 2020-09-16 06:11:32 UTC
Moving the bug to verified state as i see that the fix landed about 6 days ago and no failures seen from that point when checked here for about 7 days.

https://search.ci.openshift.org/?search=k8s.io%2Fkubernetes%2Ftest%2Fe2e%2Fapps%2Fdeployment.go%3A904

Comment 11 errata-xmlrpc 2020-10-27 16:26:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.