Description of problem: Seeing a lot of error messages in the release-openshift-origin-installer-e2e-aws-upgrade-4.1 job: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1/206 Jul 25 14:26:44.336 I ns/openshift-kube-apiserver-operator deployment/kube-apiserver-operator Status for clusteroperator/kube-apiserver changed: Degraded message changed from "" to "StaticPodsDegraded: nodes/ip-10-0-143-150.ec2.internal pods/kube-apiserver-ip-10-0-143-150.ec2.internal container=\"kube-apiserver-7\" is not ready\nStaticPodsDegraded: nodes/ip-10-0-143-150.ec2.internal pods/kube-apiserver-ip-10-0-143-150.ec2.internal container=\"kube-apiserver-7\" is terminated: \"Error\" - \"esetting endpoints for master service \\\"kubernetes\\\" to [10.0.158.95 10.0.163.120] log.go:172] suppressing panic for copyResponse error in test; copy error: context canceled
That test failed because of: Jul 25 14:27:12.603: INFO: cluster upgrade is failing: Cluster operator machine-config is still updating Jul 25 14:34:02.602: INFO: cluster upgrade is failing: Could not update deployment "openshift-machine-config-operator/etcd-quorum-guard" (315 of 350) Jul 25 14:34:12.605: INFO: cluster upgrade is failing: Could not update deployment "openshift-machine-config-operator/etcd-quorum-guard" (315 of 350) Jul 25 14:34:22.601: INFO: cluster upgrade is failing: Could not update deployment "openshift-machine-config-operator/etcd-quorum-guard" (315 of 350) The upgrade got stucked at this point and timed out. The degraded transition is expected, it is not a bug. The fact the upgrade got stucked on updating etcd-quorum-guard is. Sam, is there known bug about this?
This is actually not a duplicate of 1742744[1] I am going to reopen this for further review. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1742744#c4
Saw an occurrence today: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/7308
If "resetting endpoints for master service" is the signal on this bug, it is showing up quite a bit in recent searches: https://search-clayton-ci-search.apps.build01.ci.devcluster.openshift.com/?search=resetting+endpoints+for+master+service&maxAge=336h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job 2.4% of all recent job runs show it.