Seen in 4.8.0-0.ci-2021-06-03-215105 -> 4.8.0-0.ci-2021-06-03-222105 CI [1]: the cluster came up sad about etcd: [sig-arch][Early] Managed cluster should start all core operators [Skipped:Disconnected] [Suite:openshift/conformance/parallel] 1s fail [github.com/onsi/ginkgo.0-origin.0+incompatible/internal/leafnodes/runner.go:113]: Jun 4 04:01:30.309: Some cluster operators are not ready: kube-apiserver (Degraded=True EncryptionPruneController_Error: EncryptionPruneControllerDegraded: etcdserver: request timed out) And then hung on the update: [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] 2h35m21s fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:153]: during upgrade to registry.build01.ci.openshift.org/ci-op-sqrgrlfc/release@sha256:77fa9b8c518f8cef038f11aef1035f53f055ed44a87de5b274a8896609e63efe Unexpected error: <*errors.errorString | 0xc0014be540>: { s: "Cluster did not complete upgrade: timed out waiting for the condition: Working towards 4.8.0-0.ci-2021-06-03-222105: 93 of 695 done (13% complete)", } Assorted sad operators: disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success 2h34m42s Jun 4 06:36:34.711: Unexpected alerts fired or pending during the upgrade: alert ClusterOperatorDegraded fired for 3844 seconds with labels: {condition="Degraded", endpoint="metrics", instance="10.0.168.14:9099", job="cluster-version-operator", name="kube-apiserver", namespace="openshift-cluster-version", pod="cluster-version-operator-5477d6f586-2622g", reason="EncryptionPruneController_Error", service="cluster-version-operator", severity="warning"} Operator upgrade kube-apiserver 0s Failed to upgrade kube-apiserver, operator was degraded (EncryptionPruneController_Error): EncryptionPruneControllerDegraded: etcdserver: request timed out Checking ClusterVersion: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade/1400652395809083392/artifacts/e2e-aws-upgrade/gather-extra/artifacts/clusterversion.json | jq -r '.items[].stat us.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' 2021-06-04T04:00:42Z Available=True : Done applying 4.8.0-0.ci-2021-06-03-215105 2021-06-04T06:40:30Z Failing=False : 2021-06-04T04:05:26Z Progressing=True : Working towards 4.8.0-0.ci-2021-06-03-222105: 93 of 695 done (13% complete) 2021-06-04T03:26:27Z RetrievedUpdates=False NoChannel: The update channel has not been configured. Kube-API-server has been sad for a while: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade/1400652395809083392/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-5477d6f586-2622g_cluster-version-operator.log | grep 'Result of work' | tail I0604 05:33:35.098840 1 task_graph.go:555] Result of work: [Cluster operator kube-apiserver is degraded] I0604 05:39:40.981064 1 task_graph.go:555] Result of work: [Cluster operator kube-apiserver is degraded] I0604 05:46:07.535335 1 task_graph.go:555] Result of work: [Cluster operator kube-apiserver is degraded] I0604 05:53:26.031238 1 task_graph.go:555] Result of work: [Cluster operator kube-apiserver is degraded] I0604 06:02:20.276362 1 task_graph.go:555] Result of work: [Cluster operator kube-apiserver is degraded] I0604 06:11:01.611089 1 task_graph.go:555] Result of work: [Cluster operator kube-apiserver is degraded] I0604 06:19:48.562131 1 task_graph.go:555] Result of work: [Cluster operator kube-apiserver is degraded] I0604 06:28:44.827046 1 task_graph.go:555] Result of work: [Cluster operator kube-apiserver is degraded] I0604 06:37:20.851574 1 task_graph.go:555] Result of work: [Cluster operator kube-apiserver is degraded] And kube-apiserver conditions: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade/1400652395809083392/artifacts/e2e-aws-upgrade/gather-extra/artifacts/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "kube-apiserver").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' 2021-06-04T03:34:11Z Degraded=True EncryptionPruneController_Error: EncryptionPruneControllerDegraded: etcdserver: request timed out ... Seems like the operator should have recovered by now? Or maybe etcd is really on the floor, and the etcd operator just isn't complaining about it? Not all that common, but also not super-rare: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=336h&type=junit&search=EncryptionPruneControllerDegraded:+etcdserver:+request+timed+out' | grep 'failures match' | sort periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-gcp-ovn-upgrade (all) - 56 runs, 43% failed, 13% of failures match = 5% impact periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade (all) - 14 runs, 50% failed, 14% of failures match = 7% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-ovirt (all) - 111 runs, 52% failed, 2% of failures match = 1% impact pull-ci-openshift-cluster-kube-scheduler-operator-master-e2e-aws-operator-preferred-host (all) - 10 runs, 50% failed, 20% of failures match = 10% impact pull-ci-openshift-console-master-ceph-storage-plugin (all) - 196 runs, 42% failed, 1% of failures match = 1% impact pull-ci-openshift-machine-api-operator-master-e2e-azure (all) - 5 runs, 80% failed, 25% of failures match = 20% impact pull-ci-openshift-origin-master-e2e-aws-csi (all) - 105 runs, 64% failed, 1% of failures match = 1% impact release-openshift-ocp-installer-e2e-azure-serial-4.4 (all) - 14 runs, 43% failed, 17% of failures match = 7% impact release-openshift-origin-installer-e2e-aws-disruptive-4.3 (all) - 14 runs, 100% failed, 14% of failures match = 14% impact release-openshift-origin-installer-e2e-aws-upgrade-4.3-to-4.4-to-4.5-to-4.6-ci (all) - 14 runs, 64% failed, 22% of failures match = 14% impact [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade/1400652395809083392
As of the patch was merged 4 days ago, in past 72 hours, the results look better. $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=72h&type=junit&search=EncryptionPruneControllerDegraded:+etcdserver:+request+timed+out' | grep 'failures match' | sort periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-gcp-ovn-upgrade (all) - 12 runs, 17% failed, 50% of failures match = 8% impact pull-ci-openshift-origin-master-e2e-gcp (all) - 45 runs, 44% failed, 5% of failures match = 2% impact release-openshift-okd-installer-e2e-aws-upgrade (all) - 28 runs, 89% failed, 4% of failures match = 4% impact
$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=72h&type=junit&search=EncryptionPruneControllerDegraded:+etcdserver:+request+timed+out' | grep -v '4.6' |grep 'failures match' | sort pull-ci-openshift-assisted-service-master-subsystem-aws (all) - 155 runs, 35% failed, 2% of failures match = 1% impact pull-ci-openshift-origin-master-e2e-gcp (all) - 107 runs, 17% failed, 6% of failures match = 1% impact Unable to find the 4.8 upgrade failed tests in past 72 hours, the results look good, so move the bug VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438