Bug 1969631

Summary: EncryptionPruneControllerDegraded: etcdserver: request timed out
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: kube-apiserverAssignee: Lukasz Szaszkiewicz <lszaszki>
Status: CLOSED ERRATA QA Contact: Ke Wang <kewang>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.8CC: aos-bugs, mfojtik, xxia
Target Milestone: ---Keywords: Upgrades
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: tag-ci
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 23:12:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description W. Trevor King 2021-06-08 19:05:03 UTC
Seen in 4.8.0-0.ci-2021-06-03-215105 -> 4.8.0-0.ci-2021-06-03-222105 CI [1]: the cluster came up sad about etcd:

  [sig-arch][Early] Managed cluster should start all core operators [Skipped:Disconnected] [Suite:openshift/conformance/parallel]	1s
  fail [github.com/onsi/ginkgo.0-origin.0+incompatible/internal/leafnodes/runner.go:113]: Jun  4 04:01:30.309: 
  Some cluster operators are not ready: kube-apiserver (Degraded=True EncryptionPruneController_Error: EncryptionPruneControllerDegraded: etcdserver: request timed out)

And then hung on the update:

  [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]	2h35m21s
  fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:153]: during upgrade to registry.build01.ci.openshift.org/ci-op-sqrgrlfc/release@sha256:77fa9b8c518f8cef038f11aef1035f53f055ed44a87de5b274a8896609e63efe
  Unexpected error:
      <*errors.errorString | 0xc0014be540>: {
          s: "Cluster did not complete upgrade: timed out waiting for the condition: Working towards 4.8.0-0.ci-2021-06-03-222105: 93 of 695 done (13% complete)",
      }

Assorted sad operators:

  disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success	2h34m42s
  Jun  4 06:36:34.711: Unexpected alerts fired or pending during the upgrade:

    alert ClusterOperatorDegraded fired for 3844 seconds with labels: {condition="Degraded", endpoint="metrics", instance="10.0.168.14:9099", job="cluster-version-operator", name="kube-apiserver", namespace="openshift-cluster-version", pod="cluster-version-operator-5477d6f586-2622g", reason="EncryptionPruneController_Error", service="cluster-version-operator", severity="warning"}

  Operator upgrade kube-apiserver	0s
  Failed to upgrade kube-apiserver, operator was degraded (EncryptionPruneController_Error): EncryptionPruneControllerDegraded: etcdserver: request timed out

Checking ClusterVersion:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade/1400652395809083392/artifacts/e2e-aws-upgrade/gather-extra/artifacts/clusterversion.json | jq -r '.items[].stat
us.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
  2021-06-04T04:00:42Z Available=True : Done applying 4.8.0-0.ci-2021-06-03-215105
  2021-06-04T06:40:30Z Failing=False : 
  2021-06-04T04:05:26Z Progressing=True : Working towards 4.8.0-0.ci-2021-06-03-222105: 93 of 695 done (13% complete)
  2021-06-04T03:26:27Z RetrievedUpdates=False NoChannel: The update channel has not been configured.

Kube-API-server has been sad for a while:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade/1400652395809083392/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-5477d6f586-2622g_cluster-version-operator.log | grep 'Result of work' | tail
  I0604 05:33:35.098840       1 task_graph.go:555] Result of work: [Cluster operator kube-apiserver is degraded]
  I0604 05:39:40.981064       1 task_graph.go:555] Result of work: [Cluster operator kube-apiserver is degraded]
  I0604 05:46:07.535335       1 task_graph.go:555] Result of work: [Cluster operator kube-apiserver is degraded]
  I0604 05:53:26.031238       1 task_graph.go:555] Result of work: [Cluster operator kube-apiserver is degraded]
  I0604 06:02:20.276362       1 task_graph.go:555] Result of work: [Cluster operator kube-apiserver is degraded]
  I0604 06:11:01.611089       1 task_graph.go:555] Result of work: [Cluster operator kube-apiserver is degraded]
  I0604 06:19:48.562131       1 task_graph.go:555] Result of work: [Cluster operator kube-apiserver is degraded]
  I0604 06:28:44.827046       1 task_graph.go:555] Result of work: [Cluster operator kube-apiserver is degraded]
  I0604 06:37:20.851574       1 task_graph.go:555] Result of work: [Cluster operator kube-apiserver is degraded]

And kube-apiserver conditions:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade/1400652395809083392/artifacts/e2e-aws-upgrade/gather-extra/artifacts/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "kube-apiserver").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
  2021-06-04T03:34:11Z Degraded=True EncryptionPruneController_Error: EncryptionPruneControllerDegraded: etcdserver: request timed out
  ...

Seems like the operator should have recovered by now?  Or maybe etcd is really on the floor, and the etcd operator just isn't complaining about it?  Not all that common, but also not super-rare:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=336h&type=junit&search=EncryptionPruneControllerDegraded:+etcdserver:+request+timed+out' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-gcp-ovn-upgrade (all) - 56 runs, 43% failed, 13% of failures match = 5% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade (all) - 14 runs, 50% failed, 14% of failures match = 7% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-ovirt (all) - 111 runs, 52% failed, 2% of failures match = 1% impact
pull-ci-openshift-cluster-kube-scheduler-operator-master-e2e-aws-operator-preferred-host (all) - 10 runs, 50% failed, 20% of failures match = 10% impact
pull-ci-openshift-console-master-ceph-storage-plugin (all) - 196 runs, 42% failed, 1% of failures match = 1% impact
pull-ci-openshift-machine-api-operator-master-e2e-azure (all) - 5 runs, 80% failed, 25% of failures match = 20% impact
pull-ci-openshift-origin-master-e2e-aws-csi (all) - 105 runs, 64% failed, 1% of failures match = 1% impact
release-openshift-ocp-installer-e2e-azure-serial-4.4 (all) - 14 runs, 43% failed, 17% of failures match = 7% impact
release-openshift-origin-installer-e2e-aws-disruptive-4.3 (all) - 14 runs, 100% failed, 14% of failures match = 14% impact
release-openshift-origin-installer-e2e-aws-upgrade-4.3-to-4.4-to-4.5-to-4.6-ci (all) - 14 runs, 64% failed, 22% of failures match = 14% impact

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade/1400652395809083392

Comment 3 Ke Wang 2021-06-15 07:30:51 UTC
As of the patch was merged 4 days ago, in past 72 hours, the results look better.

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=72h&type=junit&search=EncryptionPruneControllerDegraded:+etcdserver:+request+timed+out' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-gcp-ovn-upgrade (all) - 12 runs, 17% failed, 50% of failures match = 8% impact
pull-ci-openshift-origin-master-e2e-gcp (all) - 45 runs, 44% failed, 5% of failures match = 2% impact
release-openshift-okd-installer-e2e-aws-upgrade (all) - 28 runs, 89% failed, 4% of failures match = 4% impact

Comment 5 Ke Wang 2021-06-16 09:17:03 UTC
$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=72h&type=junit&search=EncryptionPruneControllerDegraded:+etcdserver:+request+timed+out' | grep -v '4.6' |grep 'failures match' | sort
pull-ci-openshift-assisted-service-master-subsystem-aws (all) - 155 runs, 35% failed, 2% of failures match = 1% impact
pull-ci-openshift-origin-master-e2e-gcp (all) - 107 runs, 17% failed, 6% of failures match = 1% impact

Unable to find the 4.8 upgrade failed tests in past 72 hours, the results look good, so move the bug VERIFIED.

Comment 7 errata-xmlrpc 2021-07-27 23:12:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438