Bug 1969631 - EncryptionPruneControllerDegraded: etcdserver: request timed out
Summary: EncryptionPruneControllerDegraded: etcdserver: request timed out
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.8.0
Assignee: Lukasz Szaszkiewicz
QA Contact: Ke Wang
URL:
Whiteboard: tag-ci
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-08 19:05 UTC by W. Trevor King
Modified: 2021-07-27 23:12 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 23:12:15 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-authentication-operator pull 453 0 None closed Bug 1969631: encryption controllers wait for all informers before running the sync loops 2021-06-15 07:12:21 UTC
Github openshift cluster-openshift-apiserver-operator pull 456 0 None closed Bug 1969631: encryption controllers wait for all informers before running the sync loops 2021-06-15 07:06:25 UTC
Github openshift library-go pull 1108 0 None closed Bug 1969631: encryption controllers wait for all informers before running the sync loops 2021-06-15 07:06:25 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 23:12:29 UTC

Description W. Trevor King 2021-06-08 19:05:03 UTC
Seen in 4.8.0-0.ci-2021-06-03-215105 -> 4.8.0-0.ci-2021-06-03-222105 CI [1]: the cluster came up sad about etcd:

  [sig-arch][Early] Managed cluster should start all core operators [Skipped:Disconnected] [Suite:openshift/conformance/parallel]	1s
  fail [github.com/onsi/ginkgo@v4.7.0-origin.0+incompatible/internal/leafnodes/runner.go:113]: Jun  4 04:01:30.309: 
  Some cluster operators are not ready: kube-apiserver (Degraded=True EncryptionPruneController_Error: EncryptionPruneControllerDegraded: etcdserver: request timed out)

And then hung on the update:

  [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]	2h35m21s
  fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:153]: during upgrade to registry.build01.ci.openshift.org/ci-op-sqrgrlfc/release@sha256:77fa9b8c518f8cef038f11aef1035f53f055ed44a87de5b274a8896609e63efe
  Unexpected error:
      <*errors.errorString | 0xc0014be540>: {
          s: "Cluster did not complete upgrade: timed out waiting for the condition: Working towards 4.8.0-0.ci-2021-06-03-222105: 93 of 695 done (13% complete)",
      }

Assorted sad operators:

  disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success	2h34m42s
  Jun  4 06:36:34.711: Unexpected alerts fired or pending during the upgrade:

    alert ClusterOperatorDegraded fired for 3844 seconds with labels: {condition="Degraded", endpoint="metrics", instance="10.0.168.14:9099", job="cluster-version-operator", name="kube-apiserver", namespace="openshift-cluster-version", pod="cluster-version-operator-5477d6f586-2622g", reason="EncryptionPruneController_Error", service="cluster-version-operator", severity="warning"}

  Operator upgrade kube-apiserver	0s
  Failed to upgrade kube-apiserver, operator was degraded (EncryptionPruneController_Error): EncryptionPruneControllerDegraded: etcdserver: request timed out

Checking ClusterVersion:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade/1400652395809083392/artifacts/e2e-aws-upgrade/gather-extra/artifacts/clusterversion.json | jq -r '.items[].stat
us.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
  2021-06-04T04:00:42Z Available=True : Done applying 4.8.0-0.ci-2021-06-03-215105
  2021-06-04T06:40:30Z Failing=False : 
  2021-06-04T04:05:26Z Progressing=True : Working towards 4.8.0-0.ci-2021-06-03-222105: 93 of 695 done (13% complete)
  2021-06-04T03:26:27Z RetrievedUpdates=False NoChannel: The update channel has not been configured.

Kube-API-server has been sad for a while:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade/1400652395809083392/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-5477d6f586-2622g_cluster-version-operator.log | grep 'Result of work' | tail
  I0604 05:33:35.098840       1 task_graph.go:555] Result of work: [Cluster operator kube-apiserver is degraded]
  I0604 05:39:40.981064       1 task_graph.go:555] Result of work: [Cluster operator kube-apiserver is degraded]
  I0604 05:46:07.535335       1 task_graph.go:555] Result of work: [Cluster operator kube-apiserver is degraded]
  I0604 05:53:26.031238       1 task_graph.go:555] Result of work: [Cluster operator kube-apiserver is degraded]
  I0604 06:02:20.276362       1 task_graph.go:555] Result of work: [Cluster operator kube-apiserver is degraded]
  I0604 06:11:01.611089       1 task_graph.go:555] Result of work: [Cluster operator kube-apiserver is degraded]
  I0604 06:19:48.562131       1 task_graph.go:555] Result of work: [Cluster operator kube-apiserver is degraded]
  I0604 06:28:44.827046       1 task_graph.go:555] Result of work: [Cluster operator kube-apiserver is degraded]
  I0604 06:37:20.851574       1 task_graph.go:555] Result of work: [Cluster operator kube-apiserver is degraded]

And kube-apiserver conditions:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade/1400652395809083392/artifacts/e2e-aws-upgrade/gather-extra/artifacts/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "kube-apiserver").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
  2021-06-04T03:34:11Z Degraded=True EncryptionPruneController_Error: EncryptionPruneControllerDegraded: etcdserver: request timed out
  ...

Seems like the operator should have recovered by now?  Or maybe etcd is really on the floor, and the etcd operator just isn't complaining about it?  Not all that common, but also not super-rare:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=336h&type=junit&search=EncryptionPruneControllerDegraded:+etcdserver:+request+timed+out' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-gcp-ovn-upgrade (all) - 56 runs, 43% failed, 13% of failures match = 5% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade (all) - 14 runs, 50% failed, 14% of failures match = 7% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-ovirt (all) - 111 runs, 52% failed, 2% of failures match = 1% impact
pull-ci-openshift-cluster-kube-scheduler-operator-master-e2e-aws-operator-preferred-host (all) - 10 runs, 50% failed, 20% of failures match = 10% impact
pull-ci-openshift-console-master-ceph-storage-plugin (all) - 196 runs, 42% failed, 1% of failures match = 1% impact
pull-ci-openshift-machine-api-operator-master-e2e-azure (all) - 5 runs, 80% failed, 25% of failures match = 20% impact
pull-ci-openshift-origin-master-e2e-aws-csi (all) - 105 runs, 64% failed, 1% of failures match = 1% impact
release-openshift-ocp-installer-e2e-azure-serial-4.4 (all) - 14 runs, 43% failed, 17% of failures match = 7% impact
release-openshift-origin-installer-e2e-aws-disruptive-4.3 (all) - 14 runs, 100% failed, 14% of failures match = 14% impact
release-openshift-origin-installer-e2e-aws-upgrade-4.3-to-4.4-to-4.5-to-4.6-ci (all) - 14 runs, 64% failed, 22% of failures match = 14% impact

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade/1400652395809083392

Comment 3 Ke Wang 2021-06-15 07:30:51 UTC
As of the patch was merged 4 days ago, in past 72 hours, the results look better.

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=72h&type=junit&search=EncryptionPruneControllerDegraded:+etcdserver:+request+timed+out' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-gcp-ovn-upgrade (all) - 12 runs, 17% failed, 50% of failures match = 8% impact
pull-ci-openshift-origin-master-e2e-gcp (all) - 45 runs, 44% failed, 5% of failures match = 2% impact
release-openshift-okd-installer-e2e-aws-upgrade (all) - 28 runs, 89% failed, 4% of failures match = 4% impact

Comment 5 Ke Wang 2021-06-16 09:17:03 UTC
$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=72h&type=junit&search=EncryptionPruneControllerDegraded:+etcdserver:+request+timed+out' | grep -v '4.6' |grep 'failures match' | sort
pull-ci-openshift-assisted-service-master-subsystem-aws (all) - 155 runs, 35% failed, 2% of failures match = 1% impact
pull-ci-openshift-origin-master-e2e-gcp (all) - 107 runs, 17% failed, 6% of failures match = 1% impact

Unable to find the 4.8 upgrade failed tests in past 72 hours, the results look good, so move the bug VERIFIED.

Comment 7 errata-xmlrpc 2021-07-27 23:12:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.