2059347 – FSyncControllerDegraded latches True, even after fsync latency recovers on all members

Bug 2059347 - FSyncControllerDegraded latches True, even after fsync latency recovers on all members

Summary: FSyncControllerDegraded latches True, even after fsync latency recovers on al...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.10.z
Assignee:	Allen Ray
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:	2057644
Blocks:
TreeView+	depends on / blocked

Reported:	2022-02-28 20:36 UTC by OpenShift BugZilla Robot
Modified:	2023-04-26 16:26 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-04-21 13:15:55 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-etcd-operator pull 755	0	None	open	[release-4.10] Bug 2059347: Fix FSyncController degraded latch	2022-02-28 20:36:47 UTC
Red Hat Product Errata	RHSA-2022:1356	0	None	None	None	2022-04-21 13:16:05 UTC

Description OpenShift BugZilla Robot 2022-02-28 20:36:26 UTC

+++ This bug was initially created as a clone of Bug #2057644 +++

+++ This bug was initially created as a clone of Bug #2052270 +++

The new condition landed in the dev branch [1] after 4.9 forked off and before 4.10 forked off, so 4.10 and later are impacted.

There are a few issues with the current implementation:

* Some "treshold" -> "threshold" typos.
* The "etcd disk metrics exceeded..." reasons [1] aren't the CamelCase slugs reason expects [2].
* It seems that the condition may not get cleared once latency returns to reasonable levels, although no exact code links to back this one up.

...

[1]: https://github.com/openshift/cluster-etcd-operator/pull/687/files#diff-1ffe8d5a41289fd1acde2f0fcd5d53c842caffe41d70a478e82b018977b6ac14R127
[2]: https://github.com/openshift/api/blob/7e3ffb09accd36fb0536fa0e69bed5d70cccd6e5/config/v1/types_cluster_operator.go#L131

--- Additional comment from William Caban on 2022-02-21 16:48:49 UTC ---

I've been able to reproduce the etcd staying in degraded mode and preventing the cluster from upgrades after that. In my case, crashing the cluster and rebooting the nodes, we found that on boot, one of the nodes had an SSD disk that sometimes would spike on initial access (randomly on boot), but after that, even when the disks were stable or in other reboots it was fine, etcd kept this FSyncControllerDegrated and blocked any further upgrades.


This was confirmed multiple times when upgrading from 4.9.19 to 4.10rc2 or 4.10rc3.

--- Additional comment from W. Trevor King on 2022-02-21 17:10:44 UTC ---

...
[The latching] is by far the most important, because a Degraded=True etcd ClusterOperator will block updates, including 4.y.z -> 4.y.z' patch updates, unless the cluster admin does some hoop-jumpy workarounds, or is updating to a release that fixes the latching behavior.
...

---

Bug 2052270 is covering the typos.  Bug 2057642 is covering the slugging.  This bug series picks up the third point: latching Degraded=True.  I'm preserving blocker- from [1], but if this was fixed for 4.10.0, I would not be sad ;).

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=2052270#c11

Comment 8 errata-xmlrpc 2022-04-21 13:15:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.10 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1356

Comment 9 pupadhya 2022-04-22 08:07:46 UTC

Hi Team, 

Is there any workaround if CU want to upgrade ;-OCP 4.10.5 -> 4.10.6

Comment 11 W. Trevor King 2022-04-28 15:06:32 UTC

It is possible to complete an update to 4.10.6 by fiddling with ClusterVersion overrides, but whether using overrides like that is supported is unclear.

The preferred way to unstick a cluster with a latched FSyncControllerDegraded is to ask the cluster to update to 4.10.10 or a later release with the fix.  And you can ask for that retarget using 'oc adm upgrade --to 4.10.10' or the web console, without needing to complete an in-progress update to 4.10.6 or similar.

Note You need to log in before you can comment on or make changes to this bug.