Bug 2057644

Summary: FSyncControllerDegraded latches True, even after fsync latency recovers on all members
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: EtcdAssignee: Allen Ray <alray>
Status: CLOSED ERRATA QA Contact: ge liu <geliu>
Severity: high Docs Contact:
Priority: high    
Version: 4.10CC: alray, dwest, geliu, mallmen, travi, william.caban, wlewis, yanyang
Target Milestone: ---Keywords: Upgrades
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: 2052270 Environment:
Last Closed: 2022-08-10 10:50:55 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 2059347    

Description W. Trevor King 2022-02-23 19:16:56 UTC
+++ This bug was initially created as a clone of Bug #2052270 +++

The new condition landed in the dev branch [1] after 4.9 forked off and before 4.10 forked off, so 4.10 and later are impacted.

There are a few issues with the current implementation:

* Some "treshold" -> "threshold" typos.
* The "etcd disk metrics exceeded..." reasons [1] aren't the CamelCase slugs reason expects [2].
* It seems that the condition may not get cleared once latency returns to reasonable levels, although no exact code links to back this one up.


[1]: https://github.com/openshift/cluster-etcd-operator/pull/687/files#diff-1ffe8d5a41289fd1acde2f0fcd5d53c842caffe41d70a478e82b018977b6ac14R127
[2]: https://github.com/openshift/api/blob/7e3ffb09accd36fb0536fa0e69bed5d70cccd6e5/config/v1/types_cluster_operator.go#L131

--- Additional comment from William Caban on 2022-02-21 16:48:49 UTC ---

I've been able to reproduce the etcd staying in degraded mode and preventing the cluster from upgrades after that. In my case, crashing the cluster and rebooting the nodes, we found that on boot, one of the nodes had an SSD disk that sometimes would spike on initial access (randomly on boot), but after that, even when the disks were stable or in other reboots it was fine, etcd kept this FSyncControllerDegrated and blocked any further upgrades.

This was confirmed multiple times when upgrading from 4.9.19 to 4.10rc2 or 4.10rc3.

--- Additional comment from W. Trevor King on 2022-02-21 17:10:44 UTC ---

[The latching] is by far the most important, because a Degraded=True etcd ClusterOperator will block updates, including 4.y.z -> 4.y.z' patch updates, unless the cluster admin does some hoop-jumpy workarounds, or is updating to a release that fixes the latching behavior.


Bug 2052270 is covering the typos.  Bug 2057642 is covering the slugging.  This bug series picks up the third point: latching Degraded=True.  I'm preserving blocker- from [1], but if this was fixed for 4.10.0, I would not be sad ;).

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=2052270#c11

Comment 3 ge liu 2022-03-22 07:43:19 UTC
@wking, this bug is abstracted to us for verification, do you have any suggestion for how to verify this bug? thanks

Comment 4 W. Trevor King 2022-03-22 18:52:19 UTC
From comment 0, originally from [1], William Caban was able to reproduce this by restarting all of his control plane nodes.  How effective that is probably depends on how close your disks are to the threshold.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=2052270#c12

Comment 5 ge liu 2022-03-24 02:23:52 UTC
Tried restarting all of control plane nodes, but have not reproduce this issue, tried with 4.11.0-0.nightly-2022-03-20-160505, run some regression test with high workload, but have not hit this issue, close this bug.

Comment 7 errata-xmlrpc 2022-08-10 10:50:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.