Bug 2052270 - FSyncControllerDegraded has "treshold" -> "threshold" typos
Summary: FSyncControllerDegraded has "treshold" -> "threshold" typos
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.10
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.11.0
Assignee: W. Trevor King
QA Contact: ge liu
URL:
Whiteboard:
Depends On:
Blocks: 2059632
TreeView+ depends on / blocked
 
Reported: 2022-02-08 23:28 UTC by W. Trevor King
Modified: 2022-08-10 10:49 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2057642 2057644 (view as bug list)
Environment:
Last Closed: 2022-08-10 10:48:33 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-etcd-operator pull 743 0 None Merged Bug 2052270: pkg/operator/metriccontroller/fsync_controller: Fix "treshold" -> "thresholds" typos 2022-02-09 19:16:46 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:49:09 UTC

Description W. Trevor King 2022-02-08 23:28:12 UTC
The new condition landed in the dev branch [1] after 4.9 forked off and before 4.10 forked off, so 4.10 and later are impacted.

There are a few issues with the current implementation:

* Some "treshold" -> "threshold" typos.
* The "etcd disk metrics exceeded..." reasons [1] aren't the CamelCase slugs reason expects [2].
* It seems that the condition may not get cleared once latency returns to reasonable levels, although no exact code links to back this one up.

If the lack of clearing is accurate, that would be pretty bad, and deserve a backport to 4.10.z.  I don't feel all that strongly about CamelCasing and typos, backport those if you want.

$ oc get -o yaml clusteroperator etcd
...
status:
  conditions:
  - lastTransitionTime: "2022-02-07T19:03:33Z"
    message: 'FSyncControllerDegraded: etcd disk metrics exceeded known tresholds:  fsync
      duration value: 6.220061, '
    reason: FSyncController_etcd disk metrics exceeded known tresholds
    status: "True"
    type: Degraded
  - lastTransitionTime: "2022-02-08T00:30:33Z"
    message: |-
      NodeInstallerProgressing: 3 nodes are at revision 7
      EtcdMembersProgressing: No unstarted etcd members found
    reason: AsExpected
    status: "False"
    type: Progressing
  - lastTransitionTime: "2022-01-23T19:40:27Z"
    message: |-
      StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 7
      EtcdMembersAvailable: 3 members are available
    reason: AsExpected
    status: "True"
    type: Available
  - lastTransitionTime: "2022-01-23T19:38:02Z"
    message: All is well
    reason: AsExpected
    status: "True"
    type: Upgradeable
  - lastTransitionTime: "2022-02-07T17:46:50Z"
    message: UpgradeBackup pre 4.9 located at path /etc/kubernetes/cluster-backup/upgrade-backup-2022-02-07_174639
      on node "m2"
    reason: UpgradeBackupSuccessful
    status: "True"
    type: RecentBackup
  extension: null
  relatedObjects: ...
  versions:
  - name: raw-internal
    version: 4.10.0-rc.1
  - name: etcd
    version: 4.10.0-rc.1
  - name: operator
    version: 4.10.0-rc.1

[1]: https://github.com/openshift/cluster-etcd-operator/pull/687/files#diff-1ffe8d5a41289fd1acde2f0fcd5d53c842caffe41d70a478e82b018977b6ac14R127
[2]: https://github.com/openshift/api/blob/7e3ffb09accd36fb0536fa0e69bed5d70cccd6e5/config/v1/types_cluster_operator.go#L131

Comment 2 W. Trevor King 2022-02-08 23:36:49 UTC
We'll probably want to revert or otherwise restore the stale-condition controller dropped in [1] when we move to a slug reason.  Since Sam dropped it, the library-go implementation has moved from second syncs to minute syncs [2], which is much more reasonable.

[1]: https://github.com/openshift/cluster-etcd-operator/pull/672
[2]: https://github.com/openshift/library-go/blame/874db8a3dac9034969ba6497aa53aa647bfe25f8/pkg/operator/staleconditions/remove_stale_conditions.go#L28

Comment 7 W. Trevor King 2022-02-09 00:18:41 UTC
I'll set this at low severity until we have confirmed or refuted the condition getting cleared, with high priority to make that decision quickly.  But it's not my project, so feel free to adjust as needed.

Comment 10 W. Trevor King 2022-02-09 19:17:33 UTC
#743 fixed the simplest bit of this.  Still the non-slug reason and the possible status-latching to go.

Comment 11 Wally 2022-02-09 20:08:55 UTC
Setting blocker- based on previous comments.

Comment 12 William Caban 2022-02-21 16:48:49 UTC
I've been able to reproduce the etcd staying in degraded mode and preventing the cluster from upgrades after that. In my case, crashing the cluster and rebooting the nodes, we found that on boot, one of the nodes had an SSD disk that sometimes would spike on initial access (randomly on boot), but after that, even when the disks were stable or in other reboots it was fine, etcd kept this FSyncControllerDegrated and blocked any further upgrades.


This was confirmed multiple times when upgrading from 4.9.19 to 4.10rc2 or 4.10rc3.

Comment 13 W. Trevor King 2022-02-21 17:10:44 UTC
I've updated the bug title based on comment 12's latching confirmation.  Repeating the three points from comment 0:

* Some "treshold" -> "threshold" typos (fixed already via #743).
* The "etcd disk metrics exceeded..." reasons [1] aren't the CamelCase slugs reason expects [2].
* It seems that the condition may not get cleared once latency returns to reasonable levels, confirmed in comment 12.

Those later two still need fixing, and the last one is by far the most important, because a Degraded=True etcd ClusterOperator will block updates, including 4.y.z -> 4.y.z' patch updates, unless the cluster admin does some hoop-jumpy workarounds, or is updating to a release that fixes the latching behavior.

[1]: https://github.com/openshift/cluster-etcd-operator/pull/687/files#diff-1ffe8d5a41289fd1acde2f0fcd5d53c842caffe41d70a478e82b018977b6ac14R127
[2]: https://github.com/openshift/api/blob/7e3ffb09accd36fb0536fa0e69bed5d70cccd6e5/config/v1/types_cluster_operator.go#L131

Comment 14 W. Trevor King 2022-02-21 18:39:50 UTC
Raising severity to high now that the latching has been confirmed, since blocking updates (in cases where we have latched) slows the rollout of other fixes.

Comment 15 W. Trevor King 2022-02-23 19:20:08 UTC
At Allen's request, I've sharded the three points into separate bug series.  This bug now focuses exclusively on the "treshold" -> "threshold" typos.  A new bug 2057642 series focuses on the slugging.  And a new bug 2057644 focuses on the latching.  All three series should be backported to 4.10.z, but the most pressing is the latching bug 2057644, because that has the update-sticking impact.

Comment 16 ge liu 2022-03-03 08:12:49 UTC
hi W. Trevor King, thanks for your detailed info, verified the 1st issue with 4.11.0-0.nightly-2022-02-27-122819

Comment 17 Kevin Chung 2022-04-04 17:15:48 UTC
I have a lab OCP 4.10.6 cluster that I began upgrading to 4.10.8, and encountered this error:
FSyncControllerDegraded: etcd disk metrics exceeded known tresholds:  fsync duration value: 8.192000,

I've gone ahead and rebooted my masters as well.  This being a lab on a hypervisor shared with other apps, I do think this cluster legitly has etcd disk issues.  Practically speaking, what does this mean for my upgrade?  Am I no longer able to perform upgrades and the  will just sit in this state below?
Unable to apply 4.10.8: wait has exceeded 40 minutes for these operators: etcd

Or is there another path forward where the user can acknowledge and proceed ahead despite the etcd disk issues?

Comment 18 W. Trevor King 2022-04-05 00:02:01 UTC
The breakdown into bug series is summarized in comment 15.  The only serious component is the latching bug 2057644, the other two are cosmetic.  Bug 2057644 is going back to 4.10.z with bug 2059347, still in post.  Once that bug gets fixed, you should be able to update from your cluster into the fixed 4.10.z.

Comment 20 bottkars 2022-04-07 08:27:43 UTC
Ran into this on an AzureStack All Flash System, now stuck at 4.10.6 Update. is there any way Clear the Degraded=true status ?

Comment 22 errata-xmlrpc 2022-08-10 10:48:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.