The new condition landed in the dev branch [1] after 4.9 forked off and before 4.10 forked off, so 4.10 and later are impacted. There are a few issues with the current implementation: * Some "treshold" -> "threshold" typos. * The "etcd disk metrics exceeded..." reasons [1] aren't the CamelCase slugs reason expects [2]. * It seems that the condition may not get cleared once latency returns to reasonable levels, although no exact code links to back this one up. If the lack of clearing is accurate, that would be pretty bad, and deserve a backport to 4.10.z. I don't feel all that strongly about CamelCasing and typos, backport those if you want. $ oc get -o yaml clusteroperator etcd ... status: conditions: - lastTransitionTime: "2022-02-07T19:03:33Z" message: 'FSyncControllerDegraded: etcd disk metrics exceeded known tresholds: fsync duration value: 6.220061, ' reason: FSyncController_etcd disk metrics exceeded known tresholds status: "True" type: Degraded - lastTransitionTime: "2022-02-08T00:30:33Z" message: |- NodeInstallerProgressing: 3 nodes are at revision 7 EtcdMembersProgressing: No unstarted etcd members found reason: AsExpected status: "False" type: Progressing - lastTransitionTime: "2022-01-23T19:40:27Z" message: |- StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 7 EtcdMembersAvailable: 3 members are available reason: AsExpected status: "True" type: Available - lastTransitionTime: "2022-01-23T19:38:02Z" message: All is well reason: AsExpected status: "True" type: Upgradeable - lastTransitionTime: "2022-02-07T17:46:50Z" message: UpgradeBackup pre 4.9 located at path /etc/kubernetes/cluster-backup/upgrade-backup-2022-02-07_174639 on node "m2" reason: UpgradeBackupSuccessful status: "True" type: RecentBackup extension: null relatedObjects: ... versions: - name: raw-internal version: 4.10.0-rc.1 - name: etcd version: 4.10.0-rc.1 - name: operator version: 4.10.0-rc.1 [1]: https://github.com/openshift/cluster-etcd-operator/pull/687/files#diff-1ffe8d5a41289fd1acde2f0fcd5d53c842caffe41d70a478e82b018977b6ac14R127 [2]: https://github.com/openshift/api/blob/7e3ffb09accd36fb0536fa0e69bed5d70cccd6e5/config/v1/types_cluster_operator.go#L131
We'll probably want to revert or otherwise restore the stale-condition controller dropped in [1] when we move to a slug reason. Since Sam dropped it, the library-go implementation has moved from second syncs to minute syncs [2], which is much more reasonable. [1]: https://github.com/openshift/cluster-etcd-operator/pull/672 [2]: https://github.com/openshift/library-go/blame/874db8a3dac9034969ba6497aa53aa647bfe25f8/pkg/operator/staleconditions/remove_stale_conditions.go#L28
I'll set this at low severity until we have confirmed or refuted the condition getting cleared, with high priority to make that decision quickly. But it's not my project, so feel free to adjust as needed.
#743 fixed the simplest bit of this. Still the non-slug reason and the possible status-latching to go.
Setting blocker- based on previous comments.
I've been able to reproduce the etcd staying in degraded mode and preventing the cluster from upgrades after that. In my case, crashing the cluster and rebooting the nodes, we found that on boot, one of the nodes had an SSD disk that sometimes would spike on initial access (randomly on boot), but after that, even when the disks were stable or in other reboots it was fine, etcd kept this FSyncControllerDegrated and blocked any further upgrades. This was confirmed multiple times when upgrading from 4.9.19 to 4.10rc2 or 4.10rc3.
I've updated the bug title based on comment 12's latching confirmation. Repeating the three points from comment 0: * Some "treshold" -> "threshold" typos (fixed already via #743). * The "etcd disk metrics exceeded..." reasons [1] aren't the CamelCase slugs reason expects [2]. * It seems that the condition may not get cleared once latency returns to reasonable levels, confirmed in comment 12. Those later two still need fixing, and the last one is by far the most important, because a Degraded=True etcd ClusterOperator will block updates, including 4.y.z -> 4.y.z' patch updates, unless the cluster admin does some hoop-jumpy workarounds, or is updating to a release that fixes the latching behavior. [1]: https://github.com/openshift/cluster-etcd-operator/pull/687/files#diff-1ffe8d5a41289fd1acde2f0fcd5d53c842caffe41d70a478e82b018977b6ac14R127 [2]: https://github.com/openshift/api/blob/7e3ffb09accd36fb0536fa0e69bed5d70cccd6e5/config/v1/types_cluster_operator.go#L131
Raising severity to high now that the latching has been confirmed, since blocking updates (in cases where we have latched) slows the rollout of other fixes.
At Allen's request, I've sharded the three points into separate bug series. This bug now focuses exclusively on the "treshold" -> "threshold" typos. A new bug 2057642 series focuses on the slugging. And a new bug 2057644 focuses on the latching. All three series should be backported to 4.10.z, but the most pressing is the latching bug 2057644, because that has the update-sticking impact.
hi W. Trevor King, thanks for your detailed info, verified the 1st issue with 4.11.0-0.nightly-2022-02-27-122819
I have a lab OCP 4.10.6 cluster that I began upgrading to 4.10.8, and encountered this error: FSyncControllerDegraded: etcd disk metrics exceeded known tresholds: fsync duration value: 8.192000, I've gone ahead and rebooted my masters as well. This being a lab on a hypervisor shared with other apps, I do think this cluster legitly has etcd disk issues. Practically speaking, what does this mean for my upgrade? Am I no longer able to perform upgrades and the will just sit in this state below? Unable to apply 4.10.8: wait has exceeded 40 minutes for these operators: etcd Or is there another path forward where the user can acknowledge and proceed ahead despite the etcd disk issues?
The breakdown into bug series is summarized in comment 15. The only serious component is the latching bug 2057644, the other two are cosmetic. Bug 2057644 is going back to 4.10.z with bug 2059347, still in post. Once that bug gets fixed, you should be able to update from your cluster into the fixed 4.10.z.
Ran into this on an AzureStack All Flash System, now stuck at 4.10.6 Update. is there any way Clear the Degraded=true status ?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069