This alert fired twice when upgrading CI build cluster. From 4.4.0 To 4.4.2 https://coreos.slack.com/archives/CHY2E1BL4/p1588341749327300 [FIRING:1] ClusterOperatorDegraded cluster-version-operator (Degraded metrics 10.0.140.81:9099 machine-config openshift-cluster-version cluster-version-operator-596c54cbf-7q98t openshift-monitoring/k8s RenderConfigFailed cluster-version-operator critical) Cluster operator machine-config has been degraded for 10 mins. Operator is degraded because RenderConfigFailed and cluster upgrades will be unstable. Will update link to must-gather later. Please do not fire alert if the upgrade is considered successful.
We don't control this alert it comes from the CVO: https://github.com/openshift/cluster-version-operator/blob/751c6d0c872e05f218f01d2a9f20293b4dfcca88/install/0000_90_cluster-version-operator_02_servicemonitor.yaml
What should we do about this? MCO goes Degraded on RenderConfigFailed and then recovers itself. How will the CVO have any opinions on how long that sort of thing should be expected to take? Also, "and cluster upgrades will be unstable" doesn't square with our existing behavior (allowing updates to be initiated regardless of Degraded=True). I'm personally in favor of making this a warning-level alert, and leaving it to individual operators to set critical alerts if they go degraded for long enough to be actually concerned about something.
Why is the operator going degraded? Why would RenderConfig fail? An admin looking at our system would assume something went wrong. Components should not go degraded during normal upgrade operations. Misidentified failures waste human time and make upgrades appear to be broken. If it’s not a real failure, don’t fail. If it’s a real failure, fix the failure or retry (add hysteresis to smooth those out). If it’s really, really, really hard to avoid going degraded, ask why the system is that way and how it should get improved. If your car randomly has a check engine light randomly come on, you start to consider never buying that car brand again.
Operators shouldn’t go degraded over things that look like normal operation. Upgrade should be normal. Don’t make it look abnormal.
Ok, moving back to MCO saying "the alert is fine, MCO should only go Degraded=True on abnormal failures". Not clear to me what was behind this particular RenderConfigFailed condition.
From the attached must-gather, the CVO saw issues wrapping up between 13:59 and 14:02: $ grep clusteroperator cluster-version-operator/cluster-version-operator/logs/current.log | grep machine-config 2020-05-01T13:51:26.351715237Z I0501 13:51:26.351681 1 sync_worker.go:621] Running sync for clusteroperator "machine-config" (475 of 573) 2020-05-01T13:53:06.523764526Z E0501 13:53:06.523741 1 task.go:81] error running apply for clusteroperator "machine-config" (475 of 573): Cluster operator machine-config is still updating 2020-05-01T13:54:26.589238261Z I0501 13:54:26.589229 1 sync_worker.go:621] Running sync for clusteroperator "machine-config" (475 of 573) 2020-05-01T13:59:37.235492962Z E0501 13:59:37.235448 1 task.go:81] error running apply for clusteroperator "machine-config" (475 of 573): Cluster operator machine-config is still updating 2020-05-01T14:01:39.274058774Z I0501 14:01:39.274048 1 sync_worker.go:621] Running sync for clusteroperator "machine-config" (475 of 573) 2020-05-01T14:02:12.274321112Z I0501 14:02:12.274264 1 sync_worker.go:634] Done syncing for clusteroperator "machine-config" (475 of 573) 2020-05-01T14:05:30.209893695Z I0501 14:05:30.209885 1 sync_worker.go:621] Running sync for clusteroperator "machine-config" (475 of 573) 2020-05-01T14:05:30.209961321Z I0501 14:05:30.209945 1 sync_worker.go:634] Done syncing for clusteroperator "machine-config" (475 of 573) And the MCO logs from the must-gather open before then: $ cat namespaces/openshift-machine-config-operator/pods/machine-config-operator-7886fd8595-vnkdw/machine-config-operator/machine-config-operator/logs/current.log 2020-05-01T13:51:23.940246481Z I0501 13:51:23.940042 1 start.go:45] Version: v4.4.0-202004301440-dirty (46ea9642ad42194747bc19ca42775c72e0376b80) ... 2020-05-01T13:53:19.593464975Z I0501 13:53:19.592675 1 leaderelection.go:252] successfully acquired lease openshift-machine-config-operator/machine-config 2020-05-01T13:53:21.415681534Z I0501 13:53:21.415648 1 operator.go:264] Starting MachineConfigOperator 2020-05-01T13:53:21.419005865Z I0501 13:53:21.418964 1 event.go:281] Event(v1.ObjectReference{Kind:"", Namespace:"", Name:"machine-config", UID:"ae1f5446-e258-48b7-9d67-f121557c48f2", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorVersionChanged' clusteroperator/machine-config-operator started a version change from [{operator 4.4.0}] to [{operator 4.4.2}] 2020-05-01T14:02:11.665704363Z I0501 14:02:11.665636 1 event.go:281] Event(v1.ObjectReference{Kind:"", Namespace:"", Name:"machine-config", UID:"ae1f5446-e258-48b7-9d67-f121557c48f2", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorVersionChanged' clusteroperator/machine-config-operator version changed from [{operator 4.4.0}] to [{operator 4.4.2}] There's no mention of any RenderConfigFailed. Checking the ClusterOperator: $ cat cluster-scoped-resources/config.openshift.io/clusteroperators/machine-config.yaml ... - lastTransitionTime: "2020-05-01T14:02:11Z" message: Cluster has deployed 4.4.2 status: "True" type: Available - lastTransitionTime: "2020-05-01T14:02:11Z" message: Cluster version is 4.4.2 status: "False" type: Progressing - lastTransitionTime: "2020-05-01T14:02:11Z" status: "False" type: Degraded ... So I think we see the recovery, but that the MCO logs aren't deep enough for me to understand the original issue. The events mentioned in the MCO logs are strange. Looks like they ended up in the default namespace (I would have expected them to live in the openshift-machine-config-operator namespace): $ yaml2json <namespaces/default/core/events.yaml | event-log 2020-04-30T07:49:42Z - 2020-05-01T13:47:42Z (136) default failed to delete empty node: failed to delete ip-10-0-143-172.ec2.internal: unable to delete 1 machines in "openshift-machine-api/build01-9hdwj-worker-us-east-1a", machine replicas are <= 0 2020-04-30T07:49:42Z - 2020-05-01T13:48:22Z (160) default failed to delete empty node: failed to delete ip-10-0-174-84.ec2.internal: unable to delete 1 machines in "openshift-machine-api/build01-9hdwj-worker-us-east-1c", machine replicas are <= 0 2020-05-01T13:46:12Z - 2020-05-01T13:46:12Z (1) default Progressing towards operator: 4.4.2 2020-05-01T13:53:21Z - 2020-05-01T13:53:21Z (1) default clusteroperator/machine-config-operator started a version change from [{operator 4.4.0}] to [{operator 4.4.2}] 2020-05-01T14:01:00Z - 2020-05-01T14:01:00Z (1) default failed to delete empty node: failed to delete ip-10-0-143-172.ec2.internal: unable to delete 1 machines in "openshift-machine-api/build01-9hdwj-worker-us-east-1a", machine replicas are <= 0 2020-05-01T14:01:00Z - 2020-05-01T14:01:00Z (1) default failed to delete empty node: failed to delete ip-10-0-174-84.ec2.internal: unable to delete 1 machines in "openshift-machine-api/build01-9hdwj-worker-us-east-1c", machine replicas are <= 0 2020-05-01T14:02:11Z - 2020-05-01T14:02:11Z (1) default clusteroperator/machine-config-operator version changed from [{operator 4.4.0}] to [{operator 4.4.2}] But I guess it's possible that we need to close this INSUFFICIENT_DATA and wait for someone to must-gather while their cluster is actually showing a surprising RenderConfigFailed.
This was fixed via https://github.com/openshift/machine-config-operator/pull/1598 in 4.5 Do we want this backported?
> This was fixed via https://github.com/openshift/machine-config-operator/pull/1598 in 4.5 Hooray :). From the PR message [1]: What happened from now on is that the MCO kept rolling and eventually finishes the upgrade I was running but the RenderConfigFailed Degraded status was never cleared back to False even if the task did succeed (I had to wait for all the sync functions to finish as expected, and it takes a while still reporting degraded). It's not clear to me when the old code (before the fix) went back to Degraded=False. If it was "never" (from the PR message), then we definitely want that backported. Probably as far as 4.2, because if the operator sticks Degraded=True updates are never going to complete and customers are going to get mad. If it was "it takes a while" (also from the PR message) and "a while" is just the ~10m that we saw in the initial report here, then I'm agnostic about backports. Looks like it would be a pretty clean backport, so... maybe we should just do it? Even if we don't, 4.4 will be EOL before we know it ;).
Moving back to MODIFIED so ART's sweeper will find us and attach us to an errata while sweeping us into ON_QA.
This seems like something we'd want to backport at least to 4.4.z once it's VERIFIED here. Does that make sense? Do people think it should go back even further?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409