1830390 – [4.4 upgrade][alert]Cluster operator machine-config has been degraded for 10 mins: RenderConfigFailed

Bug 1830390 - [4.4 upgrade][alert]Cluster operator machine-config has been degraded for 10 mins: RenderConfigFailed

Summary: [4.4 upgrade][alert]Cluster operator machine-config has been degraded for 10 ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Kirsten Garrison
QA Contact:	Colin Walters
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-05-01 19:51 UTC by Hongkai Liu
Modified:	2020-07-13 17:34 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-07-13 17:34:13 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 1598	0	None	closed	Bug 1830390: pkg/operator: clear Degraded on task's success	2020-11-13 15:09:24 UTC
Red Hat Product Errata	RHBA-2020:2409	0	None	None	None	2020-07-13 17:34:45 UTC

Internal Links: 1824981

Description Hongkai Liu 2020-05-01 19:51:29 UTC

This alert fired twice when upgrading CI build cluster.

From 4.4.0
To 4.4.2


https://coreos.slack.com/archives/CHY2E1BL4/p1588341749327300

[FIRING:1] ClusterOperatorDegraded cluster-version-operator (Degraded metrics 10.0.140.81:9099 machine-config openshift-cluster-version cluster-version-operator-596c54cbf-7q98t openshift-monitoring/k8s RenderConfigFailed cluster-version-operator critical)
Cluster operator machine-config has been degraded for 10 mins. Operator is degraded because RenderConfigFailed and cluster upgrades will be unstable.


Will update link to must-gather later.

Please do not fire alert if the upgrade is considered successful.

Comment 2 Kirsten Garrison 2020-05-01 22:38:45 UTC

We don't control this alert it comes from the CVO:
https://github.com/openshift/cluster-version-operator/blob/751c6d0c872e05f218f01d2a9f20293b4dfcca88/install/0000_90_cluster-version-operator_02_servicemonitor.yaml

Comment 3 W. Trevor King 2020-05-19 03:42:31 UTC

What should we do about this?  MCO goes Degraded on RenderConfigFailed and then recovers itself.  How will the CVO have any opinions on how long that sort of thing should be expected to take?  Also, "and cluster upgrades will be unstable" doesn't square with our existing behavior (allowing updates to be initiated regardless of Degraded=True).  I'm personally in favor of making this a warning-level alert, and leaving it to individual operators to set critical alerts if they go degraded for long enough to be actually concerned about something.

Comment 5 Clayton Coleman 2020-05-19 05:18:39 UTC

Why is the operator going degraded?  Why would RenderConfig fail?

An admin looking at our system would assume something went wrong.  Components should not go degraded during normal upgrade operations.  Misidentified failures waste human time and make upgrades appear to be broken.

If it’s not a real failure, don’t fail.  If it’s a real failure, fix the failure or retry (add hysteresis to smooth those out).  If it’s really, really, really hard to avoid going degraded, ask why the system is that way and how it should get improved.

If your car randomly has a check engine light randomly come on, you start to consider never buying that car brand again.

Comment 6 Clayton Coleman 2020-05-19 05:19:27 UTC

Operators shouldn’t go degraded over things that look like normal operation.  Upgrade should be normal.  Don’t make it look abnormal.

Comment 7 W. Trevor King 2020-05-20 00:34:14 UTC

Ok, moving back to MCO saying "the alert is fine, MCO should only go Degraded=True on abnormal failures".  Not clear to me what was behind this particular RenderConfigFailed condition.

Comment 8 W. Trevor King 2020-05-20 01:11:28 UTC

From the attached must-gather, the CVO saw issues wrapping up between 13:59 and 14:02:

$ grep clusteroperator cluster-version-operator/cluster-version-operator/logs/current.log | grep machine-config
2020-05-01T13:51:26.351715237Z I0501 13:51:26.351681       1 sync_worker.go:621] Running sync for clusteroperator "machine-config" (475 of 573)
2020-05-01T13:53:06.523764526Z E0501 13:53:06.523741       1 task.go:81] error running apply for clusteroperator "machine-config" (475 of 573): Cluster operator machine-config is still updating
2020-05-01T13:54:26.589238261Z I0501 13:54:26.589229       1 sync_worker.go:621] Running sync for clusteroperator "machine-config" (475 of 573)
2020-05-01T13:59:37.235492962Z E0501 13:59:37.235448       1 task.go:81] error running apply for clusteroperator "machine-config" (475 of 573): Cluster operator machine-config is still updating
2020-05-01T14:01:39.274058774Z I0501 14:01:39.274048       1 sync_worker.go:621] Running sync for clusteroperator "machine-config" (475 of 573)
2020-05-01T14:02:12.274321112Z I0501 14:02:12.274264       1 sync_worker.go:634] Done syncing for clusteroperator "machine-config" (475 of 573)
2020-05-01T14:05:30.209893695Z I0501 14:05:30.209885       1 sync_worker.go:621] Running sync for clusteroperator "machine-config" (475 of 573)
2020-05-01T14:05:30.209961321Z I0501 14:05:30.209945       1 sync_worker.go:634] Done syncing for clusteroperator "machine-config" (475 of 573)

And the MCO logs from the must-gather open before then:

$ cat namespaces/openshift-machine-config-operator/pods/machine-config-operator-7886fd8595-vnkdw/machine-config-operator/machine-config-operator/logs/current.log
2020-05-01T13:51:23.940246481Z I0501 13:51:23.940042       1 start.go:45] Version: v4.4.0-202004301440-dirty (46ea9642ad42194747bc19ca42775c72e0376b80)
...
2020-05-01T13:53:19.593464975Z I0501 13:53:19.592675       1 leaderelection.go:252] successfully acquired lease openshift-machine-config-operator/machine-config
2020-05-01T13:53:21.415681534Z I0501 13:53:21.415648       1 operator.go:264] Starting MachineConfigOperator
2020-05-01T13:53:21.419005865Z I0501 13:53:21.418964       1 event.go:281] Event(v1.ObjectReference{Kind:"", Namespace:"", Name:"machine-config", UID:"ae1f5446-e258-48b7-9d67-f121557c48f2", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorVersionChanged' clusteroperator/machine-config-operator started a version change from [{operator 4.4.0}] to [{operator 4.4.2}]
2020-05-01T14:02:11.665704363Z I0501 14:02:11.665636       1 event.go:281] Event(v1.ObjectReference{Kind:"", Namespace:"", Name:"machine-config", UID:"ae1f5446-e258-48b7-9d67-f121557c48f2", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorVersionChanged' clusteroperator/machine-config-operator version changed from [{operator 4.4.0}] to [{operator 4.4.2}]

There's no mention of any RenderConfigFailed.  Checking the ClusterOperator:

$ cat cluster-scoped-resources/config.openshift.io/clusteroperators/machine-config.yaml
...
  - lastTransitionTime: "2020-05-01T14:02:11Z"
    message: Cluster has deployed 4.4.2
    status: "True"
    type: Available
  - lastTransitionTime: "2020-05-01T14:02:11Z"
    message: Cluster version is 4.4.2
    status: "False"
    type: Progressing
  - lastTransitionTime: "2020-05-01T14:02:11Z"
    status: "False"
    type: Degraded
...

So I think we see the recovery, but that the MCO logs aren't deep enough for me to understand the original issue.  The events mentioned in the MCO logs are strange.  Looks like they ended up in the default namespace (I would have expected them to live in the openshift-machine-config-operator namespace):

$ yaml2json <namespaces/default/core/events.yaml | event-log 
2020-04-30T07:49:42Z - 2020-05-01T13:47:42Z (136) default failed to delete empty node: failed to delete ip-10-0-143-172.ec2.internal: unable to delete 1 machines in "openshift-machine-api/build01-9hdwj-worker-us-east-1a", machine replicas are <= 0 
2020-04-30T07:49:42Z - 2020-05-01T13:48:22Z (160) default failed to delete empty node: failed to delete ip-10-0-174-84.ec2.internal: unable to delete 1 machines in "openshift-machine-api/build01-9hdwj-worker-us-east-1c", machine replicas are <= 0 
2020-05-01T13:46:12Z - 2020-05-01T13:46:12Z (1) default Progressing towards operator: 4.4.2
2020-05-01T13:53:21Z - 2020-05-01T13:53:21Z (1) default clusteroperator/machine-config-operator started a version change from [{operator 4.4.0}] to [{operator 4.4.2}]
2020-05-01T14:01:00Z - 2020-05-01T14:01:00Z (1) default failed to delete empty node: failed to delete ip-10-0-143-172.ec2.internal: unable to delete 1 machines in "openshift-machine-api/build01-9hdwj-worker-us-east-1a", machine replicas are <= 0 
2020-05-01T14:01:00Z - 2020-05-01T14:01:00Z (1) default failed to delete empty node: failed to delete ip-10-0-174-84.ec2.internal: unable to delete 1 machines in "openshift-machine-api/build01-9hdwj-worker-us-east-1c", machine replicas are <= 0 
2020-05-01T14:02:11Z - 2020-05-01T14:02:11Z (1) default clusteroperator/machine-config-operator version changed from [{operator 4.4.0}] to [{operator 4.4.2}]

But I guess it's possible that we need to close this INSUFFICIENT_DATA and wait for someone to must-gather while their cluster is actually showing a surprising RenderConfigFailed.

Comment 9 Kirsten Garrison 2020-05-20 18:45:39 UTC

This was fixed via https://github.com/openshift/machine-config-operator/pull/1598 in 4.5

Do we want this backported?

Comment 10 W. Trevor King 2020-05-21 04:48:41 UTC

> This was fixed via https://github.com/openshift/machine-config-operator/pull/1598 in 4.5

Hooray :).  From the PR message [1]:

  What happened from now on is that the MCO kept rolling and eventually finishes the upgrade I was running but the RenderConfigFailed Degraded status was never cleared back to False even if the task did succeed (I had to wait for all the sync functions to finish as expected, and it takes a while still reporting degraded).

It's not clear to me when the old code (before the fix) went back to Degraded=False.  If it was "never" (from the PR message), then we definitely want that backported.  Probably as far as 4.2, because if the operator sticks Degraded=True updates are never going to complete and customers are going to get mad.  If it was "it takes a while" (also from the PR message) and "a while" is just the ~10m that we saw in the initial report here, then I'm agnostic about backports.  Looks like it would be a pretty clean backport, so... maybe we should just do it?  Even if we don't, 4.4 will be EOL before we know it ;).

Comment 11 W. Trevor King 2020-05-21 04:49:18 UTC

Moving back to MODIFIED so ART's sweeper will find us and attach us to an errata while sweeping us into ON_QA.

Comment 15 W. Trevor King 2020-06-01 22:27:08 UTC

This seems like something we'd want to backport at least to 4.4.z once it's VERIFIED here.  Does that make sense?  Do people think it should go back even further?

Comment 17 errata-xmlrpc 2020-07-13 17:34:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.