1943289 – machine-config ClusterOperator claims Available=False and does not level while waiting on worker pool

Bug 1943289 - machine-config ClusterOperator claims Available=False and does not level while waiting on worker pool

Summary: machine-config ClusterOperator claims Available=False and does not level whil...

Keywords:
Status:	CLOSED DUPLICATE of bug 1955300
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Kirsten Garrison
QA Contact:	Rio Liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-03-25 17:58 UTC by W. Trevor King
Modified:	2021-07-30 21:12 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-30 21:12:03 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description W. Trevor King 2021-03-25 17:58:01 UTC

Seen in a 4.7.3 to 4.7.4 update, the master MachineConfigPool had completed, which should be enough for the machine-config operator, but the machine-config operator claimed Available=False and did not bump status.versions while waiting for the worker MachineConfigPool.  The expectation is that the machine-config operator levels when the required master pool completes, because compute can trail for many reasons, including restrictive PDBs, and that as long as there isn't excessive minor version skew, trailing compute does not significantly impact the cluster (and for example, should not block subsequent control-plane updates from being applied).

The 4.7.3 to 4.7.4 update is not complete:

$ yaml2json <cluster-scoped-resources/config.openshift.io/clusterversions/version.yaml | jq -r '.status.history[] | .startedTime + " " + .completionTime + " " + .state + " " + .version' | head -n3
2021-03-25T07:30:44Z  Partial 4.7.4
2021-03-18T07:31:09Z 2021-03-18T21:07:48Z Completed 4.7.3
2021-03-11T07:30:44Z 2021-03-11T14:22:09Z Completed 4.7.2

The CVO isn't all that clear on what it's stuck on:

$ yaml2json <cluster-scoped-resources/config.openshift.io/clusterversions/version.yaml | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")'
2020-01-30T13:59:22Z Available=True -: Done applying 4.7.3
2021-03-25T09:37:17Z Failing=False -: -
2021-03-25T07:30:44Z Progressing=True -: Working towards 4.7.4: 560 of 668 done (83% complete)
2021-03-08T15:39:03Z RetrievedUpdates=True -: -
2021-03-25T09:40:30Z Upgradeable=False One or more machine config pool is degraded, please see `oc get mcp` for further details and resolve before upgrading: Cluster operator machine-config cannot be upgraded between minor versions: 

But CVO logs show that it is the machine-config ClusterOperator:

$ grep 'Running sync.*in state\|Result of work' namespaces/openshift-cluster-version/pods/cluster-version-operator-f54b6964f-ckqcz/cluster-version-operator/cluster-version-operator/logs/current.log | tail
2021-03-25T11:51:36.695318607Z I0325 11:51:36.695312       1 task_graph.go:555] Result of work: []
2021-03-25T11:57:18.473047398Z I0325 11:57:18.473035       1 task_graph.go:555] Result of work: [Cluster operator machine-config is still updating]
2021-03-25T12:00:17.903909526Z I0325 12:00:17.903886       1 sync_worker.go:549] Running sync 4.7.4 (force=false) on generation 66 in state Updating at attempt 17
2021-03-25T12:00:18.047125497Z I0325 12:00:18.047115       1 task_graph.go:555] Result of work: []
2021-03-25T12:05:59.817059048Z I0325 12:05:59.817049       1 task_graph.go:555] Result of work: [Cluster operator machine-config is still updating]
2021-03-25T12:09:12.970352284Z I0325 12:09:12.970338       1 sync_worker.go:549] Running sync 4.7.4 (force=false) on generation 66 in state Updating at attempt 18
2021-03-25T12:09:13.099230749Z I0325 12:09:13.099222       1 task_graph.go:555] Result of work: []
2021-03-25T12:14:54.883529345Z I0325 12:14:54.883514       1 task_graph.go:555] Result of work: [Cluster operator machine-config is still updating]
2021-03-25T12:17:59.747114076Z I0325 12:17:59.747102       1 sync_worker.go:549] Running sync 4.7.4 (force=false) on generation 66 in state Updating at attempt 19
2021-03-25T12:17:59.877575546Z I0325 12:17:59.877569       1 task_graph.go:555] Result of work: []

machine-config is really mad, with Available=False and Degraded=True.  Claiming RequiredPoolsFailed, despire the pool at issue being the ideally-not-required-for-update 'worker' pool:

$ yaml2json <cluster-scoped-resources/config.openshift.io/clusteroperators/machine-config.yaml | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")'       
2021-03-25T09:18:42Z Available=False -: Cluster not available for 4.7.4
2021-03-25T09:08:42Z Progressing=True -: Working towards 4.7.4
2021-03-25T09:45:01Z Degraded=True RequiredPoolsFailed: Unable to apply 4.7.4: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool worker is not ready, retrying. Status: (pool degraded: true total: 52, ready 25, updated: 25, unavailable: 11)
2021-03-25T09:39:16Z Upgradeable=False One or more machine config pool is degraded, please see `oc get mcp` for further details and resolve before upgrading: -

Just confirming that the 'master' pool is fine:

$ yaml2json <cluster-scoped-resources/config.openshift.io/clusteroperators/machine-config.yaml | jq -r '.status.extension | to_entries[] | .key + "\n" + .value + "\n"'
master
all 3 nodes are at latest configuration rendered-master-52dbe6cc1c25686b710496466d9c1dbb
worker
pool is degraded because nodes fail with "9 nodes are reporting degraded status on sync": "Node ip-10-0-137-214.ec2.internal is reporting: \"failed to drain node (5 tries): timed out waiting...

And that the machine-config operator has not bumped its version to 4.7.4 to claim level:

$ yaml2json <cluster-scoped-resources/config.openshift.io/clusteroperators/machine-config.yaml | jq -r '.status.versions'                                              
[
  {
    "name": "operator",
    "version": "4.7.3"
  }
]

And that the worker pool does not have the required-for-upgrade label:

$ yaml2json <cluster-scoped-resources/machineconfiguration.openshift.io/machineconfigpools/worker.yaml | jq .metadata.labels
{
  "custom-kubelet": "enabled",
  "machineconfiguration.openshift.io/mco-built-in": "",
  "pools.operator.machineconfiguration.openshift.io/worker": ""
}

The problem is from the degraded-pool check that lands before that label [2].  That was introduced in [3].  Checking all pools for degraded status and using that to feed Upgradeable=False is fine with me.  But having degradation in a non-required pool should not also translate to Available=False, Degraded=True, and a delayed version bump.  In this case, the compute pool is making progress towards the new version, it's just being slowed by PDBs.  That's not a page-at-midnight, Available=False level issue.

Similar space to bug 1932105, but while that was "MCO is not strict enough for 'master'", this one is "MCO is too strict for non-required pools".

[1]: https://github.com/openshift/machine-config-operator/blob/5bc4bd4e6a0c778d8f55f30f6eae4f031c9e3c41/pkg/operator/sync.go#L38
[2]: https://github.com/openshift/machine-config-operator/blob/5bc4bd4e6a0c778d8f55f30f6eae4f031c9e3c41/pkg/operator/sync.go#L597-L605
[3]: https://github.com/openshift/machine-config-operator/pull/2231

Comment 1 Kirsten Garrison 2021-07-30 21:12:03 UTC

This is partially fixed by our drain timeout work: https://bugzilla.redhat.com/show_bug.cgi?id=1968759

And the Available status refactoring is going to be done by: https://bugzilla.redhat.com/show_bug.cgi?id=1955300 which is similar issue (Available=False is too broad and needs to be refined to only truly unavailble cases)

*** This bug has been marked as a duplicate of bug 1955300 ***

Note You need to log in before you can comment on or make changes to this bug.