Description of problem: Upgrade 4.4.0-rc.4 to 4.4.0-rc.6 with upgradeable=false condition set through overriding cluster operator in clusterversion. The upgrade can start, but stuck at 78% complete with network operator unupdated. # ./oc get co|grep rc.4 dns 4.4.0-rc.4 True False False 100m machine-config 4.4.0-rc.4 True False False 36m network 4.4.0-rc.4 True False False 101m # ./oc adm upgrade info: An upgrade is in progress. Unable to apply 4.4.0-rc.6: the cluster operator network has not yet successfully rolled out ... # ./oc adm upgrade info: An upgrade is in progress. Working towards 4.4.0-rc.6: 78% complete cvo logs: ... E0410 02:48:22.295838 1 task.go:81] error running apply for clusteroperator "network" (457 of 580): Cluster operator network is still updating I0410 02:48:22.295911 1 task_graph.go:568] Canceled worker 3 I0410 02:48:22.296001 1 task_graph.go:588] Workers finished I0410 02:48:22.296022 1 task_graph.go:516] No more reachable nodes in graph, continue I0410 02:48:22.296025 1 task_graph.go:596] Result of work: [Cluster operator network is still updating] I0410 02:48:22.296035 1 task_graph.go:552] No more work I0410 02:48:22.296044 1 sync_worker.go:783] Summarizing 1 errors I0410 02:48:22.296052 1 sync_worker.go:787] Update error 457 of 580: ClusterOperatorNotAvailable Cluster operator network is still updating (*errors.errorString: cluster operator network is still updating) E0410 02:48:22.296079 1 sync_worker.go:329] unable to synchronize image (waiting 43.131425612s): Cluster operator network is still updating # ./oc get co network -o json|jq .status.conditions [ { "lastTransitionTime": "2020-04-10T01:29:40Z", "status": "False", "type": "Degraded" }, { "lastTransitionTime": "2020-04-10T01:29:40Z", "status": "True", "type": "Upgradeable" }, { "lastTransitionTime": "2020-04-10T01:37:26Z", "status": "False", "type": "Progressing" }, { "lastTransitionTime": "2020-04-10T01:32:56Z", "status": "True", "type": "Available" } ] # ./oc get clusterversion version -o json|jq .status.conditions[-1]{ "lastTransitionTime": "2020-04-10T02:20:47Z", "message": "Disabling ownership via cluster version overrides prevents upgrades. Please remove overrides before continuing.", "reason": "ClusterVersionOverridesSet", "status": "False", "type": "Upgradeable" } ======================================================================== After remove above overrides in clusterversion, the upgrade can continue. # ./oc get co|grep rc.4 machine-config 4.4.0-rc.4 True False False 99m # ./oc get co network NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE network 4.4.0-rc.6 True False False 164m # ./oc adm upgrade info: An upgrade is in progress. Working towards 4.4.0-rc.6: 83% complete Version-Release number of the following components: 4.4.0-rc.4 to 4.4.0-rc.6 How reproducible: always Steps to Reproduce: 1. oc patch clusterversion to override network-operator # ./oc get clusterversion version -o json|jq .spec.overrides[ { "group": "apps/v1", "kind": "Deployment", "name": "network-operator", "namespace": "openshift-network-operator", "unmanaged": true } ] 2. change channel to candidate-4.4 and do upgrade against 4.4.0-rc.4 to 4.4.0-rc.6 # ./oc adm upgrade --to 4.4.0-rc.6 Updating to 4.4.0-rc.6 Actual results: upgrade stuck on network operator. Expected results: upgrade should succeed. Additional info: Run upgrade against rc.4 to rc.6 successfully if not set upgradeable=false. So i assign this bug to cvo for debug first even it stuck at network operator.
This is definitely a hole in the Upgradeable precondition. Setting objects unmanaged in ClusterVersion should block even z-stream updates, because the network operator is never going to update if the CVO is not bumping its deployment.
Moving to 4.3.z. Ideally the CVO's precondition would catch this, but in the meantime, cluster admins who configure CVO overrides will mostly get stuck mid-update while a ClusterOperator waits for the expected (but overridden) manifest to get updated. That's not great, but also unlikely to result in terrible cluster degradation, so this should not be a 4.4.0 blocker.
First patch in this series will target 4.5, after which we will backport as far as necessary (at least as far as 4.3.z).
Looking at the current state of [1], I think we're not all on the same page about where we are headed. Here's my current understanding: * CVO syncs ClusterOperator Upgradeable=False conditions into its own ClusterVersion Upgradeable=False. * CVO also monitors ClusterVersion spec.overrides and sets Upgradeable=False with ClusterVersionOverridesSet if admin put anything significant in there. * Upgradeable=False (from any source) should continue to block minor updates, because operators use this for things like "in the next minor we will stomp you; fix yourself first" but want to continue to allow patch updates for minor bugfixing and CVEs. * Upgradeable=False where ClusterVersionOverridesSet as one contributor should also block patch updates, because it's likely that the CVO ignoring the overridden manifest (e.g. a cluster operator deployment) will mean that any update attempt will hang (e.g. when the CVO starts waiting for the associated ClusterOperator to level). Does that make sense? [1]: https://github.com/openshift/cluster-version-operator/pull/364
(In reply to W. Trevor King from comment #7) Yes, it does. Let me take another look. I kind of figured if it was as easy to fix as my change you or someone would've already fixed it :).
Adding UpcomingSprint keyword since not sure I'll have time to circle back and complete by weekend.
Adding UpcomingSprint keyword. This bug fix needs additional thought due to another bug.
Version:4.6.0-0.nightly-2020-08-02-091622 1. oc patch clusterversion to override network-operator, upgradeable=false condition was set # oc get clusterversion -o json|jq -r '.items[0].status.conditions[-1]' { "lastTransitionTime": "2020-08-03T02:12:33Z", "message": "Disabling ownership via cluster version overrides prevents upgrades. Please remove overrides before continuing.", "reason": "ClusterVersionOverridesSet", "status": "False", "type": "Upgradeable" } 2. change upstream and do upgrade against 4.6.0-0.nightly-2020-08-02-044648 to 4.6.0-0.nightly-2020-08-02-091622 # ./oc adm upgrade --to 4.6.0-0.nightly-2020-08-02-091622 Updating to 4.6.0-0.nightly-2020-08-02-091622 The upgrade does not start actually(as expected) # ./oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-08-02-044648 True True 50s Unable to apply 4.6.0-0.nightly-2020-08-02-091622: it may not be safe to apply this update
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196