1822844 – CVO Upgradeable=False precondition should block all updates (including z version updates) on ClusterVersion overrides

Bug 1822844 - CVO Upgradeable=False precondition should block all updates (including z version updates) on ClusterVersion overrides

Summary: CVO Upgradeable=False precondition should block all updates (including z vers...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.3.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Jack Ottofaro
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-10 04:18 UTC by liujia
Modified:	2020-10-27 15:58 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: When ClusterVersion overrides are set all upgrades, including z level, should be blocked but are not. Consequence: The upgrade proceeds but gets stuck. Fix: Block upgrades when ClusterVersion overrides set. Result: Upgrade will not begin when ClusterVersion overrides set and user will be informed to remove overrides if they want to upgrade.
Clone Of:
Environment:
Last Closed:	2020-10-27 15:57:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-version-operator pull 364	0	None	closed	Bug 1822844: Block z level upgrades when ClusterVersionOverridesSet is set	2021-02-20 06:41:35 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 15:58:14 UTC

Description liujia 2020-04-10 04:18:07 UTC

Description of problem:
Upgrade 4.4.0-rc.4 to 4.4.0-rc.6 with upgradeable=false condition set through overriding cluster operator in clusterversion. The upgrade can start, but stuck at 78% complete with network operator unupdated.

# ./oc get co|grep rc.4
dns                                        4.4.0-rc.4   True        False         False      100m
machine-config                             4.4.0-rc.4   True        False         False      36m
network                                    4.4.0-rc.4   True        False         False      101m

# ./oc adm upgrade
info: An upgrade is in progress. Unable to apply 4.4.0-rc.6: the cluster operator network has not yet successfully rolled out
...
# ./oc adm upgrade
info: An upgrade is in progress. Working towards 4.4.0-rc.6: 78% complete

cvo logs:
...
E0410 02:48:22.295838       1 task.go:81] error running apply for clusteroperator "network" (457 of 580): Cluster operator network is still updating
I0410 02:48:22.295911       1 task_graph.go:568] Canceled worker 3
I0410 02:48:22.296001       1 task_graph.go:588] Workers finished
I0410 02:48:22.296022       1 task_graph.go:516] No more reachable nodes in graph, continue
I0410 02:48:22.296025       1 task_graph.go:596] Result of work: [Cluster operator network is still updating]
I0410 02:48:22.296035       1 task_graph.go:552] No more work
I0410 02:48:22.296044       1 sync_worker.go:783] Summarizing 1 errors
I0410 02:48:22.296052       1 sync_worker.go:787] Update error 457 of 580: ClusterOperatorNotAvailable Cluster operator network is still updating (*errors.errorString: cluster operator network is still updating)
E0410 02:48:22.296079       1 sync_worker.go:329] unable to synchronize image (waiting 43.131425612s): Cluster operator network is still updating

# ./oc get co network -o json|jq .status.conditions
[
  {
    "lastTransitionTime": "2020-04-10T01:29:40Z",
    "status": "False",
    "type": "Degraded"
  },
  {
    "lastTransitionTime": "2020-04-10T01:29:40Z",
    "status": "True",
    "type": "Upgradeable"
  },
  {
    "lastTransitionTime": "2020-04-10T01:37:26Z",
    "status": "False",
    "type": "Progressing"
  },
  {
    "lastTransitionTime": "2020-04-10T01:32:56Z",
    "status": "True",
    "type": "Available"
  }
]

# ./oc get clusterversion version -o json|jq .status.conditions[-1]{
  "lastTransitionTime": "2020-04-10T02:20:47Z",
  "message": "Disabling ownership via cluster version overrides prevents upgrades. Please remove overrides before continuing.",
  "reason": "ClusterVersionOverridesSet",
  "status": "False",
  "type": "Upgradeable"
}


========================================================================
After remove above overrides in clusterversion, the upgrade can continue.
# ./oc get co|grep rc.4
machine-config                             4.4.0-rc.4   True        False         False      99m
# ./oc get co network
NAME      VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE
network   4.4.0-rc.6   True        False         False      164m

# ./oc adm upgrade
info: An upgrade is in progress. Working towards 4.4.0-rc.6: 83% complete


Version-Release number of the following components:
4.4.0-rc.4 to 4.4.0-rc.6

How reproducible:
always

Steps to Reproduce:
1. oc patch clusterversion to override network-operator
# ./oc get clusterversion version -o json|jq .spec.overrides[
  {
    "group": "apps/v1",
    "kind": "Deployment",
    "name": "network-operator",
    "namespace": "openshift-network-operator",
    "unmanaged": true
  }
]

2. change channel to candidate-4.4 and do upgrade against 4.4.0-rc.4 to 4.4.0-rc.6
# ./oc adm upgrade --to 4.4.0-rc.6
Updating to 4.4.0-rc.6

Actual results:
upgrade stuck on network operator.

Expected results:
upgrade should succeed.

Additional info:
Run upgrade against rc.4 to rc.6 successfully if not set upgradeable=false. So i assign this bug to cvo for debug first even it stuck at network operator.

Comment 2 W. Trevor King 2020-04-10 04:45:49 UTC

This is definitely a hole in the Upgradeable precondition.  Setting objects unmanaged in ClusterVersion should block even z-stream updates, because the network operator is never going to update if the CVO is not bumping its deployment.

Comment 3 W. Trevor King 2020-04-13 17:33:17 UTC

Moving to 4.3.z.  Ideally the CVO's precondition would catch this, but in the meantime, cluster admins who configure CVO overrides will mostly get stuck mid-update while a ClusterOperator waits for the expected (but overridden) manifest to get updated.  That's not great, but also unlikely to result in terrible cluster degradation, so this should not be a 4.4.0 blocker.

Comment 5 W. Trevor King 2020-04-14 00:00:53 UTC

First patch in this series will target 4.5, after which we will backport as far as necessary (at least as far as 4.3.z).

Comment 7 W. Trevor King 2020-05-01 04:38:04 UTC

Looking at the current state of [1], I think we're not all on the same page about where we are headed.  Here's my current understanding:

* CVO syncs ClusterOperator Upgradeable=False conditions into its own ClusterVersion Upgradeable=False.
* CVO also monitors ClusterVersion spec.overrides and sets Upgradeable=False with ClusterVersionOverridesSet  if admin put anything significant in there.
* Upgradeable=False (from any source) should continue to block minor updates, because operators use this for things like "in the next minor we will stomp you; fix yourself first" but want to continue to allow patch updates for minor bugfixing and CVEs.
* Upgradeable=False where ClusterVersionOverridesSet as one contributor should also block patch updates, because it's likely that the CVO ignoring the overridden manifest (e.g. a cluster operator deployment) will mean that any update attempt will hang (e.g. when the CVO starts waiting for the associated ClusterOperator to level).

Does that make sense?

[1]: https://github.com/openshift/cluster-version-operator/pull/364

Comment 8 Jack Ottofaro 2020-05-01 12:54:51 UTC

(In reply to W. Trevor King from comment #7)
Yes, it does. Let me take another look. I kind of figured if it was as easy to fix as my change you or someone would've already fixed it :).

Comment 9 Jack Ottofaro 2020-05-28 17:06:24 UTC

Adding UpcomingSprint keyword since not sure I'll have time to circle back and complete by weekend.

Comment 10 Jack Ottofaro 2020-07-09 14:17:13 UTC

Adding UpcomingSprint keyword. This bug fix needs additional thought due to another bug.

Comment 13 liujia 2020-08-03 02:25:45 UTC

Version:4.6.0-0.nightly-2020-08-02-091622

1. oc patch clusterversion to override network-operator, upgradeable=false condition was set
# oc get clusterversion -o json|jq -r '.items[0].status.conditions[-1]'
{
  "lastTransitionTime": "2020-08-03T02:12:33Z",
  "message": "Disabling ownership via cluster version overrides prevents upgrades. Please remove overrides before continuing.",
  "reason": "ClusterVersionOverridesSet",
  "status": "False",
  "type": "Upgradeable"
}

2. change upstream and do upgrade against 4.6.0-0.nightly-2020-08-02-044648 to 4.6.0-0.nightly-2020-08-02-091622
# ./oc adm upgrade --to 4.6.0-0.nightly-2020-08-02-091622
Updating to 4.6.0-0.nightly-2020-08-02-091622

The upgrade does not start actually(as expected)
# ./oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-08-02-044648   True        True          50s     Unable to apply 4.6.0-0.nightly-2020-08-02-091622: it may not be safe to apply this update

Comment 15 errata-xmlrpc 2020-10-27 15:57:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.