Bug 1822513 - Upgrades are getting blocked during 4.y.z-4.y.(z+1) upgrade with "oc adm upgrade --to-image" command when CVO has upgradeable=false
Summary: Upgrades are getting blocked during 4.y.z-4.y.(z+1) upgrade with "oc adm upgr...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 4.4
Hardware: Unspecified
OS: Unspecified
low
medium
Target Milestone: ---
: 4.6.0
Assignee: Jack Ottofaro
QA Contact: liujia
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-04-09 08:44 UTC by liujia
Modified: 2020-10-27 15:58 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: currentMinor is always being pulled from cv.Status.History[0].Version which contains the version being upgraded to and not the current version. Consequence: When --to-image is used, cv.Status.History[0].Version = "" which then fails the check for a z-level upgrade. Fix: Iterate the version history to find and use the first version with State == configv1.CompletedUpdate, which will yield the current version, and pull currentMinor from it. Result: z-level upgrades using "oc adm upgrade --to-image" command are allowed even when CVO has upgradeable=false.
Clone Of:
Environment:
Last Closed: 2020-10-27 15:57:47 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-version-operator pull 394 0 None closed Bug 1822513: Determine current version by checking for status completed 2021-01-18 02:28:04 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 15:58:14 UTC

Description liujia 2020-04-09 08:44:02 UTC
Description of problem:
Do upgrade test against 4.4.0-rc.6 to 4.4.0-0.nightly-2020-04-04-025830 path with upgradeable=false condition. Upgrade can not start due to precondition test fail.

# ./oc adm upgrade
info: An upgrade is in progress. Unable to apply registry.svc.ci.openshift.org/ocp/release@sha256:5e727bba8407a963fb2bdd95aaa2e2ba6aa63bc58da1f7e69ea28c3f43b90dea: it may not be safe to apply this update

E0409 04:04:55.478434       1 precondition.go:59] Precondition "ClusterVersionUpgradeable" failed: Disabling ownership via cluster version overrides prevents upgrades. Please remove overrides before continuing.

Version-Release number of the following components:
4.4.0-0.nightly-2020-04-04-025830

How reproducible:
always

Steps to Reproduce:
1. install 4.4.0-rc.6 cluster
2. oc patch clusterversion to override network-operator
3. do upgrade against 4.4.0-rc.6 to 4.4.0-0.nightly-2020-04-04-025830
./oc adm upgrade --to-image registry.svc.ci.openshift.org/ocp/release@sha256:5e727bba8407a963fb2bdd95aaa2e2ba6aa63bc58da1f7e69ea28c3f43b90dea --allow-explicit-upgrade=true

Actual results:
upgrade failed due to precondition check.

Expected results:
upgrade succeed.

Additional info:
Hit the issue when do regression test against the change in https://bugzilla.redhat.com/show_bug.cgi?id=1797624.Communicated with dev, should be related with "--to-image" name extraction which he works on. Try another way to do upgrade with "--to" which can avoid "--to-image" issue.

1. oc patch clusterversion to override network-operator
2. change channel to candidate-4.4 and do upgrade against 4.4.0-rc.4 to 4.4.0-rc.6
# ./oc adm upgrade --to 4.4.0-rc.6
Updating to 4.4.0-rc.6
3. upgrade succeed.

Comment 1 liujia 2020-04-09 10:43:04 UTC
>Additional info:
>Hit the issue when do regression test against the change in https://bugzilla.redhat.com/show_bug.cgi?id=1797624.Communicated with dev, should be related with "--to-image" name extraction which he works on. Try another way to do upgrade with "--to" which can avoid "--to-image" issue.

>1. oc patch clusterversion to override network-operator
>2. change channel to candidate-4.4 and do upgrade against 4.4.0-rc.4 to 4.4.0-rc.6
# ./oc adm upgrade --to 4.4.0-rc.6
Updating to 4.4.0-rc.6
>3. upgrade succeed.

Update the result, the upgrade can start(not the same with --to-image), but seems stuck at 78% complete(more than 2 hrs). I think it's another issue.
#./oc adm upgrade
info: An upgrade is in progress. Working towards 4.4.0-rc.6: 78% complete

# ./oc get co|grep rc.4
dns                                        4.4.0-rc.4   True        False         False      5h26m
machine-config                             4.4.0-rc.4   True        False         False      5h22m
network                                    4.4.0-rc.4   True        False         False      5h27m

E0409 10:20:58.526070       1 task.go:81] error running apply for clusteroperator "network" (457 of 580): Cluster operator network is still updating
I0409 10:20:58.526151       1 task_graph.go:568] Canceled worker 13
I0409 10:20:58.526192       1 task_graph.go:588] Workers finished
I0409 10:20:58.526233       1 task_graph.go:516] No more reachable nodes in graph, continue
I0409 10:20:58.526256       1 task_graph.go:552] No more work
I0409 10:20:58.526271       1 task_graph.go:596] Result of work: [Cluster operator network is still updating]
I0409 10:20:58.526289       1 sync_worker.go:783] Summarizing 1 errors
I0409 10:20:58.526297       1 sync_worker.go:787] Update error 457 of 580: ClusterOperatorNotAvailable Cluster operator network is still updating (*errors.errorString: cluster operator network is still updating)
E0409 10:20:58.526324       1 sync_worker.go:329] unable to synchronize image (waiting 2m52.525702462s): Cluster operator network is still updating


# ./oc get clusterversion version -o json|jq .status.conditions[-1]
{
  "lastTransitionTime": "2020-04-09T05:58:13Z",
  "message": "Disabling ownership via cluster version overrides prevents upgrades. Please remove overrides before continuing.",
  "reason": "ClusterVersionOverridesSet",
  "status": "False",
  "type": "Upgradeable"
}

# ./oc get clusterversion version -o json|jq .spec.overrides
[
  {
    "group": "apps/v1",
    "kind": "Deployment",
    "name": "network-operator",
    "namespace": "openshift-network-operator",
    "unmanaged": true
  }
]

@king, i think it doesn't work as expected even with --to. I attach the result here first, if needed, we can file a new bug to track it since original one bz1797624 is verified.

Comment 2 W. Trevor King 2020-04-09 20:19:10 UTC
> Update the result, the upgrade can start(not the same with --to-image)...

So that means we were fine to mark bug 1797624 VERIFIED; the CVO has no problems with the --to bump.  And we need this bug about getting --to-image working.

> ...but seems stuck at 78% complete(more than 2 hrs). I think it's another issue.

Yeah, that seems like a separate issue.  Can you post the network ClusterOperator?  It might be that they are not actually on board with allowing z-stream updates when they set Upgradeable=False.

Comment 3 W. Trevor King 2020-04-09 20:19:46 UTC
Possibly also the logs of the network operator pod.

Comment 4 liujia 2020-04-10 04:20:02 UTC
(In reply to W. Trevor King from comment #2) 
> > ...but seems stuck at 78% complete(more than 2 hrs). I think it's another issue.
> 
> Yeah, that seems like a separate issue.  Can you post the network
> ClusterOperator?  It might be that they are not actually on board with
> allowing z-stream updates when they set Upgradeable=False.

Sure, i gave another to try to have more logs and file a new bug https://bugzilla.redhat.com/show_bug.cgi?id=1822844 to track this issue separately. Let's track '--to-image' issue here only.

Comment 6 W. Trevor King 2020-04-13 23:59:51 UTC
Deferring to 4.5.  I'm agnostic about whether we backport this once we have a fix.

Comment 7 Scott Dodson 2020-04-21 17:21:09 UTC
--to-image is not a common customer use case, lowering priority on this one.

Comment 11 Jack Ottofaro 2020-06-26 19:20:59 UTC
Changing the reproducer for this bug. With the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1822844 expected behaviour will be to reject z-level upgrade if overrides are set.

However the following will reproduce this bug's issue:

Steps to Reproduce:
1. install 4.4.0-rc.6 cluster
2. ./oc patch featuregate cluster --type json -p '[{"op": "add", "path": "/spec/featureSet", "value": "TechPreviewNoUpgrade"}]' featuregate.config.openshift.io/cluster patched
3. do upgrade against 4.4.0-rc.6 to 4.4.0-rc.7
./ oc adm upgrade --allow-explicit-upgrade=true --to-image quay.io/openshift-release-dev/ocp-release@sha256:2532227a868fca11a0cb7563232a26ab9a682d8ee1bb72fd416c4e7789d7ce11

Actual results:
upgrade failed due to precondition check.

CVO log:
E0626 18:40:35.598906       1 precondition.go:59] Precondition "ClusterVersionUpgradeable" failed: Cluster operator kube-apiserver cannot be upgraded: FeatureGatesUpgradeable: "TechPreviewNoUpgrade" does not allow updates
E0626 18:40:35.598965       1 sync_worker.go:329] unable to synchronize image (waiting 2m52.525702462s): Precondition "ClusterVersionUpgradeable" failed because of "FeatureGates_RestrictedFeatureGates_TechPreviewNoUpgrade": Cluster operator kube-apiserver cannot be upgraded: FeatureGatesUpgradeable: "TechPreviewNoUpgrade" does not allow updates

Expected results:
upgrade success

Additional info:

Using "./oc adm upgrade --to 4.4.0-rc.7" upgrade succeeds.

Comment 12 Jack Ottofaro 2020-07-06 16:48:44 UTC
With the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1822844 deploys preconditions such as ClusterVersion overrides should block all upgrades
including z-level upgrades. However other preconditions should not.

Currently, when upgrades are performed using `oc adm upgrade` with the `--to-image option`, the CVO thread that updates version history can add a new cluster version history entry before the thread loading the upgrade version has set the version field resulting in an empty field. This field is used to extract the upgrade target minor version number for comparison to the current version minor version number to check if this is a z-level upgrade and therefore allow precondition
bypass.

Comment 13 Jack Ottofaro 2020-07-09 14:18:23 UTC
Adding UpcomingSprint keyword. Bug is still in work.

Comment 14 Jack Ottofaro 2020-07-13 19:43:08 UTC
Disregard my comment https://bugzilla.redhat.com/show_bug.cgi?id=1822513#c12 as it is not completely accurate.

The real issue is that currentMinor is always being pulled from cv.Status.History[0].Version (https://github.com/openshift/cluster-version-operator/blob/40ec7e4f90b9fa0992145b926bd5f5bf6bd973a3/pkg/payload/precondition/clusterversion/upgradeable.go#L65) which contains the version being upgraded to and not the current version. In this bug's specific case, when --to-image is used, cv.Status.History[0].Version = "" which then fails the check for a z-level upgrade. Instead we should iterate the version history to find and use the first version with State == configv1.CompletedUpdate, which will yield the current version, and pull currentMinor from it.

Comment 17 liujia 2020-08-03 05:39:38 UTC
Version:4.6.0-0.nightly-2020-08-02-091622

1. install 4.6.0-0.nightly-2020-08-02-044648 cluster
2. # ./oc patch featuregate cluster --type json -p '[{"op": "add", "path": "/spec/featureSet", "value": "TechPreviewNoUpgrade"}]'
featuregate.config.openshift.io/cluster patched

# ./oc get clusterversion -o json|jq -r '.items[0].status.conditions[-1]'
{
  "lastTransitionTime": "2020-08-03T03:55:52Z",
  "message": "Cluster operator kube-apiserver cannot be upgraded between minor versions: FeatureGatesUpgradeable: \"TechPreviewNoUpgrade\" does not allow updates",
  "reason": "FeatureGates_RestrictedFeatureGates_TechPreviewNoUpgrade",
  "status": "False",
  "type": "Upgradeable"
}
3. do upgrade against 4.6.0-0.nightly-2020-08-02-044648 to 4.6.0-0.nightly-2020-08-02-091622 with --to-image command
# ./oc adm upgrade --to-image registry.svc.ci.openshift.org/ocp/release@sha256:a0cd5e461757e8c0d0f4e6563ffd716dca90e8ed2956bd6b1405223e74da057c --allow-explicit-upgrade
Updating to release image registry.svc.ci.openshift.org/ocp/release@sha256:a0cd5e461757e8c0d0f4e6563ffd716dca90e8ed2956bd6b1405223e74da057c

Upgrade succeed.
# ./oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-08-02-091622   True        False         7m9s    Cluster version is 4.6.0-0.nightly-2020-08-02-091622

Comment 19 errata-xmlrpc 2020-10-27 15:57:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.