Description of problem: When trying to run a platform upgrade in a target cluster by using the ClusterVersion.yaml source CR, the spoke's cluster version operator fails to start the upgrade when changing the channel and version at the same time. However, the ACM policy shows the policy as compliant in the Hub. I mean, the Clusterversion version object of the spoke cluster is patched successfully (desired = current state) Version-Release number of selected component (if applicable): 4.10 How reproducible: Always Steps to Reproduce: 1. Create a PGT that uses the ClusterVersion.yaml source CR to modify the version of one or more targeted clusters. An example can be shown here: https://gitlab.cee.redhat.com/sysdeseng/5g-ericsson/-/blob/master/demos/ztp-policygen/site-policies/group-policies/platform-upgrade-sno.yaml#L12 2. Change both the channel and version of the ClusterVersion object so the target cluster will start a platform upgrade for a version that it is in another channel. 2. Apply the policy to the Hub cluster. Notice that you probably need to set the remediationAction to enforce 3. Verify the desired configuration has been applied to the target clusters Actual results: After a while the policy is compliant and the spoke's clusterversion is configured as desired. The upgrade process does not start due to an error: - lastTransitionTime: "2022-01-24T10:21:21Z" message: 'The cluster version is invalid: spec.desiredUpdate.version: Invalid value: "4.9.15": when image is empty the update must be a previous version or an available update' reason: InvalidClusterVersion status: "True" type: Invalid Every 2.0s: oc get clusterversion,node,co katatonic: Mon Jan 24 11:27:14 2022 NAME VERSION AVAILABLE PROGRESSING SINCE STATUS clusterversion.config.openshift.io/version 4.9.13 True False 3d18h Stopped at 4.9.13: the cluster version is invalid Expected results: The upgrade is started successfully. Additional info: In the previous error, I tried to upgrade from 4.9.13 in the stable channel to 4.9.15 in the fast channel. I restarted the managed cluster cluster-version operator but the error was still showing and the upgrade process showing the same error.
I have been testing this morning with TALO/CGU. The idea is to upgrade from 4.9.19 fast-channel to 4.9.21 candidate-channel - notice that 4.9.21 is not available in the fast one. The first approach has been to create a PGT with 2 policies grouped in a single PolicyName. Then use waves to let TALO know that we want to patch the channel and afterward the version. Below, the example PGT: apiVersion: ran.openshift.io/v1 kind: PolicyGenTemplate metadata: name: "cnfdc8-day2" namespace: "ztp-cnfdc8-policies" spec: bindingRules: name: "cnfdc8" mcp: "master" remediationAction: enforce sourceFiles: - fileName: ClusterVersion.yaml policyName: "platform-upgrade" metadata: name: version annotations: ran.openshift.io/ztp-deploy-wave: "90" spec: channel: "candidate-4.9" - fileName: ClusterVersion.yaml policyName: "platform-upgrade" metadata: name: version annotations: ran.openshift.io/ztp-deploy-wave: "95" spec: desiredUpdate: version: "4.9.21" However, this just throwed an error in ArgoCD. Basically the kustomize plugin was complaining: rpc error: code = Unknown desc = `kustomize build /tmp/https___gitlab.cee.redhat.com_sysdeseng_5g-ericsson/demos/ztp-policygen/site-policies --enable-alpha-plugins` failed exit status 1: 2022/02/11 10:06:32 Could not build the entire policy defined by /tmp/kust-plugin-config-774393509: ran.openshift.io/ztp-deploy-wave annotation in Resource ClusterVersion.yaml (wave 95) doesn't match with Policy cnfdc8-day2-platform-upgrade (wave 90) Error: failure in plugin configured via /tmp/kust-plugin-config-774393509; exit status 1: exit status 1 The second approach was creating a PGT with 2 policies ungrouped, e.g. independent policies. One is called channel and the other version. See example: apiVersion: ran.openshift.io/v1 kind: PolicyGenTemplate metadata: name: "platform-upgrade" namespace: "ztp-cnfdc8-policies" spec: bindingRules: name: "cnfdc8" mcp: "master" remediationAction: inform sourceFiles: - fileName: ClusterVersion.yaml policyName: "channel" metadata: name: version spec: channel: "candidate-4.9" - fileName: ClusterVersion.yaml policyName: "version" metadata: name: version spec: desiredUpdate: version: "4.9.21" Once applied in inform, I can create the following CGU CR to enforce them. The point is that in CGU I can select which policy is applied first: $ cat clustergroupupgrades-cnfdc8.yaml apiVersion: ran.openshift.io/v1alpha1 kind: ClusterGroupUpgrade metadata: name: fast-4921 namespace: ztp-cnfdc8-policies spec: preCaching: false deleteObjectsOnCompletion: false clusters: - cnfdc8 enable: true managedPolicies: - platform-upgrade-channel - platform-upgrade-version remediationStrategy: maxConcurrency: 1 timeout: 240 Once the policies are being enforced, I realized that both policies were colliding. The first one (channel policy) is trying to set the following configuration in the spoke clusterversion object: oc get clusterversion version -o jsonpath='{.spec}' | jq { "channel": "candidate-4.9", "clusterID": "817f2d36-4a89-48f8-8553-2e9e03dcf3d3", "desiredUpdate": { "force": false, "version": "$version" }, "upstream": "https://api.openshift.com/api/upgrades_info/v1/graph" } Notice the version: $version. I thought that NOT setting this variable will not patch the version using the source-cr. Also, notice that this causes the cluster to have an invalid clusterversion: NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.19 True False 57m Stopped at 4.9.19: the cluster version is invalid Once it is compliant, then the second one (version policy) is enforced: oc get clusterversion version -o jsonpath='{.spec}' | jq { "channel": "candidate-4.9", "clusterID": "817f2d36-4a89-48f8-8553-2e9e03dcf3d3", "desiredUpdate": { "force": false, "version": "4.9" }, "upstream": "https://api.openshift.com/api/upgrades_info/v1/graph" } However, the above configuration stays just a few seconds because it is reverted back to the previous one quickly. This never triggers the upgrade process and the cluster remains in a "cluster version is invalid" status forever. Actually, in ACM UI I can see how the policies are moving from compliant (green) to non-compliant (red) all the time since they are configuring the same object. The third approach was to create two policies independent as the previous approach, but in this case each one will be applied in different CGUs. Here the channel CGU which enforces the channel policy: clustergroupupgrades-cnfdc8-channel.yaml: apiVersion: ran.openshift.io/v1alpha1 kind: ClusterGroupUpgrade metadata: name: fast-4921 namespace: ztp-cnfdc8-policies spec: actions: afterCompletion: deleteObjects: true preCaching: false deleteObjectsOnCompletion: false clusters: - cnfdc8 enable: true managedPolicies: - platform-upgrade-channel remediationStrategy: maxConcurrency: 1 timeout: 240 And the CGU that enforces the version upgrade: apiVersion: ran.openshift.io/v1alpha1 kind: ClusterGroupUpgrade metadata: name: version-4921 namespace: ztp-cnfdc8-policies spec: preCaching: false deleteObjectsOnCompletion: false clusters: - cnfdc8 enable: true managedPolicies: - platform-upgrade-version remediationStrategy: maxConcurrency: 1 timeout: 240 In this case there are 3 scenarios: * Executed both CGUs at the same time. The result is the same as the second approach. Since two policies in enforce mode are modifying the same object then they are changing constantly from compliant to non-compliant * Executed sequentially and manually (so far). Once the channel CGU is compliant it is removed (automatically). Then we can create the second CGU (version). The result is a little bit unexpected since the clusterVersion keeps in an error status even the clusterVersion object is correct. I think it is because since we did not set a version in the channel policy, it automatically patches as version=$version. Then it is never recovered. That's the error: Type: RetrievedUpdates Last Transition Time: 2022-02-10T21:18:25Z Message: The cluster version is invalid: spec.desiredUpdate.version: Invalid value: "4.9.19": when image is empty the update must be a previous version or an available update Reason: InvalidClusterVersion Status: True Type: Invalid * Executed sequentially and manually (so far). In this case, instead of leaving the version field empty, we are going to set it to the current version (4.9.19). See example of the PGT: apiVersion: ran.openshift.io/v1 kind: PolicyGenTemplate metadata: name: "platform-upgrade" namespace: "ztp-cnfdc8-policies" spec: bindingRules: name: "cnfdc8" mcp: "master" remediationAction: inform sourceFiles: - fileName: ClusterVersion.yaml policyName: "channel" metadata: name: version spec: channel: "candidate-4.9" desiredUpdate: version: "4.9.19" - fileName: ClusterVersion.yaml policyName: "version" metadata: name: version spec: desiredUpdate: version: "4.9.21" I think this avoids the problem with version: $version. In this case, it is able to apply the channel CGU. Once the upgrade is complete, then it removes the enforced policy created by the channel CGU. 2022-02-11T11:44:59.481Z INFO controllers.ClusterGroupUpgrade [Reconcile] {"CR": "fast-4921"} 2022-02-11T11:44:59.482Z INFO controllers.ClusterGroupUpgrade [getClusterBySelectors] {"clustersBySelector": []} 2022-02-11T11:44:59.482Z INFO controllers.ClusterGroupUpgrade [getClustersBySelectors] {"clusterNames": ["cnfdc8"]} 2022-02-11T11:44:59.482Z INFO controllers.ClusterGroupUpgrade Upgrade is completed Next, it is time to create the second CGU, the one called version. oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.19 True True 9s Working towards 4.9.21: 9 of 738 done (1% complete) So, I did not find a way to automatically upgrade both channel and version.
This is the CVO issue https://bugzilla.redhat.com/show_bug.cgi?id=1990635 that was fixed in CVO 4.10 https://github.com/openshift/cluster-version-operator/pull/669. For the OCP upgrade from 4.9 to 4.10, we can workaround to use two policies to firstly update upstream/channel and then update desiredVersion.version to trigger OCP upgrade.
This BZ https://bugzilla.redhat.com/show_bug.cgi?id=2055314 has PR posted to backport the CVO issue to 4.9. If it's backported to 4.9, the two steps workaround is no longer needed.
Verified by using 2 separate PGTs and have policy1 applied first via TALO. http://registry.kni-qe-0.lab.eng.rdu2.redhat.com:3000/kni-qe/ztp-site-configs/src/6f0802ac1d6412867292fcfd9d676b478e2c8614/policygentemplates/upgrade.yaml#L24-L48