Bug 2044339

Summary:	ZTP: Platform upgrades do not work if modifying channel and version at the same time
Product:	OpenShift Container Platform	Reporter:	Alberto Losada <alosadag>
Component:	Telco Edge	Assignee:	Angie Wang <angwang>
Telco Edge sub component:	ZTP	QA Contact:	yliu1
Status:	CLOSED ERRATA	Docs Contact:
Severity:	unspecified
Priority:	unspecified	CC:	angwang, mcornea
Version:	4.10
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-08-26 16:43:57 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Alberto Losada 2022-01-24 12:16:17 UTC

Description of problem:
When trying to run a platform upgrade in a target cluster by using the ClusterVersion.yaml source CR, the spoke's cluster version operator fails to start the upgrade when changing the channel and version at the same time.

However, the ACM policy shows the policy as compliant in the Hub. I mean, the Clusterversion version object of the spoke cluster is patched successfully (desired = current state)

Version-Release number of selected component (if applicable):
4.10

How reproducible:
Always

Steps to Reproduce:
1. Create a PGT that uses the ClusterVersion.yaml source CR to modify the version of one or more targeted clusters. An example can be shown here: https://gitlab.cee.redhat.com/sysdeseng/5g-ericsson/-/blob/master/demos/ztp-policygen/site-policies/group-policies/platform-upgrade-sno.yaml#L12
2. Change both the channel and version of the ClusterVersion object so the target cluster will start a platform upgrade for a version that it is in another channel.
2. Apply the policy to the Hub cluster. Notice that you probably need to set the remediationAction to enforce
3. Verify the desired configuration has been applied to the target clusters

Actual results:
After a while the policy is compliant and the spoke's clusterversion is configured as desired.

The upgrade process does not start due to an error:

- lastTransitionTime: "2022-01-24T10:21:21Z"
message: 'The cluster version is invalid: spec.desiredUpdate.version: Invalid
value: "4.9.15": when image is empty the update must be a previous version or
an available update'
reason: InvalidClusterVersion
status: "True"
type: Invalid

Every 2.0s: oc get clusterversion,node,co katatonic: Mon Jan 24 11:27:14 2022

NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
clusterversion.config.openshift.io/version 4.9.13 True False 3d18h Stopped at 4.9.13: the cluster version is invalid

Expected results:
The upgrade is started successfully.

Additional info:
In the previous error, I tried to upgrade from 4.9.13 in the stable channel to 4.9.15 in the fast channel.

I restarted the managed cluster cluster-version operator but the error was still showing and the upgrade process showing the same error.

Comment 1 Alberto Losada 2022-02-11 12:46:25 UTC

I have been testing this morning with TALO/CGU. The idea is to upgrade from 4.9.19 fast-channel to 4.9.21 candidate-channel - notice that 4.9.21 is not available in the fast one. 

The first approach has been to create a PGT with 2 policies grouped in a single PolicyName. Then use waves to let TALO know that we want to patch the channel and afterward the version. Below, the example PGT:


apiVersion: ran.openshift.io/v1
kind: PolicyGenTemplate
metadata:
  name: "cnfdc8-day2"
  namespace: "ztp-cnfdc8-policies"
spec:
  bindingRules:
    name: "cnfdc8"
  mcp: "master"
  remediationAction: enforce
  sourceFiles:
    - fileName: ClusterVersion.yaml
      policyName: "platform-upgrade"
      metadata:
        name: version
        annotations:
          ran.openshift.io/ztp-deploy-wave: "90"
      spec:
        channel: "candidate-4.9"
    - fileName: ClusterVersion.yaml
      policyName: "platform-upgrade"
      metadata:
        name: version
        annotations:
          ran.openshift.io/ztp-deploy-wave: "95"
      spec:
        desiredUpdate:
          version: "4.9.21"



However, this just throwed an error in ArgoCD. Basically the kustomize plugin was complaining:

rpc error: code = Unknown desc = `kustomize build /tmp/https___gitlab.cee.redhat.com_sysdeseng_5g-ericsson/demos/ztp-policygen/site-policies --enable-alpha-plugins` failed exit status 1: 2022/02/11 10:06:32 Could not build the entire policy defined by /tmp/kust-plugin-config-774393509: ran.openshift.io/ztp-deploy-wave annotation in Resource ClusterVersion.yaml (wave 95) doesn't match with Policy cnfdc8-day2-platform-upgrade (wave 90) Error: failure in plugin configured via /tmp/kust-plugin-config-774393509; exit status 1: exit status 1


The second approach was creating a PGT with 2 policies ungrouped, e.g. independent policies. One is called channel and the other version. See example:

apiVersion: ran.openshift.io/v1
kind: PolicyGenTemplate
metadata:
  name: "platform-upgrade"
  namespace: "ztp-cnfdc8-policies"
spec:
  bindingRules:
    name: "cnfdc8"
  mcp: "master"
  remediationAction: inform
  sourceFiles:
    - fileName: ClusterVersion.yaml
      policyName: "channel"
      metadata:
        name: version
      spec:
        channel: "candidate-4.9"
    - fileName: ClusterVersion.yaml
      policyName: "version"
      metadata:
        name: version
      spec:
        desiredUpdate:
          version: "4.9.21"


Once applied in inform, I can create the following CGU CR to enforce them. The point is that in CGU I can select which policy is applied first:

$ cat clustergroupupgrades-cnfdc8.yaml

apiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
  name: fast-4921
  namespace: ztp-cnfdc8-policies
spec:
  preCaching: false
  deleteObjectsOnCompletion: false
  clusters:
  - cnfdc8
  enable: true
  managedPolicies:
  - platform-upgrade-channel
  - platform-upgrade-version
  remediationStrategy:
    maxConcurrency: 1 
    timeout: 240



Once the policies are being enforced, I realized that both policies were colliding. The first one (channel policy) is trying to set the following configuration in the spoke clusterversion object:
oc get clusterversion version -o jsonpath='{.spec}' | jq
{
  "channel": "candidate-4.9",
  "clusterID": "817f2d36-4a89-48f8-8553-2e9e03dcf3d3",
  "desiredUpdate": {
    "force": false,
    "version": "$version"
  },
  "upstream": "https://api.openshift.com/api/upgrades_info/v1/graph"
}
Notice the version: $version. I thought that NOT setting this variable will not patch the version using the source-cr. Also, notice that this causes the cluster to have an invalid clusterversion:

NAME	  VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.19    True        False         57m     Stopped at 4.9.19: the cluster version is invalid


Once it is compliant, then the second one (version policy) is enforced:

oc get clusterversion version -o jsonpath='{.spec}' | jq
{
  "channel": "candidate-4.9",
  "clusterID": "817f2d36-4a89-48f8-8553-2e9e03dcf3d3",
  "desiredUpdate": {
    "force": false,
    "version": "4.9"
  },
  "upstream": "https://api.openshift.com/api/upgrades_info/v1/graph"
}


However, the above configuration stays just a few seconds because it is reverted back to the previous one quickly. This never triggers the upgrade process and the cluster remains in a "cluster version is invalid" status forever.

Actually, in ACM UI I can see how the policies are moving from compliant (green) to non-compliant (red) all the time since they are configuring the same object.


The third approach was to create two policies independent as the previous approach, but in this case each one will be applied in different CGUs. Here the channel CGU which enforces the channel policy:

clustergroupupgrades-cnfdc8-channel.yaml:

apiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
  name: fast-4921
  namespace: ztp-cnfdc8-policies
spec:
  actions:
    afterCompletion:
      deleteObjects: true
  preCaching: false
  deleteObjectsOnCompletion: false
  clusters:
  - cnfdc8
  enable: true
  managedPolicies:
  - platform-upgrade-channel
  remediationStrategy:
    maxConcurrency: 1 
    timeout: 240

And the CGU that enforces the version upgrade:

apiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
  name: version-4921
  namespace: ztp-cnfdc8-policies
spec:
  preCaching: false
  deleteObjectsOnCompletion: false
  clusters:
  - cnfdc8
  enable: true
  managedPolicies:
  - platform-upgrade-version
  remediationStrategy:
    maxConcurrency: 1 
    timeout: 240


In this case there are 3 scenarios:

* Executed both CGUs at the same time. The result is the same as the second approach. Since two policies in enforce mode are modifying the same object then they are changing constantly from compliant to non-compliant
* Executed sequentially and manually (so far). Once the channel CGU is compliant it is removed (automatically). Then we can create the second CGU (version). The result is a little bit unexpected since the clusterVersion keeps in an error status even the clusterVersion object is correct. I think it is because since we did not set a version in the channel policy, it automatically patches as version=$version. Then it is never recovered. That's the error:

    Type:                  RetrievedUpdates
    Last Transition Time:  2022-02-10T21:18:25Z
    Message:               The cluster version is invalid: spec.desiredUpdate.version: Invalid value: "4.9.19": when image is empty the update must be a previous version or an available update
    Reason:                InvalidClusterVersion
    Status:                True
    Type:                  Invalid


* Executed sequentially and manually (so far). In this case, instead of leaving the version field empty, we are going to set it to the current version (4.9.19). See example of the PGT:

apiVersion: ran.openshift.io/v1
kind: PolicyGenTemplate
metadata:
  name: "platform-upgrade"
  namespace: "ztp-cnfdc8-policies"
spec:
  bindingRules:
    name: "cnfdc8"
  mcp: "master"
  remediationAction: inform
  sourceFiles:
    - fileName: ClusterVersion.yaml
      policyName: "channel"
      metadata:
        name: version
      spec:
        channel: "candidate-4.9"
        desiredUpdate:
          version: "4.9.19"
    - fileName: ClusterVersion.yaml
      policyName: "version"
      metadata:
        name: version
      spec:
        desiredUpdate:
          version: "4.9.21"

I think this avoids the problem with version: $version. In this case, it is able to apply the channel CGU. Once the upgrade is complete, then it removes the enforced policy created by the channel CGU. 

2022-02-11T11:44:59.481Z	INFO	controllers.ClusterGroupUpgrade	[Reconcile]	{"CR": "fast-4921"}
2022-02-11T11:44:59.482Z	INFO	controllers.ClusterGroupUpgrade	[getClusterBySelectors]	{"clustersBySelector": []}
2022-02-11T11:44:59.482Z	INFO	controllers.ClusterGroupUpgrade	[getClustersBySelectors]	{"clusterNames": ["cnfdc8"]}
2022-02-11T11:44:59.482Z	INFO	controllers.ClusterGroupUpgrade	Upgrade is completed


Next, it is time to create the second CGU, the one called version.


oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.19    True        True          9s      Working towards 4.9.21: 9 of 738 done (1% complete)


So, I did not find a way to automatically upgrade both channel and version.

Comment 2 Angie Wang 2022-02-14 15:55:34 UTC

This is the CVO issue https://bugzilla.redhat.com/show_bug.cgi?id=1990635 that was fixed in CVO 4.10 https://github.com/openshift/cluster-version-operator/pull/669.

For the OCP upgrade from 4.9 to 4.10, we can workaround to use two policies to firstly update upstream/channel and then update desiredVersion.version to trigger OCP upgrade.

Comment 3 Angie Wang 2022-02-17 14:57:15 UTC

This BZ https://bugzilla.redhat.com/show_bug.cgi?id=2055314 has PR posted to backport the CVO issue to 4.9. If it's backported to 4.9, the two steps workaround is no longer needed.

Comment 4 yliu1 2022-03-01 18:13:58 UTC

Verified by using 2 separate PGTs and have policy1 applied first via TALO.

http://registry.kni-qe-0.lab.eng.rdu2.redhat.com:3000/kni-qe/ztp-site-configs/src/6f0802ac1d6412867292fcfd9d676b478e2c8614/policygentemplates/upgrade.yaml#L24-L48