Bug 2044339
| Summary: | ZTP: Platform upgrades do not work if modifying channel and version at the same time | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Alberto Losada <alosadag> |
| Component: | Telco Edge | Assignee: | Angie Wang <angwang> |
| Telco Edge sub component: | ZTP | QA Contact: | yliu1 |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | unspecified | ||
| Priority: | unspecified | CC: | angwang, mcornea |
| Version: | 4.10 | ||
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-08-26 16:43:57 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Alberto Losada
2022-01-24 12:16:17 UTC
I have been testing this morning with TALO/CGU. The idea is to upgrade from 4.9.19 fast-channel to 4.9.21 candidate-channel - notice that 4.9.21 is not available in the fast one.
The first approach has been to create a PGT with 2 policies grouped in a single PolicyName. Then use waves to let TALO know that we want to patch the channel and afterward the version. Below, the example PGT:
apiVersion: ran.openshift.io/v1
kind: PolicyGenTemplate
metadata:
name: "cnfdc8-day2"
namespace: "ztp-cnfdc8-policies"
spec:
bindingRules:
name: "cnfdc8"
mcp: "master"
remediationAction: enforce
sourceFiles:
- fileName: ClusterVersion.yaml
policyName: "platform-upgrade"
metadata:
name: version
annotations:
ran.openshift.io/ztp-deploy-wave: "90"
spec:
channel: "candidate-4.9"
- fileName: ClusterVersion.yaml
policyName: "platform-upgrade"
metadata:
name: version
annotations:
ran.openshift.io/ztp-deploy-wave: "95"
spec:
desiredUpdate:
version: "4.9.21"
However, this just throwed an error in ArgoCD. Basically the kustomize plugin was complaining:
rpc error: code = Unknown desc = `kustomize build /tmp/https___gitlab.cee.redhat.com_sysdeseng_5g-ericsson/demos/ztp-policygen/site-policies --enable-alpha-plugins` failed exit status 1: 2022/02/11 10:06:32 Could not build the entire policy defined by /tmp/kust-plugin-config-774393509: ran.openshift.io/ztp-deploy-wave annotation in Resource ClusterVersion.yaml (wave 95) doesn't match with Policy cnfdc8-day2-platform-upgrade (wave 90) Error: failure in plugin configured via /tmp/kust-plugin-config-774393509; exit status 1: exit status 1
The second approach was creating a PGT with 2 policies ungrouped, e.g. independent policies. One is called channel and the other version. See example:
apiVersion: ran.openshift.io/v1
kind: PolicyGenTemplate
metadata:
name: "platform-upgrade"
namespace: "ztp-cnfdc8-policies"
spec:
bindingRules:
name: "cnfdc8"
mcp: "master"
remediationAction: inform
sourceFiles:
- fileName: ClusterVersion.yaml
policyName: "channel"
metadata:
name: version
spec:
channel: "candidate-4.9"
- fileName: ClusterVersion.yaml
policyName: "version"
metadata:
name: version
spec:
desiredUpdate:
version: "4.9.21"
Once applied in inform, I can create the following CGU CR to enforce them. The point is that in CGU I can select which policy is applied first:
$ cat clustergroupupgrades-cnfdc8.yaml
apiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
name: fast-4921
namespace: ztp-cnfdc8-policies
spec:
preCaching: false
deleteObjectsOnCompletion: false
clusters:
- cnfdc8
enable: true
managedPolicies:
- platform-upgrade-channel
- platform-upgrade-version
remediationStrategy:
maxConcurrency: 1
timeout: 240
Once the policies are being enforced, I realized that both policies were colliding. The first one (channel policy) is trying to set the following configuration in the spoke clusterversion object:
oc get clusterversion version -o jsonpath='{.spec}' | jq
{
"channel": "candidate-4.9",
"clusterID": "817f2d36-4a89-48f8-8553-2e9e03dcf3d3",
"desiredUpdate": {
"force": false,
"version": "$version"
},
"upstream": "https://api.openshift.com/api/upgrades_info/v1/graph"
}
Notice the version: $version. I thought that NOT setting this variable will not patch the version using the source-cr. Also, notice that this causes the cluster to have an invalid clusterversion:
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.9.19 True False 57m Stopped at 4.9.19: the cluster version is invalid
Once it is compliant, then the second one (version policy) is enforced:
oc get clusterversion version -o jsonpath='{.spec}' | jq
{
"channel": "candidate-4.9",
"clusterID": "817f2d36-4a89-48f8-8553-2e9e03dcf3d3",
"desiredUpdate": {
"force": false,
"version": "4.9"
},
"upstream": "https://api.openshift.com/api/upgrades_info/v1/graph"
}
However, the above configuration stays just a few seconds because it is reverted back to the previous one quickly. This never triggers the upgrade process and the cluster remains in a "cluster version is invalid" status forever.
Actually, in ACM UI I can see how the policies are moving from compliant (green) to non-compliant (red) all the time since they are configuring the same object.
The third approach was to create two policies independent as the previous approach, but in this case each one will be applied in different CGUs. Here the channel CGU which enforces the channel policy:
clustergroupupgrades-cnfdc8-channel.yaml:
apiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
name: fast-4921
namespace: ztp-cnfdc8-policies
spec:
actions:
afterCompletion:
deleteObjects: true
preCaching: false
deleteObjectsOnCompletion: false
clusters:
- cnfdc8
enable: true
managedPolicies:
- platform-upgrade-channel
remediationStrategy:
maxConcurrency: 1
timeout: 240
And the CGU that enforces the version upgrade:
apiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
name: version-4921
namespace: ztp-cnfdc8-policies
spec:
preCaching: false
deleteObjectsOnCompletion: false
clusters:
- cnfdc8
enable: true
managedPolicies:
- platform-upgrade-version
remediationStrategy:
maxConcurrency: 1
timeout: 240
In this case there are 3 scenarios:
* Executed both CGUs at the same time. The result is the same as the second approach. Since two policies in enforce mode are modifying the same object then they are changing constantly from compliant to non-compliant
* Executed sequentially and manually (so far). Once the channel CGU is compliant it is removed (automatically). Then we can create the second CGU (version). The result is a little bit unexpected since the clusterVersion keeps in an error status even the clusterVersion object is correct. I think it is because since we did not set a version in the channel policy, it automatically patches as version=$version. Then it is never recovered. That's the error:
Type: RetrievedUpdates
Last Transition Time: 2022-02-10T21:18:25Z
Message: The cluster version is invalid: spec.desiredUpdate.version: Invalid value: "4.9.19": when image is empty the update must be a previous version or an available update
Reason: InvalidClusterVersion
Status: True
Type: Invalid
* Executed sequentially and manually (so far). In this case, instead of leaving the version field empty, we are going to set it to the current version (4.9.19). See example of the PGT:
apiVersion: ran.openshift.io/v1
kind: PolicyGenTemplate
metadata:
name: "platform-upgrade"
namespace: "ztp-cnfdc8-policies"
spec:
bindingRules:
name: "cnfdc8"
mcp: "master"
remediationAction: inform
sourceFiles:
- fileName: ClusterVersion.yaml
policyName: "channel"
metadata:
name: version
spec:
channel: "candidate-4.9"
desiredUpdate:
version: "4.9.19"
- fileName: ClusterVersion.yaml
policyName: "version"
metadata:
name: version
spec:
desiredUpdate:
version: "4.9.21"
I think this avoids the problem with version: $version. In this case, it is able to apply the channel CGU. Once the upgrade is complete, then it removes the enforced policy created by the channel CGU.
2022-02-11T11:44:59.481Z INFO controllers.ClusterGroupUpgrade [Reconcile] {"CR": "fast-4921"}
2022-02-11T11:44:59.482Z INFO controllers.ClusterGroupUpgrade [getClusterBySelectors] {"clustersBySelector": []}
2022-02-11T11:44:59.482Z INFO controllers.ClusterGroupUpgrade [getClustersBySelectors] {"clusterNames": ["cnfdc8"]}
2022-02-11T11:44:59.482Z INFO controllers.ClusterGroupUpgrade Upgrade is completed
Next, it is time to create the second CGU, the one called version.
oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.9.19 True True 9s Working towards 4.9.21: 9 of 738 done (1% complete)
So, I did not find a way to automatically upgrade both channel and version.
This is the CVO issue https://bugzilla.redhat.com/show_bug.cgi?id=1990635 that was fixed in CVO 4.10 https://github.com/openshift/cluster-version-operator/pull/669. For the OCP upgrade from 4.9 to 4.10, we can workaround to use two policies to firstly update upstream/channel and then update desiredVersion.version to trigger OCP upgrade. This BZ https://bugzilla.redhat.com/show_bug.cgi?id=2055314 has PR posted to backport the CVO issue to 4.9. If it's backported to 4.9, the two steps workaround is no longer needed. Verified by using 2 separate PGTs and have policy1 applied first via TALO. http://registry.kni-qe-0.lab.eng.rdu2.redhat.com:3000/kni-qe/ztp-site-configs/src/6f0802ac1d6412867292fcfd9d676b478e2c8614/policygentemplates/upgrade.yaml#L24-L48 |