Bug 2063265

Summary: [backport 4.10] Suggest to change ztp upgrade workflow to deploy TALO at the end
Product: OpenShift Container Platform Reporter: OpenShift BugZilla Robot <openshift-bugzilla-robot>
Component: Telco EdgeAssignee: Jim Ramsay <jramsay>
Telco Edge sub component: ZTP QA Contact: yliu1
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: imiller, keyoung, scuppett
Version: 4.9   
Target Milestone: ---   
Target Release: 4.10.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-21 12:40:37 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2057678    
Bug Blocks:    

Description OpenShift BugZilla Robot 2022-03-11 16:51:32 UTC
+++ This bug was initially created as a clone of Bug #2057678 +++

Description of problem:
In following doc, TALO was deployed before ZTP workflow (including argocd apps and PGT structures) was updated. 
https://github.com/openshift-kni/cnf-features-deploy/blob/44aa7ebc675dff2a09f5afbdc784e48fdf51624e/ztp/gitops-subscriptions/argocd/Upgrade.md

The problem with that is no cgu will be created automatically because existing policies from 4.9 deployment are all compliant without any wave numbers; and after ztp workflow is updated, all the policies will become NonCompliant. (Also note, because clusters are deployed via 4.9 ztp, they will NOT be removed from managedclusters even if gitops apps are deleted, so TALO will take action right away.)

If we deploy (or restart) TALO after ztp workflow is updated to 4.10, then cgu will be created automatically to apply 4.10 structural changes such as wave annotations, installplanapproval strategy, etc.

Version-Release number of selected component (if applicable):
4.10

How reproducible:
Always (probably)

Steps to Reproduce:
1. Deploy some 4.9 spoke clusters using 4.9 ztp workflow
2. Run following steps as per doc in https://github.com/openshift-kni/cnf-features-deploy/blob/44aa7ebc675dff2a09f5afbdc784e48fdf51624e/ztp/gitops-subscriptions/argocd/Upgrade.md
2.1 Delete gitops apps
2.2 Deploy TALO
2.3 Update PGT to 4.10 structure
2.4 Update ZTP apps to 4.10  

Actual results:
All policies became NonCompliant and no CGU is created

Expected results:
ztp-install CGU is auto created that applies 4.10 ztp changes such as wave annotations. 

Additional info:

Suggest to change ZTP update workflow to below: 
1 Update ZTP apps to 4.10 (if we do this first, then policies will be inform by default, thus we don't need to worry about automatic change in step2)
2 Update PGT to 4.10 structure
3 Deploy TALO (or restart TALO)

Or if deleting argocd apps are necessary, we can do this:
1 Delete gitops apps
2 Update PGT to 4.10 structure
3 Update ZTP apps to 4.10  
4 Deploy TALO (or restart TALO)

--- Additional comment from yliu1 on 2022-02-24 15:58:10 UTC ---

If all the old policies were compliant, then no CGU will be created if TALO was deployed before ztp and pgt got updated. TALO logs as below. 
And if some old policies were NonCompliant, then CGU will be created against old policies, and CGU will likely fail, because old policies were already enforce - if they didn't become compliant, TALO won't make a difference either.

[kni@provisionhost-0-0 ~]$ oc logs -n openshift-cluster-group-upgrades cluster-group-upgrades-controller-manager-75bcc7484d-7sqnd manager
I0223 18:30:18.459515       1 request.go:668] Waited for 1.022655491s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/view.open-cluster-management.io/v1beta1?timeout=32s
2022-02-23T18:30:21.816Z	INFO	controller-runtime.metrics	metrics server is starting to listen	{"addr": "127.0.0.1:8080"}
2022-02-23T18:30:21.825Z	INFO	setup	starting manager
I0223 18:30:21.826051       1 leaderelection.go:243] attempting to acquire leader lease openshift-cluster-group-upgrades/9a2365a3.openshift.io...
2022-02-23T18:30:21.826Z	INFO	controller-runtime.manager	starting metrics server	{"path": "/metrics"}
I0223 18:30:21.853021       1 leaderelection.go:253] successfully acquired lease openshift-cluster-group-upgrades/9a2365a3.openshift.io
2022-02-23T18:30:21.853Z	INFO	controller-runtime.manager.controller.managedclusterForCGU	Starting EventSource	{"reconciler group": "cluster.open-cluster-management.io", "reconciler kind": "ManagedCluster", "source": "kind source: /, Kind="}
2022-02-23T18:30:21.853Z	INFO	controller-runtime.manager.controller.managedclusterForCGU	Starting EventSource	{"reconciler group": "cluster.open-cluster-management.io", "reconciler kind": "ManagedCluster", "source": "kind source: /, Kind="}
2022-02-23T18:30:21.853Z	INFO	controller-runtime.manager.controller.managedclusterForCGU	Starting Controller	{"reconciler group": "cluster.open-cluster-management.io", "reconciler kind": "ManagedCluster"}
2022-02-23T18:30:21.853Z	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"ConfigMap","namespace":"openshift-cluster-group-upgrades","name":"9a2365a3.openshift.io","uid":"346066c2-ffe2-4389-8030-c212c94fc09d","apiVersion":"v1","resourceVersion":"2490961"}, "reason": "LeaderElection", "message": "cluster-group-upgrades-controller-manager-75bcc7484d-7sqnd_bd2d64c6-c4d8-492e-a337-f6c3e921d657 became leader"}
2022-02-23T18:30:21.854Z	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"Lease","namespace":"openshift-cluster-group-upgrades","name":"9a2365a3.openshift.io","uid":"2a9546b1-6e25-4f2e-be37-cbb6f2f70543","apiVersion":"coordination.k8s.io/v1","resourceVersion":"2490967"}, "reason": "LeaderElection", "message": "cluster-group-upgrades-controller-manager-75bcc7484d-7sqnd_bd2d64c6-c4d8-492e-a337-f6c3e921d657 became leader"}
2022-02-23T18:30:21.854Z	INFO	controller-runtime.manager.controller.clustergroupupgrade	Starting EventSource	{"reconciler group": "ran.openshift.io", "reconciler kind": "ClusterGroupUpgrade", "source": "kind source: /, Kind="}
2022-02-23T18:30:21.854Z	INFO	controller-runtime.manager.controller.clustergroupupgrade	Starting EventSource	{"reconciler group": "ran.openshift.io", "reconciler kind": "ClusterGroupUpgrade", "source": "kind source: apps.open-cluster-management.io/v1, Kind=PlacementRule"}
2022-02-23T18:30:21.854Z	INFO	controller-runtime.manager.controller.clustergroupupgrade	Starting EventSource	{"reconciler group": "ran.openshift.io", "reconciler kind": "ClusterGroupUpgrade", "source": "kind source: policy.open-cluster-management.io/v1, Kind=PlacementBinding"}
2022-02-23T18:30:21.854Z	INFO	controller-runtime.manager.controller.clustergroupupgrade	Starting EventSource	{"reconciler group": "ran.openshift.io", "reconciler kind": "ClusterGroupUpgrade", "source": "kind source: policy.open-cluster-management.io/v1, Kind=Policy"}
2022-02-23T18:30:21.854Z	INFO	controller-runtime.manager.controller.clustergroupupgrade	Starting Controller	{"reconciler group": "ran.openshift.io", "reconciler kind": "ClusterGroupUpgrade"}
2022-02-23T18:30:22.055Z	INFO	controller-runtime.manager.controller.managedclusterForCGU	Starting workers	{"reconciler group": "cluster.open-cluster-management.io", "reconciler kind": "ManagedCluster", "worker count": 1}
2022-02-23T18:30:22.055Z	INFO	controllers.ManagedClusterForCGU	Reconciling managedCluster to create clusterGroupUpgrade	{"Request.Name": "helix21-2"}
2022-02-23T18:30:22.056Z	INFO	controllers.ManagedClusterForCGU	cluster is ready	{"Name": "helix21-2"}
2022-02-23T18:30:22.056Z	INFO	controller-runtime.manager.controller.clustergroupupgrade	Starting workers	{"reconciler group": "ran.openshift.io", "reconciler kind": "ClusterGroupUpgrade", "worker count": 1}
2022-02-23T18:30:22.157Z	INFO	controllers.ManagedClusterForCGU	No policies need to be managed by ClusterGroupUpgrade operator
2022-02-23T18:30:22.157Z	INFO	controllers.ManagedClusterForCGU	Reconciling managedCluster to create clusterGroupUpgrade	{"Request.Name": "helix21-3"}
2022-02-23T18:30:22.158Z	INFO	controllers.ManagedClusterForCGU	cluster is ready	{"Name": "helix21-3"}
2022-02-23T18:30:22.158Z	INFO	controllers.ManagedClusterForCGU	No policies need to be managed by ClusterGroupUpgrade operator
2022-02-23T18:30:22.158Z	INFO	controllers.ManagedClusterForCGU	Reconciling managedCluster to create clusterGroupUpgrade	{"Request.Name": "local-cluster"}
2022-02-23T18:30:22.158Z	INFO	controllers.ManagedClusterForCGU	cluster is ready	{"Name": "local-cluster"}
2022-02-23T18:30:22.158Z	INFO	controllers.ManagedClusterForCGU	WARN: No child policies found for cluster	{"Name": "local-cluster"}
2022-02-23T18:30:22.158Z	INFO	controllers.ManagedClusterForCGU	Reconciling managedCluster to create clusterGroupUpgrade	{"Request.Name": "helix21-0"}
2022-02-23T18:30:22.158Z	INFO	controllers.ManagedClusterForCGU	cluster is ready	{"Name": "helix21-0"}
2022-02-23T18:30:22.158Z	INFO	controllers.ManagedClusterForCGU	No policies need to be managed by ClusterGroupUpgrade operator
2022-02-23T18:30:22.158Z	INFO	controllers.ManagedClusterForCGU	Reconciling managedCluster to create clusterGroupUpgrade	{"Request.Name": "helix21-1"}
2022-02-23T18:30:22.158Z	INFO	controllers.ManagedClusterForCGU	cluster is ready	{"Name": "helix21-1"}
2022-02-23T18:30:22.158Z	INFO	controllers.ManagedClusterForCGU	No policies need to be managed by ClusterGroupUpgrade operator

--- Additional comment from imiller on 2022-03-11 16:49:13 UTC ---

Upgrade procedure discussed and changes have been made in upstream documentation.

Comment 2 yliu1 2022-03-14 16:31:56 UTC
Verified with 4.10 ZTP.

Comment 5 errata-xmlrpc 2022-03-21 12:40:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.10.5 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:0928