Bug 2057678

Summary: Suggest to change ztp upgrade workflow to deploy TALO at the end
Product: OpenShift Container Platform Reporter: yliu1
Component: Telco EdgeAssignee: Jim Ramsay <jramsay>
Telco Edge sub component: ZTP QA Contact: yliu1
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: keyoung, scuppett
Version: 4.9   
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
This doc update is already part of https://github.com/openshift/openshift-docs/pull/43890
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-26 16:43:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2063265    

Description yliu1 2022-02-23 20:18:29 UTC
Description of problem:
In following doc, TALO was deployed before ZTP workflow (including argocd apps and PGT structures) was updated. 
https://github.com/openshift-kni/cnf-features-deploy/blob/44aa7ebc675dff2a09f5afbdc784e48fdf51624e/ztp/gitops-subscriptions/argocd/Upgrade.md

The problem with that is no cgu will be created automatically because existing policies from 4.9 deployment are all compliant without any wave numbers; and after ztp workflow is updated, all the policies will become NonCompliant. (Also note, because clusters are deployed via 4.9 ztp, they will NOT be removed from managedclusters even if gitops apps are deleted, so TALO will take action right away.)

If we deploy (or restart) TALO after ztp workflow is updated to 4.10, then cgu will be created automatically to apply 4.10 structural changes such as wave annotations, installplanapproval strategy, etc.

Version-Release number of selected component (if applicable):
4.10

How reproducible:
Always (probably)

Steps to Reproduce:
1. Deploy some 4.9 spoke clusters using 4.9 ztp workflow
2. Run following steps as per doc in https://github.com/openshift-kni/cnf-features-deploy/blob/44aa7ebc675dff2a09f5afbdc784e48fdf51624e/ztp/gitops-subscriptions/argocd/Upgrade.md
2.1 Delete gitops apps
2.2 Deploy TALO
2.3 Update PGT to 4.10 structure
2.4 Update ZTP apps to 4.10  

Actual results:
All policies became NonCompliant and no CGU is created

Expected results:
ztp-install CGU is auto created that applies 4.10 ztp changes such as wave annotations. 

Additional info:

Suggest to change ZTP update workflow to below: 
1 Update ZTP apps to 4.10 (if we do this first, then policies will be inform by default, thus we don't need to worry about automatic change in step2)
2 Update PGT to 4.10 structure
3 Deploy TALO (or restart TALO)

Or if deleting argocd apps are necessary, we can do this:
1 Delete gitops apps
2 Update PGT to 4.10 structure
3 Update ZTP apps to 4.10  
4 Deploy TALO (or restart TALO)

Comment 1 yliu1 2022-02-24 15:58:10 UTC
If all the old policies were compliant, then no CGU will be created if TALO was deployed before ztp and pgt got updated. TALO logs as below. 
And if some old policies were NonCompliant, then CGU will be created against old policies, and CGU will likely fail, because old policies were already enforce - if they didn't become compliant, TALO won't make a difference either.

[kni@provisionhost-0-0 ~]$ oc logs -n openshift-cluster-group-upgrades cluster-group-upgrades-controller-manager-75bcc7484d-7sqnd manager
I0223 18:30:18.459515       1 request.go:668] Waited for 1.022655491s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/view.open-cluster-management.io/v1beta1?timeout=32s
2022-02-23T18:30:21.816Z	INFO	controller-runtime.metrics	metrics server is starting to listen	{"addr": "127.0.0.1:8080"}
2022-02-23T18:30:21.825Z	INFO	setup	starting manager
I0223 18:30:21.826051       1 leaderelection.go:243] attempting to acquire leader lease openshift-cluster-group-upgrades/9a2365a3.openshift.io...
2022-02-23T18:30:21.826Z	INFO	controller-runtime.manager	starting metrics server	{"path": "/metrics"}
I0223 18:30:21.853021       1 leaderelection.go:253] successfully acquired lease openshift-cluster-group-upgrades/9a2365a3.openshift.io
2022-02-23T18:30:21.853Z	INFO	controller-runtime.manager.controller.managedclusterForCGU	Starting EventSource	{"reconciler group": "cluster.open-cluster-management.io", "reconciler kind": "ManagedCluster", "source": "kind source: /, Kind="}
2022-02-23T18:30:21.853Z	INFO	controller-runtime.manager.controller.managedclusterForCGU	Starting EventSource	{"reconciler group": "cluster.open-cluster-management.io", "reconciler kind": "ManagedCluster", "source": "kind source: /, Kind="}
2022-02-23T18:30:21.853Z	INFO	controller-runtime.manager.controller.managedclusterForCGU	Starting Controller	{"reconciler group": "cluster.open-cluster-management.io", "reconciler kind": "ManagedCluster"}
2022-02-23T18:30:21.853Z	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"ConfigMap","namespace":"openshift-cluster-group-upgrades","name":"9a2365a3.openshift.io","uid":"346066c2-ffe2-4389-8030-c212c94fc09d","apiVersion":"v1","resourceVersion":"2490961"}, "reason": "LeaderElection", "message": "cluster-group-upgrades-controller-manager-75bcc7484d-7sqnd_bd2d64c6-c4d8-492e-a337-f6c3e921d657 became leader"}
2022-02-23T18:30:21.854Z	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"Lease","namespace":"openshift-cluster-group-upgrades","name":"9a2365a3.openshift.io","uid":"2a9546b1-6e25-4f2e-be37-cbb6f2f70543","apiVersion":"coordination.k8s.io/v1","resourceVersion":"2490967"}, "reason": "LeaderElection", "message": "cluster-group-upgrades-controller-manager-75bcc7484d-7sqnd_bd2d64c6-c4d8-492e-a337-f6c3e921d657 became leader"}
2022-02-23T18:30:21.854Z	INFO	controller-runtime.manager.controller.clustergroupupgrade	Starting EventSource	{"reconciler group": "ran.openshift.io", "reconciler kind": "ClusterGroupUpgrade", "source": "kind source: /, Kind="}
2022-02-23T18:30:21.854Z	INFO	controller-runtime.manager.controller.clustergroupupgrade	Starting EventSource	{"reconciler group": "ran.openshift.io", "reconciler kind": "ClusterGroupUpgrade", "source": "kind source: apps.open-cluster-management.io/v1, Kind=PlacementRule"}
2022-02-23T18:30:21.854Z	INFO	controller-runtime.manager.controller.clustergroupupgrade	Starting EventSource	{"reconciler group": "ran.openshift.io", "reconciler kind": "ClusterGroupUpgrade", "source": "kind source: policy.open-cluster-management.io/v1, Kind=PlacementBinding"}
2022-02-23T18:30:21.854Z	INFO	controller-runtime.manager.controller.clustergroupupgrade	Starting EventSource	{"reconciler group": "ran.openshift.io", "reconciler kind": "ClusterGroupUpgrade", "source": "kind source: policy.open-cluster-management.io/v1, Kind=Policy"}
2022-02-23T18:30:21.854Z	INFO	controller-runtime.manager.controller.clustergroupupgrade	Starting Controller	{"reconciler group": "ran.openshift.io", "reconciler kind": "ClusterGroupUpgrade"}
2022-02-23T18:30:22.055Z	INFO	controller-runtime.manager.controller.managedclusterForCGU	Starting workers	{"reconciler group": "cluster.open-cluster-management.io", "reconciler kind": "ManagedCluster", "worker count": 1}
2022-02-23T18:30:22.055Z	INFO	controllers.ManagedClusterForCGU	Reconciling managedCluster to create clusterGroupUpgrade	{"Request.Name": "helix21-2"}
2022-02-23T18:30:22.056Z	INFO	controllers.ManagedClusterForCGU	cluster is ready	{"Name": "helix21-2"}
2022-02-23T18:30:22.056Z	INFO	controller-runtime.manager.controller.clustergroupupgrade	Starting workers	{"reconciler group": "ran.openshift.io", "reconciler kind": "ClusterGroupUpgrade", "worker count": 1}
2022-02-23T18:30:22.157Z	INFO	controllers.ManagedClusterForCGU	No policies need to be managed by ClusterGroupUpgrade operator
2022-02-23T18:30:22.157Z	INFO	controllers.ManagedClusterForCGU	Reconciling managedCluster to create clusterGroupUpgrade	{"Request.Name": "helix21-3"}
2022-02-23T18:30:22.158Z	INFO	controllers.ManagedClusterForCGU	cluster is ready	{"Name": "helix21-3"}
2022-02-23T18:30:22.158Z	INFO	controllers.ManagedClusterForCGU	No policies need to be managed by ClusterGroupUpgrade operator
2022-02-23T18:30:22.158Z	INFO	controllers.ManagedClusterForCGU	Reconciling managedCluster to create clusterGroupUpgrade	{"Request.Name": "local-cluster"}
2022-02-23T18:30:22.158Z	INFO	controllers.ManagedClusterForCGU	cluster is ready	{"Name": "local-cluster"}
2022-02-23T18:30:22.158Z	INFO	controllers.ManagedClusterForCGU	WARN: No child policies found for cluster	{"Name": "local-cluster"}
2022-02-23T18:30:22.158Z	INFO	controllers.ManagedClusterForCGU	Reconciling managedCluster to create clusterGroupUpgrade	{"Request.Name": "helix21-0"}
2022-02-23T18:30:22.158Z	INFO	controllers.ManagedClusterForCGU	cluster is ready	{"Name": "helix21-0"}
2022-02-23T18:30:22.158Z	INFO	controllers.ManagedClusterForCGU	No policies need to be managed by ClusterGroupUpgrade operator
2022-02-23T18:30:22.158Z	INFO	controllers.ManagedClusterForCGU	Reconciling managedCluster to create clusterGroupUpgrade	{"Request.Name": "helix21-1"}
2022-02-23T18:30:22.158Z	INFO	controllers.ManagedClusterForCGU	cluster is ready	{"Name": "helix21-1"}
2022-02-23T18:30:22.158Z	INFO	controllers.ManagedClusterForCGU	No policies need to be managed by ClusterGroupUpgrade operator

Comment 2 Ian Miller 2022-03-11 16:49:13 UTC
Upgrade procedure discussed and changes have been made in upstream documentation.

Comment 4 yliu1 2022-03-11 18:55:14 UTC
Since this is doc change, verification is done using 4.10 ztp.