Bug 2050789

Summary: ArgoCD App Policy Refreshing Infinitely
Product: OpenShift Container Platform Reporter: yliu1
Component: Telco EdgeAssignee: Ian Miller <imiller>
Telco Edge sub component: ZTP QA Contact: Joshua Clark <josclark>
Status: CLOSED NOTABUG Docs Contact:
Severity: medium    
Priority: unspecified CC: josclark, jramsay, keyoung, tradej
Version: 4.10   
Target Milestone: ---   
Target Release: 4.11.z   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Known Issue
Doc Text:
Cause: The GitOps Application which is synchronizing content from Git to the hub cluster actively managed Policy CRs in any namespace starting with "ztp". If a cluster with a name beginning with this same prefix is deployed then copied policies in the cluster namespace will be removed by the Application. Consequence: Cluster does not get configured with full set of DU configuration CRs and never reaches full compliance. ZTP does not report "done" for the cluster. Workaround (if any): Do not use the prefix "ztp" for any cluster name. Result: Clusters deploy as expected
Story Points: ---
Clone Of: 2049154 Environment:
Last Closed: 2022-10-17 18:46:31 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2049154    
Bug Blocks:    

Description yliu1 2022-02-04 16:44:42 UTC
+++ This bug was initially created as a clone of Bug #2049154 +++
This bug is cloned to have a better fix in 4.11 and remove naming restriction in doc. 
-----------------

Description of problem:

When deploying managed clusters via ZTP, the ArgoCD Application Policy get stuck in a refresh loop. This results in the namespace being removed and re-added, preventing the ZTP process from reaching completion.

Logs loop with the following messages:

2022-02-01T15:06:00.369Z	INFO	controllers.ClusterGroupUpgrade	[Reconcile]	{"CR": "ztpmultinode"}
2022-02-01T15:06:00.369Z	INFO	controllers.ClusterGroupUpgrade	[getClusterBySelectors]	{"clustersBySelector": []}
2022-02-01T15:06:00.369Z	INFO	controllers.ClusterGroupUpgrade	[getClustersBySelectors]	{"clusterNames": ["ztpmultinode"]}
2022-02-01T15:06:00.382Z	DEBUG	controller-runtime.manager.events	Warning	{"object": {"kind":"ClusterGroupUpgrade","namespace":"ztp-install","name":"ztpmultinode","uid":"b9addcf0-c0ae-4afb-9131-ebf18a85a475","apiVersion":"ran.openshift.io/v1alpha1","resourceVersion":"22689220"}, "reason": "UpgradeTimedOut", "message": "The ClusterGroupUpgrade CR policies are taking too long to complete"}
2022-02-01T15:06:00.468Z	INFO	controllers.ClusterGroupUpgrade	[Reconcile]	{"CR": "ztpmultinode"}
2022-02-01T15:06:00.468Z	INFO	controllers.ClusterGroupUpgrade	[getClusterBySelectors]	{"clustersBySelector": []}
2022-02-01T15:06:00.468Z	INFO	controllers.ClusterGroupUpgrade	[getClustersBySelectors]	{"clusterNames": ["ztpmultinode"]}
2022-02-01T15:06:00.475Z	DEBUG	controller-runtime.manager.events	Warning	{"object": {"kind":"ClusterGroupUpgrade","namespace":"ztp-install","name":"ztpmultinode","uid":"b9addcf0-c0ae-4afb-9131-ebf18a85a475","apiVersion":"ran.openshift.io/v1alpha1","resourceVersion":"22689220"}, "reason": "UpgradeTimedOut", "message": "The ClusterGroupUpgrade CR policies are taking too long to complete"}
2022-02-01T15:06:00.544Z	INFO	controllers.ClusterGroupUpgrade	[Reconcile]	{"CR": "ztpmultinode"}
2022-02-01T15:06:00.544Z	INFO	controllers.ClusterGroupUpgrade	[getClusterBySelectors]	{"clustersBySelector": []}
2022-02-01T15:06:00.544Z	INFO	controllers.ClusterGroupUpgrade	[getClustersBySelectors]	{"clusterNames": ["ztpmultinode"]}
2022-02-01T15:06:00.553Z	DEBUG	controller-runtime.manager.events	Warning	{"object": {"kind":"ClusterGroupUpgrade","namespace":"ztp-install","name":"ztpmultinode","uid":"b9addcf0-c0ae-4afb-9131-ebf18a85a475","apiVersion":"ran.openshift.io/v1alpha1","resourceVersion":"22689220"}, "reason": "UpgradeTimedOut", "message": "The ClusterGroupUpgrade CR policies are taking too long to complete"}


Version-Release number of selected component (if applicable):

gitops-service-source-vt7ln

How reproducible:

Unknown- additional testing needed.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

ClusterID: deff99aa-9230-4bf6-b0b3-92c8bebd2f5b
ClusterVersion: Stable at "4.9.15"
ClusterOperators:
	All healthy and stable


$  oc describe -n openshift-gitops application policies |tail -25
  Normal  OperationCompleted  30s    argocd-application-controller  Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded
  Normal  ResourceUpdated     30s    argocd-application-controller  Updated health status: Healthy -> Progressing
  Normal  ResourceUpdated     30s    argocd-application-controller  Updated health status: Progressing -> Healthy
  Normal  ResourceUpdated     25s    argocd-application-controller  Updated health status: Healthy -> Progressing
  Normal  OperationCompleted  25s    argocd-application-controller  Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded
  Normal  OperationStarted    25s    argocd-application-controller  Initiated automated sync to '7df105e33994b18151f0d47bd8626a61937135b2'
  Normal  ResourceUpdated     25s    argocd-application-controller  Updated health status: Progressing -> Healthy
  Normal  ResourceUpdated     25s    argocd-application-controller  Updated health status: Progressing -> Healthy
  Normal  OperationStarted    20s    argocd-application-controller  Initiated automated sync to '7df105e33994b18151f0d47bd8626a61937135b2'
  Normal  OperationCompleted  20s    argocd-application-controller  Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded
  Normal  ResourceUpdated     20s    argocd-application-controller  Updated health status: Healthy -> Progressing
  Normal  ResourceUpdated     20s    argocd-application-controller  Updated health status: Progressing -> Healthy
  Normal  ResourceUpdated     15s    argocd-application-controller  Updated health status: Progressing -> Healthy
  Normal  OperationCompleted  15s    argocd-application-controller  Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded
  Normal  ResourceUpdated     15s    argocd-application-controller  Updated health status: Healthy -> Progressing
  Normal  OperationStarted    15s    argocd-application-controller  Initiated automated sync to '7df105e33994b18151f0d47bd8626a61937135b2'
  Normal  OperationStarted    10s    argocd-application-controller  Initiated automated sync to '7df105e33994b18151f0d47bd8626a61937135b2'
  Normal  OperationCompleted  10s    argocd-application-controller  Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded
  Normal  ResourceUpdated     10s    argocd-application-controller  Updated health status: Healthy -> Progressing
  Normal  ResourceUpdated     10s    argocd-application-controller  Updated health status: Progressing -> Healthy
  Normal  OperationStarted    5s     argocd-application-controller  Initiated automated sync to '7df105e33994b18151f0d47bd8626a61937135b2'
  Normal  OperationCompleted  5s     argocd-application-controller  Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded
  Normal  ResourceUpdated     5s     argocd-application-controller  Updated health status: Healthy -> Progressing
  Normal  ResourceUpdated     5s     argocd-application-controller  Updated health status: Progressing -> Healthy
  Normal  ResourceUpdated     5s     argocd-application-controller  Updated health status: Progressing -> Healthy

--- Additional comment from Joshua Clark on 2022-02-01 16:34:21 UTC ---

Must Gather: https://drive.google.com/file/d/1dag33-Ewb9LaLqhIYQUJIkFDgWPkzEt6/view?usp=sharing

--- Additional comment from  on 2022-02-03 14:40:19 UTC ---

I saw something similar before when the ArgoCD App policies doesn't have the righ config in its ArgoCD AppProject. Would give more details on the ArgoCD App Policies and its AppProject.

--- Additional comment from Jim Ramsay on 2022-02-04 14:24:14 UTC ---

Root cause:

The ArgoCD config is "right", but there's still a conflict: The cluster being deployed is named `ztpmultinode` and unfortunately our default ArgoCD policy app is set up to manage all Policy objects in any namspaces that match `ztp*`.  So when ACM copies the policies into the cluster namespace, ArgoCD sees them appear and removes them, and ACM recreates them, and ArgoCD removes them, etc.

Workaround for QE:

Change ArgoCD so it only manages `ztp-*`, and then the cluster deployment succeeds with no contention.

Fix for 4.10:

We should mention in our documentation that this potential collision exists, and warn against customers naming clusters `ztp*`.

Fix for 4.11:

Maybe we can do better with how we select/ignore these policies?  Needs more investigation.

--- Additional comment from Jim Ramsay on 2022-02-04 14:53:54 UTC ---

Sheldon: I'm actively working on the docs portion of this bug with stesmith as part of TELCODOCS-364

Comment 1 Ian Miller 2022-02-15 03:40:52 UTC
If the restriction on namespaces is captured can this bug be closed?
@jramsay @josclark

Comment 2 Joshua Clark 2022-02-28 23:01:28 UTC
@jramsay I have't seen the documentation update. Should I wait for that before marking this BZ as verified?

Comment 3 Joshua Clark 2022-04-05 12:35:11 UTC
QE Verified. Errata RHBA-2022:0928 covers this bug.

Comment 4 Joshua Clark 2022-04-05 13:13:11 UTC
Moved back to ON_QE until this can be verified in 4.11

Comment 5 Ian Miller 2022-10-17 18:46:31 UTC
The restriction on naming has been updated in the documentation. If the naming restriction needs to be relaxed please open an RFE.