2050789 – ArgoCD App Policy Refreshing Infinitely

Bug 2050789 - ArgoCD App Policy Refreshing Infinitely

Summary: ArgoCD App Policy Refreshing Infinitely

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Telco Edge
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.11.z
Assignee:	Ian Miller
QA Contact:	Joshua Clark
Docs Contact:
URL:
Whiteboard:
Depends On:	2049154
Blocks:
TreeView+	depends on / blocked

Reported:	2022-02-04 16:44 UTC by yliu1
Modified:	2023-02-08 21:11 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Known Issue
Doc Text:	Cause: The GitOps Application which is synchronizing content from Git to the hub cluster actively managed Policy CRs in any namespace starting with "ztp". If a cluster with a name beginning with this same prefix is deployed then copied policies in the cluster namespace will be removed by the Application. Consequence: Cluster does not get configured with full set of DU configuration CRs and never reaches full compliance. ZTP does not report "done" for the cluster. Workaround (if any): Do not use the prefix "ztp" for any cluster name. Result: Clusters deploy as expected
Clone Of:	2049154
Environment:
Last Closed:	2022-10-17 18:46:31 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description yliu1 2022-02-04 16:44:42 UTC

+++ This bug was initially created as a clone of Bug #2049154 +++
This bug is cloned to have a better fix in 4.11 and remove naming restriction in doc. 
-----------------

Description of problem:

When deploying managed clusters via ZTP, the ArgoCD Application Policy get stuck in a refresh loop. This results in the namespace being removed and re-added, preventing the ZTP process from reaching completion.

Logs loop with the following messages:

2022-02-01T15:06:00.369Z	INFO	controllers.ClusterGroupUpgrade	[Reconcile]	{"CR": "ztpmultinode"}
2022-02-01T15:06:00.369Z	INFO	controllers.ClusterGroupUpgrade	[getClusterBySelectors]	{"clustersBySelector": []}
2022-02-01T15:06:00.369Z	INFO	controllers.ClusterGroupUpgrade	[getClustersBySelectors]	{"clusterNames": ["ztpmultinode"]}
2022-02-01T15:06:00.382Z	DEBUG	controller-runtime.manager.events	Warning	{"object": {"kind":"ClusterGroupUpgrade","namespace":"ztp-install","name":"ztpmultinode","uid":"b9addcf0-c0ae-4afb-9131-ebf18a85a475","apiVersion":"ran.openshift.io/v1alpha1","resourceVersion":"22689220"}, "reason": "UpgradeTimedOut", "message": "The ClusterGroupUpgrade CR policies are taking too long to complete"}
2022-02-01T15:06:00.468Z	INFO	controllers.ClusterGroupUpgrade	[Reconcile]	{"CR": "ztpmultinode"}
2022-02-01T15:06:00.468Z	INFO	controllers.ClusterGroupUpgrade	[getClusterBySelectors]	{"clustersBySelector": []}
2022-02-01T15:06:00.468Z	INFO	controllers.ClusterGroupUpgrade	[getClustersBySelectors]	{"clusterNames": ["ztpmultinode"]}
2022-02-01T15:06:00.475Z	DEBUG	controller-runtime.manager.events	Warning	{"object": {"kind":"ClusterGroupUpgrade","namespace":"ztp-install","name":"ztpmultinode","uid":"b9addcf0-c0ae-4afb-9131-ebf18a85a475","apiVersion":"ran.openshift.io/v1alpha1","resourceVersion":"22689220"}, "reason": "UpgradeTimedOut", "message": "The ClusterGroupUpgrade CR policies are taking too long to complete"}
2022-02-01T15:06:00.544Z	INFO	controllers.ClusterGroupUpgrade	[Reconcile]	{"CR": "ztpmultinode"}
2022-02-01T15:06:00.544Z	INFO	controllers.ClusterGroupUpgrade	[getClusterBySelectors]	{"clustersBySelector": []}
2022-02-01T15:06:00.544Z	INFO	controllers.ClusterGroupUpgrade	[getClustersBySelectors]	{"clusterNames": ["ztpmultinode"]}
2022-02-01T15:06:00.553Z	DEBUG	controller-runtime.manager.events	Warning	{"object": {"kind":"ClusterGroupUpgrade","namespace":"ztp-install","name":"ztpmultinode","uid":"b9addcf0-c0ae-4afb-9131-ebf18a85a475","apiVersion":"ran.openshift.io/v1alpha1","resourceVersion":"22689220"}, "reason": "UpgradeTimedOut", "message": "The ClusterGroupUpgrade CR policies are taking too long to complete"}


Version-Release number of selected component (if applicable):

gitops-service-source-vt7ln

How reproducible:

Unknown- additional testing needed.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

ClusterID: deff99aa-9230-4bf6-b0b3-92c8bebd2f5b
ClusterVersion: Stable at "4.9.15"
ClusterOperators:
	All healthy and stable


$  oc describe -n openshift-gitops application policies |tail -25
  Normal  OperationCompleted  30s    argocd-application-controller  Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded
  Normal  ResourceUpdated     30s    argocd-application-controller  Updated health status: Healthy -> Progressing
  Normal  ResourceUpdated     30s    argocd-application-controller  Updated health status: Progressing -> Healthy
  Normal  ResourceUpdated     25s    argocd-application-controller  Updated health status: Healthy -> Progressing
  Normal  OperationCompleted  25s    argocd-application-controller  Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded
  Normal  OperationStarted    25s    argocd-application-controller  Initiated automated sync to '7df105e33994b18151f0d47bd8626a61937135b2'
  Normal  ResourceUpdated     25s    argocd-application-controller  Updated health status: Progressing -> Healthy
  Normal  ResourceUpdated     25s    argocd-application-controller  Updated health status: Progressing -> Healthy
  Normal  OperationStarted    20s    argocd-application-controller  Initiated automated sync to '7df105e33994b18151f0d47bd8626a61937135b2'
  Normal  OperationCompleted  20s    argocd-application-controller  Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded
  Normal  ResourceUpdated     20s    argocd-application-controller  Updated health status: Healthy -> Progressing
  Normal  ResourceUpdated     20s    argocd-application-controller  Updated health status: Progressing -> Healthy
  Normal  ResourceUpdated     15s    argocd-application-controller  Updated health status: Progressing -> Healthy
  Normal  OperationCompleted  15s    argocd-application-controller  Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded
  Normal  ResourceUpdated     15s    argocd-application-controller  Updated health status: Healthy -> Progressing
  Normal  OperationStarted    15s    argocd-application-controller  Initiated automated sync to '7df105e33994b18151f0d47bd8626a61937135b2'
  Normal  OperationStarted    10s    argocd-application-controller  Initiated automated sync to '7df105e33994b18151f0d47bd8626a61937135b2'
  Normal  OperationCompleted  10s    argocd-application-controller  Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded
  Normal  ResourceUpdated     10s    argocd-application-controller  Updated health status: Healthy -> Progressing
  Normal  ResourceUpdated     10s    argocd-application-controller  Updated health status: Progressing -> Healthy
  Normal  OperationStarted    5s     argocd-application-controller  Initiated automated sync to '7df105e33994b18151f0d47bd8626a61937135b2'
  Normal  OperationCompleted  5s     argocd-application-controller  Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded
  Normal  ResourceUpdated     5s     argocd-application-controller  Updated health status: Healthy -> Progressing
  Normal  ResourceUpdated     5s     argocd-application-controller  Updated health status: Progressing -> Healthy
  Normal  ResourceUpdated     5s     argocd-application-controller  Updated health status: Progressing -> Healthy

--- Additional comment from Joshua Clark on 2022-02-01 16:34:21 UTC ---

Must Gather: https://drive.google.com/file/d/1dag33-Ewb9LaLqhIYQUJIkFDgWPkzEt6/view?usp=sharing

--- Additional comment from  on 2022-02-03 14:40:19 UTC ---

I saw something similar before when the ArgoCD App policies doesn't have the righ config in its ArgoCD AppProject. Would give more details on the ArgoCD App Policies and its AppProject.

--- Additional comment from Jim Ramsay on 2022-02-04 14:24:14 UTC ---

Root cause:

The ArgoCD config is "right", but there's still a conflict: The cluster being deployed is named `ztpmultinode` and unfortunately our default ArgoCD policy app is set up to manage all Policy objects in any namspaces that match `ztp*`.  So when ACM copies the policies into the cluster namespace, ArgoCD sees them appear and removes them, and ACM recreates them, and ArgoCD removes them, etc.

Workaround for QE:

Change ArgoCD so it only manages `ztp-*`, and then the cluster deployment succeeds with no contention.

Fix for 4.10:

We should mention in our documentation that this potential collision exists, and warn against customers naming clusters `ztp*`.

Fix for 4.11:

Maybe we can do better with how we select/ignore these policies?  Needs more investigation.

--- Additional comment from Jim Ramsay on 2022-02-04 14:53:54 UTC ---

Sheldon: I'm actively working on the docs portion of this bug with stesmith as part of TELCODOCS-364

Comment 1 Ian Miller 2022-02-15 03:40:52 UTC

If the restriction on namespaces is captured can this bug be closed?
@jramsay @josclark

Comment 2 Joshua Clark 2022-02-28 23:01:28 UTC

@jramsay I have't seen the documentation update. Should I wait for that before marking this BZ as verified?

Comment 3 Joshua Clark 2022-04-05 12:35:11 UTC

QE Verified. Errata RHBA-2022:0928 covers this bug.

Comment 4 Joshua Clark 2022-04-05 13:13:11 UTC

Moved back to ON_QE until this can be verified in 4.11

Comment 5 Ian Miller 2022-10-17 18:46:31 UTC

The restriction on naming has been updated in the documentation. If the naming restriction needs to be relaxed please open an RFE.

Note You need to log in before you can comment on or make changes to this bug.