Bug 2049154

Summary:	ArgoCD App Policy Refreshing Infinitely
Product:	OpenShift Container Platform	Reporter:	Joshua Clark <josclark>
Component:	Telco Edge	Assignee:	Jim Ramsay <jramsay>
Telco Edge sub component:	ZTP	QA Contact:	Joshua Clark <josclark>
Status:	CLOSED ERRATA	Docs Contact:	Tomas 'Sheldon' Radej <tradej>
Severity:	medium
Priority:	unspecified	CC:	grajaiya, jramsay, keyoung
Version:	4.10
Target Milestone:	---
Target Release:	4.10.0
Hardware:	Unspecified
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Known Issue
Doc Text:	Cause: Our default ArgoCD configuration assumes no clusters are named `ztp` Consequence: Adding a cluster via ZTP with a name `ztp` causes a situation where ArgoCD deletes policies that ACM copies in to the cluster namespace, leading to a reconciliation loop and the policies will never go compliant. Workaround (if any): When using ZTP, do not name clusters with `ztp` at the beginning of the name. Or adjust the ArgoCD policy application's namespace glob to be more selective (for example, using `ztp-*` as the pattern in the app configuration if your cluster names do not start with `ztp-` Result: Avoiding or removing the name collision will stop the reconciliation loop and policies will become compliant.	Story Points:	---
Clone Of:
Clones:	2050789 (view as bug list)		Environment:
Last Closed:	2022-03-21 12:40:05 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2050789

Description Joshua Clark 2022-02-01 16:30:08 UTC

Description of problem:

When deploying managed clusters via ZTP, the ArgoCD Application Policy get stuck in a refresh loop. This results in the namespace being removed and re-added, preventing the ZTP process from reaching completion.

Logs loop with the following messages:

2022-02-01T15:06:00.369Z	INFO	controllers.ClusterGroupUpgrade	[Reconcile]	{"CR": "ztpmultinode"}
2022-02-01T15:06:00.369Z	INFO	controllers.ClusterGroupUpgrade	[getClusterBySelectors]	{"clustersBySelector": []}
2022-02-01T15:06:00.369Z	INFO	controllers.ClusterGroupUpgrade	[getClustersBySelectors]	{"clusterNames": ["ztpmultinode"]}
2022-02-01T15:06:00.382Z	DEBUG	controller-runtime.manager.events	Warning	{"object": {"kind":"ClusterGroupUpgrade","namespace":"ztp-install","name":"ztpmultinode","uid":"b9addcf0-c0ae-4afb-9131-ebf18a85a475","apiVersion":"ran.openshift.io/v1alpha1","resourceVersion":"22689220"}, "reason": "UpgradeTimedOut", "message": "The ClusterGroupUpgrade CR policies are taking too long to complete"}
2022-02-01T15:06:00.468Z	INFO	controllers.ClusterGroupUpgrade	[Reconcile]	{"CR": "ztpmultinode"}
2022-02-01T15:06:00.468Z	INFO	controllers.ClusterGroupUpgrade	[getClusterBySelectors]	{"clustersBySelector": []}
2022-02-01T15:06:00.468Z	INFO	controllers.ClusterGroupUpgrade	[getClustersBySelectors]	{"clusterNames": ["ztpmultinode"]}
2022-02-01T15:06:00.475Z	DEBUG	controller-runtime.manager.events	Warning	{"object": {"kind":"ClusterGroupUpgrade","namespace":"ztp-install","name":"ztpmultinode","uid":"b9addcf0-c0ae-4afb-9131-ebf18a85a475","apiVersion":"ran.openshift.io/v1alpha1","resourceVersion":"22689220"}, "reason": "UpgradeTimedOut", "message": "The ClusterGroupUpgrade CR policies are taking too long to complete"}
2022-02-01T15:06:00.544Z	INFO	controllers.ClusterGroupUpgrade	[Reconcile]	{"CR": "ztpmultinode"}
2022-02-01T15:06:00.544Z	INFO	controllers.ClusterGroupUpgrade	[getClusterBySelectors]	{"clustersBySelector": []}
2022-02-01T15:06:00.544Z	INFO	controllers.ClusterGroupUpgrade	[getClustersBySelectors]	{"clusterNames": ["ztpmultinode"]}
2022-02-01T15:06:00.553Z	DEBUG	controller-runtime.manager.events	Warning	{"object": {"kind":"ClusterGroupUpgrade","namespace":"ztp-install","name":"ztpmultinode","uid":"b9addcf0-c0ae-4afb-9131-ebf18a85a475","apiVersion":"ran.openshift.io/v1alpha1","resourceVersion":"22689220"}, "reason": "UpgradeTimedOut", "message": "The ClusterGroupUpgrade CR policies are taking too long to complete"}


Version-Release number of selected component (if applicable):

gitops-service-source-vt7ln

How reproducible:

Unknown- additional testing needed.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

ClusterID: deff99aa-9230-4bf6-b0b3-92c8bebd2f5b
ClusterVersion: Stable at "4.9.15"
ClusterOperators:
	All healthy and stable


$  oc describe -n openshift-gitops application policies |tail -25
  Normal  OperationCompleted  30s    argocd-application-controller  Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded
  Normal  ResourceUpdated     30s    argocd-application-controller  Updated health status: Healthy -> Progressing
  Normal  ResourceUpdated     30s    argocd-application-controller  Updated health status: Progressing -> Healthy
  Normal  ResourceUpdated     25s    argocd-application-controller  Updated health status: Healthy -> Progressing
  Normal  OperationCompleted  25s    argocd-application-controller  Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded
  Normal  OperationStarted    25s    argocd-application-controller  Initiated automated sync to '7df105e33994b18151f0d47bd8626a61937135b2'
  Normal  ResourceUpdated     25s    argocd-application-controller  Updated health status: Progressing -> Healthy
  Normal  ResourceUpdated     25s    argocd-application-controller  Updated health status: Progressing -> Healthy
  Normal  OperationStarted    20s    argocd-application-controller  Initiated automated sync to '7df105e33994b18151f0d47bd8626a61937135b2'
  Normal  OperationCompleted  20s    argocd-application-controller  Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded
  Normal  ResourceUpdated     20s    argocd-application-controller  Updated health status: Healthy -> Progressing
  Normal  ResourceUpdated     20s    argocd-application-controller  Updated health status: Progressing -> Healthy
  Normal  ResourceUpdated     15s    argocd-application-controller  Updated health status: Progressing -> Healthy
  Normal  OperationCompleted  15s    argocd-application-controller  Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded
  Normal  ResourceUpdated     15s    argocd-application-controller  Updated health status: Healthy -> Progressing
  Normal  OperationStarted    15s    argocd-application-controller  Initiated automated sync to '7df105e33994b18151f0d47bd8626a61937135b2'
  Normal  OperationStarted    10s    argocd-application-controller  Initiated automated sync to '7df105e33994b18151f0d47bd8626a61937135b2'
  Normal  OperationCompleted  10s    argocd-application-controller  Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded
  Normal  ResourceUpdated     10s    argocd-application-controller  Updated health status: Healthy -> Progressing
  Normal  ResourceUpdated     10s    argocd-application-controller  Updated health status: Progressing -> Healthy
  Normal  OperationStarted    5s     argocd-application-controller  Initiated automated sync to '7df105e33994b18151f0d47bd8626a61937135b2'
  Normal  OperationCompleted  5s     argocd-application-controller  Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded
  Normal  ResourceUpdated     5s     argocd-application-controller  Updated health status: Healthy -> Progressing
  Normal  ResourceUpdated     5s     argocd-application-controller  Updated health status: Progressing -> Healthy
  Normal  ResourceUpdated     5s     argocd-application-controller  Updated health status: Progressing -> Healthy

Comment 1 Joshua Clark 2022-02-01 16:34:21 UTC

Must Gather: https://drive.google.com/file/d/1dag33-Ewb9LaLqhIYQUJIkFDgWPkzEt6/view?usp=sharing

Comment 2 melserng 2022-02-03 14:40:19 UTC

I saw something similar before when the ArgoCD App policies doesn't have the righ config in its ArgoCD AppProject. Would give more details on the ArgoCD App Policies and its AppProject.

Comment 3 Jim Ramsay 2022-02-04 14:24:14 UTC

Root cause:

The ArgoCD config is "right", but there's still a conflict: The cluster being deployed is named `ztpmultinode` and unfortunately our default ArgoCD policy app is set up to manage all Policy objects in any namspaces that match `ztp*`.  So when ACM copies the policies into the cluster namespace, ArgoCD sees them appear and removes them, and ACM recreates them, and ArgoCD removes them, etc.

Workaround for QE:

Change ArgoCD so it only manages `ztp-*`, and then the cluster deployment succeeds with no contention.

Fix for 4.10:

We should mention in our documentation that this potential collision exists, and warn against customers naming clusters `ztp*`.

Fix for 4.11:

Maybe we can do better with how we select/ignore these policies?  Needs more investigation.

Comment 4 Jim Ramsay 2022-02-04 14:53:54 UTC

Sheldon: I'm actively working on the docs portion of this bug with stesmith as part of TELCODOCS-364

Comment 5 Gowrishankar Rajaiyan 2022-02-04 16:52:58 UTC

@stesmith please note that there would be a doc update once bug 2050789 has a better fix.

Comment 11 Joshua Clark 2022-03-15 21:03:35 UTC

Documentation changes look good- changing to verified.

Comment 13 errata-xmlrpc 2022-03-21 12:40:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.10.5 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:0928