Bug 2049154 - ArgoCD App Policy Refreshing Infinitely
Summary: ArgoCD App Policy Refreshing Infinitely
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Telco Edge
Version: 4.10
Hardware: Unspecified
OS: Linux
unspecified
medium
Target Milestone: ---
: 4.10.0
Assignee: Jim Ramsay
QA Contact: Joshua Clark
Tomas 'Sheldon' Radej
URL:
Whiteboard:
Depends On:
Blocks: 2050789
TreeView+ depends on / blocked
 
Reported: 2022-02-01 16:30 UTC by Joshua Clark
Modified: 2022-03-21 12:40 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
Cause: Our default ArgoCD configuration assumes no clusters are named `ztp*` Consequence: Adding a cluster via ZTP with a name `ztp*` causes a situation where ArgoCD deletes policies that ACM copies in to the cluster namespace, leading to a reconciliation loop and the policies will never go compliant. Workaround (if any): When using ZTP, do not name clusters with `ztp` at the beginning of the name. Or adjust the ArgoCD policy application's namespace glob to be more selective (for example, using `ztp-*` as the pattern in the app configuration if your cluster names do not start with `ztp-` Result: Avoiding or removing the name collision will stop the reconciliation loop and policies will become compliant.
Clone Of:
: 2050789 (view as bug list)
Environment:
Last Closed: 2022-03-21 12:40:05 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift-kni cnf-features-deploy pull 943 0 None open ztp/bug2049154 document ztp namespace restriction 2022-02-04 14:46:15 UTC
Red Hat Product Errata RHBA-2022:0928 0 None None None 2022-03-21 12:40:22 UTC

Description Joshua Clark 2022-02-01 16:30:08 UTC
Description of problem:

When deploying managed clusters via ZTP, the ArgoCD Application Policy get stuck in a refresh loop. This results in the namespace being removed and re-added, preventing the ZTP process from reaching completion.

Logs loop with the following messages:

2022-02-01T15:06:00.369Z	INFO	controllers.ClusterGroupUpgrade	[Reconcile]	{"CR": "ztpmultinode"}
2022-02-01T15:06:00.369Z	INFO	controllers.ClusterGroupUpgrade	[getClusterBySelectors]	{"clustersBySelector": []}
2022-02-01T15:06:00.369Z	INFO	controllers.ClusterGroupUpgrade	[getClustersBySelectors]	{"clusterNames": ["ztpmultinode"]}
2022-02-01T15:06:00.382Z	DEBUG	controller-runtime.manager.events	Warning	{"object": {"kind":"ClusterGroupUpgrade","namespace":"ztp-install","name":"ztpmultinode","uid":"b9addcf0-c0ae-4afb-9131-ebf18a85a475","apiVersion":"ran.openshift.io/v1alpha1","resourceVersion":"22689220"}, "reason": "UpgradeTimedOut", "message": "The ClusterGroupUpgrade CR policies are taking too long to complete"}
2022-02-01T15:06:00.468Z	INFO	controllers.ClusterGroupUpgrade	[Reconcile]	{"CR": "ztpmultinode"}
2022-02-01T15:06:00.468Z	INFO	controllers.ClusterGroupUpgrade	[getClusterBySelectors]	{"clustersBySelector": []}
2022-02-01T15:06:00.468Z	INFO	controllers.ClusterGroupUpgrade	[getClustersBySelectors]	{"clusterNames": ["ztpmultinode"]}
2022-02-01T15:06:00.475Z	DEBUG	controller-runtime.manager.events	Warning	{"object": {"kind":"ClusterGroupUpgrade","namespace":"ztp-install","name":"ztpmultinode","uid":"b9addcf0-c0ae-4afb-9131-ebf18a85a475","apiVersion":"ran.openshift.io/v1alpha1","resourceVersion":"22689220"}, "reason": "UpgradeTimedOut", "message": "The ClusterGroupUpgrade CR policies are taking too long to complete"}
2022-02-01T15:06:00.544Z	INFO	controllers.ClusterGroupUpgrade	[Reconcile]	{"CR": "ztpmultinode"}
2022-02-01T15:06:00.544Z	INFO	controllers.ClusterGroupUpgrade	[getClusterBySelectors]	{"clustersBySelector": []}
2022-02-01T15:06:00.544Z	INFO	controllers.ClusterGroupUpgrade	[getClustersBySelectors]	{"clusterNames": ["ztpmultinode"]}
2022-02-01T15:06:00.553Z	DEBUG	controller-runtime.manager.events	Warning	{"object": {"kind":"ClusterGroupUpgrade","namespace":"ztp-install","name":"ztpmultinode","uid":"b9addcf0-c0ae-4afb-9131-ebf18a85a475","apiVersion":"ran.openshift.io/v1alpha1","resourceVersion":"22689220"}, "reason": "UpgradeTimedOut", "message": "The ClusterGroupUpgrade CR policies are taking too long to complete"}


Version-Release number of selected component (if applicable):

gitops-service-source-vt7ln

How reproducible:

Unknown- additional testing needed.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

ClusterID: deff99aa-9230-4bf6-b0b3-92c8bebd2f5b
ClusterVersion: Stable at "4.9.15"
ClusterOperators:
	All healthy and stable


$  oc describe -n openshift-gitops application policies |tail -25
  Normal  OperationCompleted  30s    argocd-application-controller  Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded
  Normal  ResourceUpdated     30s    argocd-application-controller  Updated health status: Healthy -> Progressing
  Normal  ResourceUpdated     30s    argocd-application-controller  Updated health status: Progressing -> Healthy
  Normal  ResourceUpdated     25s    argocd-application-controller  Updated health status: Healthy -> Progressing
  Normal  OperationCompleted  25s    argocd-application-controller  Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded
  Normal  OperationStarted    25s    argocd-application-controller  Initiated automated sync to '7df105e33994b18151f0d47bd8626a61937135b2'
  Normal  ResourceUpdated     25s    argocd-application-controller  Updated health status: Progressing -> Healthy
  Normal  ResourceUpdated     25s    argocd-application-controller  Updated health status: Progressing -> Healthy
  Normal  OperationStarted    20s    argocd-application-controller  Initiated automated sync to '7df105e33994b18151f0d47bd8626a61937135b2'
  Normal  OperationCompleted  20s    argocd-application-controller  Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded
  Normal  ResourceUpdated     20s    argocd-application-controller  Updated health status: Healthy -> Progressing
  Normal  ResourceUpdated     20s    argocd-application-controller  Updated health status: Progressing -> Healthy
  Normal  ResourceUpdated     15s    argocd-application-controller  Updated health status: Progressing -> Healthy
  Normal  OperationCompleted  15s    argocd-application-controller  Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded
  Normal  ResourceUpdated     15s    argocd-application-controller  Updated health status: Healthy -> Progressing
  Normal  OperationStarted    15s    argocd-application-controller  Initiated automated sync to '7df105e33994b18151f0d47bd8626a61937135b2'
  Normal  OperationStarted    10s    argocd-application-controller  Initiated automated sync to '7df105e33994b18151f0d47bd8626a61937135b2'
  Normal  OperationCompleted  10s    argocd-application-controller  Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded
  Normal  ResourceUpdated     10s    argocd-application-controller  Updated health status: Healthy -> Progressing
  Normal  ResourceUpdated     10s    argocd-application-controller  Updated health status: Progressing -> Healthy
  Normal  OperationStarted    5s     argocd-application-controller  Initiated automated sync to '7df105e33994b18151f0d47bd8626a61937135b2'
  Normal  OperationCompleted  5s     argocd-application-controller  Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded
  Normal  ResourceUpdated     5s     argocd-application-controller  Updated health status: Healthy -> Progressing
  Normal  ResourceUpdated     5s     argocd-application-controller  Updated health status: Progressing -> Healthy
  Normal  ResourceUpdated     5s     argocd-application-controller  Updated health status: Progressing -> Healthy

Comment 2 melserng 2022-02-03 14:40:19 UTC
I saw something similar before when the ArgoCD App policies doesn't have the righ config in its ArgoCD AppProject. Would give more details on the ArgoCD App Policies and its AppProject.

Comment 3 Jim Ramsay 2022-02-04 14:24:14 UTC
Root cause:

The ArgoCD config is "right", but there's still a conflict: The cluster being deployed is named `ztpmultinode` and unfortunately our default ArgoCD policy app is set up to manage all Policy objects in any namspaces that match `ztp*`.  So when ACM copies the policies into the cluster namespace, ArgoCD sees them appear and removes them, and ACM recreates them, and ArgoCD removes them, etc.

Workaround for QE:

Change ArgoCD so it only manages `ztp-*`, and then the cluster deployment succeeds with no contention.

Fix for 4.10:

We should mention in our documentation that this potential collision exists, and warn against customers naming clusters `ztp*`.

Fix for 4.11:

Maybe we can do better with how we select/ignore these policies?  Needs more investigation.

Comment 4 Jim Ramsay 2022-02-04 14:53:54 UTC
Sheldon: I'm actively working on the docs portion of this bug with stesmith as part of TELCODOCS-364

Comment 5 Gowrishankar Rajaiyan 2022-02-04 16:52:58 UTC
@stesmith please note that there would be a doc update once bug 2050789 has a better fix.

Comment 11 Joshua Clark 2022-03-15 21:03:35 UTC
Documentation changes look good- changing to verified.

Comment 13 errata-xmlrpc 2022-03-21 12:40:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.10.5 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:0928


Note You need to log in before you can comment on or make changes to this bug.