Cause:
Our default ArgoCD configuration assumes no clusters are named `ztp*`
Consequence:
Adding a cluster via ZTP with a name `ztp*` causes a situation where ArgoCD deletes policies that ACM copies in to the cluster namespace, leading to a reconciliation loop and the policies will never go compliant.
Workaround (if any):
When using ZTP, do not name clusters with `ztp` at the beginning of the name. Or adjust the ArgoCD policy application's namespace glob to be more selective (for example, using `ztp-*` as the pattern in the app configuration if your cluster names do not start with `ztp-`
Result:
Avoiding or removing the name collision will stop the reconciliation loop and policies will become compliant.
Description of problem:
When deploying managed clusters via ZTP, the ArgoCD Application Policy get stuck in a refresh loop. This results in the namespace being removed and re-added, preventing the ZTP process from reaching completion.
Logs loop with the following messages:
2022-02-01T15:06:00.369Z INFO controllers.ClusterGroupUpgrade [Reconcile] {"CR": "ztpmultinode"}
2022-02-01T15:06:00.369Z INFO controllers.ClusterGroupUpgrade [getClusterBySelectors] {"clustersBySelector": []}
2022-02-01T15:06:00.369Z INFO controllers.ClusterGroupUpgrade [getClustersBySelectors] {"clusterNames": ["ztpmultinode"]}
2022-02-01T15:06:00.382Z DEBUG controller-runtime.manager.events Warning {"object": {"kind":"ClusterGroupUpgrade","namespace":"ztp-install","name":"ztpmultinode","uid":"b9addcf0-c0ae-4afb-9131-ebf18a85a475","apiVersion":"ran.openshift.io/v1alpha1","resourceVersion":"22689220"}, "reason": "UpgradeTimedOut", "message": "The ClusterGroupUpgrade CR policies are taking too long to complete"}
2022-02-01T15:06:00.468Z INFO controllers.ClusterGroupUpgrade [Reconcile] {"CR": "ztpmultinode"}
2022-02-01T15:06:00.468Z INFO controllers.ClusterGroupUpgrade [getClusterBySelectors] {"clustersBySelector": []}
2022-02-01T15:06:00.468Z INFO controllers.ClusterGroupUpgrade [getClustersBySelectors] {"clusterNames": ["ztpmultinode"]}
2022-02-01T15:06:00.475Z DEBUG controller-runtime.manager.events Warning {"object": {"kind":"ClusterGroupUpgrade","namespace":"ztp-install","name":"ztpmultinode","uid":"b9addcf0-c0ae-4afb-9131-ebf18a85a475","apiVersion":"ran.openshift.io/v1alpha1","resourceVersion":"22689220"}, "reason": "UpgradeTimedOut", "message": "The ClusterGroupUpgrade CR policies are taking too long to complete"}
2022-02-01T15:06:00.544Z INFO controllers.ClusterGroupUpgrade [Reconcile] {"CR": "ztpmultinode"}
2022-02-01T15:06:00.544Z INFO controllers.ClusterGroupUpgrade [getClusterBySelectors] {"clustersBySelector": []}
2022-02-01T15:06:00.544Z INFO controllers.ClusterGroupUpgrade [getClustersBySelectors] {"clusterNames": ["ztpmultinode"]}
2022-02-01T15:06:00.553Z DEBUG controller-runtime.manager.events Warning {"object": {"kind":"ClusterGroupUpgrade","namespace":"ztp-install","name":"ztpmultinode","uid":"b9addcf0-c0ae-4afb-9131-ebf18a85a475","apiVersion":"ran.openshift.io/v1alpha1","resourceVersion":"22689220"}, "reason": "UpgradeTimedOut", "message": "The ClusterGroupUpgrade CR policies are taking too long to complete"}
Version-Release number of selected component (if applicable):
gitops-service-source-vt7ln
How reproducible:
Unknown- additional testing needed.
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
ClusterID: deff99aa-9230-4bf6-b0b3-92c8bebd2f5b
ClusterVersion: Stable at "4.9.15"
ClusterOperators:
All healthy and stable
$ oc describe -n openshift-gitops application policies |tail -25
Normal OperationCompleted 30s argocd-application-controller Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded
Normal ResourceUpdated 30s argocd-application-controller Updated health status: Healthy -> Progressing
Normal ResourceUpdated 30s argocd-application-controller Updated health status: Progressing -> Healthy
Normal ResourceUpdated 25s argocd-application-controller Updated health status: Healthy -> Progressing
Normal OperationCompleted 25s argocd-application-controller Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded
Normal OperationStarted 25s argocd-application-controller Initiated automated sync to '7df105e33994b18151f0d47bd8626a61937135b2'
Normal ResourceUpdated 25s argocd-application-controller Updated health status: Progressing -> Healthy
Normal ResourceUpdated 25s argocd-application-controller Updated health status: Progressing -> Healthy
Normal OperationStarted 20s argocd-application-controller Initiated automated sync to '7df105e33994b18151f0d47bd8626a61937135b2'
Normal OperationCompleted 20s argocd-application-controller Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded
Normal ResourceUpdated 20s argocd-application-controller Updated health status: Healthy -> Progressing
Normal ResourceUpdated 20s argocd-application-controller Updated health status: Progressing -> Healthy
Normal ResourceUpdated 15s argocd-application-controller Updated health status: Progressing -> Healthy
Normal OperationCompleted 15s argocd-application-controller Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded
Normal ResourceUpdated 15s argocd-application-controller Updated health status: Healthy -> Progressing
Normal OperationStarted 15s argocd-application-controller Initiated automated sync to '7df105e33994b18151f0d47bd8626a61937135b2'
Normal OperationStarted 10s argocd-application-controller Initiated automated sync to '7df105e33994b18151f0d47bd8626a61937135b2'
Normal OperationCompleted 10s argocd-application-controller Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded
Normal ResourceUpdated 10s argocd-application-controller Updated health status: Healthy -> Progressing
Normal ResourceUpdated 10s argocd-application-controller Updated health status: Progressing -> Healthy
Normal OperationStarted 5s argocd-application-controller Initiated automated sync to '7df105e33994b18151f0d47bd8626a61937135b2'
Normal OperationCompleted 5s argocd-application-controller Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded
Normal ResourceUpdated 5s argocd-application-controller Updated health status: Healthy -> Progressing
Normal ResourceUpdated 5s argocd-application-controller Updated health status: Progressing -> Healthy
Normal ResourceUpdated 5s argocd-application-controller Updated health status: Progressing -> Healthy
I saw something similar before when the ArgoCD App policies doesn't have the righ config in its ArgoCD AppProject. Would give more details on the ArgoCD App Policies and its AppProject.
Root cause:
The ArgoCD config is "right", but there's still a conflict: The cluster being deployed is named `ztpmultinode` and unfortunately our default ArgoCD policy app is set up to manage all Policy objects in any namspaces that match `ztp*`. So when ACM copies the policies into the cluster namespace, ArgoCD sees them appear and removes them, and ACM recreates them, and ArgoCD removes them, etc.
Workaround for QE:
Change ArgoCD so it only manages `ztp-*`, and then the cluster deployment succeeds with no contention.
Fix for 4.10:
We should mention in our documentation that this potential collision exists, and warn against customers naming clusters `ztp*`.
Fix for 4.11:
Maybe we can do better with how we select/ignore these policies? Needs more investigation.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (OpenShift Container Platform 4.10.5 bug fix update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHBA-2022:0928
Description of problem: When deploying managed clusters via ZTP, the ArgoCD Application Policy get stuck in a refresh loop. This results in the namespace being removed and re-added, preventing the ZTP process from reaching completion. Logs loop with the following messages: 2022-02-01T15:06:00.369Z INFO controllers.ClusterGroupUpgrade [Reconcile] {"CR": "ztpmultinode"} 2022-02-01T15:06:00.369Z INFO controllers.ClusterGroupUpgrade [getClusterBySelectors] {"clustersBySelector": []} 2022-02-01T15:06:00.369Z INFO controllers.ClusterGroupUpgrade [getClustersBySelectors] {"clusterNames": ["ztpmultinode"]} 2022-02-01T15:06:00.382Z DEBUG controller-runtime.manager.events Warning {"object": {"kind":"ClusterGroupUpgrade","namespace":"ztp-install","name":"ztpmultinode","uid":"b9addcf0-c0ae-4afb-9131-ebf18a85a475","apiVersion":"ran.openshift.io/v1alpha1","resourceVersion":"22689220"}, "reason": "UpgradeTimedOut", "message": "The ClusterGroupUpgrade CR policies are taking too long to complete"} 2022-02-01T15:06:00.468Z INFO controllers.ClusterGroupUpgrade [Reconcile] {"CR": "ztpmultinode"} 2022-02-01T15:06:00.468Z INFO controllers.ClusterGroupUpgrade [getClusterBySelectors] {"clustersBySelector": []} 2022-02-01T15:06:00.468Z INFO controllers.ClusterGroupUpgrade [getClustersBySelectors] {"clusterNames": ["ztpmultinode"]} 2022-02-01T15:06:00.475Z DEBUG controller-runtime.manager.events Warning {"object": {"kind":"ClusterGroupUpgrade","namespace":"ztp-install","name":"ztpmultinode","uid":"b9addcf0-c0ae-4afb-9131-ebf18a85a475","apiVersion":"ran.openshift.io/v1alpha1","resourceVersion":"22689220"}, "reason": "UpgradeTimedOut", "message": "The ClusterGroupUpgrade CR policies are taking too long to complete"} 2022-02-01T15:06:00.544Z INFO controllers.ClusterGroupUpgrade [Reconcile] {"CR": "ztpmultinode"} 2022-02-01T15:06:00.544Z INFO controllers.ClusterGroupUpgrade [getClusterBySelectors] {"clustersBySelector": []} 2022-02-01T15:06:00.544Z INFO controllers.ClusterGroupUpgrade [getClustersBySelectors] {"clusterNames": ["ztpmultinode"]} 2022-02-01T15:06:00.553Z DEBUG controller-runtime.manager.events Warning {"object": {"kind":"ClusterGroupUpgrade","namespace":"ztp-install","name":"ztpmultinode","uid":"b9addcf0-c0ae-4afb-9131-ebf18a85a475","apiVersion":"ran.openshift.io/v1alpha1","resourceVersion":"22689220"}, "reason": "UpgradeTimedOut", "message": "The ClusterGroupUpgrade CR policies are taking too long to complete"} Version-Release number of selected component (if applicable): gitops-service-source-vt7ln How reproducible: Unknown- additional testing needed. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: ClusterID: deff99aa-9230-4bf6-b0b3-92c8bebd2f5b ClusterVersion: Stable at "4.9.15" ClusterOperators: All healthy and stable $ oc describe -n openshift-gitops application policies |tail -25 Normal OperationCompleted 30s argocd-application-controller Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded Normal ResourceUpdated 30s argocd-application-controller Updated health status: Healthy -> Progressing Normal ResourceUpdated 30s argocd-application-controller Updated health status: Progressing -> Healthy Normal ResourceUpdated 25s argocd-application-controller Updated health status: Healthy -> Progressing Normal OperationCompleted 25s argocd-application-controller Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded Normal OperationStarted 25s argocd-application-controller Initiated automated sync to '7df105e33994b18151f0d47bd8626a61937135b2' Normal ResourceUpdated 25s argocd-application-controller Updated health status: Progressing -> Healthy Normal ResourceUpdated 25s argocd-application-controller Updated health status: Progressing -> Healthy Normal OperationStarted 20s argocd-application-controller Initiated automated sync to '7df105e33994b18151f0d47bd8626a61937135b2' Normal OperationCompleted 20s argocd-application-controller Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded Normal ResourceUpdated 20s argocd-application-controller Updated health status: Healthy -> Progressing Normal ResourceUpdated 20s argocd-application-controller Updated health status: Progressing -> Healthy Normal ResourceUpdated 15s argocd-application-controller Updated health status: Progressing -> Healthy Normal OperationCompleted 15s argocd-application-controller Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded Normal ResourceUpdated 15s argocd-application-controller Updated health status: Healthy -> Progressing Normal OperationStarted 15s argocd-application-controller Initiated automated sync to '7df105e33994b18151f0d47bd8626a61937135b2' Normal OperationStarted 10s argocd-application-controller Initiated automated sync to '7df105e33994b18151f0d47bd8626a61937135b2' Normal OperationCompleted 10s argocd-application-controller Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded Normal ResourceUpdated 10s argocd-application-controller Updated health status: Healthy -> Progressing Normal ResourceUpdated 10s argocd-application-controller Updated health status: Progressing -> Healthy Normal OperationStarted 5s argocd-application-controller Initiated automated sync to '7df105e33994b18151f0d47bd8626a61937135b2' Normal OperationCompleted 5s argocd-application-controller Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded Normal ResourceUpdated 5s argocd-application-controller Updated health status: Healthy -> Progressing Normal ResourceUpdated 5s argocd-application-controller Updated health status: Progressing -> Healthy Normal ResourceUpdated 5s argocd-application-controller Updated health status: Progressing -> Healthy