Description of problem: When deploying managed clusters via ZTP, the ArgoCD Application Policy get stuck in a refresh loop. This results in the namespace being removed and re-added, preventing the ZTP process from reaching completion. Logs loop with the following messages: 2022-02-01T15:06:00.369Z INFO controllers.ClusterGroupUpgrade [Reconcile] {"CR": "ztpmultinode"} 2022-02-01T15:06:00.369Z INFO controllers.ClusterGroupUpgrade [getClusterBySelectors] {"clustersBySelector": []} 2022-02-01T15:06:00.369Z INFO controllers.ClusterGroupUpgrade [getClustersBySelectors] {"clusterNames": ["ztpmultinode"]} 2022-02-01T15:06:00.382Z DEBUG controller-runtime.manager.events Warning {"object": {"kind":"ClusterGroupUpgrade","namespace":"ztp-install","name":"ztpmultinode","uid":"b9addcf0-c0ae-4afb-9131-ebf18a85a475","apiVersion":"ran.openshift.io/v1alpha1","resourceVersion":"22689220"}, "reason": "UpgradeTimedOut", "message": "The ClusterGroupUpgrade CR policies are taking too long to complete"} 2022-02-01T15:06:00.468Z INFO controllers.ClusterGroupUpgrade [Reconcile] {"CR": "ztpmultinode"} 2022-02-01T15:06:00.468Z INFO controllers.ClusterGroupUpgrade [getClusterBySelectors] {"clustersBySelector": []} 2022-02-01T15:06:00.468Z INFO controllers.ClusterGroupUpgrade [getClustersBySelectors] {"clusterNames": ["ztpmultinode"]} 2022-02-01T15:06:00.475Z DEBUG controller-runtime.manager.events Warning {"object": {"kind":"ClusterGroupUpgrade","namespace":"ztp-install","name":"ztpmultinode","uid":"b9addcf0-c0ae-4afb-9131-ebf18a85a475","apiVersion":"ran.openshift.io/v1alpha1","resourceVersion":"22689220"}, "reason": "UpgradeTimedOut", "message": "The ClusterGroupUpgrade CR policies are taking too long to complete"} 2022-02-01T15:06:00.544Z INFO controllers.ClusterGroupUpgrade [Reconcile] {"CR": "ztpmultinode"} 2022-02-01T15:06:00.544Z INFO controllers.ClusterGroupUpgrade [getClusterBySelectors] {"clustersBySelector": []} 2022-02-01T15:06:00.544Z INFO controllers.ClusterGroupUpgrade [getClustersBySelectors] {"clusterNames": ["ztpmultinode"]} 2022-02-01T15:06:00.553Z DEBUG controller-runtime.manager.events Warning {"object": {"kind":"ClusterGroupUpgrade","namespace":"ztp-install","name":"ztpmultinode","uid":"b9addcf0-c0ae-4afb-9131-ebf18a85a475","apiVersion":"ran.openshift.io/v1alpha1","resourceVersion":"22689220"}, "reason": "UpgradeTimedOut", "message": "The ClusterGroupUpgrade CR policies are taking too long to complete"} Version-Release number of selected component (if applicable): gitops-service-source-vt7ln How reproducible: Unknown- additional testing needed. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: ClusterID: deff99aa-9230-4bf6-b0b3-92c8bebd2f5b ClusterVersion: Stable at "4.9.15" ClusterOperators: All healthy and stable $ oc describe -n openshift-gitops application policies |tail -25 Normal OperationCompleted 30s argocd-application-controller Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded Normal ResourceUpdated 30s argocd-application-controller Updated health status: Healthy -> Progressing Normal ResourceUpdated 30s argocd-application-controller Updated health status: Progressing -> Healthy Normal ResourceUpdated 25s argocd-application-controller Updated health status: Healthy -> Progressing Normal OperationCompleted 25s argocd-application-controller Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded Normal OperationStarted 25s argocd-application-controller Initiated automated sync to '7df105e33994b18151f0d47bd8626a61937135b2' Normal ResourceUpdated 25s argocd-application-controller Updated health status: Progressing -> Healthy Normal ResourceUpdated 25s argocd-application-controller Updated health status: Progressing -> Healthy Normal OperationStarted 20s argocd-application-controller Initiated automated sync to '7df105e33994b18151f0d47bd8626a61937135b2' Normal OperationCompleted 20s argocd-application-controller Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded Normal ResourceUpdated 20s argocd-application-controller Updated health status: Healthy -> Progressing Normal ResourceUpdated 20s argocd-application-controller Updated health status: Progressing -> Healthy Normal ResourceUpdated 15s argocd-application-controller Updated health status: Progressing -> Healthy Normal OperationCompleted 15s argocd-application-controller Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded Normal ResourceUpdated 15s argocd-application-controller Updated health status: Healthy -> Progressing Normal OperationStarted 15s argocd-application-controller Initiated automated sync to '7df105e33994b18151f0d47bd8626a61937135b2' Normal OperationStarted 10s argocd-application-controller Initiated automated sync to '7df105e33994b18151f0d47bd8626a61937135b2' Normal OperationCompleted 10s argocd-application-controller Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded Normal ResourceUpdated 10s argocd-application-controller Updated health status: Healthy -> Progressing Normal ResourceUpdated 10s argocd-application-controller Updated health status: Progressing -> Healthy Normal OperationStarted 5s argocd-application-controller Initiated automated sync to '7df105e33994b18151f0d47bd8626a61937135b2' Normal OperationCompleted 5s argocd-application-controller Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded Normal ResourceUpdated 5s argocd-application-controller Updated health status: Healthy -> Progressing Normal ResourceUpdated 5s argocd-application-controller Updated health status: Progressing -> Healthy Normal ResourceUpdated 5s argocd-application-controller Updated health status: Progressing -> Healthy
Must Gather: https://drive.google.com/file/d/1dag33-Ewb9LaLqhIYQUJIkFDgWPkzEt6/view?usp=sharing
I saw something similar before when the ArgoCD App policies doesn't have the righ config in its ArgoCD AppProject. Would give more details on the ArgoCD App Policies and its AppProject.
Root cause: The ArgoCD config is "right", but there's still a conflict: The cluster being deployed is named `ztpmultinode` and unfortunately our default ArgoCD policy app is set up to manage all Policy objects in any namspaces that match `ztp*`. So when ACM copies the policies into the cluster namespace, ArgoCD sees them appear and removes them, and ACM recreates them, and ArgoCD removes them, etc. Workaround for QE: Change ArgoCD so it only manages `ztp-*`, and then the cluster deployment succeeds with no contention. Fix for 4.10: We should mention in our documentation that this potential collision exists, and warn against customers naming clusters `ztp*`. Fix for 4.11: Maybe we can do better with how we select/ignore these policies? Needs more investigation.
Sheldon: I'm actively working on the docs portion of this bug with stesmith as part of TELCODOCS-364
@stesmith please note that there would be a doc update once bug 2050789 has a better fix.
Documentation changes look good- changing to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.10.5 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:0928