+++ This bug was initially created as a clone of Bug #2049154 +++ This bug is cloned to have a better fix in 4.11 and remove naming restriction in doc. ----------------- Description of problem: When deploying managed clusters via ZTP, the ArgoCD Application Policy get stuck in a refresh loop. This results in the namespace being removed and re-added, preventing the ZTP process from reaching completion. Logs loop with the following messages: 2022-02-01T15:06:00.369Z INFO controllers.ClusterGroupUpgrade [Reconcile] {"CR": "ztpmultinode"} 2022-02-01T15:06:00.369Z INFO controllers.ClusterGroupUpgrade [getClusterBySelectors] {"clustersBySelector": []} 2022-02-01T15:06:00.369Z INFO controllers.ClusterGroupUpgrade [getClustersBySelectors] {"clusterNames": ["ztpmultinode"]} 2022-02-01T15:06:00.382Z DEBUG controller-runtime.manager.events Warning {"object": {"kind":"ClusterGroupUpgrade","namespace":"ztp-install","name":"ztpmultinode","uid":"b9addcf0-c0ae-4afb-9131-ebf18a85a475","apiVersion":"ran.openshift.io/v1alpha1","resourceVersion":"22689220"}, "reason": "UpgradeTimedOut", "message": "The ClusterGroupUpgrade CR policies are taking too long to complete"} 2022-02-01T15:06:00.468Z INFO controllers.ClusterGroupUpgrade [Reconcile] {"CR": "ztpmultinode"} 2022-02-01T15:06:00.468Z INFO controllers.ClusterGroupUpgrade [getClusterBySelectors] {"clustersBySelector": []} 2022-02-01T15:06:00.468Z INFO controllers.ClusterGroupUpgrade [getClustersBySelectors] {"clusterNames": ["ztpmultinode"]} 2022-02-01T15:06:00.475Z DEBUG controller-runtime.manager.events Warning {"object": {"kind":"ClusterGroupUpgrade","namespace":"ztp-install","name":"ztpmultinode","uid":"b9addcf0-c0ae-4afb-9131-ebf18a85a475","apiVersion":"ran.openshift.io/v1alpha1","resourceVersion":"22689220"}, "reason": "UpgradeTimedOut", "message": "The ClusterGroupUpgrade CR policies are taking too long to complete"} 2022-02-01T15:06:00.544Z INFO controllers.ClusterGroupUpgrade [Reconcile] {"CR": "ztpmultinode"} 2022-02-01T15:06:00.544Z INFO controllers.ClusterGroupUpgrade [getClusterBySelectors] {"clustersBySelector": []} 2022-02-01T15:06:00.544Z INFO controllers.ClusterGroupUpgrade [getClustersBySelectors] {"clusterNames": ["ztpmultinode"]} 2022-02-01T15:06:00.553Z DEBUG controller-runtime.manager.events Warning {"object": {"kind":"ClusterGroupUpgrade","namespace":"ztp-install","name":"ztpmultinode","uid":"b9addcf0-c0ae-4afb-9131-ebf18a85a475","apiVersion":"ran.openshift.io/v1alpha1","resourceVersion":"22689220"}, "reason": "UpgradeTimedOut", "message": "The ClusterGroupUpgrade CR policies are taking too long to complete"} Version-Release number of selected component (if applicable): gitops-service-source-vt7ln How reproducible: Unknown- additional testing needed. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: ClusterID: deff99aa-9230-4bf6-b0b3-92c8bebd2f5b ClusterVersion: Stable at "4.9.15" ClusterOperators: All healthy and stable $ oc describe -n openshift-gitops application policies |tail -25 Normal OperationCompleted 30s argocd-application-controller Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded Normal ResourceUpdated 30s argocd-application-controller Updated health status: Healthy -> Progressing Normal ResourceUpdated 30s argocd-application-controller Updated health status: Progressing -> Healthy Normal ResourceUpdated 25s argocd-application-controller Updated health status: Healthy -> Progressing Normal OperationCompleted 25s argocd-application-controller Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded Normal OperationStarted 25s argocd-application-controller Initiated automated sync to '7df105e33994b18151f0d47bd8626a61937135b2' Normal ResourceUpdated 25s argocd-application-controller Updated health status: Progressing -> Healthy Normal ResourceUpdated 25s argocd-application-controller Updated health status: Progressing -> Healthy Normal OperationStarted 20s argocd-application-controller Initiated automated sync to '7df105e33994b18151f0d47bd8626a61937135b2' Normal OperationCompleted 20s argocd-application-controller Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded Normal ResourceUpdated 20s argocd-application-controller Updated health status: Healthy -> Progressing Normal ResourceUpdated 20s argocd-application-controller Updated health status: Progressing -> Healthy Normal ResourceUpdated 15s argocd-application-controller Updated health status: Progressing -> Healthy Normal OperationCompleted 15s argocd-application-controller Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded Normal ResourceUpdated 15s argocd-application-controller Updated health status: Healthy -> Progressing Normal OperationStarted 15s argocd-application-controller Initiated automated sync to '7df105e33994b18151f0d47bd8626a61937135b2' Normal OperationStarted 10s argocd-application-controller Initiated automated sync to '7df105e33994b18151f0d47bd8626a61937135b2' Normal OperationCompleted 10s argocd-application-controller Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded Normal ResourceUpdated 10s argocd-application-controller Updated health status: Healthy -> Progressing Normal ResourceUpdated 10s argocd-application-controller Updated health status: Progressing -> Healthy Normal OperationStarted 5s argocd-application-controller Initiated automated sync to '7df105e33994b18151f0d47bd8626a61937135b2' Normal OperationCompleted 5s argocd-application-controller Partial sync operation to 7df105e33994b18151f0d47bd8626a61937135b2 succeeded Normal ResourceUpdated 5s argocd-application-controller Updated health status: Healthy -> Progressing Normal ResourceUpdated 5s argocd-application-controller Updated health status: Progressing -> Healthy Normal ResourceUpdated 5s argocd-application-controller Updated health status: Progressing -> Healthy --- Additional comment from Joshua Clark on 2022-02-01 16:34:21 UTC --- Must Gather: https://drive.google.com/file/d/1dag33-Ewb9LaLqhIYQUJIkFDgWPkzEt6/view?usp=sharing --- Additional comment from on 2022-02-03 14:40:19 UTC --- I saw something similar before when the ArgoCD App policies doesn't have the righ config in its ArgoCD AppProject. Would give more details on the ArgoCD App Policies and its AppProject. --- Additional comment from Jim Ramsay on 2022-02-04 14:24:14 UTC --- Root cause: The ArgoCD config is "right", but there's still a conflict: The cluster being deployed is named `ztpmultinode` and unfortunately our default ArgoCD policy app is set up to manage all Policy objects in any namspaces that match `ztp*`. So when ACM copies the policies into the cluster namespace, ArgoCD sees them appear and removes them, and ACM recreates them, and ArgoCD removes them, etc. Workaround for QE: Change ArgoCD so it only manages `ztp-*`, and then the cluster deployment succeeds with no contention. Fix for 4.10: We should mention in our documentation that this potential collision exists, and warn against customers naming clusters `ztp*`. Fix for 4.11: Maybe we can do better with how we select/ignore these policies? Needs more investigation. --- Additional comment from Jim Ramsay on 2022-02-04 14:53:54 UTC --- Sheldon: I'm actively working on the docs portion of this bug with stesmith as part of TELCODOCS-364
If the restriction on namespaces is captured can this bug be closed? @jramsay @josclark
@jramsay I have't seen the documentation update. Should I wait for that before marking this BZ as verified?
QE Verified. Errata RHBA-2022:0928 covers this bug.
Moved back to ON_QE until this can be verified in 4.11
The restriction on naming has been updated in the documentation. If the naming restriction needs to be relaxed please open an RFE.