Bug 2115480
| Summary: | With multiple batches, if first batch timed out, then policies for second batch will not be created | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | yliu1 | |
| Component: | Telco Edge | Assignee: | Steven Skeard <sskeard> | |
| Telco Edge sub component: | TALO | QA Contact: | yliu1 | |
| Status: | CLOSED WONTFIX | Docs Contact: | ||
| Severity: | high | |||
| Priority: | unspecified | CC: | ijolliff, jun, keyoung | |
| Version: | 4.11 | |||
| Target Milestone: | --- | |||
| Target Release: | 4.12.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 2117228 (view as bug list) | Environment: | ||
| Last Closed: | 2024-04-30 18:04:53 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 2108639, 2117038 | |||
I can see the change with latest talm build. However, there is a 5 minute gap between batch1 timeout and batch2 start. Steven has posted https://github.com/openshift-kni/cluster-group-upgrades-operator/pull/296 to resolve it. spec: actions: afterCompletion: deleteObjects: true beforeEnable: {} backup: false clusterSelector: - group-du-sno enable: true managedPolicies: - common-config-policy preCaching: false remediationStrategy: maxConcurrency: 1 timeout: 18 status: computedMaxConcurrency: 1 conditions: - lastTransitionTime: "2022-08-16T13:27:33Z" message: The ClusterGroupUpgrade CR policies are taking too long to complete reason: UpgradeTimedOut status: "False" type: Ready copiedPolicies: - test-common-config-policy-wzf2f managedPoliciesContent: common-config-policy: "null" managedPoliciesForUpgrade: - name: common-config-policy namespace: ztp-common managedPoliciesNs: common-config-policy: ztp-common placementBindings: - test-common-config-policy-placement-zf9st placementRules: - test-common-config-policy-placement-zf9st remediationPlan: - - ocp-edge88 - - ocp-edge87 safeResourceNames: test-common-config-policy: test-common-config-policy-wzf2f test-common-config-policy-config: test-common-config-policy-config-9s7cw test-common-config-policy-placement: test-common-config-policy-placement-zf9st status: currentBatch: 2 currentBatchRemediationProgress: ocp-edge87: policyIndex: 0 state: InProgress currentBatchStartedAt: "2022-08-16T13:42:34Z" startedAt: "2022-08-16T13:27:33Z" Dev tested on 4.12 and all looked good however will be verified by QE on 4.11 (https://bugzilla.redhat.com/show_bug.cgi?id=2117038) OCP is no longer using Bugzilla and this bug appears to have been left in an orphaned state. If the bug is still relevant, please open a new issue in the OCPBUGS Jira project: https://issues.redhat.com/projects/OCPBUGS/summary |
Description of problem: With multiple batches, when a batch fail, the policies for next batch will not be created even though CGU indicates it moved to second batch. The first batch timed out after the total timeout reached. Version-Release number of selected component (if applicable): 4.11 How reproducible: 100% Steps to Reproduce: 1. Two spokes were deployed via ztp and still managed by hub cluster 2. Create a cgu that applies to both spokes with max concurrency set to 1 (The policy in CGU contains conflicting CRs, so it will never be compliant) 3. Wait for first batch to time out 4. Wait for second batch to start Actual results: 3. First batch timed out very late (after total timeout value for the whole CGU), and policies for second batch was never created. # cgu total timeout is 18 minutes, but first batch timed out after about 20 minutes, then whole CGU timed out in another 5 minutes (probably in the following audit). spec: actions: afterCompletion: deleteObjects: true beforeEnable: {} backup: false clusterSelector: - group-du-sno enable: true managedPolicies: - common-config-policy preCaching: false remediationStrategy: maxConcurrency: 1 timeout: 18 status: computedMaxConcurrency: 1 conditions: - lastTransitionTime: "2022-08-04T18:22:00Z" message: The ClusterGroupUpgrade CR policies are taking too long to complete reason: UpgradeTimedOut status: "False" type: Ready copiedPolicies: - test-common-config-policy-zdkzx managedPoliciesContent: common-config-policy: "null" managedPoliciesForUpgrade: - name: common-config-policy namespace: ztp-common managedPoliciesNs: common-config-policy: ztp-common placementBindings: - test-common-config-policy-placement-rx4p2 placementRules: - test-common-config-policy-placement-rx4p2 remediationPlan: - - ocp-edge88 - - ocp-edge87 safeResourceNames: test-common-config-policy: test-common-config-policy-zdkzx test-common-config-policy-config: test-common-config-policy-config-cwtnd test-common-config-policy-placement: test-common-config-policy-placement-rx4p2 status: currentBatch: 2 currentBatchRemediationProgress: ocp-edge87: policyIndex: 0 state: InProgress currentBatchStartedAt: "2022-08-04T18:47:00Z" startedAt: "2022-08-04T18:22:00Z" # note that ocp-edge88 belongs to batch1 [kni ~]$ oc get policies -A |grep common-config-policy default test-common-config-policy-zdkzx enforce Compliant 23m ocp-edge87 ztp-common.common-config-policy inform NonCompliant 9d ocp-edge88 default.test-common-config-policy-zdkzx enforce Compliant 23m ocp-edge88 ztp-common.common-config-policy inform NonCompliant 9d ztp-common common-config-policy inform NonCompliant 9d Expected results: first batch times out about 10 minutes after cgu started, and enforce policy for second batch is created. Additional info: 2022-08-04T18:32:00.302Z INFO controllers.ClusterGroupUpgrade [Reconcile] {"Status.CurrentBatch": 1} 2022-08-04T18:32:00.308Z INFO controllers.ClusterGroupUpgrade [getNextRemediationPoliciesForBatch] {"isBatchComplete": false} 2022-08-04T18:32:00.308Z INFO controllers.ClusterGroupUpgrade [getNextRemediationPoliciesForBatch] {"plan": {"ocp-edge88":{"state":"InProgress","policyIndex":0}}} 2022-08-04T18:32:00.334Z INFO controllers.ClusterGroupUpgrade Finish reconciling CGU {"name": "default/test", "requeueAfter": 300} 2022-08-04T18:37:00.335Z INFO controllers.ClusterGroupUpgrade Start reconciling CGU {"name": "default/test"} 2022-08-04T18:37:00.436Z INFO controllers.ClusterGroupUpgrade Loaded CGU {"name": "default/test", "version": "45981553"} 2022-08-04T18:37:00.436Z INFO controllers.ClusterGroupUpgrade [getClusterBySelectors] {"clustersBySelector": ["ocp-edge88", "ocp-edge87"]} 2022-08-04T18:37:00.436Z INFO controllers.ClusterGroupUpgrade [getClustersBySelectors] {"clusterNames": ["ocp-edge88", "ocp-edge87"]} 2022-08-04T18:37:00.436Z INFO controllers.ClusterGroupUpgrade [Reconcile] {"Status.CurrentBatch": 1} 2022-08-04T18:37:00.444Z INFO controllers.ClusterGroupUpgrade [getNextRemediationPoliciesForBatch] {"isBatchComplete": false} 2022-08-04T18:37:00.444Z INFO controllers.ClusterGroupUpgrade [getNextRemediationPoliciesForBatch] {"plan": {"ocp-edge88":{"state":"InProgress","policyIndex":0}}} 2022-08-04T18:37:00.470Z INFO controllers.ClusterGroupUpgrade Finish reconciling CGU {"name": "default/test", "requeueAfter": 300} 2022-08-04T18:42:00.470Z INFO controllers.ClusterGroupUpgrade Start reconciling CGU {"name": "default/test"} 2022-08-04T18:42:00.571Z INFO controllers.ClusterGroupUpgrade Loaded CGU {"name": "default/test", "version": "45981553"} 2022-08-04T18:42:00.571Z INFO controllers.ClusterGroupUpgrade [getClusterBySelectors] {"clustersBySelector": ["ocp-edge87", "ocp-edge88"]} 2022-08-04T18:42:00.571Z INFO controllers.ClusterGroupUpgrade [getClustersBySelectors] {"clusterNames": ["ocp-edge87", "ocp-edge88"]} 2022-08-04T18:42:00.571Z INFO controllers.ClusterGroupUpgrade [Reconcile] {"Status.CurrentBatch": 1} 2022-08-04T18:42:00.578Z INFO controllers.ClusterGroupUpgrade [getNextRemediationPoliciesForBatch] {"isBatchComplete": false} 2022-08-04T18:42:00.578Z INFO controllers.ClusterGroupUpgrade [getNextRemediationPoliciesForBatch] {"plan": {"ocp-edge88":{"state":"InProgress","policyIndex":0}}} 2022-08-04T18:42:00.591Z INFO controllers.ClusterGroupUpgrade Batch upgrade timed out 2022-08-04T18:42:00.603Z INFO controllers.ClusterGroupUpgrade Finish reconciling CGU {"name": "default/test", "requeueAfter": 300} 2022-08-04T18:47:00.604Z INFO controllers.ClusterGroupUpgrade Start reconciling CGU {"name": "default/test"} 2022-08-04T18:47:00.705Z INFO controllers.ClusterGroupUpgrade Loaded CGU {"name": "default/test", "version": "46033245"} 2022-08-04T18:47:00.705Z INFO controllers.ClusterGroupUpgrade [getClusterBySelectors] {"clustersBySelector": ["ocp-edge87", "ocp-edge88"]} 2022-08-04T18:47:00.705Z INFO controllers.ClusterGroupUpgrade [getClustersBySelectors] {"clusterNames": ["ocp-edge87", "ocp-edge88"]} 2022-08-04T18:47:00.705Z INFO controllers.ClusterGroupUpgrade [Reconcile] {"Status.CurrentBatch": 2} 2022-08-04T18:47:00.705Z INFO controllers.ClusterGroupUpgrade [initializeRemediationPolicyForBatch] {"CurrentBatchRemediationProgress": {"ocp-edge87":{"state":"NotStarted"}}} 2022-08-04T18:47:00.715Z INFO controllers.ClusterGroupUpgrade [getNextRemediationPoliciesForBatch] {"isBatchComplete": false} 2022-08-04T18:47:00.715Z INFO controllers.ClusterGroupUpgrade [getNextRemediationPoliciesForBatch] {"plan": {"ocp-edge87":{"state":"InProgress","policyIndex":0}}} 2022-08-04T18:47:00.731Z INFO controllers.ClusterGroupUpgrade Finish reconciling CGU {"name": "default/test", "requeueRightAway": true} 2022-08-04T18:47:00.737Z INFO controllers.ClusterGroupUpgrade Start reconciling CGU {"name": "default/test"} 2022-08-04T18:47:00.837Z INFO controllers.ClusterGroupUpgrade Loaded CGU {"name": "default/test", "version": "46044299"} 2022-08-04T18:47:00.838Z INFO controllers.ClusterGroupUpgrade [getClusterBySelectors] {"clustersBySelector": ["ocp-edge87", "ocp-edge88"]} 2022-08-04T18:47:00.838Z INFO controllers.ClusterGroupUpgrade [getClustersBySelectors] {"clusterNames": ["ocp-edge87", "ocp-edge88"]} 2022-08-04T18:47:00.838Z DEBUG controller-runtime.manager.events Warning {"object": {"kind":"ClusterGroupUpgrade","namespace":"default","name":"test","uid":"6e70b778-1778-4caf-bdda-7dd81dee2e99","apiVersion":"ran.openshift.io/v1alpha1","resourceVersion":"46044299"}, "reason": "UpgradeTimedOut", "message": "The ClusterGroupUpgrade CR policies are taking too long to complete"} 2022-08-04T18:47:00.845Z INFO controllers.ClusterGroupUpgrade [getNextRemediationPoliciesForBatch] {"isBatchComplete": false} 2022-08-04T18:47:00.845Z INFO controllers.ClusterGroupUpgrade [getNextRemediationPoliciesForBatch] {"plan": {"ocp-edge87":{"state":"InProgress","policyIndex":0}}} 2022-08-04T18:47:00.856Z INFO controllers.ClusterGroupUpgrade Finish reconciling CGU {"name": "default/test", "requeueAfter": 3600}