Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2115480

Summary:	With multiple batches, if first batch timed out, then policies for second batch will not be created
Product:	OpenShift Container Platform	Reporter:	yliu1
Component:	Telco Edge	Assignee:	Steven Skeard <sskeard>
Telco Edge sub component:	TALO	QA Contact:	yliu1
Status:	CLOSED WONTFIX	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	ijolliff, jun, keyoung
Version:	4.11
Target Milestone:	---
Target Release:	4.12.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	2117228 (view as bug list)		Environment:
Last Closed:	2024-04-30 18:04:53 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2108639, 2117038

Description yliu1 2022-08-04 18:54:30 UTC

Description of problem:
With multiple batches, when a batch fail, the policies for next batch will not be created even though CGU indicates it moved to second batch.
The first batch timed out after the total timeout reached. 

Version-Release number of selected component (if applicable):
4.11

How reproducible:
100%

Steps to Reproduce:
1. Two spokes were deployed via ztp and still managed by hub cluster
2. Create a cgu that applies to both spokes with max concurrency set to 1
(The policy in CGU contains conflicting CRs, so it will never be compliant)
3. Wait for first batch to time out
4. Wait for second batch to start

Actual results:
3. First batch timed out very late (after total timeout value for the whole CGU), and policies for second batch was never created.

# cgu total timeout is 18 minutes, but first batch timed out after about 20 minutes, then whole CGU timed out in another 5 minutes (probably in the following audit).
  spec:
    actions:
      afterCompletion:
        deleteObjects: true
      beforeEnable: {}
    backup: false
    clusterSelector:
    - group-du-sno
    enable: true
    managedPolicies:
    - common-config-policy
    preCaching: false
    remediationStrategy:
      maxConcurrency: 1
      timeout: 18
  status:
    computedMaxConcurrency: 1
    conditions:
    - lastTransitionTime: "2022-08-04T18:22:00Z"
      message: The ClusterGroupUpgrade CR policies are taking too long to complete
      reason: UpgradeTimedOut
      status: "False"
      type: Ready
    copiedPolicies:
    - test-common-config-policy-zdkzx
    managedPoliciesContent:
      common-config-policy: "null"
    managedPoliciesForUpgrade:
    - name: common-config-policy
      namespace: ztp-common
    managedPoliciesNs:
      common-config-policy: ztp-common
    placementBindings:
    - test-common-config-policy-placement-rx4p2
    placementRules:
    - test-common-config-policy-placement-rx4p2
    remediationPlan:
    - - ocp-edge88
    - - ocp-edge87
    safeResourceNames:
      test-common-config-policy: test-common-config-policy-zdkzx
      test-common-config-policy-config: test-common-config-policy-config-cwtnd
      test-common-config-policy-placement: test-common-config-policy-placement-rx4p2
    status:
      currentBatch: 2
      currentBatchRemediationProgress:
        ocp-edge87:
          policyIndex: 0
          state: InProgress
      currentBatchStartedAt: "2022-08-04T18:47:00Z"
      startedAt: "2022-08-04T18:22:00Z"



# note that ocp-edge88 belongs to batch1
[kni ~]$ oc get policies -A |grep common-config-policy
default      test-common-config-policy-zdkzx           enforce              Compliant          23m
ocp-edge87   ztp-common.common-config-policy           inform               NonCompliant       9d
ocp-edge88   default.test-common-config-policy-zdkzx   enforce              Compliant          23m
ocp-edge88   ztp-common.common-config-policy           inform               NonCompliant       9d
ztp-common   common-config-policy                      inform               NonCompliant       9d


Expected results:
first batch times out about 10 minutes after cgu started, and enforce policy for second batch is created.


Additional info:
2022-08-04T18:32:00.302Z        INFO    controllers.ClusterGroupUpgrade [Reconcile]     {"Status.CurrentBatch": 1}
2022-08-04T18:32:00.308Z        INFO    controllers.ClusterGroupUpgrade [getNextRemediationPoliciesForBatch]    {"isBatchComplete": false}
2022-08-04T18:32:00.308Z        INFO    controllers.ClusterGroupUpgrade [getNextRemediationPoliciesForBatch]    {"plan": {"ocp-edge88":{"state":"InProgress","policyIndex":0}}}
2022-08-04T18:32:00.334Z        INFO    controllers.ClusterGroupUpgrade Finish reconciling CGU  {"name": "default/test", "requeueAfter": 300}
2022-08-04T18:37:00.335Z        INFO    controllers.ClusterGroupUpgrade Start reconciling CGU   {"name": "default/test"}
2022-08-04T18:37:00.436Z        INFO    controllers.ClusterGroupUpgrade Loaded CGU      {"name": "default/test", "version": "45981553"}
2022-08-04T18:37:00.436Z        INFO    controllers.ClusterGroupUpgrade [getClusterBySelectors] {"clustersBySelector": ["ocp-edge88", "ocp-edge87"]}
2022-08-04T18:37:00.436Z        INFO    controllers.ClusterGroupUpgrade [getClustersBySelectors]        {"clusterNames": ["ocp-edge88", "ocp-edge87"]}
2022-08-04T18:37:00.436Z        INFO    controllers.ClusterGroupUpgrade [Reconcile]     {"Status.CurrentBatch": 1}
2022-08-04T18:37:00.444Z        INFO    controllers.ClusterGroupUpgrade [getNextRemediationPoliciesForBatch]    {"isBatchComplete": false}
2022-08-04T18:37:00.444Z        INFO    controllers.ClusterGroupUpgrade [getNextRemediationPoliciesForBatch]    {"plan": {"ocp-edge88":{"state":"InProgress","policyIndex":0}}}
2022-08-04T18:37:00.470Z        INFO    controllers.ClusterGroupUpgrade Finish reconciling CGU  {"name": "default/test", "requeueAfter": 300}
2022-08-04T18:42:00.470Z        INFO    controllers.ClusterGroupUpgrade Start reconciling CGU   {"name": "default/test"}
2022-08-04T18:42:00.571Z        INFO    controllers.ClusterGroupUpgrade Loaded CGU      {"name": "default/test", "version": "45981553"}
2022-08-04T18:42:00.571Z        INFO    controllers.ClusterGroupUpgrade [getClusterBySelectors] {"clustersBySelector": ["ocp-edge87", "ocp-edge88"]}
2022-08-04T18:42:00.571Z        INFO    controllers.ClusterGroupUpgrade [getClustersBySelectors]        {"clusterNames": ["ocp-edge87", "ocp-edge88"]}
2022-08-04T18:42:00.571Z        INFO    controllers.ClusterGroupUpgrade [Reconcile]     {"Status.CurrentBatch": 1}
2022-08-04T18:42:00.578Z        INFO    controllers.ClusterGroupUpgrade [getNextRemediationPoliciesForBatch]    {"isBatchComplete": false}
2022-08-04T18:42:00.578Z        INFO    controllers.ClusterGroupUpgrade [getNextRemediationPoliciesForBatch]    {"plan": {"ocp-edge88":{"state":"InProgress","policyIndex":0}}}
2022-08-04T18:42:00.591Z        INFO    controllers.ClusterGroupUpgrade Batch upgrade timed out
2022-08-04T18:42:00.603Z        INFO    controllers.ClusterGroupUpgrade Finish reconciling CGU  {"name": "default/test", "requeueAfter": 300}
2022-08-04T18:47:00.604Z        INFO    controllers.ClusterGroupUpgrade Start reconciling CGU   {"name": "default/test"}
2022-08-04T18:47:00.705Z        INFO    controllers.ClusterGroupUpgrade Loaded CGU      {"name": "default/test", "version": "46033245"}
2022-08-04T18:47:00.705Z        INFO    controllers.ClusterGroupUpgrade [getClusterBySelectors] {"clustersBySelector": ["ocp-edge87", "ocp-edge88"]}
2022-08-04T18:47:00.705Z        INFO    controllers.ClusterGroupUpgrade [getClustersBySelectors]        {"clusterNames": ["ocp-edge87", "ocp-edge88"]}
2022-08-04T18:47:00.705Z        INFO    controllers.ClusterGroupUpgrade [Reconcile]     {"Status.CurrentBatch": 2}
2022-08-04T18:47:00.705Z        INFO    controllers.ClusterGroupUpgrade [initializeRemediationPolicyForBatch]   {"CurrentBatchRemediationProgress": {"ocp-edge87":{"state":"NotStarted"}}}
2022-08-04T18:47:00.715Z        INFO    controllers.ClusterGroupUpgrade [getNextRemediationPoliciesForBatch]    {"isBatchComplete": false}
2022-08-04T18:47:00.715Z        INFO    controllers.ClusterGroupUpgrade [getNextRemediationPoliciesForBatch]    {"plan": {"ocp-edge87":{"state":"InProgress","policyIndex":0}}}
2022-08-04T18:47:00.731Z        INFO    controllers.ClusterGroupUpgrade Finish reconciling CGU  {"name": "default/test", "requeueRightAway": true}
2022-08-04T18:47:00.737Z        INFO    controllers.ClusterGroupUpgrade Start reconciling CGU   {"name": "default/test"}
2022-08-04T18:47:00.837Z        INFO    controllers.ClusterGroupUpgrade Loaded CGU      {"name": "default/test", "version": "46044299"}
2022-08-04T18:47:00.838Z        INFO    controllers.ClusterGroupUpgrade [getClusterBySelectors] {"clustersBySelector": ["ocp-edge87", "ocp-edge88"]}
2022-08-04T18:47:00.838Z        INFO    controllers.ClusterGroupUpgrade [getClustersBySelectors]        {"clusterNames": ["ocp-edge87", "ocp-edge88"]}
2022-08-04T18:47:00.838Z        DEBUG   controller-runtime.manager.events       Warning {"object": {"kind":"ClusterGroupUpgrade","namespace":"default","name":"test","uid":"6e70b778-1778-4caf-bdda-7dd81dee2e99","apiVersion":"ran.openshift.io/v1alpha1","resourceVersion":"46044299"}, "reason": "UpgradeTimedOut", "message": "The ClusterGroupUpgrade CR policies are taking too long to complete"}
2022-08-04T18:47:00.845Z        INFO    controllers.ClusterGroupUpgrade [getNextRemediationPoliciesForBatch]    {"isBatchComplete": false}
2022-08-04T18:47:00.845Z        INFO    controllers.ClusterGroupUpgrade [getNextRemediationPoliciesForBatch]    {"plan": {"ocp-edge87":{"state":"InProgress","policyIndex":0}}}
2022-08-04T18:47:00.856Z        INFO    controllers.ClusterGroupUpgrade Finish reconciling CGU  {"name": "default/test", "requeueAfter": 3600}

Comment 1 yliu1 2022-08-16 16:40:18 UTC

I can see the change with latest talm build. However, there is a 5 minute gap between batch1 timeout and batch2 start. Steven has posted https://github.com/openshift-kni/cluster-group-upgrades-operator/pull/296 to resolve it.

  spec:
    actions:
      afterCompletion:
        deleteObjects: true
      beforeEnable: {}
    backup: false
    clusterSelector:
    - group-du-sno
    enable: true
    managedPolicies:
    - common-config-policy
    preCaching: false
    remediationStrategy:
      maxConcurrency: 1
      timeout: 18
  status:
    computedMaxConcurrency: 1
    conditions:
    - lastTransitionTime: "2022-08-16T13:27:33Z"
      message: The ClusterGroupUpgrade CR policies are taking too long to complete
      reason: UpgradeTimedOut
      status: "False"
      type: Ready
    copiedPolicies:
    - test-common-config-policy-wzf2f
    managedPoliciesContent:
      common-config-policy: "null"
    managedPoliciesForUpgrade:
    - name: common-config-policy
      namespace: ztp-common
    managedPoliciesNs:
      common-config-policy: ztp-common
    placementBindings:
    - test-common-config-policy-placement-zf9st
    placementRules:
    - test-common-config-policy-placement-zf9st
    remediationPlan:
    - - ocp-edge88
    - - ocp-edge87
    safeResourceNames:
      test-common-config-policy: test-common-config-policy-wzf2f
      test-common-config-policy-config: test-common-config-policy-config-9s7cw
      test-common-config-policy-placement: test-common-config-policy-placement-zf9st
    status:
      currentBatch: 2
      currentBatchRemediationProgress:
        ocp-edge87:
          policyIndex: 0
          state: InProgress
      currentBatchStartedAt: "2022-08-16T13:42:34Z"
      startedAt: "2022-08-16T13:27:33Z"

Comment 2 Steven Skeard 2022-09-02 17:00:55 UTC

Dev tested on 4.12 and all looked good however will be verified by QE on 4.11 (https://bugzilla.redhat.com/show_bug.cgi?id=2117038)

Comment 5 Rory Thrasher 2024-04-30 18:04:53 UTC

OCP is no longer using Bugzilla and this bug appears to have been left in an orphaned state. If the bug is still relevant, please open a new issue in the OCPBUGS Jira project: https://issues.redhat.com/projects/OCPBUGS/summary