Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2117228

Summary: With multiple batches, if first batch timed out, then policies for second batch will not be created
Product: OpenShift Container Platform Reporter: Steven Skeard <sskeard>
Component: Telco EdgeAssignee: Steven Skeard <sskeard>
Telco Edge sub component: TALO QA Contact: yliu1
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: ijolliff, jun, keyoung, yliu1
Version: 4.10   
Target Milestone: ---   
Target Release: 4.10.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 2115480 Environment:
Last Closed: 2022-09-21 14:07:05 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2117038    
Bug Blocks: 2108639    

Description Steven Skeard 2022-08-10 11:08:32 UTC
+++ This bug was initially created as a clone of Bug #2115480 +++

Description of problem:
With multiple batches, when a batch fail, the policies for next batch will not be created even though CGU indicates it moved to second batch.
The first batch timed out after the total timeout reached. 

Version-Release number of selected component (if applicable):
4.11

How reproducible:
100%

Steps to Reproduce:
1. Two spokes were deployed via ztp and still managed by hub cluster
2. Create a cgu that applies to both spokes with max concurrency set to 1
(The policy in CGU contains conflicting CRs, so it will never be compliant)
3. Wait for first batch to time out
4. Wait for second batch to start

Actual results:
3. First batch timed out very late (after total timeout value for the whole CGU), and policies for second batch was never created.

# cgu total timeout is 18 minutes, but first batch timed out after about 20 minutes, then whole CGU timed out in another 5 minutes (probably in the following audit).
  spec:
    actions:
      afterCompletion:
        deleteObjects: true
      beforeEnable: {}
    backup: false
    clusterSelector:
    - group-du-sno
    enable: true
    managedPolicies:
    - common-config-policy
    preCaching: false
    remediationStrategy:
      maxConcurrency: 1
      timeout: 18
  status:
    computedMaxConcurrency: 1
    conditions:
    - lastTransitionTime: "2022-08-04T18:22:00Z"
      message: The ClusterGroupUpgrade CR policies are taking too long to complete
      reason: UpgradeTimedOut
      status: "False"
      type: Ready
    copiedPolicies:
    - test-common-config-policy-zdkzx
    managedPoliciesContent:
      common-config-policy: "null"
    managedPoliciesForUpgrade:
    - name: common-config-policy
      namespace: ztp-common
    managedPoliciesNs:
      common-config-policy: ztp-common
    placementBindings:
    - test-common-config-policy-placement-rx4p2
    placementRules:
    - test-common-config-policy-placement-rx4p2
    remediationPlan:
    - - ocp-edge88
    - - ocp-edge87
    safeResourceNames:
      test-common-config-policy: test-common-config-policy-zdkzx
      test-common-config-policy-config: test-common-config-policy-config-cwtnd
      test-common-config-policy-placement: test-common-config-policy-placement-rx4p2
    status:
      currentBatch: 2
      currentBatchRemediationProgress:
        ocp-edge87:
          policyIndex: 0
          state: InProgress
      currentBatchStartedAt: "2022-08-04T18:47:00Z"
      startedAt: "2022-08-04T18:22:00Z"



# note that ocp-edge88 belongs to batch1
[kni ~]$ oc get policies -A |grep common-config-policy
default      test-common-config-policy-zdkzx           enforce              Compliant          23m
ocp-edge87   ztp-common.common-config-policy           inform               NonCompliant       9d
ocp-edge88   default.test-common-config-policy-zdkzx   enforce              Compliant          23m
ocp-edge88   ztp-common.common-config-policy           inform               NonCompliant       9d
ztp-common   common-config-policy                      inform               NonCompliant       9d


Expected results:
first batch times out about 10 minutes after cgu started, and enforce policy for second batch is created.


Additional info:
2022-08-04T18:32:00.302Z        INFO    controllers.ClusterGroupUpgrade [Reconcile]     {"Status.CurrentBatch": 1}
2022-08-04T18:32:00.308Z        INFO    controllers.ClusterGroupUpgrade [getNextRemediationPoliciesForBatch]    {"isBatchComplete": false}
2022-08-04T18:32:00.308Z        INFO    controllers.ClusterGroupUpgrade [getNextRemediationPoliciesForBatch]    {"plan": {"ocp-edge88":{"state":"InProgress","policyIndex":0}}}
2022-08-04T18:32:00.334Z        INFO    controllers.ClusterGroupUpgrade Finish reconciling CGU  {"name": "default/test", "requeueAfter": 300}
2022-08-04T18:37:00.335Z        INFO    controllers.ClusterGroupUpgrade Start reconciling CGU   {"name": "default/test"}
2022-08-04T18:37:00.436Z        INFO    controllers.ClusterGroupUpgrade Loaded CGU      {"name": "default/test", "version": "45981553"}
2022-08-04T18:37:00.436Z        INFO    controllers.ClusterGroupUpgrade [getClusterBySelectors] {"clustersBySelector": ["ocp-edge88", "ocp-edge87"]}
2022-08-04T18:37:00.436Z        INFO    controllers.ClusterGroupUpgrade [getClustersBySelectors]        {"clusterNames": ["ocp-edge88", "ocp-edge87"]}
2022-08-04T18:37:00.436Z        INFO    controllers.ClusterGroupUpgrade [Reconcile]     {"Status.CurrentBatch": 1}
2022-08-04T18:37:00.444Z        INFO    controllers.ClusterGroupUpgrade [getNextRemediationPoliciesForBatch]    {"isBatchComplete": false}
2022-08-04T18:37:00.444Z        INFO    controllers.ClusterGroupUpgrade [getNextRemediationPoliciesForBatch]    {"plan": {"ocp-edge88":{"state":"InProgress","policyIndex":0}}}
2022-08-04T18:37:00.470Z        INFO    controllers.ClusterGroupUpgrade Finish reconciling CGU  {"name": "default/test", "requeueAfter": 300}
2022-08-04T18:42:00.470Z        INFO    controllers.ClusterGroupUpgrade Start reconciling CGU   {"name": "default/test"}
2022-08-04T18:42:00.571Z        INFO    controllers.ClusterGroupUpgrade Loaded CGU      {"name": "default/test", "version": "45981553"}
2022-08-04T18:42:00.571Z        INFO    controllers.ClusterGroupUpgrade [getClusterBySelectors] {"clustersBySelector": ["ocp-edge87", "ocp-edge88"]}
2022-08-04T18:42:00.571Z        INFO    controllers.ClusterGroupUpgrade [getClustersBySelectors]        {"clusterNames": ["ocp-edge87", "ocp-edge88"]}
2022-08-04T18:42:00.571Z        INFO    controllers.ClusterGroupUpgrade [Reconcile]     {"Status.CurrentBatch": 1}
2022-08-04T18:42:00.578Z        INFO    controllers.ClusterGroupUpgrade [getNextRemediationPoliciesForBatch]    {"isBatchComplete": false}
2022-08-04T18:42:00.578Z        INFO    controllers.ClusterGroupUpgrade [getNextRemediationPoliciesForBatch]    {"plan": {"ocp-edge88":{"state":"InProgress","policyIndex":0}}}
2022-08-04T18:42:00.591Z        INFO    controllers.ClusterGroupUpgrade Batch upgrade timed out
2022-08-04T18:42:00.603Z        INFO    controllers.ClusterGroupUpgrade Finish reconciling CGU  {"name": "default/test", "requeueAfter": 300}
2022-08-04T18:47:00.604Z        INFO    controllers.ClusterGroupUpgrade Start reconciling CGU   {"name": "default/test"}
2022-08-04T18:47:00.705Z        INFO    controllers.ClusterGroupUpgrade Loaded CGU      {"name": "default/test", "version": "46033245"}
2022-08-04T18:47:00.705Z        INFO    controllers.ClusterGroupUpgrade [getClusterBySelectors] {"clustersBySelector": ["ocp-edge87", "ocp-edge88"]}
2022-08-04T18:47:00.705Z        INFO    controllers.ClusterGroupUpgrade [getClustersBySelectors]        {"clusterNames": ["ocp-edge87", "ocp-edge88"]}
2022-08-04T18:47:00.705Z        INFO    controllers.ClusterGroupUpgrade [Reconcile]     {"Status.CurrentBatch": 2}
2022-08-04T18:47:00.705Z        INFO    controllers.ClusterGroupUpgrade [initializeRemediationPolicyForBatch]   {"CurrentBatchRemediationProgress": {"ocp-edge87":{"state":"NotStarted"}}}
2022-08-04T18:47:00.715Z        INFO    controllers.ClusterGroupUpgrade [getNextRemediationPoliciesForBatch]    {"isBatchComplete": false}
2022-08-04T18:47:00.715Z        INFO    controllers.ClusterGroupUpgrade [getNextRemediationPoliciesForBatch]    {"plan": {"ocp-edge87":{"state":"InProgress","policyIndex":0}}}
2022-08-04T18:47:00.731Z        INFO    controllers.ClusterGroupUpgrade Finish reconciling CGU  {"name": "default/test", "requeueRightAway": true}
2022-08-04T18:47:00.737Z        INFO    controllers.ClusterGroupUpgrade Start reconciling CGU   {"name": "default/test"}
2022-08-04T18:47:00.837Z        INFO    controllers.ClusterGroupUpgrade Loaded CGU      {"name": "default/test", "version": "46044299"}
2022-08-04T18:47:00.838Z        INFO    controllers.ClusterGroupUpgrade [getClusterBySelectors] {"clustersBySelector": ["ocp-edge87", "ocp-edge88"]}
2022-08-04T18:47:00.838Z        INFO    controllers.ClusterGroupUpgrade [getClustersBySelectors]        {"clusterNames": ["ocp-edge87", "ocp-edge88"]}
2022-08-04T18:47:00.838Z        DEBUG   controller-runtime.manager.events       Warning {"object": {"kind":"ClusterGroupUpgrade","namespace":"default","name":"test","uid":"6e70b778-1778-4caf-bdda-7dd81dee2e99","apiVersion":"ran.openshift.io/v1alpha1","resourceVersion":"46044299"}, "reason": "UpgradeTimedOut", "message": "The ClusterGroupUpgrade CR policies are taking too long to complete"}
2022-08-04T18:47:00.845Z        INFO    controllers.ClusterGroupUpgrade [getNextRemediationPoliciesForBatch]    {"isBatchComplete": false}
2022-08-04T18:47:00.845Z        INFO    controllers.ClusterGroupUpgrade [getNextRemediationPoliciesForBatch]    {"plan": {"ocp-edge87":{"state":"InProgress","policyIndex":0}}}
2022-08-04T18:47:00.856Z        INFO    controllers.ClusterGroupUpgrade Finish reconciling CGU  {"name": "default/test", "requeueAfter": 3600}

Comment 3 yliu1 2022-09-16 15:57:36 UTC
Verified with 4.10 TALM.

With 18 minutes of total timeout, the second batch started 10 minutes after the cgu started as expected. 

    status:
      currentBatch: 2
      currentBatchRemediationProgress:
        spoke-1:
          state: Completed
      currentBatchStartedAt: "2022-09-16T14:05:08Z"
      startedAt: "2022-09-16T13:55:07Z"

Comment 5 errata-xmlrpc 2022-09-21 14:07:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.10.33 extras update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:6533