Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2117038

Summary:	With multiple batches, if first batch timed out, then policies for second batch will not be created
Product:	OpenShift Container Platform	Reporter:	OpenShift BugZilla Robot <openshift-bugzilla-robot>
Component:	Telco Edge	Assignee:	Steven Skeard <sskeard>
Telco Edge sub component:	TALO	QA Contact:	yliu1
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	ijolliff, jun, keyoung
Version:	4.11
Target Milestone:	---
Target Release:	4.11.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-09-27 08:10:20 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2115480
Bug Blocks:	2108639, 2117228

Description OpenShift BugZilla Robot 2022-08-09 19:57:07 UTC

+++ This bug was initially created as a clone of Bug #2115480 +++

Description of problem:
With multiple batches, when a batch fail, the policies for next batch will not be created even though CGU indicates it moved to second batch.
The first batch timed out after the total timeout reached. 

Version-Release number of selected component (if applicable):
4.11

How reproducible:
100%

Steps to Reproduce:
1. Two spokes were deployed via ztp and still managed by hub cluster
2. Create a cgu that applies to both spokes with max concurrency set to 1
(The policy in CGU contains conflicting CRs, so it will never be compliant)
3. Wait for first batch to time out
4. Wait for second batch to start

Actual results:
3. First batch timed out very late (after total timeout value for the whole CGU), and policies for second batch was never created.

# cgu total timeout is 18 minutes, but first batch timed out after about 20 minutes, then whole CGU timed out in another 5 minutes (probably in the following audit).
  spec:
    actions:
      afterCompletion:
        deleteObjects: true
      beforeEnable: {}
    backup: false
    clusterSelector:
    - group-du-sno
    enable: true
    managedPolicies:
    - common-config-policy
    preCaching: false
    remediationStrategy:
      maxConcurrency: 1
      timeout: 18
  status:
    computedMaxConcurrency: 1
    conditions:
    - lastTransitionTime: "2022-08-04T18:22:00Z"
      message: The ClusterGroupUpgrade CR policies are taking too long to complete
      reason: UpgradeTimedOut
      status: "False"
      type: Ready
    copiedPolicies:
    - test-common-config-policy-zdkzx
    managedPoliciesContent:
      common-config-policy: "null"
    managedPoliciesForUpgrade:
    - name: common-config-policy
      namespace: ztp-common
    managedPoliciesNs:
      common-config-policy: ztp-common
    placementBindings:
    - test-common-config-policy-placement-rx4p2
    placementRules:
    - test-common-config-policy-placement-rx4p2
    remediationPlan:
    - - ocp-edge88
    - - ocp-edge87
    safeResourceNames:
      test-common-config-policy: test-common-config-policy-zdkzx
      test-common-config-policy-config: test-common-config-policy-config-cwtnd
      test-common-config-policy-placement: test-common-config-policy-placement-rx4p2
    status:
      currentBatch: 2
      currentBatchRemediationProgress:
        ocp-edge87:
          policyIndex: 0
          state: InProgress
      currentBatchStartedAt: "2022-08-04T18:47:00Z"
      startedAt: "2022-08-04T18:22:00Z"



# note that ocp-edge88 belongs to batch1
[kni ~]$ oc get policies -A |grep common-config-policy
default      test-common-config-policy-zdkzx           enforce              Compliant          23m
ocp-edge87   ztp-common.common-config-policy           inform               NonCompliant       9d
ocp-edge88   default.test-common-config-policy-zdkzx   enforce              Compliant          23m
ocp-edge88   ztp-common.common-config-policy           inform               NonCompliant       9d
ztp-common   common-config-policy                      inform               NonCompliant       9d


Expected results:
first batch times out about 10 minutes after cgu started, and enforce policy for second batch is created.


Additional info:
2022-08-04T18:32:00.302Z        INFO    controllers.ClusterGroupUpgrade [Reconcile]     {"Status.CurrentBatch": 1}
2022-08-04T18:32:00.308Z        INFO    controllers.ClusterGroupUpgrade [getNextRemediationPoliciesForBatch]    {"isBatchComplete": false}
2022-08-04T18:32:00.308Z        INFO    controllers.ClusterGroupUpgrade [getNextRemediationPoliciesForBatch]    {"plan": {"ocp-edge88":{"state":"InProgress","policyIndex":0}}}
2022-08-04T18:32:00.334Z        INFO    controllers.ClusterGroupUpgrade Finish reconciling CGU  {"name": "default/test", "requeueAfter": 300}
2022-08-04T18:37:00.335Z        INFO    controllers.ClusterGroupUpgrade Start reconciling CGU   {"name": "default/test"}
2022-08-04T18:37:00.436Z        INFO    controllers.ClusterGroupUpgrade Loaded CGU      {"name": "default/test", "version": "45981553"}
2022-08-04T18:37:00.436Z        INFO    controllers.ClusterGroupUpgrade [getClusterBySelectors] {"clustersBySelector": ["ocp-edge88", "ocp-edge87"]}
2022-08-04T18:37:00.436Z        INFO    controllers.ClusterGroupUpgrade [getClustersBySelectors]        {"clusterNames": ["ocp-edge88", "ocp-edge87"]}
2022-08-04T18:37:00.436Z        INFO    controllers.ClusterGroupUpgrade [Reconcile]     {"Status.CurrentBatch": 1}
2022-08-04T18:37:00.444Z        INFO    controllers.ClusterGroupUpgrade [getNextRemediationPoliciesForBatch]    {"isBatchComplete": false}
2022-08-04T18:37:00.444Z        INFO    controllers.ClusterGroupUpgrade [getNextRemediationPoliciesForBatch]    {"plan": {"ocp-edge88":{"state":"InProgress","policyIndex":0}}}
2022-08-04T18:37:00.470Z        INFO    controllers.ClusterGroupUpgrade Finish reconciling CGU  {"name": "default/test", "requeueAfter": 300}
2022-08-04T18:42:00.470Z        INFO    controllers.ClusterGroupUpgrade Start reconciling CGU   {"name": "default/test"}
2022-08-04T18:42:00.571Z        INFO    controllers.ClusterGroupUpgrade Loaded CGU      {"name": "default/test", "version": "45981553"}
2022-08-04T18:42:00.571Z        INFO    controllers.ClusterGroupUpgrade [getClusterBySelectors] {"clustersBySelector": ["ocp-edge87", "ocp-edge88"]}
2022-08-04T18:42:00.571Z        INFO    controllers.ClusterGroupUpgrade [getClustersBySelectors]        {"clusterNames": ["ocp-edge87", "ocp-edge88"]}
2022-08-04T18:42:00.571Z        INFO    controllers.ClusterGroupUpgrade [Reconcile]     {"Status.CurrentBatch": 1}
2022-08-04T18:42:00.578Z        INFO    controllers.ClusterGroupUpgrade [getNextRemediationPoliciesForBatch]    {"isBatchComplete": false}
2022-08-04T18:42:00.578Z        INFO    controllers.ClusterGroupUpgrade [getNextRemediationPoliciesForBatch]    {"plan": {"ocp-edge88":{"state":"InProgress","policyIndex":0}}}
2022-08-04T18:42:00.591Z        INFO    controllers.ClusterGroupUpgrade Batch upgrade timed out
2022-08-04T18:42:00.603Z        INFO    controllers.ClusterGroupUpgrade Finish reconciling CGU  {"name": "default/test", "requeueAfter": 300}
2022-08-04T18:47:00.604Z        INFO    controllers.ClusterGroupUpgrade Start reconciling CGU   {"name": "default/test"}
2022-08-04T18:47:00.705Z        INFO    controllers.ClusterGroupUpgrade Loaded CGU      {"name": "default/test", "version": "46033245"}
2022-08-04T18:47:00.705Z        INFO    controllers.ClusterGroupUpgrade [getClusterBySelectors] {"clustersBySelector": ["ocp-edge87", "ocp-edge88"]}
2022-08-04T18:47:00.705Z        INFO    controllers.ClusterGroupUpgrade [getClustersBySelectors]        {"clusterNames": ["ocp-edge87", "ocp-edge88"]}
2022-08-04T18:47:00.705Z        INFO    controllers.ClusterGroupUpgrade [Reconcile]     {"Status.CurrentBatch": 2}
2022-08-04T18:47:00.705Z        INFO    controllers.ClusterGroupUpgrade [initializeRemediationPolicyForBatch]   {"CurrentBatchRemediationProgress": {"ocp-edge87":{"state":"NotStarted"}}}
2022-08-04T18:47:00.715Z        INFO    controllers.ClusterGroupUpgrade [getNextRemediationPoliciesForBatch]    {"isBatchComplete": false}
2022-08-04T18:47:00.715Z        INFO    controllers.ClusterGroupUpgrade [getNextRemediationPoliciesForBatch]    {"plan": {"ocp-edge87":{"state":"InProgress","policyIndex":0}}}
2022-08-04T18:47:00.731Z        INFO    controllers.ClusterGroupUpgrade Finish reconciling CGU  {"name": "default/test", "requeueRightAway": true}
2022-08-04T18:47:00.737Z        INFO    controllers.ClusterGroupUpgrade Start reconciling CGU   {"name": "default/test"}
2022-08-04T18:47:00.837Z        INFO    controllers.ClusterGroupUpgrade Loaded CGU      {"name": "default/test", "version": "46044299"}
2022-08-04T18:47:00.838Z        INFO    controllers.ClusterGroupUpgrade [getClusterBySelectors] {"clustersBySelector": ["ocp-edge87", "ocp-edge88"]}
2022-08-04T18:47:00.838Z        INFO    controllers.ClusterGroupUpgrade [getClustersBySelectors]        {"clusterNames": ["ocp-edge87", "ocp-edge88"]}
2022-08-04T18:47:00.838Z        DEBUG   controller-runtime.manager.events       Warning {"object": {"kind":"ClusterGroupUpgrade","namespace":"default","name":"test","uid":"6e70b778-1778-4caf-bdda-7dd81dee2e99","apiVersion":"ran.openshift.io/v1alpha1","resourceVersion":"46044299"}, "reason": "UpgradeTimedOut", "message": "The ClusterGroupUpgrade CR policies are taking too long to complete"}
2022-08-04T18:47:00.845Z        INFO    controllers.ClusterGroupUpgrade [getNextRemediationPoliciesForBatch]    {"isBatchComplete": false}
2022-08-04T18:47:00.845Z        INFO    controllers.ClusterGroupUpgrade [getNextRemediationPoliciesForBatch]    {"plan": {"ocp-edge87":{"state":"InProgress","policyIndex":0}}}
2022-08-04T18:47:00.856Z        INFO    controllers.ClusterGroupUpgrade Finish reconciling CGU  {"name": "default/test", "requeueAfter": 3600}

Comment 1 yliu1 2022-08-11 18:22:55 UTC

I can still reproduce this issue with latest brew build. In following CGU, you can see the timeout is 18, but second batch started 24 minutes after cgu started.
The brew build used is this one: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=2118287, which was built on Aug 10th. Given the PR was merged on the Aug 9th. (Unless there is a build issue...)

[kni ~]$ oc get cgu -o yaml 
apiVersion: v1
items:
- apiVersion: ran.openshift.io/v1alpha1
  kind: ClusterGroupUpgrade
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"ran.openshift.io/v1alpha1","kind":"ClusterGroupUpgrade","metadata":{"annotations":{},"name":"test","namespace":"default"},"spec":{"backup":false,"clusterSelector":["group-du-sno"],"enable":true,"managedPolicies":["common-config-policy"],"preCaching":false,"remediationStrategy":{"maxConcurrency":1,"timeout":18}}}
    creationTimestamp: "2022-08-11T17:11:02Z"
    finalizers:
    - ran.openshift.io/cleanup-finalizer
    generation: 2
    name: test
    namespace: default
    resourceVersion: "72544555"
    uid: eb0f56e7-e8a1-4a7a-ade9-2fde7b1d4fa9
  spec:
    actions:
      afterCompletion:
        deleteObjects: true
      beforeEnable: {}
    backup: false
    clusterSelector:
    - group-du-sno
    enable: true
    managedPolicies:
    - common-config-policy
    preCaching: false
    remediationStrategy:
      maxConcurrency: 1
      timeout: 18
  status:
    computedMaxConcurrency: 1
    conditions:
    - lastTransitionTime: "2022-08-11T17:11:02Z"
      message: The ClusterGroupUpgrade CR policies are taking too long to complete
      reason: UpgradeTimedOut
      status: "False"
      type: Ready
    copiedPolicies:
    - test-common-config-policy-jmb4j
    managedPoliciesContent:
      common-config-policy: "null"
    managedPoliciesForUpgrade:
    - name: common-config-policy
      namespace: ztp-common
    managedPoliciesNs:
      common-config-policy: ztp-common
    placementBindings:
    - test-common-config-policy-placement-fr9th
    placementRules:
    - test-common-config-policy-placement-fr9th
    remediationPlan:
    - - ocp-edge87
    - - ocp-edge88
    safeResourceNames:
      test-common-config-policy: test-common-config-policy-jmb4j
      test-common-config-policy-config: test-common-config-policy-config-9wzxk
      test-common-config-policy-placement: test-common-config-policy-placement-fr9th
    status:
      currentBatch: 2
      currentBatchRemediationProgress:
        ocp-edge88:
          policyIndex: 0
          state: InProgress
      currentBatchStartedAt: "2022-08-11T17:36:02Z"
      startedAt: "2022-08-11T17:11:02Z"
kind: List
metadata:
  resourceVersion: ""

Comment 6 yliu1 2022-09-09 18:33:53 UTC

Verified on latest 4.11 TALM. Second batch started at expected time.

spec:
  actions:
    afterCompletion:
      deleteObjects: true
    beforeEnable: {}
  backup: false
  clusters:
  - spoke-1
  - spoke-2
  enable: true
  managedPolicies:
  - du-upgrade-cluster-version-policy1
  preCaching: false
  remediationStrategy:
    maxConcurrency: 1
    timeout: 18
status:
  computedMaxConcurrency: 1
  conditions:
  - lastTransitionTime: "2022-09-09T18:22:34Z"
    message: The ClusterGroupUpgrade CR has upgrade policies that are still non compliant
    reason: UpgradeNotCompleted
    status: "False"
    type: Ready
  copiedPolicies:
  - spoke-2-du-upgrade-cluster-version-policy1-hg4s7
  managedPoliciesForUpgrade:
  - name: du-upgrade-cluster-version-policy1
    namespace: ztp-upgrade
  managedPoliciesNs:
    du-upgrade-cluster-version-policy1: ztp-upgrade
  placementBindings:
  - spoke-2-du-upgrade-cluster-version-policy1-placement-t4625
  placementRules:
  - spoke-2-du-upgrade-cluster-version-policy1-placement-t4625
  remediationPlan:
  - - spoke-1
  - - spoke-2
  safeResourceNames:
    spoke-2-du-upgrade-cluster-version-policy1: spoke-2-du-upgrade-cluster-version-policy1-hg4s7
    spoke-2-du-upgrade-cluster-version-policy1-config: spoke-2-du-upgrade-cluster-version-policy1-config-nrls5
    spoke-2-du-upgrade-cluster-version-policy1-placement: spoke-2-du-upgrade-cluster-version-policy1-placement-t4625
  status:
    currentBatch: 2
    currentBatchRemediationProgress:
      spoke-2:
        policyIndex: 0
        state: InProgress
    currentBatchStartedAt: "2022-09-09T18:32:35Z"
    startedAt: "2022-09-09T18:22:34Z"

Comment 8 errata-xmlrpc 2022-09-27 08:10:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.11.6 CNF vRAN extras update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2022:6669