Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2108639

Summary:	CGU status does not reflect timeouts from earlier batches
Product:	OpenShift Container Platform	Reporter:	OpenShift BugZilla Robot <openshift-bugzilla-robot>
Component:	Telco Edge	Assignee:	jun
Telco Edge sub component:	TALO	QA Contact:	yliu1
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	high	CC:	bzvonar, ijolliff, jun, keyoung
Version:	4.10
Target Milestone:	---
Target Release:	4.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-08-18 04:08:08 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2087125, 2115480, 2117038, 2117228
Bug Blocks:	2108692

Description OpenShift BugZilla Robot 2022-07-19 15:08:56 UTC

+++ This bug was initially created as a clone of Bug #2087125 +++

Description of problem:
For an upgrade with multiple batches, the final CGU status becomes completed as long as the last batch becomes all compliant within the overall timeout period. Timeouts from earlier batches are not reflected.

Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1. 
2.
3.

Actual results:


Expected results:
After last batch is completed, the final status should be set to completed only if all batches are compliant

Additional info:

Comment 1 yliu1 2022-07-27 19:12:17 UTC

A regression was encountered when testing this bz. Not sure if the issue was introduced in this PR, but this scenario used to work in 4.10.

Steps:
- 2 spokes are managed 
- create cgu to apply a simple config on both clusters with maxConcurrency set to 1, and timeout set to 3 
- before original policy became compliant on hub cluster, reboot or power of spoke1
- observe the cgu status indicated timeout (all are expected up until now)

Unexpected behavior after cgu move to batch 2
- no enforce policy was ever created under spoke2 namespace even after spoke1 was recovered.
- cluster-group-upgrades-controller-manager pod in crashloopbackoff state with following error in pod logs:

2022-07-27T19:07:50.993Z	INFO	controllers.ClusterGroupUpgrade	[getClustersBySelectors]	{"clusterNames": ["ocp-edge87"]}
2022-07-27T19:07:50.993Z	INFO	controllers.ClusterGroupUpgrade	Upgrade is completed
2022-07-27T19:07:51.002Z	INFO	controllers.ManagedClusterForCGU	WARN: No child policies found for cluster	{"Name": "local-cluster"}
2022-07-27T19:07:51.002Z	INFO	controllers.ManagedClusterForCGU	Reconciling managedCluster to create clusterGroupUpgrade	{"Request.Name": "ocp-edge87"}
2022-07-27T19:07:51.002Z	INFO	controllers.ManagedClusterForCGU	ZTP for the cluster has completed. ztp-done label found.	{"Name": "ocp-edge87"}
2022-07-27T19:07:51.002Z	INFO	controllers.ManagedClusterForCGU	Reconciling managedCluster to create clusterGroupUpgrade	{"Request.Name": "ocp-edge88"}
2022-07-27T19:07:51.002Z	INFO	controllers.ManagedClusterForCGU	ZTP for the cluster has completed. ztp-done label found.	{"Name": "ocp-edge88"}
2022-07-27T19:07:51.016Z	INFO	controllers.ClusterGroupUpgrade	Finish reconciling CGU	{"name": "ztp-install/ocp-edge87", "result": {"Requeue":false,"RequeueAfter":0}}
2022-07-27T19:07:51.016Z	INFO	controllers.ClusterGroupUpgrade	Start reconciling CGU	{"name": "ztp-install/ocp-edge88"}
2022-07-27T19:07:51.117Z	INFO	controllers.ClusterGroupUpgrade	Loaded CGU	{"name": "ztp-install/ocp-edge88", "version": "18340784"}
2022-07-27T19:07:51.118Z	INFO	controllers.ClusterGroupUpgrade	[getClusterBySelectors]	{"clustersBySelector": []}
2022-07-27T19:07:51.118Z	INFO	controllers.ClusterGroupUpgrade	[getClustersBySelectors]	{"clusterNames": ["ocp-edge88"]}
2022-07-27T19:07:51.118Z	INFO	controllers.ClusterGroupUpgrade	Upgrade is completed
2022-07-27T19:07:51.139Z	INFO	controllers.ClusterGroupUpgrade	Finish reconciling CGU	{"name": "ztp-install/ocp-edge88", "result": {"Requeue":false,"RequeueAfter":0}}
2022-07-27T19:07:51.139Z	INFO	controllers.ClusterGroupUpgrade	Start reconciling CGU	{"name": "default/test"}
2022-07-27T19:07:51.239Z	INFO	controllers.ClusterGroupUpgrade	Loaded CGU	{"name": "default/test", "version": "21234640"}
2022-07-27T19:07:51.239Z	INFO	controllers.ClusterGroupUpgrade	[getClusterBySelectors]	{"clustersBySelector": ["ocp-edge87", "ocp-edge88"]}
2022-07-27T19:07:51.239Z	INFO	controllers.ClusterGroupUpgrade	[getClustersBySelectors]	{"clusterNames": ["ocp-edge87", "ocp-edge88"]}
2022-07-27T19:07:51.240Z	INFO	controllers.ClusterGroupUpgrade	Finish reconciling CGU	{"name": "default/test", "result": {"Requeue":false,"RequeueAfter":0}}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x13a388d]

goroutine 619 [running]:
github.com/openshift-kni/cluster-group-upgrades-operator/controllers.(*ClusterGroupUpgradeReconciler).isUpgradeComplete(0xc000e13e80, {0x18ca8d8, 0xc0008b1200}, 0xc0006f6000)
	/remote-source/app/controllers/clustergroupupgrade_controller.go:1234 +0xed
github.com/openshift-kni/cluster-group-upgrades-operator/controllers.(*ClusterGroupUpgradeReconciler).Reconcile(0xc000e13e40, {0x18ca8d8, 0xc0008b1200}, {{{0xc0007dac98, 0x7}, {0xc0007dac94, 0x4}}})
	/remote-source/app/controllers/clustergroupupgrade_controller.go:401 +0xbe5
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0000c1220, {0x18ca830, 0xc000b9a080}, {0x154ab60, 0xc00085e140})
	/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:298 +0x303
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0000c1220, {0x18ca830, 0xc000b9a080})
	/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253 +0x205
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
	/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214 +0x85
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
	/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:210 +0x354


openshift-operators                                cluster-group-upgrades-controller-manager-6b94f4959-z78g8         1/2     CrashLoopBackOff   7 (3m42s ago)   46h

Comment 2 yliu1 2022-07-27 19:41:09 UTC

More info for failure mentioned in comment #1:

cgu:
[kni ~]$ oc get cgu -n default test -o yaml
apiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"ran.openshift.io/v1alpha1","kind":"ClusterGroupUpgrade","metadata":{"annotations":{},"name":"test","namespace":"default"},"spec":{"backup":false,"clusterSelector":["group-du-sno"],"enable":true,"managedPolicies":["common-config-policy","common-subscriptions-policy"],"preCaching":false,"remediationStrategy":{"maxConcurrency":1,"timeout":3}}}
  creationTimestamp: "2022-07-27T18:49:25Z"
  finalizers:
  - ran.openshift.io/cleanup-finalizer
  generation: 2
  name: test
  namespace: default
  resourceVersion: "21234640"
  uid: 85760400-2728-4782-a7ba-136e2a3cf4e4
spec:
  actions:
    afterCompletion:
      deleteObjects: true
    beforeEnable: {}
  backup: false
  clusterSelector:
  - group-du-sno
  enable: true
  managedPolicies:
  - common-config-policy
  - common-subscriptions-policy
  preCaching: false
  remediationStrategy:
    maxConcurrency: 1
    timeout: 3
status:
  computedMaxConcurrency: 1
  conditions:
  - lastTransitionTime: "2022-07-27T18:49:25Z"
    message: The ClusterGroupUpgrade CR policies are taking too long to complete
    reason: UpgradeTimedOut
    status: "False"
    type: Ready
  copiedPolicies:
  - test-common-config-policy-s2vzg
  managedPoliciesCompliantBeforeUpgrade:
  - common-subscriptions-policy
  managedPoliciesContent:
    common-config-policy: "null"
  managedPoliciesForUpgrade:
  - name: common-config-policy
    namespace: ztp-common
  managedPoliciesNs:
    common-config-policy: ztp-common
  placementBindings:
  - test-common-config-policy-placement-h5sft
  placementRules:
  - test-common-config-policy-placement-h5sft
  remediationPlan:
  - - ocp-edge87
  - - ocp-edge88
  safeResourceNames:
    test-common-config-policy: test-common-config-policy-s2vzg
    test-common-config-policy-config: test-common-config-policy-config-24dg6
    test-common-config-policy-placement: test-common-config-policy-placement-h5sft
  status:
    currentBatch: 2
    currentBatchRemediationProgress:
      ocp-edge87:
        state: Completed
    startedAt: "2022-07-27T18:49:25Z"


Policies:
[kni ~]$ oc get policies -A |grep common-config-policy
default      test-common-config-policy-s2vzg          enforce                                 44m
ocp-edge87   ztp-common.common-config-policy          inform               NonCompliant       25h
ocp-edge88   ztp-common.common-config-policy          inform               NonCompliant       25h
ztp-common   common-config-policy                     inform               NonCompliant       25h


Note that the parent policy created by CGU has no compliance status.

$ oc get policies -n default test-common-config-policy-s2vzg -oyaml 
..
status:
  placement:
  - placementBinding: test-common-config-policy-placement-h5sft
    placementRule: test-common-config-policy-placement-h5sft


$ oc get managedcluster
NAME            HUB ACCEPTED   MANAGED CLUSTER URLS                                  JOINED   AVAILABLE   AGE
local-cluster   true           https://api.hlxcl11.lab.eng.tlv2.redhat.com:6443      True     True        7d3h
ocp-edge87      true           https://api.ocp-edge87.lab.eng.tlv2.redhat.com:6443   True     True        25h
ocp-edge88      true           https://api.ocp-edge88.lab.eng.tlv2.redhat.com:6443   True     True        25h

Comment 5 yliu1 2022-08-05 18:18:37 UTC

A new issue is introduced. CGU does not start with policy missing error, although the policy does exist and is NonCompliant. Please let me know if new bz should be opened instead.

[kni ~]$ oc get cgu -o yaml 
apiVersion: v1
items:
- apiVersion: ran.openshift.io/v1alpha1
  kind: ClusterGroupUpgrade
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"ran.openshift.io/v1alpha1","kind":"ClusterGroupUpgrade","metadata":{"annotations":{},"name":"test","namespace":"default"},"spec":{"CusterSelector":["group-du-sno"],"backup":false,"enable":true,"managedPolicies":["common-config-policy"],"preCaching":false,"remediationStrategy":{"maxConcurrency":1,"timeout":18}}}
    creationTimestamp: "2022-08-05T18:13:43Z"
    finalizers:
    - ran.openshift.io/cleanup-finalizer
    generation: 2
    name: test
    namespace: default
    resourceVersion: "49721892"
    uid: 9b20a189-1f39-4824-afae-cd95de546573
  spec:
    actions:
      afterCompletion:
        deleteObjects: true
      beforeEnable: {}
    backup: false
    enable: true
    managedPolicies:
    - common-config-policy
    preCaching: false
    remediationStrategy:
      maxConcurrency: 1
      timeout: 18
  status:
    conditions:
    - lastTransitionTime: "2022-08-05T18:13:43Z"
      message: 'The ClusterGroupUpgrade CR has managed policies that are missing:
        [common-config-policy]'
      reason: UpgradeCannotStart
      status: "False"
      type: Ready
    status: {}
kind: List
metadata:
  resourceVersion: ""

[kni ~]$ oc get policies -A
NAMESPACE    NAME                                     REMEDIATION ACTION   COMPLIANCE STATE   AGE
ocp-edge87   ztp-common.common-config-policy          inform               NonCompliant       22h
ocp-edge87   ztp-common.common-subscriptions-policy   inform               Compliant          10d
ocp-edge87   ztp-group.group-du-sno-config-policy     inform               NonCompliant       10d
ocp-edge87   ztp-site.ocp-edge87-config-policy        inform               Compliant          10d
ocp-edge87   ztp-site.ocp-edge87-ptp-config-policy    inform               Compliant          10d
ocp-edge88   ztp-common.common-config-policy          inform               NonCompliant       22h
ocp-edge88   ztp-common.common-subscriptions-policy   inform               Compliant          10d
ocp-edge88   ztp-group.group-du-sno-config-policy     inform               Compliant          10d
ocp-edge88   ztp-site.ocp-edge88-config-policy        inform               Compliant          10d
ocp-edge88   ztp-site.ocp-edge88-ptp-config-policy    inform               Compliant          10d
ztp-common   common-config-policy                     inform               NonCompliant       22h
ztp-common   common-subscriptions-policy              inform               Compliant          10d
ztp-group    group-du-sno-config-policy               inform               NonCompliant       10d
ztp-group    group-du-sno-validator-du-policy         inform                                  10d
ztp-site     ocp-edge87-config-policy                 inform               Compliant          10d
ztp-site     ocp-edge87-ptp-config-policy             inform               Compliant          10d
ztp-site     ocp-edge88-config-policy                 inform               Compliant          10d
ztp-site     ocp-edge88-ptp-config-policy             inform               Compliant          10d

Comment 6 yliu1 2022-08-05 19:25:25 UTC

I had a typo in clusterSelector field in my yaml, thus above error. Thanks Jun to catch that. Changing back to on_qa.

Comment 7 yliu1 2022-08-16 16:54:34 UTC

Overall status for cgu indicates timeout when previous batch failed but last batch passed. 

  status:
    computedMaxConcurrency: 1
    conditions:
    - lastTransitionTime: "2022-08-16T16:31:48Z"
      message: The ClusterGroupUpgrade CR policies are taking too long to complete
      reason: UpgradeTimedOut
      status: "False"
      type: Ready
    copiedPolicies:
    - test-common-config-policy-jhf9c
    managedPoliciesContent:
      common-config-policy: "null"
    managedPoliciesForUpgrade:
    - name: common-config-policy
      namespace: ztp-common
    managedPoliciesNs:
      common-config-policy: ztp-common
    placementBindings:
    - test-common-config-policy-placement-gkcp2
    placementRules:
    - test-common-config-policy-placement-gkcp2
    remediationPlan:
    - - ocp-edge87
    - - ocp-edge88
    safeResourceNames:
      test-common-config-policy: test-common-config-policy-jhf9c
      test-common-config-policy-config: test-common-config-policy-config-8fnxw
      test-common-config-policy-placement: test-common-config-policy-placement-gkcp2
    status:
      currentBatch: 2
      currentBatchRemediationProgress:
        ocp-edge88:
          state: Completed
      currentBatchStartedAt: "2022-08-16T16:46:49Z"
      startedAt: "2022-08-16T16:31:48Z"

Comment 9 errata-xmlrpc 2022-08-18 04:08:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.11 CNF vRAN extras update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2022:6110