Bug 2108639
| Summary: | CGU status does not reflect timeouts from earlier batches | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | OpenShift BugZilla Robot <openshift-bugzilla-robot> |
| Component: | Telco Edge | Assignee: | jun |
| Telco Edge sub component: | TALO | QA Contact: | yliu1 |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | medium | ||
| Priority: | high | CC: | bzvonar, ijolliff, jun, keyoung |
| Version: | 4.10 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.11.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-08-18 04:08:08 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 2087125, 2115480, 2117038, 2117228 | ||
| Bug Blocks: | 2108692 | ||
|
Description
OpenShift BugZilla Robot
2022-07-19 15:08:56 UTC
A regression was encountered when testing this bz. Not sure if the issue was introduced in this PR, but this scenario used to work in 4.10.
Steps:
- 2 spokes are managed
- create cgu to apply a simple config on both clusters with maxConcurrency set to 1, and timeout set to 3
- before original policy became compliant on hub cluster, reboot or power of spoke1
- observe the cgu status indicated timeout (all are expected up until now)
Unexpected behavior after cgu move to batch 2
- no enforce policy was ever created under spoke2 namespace even after spoke1 was recovered.
- cluster-group-upgrades-controller-manager pod in crashloopbackoff state with following error in pod logs:
2022-07-27T19:07:50.993Z INFO controllers.ClusterGroupUpgrade [getClustersBySelectors] {"clusterNames": ["ocp-edge87"]}
2022-07-27T19:07:50.993Z INFO controllers.ClusterGroupUpgrade Upgrade is completed
2022-07-27T19:07:51.002Z INFO controllers.ManagedClusterForCGU WARN: No child policies found for cluster {"Name": "local-cluster"}
2022-07-27T19:07:51.002Z INFO controllers.ManagedClusterForCGU Reconciling managedCluster to create clusterGroupUpgrade {"Request.Name": "ocp-edge87"}
2022-07-27T19:07:51.002Z INFO controllers.ManagedClusterForCGU ZTP for the cluster has completed. ztp-done label found. {"Name": "ocp-edge87"}
2022-07-27T19:07:51.002Z INFO controllers.ManagedClusterForCGU Reconciling managedCluster to create clusterGroupUpgrade {"Request.Name": "ocp-edge88"}
2022-07-27T19:07:51.002Z INFO controllers.ManagedClusterForCGU ZTP for the cluster has completed. ztp-done label found. {"Name": "ocp-edge88"}
2022-07-27T19:07:51.016Z INFO controllers.ClusterGroupUpgrade Finish reconciling CGU {"name": "ztp-install/ocp-edge87", "result": {"Requeue":false,"RequeueAfter":0}}
2022-07-27T19:07:51.016Z INFO controllers.ClusterGroupUpgrade Start reconciling CGU {"name": "ztp-install/ocp-edge88"}
2022-07-27T19:07:51.117Z INFO controllers.ClusterGroupUpgrade Loaded CGU {"name": "ztp-install/ocp-edge88", "version": "18340784"}
2022-07-27T19:07:51.118Z INFO controllers.ClusterGroupUpgrade [getClusterBySelectors] {"clustersBySelector": []}
2022-07-27T19:07:51.118Z INFO controllers.ClusterGroupUpgrade [getClustersBySelectors] {"clusterNames": ["ocp-edge88"]}
2022-07-27T19:07:51.118Z INFO controllers.ClusterGroupUpgrade Upgrade is completed
2022-07-27T19:07:51.139Z INFO controllers.ClusterGroupUpgrade Finish reconciling CGU {"name": "ztp-install/ocp-edge88", "result": {"Requeue":false,"RequeueAfter":0}}
2022-07-27T19:07:51.139Z INFO controllers.ClusterGroupUpgrade Start reconciling CGU {"name": "default/test"}
2022-07-27T19:07:51.239Z INFO controllers.ClusterGroupUpgrade Loaded CGU {"name": "default/test", "version": "21234640"}
2022-07-27T19:07:51.239Z INFO controllers.ClusterGroupUpgrade [getClusterBySelectors] {"clustersBySelector": ["ocp-edge87", "ocp-edge88"]}
2022-07-27T19:07:51.239Z INFO controllers.ClusterGroupUpgrade [getClustersBySelectors] {"clusterNames": ["ocp-edge87", "ocp-edge88"]}
2022-07-27T19:07:51.240Z INFO controllers.ClusterGroupUpgrade Finish reconciling CGU {"name": "default/test", "result": {"Requeue":false,"RequeueAfter":0}}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x13a388d]
goroutine 619 [running]:
github.com/openshift-kni/cluster-group-upgrades-operator/controllers.(*ClusterGroupUpgradeReconciler).isUpgradeComplete(0xc000e13e80, {0x18ca8d8, 0xc0008b1200}, 0xc0006f6000)
/remote-source/app/controllers/clustergroupupgrade_controller.go:1234 +0xed
github.com/openshift-kni/cluster-group-upgrades-operator/controllers.(*ClusterGroupUpgradeReconciler).Reconcile(0xc000e13e40, {0x18ca8d8, 0xc0008b1200}, {{{0xc0007dac98, 0x7}, {0xc0007dac94, 0x4}}})
/remote-source/app/controllers/clustergroupupgrade_controller.go:401 +0xbe5
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0000c1220, {0x18ca830, 0xc000b9a080}, {0x154ab60, 0xc00085e140})
/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:298 +0x303
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0000c1220, {0x18ca830, 0xc000b9a080})
/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253 +0x205
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214 +0x85
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:210 +0x354
openshift-operators cluster-group-upgrades-controller-manager-6b94f4959-z78g8 1/2 CrashLoopBackOff 7 (3m42s ago) 46h
More info for failure mentioned in comment #1: cgu: [kni ~]$ oc get cgu -n default test -o yaml apiVersion: ran.openshift.io/v1alpha1 kind: ClusterGroupUpgrade metadata: annotations: kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"ran.openshift.io/v1alpha1","kind":"ClusterGroupUpgrade","metadata":{"annotations":{},"name":"test","namespace":"default"},"spec":{"backup":false,"clusterSelector":["group-du-sno"],"enable":true,"managedPolicies":["common-config-policy","common-subscriptions-policy"],"preCaching":false,"remediationStrategy":{"maxConcurrency":1,"timeout":3}}} creationTimestamp: "2022-07-27T18:49:25Z" finalizers: - ran.openshift.io/cleanup-finalizer generation: 2 name: test namespace: default resourceVersion: "21234640" uid: 85760400-2728-4782-a7ba-136e2a3cf4e4 spec: actions: afterCompletion: deleteObjects: true beforeEnable: {} backup: false clusterSelector: - group-du-sno enable: true managedPolicies: - common-config-policy - common-subscriptions-policy preCaching: false remediationStrategy: maxConcurrency: 1 timeout: 3 status: computedMaxConcurrency: 1 conditions: - lastTransitionTime: "2022-07-27T18:49:25Z" message: The ClusterGroupUpgrade CR policies are taking too long to complete reason: UpgradeTimedOut status: "False" type: Ready copiedPolicies: - test-common-config-policy-s2vzg managedPoliciesCompliantBeforeUpgrade: - common-subscriptions-policy managedPoliciesContent: common-config-policy: "null" managedPoliciesForUpgrade: - name: common-config-policy namespace: ztp-common managedPoliciesNs: common-config-policy: ztp-common placementBindings: - test-common-config-policy-placement-h5sft placementRules: - test-common-config-policy-placement-h5sft remediationPlan: - - ocp-edge87 - - ocp-edge88 safeResourceNames: test-common-config-policy: test-common-config-policy-s2vzg test-common-config-policy-config: test-common-config-policy-config-24dg6 test-common-config-policy-placement: test-common-config-policy-placement-h5sft status: currentBatch: 2 currentBatchRemediationProgress: ocp-edge87: state: Completed startedAt: "2022-07-27T18:49:25Z" Policies: [kni ~]$ oc get policies -A |grep common-config-policy default test-common-config-policy-s2vzg enforce 44m ocp-edge87 ztp-common.common-config-policy inform NonCompliant 25h ocp-edge88 ztp-common.common-config-policy inform NonCompliant 25h ztp-common common-config-policy inform NonCompliant 25h Note that the parent policy created by CGU has no compliance status. $ oc get policies -n default test-common-config-policy-s2vzg -oyaml .. status: placement: - placementBinding: test-common-config-policy-placement-h5sft placementRule: test-common-config-policy-placement-h5sft $ oc get managedcluster NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE local-cluster true https://api.hlxcl11.lab.eng.tlv2.redhat.com:6443 True True 7d3h ocp-edge87 true https://api.ocp-edge87.lab.eng.tlv2.redhat.com:6443 True True 25h ocp-edge88 true https://api.ocp-edge88.lab.eng.tlv2.redhat.com:6443 True True 25h A new issue is introduced. CGU does not start with policy missing error, although the policy does exist and is NonCompliant. Please let me know if new bz should be opened instead.
[kni ~]$ oc get cgu -o yaml
apiVersion: v1
items:
- apiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"ran.openshift.io/v1alpha1","kind":"ClusterGroupUpgrade","metadata":{"annotations":{},"name":"test","namespace":"default"},"spec":{"CusterSelector":["group-du-sno"],"backup":false,"enable":true,"managedPolicies":["common-config-policy"],"preCaching":false,"remediationStrategy":{"maxConcurrency":1,"timeout":18}}}
creationTimestamp: "2022-08-05T18:13:43Z"
finalizers:
- ran.openshift.io/cleanup-finalizer
generation: 2
name: test
namespace: default
resourceVersion: "49721892"
uid: 9b20a189-1f39-4824-afae-cd95de546573
spec:
actions:
afterCompletion:
deleteObjects: true
beforeEnable: {}
backup: false
enable: true
managedPolicies:
- common-config-policy
preCaching: false
remediationStrategy:
maxConcurrency: 1
timeout: 18
status:
conditions:
- lastTransitionTime: "2022-08-05T18:13:43Z"
message: 'The ClusterGroupUpgrade CR has managed policies that are missing:
[common-config-policy]'
reason: UpgradeCannotStart
status: "False"
type: Ready
status: {}
kind: List
metadata:
resourceVersion: ""
[kni ~]$ oc get policies -A
NAMESPACE NAME REMEDIATION ACTION COMPLIANCE STATE AGE
ocp-edge87 ztp-common.common-config-policy inform NonCompliant 22h
ocp-edge87 ztp-common.common-subscriptions-policy inform Compliant 10d
ocp-edge87 ztp-group.group-du-sno-config-policy inform NonCompliant 10d
ocp-edge87 ztp-site.ocp-edge87-config-policy inform Compliant 10d
ocp-edge87 ztp-site.ocp-edge87-ptp-config-policy inform Compliant 10d
ocp-edge88 ztp-common.common-config-policy inform NonCompliant 22h
ocp-edge88 ztp-common.common-subscriptions-policy inform Compliant 10d
ocp-edge88 ztp-group.group-du-sno-config-policy inform Compliant 10d
ocp-edge88 ztp-site.ocp-edge88-config-policy inform Compliant 10d
ocp-edge88 ztp-site.ocp-edge88-ptp-config-policy inform Compliant 10d
ztp-common common-config-policy inform NonCompliant 22h
ztp-common common-subscriptions-policy inform Compliant 10d
ztp-group group-du-sno-config-policy inform NonCompliant 10d
ztp-group group-du-sno-validator-du-policy inform 10d
ztp-site ocp-edge87-config-policy inform Compliant 10d
ztp-site ocp-edge87-ptp-config-policy inform Compliant 10d
ztp-site ocp-edge88-config-policy inform Compliant 10d
ztp-site ocp-edge88-ptp-config-policy inform Compliant 10d
I had a typo in clusterSelector field in my yaml, thus above error. Thanks Jun to catch that. Changing back to on_qa. Overall status for cgu indicates timeout when previous batch failed but last batch passed.
status:
computedMaxConcurrency: 1
conditions:
- lastTransitionTime: "2022-08-16T16:31:48Z"
message: The ClusterGroupUpgrade CR policies are taking too long to complete
reason: UpgradeTimedOut
status: "False"
type: Ready
copiedPolicies:
- test-common-config-policy-jhf9c
managedPoliciesContent:
common-config-policy: "null"
managedPoliciesForUpgrade:
- name: common-config-policy
namespace: ztp-common
managedPoliciesNs:
common-config-policy: ztp-common
placementBindings:
- test-common-config-policy-placement-gkcp2
placementRules:
- test-common-config-policy-placement-gkcp2
remediationPlan:
- - ocp-edge87
- - ocp-edge88
safeResourceNames:
test-common-config-policy: test-common-config-policy-jhf9c
test-common-config-policy-config: test-common-config-policy-config-8fnxw
test-common-config-policy-placement: test-common-config-policy-placement-gkcp2
status:
currentBatch: 2
currentBatchRemediationProgress:
ocp-edge88:
state: Completed
currentBatchStartedAt: "2022-08-16T16:46:49Z"
startedAt: "2022-08-16T16:31:48Z"
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.11 CNF vRAN extras update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2022:6110 |