Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2114903

Summary: TALM pod crashed after first batch timed out
Product: OpenShift Container Platform Reporter: yliu1
Component: Telco EdgeAssignee: jun
Telco Edge sub component: TALO QA Contact: yliu1
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: unspecified CC: ijolliff, jun
Version: 4.11   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-03 13:45:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description yliu1 2022-08-03 13:42:09 UTC
Description of problem:
TALM pod crashed after first batch timed out

Version-Release number of selected component (if applicable):
4.11

How reproducible:
100%

Steps to Reproduce:
- 2 spokes are managed 
- create cgu to apply a simple config on both clusters with maxConcurrency set to 1, and timeout set to 3 
- before original policy became compliant on hub cluster, reboot or power of spoke1
- observe the cgu status indicated timeout (all are expected up until now)

Actual results:
- no enforce policy was ever created under spoke2 namespace even after spoke1 was recovered.
- cluster-group-upgrades-controller-manager pod in crashloopbackoff state with following error in pod logs:

2022-07-27T19:07:50.993Z	INFO	controllers.ClusterGroupUpgrade	[getClustersBySelectors]	{"clusterNames": ["ocp-edge87"]}
2022-07-27T19:07:50.993Z	INFO	controllers.ClusterGroupUpgrade	Upgrade is completed
2022-07-27T19:07:51.002Z	INFO	controllers.ManagedClusterForCGU	WARN: No child policies found for cluster	{"Name": "local-cluster"}
2022-07-27T19:07:51.002Z	INFO	controllers.ManagedClusterForCGU	Reconciling managedCluster to create clusterGroupUpgrade	{"Request.Name": "ocp-edge87"}
2022-07-27T19:07:51.002Z	INFO	controllers.ManagedClusterForCGU	ZTP for the cluster has completed. ztp-done label found.	{"Name": "ocp-edge87"}
2022-07-27T19:07:51.002Z	INFO	controllers.ManagedClusterForCGU	Reconciling managedCluster to create clusterGroupUpgrade	{"Request.Name": "ocp-edge88"}
2022-07-27T19:07:51.002Z	INFO	controllers.ManagedClusterForCGU	ZTP for the cluster has completed. ztp-done label found.	{"Name": "ocp-edge88"}
2022-07-27T19:07:51.016Z	INFO	controllers.ClusterGroupUpgrade	Finish reconciling CGU	{"name": "ztp-install/ocp-edge87", "result": {"Requeue":false,"RequeueAfter":0}}
2022-07-27T19:07:51.016Z	INFO	controllers.ClusterGroupUpgrade	Start reconciling CGU	{"name": "ztp-install/ocp-edge88"}
2022-07-27T19:07:51.117Z	INFO	controllers.ClusterGroupUpgrade	Loaded CGU	{"name": "ztp-install/ocp-edge88", "version": "18340784"}
2022-07-27T19:07:51.118Z	INFO	controllers.ClusterGroupUpgrade	[getClusterBySelectors]	{"clustersBySelector": []}
2022-07-27T19:07:51.118Z	INFO	controllers.ClusterGroupUpgrade	[getClustersBySelectors]	{"clusterNames": ["ocp-edge88"]}
2022-07-27T19:07:51.118Z	INFO	controllers.ClusterGroupUpgrade	Upgrade is completed
2022-07-27T19:07:51.139Z	INFO	controllers.ClusterGroupUpgrade	Finish reconciling CGU	{"name": "ztp-install/ocp-edge88", "result": {"Requeue":false,"RequeueAfter":0}}
2022-07-27T19:07:51.139Z	INFO	controllers.ClusterGroupUpgrade	Start reconciling CGU	{"name": "default/test"}
2022-07-27T19:07:51.239Z	INFO	controllers.ClusterGroupUpgrade	Loaded CGU	{"name": "default/test", "version": "21234640"}
2022-07-27T19:07:51.239Z	INFO	controllers.ClusterGroupUpgrade	[getClusterBySelectors]	{"clustersBySelector": ["ocp-edge87", "ocp-edge88"]}
2022-07-27T19:07:51.239Z	INFO	controllers.ClusterGroupUpgrade	[getClustersBySelectors]	{"clusterNames": ["ocp-edge87", "ocp-edge88"]}
2022-07-27T19:07:51.240Z	INFO	controllers.ClusterGroupUpgrade	Finish reconciling CGU	{"name": "default/test", "result": {"Requeue":false,"RequeueAfter":0}}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x13a388d]

goroutine 619 [running]:
github.com/openshift-kni/cluster-group-upgrades-operator/controllers.(*ClusterGroupUpgradeReconciler).isUpgradeComplete(0xc000e13e80, {0x18ca8d8, 0xc0008b1200}, 0xc0006f6000)
	/remote-source/app/controllers/clustergroupupgrade_controller.go:1234 +0xed
github.com/openshift-kni/cluster-group-upgrades-operator/controllers.(*ClusterGroupUpgradeReconciler).Reconcile(0xc000e13e40, {0x18ca8d8, 0xc0008b1200}, {{{0xc0007dac98, 0x7}, {0xc0007dac94, 0x4}}})
	/remote-source/app/controllers/clustergroupupgrade_controller.go:401 +0xbe5
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0000c1220, {0x18ca830, 0xc000b9a080}, {0x154ab60, 0xc00085e140})
	/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:298 +0x303
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0000c1220, {0x18ca830, 0xc000b9a080})
	/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253 +0x205
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
	/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214 +0x85
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
	/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:210 +0x354


openshift-operators                                cluster-group-upgrades-controller-manager-6b94f4959-z78g8         1/2     CrashLoopBackOff   7 (3m42s ago)   46h


Expected results:
CGU moves to batch 2, pod does not crash.

Additional info:

Comment 1 jun 2022-08-03 13:45:44 UTC
This was introduced by an incomplete fix for Bug 2087125.

*** This bug has been marked as a duplicate of bug 2087125 ***