Description of problem: rook-ceph-operator pod stuck in crash-loopback state after tier1 execution in external mode while external RHCS 5 cluster remains healthy. Version-Release number of selected component (if applicable): ocp : 4.10.0-rc.8 odf : 4.10.0 RHCS : 5 How reproducible: Deploy odf in external mode with RHCS 5 and trigger ocs-ci tier1. Steps to Reproduce: Deploy odf in external mode with RHCS 5 and trigger ocs-ci tier1. Actual results: Expected results: odf remains healthy after tier execution Additional info: Tier run and must gather logs : https://drive.google.com/file/d/1Kf-IkIMeKb-9XZNSfDWml9hh00dxmqER/view?usp=sharing
ODF cluster status after the tier execution: -------- [root@m4204001 ~]# oc -n openshift-storage get pod NAME READY STATUS RESTARTS AGE csi-addons-controller-manager-59588d8bc6-xhfg4 2/2 Running 0 17h csi-cephfsplugin-87bh5 3/3 Running 0 17h csi-cephfsplugin-jpdzv 3/3 Running 0 17h csi-cephfsplugin-provisioner-6547889564-58bts 6/6 Running 0 11h csi-cephfsplugin-provisioner-6547889564-v4hgw 6/6 Running 0 17h csi-cephfsplugin-wpxxq 3/3 Running 0 17h csi-rbdplugin-hg9ml 4/4 Running 0 17h csi-rbdplugin-provisioner-6f4cd57fcb-t4c47 7/7 Running 0 17h csi-rbdplugin-provisioner-6f4cd57fcb-xdhgj 7/7 Running 0 17h csi-rbdplugin-s8mc6 4/4 Running 0 17h csi-rbdplugin-vv42j 4/4 Running 0 17h noobaa-core-0 1/1 Running 0 11h noobaa-db-pg-0 1/1 Running 0 17h noobaa-endpoint-5f946897b6-bqhkv 1/1 Running 0 11h noobaa-endpoint-5f946897b6-ztzzj 1/1 Running 0 17h noobaa-operator-86dc95f87c-tdnp2 1/1 Running 0 11h ocs-metrics-exporter-b76b778f5-snvk9 1/1 Running 0 11h ocs-operator-7445966997-pv79k 1/1 Running 0 17h odf-console-759876895-gl9k9 1/1 Running 0 17h odf-operator-controller-manager-5fcb6d85cc-2rrft 2/2 Running 0 17h rook-ceph-operator-6f87b7f4d8-2jzr4 0/1 CrashLoopBackOff 143 (4m25s ago) 11h rook-ceph-tools-external-7b6558594f-4cqkp 1/1 Running 0 11h [root@m4204001 ~]# [root@m4204001 ~]# oc -n openshift-storage get cephcluster NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH EXTERNAL ocs-external-storagecluster-cephcluster 17h Connecting Attempting to connect to an external Ceph cluster HEALTH_OK true [root@m4204001 ~]# [root@m4204001 ~]# oc -n openshift-storage get sc NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE ocs-external-storagecluster-ceph-rbd openshift-storage.rbd.csi.ceph.com Delete Immediate true 17h ocs-external-storagecluster-ceph-rgw openshift-storage.ceph.rook.io/bucket Delete Immediate false 17h ocs-external-storagecluster-cephfs openshift-storage.cephfs.csi.ceph.com Delete Immediate true 17h openshift-storage.noobaa.io openshift-storage.noobaa.io/obc Delete Immediate false 17h [root@m4204001 ~]# [root@m4204001 ~]# oc -n openshift-storage get csv NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.10.0 NooBaa Operator 4.10.0 Succeeded ocs-operator.v4.10.0 OpenShift Container Storage 4.10.0 Installing odf-csi-addons-operator.v4.10.0 CSI Addons 4.10.0 Succeeded odf-operator.v4.10.0 OpenShift Data Foundation 4.10.0 Succeeded [root@m4204001 ~]# -------- RHCS cluster status : -------- [root@xzkvm01 ~]# ceph -s cluster: id: bdabda3c-9b00-11ec-9831-525400e56e5d health: HEALTH_OK services: mon: 3 daemons, quorum xzkvm01,xzkvm02,xzkvm03 (age 4d) mgr: xzkvm01.zblcqg(active, since 4d), standbys: xzkvm02.wippoc mds: 1/1 daemons up, 2 standby osd: 3 osds: 3 up (since 4d), 3 in (since 4d) rgw: 4 daemons active (2 hosts, 1 zones) data: volumes: 1/1 healthy pools: 10 pools, 241 pgs objects: 14.69k objects, 53 GiB usage: 61 GiB used, 539 GiB / 600 GiB avail pgs: 241 active+clean io: client: 5.3 KiB/s wr, 0 op/s rd, 0 op/s wr [root@xzkvm01 ~]# --------
The operator log shows the following stack: 2022-03-08T08:48:40.962801533Z panic: runtime error: invalid memory address or nil pointer dereference 2022-03-08T08:48:40.962801533Z [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xf81faa] 2022-03-08T08:48:40.962801533Z 2022-03-08T08:48:40.962801533Z goroutine 1077 [running]: 2022-03-08T08:48:40.962801533Z github.com/rook/rook/pkg/apis/ceph.rook.io/v1.(*CephObjectStore).GetObjectKind(0x0, 0x0, 2022-03-08T08:48:40.962885446Z 0x0) 2022-03-08T08:48:40.962885446Z <autogenerated>:1 +0xa 2022-03-08T08:48:40.962885446Z github.com/rook/rook/pkg/operator/ceph/reporting.ReportReconcileResult(0xc0001821e0, 0x1de5be0, 0xc000e09e00, 0x1e1c010, 0x0, 0xc00086eed0, 0x0, 0x1dacdd8, 0xc0006ed6c8, 0xc0006ed6c8, ...) 2022-03-08T08:48:40.962885446Z /remote-source/rook/app/pkg/operator/ceph/reporting/reporting.go:46 +0x3c 2022-03-08T08:48:40.962885446Z github.com/rook/rook/pkg/operator/ceph/object.(*ReconcileCephObjectStore).Reconcile(0xc000969760, 0x1debeb8, 0xc00086eed0, 0xc000a8b8f0, 0x11, 0xc0008a0270, 0x2b, 0xc00086eed0, 0xc00086ee70, 0x30, ...) 2022-03-08T08:48:40.962885446Z /remote-source/rook/app/pkg/operator/ceph/object/controller.go:159 +0xac 2022-03-08T08:48:40.962885446Z sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0xc00069f040, 0x1debeb8, 0xc00086ee70, 0xc000a8b8f0, 0x11, 0xc0008a0270, 0x2b, 0xc00086ee70, 0x0, 0x0, ...) 2022-03-08T08:48:40.962885446Z /remote-source/rook/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114 +0x220 2022-03-08T08:48:40.962885446Z sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc00069f040, 0x1debe10, 0xc0000d2f80, 0x1863620, 0xc0007522a0) 2022-03-08T08:48:40.962885446Z /remote-source/rook/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311 +0x29c 2022-03-08T08:48:40.962885446Z sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc00069f040, 0x1debe10, 0xc0000d2f80, 0x02022-03-08T08:48:40.962896763Z ) 2022-03-08T08:48:40.962896763Z /remote-source/rook/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266 +0x206 2022-03-08T08:48:40.962896763Z sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2(0xc000044e90, 0xc00069f040, 0x1debe10, 0xc0000d2f80) 2022-03-08T08:48:40.962905747Z /remote-source/rook/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227 +0x5c 2022-03-08T08:48:40.962905747Z created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 2022-03-08T08:48:40.962916472Z /remote-source/rook/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:223 +0x3ba This is coming from the source here: https://github.com/red-hat-storage/rook/blob/release-4.10/pkg/operator/ceph/reporting/reporting.go#L46 Blaine could you take a look?
I believe I found the source of the issue and am working on a fix upstream.
PR backport to ODF 4.10 here: https://github.com/red-hat-storage/rook/pull/358
This fix has been verified. Not able to reproduce the issue anymore.
*** Bug 2064763 has been marked as a duplicate of this bug. ***