Bug 2061675 - [IBM Z][External Mode] - segfault in rook objectstore controller when in external mode (ocs-ci tier1 execution)
Summary: [IBM Z][External Mode] - segfault in rook objectstore controller when in exte...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.10
Hardware: s390x
OS: Linux
unspecified
medium
Target Milestone: ---
: ODF 4.10.0
Assignee: Blaine Gardner
QA Contact: Abdul Kandathil (IBM)
URL:
Whiteboard:
: 2064763 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-03-08 09:55 UTC by Abdul Kandathil (IBM)
Modified: 2023-08-09 17:03 UTC (History)
7 users (show)

Fixed In Version: 4.10.0-189
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-04-21 09:12:52 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage rook pull 358 0 None open core: rework usage of ReportReconcileResult 2022-03-11 17:56:04 UTC
Github rook rook pull 9873 0 None open core: rework usage of ReportReconcileResult 2022-03-09 17:12:59 UTC

Description Abdul Kandathil (IBM) 2022-03-08 09:55:37 UTC
Description of problem:
rook-ceph-operator pod stuck in crash-loopback state after tier1 execution in external mode while external RHCS 5 cluster remains healthy.

Version-Release number of selected component (if applicable):
ocp : 4.10.0-rc.8
odf : 4.10.0
RHCS : 5

How reproducible:
Deploy odf in external mode with RHCS 5 and trigger ocs-ci tier1.

Steps to Reproduce:
Deploy odf in external mode with RHCS 5 and trigger ocs-ci tier1.

Actual results:


Expected results:
odf remains healthy after tier execution

Additional info:
Tier run and must gather logs : https://drive.google.com/file/d/1Kf-IkIMeKb-9XZNSfDWml9hh00dxmqER/view?usp=sharing

Comment 2 Abdul Kandathil (IBM) 2022-03-08 10:07:21 UTC
ODF cluster status after the tier execution:
--------
[root@m4204001 ~]# oc -n openshift-storage get pod
NAME                                               READY   STATUS             RESTARTS          AGE
csi-addons-controller-manager-59588d8bc6-xhfg4     2/2     Running            0                 17h
csi-cephfsplugin-87bh5                             3/3     Running            0                 17h
csi-cephfsplugin-jpdzv                             3/3     Running            0                 17h
csi-cephfsplugin-provisioner-6547889564-58bts      6/6     Running            0                 11h
csi-cephfsplugin-provisioner-6547889564-v4hgw      6/6     Running            0                 17h
csi-cephfsplugin-wpxxq                             3/3     Running            0                 17h
csi-rbdplugin-hg9ml                                4/4     Running            0                 17h
csi-rbdplugin-provisioner-6f4cd57fcb-t4c47         7/7     Running            0                 17h
csi-rbdplugin-provisioner-6f4cd57fcb-xdhgj         7/7     Running            0                 17h
csi-rbdplugin-s8mc6                                4/4     Running            0                 17h
csi-rbdplugin-vv42j                                4/4     Running            0                 17h
noobaa-core-0                                      1/1     Running            0                 11h
noobaa-db-pg-0                                     1/1     Running            0                 17h
noobaa-endpoint-5f946897b6-bqhkv                   1/1     Running            0                 11h
noobaa-endpoint-5f946897b6-ztzzj                   1/1     Running            0                 17h
noobaa-operator-86dc95f87c-tdnp2                   1/1     Running            0                 11h
ocs-metrics-exporter-b76b778f5-snvk9               1/1     Running            0                 11h
ocs-operator-7445966997-pv79k                      1/1     Running            0                 17h
odf-console-759876895-gl9k9                        1/1     Running            0                 17h
odf-operator-controller-manager-5fcb6d85cc-2rrft   2/2     Running            0                 17h
rook-ceph-operator-6f87b7f4d8-2jzr4                0/1     CrashLoopBackOff   143 (4m25s ago)   11h
rook-ceph-tools-external-7b6558594f-4cqkp          1/1     Running            0                 11h
[root@m4204001 ~]#
[root@m4204001 ~]# oc -n openshift-storage get cephcluster
NAME                                      DATADIRHOSTPATH   MONCOUNT   AGE   PHASE        MESSAGE                                             HEALTH      EXTERNAL
ocs-external-storagecluster-cephcluster                                17h   Connecting   Attempting to connect to an external Ceph cluster   HEALTH_OK   true
[root@m4204001 ~]#
[root@m4204001 ~]# oc -n openshift-storage get sc
NAME                                   PROVISIONER                             RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
ocs-external-storagecluster-ceph-rbd   openshift-storage.rbd.csi.ceph.com      Delete          Immediate           true                   17h
ocs-external-storagecluster-ceph-rgw   openshift-storage.ceph.rook.io/bucket   Delete          Immediate           false                  17h
ocs-external-storagecluster-cephfs     openshift-storage.cephfs.csi.ceph.com   Delete          Immediate           true                   17h
openshift-storage.noobaa.io            openshift-storage.noobaa.io/obc         Delete          Immediate           false                  17h
[root@m4204001 ~]#
[root@m4204001 ~]# oc -n openshift-storage get csv
NAME                              DISPLAY                       VERSION   REPLACES   PHASE
mcg-operator.v4.10.0              NooBaa Operator               4.10.0               Succeeded
ocs-operator.v4.10.0              OpenShift Container Storage   4.10.0               Installing
odf-csi-addons-operator.v4.10.0   CSI Addons                    4.10.0               Succeeded
odf-operator.v4.10.0              OpenShift Data Foundation     4.10.0               Succeeded
[root@m4204001 ~]#
--------

RHCS cluster status :
--------
[root@xzkvm01 ~]# ceph -s
  cluster:
    id:     bdabda3c-9b00-11ec-9831-525400e56e5d
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum xzkvm01,xzkvm02,xzkvm03 (age 4d)
    mgr: xzkvm01.zblcqg(active, since 4d), standbys: xzkvm02.wippoc
    mds: 1/1 daemons up, 2 standby
    osd: 3 osds: 3 up (since 4d), 3 in (since 4d)
    rgw: 4 daemons active (2 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   10 pools, 241 pgs
    objects: 14.69k objects, 53 GiB
    usage:   61 GiB used, 539 GiB / 600 GiB avail
    pgs:     241 active+clean

  io:
    client:   5.3 KiB/s wr, 0 op/s rd, 0 op/s wr

[root@xzkvm01 ~]#
--------

Comment 3 Travis Nielsen 2022-03-08 19:16:08 UTC
The operator log shows the following stack:

2022-03-08T08:48:40.962801533Z panic: runtime error: invalid memory address or nil pointer dereference
2022-03-08T08:48:40.962801533Z [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xf81faa]
2022-03-08T08:48:40.962801533Z 
2022-03-08T08:48:40.962801533Z goroutine 1077 [running]:
2022-03-08T08:48:40.962801533Z github.com/rook/rook/pkg/apis/ceph.rook.io/v1.(*CephObjectStore).GetObjectKind(0x0, 0x0, 2022-03-08T08:48:40.962885446Z 0x0)
2022-03-08T08:48:40.962885446Z 	<autogenerated>:1 +0xa
2022-03-08T08:48:40.962885446Z github.com/rook/rook/pkg/operator/ceph/reporting.ReportReconcileResult(0xc0001821e0, 0x1de5be0, 0xc000e09e00, 0x1e1c010, 0x0, 0xc00086eed0, 0x0, 0x1dacdd8, 0xc0006ed6c8, 0xc0006ed6c8, ...)
2022-03-08T08:48:40.962885446Z 	/remote-source/rook/app/pkg/operator/ceph/reporting/reporting.go:46 +0x3c
2022-03-08T08:48:40.962885446Z github.com/rook/rook/pkg/operator/ceph/object.(*ReconcileCephObjectStore).Reconcile(0xc000969760, 0x1debeb8, 0xc00086eed0, 0xc000a8b8f0, 0x11, 0xc0008a0270, 0x2b, 0xc00086eed0, 0xc00086ee70, 0x30, ...)
2022-03-08T08:48:40.962885446Z 	/remote-source/rook/app/pkg/operator/ceph/object/controller.go:159 +0xac
2022-03-08T08:48:40.962885446Z sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0xc00069f040, 0x1debeb8, 0xc00086ee70, 0xc000a8b8f0, 0x11, 0xc0008a0270, 0x2b, 0xc00086ee70, 0x0, 0x0, ...)
2022-03-08T08:48:40.962885446Z 	/remote-source/rook/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114 +0x220
2022-03-08T08:48:40.962885446Z sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc00069f040, 0x1debe10, 0xc0000d2f80, 0x1863620, 0xc0007522a0)
2022-03-08T08:48:40.962885446Z 	/remote-source/rook/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311 +0x29c
2022-03-08T08:48:40.962885446Z sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc00069f040, 0x1debe10, 0xc0000d2f80, 0x02022-03-08T08:48:40.962896763Z )
2022-03-08T08:48:40.962896763Z 	/remote-source/rook/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266 +0x206
2022-03-08T08:48:40.962896763Z sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2(0xc000044e90, 0xc00069f040, 0x1debe10, 0xc0000d2f80)
2022-03-08T08:48:40.962905747Z 	/remote-source/rook/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227 +0x5c
2022-03-08T08:48:40.962905747Z created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
2022-03-08T08:48:40.962916472Z 	/remote-source/rook/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:223 +0x3ba

This is coming from the source here: https://github.com/red-hat-storage/rook/blob/release-4.10/pkg/operator/ceph/reporting/reporting.go#L46

Blaine could you take a look?

Comment 4 Blaine Gardner 2022-03-08 22:18:55 UTC
I believe I found the source of the issue and am working on a fix upstream.

Comment 7 Blaine Gardner 2022-03-11 17:56:04 UTC
PR backport to ODF 4.10 here: https://github.com/red-hat-storage/rook/pull/358

Comment 8 Abdul Kandathil (IBM) 2022-03-15 15:15:16 UTC
This fix has been verified. Not able to reproduce the issue anymore.

Comment 9 Blaine Gardner 2022-03-17 15:28:32 UTC
*** Bug 2064763 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.