Bug 2064763

Summary: [External Mode] rook-ceph-operator in CLBO state after upgrading 4.8 ---> 4.9 ---> 4.10
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Vijay Avuthu <vavuthu>
Component: rookAssignee: Blaine Gardner <brgardne>
Status: CLOSED DUPLICATE QA Contact: Elad <ebenahar>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.10CC: jrivera, madam, muagarwa, ocs-bugs, odf-bz-bot
Target Milestone: ---Keywords: Automation
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-17 15:28:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vijay Avuthu 2022-03-16 14:08:19 UTC
Description of problem (please be detailed as possible and provide log
snippests):

rook-ceph-operator in CLBO state after upgrading 4.8 ---> 4.9 ---> 4.10

Version of all relevant components (if applicable):

upgraded from 4.8 ---> 4.9 ( ocs-operator.v4.9.4 ) ---> 4.10

ocs-registry:4.10.0-189

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
NA

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
1/1

Can this issue reproduce from the UI?
Not tried

If this is a regression, please provide more details to justify this:
NA

Steps to Reproduce:
1. In external mode, upgrade cluster from 4.8 ---> 4.9 ---> 4.10
2. check all pods are running 



Actual results:

$ oc get csv ocs-operator.v4.10.0
NAME                   DISPLAY                       VERSION   REPLACES              PHASE
ocs-operator.v4.10.0   OpenShift Container Storage   4.10.0    ocs-operator.v4.9.4   Installing

> pods status

$ oc get pods
NAME                                               READY   STATUS             RESTARTS        AGE
csi-addons-controller-manager-c6f4bcfdb-q2php      2/2     Running            0               23h
csi-cephfsplugin-fsj7w                             3/3     Running            0               25h
csi-cephfsplugin-kljl4                             3/3     Running            0               25h
csi-cephfsplugin-provisioner-58c7b655f-d85dh       6/6     Running            0               25h
csi-cephfsplugin-provisioner-58c7b655f-xllb6       6/6     Running            0               25h
csi-cephfsplugin-r9f68                             3/3     Running            0               25h
csi-rbdplugin-22x49                                3/3     Running            0               12h
csi-rbdplugin-provisioner-5bc5c7fcd9-ld9td         6/6     Running            0               12h
csi-rbdplugin-provisioner-5bc5c7fcd9-m5rxx         6/6     Running            0               12h
csi-rbdplugin-x45gw                                3/3     Running            0               12h
csi-rbdplugin-xmmtn                                3/3     Running            0               12h
noobaa-core-0                                      1/1     Running            0               25h
noobaa-db-pg-0                                     1/1     Running            0               25h
noobaa-endpoint-564b5c9b76-mk8qd                   1/1     Running            0               25h
noobaa-operator-764b8f8569-tzf7b                   1/1     Running            0               23h
ocs-metrics-exporter-7c5d8b7bd9-q4q6n              1/1     Running            0               23h
ocs-operator-d7fd9f5fb-925zg                       1/1     Running            0               23h
odf-console-f987957d9-79bld                        1/1     Running            0               23h
odf-operator-controller-manager-7f97874489-bjkrl   2/2     Running            0               23h
rook-ceph-operator-7cb464db7d-mhq6n                0/1     CrashLoopBackOff   5 (2m37s ago)   5m58s
rook-ceph-tools-external-5f456fb6cb-rd7j9          1/1     Running            0               25h




Expected results:

upgrade should success full and all pods should be in running state


Additional info:

> csv events

$ oc describe csv ocs-operator.v4.10.0
Name:         ocs-operator.v4.10.0
Namespace:    openshift-storage
Labels:       full_version=4.10.0-189

Events:
  Type     Reason              Age                    From                        Message
  ----     ------              ----                   ----                        -------
  Normal   NeedsReinstall      157m (x600 over 23h)   operator-lifecycle-manager  installing: waiting for deployment rook-ceph-operator to become ready: deployment "rook-ceph-operator" not available: Deployment does not have minimum availability.
  Warning  InstallCheckFailed  7m48s (x310 over 22h)  operator-lifecycle-manager  install timeout
  Normal   InstallSucceeded    102s (x396 over 23h)   operator-lifecycle-manager  install strategy completed with no errors


> rook-ceph-operator log has runtime error 

2022-03-16 13:40:21.917865 I | op-bucket-prov: successfully reconciled bucket provisioner
I0316 13:40:21.917924       1 manager.go:135] objectbucket.io/provisioner-manager "msg"="starting provisioner"  "name"="openshift-storage.ceph.rook.io/bucket"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1462005]

goroutine 1045 [running]:
github.com/rook/rook/pkg/apis/ceph.rook.io/v1.(*CephObjectStore).GetObjectKind(0x0, 0x0, 0x0)
	<autogenerated>:1 +0x5
github.com/rook/rook/pkg/operator/ceph/reporting.ReportReconcileResult(0xc00000c150, 0x23a9980, 0xc0009b47c0, 0x23df8f0, 0x0, 0xc000cb6c00, 0x0, 0x2370c80, 0xc001408150, 0xc001408150, ...)
	/remote-source/rook/app/pkg/operator/ceph/reporting/reporting.go:46 +0x4f
github.com/rook/rook/pkg/operator/ceph/object.(*ReconcileCephObjectStore).Reconcile(0xc0000c3080, 0x23afb78, 0xc000b14270, 0xc000f60f48, 0x11, 0xc0001649c0, 0x2b, 0xc000b14270, 0xc000b14210, 0xc000b42db0, ...)
	/remote-source/rook/app/pkg/operator/ceph/object/controller.go:159 +0xc9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0xc000ae3220, 0x23afb78, 0xc000b14210, 0xc000f60f48, 0x11, 0xc0001649c0, 0x2b, 0xc000b14200, 0x0, 0x0, ...)
	/remote-source/rook/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114 +0x247
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000ae3220, 0x23afad0, 0xc0009b4fc0, 0x1e2bc00, 0xc000f9a4c0)
	/remote-source/rook/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311 +0x305
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000ae3220, 0x23afad0, 0xc0009b4fc0, 0x0)
	/remote-source/rook/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266 +0x205
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2(0xc000d3c930, 0xc000ae3220, 0x23afad0, 0xc0009b4fc0)
	/remote-source/rook/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227 +0x6b
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
	/remote-source/rook/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:223 +0x425

Job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/10829/consoleFull

must gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/vavuthu-bz2064107/vavuthu-bz2064107_20220315T110140/logs/must-gather/

Comment 4 Blaine Gardner 2022-03-16 16:49:20 UTC
This behavior looks the same as that from https://bugzilla.redhat.com/show_bug.cgi?id=2061675. Supposedly, the fix for 2061675 is present in the version being tested here, but I wonder if maybe it isn't present until the next release.

@vavuthu Is there a newer version of ODF 4.10 that can be used to re-test this behavior to see if it persists?

Comment 5 Vijay Avuthu 2022-03-17 10:53:22 UTC
(In reply to Blaine Gardner from comment #4)
> This behavior looks the same as that from
> https://bugzilla.redhat.com/show_bug.cgi?id=2061675. Supposedly, the fix for
> 2061675 is present in the version being tested here, but I wonder if maybe
> it isn't present until the next release.
> 
> @vavuthu Is there a newer version of ODF 4.10 that can be used to
> re-test this behavior to see if it persists?

tested with latest version of 4.10 ( 4.10.0-198 )and didn't see the issue

job ( 4.9 to 4.10 external upgrade ) : https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/10914/

4.8 to 4.9 external upgrade : https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/3542/console

> $ oc get csv
NAME                              DISPLAY                       VERSION   REPLACES              PHASE
mcg-operator.v4.10.0              NooBaa Operator               4.10.0    mcg-operator.v4.9.4   Succeeded
ocs-operator.v4.10.0              OpenShift Container Storage   4.10.0    ocs-operator.v4.9.4   Succeeded
odf-csi-addons-operator.v4.10.0   CSI Addons                    4.10.0                          Succeeded
odf-operator.v4.10.0              OpenShift Data Foundation     4.10.0    odf-operator.v4.9.4   Succeeded
$ oc get pods
NAME                                               READY   STATUS    RESTARTS   AGE
csi-addons-controller-manager-57bbfd7479-gqv9w     2/2     Running   0          88m
csi-cephfsplugin-9rp65                             3/3     Running   0          87m
csi-cephfsplugin-khkdm                             3/3     Running   0          88m
csi-cephfsplugin-provisioner-579ddb8f44-96d4f      6/6     Running   0          88m
csi-cephfsplugin-provisioner-579ddb8f44-ckz7g      6/6     Running   0          88m
csi-cephfsplugin-t7wsp                             3/3     Running   0          87m
csi-rbdplugin-68vzt                                3/3     Running   0          88m
csi-rbdplugin-g4msp                                3/3     Running   0          87m
csi-rbdplugin-provisioner-58887668cb-4nqwf         6/6     Running   0          88m
csi-rbdplugin-provisioner-58887668cb-t745h         6/6     Running   0          88m
csi-rbdplugin-w657c                                3/3     Running   0          88m
noobaa-core-0                                      1/1     Running   0          87m
noobaa-db-pg-0                                     1/1     Running   0          87m
noobaa-endpoint-8469489b8f-gb98t                   1/1     Running   0          88m
noobaa-endpoint-8469489b8f-r6fk5                   1/1     Running   0          87m
noobaa-operator-56948bd958-jqz7g                   1/1     Running   0          89m
ocs-metrics-exporter-7fd6498c-9gkx2                1/1     Running   0          88m
ocs-operator-8b49d4986-zmckr                       1/1     Running   0          88m
odf-console-58b4b85cb-b2d6w                        1/1     Running   0          90m
odf-operator-controller-manager-7488dc497c-dwfpf   2/2     Running   0          90m
rook-ceph-operator-5c54b594f-96rtd                 1/1     Running   0          88m
rook-ceph-tools-external-594b6f7978-bjzv4          1/1     Running   0          3h3m
$ 

As this issue is not seen in latest versions, we can close this bug

Comment 6 Blaine Gardner 2022-03-17 15:28:32 UTC
Great. Thanks. Closing this since it seems to have been a duplicate of 2061675 given that the issue can no longer be reproduced with the latest version.

*** This bug has been marked as a duplicate of bug 2061675 ***