Bug 2049509

Summary: ocs operator stuck on CrashLoopBackOff while installing with KMS
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: aberner
Component: ocs-operatorAssignee: Jiffin <jthottan>
Status: CLOSED ERRATA QA Contact: aberner
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.10CC: ikave, jthottan, madam, muagarwa, nberry, nibalach, ocs-bugs, odf-bz-bot, prasriva, rgeorge, sostapov
Target Milestone: ---Keywords: AutomationBackLog, Regression, TestBlocker
Target Release: ODF 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.10.0-141 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-04-13 18:52:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description aberner 2022-02-02 11:20:46 UTC
Description of problem (please be detailed as possible and provide log
snippests):
While trying to install ODF storagesystem with Vault ocs operator gets stuck on CrashLoopBackOff with an error message in the operator logs saying:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x170d20a]


Version of all relevant components (if applicable):
ocp: 4.10.0-0.nightly-2022-01-31-012936
odf: 4.10.0-133


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
yes, ocs operator is unavailable 


Is there any workaround available to the best of your knowledge?
no

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
unknown

Can this issue reproduce from the UI?
was created via installation through the ui, reproduction unknown.

If this is a regression, please provide more details to justify this:
deploying with KMS was operational in previous versions so yes it is a regression

Steps to Reproduce:
1. install an OCP cluster
2. install odf operator
3. create a storagesystem with a full deployment and connect to a KMS (Vault)


Actual results:
Cluster comes up unhealthy with the ocs operator stuck in CrashLoopBackOff

Expected results:
Cluster comes up healthy and operational 

Additional info:

Comment 2 aberner 2022-02-02 14:31:50 UTC
was able to reproduce in odf 4.10.0-137

Comment 3 aberner 2022-02-02 14:39:03 UTC
the platform of both of the failures is vsphere and there was a successful deployment over aws with kms enabled therefore we suspect it to be platform related.

Comment 7 Mudit Agarwal 2022-02-03 01:07:41 UTC
Thanks Amit!!

{"level":"info","ts":1643849128.0077307,"logger":"cmd","msg":"Go Version: go1.16.6"}
{"level":"info","ts":1643849128.0080402,"logger":"cmd","msg":"Go OS/Arch: linux/amd64"}
I0203 00:45:29.057166       1 request.go:668] Waited for 1.0379668s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/discovery.k8s.io/v1?timeout=32s
{"level":"info","ts":1643849130.7604914,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":1643849130.7768652,"logger":"cmd","msg":"OCSInitialization resource already exists"}
{"level":"info","ts":1643849133.5396976,"logger":"cmd","msg":"starting manager"}
I0203 00:45:33.539974       1 leaderelection.go:243] attempting to acquire leader lease openshift-storage/ab76f4c9.openshift.io...
{"level":"info","ts":1643849133.5400252,"logger":"controller-runtime.manager","msg":"starting metrics server","path":"/metrics"}
I0203 00:45:51.972963       1 leaderelection.go:253] successfully acquired lease openshift-storage/ab76f4c9.openshift.io
{"level":"info","ts":1643849151.9731855,"logger":"controller-runtime.manager.controller.storagecluster","msg":"Starting EventSource","reconciler group":"ocs.openshift.io","reconciler kind":"StorageCluster","source":"kind source: /, Kind="}
{"level":"info","ts":1643849151.9731698,"logger":"controller-runtime.manager.controller.storageconsumer","msg":"Starting EventSource","reconciler group":"ocs.openshift.io","reconciler kind":"StorageConsumer","source":"kind source: /, Kind="}
{"level":"info","ts":1643849151.9732287,"logger":"controller-runtime.manager.controller.persistentvolume","msg":"Starting EventSource","reconciler group":"","reconciler kind":"PersistentVolume","source":"kind source: /, Kind="}
{"level":"info","ts":1643849151.973263,"logger":"controller-runtime.manager.controller.persistentvolume","msg":"Starting Controller","reconciler group":"","reconciler kind":"PersistentVolume"}
{"level":"info","ts":1643849151.9732392,"logger":"controller-runtime.manager.controller.storagecluster","msg":"Starting EventSource","reconciler group":"ocs.openshift.io","reconciler kind":"StorageCluster","source":"kind source: /, Kind="}
{"level":"info","ts":1643849151.9739225,"logger":"controller-runtime.manager.controller.storagecluster","msg":"Starting EventSource","reconciler group":"ocs.openshift.io","reconciler kind":"StorageCluster","source":"kind source: /, Kind="}
{"level":"info","ts":1643849151.9739504,"logger":"controller-runtime.manager.controller.storagecluster","msg":"Starting EventSource","reconciler group":"ocs.openshift.io","reconciler kind":"StorageCluster","source":"kind source: /, Kind="}
{"level":"info","ts":1643849151.9739673,"logger":"controller-runtime.manager.controller.storagecluster","msg":"Starting EventSource","reconciler group":"ocs.openshift.io","reconciler kind":"StorageCluster","source":"kind source: /, Kind="}
{"level":"info","ts":1643849151.9739735,"logger":"controller-runtime.manager.controller.storagecluster","msg":"Starting EventSource","reconciler group":"ocs.openshift.io","reconciler kind":"StorageCluster","source":"kind source: /, Kind="}
{"level":"info","ts":1643849151.9739833,"logger":"controller-runtime.manager.controller.storagecluster","msg":"Starting Controller","reconciler group":"ocs.openshift.io","reconciler kind":"StorageCluster"}
{"level":"info","ts":1643849151.974249,"logger":"controller-runtime.manager.controller.storageconsumer","msg":"Starting Controller","reconciler group":"ocs.openshift.io","reconciler kind":"StorageConsumer"}
{"level":"info","ts":1643849151.9743242,"logger":"controller-runtime.manager.controller.ocsinitialization","msg":"Starting EventSource","reconciler group":"ocs.openshift.io","reconciler kind":"OCSInitialization","source":"kind source: /, Kind="}
{"level":"info","ts":1643849151.9743755,"logger":"controller-runtime.manager.controller.ocsinitialization","msg":"Starting EventSource","reconciler group":"ocs.openshift.io","reconciler kind":"OCSInitialization","source":"kind source: /, Kind="}
{"level":"info","ts":1643849151.9743888,"logger":"controller-runtime.manager.controller.ocsinitialization","msg":"Starting Controller","reconciler group":"ocs.openshift.io","reconciler kind":"OCSInitialization"}
{"level":"info","ts":1643849152.0760622,"logger":"controller-runtime.manager.controller.storageconsumer","msg":"Starting workers","reconciler group":"ocs.openshift.io","reconciler kind":"StorageConsumer","worker count":1}
{"level":"info","ts":1643849152.0762343,"logger":"controller-runtime.manager.controller.storagecluster","msg":"Starting workers","reconciler group":"ocs.openshift.io","reconciler kind":"StorageCluster","worker count":1}
{"level":"info","ts":1643849152.0763292,"logger":"controller-runtime.manager.controller.persistentvolume","msg":"Starting workers","reconciler group":"","reconciler kind":"PersistentVolume","worker count":1}
{"level":"info","ts":1643849152.0763702,"logger":"controllers.StorageCluster","msg":"Reconciling StorageCluster.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","StorageCluster":"openshift-storage/ocs-storagecluster"}
{"level":"info","ts":1643849152.0764022,"logger":"controllers.StorageCluster","msg":"Spec.AllowRemoteStorageConsumers is disabled","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"}
{"level":"info","ts":1643849152.0764332,"logger":"controller-runtime.manager.controller.ocsinitialization","msg":"Starting workers","reconciler group":"ocs.openshift.io","reconciler kind":"OCSInitialization","worker count":1}
{"level":"info","ts":1643849152.0765593,"logger":"controllers.OCSInitialization","msg":"Reconciling OCSInitialization.","Request.Namespace":"openshift-storage","Request.Name":"ocsinit","OCSInitialization":"openshift-storage/ocsinit"}
{"level":"info","ts":1643849152.082092,"logger":"controllers.OCSInitialization","msg":"Updating SecurityContextConstraint.","Request.Namespace":"openshift-storage","Request.Name":"ocsinit","SecurityContextConstraint":"rook-ceph"}
{"level":"info","ts":1643849152.0877435,"logger":"controllers.StorageCluster","msg":"Resource deletion for provider succeeded","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"}
{"level":"info","ts":1643849152.091996,"logger":"controllers.OCSInitialization","msg":"Updating SecurityContextConstraint.","Request.Namespace":"openshift-storage","Request.Name":"ocsinit","SecurityContextConstraint":"rook-ceph-csi"}
{"level":"info","ts":1643849152.101062,"logger":"controllers.OCSInitialization","msg":"Updating SecurityContextConstraint.","Request.Namespace":"openshift-storage","Request.Name":"ocsinit","SecurityContextConstraint":"noobaa"}
{"level":"info","ts":1643849152.1127658,"logger":"controllers.OCSInitialization","msg":"Updating SecurityContextConstraint.","Request.Namespace":"openshift-storage","Request.Name":"ocsinit","SecurityContextConstraint":"noobaa-endpoint"}
{"level":"info","ts":1643849152.1313975,"logger":"controllers.OCSInitialization","msg":"Reconciling OCSInitialization.","Request.Namespace":"openshift-storage","Request.Name":"ocsinit","OCSInitialization":"openshift-storage/ocsinit"}
{"level":"info","ts":1643849152.1351314,"logger":"controllers.OCSInitialization","msg":"Updating SecurityContextConstraint.","Request.Namespace":"openshift-storage","Request.Name":"ocsinit","SecurityContextConstraint":"rook-ceph"}
{"level":"info","ts":1643849152.1491027,"logger":"controllers.OCSInitialization","msg":"Updating SecurityContextConstraint.","Request.Namespace":"openshift-storage","Request.Name":"ocsinit","SecurityContextConstraint":"rook-ceph-csi"}
{"level":"info","ts":1643849152.160247,"logger":"controllers.OCSInitialization","msg":"Updating SecurityContextConstraint.","Request.Namespace":"openshift-storage","Request.Name":"ocsinit","SecurityContextConstraint":"noobaa"}
{"level":"info","ts":1643849152.1706407,"logger":"controllers.OCSInitialization","msg":"Updating SecurityContextConstraint.","Request.Namespace":"openshift-storage","Request.Name":"ocsinit","SecurityContextConstraint":"noobaa-endpoint"}
{"level":"info","ts":1643849152.9305122,"logger":"controllers.StorageCluster","msg":"Restoring original CephBlockPool.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","CephBlockPool":"openshift-storage/ocs-storagecluster-cephblockpool"}
{"level":"info","ts":1643849153.0419562,"logger":"controllers.StorageCluster","msg":"Restoring original CephFilesystem.","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster","CephFileSystem":"openshift-storage/ocs-storagecluster-cephfilesystem"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x170d20a]

goroutine 884 [running]:
github.com/red-hat-storage/ocs-operator/controllers/storagecluster.(*StorageClusterReconciler).newCephObjectStoreInstances(0xc00048a0c0, 0xc000c32000, 0xc000bc0b40, 0x1e2bf80, 0xc00059cb40, 0xc000bc0b40, 0x0, 0x0)
	/remote-source/app/controllers/storagecluster/cephobjectstores.go:218 +0x96a
github.com/red-hat-storage/ocs-operator/controllers/storagecluster.(*ocsCephObjectStores).ensureCreated(0x2a22f90, 0xc00048a0c0, 0xc000c32000, 0x0, 0x0, 0x0, 0x0)
	/remote-source/app/controllers/storagecluster/cephobjectstores.go:59 +0x12c
github.com/red-hat-storage/ocs-operator/controllers/storagecluster.(*StorageClusterReconciler).reconcilePhases(0xc00048a0c0, 0xc000c32000, 0xc000951140, 0x11, 0xc000951128, 0x12, 0x0, 0x0, 0xc000c32000, 0x0)
	/remote-source/app/controllers/storagecluster/reconcile.go:394 +0xd08
github.com/red-hat-storage/ocs-operator/controllers/storagecluster.(*StorageClusterReconciler).Reconcile(0xc00048a0c0, 0x1e174b8, 0xc000e44f90, 0xc000951140, 0x11, 0xc000951128, 0x12, 0xc000e44f00, 0x0, 0x0, ...)
	/remote-source/app/controllers/storagecluster/reconcile.go:161 +0x6c5
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000ba25a0, 0x1e17410, 0xc0005b94c0, 0x19e0a40, 0xc0007ec080)
	/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:298 +0x30d
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000ba25a0, 0x1e17410, 0xc0005b94c0, 0xc000c82f00)
	/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253 +0x205
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2(0xc000e07b70, 0xc000ba25a0, 0x1e17410, 0xc0005b94c0)
	/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214 +0x6b
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
	/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:210 +0x425

Comment 8 Mudit Agarwal 2022-02-03 03:09:58 UTC
Jiffin/Pranshu, PTAL

Comment 9 Mudit Agarwal 2022-02-03 03:30:15 UTC
Are we supposed to test this on baremetal? 
Was it tested before or it's being tested first time in 4.10?

Comment 18 aberner 2022-03-01 09:32:56 UTC
Verified over odf version 4.10.0-143

Comment 20 errata-xmlrpc 2022-04-13 18:52:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.10.0 enhancement, security & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1372