Description of problem (please be detailed as possible and provide log snippests): I'm attempting to deploy OCS using a StorageCluster w/ an empty labelSelector (i.e., doesn't require nodes to be labeled). The result of applying my storagecluster.yaml is CLBO of the ocs-operator. Version of all relevant components (if applicable): ocs-operator.v4.5.0-479.ci Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? This prevents deploying an OCS cluster Is there any workaround available to the best of your knowledge? no Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? yes Can this issue reproduce from the UI? unknown If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Deploy ocs-operator.v4.5.0-479.ci 2. Add the StorageCluster below 3. Actual results: ocs-operator CLBO due to panic Expected results: A usable cluster Additional info: --- apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: ocs-catalogsource namespace: openshift-marketplace labels: ocs-operator-internal: "true" spec: displayName: Openshift Container Storage icon: base64data: "" mediatype: "" image: quay.io/rhceph-dev/ocs-olm-operator:latest-4.5 publisher: Red Hat sourceType: grpc --- apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: ocs-subscription namespace: openshift-storage spec: channel: stable-4.5 name: ocs-operator source: ocs-catalogsource sourceNamespace: openshift-marketplace --- apiVersion: ocs.openshift.io/v1 kind: StorageCluster metadata: name: ocs-storagecluster namespace: openshift-storage spec: # The empty label selector removes the default so components can run an all # worker nodes. labelSelector: matchExpressions: [] manageNodes: false monPVCTemplate: spec: storageClassName: gp2 accessModes: - ReadWriteOnce resources: mds: limits: cpu: 1000m memory: 4Gi requests: cpu: 1000m memory: 4Gi mgr: limits: cpu: 1000m memory: 512Mi requests: cpu: 1000m memory: 512Mi mon: limits: cpu: 1000m memory: 1Gi requests: cpu: 1000m memory: 1Gi noobaa-core: limits: {} requests: {} noobaa-db: limits: {} requests: {} storageDeviceSets: - name: mydeviceset count: 3 dataPVCTemplate: spec: storageClassName: gp2 accessModes: - ReadWriteOnce volumeMode: Block resources: requests: storage: 1000Gi placement: {} portable: true resources: limits: cpu: 1000m memory: 2Gi requests: cpu: 1000m memory: 2Gi ocs-operator logs: {"level":"info","ts":"2020-07-07T18:46:31.180Z","logger":"controller-runtime.controller","msg":"Starting Controller","controller":"storagecluster-controller"} {"level":"info","ts":"2020-07-07T18:46:31.180Z","logger":"controller-runtime.controller","msg":"Starting workers","controller":"storagecluster-controller","worker count":1} {"level":"info","ts":"2020-07-07T18:46:31.180Z","logger":"controller_storagecluster","msg":"Reconciling StorageCluster","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"} {"level":"info","ts":"2020-07-07T18:46:31.300Z","logger":"controller_storagecluster","msg":"not creating a CephObjectStore because the platform is 'aws'","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"} E0707 18:46:31.544466 1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) goroutine 673 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic(0x14ba8e0, 0x2398c20) /go/src/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa3 k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0) /go/src/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x82 panic(0x14ba8e0, 0x2398c20) /opt/rh/go-toolset-1.13/root/usr/lib/go-toolset-1.13-golang/src/runtime/panic.go:679 +0x1b2 github.com/openshift/ocs-operator/pkg/controller/storagecluster.newStorageClassDeviceSets(0xc0000d0400, 0x3, 0xc0009b8c30, 0x1) /go/src/github.com/openshift/ocs-operator/pkg/controller/storagecluster/reconcile.go:1040 +0x6a1 github.com/openshift/ocs-operator/pkg/controller/storagecluster.newCephCluster(0xc0000d0400, 0xc00004800b, 0x61, 0xc, 0x18ab8c0, 0xc00060b200, 0xd4a2f12dba4b77ef) /go/src/github.com/openshift/ocs-operator/pkg/controller/storagecluster/reconcile.go:901 +0x170 github.com/openshift/ocs-operator/pkg/controller/storagecluster.(*ReconcileStorageCluster).ensureCephCluster(0xc0009dfe60, 0xc0000d0400, 0x18ab8c0, 0xc00060b200, 0x139e90a, 0xe) /go/src/github.com/openshift/ocs-operator/pkg/controller/storagecluster/reconcile.go:742 +0xe90 github.com/openshift/ocs-operator/pkg/controller/storagecluster.(*ReconcileStorageCluster).Reconcile(0xc0009dfe60, 0xc000045ce0, 0x11, 0xc000045cc0, 0x12, 0xc000911cd8, 0xc0009de750, 0xc000616008, 0x1874a60) /go/src/github.com/openshift/ocs-operator/pkg/controller/storagecluster/reconcile.go:243 +0x64c sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0009fc540, 0x150d180, 0xc0006a4000, 0x0) /go/src/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:256 +0x162 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0009fc540, 0xc000439900) /go/src/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232 +0xcb sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc0009fc540) /go/src/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211 +0x2b k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc000456f70) /go/src/k8s.io/apimachinery/pkg/util/wait/wait.go:152 +0x5e k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000456f70, 0x3b9aca00, 0x0, 0x1edee691b801, 0xc0000a4600) /go/src/k8s.io/apimachinery/pkg/util/wait/wait.go:153 +0xf8 k8s.io/apimachinery/pkg/util/wait.Until(0xc000456f70, 0x3b9aca00, 0xc0000a4600) /go/src/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1 /go/src/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:193 +0x328 panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x132c931] goroutine 673 [running]: k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0) /go/src/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0x105 panic(0x14ba8e0, 0x2398c20) /opt/rh/go-toolset-1.13/root/usr/lib/go-toolset-1.13-golang/src/runtime/panic.go:679 +0x1b2 github.com/openshift/ocs-operator/pkg/controller/storagecluster.newStorageClassDeviceSets(0xc0000d0400, 0x3, 0xc0009b8c30, 0x1) /go/src/github.com/openshift/ocs-operator/pkg/controller/storagecluster/reconcile.go:1040 +0x6a1 github.com/openshift/ocs-operator/pkg/controller/storagecluster.newCephCluster(0xc0000d0400, 0xc00004800b, 0x61, 0xc, 0x18ab8c0, 0xc00060b200, 0xd4a2f12dba4b77ef) /go/src/github.com/openshift/ocs-operator/pkg/controller/storagecluster/reconcile.go:901 +0x170 github.com/openshift/ocs-operator/pkg/controller/storagecluster.(*ReconcileStorageCluster).ensureCephCluster(0xc0009dfe60, 0xc0000d0400, 0x18ab8c0, 0xc00060b200, 0x139e90a, 0xe) /go/src/github.com/openshift/ocs-operator/pkg/controller/storagecluster/reconcile.go:742 +0xe90 github.com/openshift/ocs-operator/pkg/controller/storagecluster.(*ReconcileStorageCluster).Reconcile(0xc0009dfe60, 0xc000045ce0, 0x11, 0xc000045cc0, 0x12, 0xc000911cd8, 0xc0009de750, 0xc000616008, 0x1874a60) /go/src/github.com/openshift/ocs-operator/pkg/controller/storagecluster/reconcile.go:243 +0x64c sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0009fc540, 0x150d180, 0xc0006a4000, 0x0) /go/src/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:256 +0x162 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0009fc540, 0xc000439900) /go/src/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232 +0xcb sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc0009fc540) /go/src/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211 +0x2b k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc000456f70) /go/src/k8s.io/apimachinery/pkg/util/wait/wait.go:152 +0x5e k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000456f70, 0x3b9aca00, 0x0, 0x1edee691b801, 0xc0000a4600) /go/src/k8s.io/apimachinery/pkg/util/wait/wait.go:153 +0xf8 k8s.io/apimachinery/pkg/util/wait.Until(0xc000456f70, 0x3b9aca00, 0xc0000a4600) /go/src/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1 /go/src/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:193 +0x328
This is legit. And we need to fix it. Not sure how QE could verify https://bugzilla.redhat.com/show_bug.cgi?id=1846389 :-)
PR is upstream: https://github.com/openshift/ocs-operator/pull/618
(In reply to Michael Adam from comment #2) > This is legit. And we need to fix it. > Not sure how QE could verify > https://bugzilla.redhat.com/show_bug.cgi?id=1846389 :-) To expain: That BZ was for independent mode only, and it seems we're not hitting the crash in the code path for independent mode.
https://github.com/openshift/ocs-operator/pull/623 Backport PR.
merged
https://ceph-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/OCS%20Build%20Pipeline%204.5/61/ contains the fix 4.5.0-482.ci
Tested on AWS IPI environment(converged mode) - 3 Masters - 3 Workers Version: OCP: 4.5.0-0.nightly-2020-07-14-213353 OCS: ocs-operator.v4.5.0-487.ci Steps performed: 1. Created OCP cluster using ocs-ci 2. Ran deploy-olm.yaml $ oc create -f deploy-olm.yaml 3. Did subscription through UI 4. Ran the storagecluster.yaml $ oc create -f storagecluster.yaml Observations: Don't see any problem on ocs-operator and all pods are up and running. @John Strunk is the verification steps are correct, are we missing something? Do we need to validate on independent mode too? Additional information: $ oc get pods -n openshift-storage NAME READY STATUS RESTARTS AGE csi-cephfsplugin-28bvw 3/3 Running 0 13m csi-cephfsplugin-lfmxk 3/3 Running 0 13m csi-cephfsplugin-provisioner-65c858dcb7-q7hbm 5/5 Running 0 13m csi-cephfsplugin-provisioner-65c858dcb7-xj7hp 5/5 Running 0 13m csi-cephfsplugin-rddkh 3/3 Running 0 13m csi-rbdplugin-h8bdk 3/3 Running 0 13m csi-rbdplugin-pd9cq 3/3 Running 0 13m csi-rbdplugin-provisioner-b6b697b66-6n6bt 5/5 Running 0 13m csi-rbdplugin-provisioner-b6b697b66-9wmz5 5/5 Running 0 13m csi-rbdplugin-tdsxm 3/3 Running 0 13m noobaa-core-0 1/1 Running 0 10m noobaa-db-0 1/1 Running 0 10m noobaa-endpoint-758cbdd6d4-hj6wl 1/1 Running 0 9m20s noobaa-operator-5f9d557669-2xg6g 1/1 Running 0 16m ocs-operator-75b4fbfbff-q9t9p 1/1 Running 0 16m rook-ceph-crashcollector-ip-10-0-135-206-7566bc5678-27s5d 1/1 Running 0 12m rook-ceph-crashcollector-ip-10-0-185-61-78d5ffb9b4-5dwnz 1/1 Running 0 11m rook-ceph-crashcollector-ip-10-0-199-13-76cf7d686-skk2q 1/1 Running 0 12m rook-ceph-drain-canary-0e68ef29218a4256e368ebd8f2e7bd14-7cff4dx 1/1 Running 0 11m rook-ceph-drain-canary-1dbf9852097ecaf2d538dccc5663ece1-65xmb9t 1/1 Running 0 10m rook-ceph-drain-canary-f3fa4531e5fce199d25d2b6649d283da-69jnb2v 1/1 Running 0 10m rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-78dbfbcdjmq7f 1/1 Running 0 10m rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-575f8c6bhr7jd 1/1 Running 0 10m rook-ceph-mgr-a-6468b57f74-4zvhn 1/1 Running 0 11m rook-ceph-mon-a-7959984c64-t7thn 1/1 Running 0 12m rook-ceph-mon-b-5b4f9fb78b-tj7cp 1/1 Running 0 12m rook-ceph-mon-c-6b6d5ccfd6-vbb2p 1/1 Running 0 11m rook-ceph-operator-7cd55d84f6-hzsbf 1/1 Running 0 16m rook-ceph-osd-0-5cb8765454-l4nt2 1/1 Running 0 11m rook-ceph-osd-1-795968c964-nh96d 1/1 Running 0 10m rook-ceph-osd-2-76f96c57c5-gkhh7 1/1 Running 0 10m rook-ceph-osd-prepare-mydeviceset-0-data-0-fx97q-c5x4s 0/1 Completed 0 11m rook-ceph-osd-prepare-mydeviceset-1-data-0-jwxf8-wrgt6 0/1 Completed 0 11m rook-ceph-osd-prepare-mydeviceset-2-data-0-8x7rb-hp88r 0/1 Completed 0 11m $ oc get storagecluster ocs-storagecluster -n openshift-storage -o yaml . . spec: externalStorage: {} labelSelector: {} . . deploy-olm.yaml: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/BZ-1854651/bz1854651/deploy-olm.yaml storagecluster.yaml: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/BZ-1854651/bz1854651/storagecluster.yaml Logs: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/BZ-1854651/bz1854651/
My understanding is that the empty labelSelector is enough to trigger the bug in affected versions. I have been running successfully with 4.5.0-485.ci using the same StorageCluster that caused the initial panic, so I believe this is fixed.
Moving the BZ to verified state based on Comment#11 and Comment#12, conclusion is that with empty labelSelector we don't see any problem on ocs-operator and all pods were up and running.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3754