1854651 – Converged Mode:ocs-operator in CrashLoopBackoff with empty labelSelector

Bug 1854651 - Converged Mode:ocs-operator in CrashLoopBackoff with empty labelSelector

Summary: Converged Mode:ocs-operator in CrashLoopBackoff with empty labelSelector

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	ocs-operator
Sub Component:
Version:	4.5
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	OCS 4.5.0
Assignee:	Jose A. Rivera
QA Contact:	akarsha
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-07-07 20:52 UTC by John Strunk
Modified:	2020-09-23 09:04 UTC (History)
CC List:	6 users (show)
Fixed In Version:	4.5.0-482.ci
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-09-15 10:18:18 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift ocs-operator pull 623	0	None	closed	Bug 1854651: [release 4.5] backport pr 618 - fix segfault	2021-01-29 08:54:08 UTC
Red Hat Product Errata	RHBA-2020:3754	0	None	None	None	2020-09-15 10:18:47 UTC

Description John Strunk 2020-07-07 20:52:57 UTC

Description of problem (please be detailed as possible and provide log
snippests):

I'm attempting to deploy OCS using a StorageCluster w/ an empty labelSelector (i.e., doesn't require nodes to be labeled). The result of applying my storagecluster.yaml is CLBO of the ocs-operator.

Version of all relevant components (if applicable):

ocs-operator.v4.5.0-479.ci

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

This prevents deploying an OCS cluster

Is there any workaround available to the best of your knowledge?

no

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

2

Can this issue reproducible?

yes

Can this issue reproduce from the UI?

unknown

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Deploy ocs-operator.v4.5.0-479.ci
2. Add the StorageCluster below
3.


Actual results:

ocs-operator CLBO due to panic

Expected results:

A usable cluster

Additional info:

---
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: ocs-catalogsource
  namespace: openshift-marketplace
  labels:
    ocs-operator-internal: "true"
spec:
  displayName: Openshift Container Storage
  icon:
    base64data: ""
    mediatype: ""
  image: quay.io/rhceph-dev/ocs-olm-operator:latest-4.5
  publisher: Red Hat
  sourceType: grpc

---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: ocs-subscription
  namespace: openshift-storage
spec:
  channel: stable-4.5
  name: ocs-operator
  source: ocs-catalogsource
  sourceNamespace: openshift-marketplace

---
apiVersion: ocs.openshift.io/v1
kind: StorageCluster
metadata:
  name: ocs-storagecluster
  namespace: openshift-storage
spec:
  # The empty label selector removes the default so components can run an all
  # worker nodes.
  labelSelector:
    matchExpressions: []
  manageNodes: false
  monPVCTemplate:
    spec:
      storageClassName: gp2
      accessModes:
        - ReadWriteOnce
  resources:
    mds:
      limits:
        cpu: 1000m
        memory: 4Gi
      requests:
        cpu: 1000m
        memory: 4Gi
    mgr:
      limits:
        cpu: 1000m
        memory: 512Mi
      requests:
        cpu: 1000m
        memory: 512Mi
    mon:
      limits:
        cpu: 1000m
        memory: 1Gi
      requests:
        cpu: 1000m
        memory: 1Gi
    noobaa-core:
      limits: {}
      requests: {}
    noobaa-db:
      limits: {}
      requests: {}
  storageDeviceSets:
    - name: mydeviceset
      count: 3
      dataPVCTemplate:
        spec:
          storageClassName: gp2
          accessModes:
            - ReadWriteOnce
          volumeMode: Block
          resources:
            requests:
              storage: 1000Gi
      placement: {}
      portable: true
      resources:
        limits:
          cpu: 1000m
          memory: 2Gi
        requests:
          cpu: 1000m
          memory: 2Gi


ocs-operator logs:

{"level":"info","ts":"2020-07-07T18:46:31.180Z","logger":"controller-runtime.controller","msg":"Starting Controller","controller":"storagecluster-controller"}
{"level":"info","ts":"2020-07-07T18:46:31.180Z","logger":"controller-runtime.controller","msg":"Starting workers","controller":"storagecluster-controller","worker count":1}
{"level":"info","ts":"2020-07-07T18:46:31.180Z","logger":"controller_storagecluster","msg":"Reconciling StorageCluster","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"}
{"level":"info","ts":"2020-07-07T18:46:31.300Z","logger":"controller_storagecluster","msg":"not creating a CephObjectStore because the platform is 'aws'","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"}
E0707 18:46:31.544466       1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 673 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x14ba8e0, 0x2398c20)
	/go/src/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa3
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/src/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x82
panic(0x14ba8e0, 0x2398c20)
	/opt/rh/go-toolset-1.13/root/usr/lib/go-toolset-1.13-golang/src/runtime/panic.go:679 +0x1b2
github.com/openshift/ocs-operator/pkg/controller/storagecluster.newStorageClassDeviceSets(0xc0000d0400, 0x3, 0xc0009b8c30, 0x1)
	/go/src/github.com/openshift/ocs-operator/pkg/controller/storagecluster/reconcile.go:1040 +0x6a1
github.com/openshift/ocs-operator/pkg/controller/storagecluster.newCephCluster(0xc0000d0400, 0xc00004800b, 0x61, 0xc, 0x18ab8c0, 0xc00060b200, 0xd4a2f12dba4b77ef)
	/go/src/github.com/openshift/ocs-operator/pkg/controller/storagecluster/reconcile.go:901 +0x170
github.com/openshift/ocs-operator/pkg/controller/storagecluster.(*ReconcileStorageCluster).ensureCephCluster(0xc0009dfe60, 0xc0000d0400, 0x18ab8c0, 0xc00060b200, 0x139e90a, 0xe)
	/go/src/github.com/openshift/ocs-operator/pkg/controller/storagecluster/reconcile.go:742 +0xe90
github.com/openshift/ocs-operator/pkg/controller/storagecluster.(*ReconcileStorageCluster).Reconcile(0xc0009dfe60, 0xc000045ce0, 0x11, 0xc000045cc0, 0x12, 0xc000911cd8, 0xc0009de750, 0xc000616008, 0x1874a60)
	/go/src/github.com/openshift/ocs-operator/pkg/controller/storagecluster/reconcile.go:243 +0x64c
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0009fc540, 0x150d180, 0xc0006a4000, 0x0)
	/go/src/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:256 +0x162
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0009fc540, 0xc000439900)
	/go/src/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232 +0xcb
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc0009fc540)
	/go/src/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211 +0x2b
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc000456f70)
	/go/src/k8s.io/apimachinery/pkg/util/wait/wait.go:152 +0x5e
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000456f70, 0x3b9aca00, 0x0, 0x1edee691b801, 0xc0000a4600)
	/go/src/k8s.io/apimachinery/pkg/util/wait/wait.go:153 +0xf8
k8s.io/apimachinery/pkg/util/wait.Until(0xc000456f70, 0x3b9aca00, 0xc0000a4600)
	/go/src/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
	/go/src/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:193 +0x328
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x132c931]

goroutine 673 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/src/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0x105
panic(0x14ba8e0, 0x2398c20)
	/opt/rh/go-toolset-1.13/root/usr/lib/go-toolset-1.13-golang/src/runtime/panic.go:679 +0x1b2
github.com/openshift/ocs-operator/pkg/controller/storagecluster.newStorageClassDeviceSets(0xc0000d0400, 0x3, 0xc0009b8c30, 0x1)
	/go/src/github.com/openshift/ocs-operator/pkg/controller/storagecluster/reconcile.go:1040 +0x6a1
github.com/openshift/ocs-operator/pkg/controller/storagecluster.newCephCluster(0xc0000d0400, 0xc00004800b, 0x61, 0xc, 0x18ab8c0, 0xc00060b200, 0xd4a2f12dba4b77ef)
	/go/src/github.com/openshift/ocs-operator/pkg/controller/storagecluster/reconcile.go:901 +0x170
github.com/openshift/ocs-operator/pkg/controller/storagecluster.(*ReconcileStorageCluster).ensureCephCluster(0xc0009dfe60, 0xc0000d0400, 0x18ab8c0, 0xc00060b200, 0x139e90a, 0xe)
	/go/src/github.com/openshift/ocs-operator/pkg/controller/storagecluster/reconcile.go:742 +0xe90
github.com/openshift/ocs-operator/pkg/controller/storagecluster.(*ReconcileStorageCluster).Reconcile(0xc0009dfe60, 0xc000045ce0, 0x11, 0xc000045cc0, 0x12, 0xc000911cd8, 0xc0009de750, 0xc000616008, 0x1874a60)
	/go/src/github.com/openshift/ocs-operator/pkg/controller/storagecluster/reconcile.go:243 +0x64c
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0009fc540, 0x150d180, 0xc0006a4000, 0x0)
	/go/src/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:256 +0x162
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0009fc540, 0xc000439900)
	/go/src/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232 +0xcb
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc0009fc540)
	/go/src/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211 +0x2b
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc000456f70)
	/go/src/k8s.io/apimachinery/pkg/util/wait/wait.go:152 +0x5e
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000456f70, 0x3b9aca00, 0x0, 0x1edee691b801, 0xc0000a4600)
	/go/src/k8s.io/apimachinery/pkg/util/wait/wait.go:153 +0xf8
k8s.io/apimachinery/pkg/util/wait.Until(0xc000456f70, 0x3b9aca00, 0xc0000a4600)
	/go/src/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
	/go/src/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:193 +0x328

Comment 2 Michael Adam 2020-07-07 21:34:28 UTC

This is legit. And we need to fix it.
Not sure how QE could verify https://bugzilla.redhat.com/show_bug.cgi?id=1846389 :-)

Comment 3 Jose A. Rivera 2020-07-08 03:57:27 UTC

PR is upstream: https://github.com/openshift/ocs-operator/pull/618

Comment 5 Michael Adam 2020-07-08 07:38:15 UTC

(In reply to Michael Adam from comment #2)
> This is legit. And we need to fix it.
> Not sure how QE could verify
> https://bugzilla.redhat.com/show_bug.cgi?id=1846389 :-)

To expain: That BZ was for independent mode only, and it seems we're not hitting the crash in the code path for independent mode.

Comment 8 Michael Adam 2020-07-08 22:58:48 UTC

https://github.com/openshift/ocs-operator/pull/623

Backport PR.

Comment 9 Michael Adam 2020-07-09 06:20:55 UTC

merged

Comment 10 Michael Adam 2020-07-09 07:25:23 UTC

https://ceph-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/OCS%20Build%20Pipeline%204.5/61/

contains the fix

4.5.0-482.ci

Comment 11 akarsha 2020-07-15 09:45:09 UTC

Tested on AWS IPI environment(converged mode)
- 3 Masters
- 3 Workers

Version:
OCP: 4.5.0-0.nightly-2020-07-14-213353
OCS: ocs-operator.v4.5.0-487.ci

Steps performed:
1. Created OCP cluster using ocs-ci
2. Ran deploy-olm.yaml
   $ oc create -f deploy-olm.yaml
3. Did subscription through UI
4. Ran the storagecluster.yaml
   $ oc create -f storagecluster.yaml

Observations:
Don't see any problem on ocs-operator and all pods are up and running.


@John Strunk is the verification steps are correct, are we missing something? Do we need to validate on independent mode too? 


Additional information:

$ oc get pods -n openshift-storage 
NAME                                                              READY   STATUS      RESTARTS   AGE
csi-cephfsplugin-28bvw                                            3/3     Running     0          13m
csi-cephfsplugin-lfmxk                                            3/3     Running     0          13m
csi-cephfsplugin-provisioner-65c858dcb7-q7hbm                     5/5     Running     0          13m
csi-cephfsplugin-provisioner-65c858dcb7-xj7hp                     5/5     Running     0          13m
csi-cephfsplugin-rddkh                                            3/3     Running     0          13m
csi-rbdplugin-h8bdk                                               3/3     Running     0          13m
csi-rbdplugin-pd9cq                                               3/3     Running     0          13m
csi-rbdplugin-provisioner-b6b697b66-6n6bt                         5/5     Running     0          13m
csi-rbdplugin-provisioner-b6b697b66-9wmz5                         5/5     Running     0          13m
csi-rbdplugin-tdsxm                                               3/3     Running     0          13m
noobaa-core-0                                                     1/1     Running     0          10m
noobaa-db-0                                                       1/1     Running     0          10m
noobaa-endpoint-758cbdd6d4-hj6wl                                  1/1     Running     0          9m20s
noobaa-operator-5f9d557669-2xg6g                                  1/1     Running     0          16m
ocs-operator-75b4fbfbff-q9t9p                                     1/1     Running     0          16m
rook-ceph-crashcollector-ip-10-0-135-206-7566bc5678-27s5d         1/1     Running     0          12m
rook-ceph-crashcollector-ip-10-0-185-61-78d5ffb9b4-5dwnz          1/1     Running     0          11m
rook-ceph-crashcollector-ip-10-0-199-13-76cf7d686-skk2q           1/1     Running     0          12m
rook-ceph-drain-canary-0e68ef29218a4256e368ebd8f2e7bd14-7cff4dx   1/1     Running     0          11m
rook-ceph-drain-canary-1dbf9852097ecaf2d538dccc5663ece1-65xmb9t   1/1     Running     0          10m
rook-ceph-drain-canary-f3fa4531e5fce199d25d2b6649d283da-69jnb2v   1/1     Running     0          10m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-78dbfbcdjmq7f   1/1     Running     0          10m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-575f8c6bhr7jd   1/1     Running     0          10m
rook-ceph-mgr-a-6468b57f74-4zvhn                                  1/1     Running     0          11m
rook-ceph-mon-a-7959984c64-t7thn                                  1/1     Running     0          12m
rook-ceph-mon-b-5b4f9fb78b-tj7cp                                  1/1     Running     0          12m
rook-ceph-mon-c-6b6d5ccfd6-vbb2p                                  1/1     Running     0          11m
rook-ceph-operator-7cd55d84f6-hzsbf                               1/1     Running     0          16m
rook-ceph-osd-0-5cb8765454-l4nt2                                  1/1     Running     0          11m
rook-ceph-osd-1-795968c964-nh96d                                  1/1     Running     0          10m
rook-ceph-osd-2-76f96c57c5-gkhh7                                  1/1     Running     0          10m
rook-ceph-osd-prepare-mydeviceset-0-data-0-fx97q-c5x4s            0/1     Completed   0          11m
rook-ceph-osd-prepare-mydeviceset-1-data-0-jwxf8-wrgt6            0/1     Completed   0          11m
rook-ceph-osd-prepare-mydeviceset-2-data-0-8x7rb-hp88r            0/1     Completed   0          11m

$ oc get storagecluster ocs-storagecluster -n openshift-storage -o yaml
.
.
spec:
  externalStorage: {}
  labelSelector: {}
.
.


deploy-olm.yaml: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/BZ-1854651/bz1854651/deploy-olm.yaml
storagecluster.yaml: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/BZ-1854651/bz1854651/storagecluster.yaml
Logs: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/BZ-1854651/bz1854651/

Comment 12 John Strunk 2020-07-15 15:39:15 UTC

My understanding is that the empty labelSelector is enough to trigger the bug in affected versions.

I have been running successfully with 4.5.0-485.ci using the same StorageCluster that caused the initial panic, so I believe this is fixed.

Comment 13 akarsha 2020-07-24 06:39:36 UTC

Moving the BZ to verified state based on Comment#11 and Comment#12, conclusion is that with empty labelSelector we don't see any problem on ocs-operator and all pods were up and running.

Comment 16 errata-xmlrpc 2020-09-15 10:18:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3754

Note You need to log in before you can comment on or make changes to this bug.