Bug 2092372

Summary: [MS v2] StorageClassClaim is not reaching Ready Phase
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Jilju Joy <jijoy>
Component: ocs-operatorAssignee: Madhu Rajanna <mrajanna>
Status: CLOSED ERRATA QA Contact: Jilju Joy <jijoy>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.11CC: madam, mrajanna, muagarwa, ocs-bugs, odf-bz-bot, sostapov, srai
Target Milestone: ---Keywords: TestBlocker
Target Release: ODF 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.11.0-96 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-24 13:54:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jilju Joy 2022-06-01 11:15:08 UTC
Description of problem (please be detailed as possible and provide log
snippests):
Storageclassclaim is not getting created in a consumer cluster.

Getting this error when creating storageclassclaim in openshift-storage namespace.
$ oc get storageclassclaim -n openshift-storage
NAME                             STORAGETYPE        PHASE
test-storageclassclaim-cephfs1   sharedfilesystem   Configuring

ocs-operator logs:

{"level":"info","ts":1654078402.6853268,"logger":"controller.storageclassclaim","msg":"Reconciling StorageClassClaim.","reconciler group":"ocs.openshift.io","reconciler kind":"StorageClassClaim","name":"test-storageclassclaim-cephfs1","namespace":"openshift-storage","StorageClassClaim":"openshift-storage/test-storageclassclaim-cephfs1"}
{"level":"info","ts":1654078402.6853898,"logger":"controller.storageclassclaim","msg":"Running StorageClassClaim controller in Consumer Mode","reconciler group":"ocs.openshift.io","reconciler kind":"StorageClassClaim","name":"test-storageclassclaim-cephfs1","namespace":"openshift-storage","StorageClassClaim":"openshift-storage/test-storageclassclaim-cephfs1"}
{"level":"error","ts":1654078402.7412593,"logger":"controller.storageclassclaim","msg":"Reconciler error","reconciler group":"ocs.openshift.io","reconciler kind":"StorageClassClaim","name":"test-storageclassclaim-cephfs1","namespace":"openshift-storage","error":"failed to get StorageClassClaim config: rpc error: code = Unavailable desc = storage class claim \"test-storageclassclaim-cephfs1\" for \"56f2103b-264f-45cf-887d-edb7c2bffaae\" is in \"Creating\" phase","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"}


must-gather logs - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-j1-c1/jijoy-j1-c1_20220601T043509/logs/testcases_1654079214/

The Phase of the storage class claim is the same as reported initially in the bug #2089552. The error is different. Workaround is to respin ocs-operator pods on consumer and provider.

===============================================================
Version of all relevant components (if applicable):
$ oc get csv
NAME                                      DISPLAY                       VERSION           REPLACES                                  PHASE
mcg-operator.v4.11.0                      NooBaa Operator               4.11.0            mcg-operator.v4.10.2                      Succeeded
ocs-operator.v4.11.0                      OpenShift Container Storage   4.11.0            ocs-operator.v4.10.2                      Succeeded
ocs-osd-deployer.v2.0.2                   OCS OSD Deployer              2.0.2             ocs-osd-deployer.v2.0.1                   Succeeded
odf-csi-addons-operator.v4.11.0           CSI Addons                    4.11.0            odf-csi-addons-operator.v4.10.2           Succeeded
odf-operator.v4.11.0                      OpenShift Data Foundation     4.11.0            odf-operator.v4.10.2                      Succeeded
ose-prometheus-operator.4.10.0            Prometheus Operator           4.10.0            ose-prometheus-operator.4.8.0             Succeeded
route-monitor-operator.v0.1.418-6459408   Route Monitor Operator        0.1.418-6459408   route-monitor-operator.v0.1.408-c2256a2   Succeeded


$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.15   True        False         6h1m    Cluster version is 4.10.15



$ oc get csv odf-operator.v4.11.0 -o yaml | grep full_version
    full_version: 4.11.0-85

===========================================================================
Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes. Cannot create StorageClassClaim.


Is there any workaround available to the best of your knowledge?
Respin ocs-operator pods on consumer and provider.


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Create StorageClassClaim on consumer cluster openshift-storage namespace using the yaml given below.
apiVersion: ocs.openshift.io/v1alpha1
kind: StorageClassClaim
metadata:
  name: test-storageclassclaim-cephfs1
spec:
  type: sharedfilesystem

2. Verify the phase of the StorageClassClaim

Actual results:
$ oc get storageclassclaim -n openshift-storage
NAME                             STORAGETYPE        PHASE
test-storageclassclaim-cephfs1   sharedfilesystem   Configuring

Expected results:
PHASE should be Ready. A new storage class should be created.

Additional info:

Comment 5 Jilju Joy 2022-06-09 10:44:25 UTC
Tested in ODF 4.11.0-90

Storageclassclaims with type "sharedfilesystem" are reaching Ready phase. But there are errors in th ocs-operator logs as described below.
Storageclassclaims with type "blockpool" is not reaching Ready phase. There are errors in ocs-operator logs on both consumer and provider.

ocs-operator pod on provider cluster got into CrashLoopBackOff state.

From provider cluster:
$ oc get pods -l name=ocs-operator
NAME                            READY   STATUS             RESTARTS       AGE
ocs-operator-548d896c89-z7nmg   0/1     CrashLoopBackOff   39 (94s ago)   3h12m

$ oc get csv ocs-operator.v4.11.0
NAME                   DISPLAY                       VERSION   REPLACES               PHASE
ocs-operator.v4.11.0   OpenShift Container Storage   4.11.0    ocs-operator.v4.10.2   Installing 


-----------------------------------------

Output from consumer:


$ oc get storageclassclaim -A
NAMESPACE           NAME                                          STORAGETYPE        PHASE
openshift-storage   test-storageclassclaim-cephfs                 sharedfilesystem   Ready
openshift-storage   test-storageclassclaim-rbd                    blockpool          Configuring
test-project        test-storageclassclaim-cephfs-test-project    sharedfilesystem   Ready
test-project        test-storageclassclaim-cephfs2-test-project   sharedfilesystem   Ready
test-project        test-storageclassclaim-rbd-test-project       blockpool          Configuring


$ oc get storageclassclaim -n test-project -n openshift-storage
NAME                            STORAGETYPE        PHASE
test-storageclassclaim-cephfs   sharedfilesystem   Ready
test-storageclassclaim-rbd      blockpool          Configuring



$ oc get sc
NAME                                          PROVISIONER                             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
gp2 (default)                                 kubernetes.io/aws-ebs                   Delete          WaitForFirstConsumer   true                   24h
gp2-csi                                       ebs.csi.aws.com                         Delete          WaitForFirstConsumer   true                   24h
gp3-csi                                       ebs.csi.aws.com                         Delete          WaitForFirstConsumer   true                   24h
ocs-storagecluster-ceph-rbd                   openshift-storage.rbd.csi.ceph.com      Delete          Immediate              true                   23h
ocs-storagecluster-cephfs                     openshift-storage.cephfs.csi.ceph.com   Delete          Immediate              true                   23h
test-storageclassclaim-cephfs                 openshift-storage.cephfs.csi.ceph.com   Delete          Immediate              true                   14h
test-storageclassclaim-cephfs-test-project    openshift-storage.cephfs.csi.ceph.com   Delete          Immediate              true                   14h
test-storageclassclaim-cephfs2-test-project   openshift-storage.cephfs.csi.ceph.com   Delete          Immediate              true                   2m24s


-------------------------------------------

Output from provider:

$ oc get storageclassclaim -A
NAMESPACE           NAME                                                 STORAGETYPE        PHASE
openshift-storage   storageclassclaim-3cb007f76e37dde41384f05bace3a916   sharedfilesystem   Ready
openshift-storage   storageclassclaim-54d0f4199966db79472bea3f118bd4bb   sharedfilesystem   Ready
openshift-storage   storageclassclaim-79ae51581e9ea848deddb135d3e7e324   blockpool          
openshift-storage   storageclassclaim-84652ffb5666282fa23a5deb45b5e859   blockpool          
openshift-storage   storageclassclaim-ccd98680a21d5146dbb28892ec353f87   sharedfilesystem   Ready


$ oc get sc
NAME            PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
gp2 (default)   kubernetes.io/aws-ebs   Delete          WaitForFirstConsumer   true                   28h
gp2-csi         ebs.csi.aws.com         Delete          WaitForFirstConsumer   true                   28h
gp3-csi         ebs.csi.aws.com         Delete          WaitForFirstConsumer   true                   28h


$ oc get cephblockpools -A
NAMESPACE           NAME                                                                 PHASE
openshift-storage   cephblockpool-storageconsumer-5efc53a1-aae7-4fe8-a3ad-e2ec9ea52777   Ready

$ oc get cephfilesystem -A
NAMESPACE           NAME                                ACTIVEMDS   AGE   PHASE
openshift-storage   ocs-storagecluster-cephfilesystem   1           27h   Ready

-------------------------------------------------

Storageclassclaim with type "sharedfilesystem" are in Ready state.eg: test-storageclassclaim-cephfs.
But many occurance of the error given below is present in the ocs-operator pod on consumer.

{"level":"error","ts":1654716099.2675238,"logger":"controller.storageclassclaim","msg":"Reconciler error","reconciler group":"ocs.openshift.io","reconciler kind":"StorageClassClaim","name":"test-storageclassclaim-cephfs","namespace":"openshift-storage","error":"failed to get StorageClassClaim config: rpc error: code = Unavailable desc = status is not set for storage class claim \"test-storageclassclaim-cephfs\" for \"9c185091-d7ab-4b94-9d42-1180ae6fc809\"","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"}


----------------------------------------------
Storageclassclaim with type "blockpool" are not Ready. eg: test-storageclassclaim-rbd
Error logs in ocs-operator of consumer:

{"level":"error","ts":1654713853.2544885,"logger":"controller.storageclassclaim","msg":"Reconciler error","reconciler group":"ocs.openshift.io","reconciler kind":"StorageClassClaim","name":"test-storageclassclaim-rbd","namespace":"openshift-storage","error":"failed to get StorageClassClaim config: rpc error: code = Unavailable desc = status is not set for storage class claim \"test-storageclassclaim-rbd\" for \"9c185091-d7ab-4b94-9d42-1180ae6fc809\"","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"}


Error logs in the ocs-operator of provider:

{"level":"info","ts":1654770544.472081,"logger":"controller.storageclassclaim","msg":"Reconciling StorageClassClaim.","reconciler group":"ocs.openshift.io","reconciler kind":"StorageClassClaim","name":"storageclassclaim-54d0f4199966db79472bea3f118bd4bb","namespace":"openshift-storage","StorageClassClaim":"openshift-storage/storageclassclaim-54d0f4199966db79472bea3f118bd4bb"}
{"level":"info","ts":1654770544.472141,"logger":"controller.storageclassclaim","msg":"Running StorageClassClaim controller in Converged/Provider Mode","reconciler group":"ocs.openshift.io","reconciler kind":"StorageClassClaim","name":"storageclassclaim-54d0f4199966db79472bea3f118bd4bb","namespace":"openshift-storage","StorageClassClaim":"openshift-storage/storageclassclaim-54d0f4199966db79472bea3f118bd4bb"}
{"level":"info","ts":1654770544.662482,"logger":"controllers.OCSInitialization","msg":"Updating SecurityContextConstraint.","Request.Namespace":"openshift-storage","Request.Name":"ocsinit","SecurityContextConstraint":{"name":"ocs-metrics-exporter"}}
{"level":"info","ts":1654770544.662524,"logger":"controller.storageclassclaim","msg":"Reconciling StorageClassClaim.","reconciler group":"ocs.openshift.io","reconciler kind":"StorageClassClaim","name":"storageclassclaim-79ae51581e9ea848deddb135d3e7e324","namespace":"openshift-storage","StorageClassClaim":"openshift-storage/storageclassclaim-79ae51581e9ea848deddb135d3e7e324"}
{"level":"info","ts":1654770544.6625803,"logger":"controller.storageclassclaim","msg":"Running StorageClassClaim controller in Converged/Provider Mode","reconciler group":"ocs.openshift.io","reconciler kind":"StorageClassClaim","name":"storageclassclaim-79ae51581e9ea848deddb135d3e7e324","namespace":"openshift-storage","StorageClassClaim":"openshift-storage/storageclassclaim-79ae51581e9ea848deddb135d3e7e324"}
panic: assignment to entry in nil map

----------------------------------------------

must-gather logs from provider cluster: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-j8-pr/jijoy-j8-pr_20220608T054725/logs/testcases_1654769209/

must-gather logs from consumer cluster: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-j8-cr/jijoy-j8-cr_20220608T095739/logs/testcases_1654769180/

=========================================================
Version:

ocs-operator.v4.11.0 
ocs-osd-deployer.v2.0.2
odf-csi-addons-operator.v4.11.0
odf-operator.v4.11.0  
ODF full version 4.11.0-90
OCP 4.10.16

Comment 6 Jilju Joy 2022-06-09 12:10:17 UTC
Changing the bug status to assigned because the issue is not completely fixed. Please let me know if we need to open a new bug.

Comment 7 Madhu Rajanna 2022-06-09 12:15:09 UTC
(In reply to Jilju Joy from comment #6)
> Changing the bug status to assigned because the issue is not completely
> fixed. Please let me know if we need to open a new bug.

The panic shouldn't happen, we are testing it in another cluster to confirm it. @Jilju can you please also check on some other cluster?

Comment 9 Subham Rai 2022-06-09 14:38:49 UTC
removed the pr since the pr was from master.

Comment 11 Jilju Joy 2022-06-20 10:00:10 UTC
Verified in ODF version 4.11.0-98.
ocs-osd-deployer.v2.0.2


Created storageclassclaim on consumer cluster. Storageclassclaim and storage class created successfully.
Storageclassclaim was created automatically on the provider cluster. Verified PVC creation, pod creation and I/O. Tested both RBD and CephFS.

List of storageclass:
test-storageclassclaim-2cephfs-test-project   openshift-storage.cephfs.csi.ceph.com   Delete          Immediate              true                   25m
test-storageclassclaim-2rbd-test-project      openshift-storage.rbd.csi.ceph.com      Delete          Immediate              true                   17m
test-storageclassclaim-rbd-test-project       openshift-storage.rbd.csi.ceph.com      Delete          Immediate              true                   23m

Comment 14 errata-xmlrpc 2022-08-24 13:54:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.11.0 security, enhancement, & bugfix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6156