Bug 2215060

Summary: Storageclass ocs-storagecluster-ceph-rbd is not created on consumer
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Jilju Joy <jijoy>
Component: ocs-client-operatorAssignee: Madhu Rajanna <mrajanna>
Status: CLOSED WONTFIX QA Contact: Neha Berry <nberry>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.12CC: muagarwa, nberry, odf-bz-bot, omitrani
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 2215054 Environment:
Last Closed: 2023-08-14 06:19:52 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2215054    
Bug Blocks:    

Description Jilju Joy 2023-06-14 15:15:53 UTC
+++ This bug was initially created as a clone of Bug #2215054 +++

Description of problem:
Storageclassclaim ocs-storagecluster-ceph-rbd remains in "Configuring" configuring phase in consumer cluster. This is observed in the second consumer cluster which is connected to a single provider cluster.
Storageclient is in connected state.


$ oc get storageclassclaim
NAME                          STORAGETYPE        STORAGEPROFILE   STORAGECLIENTNAME   STORAGECLIENTNAMESPACE   PHASE
ocs-storagecluster-ceph-rbd   blockpool                           ocs-storageclient   fusion-storage           Configuring
ocs-storagecluster-cephfs     sharedfilesystem                    ocs-storageclient   fusion-storage           Ready


$ oc get sc
NAME                        PROVISIONER                          RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
gp2                         kubernetes.io/aws-ebs                Delete          WaitForFirstConsumer   true                   137m
gp2-csi                     ebs.csi.aws.com                      Delete          WaitForFirstConsumer   true                   134m
gp3 (default)               ebs.csi.aws.com                      Delete          WaitForFirstConsumer   true                   137m
gp3-csi                     ebs.csi.aws.com                      Delete          WaitForFirstConsumer   true                   134m
ocs-storagecluster-cephfs   fusion-storage.cephfs.csi.ceph.com   Delete          Immediate              true                   112m


$ oc get storageclient -n fusion-storage
NAME                PHASE       CONSUMER
ocs-storageclient   Connected   47ce36eb-36fc-4606-917f-e057d8a31233


The storageclassrequest storageclassrequest-219ba9f10e9d4ddf258721dbdd9a6d5b is in "Creating" phase on provider cluster:

$ oc get storageclassrequest
NAME                                                   STORAGETYPE        PHASE
storageclassrequest-219ba9f10e9d4ddf258721dbdd9a6d5b   blockpool          Creating
storageclassrequest-5213583e8c34523a3dee5a7552a31923   sharedfilesystem   Ready
storageclassrequest-62cf7530a4403de2fd8ed17fb3b975cd   sharedfilesystem   Ready
storageclassrequest-a05b7e5805be20106692ee15d012ddc4   blockpool          Ready



$ oc -n fusion-storage logs ocs-client-operator-controller-manager-777dfbf986-p58q2 --tail 10| grep ocs-storagecluster-ceph-rbd
1.6867533119961958e+09	INFO	Reconciling StorageClassClaim.	{"controller": "storageclassclaim", "controllerGroup": "ocs.openshift.io", "controllerKind": "StorageClassClaim", "StorageClassClaim": {"name":"ocs-storagecluster-ceph-rbd"}, "namespace": "", "name": "ocs-storagecluster-ceph-rbd", "reconcileID": "8cb3b8a2-b817-4bd9-9527-61feed582b22", "StorageClassClaim": "/ocs-storagecluster-ceph-rbd"}
1.6867533120548582e+09	ERROR	Reconciler error	{"controller": "storageclassclaim", "controllerGroup": "ocs.openshift.io", "controllerKind": "StorageClassClaim", "StorageClassClaim": {"name":"ocs-storagecluster-ceph-rbd"}, "namespace": "", "name": "ocs-storagecluster-ceph-rbd", "reconcileID": "8cb3b8a2-b817-4bd9-9527-61feed582b22", "error": "failed to get StorageClassClaim config: rpc error: code = Unavailable desc = storage class request \"ocs-storagecluster-ceph-rbd\" for \"47ce36eb-36fc-4606-917f-e057d8a31233\" is in \"Creating\" phase"}


Provider logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-jn14-pr/jijoy-jn14-pr_20230614T050712/logs/must-gather-logs/
Logs from the second consumer where SC is missing: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-jn14-c3/jijoy-jn14-c3_20230614T120542/logs/must-gather-logs/
Logs from the first consumer cluster where SCs are present: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-jn14-c1/jijoy-jn14-c1_20230614T050719/logs/must-gather-logs/


=========================================================================

Version-Release number of selected component (if applicable):
$ oc get csv
NAME                                      DISPLAY                            VERSION           REPLACES                                  PHASE
managed-fusion-agent.v2.0.11              Managed Fusion Agent               2.0.11                                                      Succeeded
observability-operator.v0.0.22            Observability Operator             0.0.22            observability-operator.v0.0.21            Succeeded
ocs-client-operator.v4.12.4-rhodf         OpenShift Data Foundation Client   4.12.4-rhodf      ocs-client-operator.v4.12.3-rhodf         Succeeded
odf-csi-addons-operator.v4.12.4-rhodf     CSI Addons                         4.12.4-rhodf      odf-csi-addons-operator.v4.12.3-rhodf     Succeeded
ose-prometheus-operator.4.10.0            Prometheus Operator                4.10.0                                                      Succeeded
route-monitor-operator.v0.1.500-6152b76   Route Monitor Operator             0.1.500-6152b76   route-monitor-operator.v0.1.498-e33e391   Succeeded

Aganet build: quay.io/resoni/managed-fusion-agent-index:4.12.3-31.05

===========================================================================
How reproducible:
2/2

Steps to Reproduce:
1. Create a provier and 2 consumer clusters
2. Check the phase of storageclassclaims.

Actual results:
$ oc get storageclassclaim
NAME                          STORAGETYPE        STORAGEPROFILE   STORAGECLIENTNAME   STORAGECLIENTNAMESPACE   PHASE
ocs-storagecluster-ceph-rbd   blockpool                           ocs-storageclient   fusion-storage           Configuring
ocs-storagecluster-cephfs     sharedfilesystem                    ocs-storageclient   fusion-storage           Ready

$ oc get sc ocs-storagecluster-ceph-rbd
Error from server (NotFound): storageclasses.storage.k8s.io "ocs-storagecluster-ceph-rbd" not found

Expected results:
$ oc get storageclassclaim
NAME                          STORAGETYPE        STORAGEPROFILE   STORAGECLIENTNAME   STORAGECLIENTNAMESPACE   PHASE
ocs-storagecluster-ceph-rbd   blockpool                           ocs-storageclient   fusion-storage           Ready
ocs-storagecluster-cephfs     sharedfilesystem                    ocs-storageclient   fusion-storage           Ready

$ oc get sc ocs-storagecluster-ceph-rbd
NAME                          PROVISIONER                       RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
ocs-storagecluster-ceph-rbd   fusion-storage.rbd.csi.ceph.com   Delete          Immediate           true                   132m


Additional info:

Comment 2 Jilju Joy 2023-07-19 12:59:06 UTC
This issue is not always reproducible. In the last 7 attempts including the initially reported 2 attempts, this was reproduced 3 times.
Adding the logs from the clusters where the issue was seen recently.

$ oc get storageclassclaim
NAME                          STORAGETYPE        STORAGEPROFILE   STORAGECLIENTNAME   STORAGECLIENTNAMESPACE   PHASE
ocs-storagecluster-ceph-rbd   blockpool                           ocs-storageclient   fusion-storage           Configuring
ocs-storagecluster-cephfs     sharedfilesystem                    ocs-storageclient   fusion-storage           Ready

Provider cluster - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-jl14-pr/jijoy-jl14-pr_20230714T100944/logs/testcases_1689338199/jijoy-jl14-pr/
Application cluster where the issue is found - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-jl14-c1/jijoy-jl14-c1_20230714T100858/logs/testcases_1689338132/

There are 2 application clusters connected to the provider cluster. The issue is seen in only 1 application cluster.