Bug 2166900

Summary: RBD PVCs are not working with 8 TiB and 20 TiB clusters
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Filip Balák <fbalak>
Component: odf-managed-serviceAssignee: Rewant <resoni>
Status: CLOSED WONTFIX QA Contact: Filip Balák <fbalak>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.10CC: cblum, nberry, ocs-bugs, odf-bz-bot, shberry, ykukreja
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2172101 (view as bug list) Environment:
Last Closed: 2023-07-03 14:16:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2172101    

Description Filip Balák 2023-02-03 12:41:31 UTC
Description of problem:
RBD PVCs can not be created on fresh clusters with 8 or 10 size option specified. They end up with an error:
InvalidArgument desc = failed to fetch monitor list using clusterID (openshift-storage): missing configuration for cluster ID "openshift-storage"

E.g.:
Name:          pvc-test-a83f3365ad834bf996cd793c7a39b20
Namespace:     namespace-test-2984fe87e8ec4534907c7ba73
StorageClass:  ocs-storagecluster-ceph-rbd
Status:        Pending
Volume:        
Labels:        <none>
Annotations:   volume.beta.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com
               volume.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      
Access Modes:  
VolumeMode:    Filesystem
Used By:       <none>
Events:
  Type     Reason                Age                From                                                                                                               Message
  ----     ------                ----               ----                                                                                                               -------
  Normal   Provisioning          31s (x7 over 62s)  openshift-storage.rbd.csi.ceph.com_csi-rbdplugin-provisioner-65477c4f5-9kk8w_77954ce8-d024-42cc-9b93-1f58d176f537  External provisioner is provisioning volume for claim "namespace-test-2984fe87e8ec4534907c7ba73/pvc-test-a83f3365ad834bf996cd793c7a39b20"
  Warning  ProvisioningFailed    31s (x7 over 62s)  openshift-storage.rbd.csi.ceph.com_csi-rbdplugin-provisioner-65477c4f5-9kk8w_77954ce8-d024-42cc-9b93-1f58d176f537  failed to provision volume with StorageClass "ocs-storagecluster-ceph-rbd": rpc error: code = InvalidArgument desc = failed to fetch monitor list using clusterID (openshift-storage): missing configuration for cluster ID "openshift-storage"
  Normal   ExternalProvisioning  8s (x5 over 62s)   persistentvolume-controller                                                                                        waiting for a volume to be created, either by external provisioner "openshift-storage.rbd.csi.ceph.com" or manually created by system administrator

Version-Release number of selected component (if applicable):
ocs-operator.v4.10.8

How reproducible:
4/4

Steps to Reproduce:
1. Deploy a provider cluster with size 8 or 20 and a consumer.
2. Create a PVC on consumer that that uses RBD storageclass.

Actual results:
PVC is stuck in Pending state with an error:
InvalidArgument desc = failed to fetch monitor list using clusterID (openshift-storage): missing configuration for cluster ID "openshift-storage"

Expected results:
PVC is created.

Additional info:
There is a workaround: restart rook-ceph-operator
Discussion about the issue: https://chat.google.com/room/AAAASHA9vWs/H_U9EtJfcPQ

Comment 1 Shekhar Berry 2023-02-10 06:26:53 UTC
Hi,

I faced the exact issue in my setup with PVC stuck in pending state. I fixed the issue by restarting the rook-ceph-operator.

The following error message was seen in describe pvc:

Warning  ProvisioningFailed    85m (x14 over 104m)     openshift-storage.rbd.csi.ceph.com_csi-rbdplugin-provisioner-65477c4f5-9x4s7_286be8d2-73a0-4410-b8ed-d1ce7d4e187e  failed to provision volume with StorageClass "ocs-storagecluster-ceph-rbd": rpc error: code = InvalidArgument desc = failed to fetch monitor list using clusterID (openshift-storage): missing configuration for cluster ID "openshift-storage"


Here are the version of various components in my consumer cluster:

ocos get csv
NAME                                      DISPLAY                       VERSION           REPLACES                                  PHASE
mcg-operator.v4.10.9                      NooBaa Operator               4.10.9            mcg-operator.v4.10.8                      Succeeded
observability-operator.v0.0.20            Observability Operator        0.0.20            observability-operator.v0.0.19            Succeeded
ocs-operator.v4.10.9                      OpenShift Container Storage   4.10.9            ocs-operator.v4.10.8                      Succeeded
ocs-osd-deployer.v2.0.11                  OCS OSD Deployer              2.0.11-7          ocs-osd-deployer.v2.0.10                  Succeeded
odf-csi-addons-operator.v4.10.9           CSI Addons                    4.10.9            odf-csi-addons-operator.v4.10.8           Succeeded
odf-operator.v4.10.9                      OpenShift Data Foundation     4.10.9            odf-operator.v4.10.8                      Succeeded
ose-prometheus-operator.4.10.0            Prometheus Operator           4.10.0            ose-prometheus-operator.4.8.0             Succeeded
route-monitor-operator.v0.1.461-dbddf1f   Route Monitor Operator        0.1.461-dbddf1f   route-monitor-operator.v0.1.456-02ea942   Succeeded

oc version
Client Version: 4.12.0
Kustomize Version: v4.5.7
Server Version: 4.10.45
Kubernetes Version: v1.23.12+8a6bfe4

Comment 2 Shekhar Berry 2023-02-10 06:29:45 UTC
(In reply to Shekhar Berry from comment #1)
> Hi,
> 
> I faced the exact issue in my setup with PVC stuck in pending state. I fixed
> the issue by restarting the rook-ceph-operator.
> 
> The following error message was seen in describe pvc:
> 
> Warning  ProvisioningFailed    85m (x14 over 104m)    
> openshift-storage.rbd.csi.ceph.com_csi-rbdplugin-provisioner-65477c4f5-
> 9x4s7_286be8d2-73a0-4410-b8ed-d1ce7d4e187e  failed to provision volume with
> StorageClass "ocs-storagecluster-ceph-rbd": rpc error: code =
> InvalidArgument desc = failed to fetch monitor list using clusterID
> (openshift-storage): missing configuration for cluster ID "openshift-storage"
> 
> 
> Here are the version of various components in my consumer cluster:
> 
> ocos get csv
> NAME                                      DISPLAY                      
> VERSION           REPLACES                                  PHASE
> mcg-operator.v4.10.9                      NooBaa Operator              
> 4.10.9            mcg-operator.v4.10.8                      Succeeded
> observability-operator.v0.0.20            Observability Operator       
> 0.0.20            observability-operator.v0.0.19            Succeeded
> ocs-operator.v4.10.9                      OpenShift Container Storage  
> 4.10.9            ocs-operator.v4.10.8                      Succeeded
> ocs-osd-deployer.v2.0.11                  OCS OSD Deployer             
> 2.0.11-7          ocs-osd-deployer.v2.0.10                  Succeeded
> odf-csi-addons-operator.v4.10.9           CSI Addons                   
> 4.10.9            odf-csi-addons-operator.v4.10.8           Succeeded
> odf-operator.v4.10.9                      OpenShift Data Foundation    
> 4.10.9            odf-operator.v4.10.8                      Succeeded
> ose-prometheus-operator.4.10.0            Prometheus Operator          
> 4.10.0            ose-prometheus-operator.4.8.0             Succeeded
> route-monitor-operator.v0.1.461-dbddf1f   Route Monitor Operator       
> 0.1.461-dbddf1f   route-monitor-operator.v0.1.456-02ea942   Succeeded
> 
> oc version
> Client Version: 4.12.0
> Kustomize Version: v4.5.7
> Server Version: 4.10.45
> Kubernetes Version: v1.23.12+8a6bfe4

This was seen with 4TiB cluster, FYI. 
My setup consisted of 1 Provider and 3 consumers and it was seen in just one consumer and other 2 worked fine out of the box.

Comment 3 Chris Blum 2023-02-10 10:21:36 UTC
Yash can you please ACK if the SRE workaround is acceptable from your perspective?

Comment 5 Rewant 2023-07-03 14:16:44 UTC
Closing this as won't Fix as we have a workaround for it, that is to restart the rook-ceph-operator.