Bug 2166900 - RBD PVCs are not working with 8 TiB and 20 TiB clusters
Summary: RBD PVCs are not working with 8 TiB and 20 TiB clusters
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: odf-managed-service
Version: 4.10
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Rewant
QA Contact: Filip Balák
URL:
Whiteboard:
Depends On:
Blocks: 2172101
TreeView+ depends on / blocked
 
Reported: 2023-02-03 12:41 UTC by Filip Balák
Modified: 2023-08-09 17:00 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2172101 (view as bug list)
Environment:
Last Closed: 2023-07-03 14:16:44 UTC
Embargoed:


Attachments (Terms of Use)

Description Filip Balák 2023-02-03 12:41:31 UTC
Description of problem:
RBD PVCs can not be created on fresh clusters with 8 or 10 size option specified. They end up with an error:
InvalidArgument desc = failed to fetch monitor list using clusterID (openshift-storage): missing configuration for cluster ID "openshift-storage"

E.g.:
Name:          pvc-test-a83f3365ad834bf996cd793c7a39b20
Namespace:     namespace-test-2984fe87e8ec4534907c7ba73
StorageClass:  ocs-storagecluster-ceph-rbd
Status:        Pending
Volume:        
Labels:        <none>
Annotations:   volume.beta.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com
               volume.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      
Access Modes:  
VolumeMode:    Filesystem
Used By:       <none>
Events:
  Type     Reason                Age                From                                                                                                               Message
  ----     ------                ----               ----                                                                                                               -------
  Normal   Provisioning          31s (x7 over 62s)  openshift-storage.rbd.csi.ceph.com_csi-rbdplugin-provisioner-65477c4f5-9kk8w_77954ce8-d024-42cc-9b93-1f58d176f537  External provisioner is provisioning volume for claim "namespace-test-2984fe87e8ec4534907c7ba73/pvc-test-a83f3365ad834bf996cd793c7a39b20"
  Warning  ProvisioningFailed    31s (x7 over 62s)  openshift-storage.rbd.csi.ceph.com_csi-rbdplugin-provisioner-65477c4f5-9kk8w_77954ce8-d024-42cc-9b93-1f58d176f537  failed to provision volume with StorageClass "ocs-storagecluster-ceph-rbd": rpc error: code = InvalidArgument desc = failed to fetch monitor list using clusterID (openshift-storage): missing configuration for cluster ID "openshift-storage"
  Normal   ExternalProvisioning  8s (x5 over 62s)   persistentvolume-controller                                                                                        waiting for a volume to be created, either by external provisioner "openshift-storage.rbd.csi.ceph.com" or manually created by system administrator

Version-Release number of selected component (if applicable):
ocs-operator.v4.10.8

How reproducible:
4/4

Steps to Reproduce:
1. Deploy a provider cluster with size 8 or 20 and a consumer.
2. Create a PVC on consumer that that uses RBD storageclass.

Actual results:
PVC is stuck in Pending state with an error:
InvalidArgument desc = failed to fetch monitor list using clusterID (openshift-storage): missing configuration for cluster ID "openshift-storage"

Expected results:
PVC is created.

Additional info:
There is a workaround: restart rook-ceph-operator
Discussion about the issue: https://chat.google.com/room/AAAASHA9vWs/H_U9EtJfcPQ

Comment 1 Shekhar Berry 2023-02-10 06:26:53 UTC
Hi,

I faced the exact issue in my setup with PVC stuck in pending state. I fixed the issue by restarting the rook-ceph-operator.

The following error message was seen in describe pvc:

Warning  ProvisioningFailed    85m (x14 over 104m)     openshift-storage.rbd.csi.ceph.com_csi-rbdplugin-provisioner-65477c4f5-9x4s7_286be8d2-73a0-4410-b8ed-d1ce7d4e187e  failed to provision volume with StorageClass "ocs-storagecluster-ceph-rbd": rpc error: code = InvalidArgument desc = failed to fetch monitor list using clusterID (openshift-storage): missing configuration for cluster ID "openshift-storage"


Here are the version of various components in my consumer cluster:

ocos get csv
NAME                                      DISPLAY                       VERSION           REPLACES                                  PHASE
mcg-operator.v4.10.9                      NooBaa Operator               4.10.9            mcg-operator.v4.10.8                      Succeeded
observability-operator.v0.0.20            Observability Operator        0.0.20            observability-operator.v0.0.19            Succeeded
ocs-operator.v4.10.9                      OpenShift Container Storage   4.10.9            ocs-operator.v4.10.8                      Succeeded
ocs-osd-deployer.v2.0.11                  OCS OSD Deployer              2.0.11-7          ocs-osd-deployer.v2.0.10                  Succeeded
odf-csi-addons-operator.v4.10.9           CSI Addons                    4.10.9            odf-csi-addons-operator.v4.10.8           Succeeded
odf-operator.v4.10.9                      OpenShift Data Foundation     4.10.9            odf-operator.v4.10.8                      Succeeded
ose-prometheus-operator.4.10.0            Prometheus Operator           4.10.0            ose-prometheus-operator.4.8.0             Succeeded
route-monitor-operator.v0.1.461-dbddf1f   Route Monitor Operator        0.1.461-dbddf1f   route-monitor-operator.v0.1.456-02ea942   Succeeded

oc version
Client Version: 4.12.0
Kustomize Version: v4.5.7
Server Version: 4.10.45
Kubernetes Version: v1.23.12+8a6bfe4

Comment 2 Shekhar Berry 2023-02-10 06:29:45 UTC
(In reply to Shekhar Berry from comment #1)
> Hi,
> 
> I faced the exact issue in my setup with PVC stuck in pending state. I fixed
> the issue by restarting the rook-ceph-operator.
> 
> The following error message was seen in describe pvc:
> 
> Warning  ProvisioningFailed    85m (x14 over 104m)    
> openshift-storage.rbd.csi.ceph.com_csi-rbdplugin-provisioner-65477c4f5-
> 9x4s7_286be8d2-73a0-4410-b8ed-d1ce7d4e187e  failed to provision volume with
> StorageClass "ocs-storagecluster-ceph-rbd": rpc error: code =
> InvalidArgument desc = failed to fetch monitor list using clusterID
> (openshift-storage): missing configuration for cluster ID "openshift-storage"
> 
> 
> Here are the version of various components in my consumer cluster:
> 
> ocos get csv
> NAME                                      DISPLAY                      
> VERSION           REPLACES                                  PHASE
> mcg-operator.v4.10.9                      NooBaa Operator              
> 4.10.9            mcg-operator.v4.10.8                      Succeeded
> observability-operator.v0.0.20            Observability Operator       
> 0.0.20            observability-operator.v0.0.19            Succeeded
> ocs-operator.v4.10.9                      OpenShift Container Storage  
> 4.10.9            ocs-operator.v4.10.8                      Succeeded
> ocs-osd-deployer.v2.0.11                  OCS OSD Deployer             
> 2.0.11-7          ocs-osd-deployer.v2.0.10                  Succeeded
> odf-csi-addons-operator.v4.10.9           CSI Addons                   
> 4.10.9            odf-csi-addons-operator.v4.10.8           Succeeded
> odf-operator.v4.10.9                      OpenShift Data Foundation    
> 4.10.9            odf-operator.v4.10.8                      Succeeded
> ose-prometheus-operator.4.10.0            Prometheus Operator          
> 4.10.0            ose-prometheus-operator.4.8.0             Succeeded
> route-monitor-operator.v0.1.461-dbddf1f   Route Monitor Operator       
> 0.1.461-dbddf1f   route-monitor-operator.v0.1.456-02ea942   Succeeded
> 
> oc version
> Client Version: 4.12.0
> Kustomize Version: v4.5.7
> Server Version: 4.10.45
> Kubernetes Version: v1.23.12+8a6bfe4

This was seen with 4TiB cluster, FYI. 
My setup consisted of 1 Provider and 3 consumers and it was seen in just one consumer and other 2 worked fine out of the box.

Comment 3 Chris Blum 2023-02-10 10:21:36 UTC
Yash can you please ACK if the SRE workaround is acceptable from your perspective?

Comment 5 Rewant 2023-07-03 14:16:44 UTC
Closing this as won't Fix as we have a workaround for it, that is to restart the rook-ceph-operator.


Note You need to log in before you can comment on or make changes to this bug.