Bug 2189290

Summary: [MDR] : Ceph blocklists the rbd clients of the managed clusters during reinstallation due to old fencing entries in RHCS
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Sravika <sbalusu>
Component: odf-drAssignee: Raghavendra Talur <rtalur>
odf-dr sub component: ramen QA Contact: krishnaram Karthick <kramdoss>
Status: CLOSED NOTABUG Docs Contact:
Severity: unspecified    
Priority: unspecified CC: muagarwa, odf-bz-bot, rtalur
Version: 4.12   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-08-17 07:52:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sravika 2023-04-24 16:36:51 UTC
Description of problem (please be detailed as possible and provide log
snippests):

Ceph blocklists the rbd clients of the managed clusters during reinstallation due to  old fencing entries in RHCS from the previous installation. Hence the volumes are not created by the openshift-storage.rbd.csi.ceph.com and the noobaa-db-pg-0 pod is in Pending state during the storage system creation.

Although there is only block list entries from managed cluster1, the pv creation failed on managed cluster2 as well. After clearing the blocklist entries in Ceph, the pvs got created on both the Managed clusters automatically.


[root@m4205001 ~]#oc get pvc -A
NAMESPACE           NAME                STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS                           AGE
openshift-storage   db-noobaa-db-pg-0   Pending                                      ocs-external-storagecluster-ceph-rbd   94m
openshift-storage   testpvc             Pending                                      ocs-external-storagecluster-ceph-rbd   5m39s
testpvc             testpvc1            Pending                                      ocs-external-storagecluster-ceph-rbd   2m13s


[root@m4202001 ~]# oc get pvc -A
NAMESPACE           NAME                STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                           AGE
openshift-storage   db-noobaa-db-pg-0   Pending      
ocs-external-storagecluster-ceph-rbd   174m
openshift-storage   testpvc             Pending   
ocs-external-storagecluster-ceph-rbd



#  oc describe pvc db-noobaa-db-pg-0 -n openshift-storage
Name:          db-noobaa-db-pg-0
Namespace:     openshift-storage
StorageClass:  ocs-external-storagecluster-ceph-rbd
Status:        Pending
Volume:
Labels:        app=noobaa
               noobaa-db=postgres
Annotations:   volume.beta.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com
               volume.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode:    Filesystem
Used By:       noobaa-db-pg-0
Events:
  Type     Reason                Age                   From                                                                                                               Message
  ----     ------                ----                  ----                                                                                                               -------
  Warning  ProvisioningFailed    22m (x14 over 41m)    openshift-storage.rbd.csi.ceph.com_csi-rbdplugin-provisioner-c4977d969-8ml8j_5a19281a-2197-45c7-83ad-3e348d51056b  failed to provision volume with StorageClass "ocs-external-storagecluster-ceph-rbd": rpc error: code = Internal desc = rados: ret=-108, Cannot send after transport endpoint shutdown
  Normal   Provisioning          2m37s (x19 over 41m)  openshift-storage.rbd.csi.ceph.com_csi-rbdplugin-provisioner-c4977d969-8ml8j_5a19281a-2197-45c7-83ad-3e348d51056b  External provisioner is provisioning volume for claim "openshift-storage/db-noobaa-db-pg-0"
  Normal   ExternalProvisioning  2m2s (x187 over 47m)  persistentvolume-controller                                                                                        waiting for a volume to be created, either by external provisioner "openshift-storage.rbd.csi.ceph.com" or manually created by system administrator



Fencing operation entries from managed cluster 1

# ceph osd blocklist ls
cidr:172.23.233.145:0/24 2028-04-20T13:48:47.119563+0000
cidr:172.23.233.146:0/24 2028-04-20T13:48:47.555998+0000
cidr:172.23.233.147:0/24 2028-04-20T13:48:48.636305+0000
cidr:172.23.233.148:0/24 2028-04-20T13:48:49.610905+0000
cidr:172.23.233.152:0/24 2028-04-20T13:48:50.639814+0000
cidr:172.23.233.153:0/24 2028-04-20T13:48:51.667072+0000
listed 6 entries
#

# ceph osd blocklist clear
 removed all blocklist entries

# ceph osd blocklist ls
listed 0 entries 




Version of all relevant components (if applicable):
OCP: 4.12.11
odf-operator.v4.12.2-rhodf
RHCS: 5.3.z2


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
Clear the block entries in Ceph

# ceph osd blocklist clear
 removed all blocklist entries

# ceph osd blocklist ls
listed 0 entries 

[root@m4205001 ~]# oc get pvc -n openshift-storage
NAME                STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                           AGE
db-noobaa-db-pg-0   Bound    pvc-b01088b7-42e1-4e47-9ca8-57cd76f25d43   50Gi       RWO            ocs-external-storagecluster-ceph-rbd   3h12m
[root@m4205001 ~]#


[root@m4202001 ~]# oc get pvc -n openshift-storage
NAMESPACE           NAME                STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                           AGE
openshift-storage   db-noobaa-db-pg-0   Bound    pvc-9cae4774-cfd2-4dfe-9197-e5e90d3adf3a   50Gi       RWO            ocs-external-storagecluster-ceph-rbd   174m
openshift-storage   testpvc             Bound    pvc-124453ce-0367-44cc-838a-aae607226372   5Gi        RWO            ocs-external-storagecluster-ceph-rbd   50m
[root@m4202001 ~]#

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
Yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Deploy RHCS 5.3.z2 
2. Connect two clusters (cluster1, cluster2) in external mode to RHCS
3. Configure Metro DR environment, deploy application on cluster 1
4. Apply Fencing operation on cluster 1 and verify that the fencing is successful
5. Reinstall cluster1 and cluster2 and connect to the same RHCS cluster in external mode
6. Verify the successful deployment of ODF on both the clusters

Actual results:
ODF deployment was not successful as the noobaa-db-pg-0 pod is in Pending state due to pv creation failure

Expected results:
Reinstallation of ODF in external mode should be successful and pv creation should go fine on both the clusters

Additional info:

Must-gather:

https://drive.google.com/file/d/1JjZ3e2xSCw33eszmwpr3iGLYYw9ac7NW/view?usp=share_link

https://drive.google.com/file/d/1XZVOxplFCsF4PNLBhZePl9LWsCSni7se/view?usp=share_link

Comment 2 Raghavendra Talur 2023-04-24 16:47:28 UTC
> 4. Apply Fencing operation on cluster 1 and verify that the fencing is successful
> 5. Reinstall cluster1 and cluster2 and connect to the same RHCS cluster in external mode

If a cluster is fenced because it is impacted by a disaster then it is expected that the admin will unfence it once it is recovered OR when the admin is sure that the infra is permanently lost and doesn't have any applications running. It is not possible to automatically determine if either of those events have occured.

Reinstallation in step 5 implies that the infra was permanently lost and I would expect the admin to have unfenced the clusters before reinstallation. Please share your thoughts if you think this is not a good enough solution or if there is a better way to clear the fence entries.

Comment 3 Sravika 2023-04-25 08:42:30 UTC
@Rag

Comment 4 Sravika 2023-04-25 08:43:29 UTC
@rtalur :  Makes sense, does it also mean that unless the administrator unfences the managed cluster1, any other cluster connecting to the external storage will be blocked as the managed cluster2 was also not able to create pv although its entries are not fenced/blocklisted by Ceph?

Comment 7 Raghavendra Talur 2023-08-17 07:52:12 UTC
Part of the question was answered in comment 2. Answer for comment 4 is also the same. To summarize:

1. Reinstallation of the managed clusters won't have any impact on the block list entries in RHCS and they must be removed if the managed clusters are using IPs in that range.
2. If the blocklist command is using a /24 netmask then there is a potential of blocking the IPs that will be used by nodes of another managed cluster.

Closing this bug as not a bug.