Bug 2189290
| Summary: | [MDR] : Ceph blocklists the rbd clients of the managed clusters during reinstallation due to old fencing entries in RHCS | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Sravika <sbalusu> |
| Component: | odf-dr | Assignee: | Raghavendra Talur <rtalur> |
| odf-dr sub component: | ramen | QA Contact: | krishnaram Karthick <kramdoss> |
| Status: | CLOSED NOTABUG | Docs Contact: | |
| Severity: | unspecified | ||
| Priority: | unspecified | CC: | muagarwa, odf-bz-bot, rtalur |
| Version: | 4.12 | ||
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-08-17 07:52:12 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
> 4. Apply Fencing operation on cluster 1 and verify that the fencing is successful
> 5. Reinstall cluster1 and cluster2 and connect to the same RHCS cluster in external mode
If a cluster is fenced because it is impacted by a disaster then it is expected that the admin will unfence it once it is recovered OR when the admin is sure that the infra is permanently lost and doesn't have any applications running. It is not possible to automatically determine if either of those events have occured.
Reinstallation in step 5 implies that the infra was permanently lost and I would expect the admin to have unfenced the clusters before reinstallation. Please share your thoughts if you think this is not a good enough solution or if there is a better way to clear the fence entries.
@Rag @rtalur : Makes sense, does it also mean that unless the administrator unfences the managed cluster1, any other cluster connecting to the external storage will be blocked as the managed cluster2 was also not able to create pv although its entries are not fenced/blocklisted by Ceph? Part of the question was answered in comment 2. Answer for comment 4 is also the same. To summarize: 1. Reinstallation of the managed clusters won't have any impact on the block list entries in RHCS and they must be removed if the managed clusters are using IPs in that range. 2. If the blocklist command is using a /24 netmask then there is a potential of blocking the IPs that will be used by nodes of another managed cluster. Closing this bug as not a bug. |
Description of problem (please be detailed as possible and provide log snippests): Ceph blocklists the rbd clients of the managed clusters during reinstallation due to old fencing entries in RHCS from the previous installation. Hence the volumes are not created by the openshift-storage.rbd.csi.ceph.com and the noobaa-db-pg-0 pod is in Pending state during the storage system creation. Although there is only block list entries from managed cluster1, the pv creation failed on managed cluster2 as well. After clearing the blocklist entries in Ceph, the pvs got created on both the Managed clusters automatically. [root@m4205001 ~]#oc get pvc -A NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE openshift-storage db-noobaa-db-pg-0 Pending ocs-external-storagecluster-ceph-rbd 94m openshift-storage testpvc Pending ocs-external-storagecluster-ceph-rbd 5m39s testpvc testpvc1 Pending ocs-external-storagecluster-ceph-rbd 2m13s [root@m4202001 ~]# oc get pvc -A NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE openshift-storage db-noobaa-db-pg-0 Pending ocs-external-storagecluster-ceph-rbd 174m openshift-storage testpvc Pending ocs-external-storagecluster-ceph-rbd # oc describe pvc db-noobaa-db-pg-0 -n openshift-storage Name: db-noobaa-db-pg-0 Namespace: openshift-storage StorageClass: ocs-external-storagecluster-ceph-rbd Status: Pending Volume: Labels: app=noobaa noobaa-db=postgres Annotations: volume.beta.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com volume.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com Finalizers: [kubernetes.io/pvc-protection] Capacity: Access Modes: VolumeMode: Filesystem Used By: noobaa-db-pg-0 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning ProvisioningFailed 22m (x14 over 41m) openshift-storage.rbd.csi.ceph.com_csi-rbdplugin-provisioner-c4977d969-8ml8j_5a19281a-2197-45c7-83ad-3e348d51056b failed to provision volume with StorageClass "ocs-external-storagecluster-ceph-rbd": rpc error: code = Internal desc = rados: ret=-108, Cannot send after transport endpoint shutdown Normal Provisioning 2m37s (x19 over 41m) openshift-storage.rbd.csi.ceph.com_csi-rbdplugin-provisioner-c4977d969-8ml8j_5a19281a-2197-45c7-83ad-3e348d51056b External provisioner is provisioning volume for claim "openshift-storage/db-noobaa-db-pg-0" Normal ExternalProvisioning 2m2s (x187 over 47m) persistentvolume-controller waiting for a volume to be created, either by external provisioner "openshift-storage.rbd.csi.ceph.com" or manually created by system administrator Fencing operation entries from managed cluster 1 # ceph osd blocklist ls cidr:172.23.233.145:0/24 2028-04-20T13:48:47.119563+0000 cidr:172.23.233.146:0/24 2028-04-20T13:48:47.555998+0000 cidr:172.23.233.147:0/24 2028-04-20T13:48:48.636305+0000 cidr:172.23.233.148:0/24 2028-04-20T13:48:49.610905+0000 cidr:172.23.233.152:0/24 2028-04-20T13:48:50.639814+0000 cidr:172.23.233.153:0/24 2028-04-20T13:48:51.667072+0000 listed 6 entries # # ceph osd blocklist clear removed all blocklist entries # ceph osd blocklist ls listed 0 entries Version of all relevant components (if applicable): OCP: 4.12.11 odf-operator.v4.12.2-rhodf RHCS: 5.3.z2 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Clear the block entries in Ceph # ceph osd blocklist clear removed all blocklist entries # ceph osd blocklist ls listed 0 entries [root@m4205001 ~]# oc get pvc -n openshift-storage NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE db-noobaa-db-pg-0 Bound pvc-b01088b7-42e1-4e47-9ca8-57cd76f25d43 50Gi RWO ocs-external-storagecluster-ceph-rbd 3h12m [root@m4205001 ~]# [root@m4202001 ~]# oc get pvc -n openshift-storage NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE openshift-storage db-noobaa-db-pg-0 Bound pvc-9cae4774-cfd2-4dfe-9197-e5e90d3adf3a 50Gi RWO ocs-external-storagecluster-ceph-rbd 174m openshift-storage testpvc Bound pvc-124453ce-0367-44cc-838a-aae607226372 5Gi RWO ocs-external-storagecluster-ceph-rbd 50m [root@m4202001 ~]# Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Deploy RHCS 5.3.z2 2. Connect two clusters (cluster1, cluster2) in external mode to RHCS 3. Configure Metro DR environment, deploy application on cluster 1 4. Apply Fencing operation on cluster 1 and verify that the fencing is successful 5. Reinstall cluster1 and cluster2 and connect to the same RHCS cluster in external mode 6. Verify the successful deployment of ODF on both the clusters Actual results: ODF deployment was not successful as the noobaa-db-pg-0 pod is in Pending state due to pv creation failure Expected results: Reinstallation of ODF in external mode should be successful and pv creation should go fine on both the clusters Additional info: Must-gather: https://drive.google.com/file/d/1JjZ3e2xSCw33eszmwpr3iGLYYw9ac7NW/view?usp=share_link https://drive.google.com/file/d/1XZVOxplFCsF4PNLBhZePl9LWsCSni7se/view?usp=share_link