Bug 2246084
| Summary: | [RDR] [Hub recovery] Failover doesn't complete | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Aman Agrawal <amagrawa> |
| Component: | odf-dr | Assignee: | Shyamsundar <srangana> |
| odf-dr sub component: | ramen | QA Contact: | krishnaram Karthick <kramdoss> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | urgent | ||
| Priority: | unspecified | CC: | bmekhiss, kramdoss, kseeger, muagarwa |
| Version: | 4.14 | Flags: | kramdoss:
needinfo+
|
| Target Milestone: | --- | ||
| Target Release: | ODF 4.15.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2024-03-19 15:28:05 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Aman Agrawal
2023-10-25 10:35:58 UTC
This is not always reproducible and we have a workaround as mentioned by Benamar in https://bugzilla.redhat.com/show_bug.cgi?id=2246186#c3 IMO, we should move both these BZs to 4.14.z as this is a corner case and it might require some code restructuring in MCO (In reply to Mudit Agarwal from comment #5) > This is not always reproducible and we have a workaround as mentioned by > Benamar in https://bugzilla.redhat.com/show_bug.cgi?id=2246186#c3 > IMO, we should move both these BZs to 4.14.z as this is a corner case and it > might require some code restructuring in MCO Actually, No. The WA didn't work as expected and Benamar knows this. On the reproducibility, I am sure it should be reproducible as the workloads were in deployed state before active hub went down, and it is just a normal failover scenario which is blocked due to this BZ, so certainly a hub-recovery blocker BZ. This issue was hit again with- OCP 4.14.0-0.nightly-2023-10-30-170011 advanced-cluster-management.v2.9.0-188 ODF 4.14.0-157 ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable) ACM 2.9.0-DOWNSTREAM-2023-10-18-17-59-25 Submariner brew.registry.redhat.io/rh-osbs/iib:607438 Steps: 1. On a hub recovery RDR setup, ensure backups are being created on active and passive hub clusters. Failover and relocate different workloads so that it is running on the primary managed cluster after the failover and relocate operation completes. Ensure latest backups are taken and no action of any of the workloads (cephfs, rbd- appset or subscription type) is in progress. 2. Collect drpc status. Bring primary managed cluster down, and then bring active hub down. 3. Ensure secondary managed cluster is properly imported on the passive hub and then DRPolicy gets validated. 4. Check the drpc status from passive hub and compare it with the output taken from active hub when it was up. We notice that post hub recovery, a sanity check is run for all the workloads which were failedover or relocated where we again perform the same action on those workloads which was performed from the active hub, which marks peer ready as false for those workloads. From active hub- NAMESPACE NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE PROGRESSION START TIME DURATION PEER READY busybox-workloads-2 subscription-cephfs-placement-1-drpc 9h amagrawa-31o-prim amagrawa-passivee Relocate Relocated Completed 2023-11-01T17:54:21Z 30.282249722s True busybox-workloads-5 subscription-rbd1-placement-1-drpc 9h amagrawa-31o-prim amagrawa-31o-prim Failover FailedOver Completed 2023-11-01T13:57:37Z 47m3.364814169s True busybox-workloads-6 subscription-rbd2-placement-1-drpc 9h amagrawa-31o-prim amagrawa-passivee Relocate Relocated Completed 2023-11-01T14:16:28Z 3h17m50.318760845s True openshift-gitops appset-cephfs-placement-drpc 9h amagrawa-31o-prim amagrawa-passivee Failover FailedOver Completed 2023-11-01T13:20:45Z 5m59.4021061s True openshift-gitops appset-rbd1-placement-drpc 9h amagrawa-31o-prim amagrawa-31o-prim Failover FailedOver Completed 2023-11-01T14:15:30Z 41m2.588884417s True openshift-gitops appset-rbd2-placement-drpc 9h amagrawa-passivee Deployed Completed True From passive hub- amagrawa:~$ drpc NAMESPACE NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE PROGRESSION START TIME DURATION PEER READY busybox-workloads-2 subscription-cephfs-placement-1-drpc 57m amagrawa-31o-prim amagrawa-passivee Relocate Relocating 2023-11-01T18:59:35Z False busybox-workloads-5 subscription-rbd1-placement-1-drpc 57m amagrawa-31o-prim amagrawa-31o-prim Failover FailingOver WaitForStorageMaintenanceActivation 2023-11-01T18:59:36Z False busybox-workloads-6 subscription-rbd2-placement-1-drpc 57m amagrawa-31o-prim amagrawa-passivee Relocate True openshift-gitops appset-cephfs-placement-drpc 57m amagrawa-31o-prim amagrawa-passivee Failover FailedOver EnsuringVolSyncSetup True openshift-gitops appset-rbd1-placement-drpc 57m amagrawa-31o-prim amagrawa-31o-prim Failover FailingOver FailingOverToCluster 2023-11-01T18:59:36Z False openshift-gitops appset-rbd2-placement-drpc 57m amagrawa-passivee Deployed Completed True Since peer ready is now marked as false due to sanity check, subscription-cephfs-placement-1-drpc and subscription-rbd1-placement-1-drpc and appset-rbd1-placement-drpc can not be failedover in this example. This sanity check is needed as per k8s recommended guidelines and we should not backup the currentstate of the workloads as confirmed by @bmekhiss so the issue will always persist. As of now, the only option is to trigger a failover by editing drpc yaml (which would be addressed by BZ2247537). So all these apps were failedover via CLI to the secondary managed cluster which was available but the failover didn't succeed for rbd backed workloads as volumereplicationclass was not backed up/got deleted. @bmekhiss tried a WA which created the volumereplicationclass on the secondary managed cluster which was available. This helped failover to proceed and created the workloads pods but not the VR's for rbd backed workloads, so VRG CURRENTSTATE couldn't be marked as Primary. We need VR's to be created for rbd backed workloads so the workaround didn't work as expected. From passive hub after triggering failover from CLI- amagrawa:~$ drpc NAMESPACE NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE PROGRESSION START TIME DURATION PEER READY busybox-workloads-2 subscription-cephfs-placement-1-drpc 3h21m amagrawa-31o-prim amagrawa-passivee Failover FailingOver WaitingForResourceRestore 2023-11-01T18:59:35Z False busybox-workloads-5 subscription-rbd1-placement-1-drpc 3h21m amagrawa-31o-prim amagrawa-passivee Failover FailedOver WaitForReadiness 2023-11-01T18:59:36Z True busybox-workloads-6 subscription-rbd2-placement-1-drpc 3h21m amagrawa-31o-prim amagrawa-passivee Failover FailedOver WaitForReadiness 2023-11-01T20:12:09Z True openshift-gitops appset-cephfs-placement-drpc 3h21m amagrawa-31o-prim amagrawa-passivee Failover FailedOver EnsuringVolSyncSetup True openshift-gitops appset-rbd1-placement-drpc 3h21m amagrawa-31o-prim amagrawa-passivee Failover FailedOver WaitForReadiness 2023-11-01T18:59:36Z True openshift-gitops appset-rbd2-placement-drpc 3h21m amagrawa-passivee Deployed Completed True From secondary available managed cluster to which failover was triggered- amagrawa:~$ busybox-5 Now using project "busybox-workloads-5" on server "https://api.amagrawa-passivee.qe.rh-ocs.com:6443". NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE VOLUMEMODE persistentvolumeclaim/busybox-pvc-21 Bound pvc-81ff5583-61e1-45fd-a739-0ad850f9d803 43Gi RWO ocs-storagecluster-ceph-rbd 70m Filesystem persistentvolumeclaim/busybox-pvc-22 Bound pvc-b14f6c3b-f1ed-42dd-b658-abaaf3e77a3d 43Gi RWO ocs-storagecluster-ceph-rbd 70m Filesystem persistentvolumeclaim/busybox-pvc-23 Bound pvc-345815af-9b83-4e27-b8fa-6946f638e3c6 52Gi RWO ocs-storagecluster-ceph-rbd 70m Filesystem persistentvolumeclaim/busybox-pvc-24 Bound pvc-3345a8f9-4552-4f2e-80ad-670088e3334a 20Gi RWO ocs-storagecluster-ceph-rbd 70m Filesystem persistentvolumeclaim/busybox-pvc-25 Bound pvc-7088a4bf-5607-4b71-b578-7682ecd6fe24 45Gi RWO ocs-storagecluster-ceph-rbd 70m Filesystem NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/subscription-rbd1-placement-1-drpc primary NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/busybox-21-7d6dfb858-qdkqn 1/1 Running 0 70m 10.129.3.25 compute-2 <none> <none> pod/busybox-22-6cf5dcc584-b9lwx 1/1 Running 0 70m 10.129.3.26 compute-2 <none> <none> pod/busybox-23-5bf89b9cc8-g62tl 1/1 Running 0 70m 10.131.0.97 compute-0 <none> <none> pod/busybox-24-6d5bc476dd-sx9xt 1/1 Running 0 70m 10.129.3.28 compute-2 <none> <none> pod/busybox-25-84d6dd6dc4-jqth2 1/1 Running 0 70m 10.131.0.98 compute-0 <none> <none> amagrawa:~$ busybox-6 Now using project "busybox-workloads-6" on server "https://api.amagrawa-passivee.qe.rh-ocs.com:6443". NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE VOLUMEMODE persistentvolumeclaim/mysql-pv-claim Bound pvc-6ea645c2-b6f8-44d2-9526-9911282aa487 24Gi RWO ocs-storagecluster-ceph-rbd 70m Filesystem NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/subscription-rbd2-placement-1-drpc primary NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/data-viewer-1-build 0/1 Completed 0 70m 10.129.3.24 compute-2 <none> <none> pod/data-viewer-775bb7cb4d-zvgt5 1/1 Running 0 69m 10.129.3.29 compute-2 <none> <none> pod/io-writer-mysql-68475c9785-bxvpp 1/1 Running 0 70m 10.129.3.22 compute-2 <none> <none> pod/io-writer-mysql-68475c9785-q74zw 1/1 Running 0 70m 10.131.0.96 compute-0 <none> <none> pod/io-writer-mysql-68475c9785-qgdh7 1/1 Running 0 70m 10.129.3.23 compute-2 <none> <none> pod/io-writer-mysql-68475c9785-qkhck 1/1 Running 0 70m 10.131.0.95 compute-0 <none> <none> pod/io-writer-mysql-68475c9785-ttmzv 1/1 Running 0 70m 10.128.3.88 compute-1 <none> <none> pod/mysql-7c88dd4dff-gsvcr 1/1 Running 1 (69m ago) 70m 10.129.3.27 compute-2 <none> <none> amagrawa:~$ busybox-3 Now using project "busybox-workloads-3" on server "https://api.amagrawa-passivee.qe.rh-ocs.com:6443". NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE VOLUMEMODE persistentvolumeclaim/dd-io-pvc-1 Bound pvc-4cb8fad8-cd23-4e25-a6df-e8f00e2583a1 117Gi RWO ocs-storagecluster-ceph-rbd 69m Filesystem persistentvolumeclaim/dd-io-pvc-2 Bound pvc-eef9d77b-d0bf-4b0b-9b67-cf1df477fdfc 143Gi RWO ocs-storagecluster-ceph-rbd 69m Filesystem persistentvolumeclaim/dd-io-pvc-3 Bound pvc-ed60b47a-1724-4685-bf72-2925535114df 134Gi RWO ocs-storagecluster-ceph-rbd 69m Filesystem persistentvolumeclaim/dd-io-pvc-4 Bound pvc-e56afbd0-65d3-4c67-b64d-24a5c301a65d 106Gi RWO ocs-storagecluster-ceph-rbd 69m Filesystem persistentvolumeclaim/dd-io-pvc-5 Bound pvc-4e9e86a1-75d3-463a-ba9e-79abe33512aa 115Gi RWO ocs-storagecluster-ceph-rbd 69m Filesystem persistentvolumeclaim/dd-io-pvc-6 Bound pvc-e541b7b9-36e4-4572-87aa-4276e7267b3e 129Gi RWO ocs-storagecluster-ceph-rbd 69m Filesystem persistentvolumeclaim/dd-io-pvc-7 Bound pvc-075a6bca-0c69-47c9-8e37-9a79a8f10f29 149Gi RWO ocs-storagecluster-ceph-rbd 69m Filesystem NAME DESIREDSTATE CURRENTSTATE volumereplicationgroup.ramendr.openshift.io/appset-rbd1-placement-drpc primary NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/dd-io-1-854f867867-rcfd5 1/1 Running 0 69m 10.129.3.45 compute-2 <none> <none> pod/dd-io-2-56679fb667-7bjb7 1/1 Running 0 69m 10.129.3.44 compute-2 <none> <none> pod/dd-io-3-5757659b99-2th5r 1/1 Running 0 69m 10.131.0.100 compute-0 <none> <none> pod/dd-io-4-75bd89888c-x9rrv 1/1 Running 0 69m 10.129.3.47 compute-2 <none> <none> pod/dd-io-5-86c65fd579-8c6m7 1/1 Running 0 69m 10.129.3.46 compute-2 <none> <none> pod/dd-io-6-fd8994467-rcrkt 1/1 Running 0 69m 10.131.0.102 compute-0 <none> <none> pod/dd-io-7-685b4f6699-l7lb8 1/1 Running 0 69m 10.131.0.101 compute-0 <none> <none> Benamar, could you pls check why VR's were not created for any of these workloads? Logs collected before applying the workaround to create volumereplicationclass- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/02nov23-1/ Logs are kept here (collected a few hours after triggering failover from CLI)- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/02nov23-2/ Moving Hub Recovery issues to 4.14.z based on offline discussion Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:1383 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |