Bug 2267885

Summary: [RDR] [Hub recovery] [Co-situated] Missing VRCs blocks failover operation for RBD workloads
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Aman Agrawal <amagrawa>
Component: odf-drAssignee: Vineet <vbadrina>
odf-dr sub component: multicluster-orchestrator QA Contact: krishnaram Karthick <kramdoss>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: unspecified CC: muagarwa
Version: 4.15Keywords: Regression
Target Milestone: ---   
Target Release: ODF 4.15.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.15.0-157 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-03-19 15:33:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Aman Agrawal 2024-03-05 11:51:49 UTC
Description of problem (please be detailed as possible and provide log
snippests):

Version of all relevant components (if applicable):
OCP 4.15.0-0.nightly-2024-02-27-181650
ACM 2.10.0-DOWNSTREAM-2024-02-28-06-06-55
ODF 4.15.0-150
ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable)


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
**Active hub co-situated with the primary managed cluster at site1**

1. Deployed multiple rbd and cephfs backed workloads of both appset and subscription types.
2. Failedover and relocated them in such a way that they are finally running on the primary managed cluster (which is expected to host all the workloads and can go under disaster) but the apps which were failedover from C1 to C2 were relocated back to C1 and the apps which were relocated to C2 were failedover to C1 (with all nodes up and running).
Ensure that we have all workloads combinations in distinct states like deployed, failedover, relocated on C1, and a few workloads remain in deployed state on C2 as well.
4. Let the latest backups be taken at least 1 for all the different states of the workloads (when progression is completed and no action is going on any of the workloads). Also ensure sync for all the workloads when on active hub is working fine and cluster is healthy. Note drpc -o wide, lastGroupSyncTime, download backups from S3, etc.
5. Perform site failure (bring active hub and primary managed cluster down), move to passive hub at site2 which is co-situated with the secondary managed cluster by performing hub recovery. Restore backps, ensure velero backup reports successful restoration. Make sure the secondary managed cluster is successfully imported, drpolicy gets validated.
6. Wait for drpc progression to be restored.
7. Failover all the rbd and cephfs workloads running on primary managed cluster which went down to secondary and observe the status.

Primary managed cluster remains down.

Actual results: [RDR] [Hub recovery] [Co-situated] Missing VRCs blocks failover operation for RBD workloads

Failover was successful for all CephFS workloads but all the RBD remained stuck

amagrawa:~$ drpc|grep rbd
busybox-workloads-10   rbd-sub-busybox10-placement-1-drpc      26h   amagrawa-new-c1    amagrawa-new-m2   Failover       FailedOver     WaitForReadiness       2024-03-03T20:40:08Z                  False
busybox-workloads-11   rbd-sub-busybox11-placement-1-drpc      26h   amagrawa-new-c1    amagrawa-new-m2   Failover       FailedOver     WaitForReadiness       2024-03-03T20:40:15Z                  False
busybox-workloads-12   rbd-sub-busybox12-placement-1-drpc      26h   amagrawa-new-m2                                     Deployed       EnsuringVolSyncSetup   2024-03-03T10:28:54Z   461.285449ms   True
busybox-workloads-9    rbd-sub-busybox9-placement-1-drpc       26h   amagrawa-new-c1    amagrawa-new-m2   Failover       FailedOver     WaitForReadiness       2024-03-03T20:40:01Z                  False
openshift-gitops       rbd-appset-busybox1-placement-drpc      26h   amagrawa-new-c1    amagrawa-new-m2   Failover       FailedOver     WaitForReadiness       2024-03-03T20:40:22Z                  False
openshift-gitops       rbd-appset-busybox2-placement-drpc      26h   amagrawa-new-c1    amagrawa-new-m2   Failover       FailedOver     WaitForReadiness       2024-03-03T20:40:27Z                  False
openshift-gitops       rbd-appset-busybox3-placement-drpc      26h   amagrawa-new-c1    amagrawa-new-m2   Failover       FailedOver     WaitForReadiness       2024-03-03T20:40:34Z                  False
openshift-gitops       rbd-appset-busybox4-placement-drpc      26h   amagrawa-new-m2                                     Deployed       EnsuringVolSyncSetup   2024-03-03T10:28:55Z   365.204907ms   True


Logs collected from passive hub and secondary managed cluster after observing that failover isn't progressing- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/04march24/

 
Expected results: Failover should complete for all the workloads


Additional info:

Comment 12 errata-xmlrpc 2024-03-19 15:33:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383