Bug 2267885 - [RDR] [Hub recovery] [Co-situated] Missing VRCs blocks failover operation for RBD workloads
Summary: [RDR] [Hub recovery] [Co-situated] Missing VRCs blocks failover operation for...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: odf-dr
Version: 4.15
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ODF 4.15.0
Assignee: Vineet
QA Contact: krishnaram Karthick
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-03-05 11:51 UTC by Aman Agrawal
Modified: 2024-03-19 15:33 UTC (History)
1 user (show)

Fixed In Version: 4.15.0-157
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-03-19 15:33:24 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage odf-multicluster-orchestrator pull 194 0 None open Bug 2267885: Add hub recovery labels for backing up secrets 2024-03-06 12:11:32 UTC
Red Hat Product Errata RHSA-2024:1383 0 None None None 2024-03-19 15:33:25 UTC

Description Aman Agrawal 2024-03-05 11:51:49 UTC
Description of problem (please be detailed as possible and provide log
snippests):

Version of all relevant components (if applicable):
OCP 4.15.0-0.nightly-2024-02-27-181650
ACM 2.10.0-DOWNSTREAM-2024-02-28-06-06-55
ODF 4.15.0-150
ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable)


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
**Active hub co-situated with the primary managed cluster at site1**

1. Deployed multiple rbd and cephfs backed workloads of both appset and subscription types.
2. Failedover and relocated them in such a way that they are finally running on the primary managed cluster (which is expected to host all the workloads and can go under disaster) but the apps which were failedover from C1 to C2 were relocated back to C1 and the apps which were relocated to C2 were failedover to C1 (with all nodes up and running).
Ensure that we have all workloads combinations in distinct states like deployed, failedover, relocated on C1, and a few workloads remain in deployed state on C2 as well.
4. Let the latest backups be taken at least 1 for all the different states of the workloads (when progression is completed and no action is going on any of the workloads). Also ensure sync for all the workloads when on active hub is working fine and cluster is healthy. Note drpc -o wide, lastGroupSyncTime, download backups from S3, etc.
5. Perform site failure (bring active hub and primary managed cluster down), move to passive hub at site2 which is co-situated with the secondary managed cluster by performing hub recovery. Restore backps, ensure velero backup reports successful restoration. Make sure the secondary managed cluster is successfully imported, drpolicy gets validated.
6. Wait for drpc progression to be restored.
7. Failover all the rbd and cephfs workloads running on primary managed cluster which went down to secondary and observe the status.

Primary managed cluster remains down.

Actual results: [RDR] [Hub recovery] [Co-situated] Missing VRCs blocks failover operation for RBD workloads

Failover was successful for all CephFS workloads but all the RBD remained stuck

amagrawa:~$ drpc|grep rbd
busybox-workloads-10   rbd-sub-busybox10-placement-1-drpc      26h   amagrawa-new-c1    amagrawa-new-m2   Failover       FailedOver     WaitForReadiness       2024-03-03T20:40:08Z                  False
busybox-workloads-11   rbd-sub-busybox11-placement-1-drpc      26h   amagrawa-new-c1    amagrawa-new-m2   Failover       FailedOver     WaitForReadiness       2024-03-03T20:40:15Z                  False
busybox-workloads-12   rbd-sub-busybox12-placement-1-drpc      26h   amagrawa-new-m2                                     Deployed       EnsuringVolSyncSetup   2024-03-03T10:28:54Z   461.285449ms   True
busybox-workloads-9    rbd-sub-busybox9-placement-1-drpc       26h   amagrawa-new-c1    amagrawa-new-m2   Failover       FailedOver     WaitForReadiness       2024-03-03T20:40:01Z                  False
openshift-gitops       rbd-appset-busybox1-placement-drpc      26h   amagrawa-new-c1    amagrawa-new-m2   Failover       FailedOver     WaitForReadiness       2024-03-03T20:40:22Z                  False
openshift-gitops       rbd-appset-busybox2-placement-drpc      26h   amagrawa-new-c1    amagrawa-new-m2   Failover       FailedOver     WaitForReadiness       2024-03-03T20:40:27Z                  False
openshift-gitops       rbd-appset-busybox3-placement-drpc      26h   amagrawa-new-c1    amagrawa-new-m2   Failover       FailedOver     WaitForReadiness       2024-03-03T20:40:34Z                  False
openshift-gitops       rbd-appset-busybox4-placement-drpc      26h   amagrawa-new-m2                                     Deployed       EnsuringVolSyncSetup   2024-03-03T10:28:55Z   365.204907ms   True


Logs collected from passive hub and secondary managed cluster after observing that failover isn't progressing- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/04march24/

 
Expected results: Failover should complete for all the workloads


Additional info:

Comment 12 errata-xmlrpc 2024-03-19 15:33:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383


Note You need to log in before you can comment on or make changes to this bug.