2246186 – [RDR] [Hub recovery] After hub recovery, MCO didn't recreate the VolumeReplicationClass

Bug 2246186 - [RDR] [Hub recovery] After hub recovery, MCO didn't recreate the VolumeReplicationClass

Summary: [RDR] [Hub recovery] After hub recovery, MCO didn't recreate the VolumeReplic...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	odf-dr
Sub Component:
Version:	4.14
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	ODF 4.16.0
Assignee:	Vineet
QA Contact:	Aman Agrawal
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2248824 2249009
TreeView+	depends on / blocked

Reported:	2023-10-25 17:54 UTC by Aman Agrawal
Modified:	2024-11-15 04:25 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	2248824 2249009 (view as bug list)
Environment:
Last Closed:	2024-07-17 13:10:04 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	red-hat-storage odf-multicluster-orchestrator pull 183	0	None	Merged	Requeue DRPolicy if no mirrorpeers found	2023-11-10 07:24:03 UTC
Red Hat Product Errata	RHSA-2024:4591	0	None	None	None	2024-07-17 13:10:07 UTC

Description Aman Agrawal 2023-10-25 17:54:22 UTC

Description of problem (please be detailed as possible and provide log
snippests): This BZ is an extended one for the issue reported in BZ2246084 where it was identified that the MCO isn't creating VolumeReplicationClass for rbd backed workloads (if I am not wrong), and thus leads to unstable workloads resources status which affects failover of those workloads (failover & cleanup won't complete as expected).

Requesting @bmekhiss to add more details to the BZ for better understanding & correct the above noted observations if needed.


Version of all relevant components (if applicable):
OCP 4.14.0-0.nightly-2023-10-18-004928
advanced-cluster-management.v2.9.0-188 
ODF 4.14.0-156
ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable)
Submariner   image: brew.registry.redhat.io/rh-osbs/iib:599799
ACM 2.9.0-DOWNSTREAM-2023-10-18-17-59-25
Latency 50ms RTT


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Read the Description above
2.
3.


Actual results:


Expected results:


Additional info:

Comment 9 Mudit Agarwal 2023-11-07 11:47:18 UTC

Moving hub recovery issues out to 4.15 based on offline discussion.

Comment 13 Aman Agrawal 2023-11-07 16:02:06 UTC

Hi Mudit, could we pls re-target this bug to 4.14.z and not 4.15?

Comment 15 Mudit Agarwal 2023-11-13 08:52:42 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=2248824 is the 4.14 clone

Comment 26 Aman Agrawal 2024-05-30 17:09:28 UTC

Tested with following versions:

ceph version 18.2.1-188.el9cp (b1ae9c989e2f41dcfec0e680c11d1d9465b1db0e) reef (stable)
OCP 4.16.0-0.nightly-2024-05-23-173505
ACM 2.11.0-DOWNSTREAM-2024-05-23-15-16-26
MCE 2.6.0-104 
ODF 4.16.0-108.stable
Gitops v1.12.3 

Platform- VMware

Steps taken from BZ2246084

1. On a RDR setup, deploy multiple rbd and cephfs based workloads of both subscription and appset based and run IOs for a few days. In this case, a few cephfs workloads were deployed on C2 but all other workloads were on C1.
2. Perform hub recovery by bringing active hub down.
3. Restore backup on passive hub, ensure managed clusters are successfully imported, DRPolicy gets validated, drpc gets created, managed clusters are healthy and sync of all the workloads are working fine.
4. Failover the cephfs workload running on C2 to C1 with all nodes of C2 up and running. Then let IOs continue for some more time (a few hrs) for all workloads.
5. Now bring master nodes of primary cluster down and wait for the cluster status to change to unknown on the RHACM console. 
6. Now perform failover of all workloads running on C1 to the secondary managed cluster C2.


Failover was successful for all RBD and CephFS workloads and VolumeReplicationClass was successfully restored on the surviving managed cluster (which is needed for RBD).

oc get volumereplicationclass -A
NAME                                    PROVISIONER
rbd-volumereplicationclass-1625360775   openshift-storage.rbd.csi.ceph.com
rbd-volumereplicationclass-473128587    openshift-storage.rbd.csi.ceph.com


DRPC from new hub-

busybox-workloads-101   rbd-sub-busybox101-placement-1-drpc       4h51m   amagrawa-c1-28my   amagrawa-c2-my28   Failover       FailedOver     Cleaning Up   2024-05-30T15:26:02Z                        False
busybox-workloads-13    cephfs-sub-busybox13-placement-1-drpc     4h51m   amagrawa-c1-28my   amagrawa-c2-my28   Failover       FailedOver     Cleaning Up   2024-05-30T15:27:33Z                        False
busybox-workloads-16    cephfs-sub-busybox16-placement-1-drpc     4h51m   amagrawa-c1-28my   amagrawa-c2-my28   Failover       FailedOver     Cleaning Up   2024-05-30T15:27:26Z                        False
busybox-workloads-18    cnv-sub-busybox18-placement-1-drpc        4h51m   amagrawa-c1-28my   amagrawa-c2-my28   Failover       FailedOver     Cleaning Up   2024-05-30T16:52:14Z                        False
busybox-workloads-5     rbd-sub-busybox5-placement-1-drpc         4h51m   amagrawa-c1-28my   amagrawa-c2-my28   Failover       FailedOver     Cleaning Up   2024-05-30T15:25:50Z                        False
busybox-workloads-6     rbd-sub-busybox6-placement-1-drpc         4h51m   amagrawa-c1-28my   amagrawa-c2-my28   Failover       FailedOver     Cleaning Up   2024-05-30T15:25:56Z                        False
busybox-workloads-7     rbd-sub-busybox7-placement-1-drpc         4h51m   amagrawa-c1-28my   amagrawa-c2-my28   Failover       FailedOver     Cleaning Up   2024-05-30T15:25:34Z                        False
openshift-gitops        cephfs-appset-busybox12-placement-drpc    4h51m   amagrawa-c1-28my   amagrawa-c2-my28   Failover       FailedOver     Cleaning Up   2024-05-30T15:28:14Z                        False
openshift-gitops        cephfs-appset-busybox9-placement-drpc     4h51m   amagrawa-c1-28my   amagrawa-c2-my28   Failover       FailedOver     Cleaning Up   2024-05-30T15:28:19Z                        False
openshift-gitops        cnv-appset-busybox17-placement-drpc       4h51m   amagrawa-c1-28my   amagrawa-c2-my28   Failover       FailedOver     Cleaning Up   2024-05-30T16:52:23Z                        False
openshift-gitops        rbd-appset-busybox1-placement-drpc        4h51m   amagrawa-c1-28my   amagrawa-c2-my28   Failover       FailedOver     Cleaning Up   2024-05-30T15:26:08Z                        False
openshift-gitops        rbd-appset-busybox100-placement-drpc      4h51m   amagrawa-c1-28my   amagrawa-c2-my28   Failover       FailedOver     Cleaning Up   2024-05-30T15:26:14Z                        False
openshift-gitops        rbd-appset-busybox2-placement-drpc        4h51m   amagrawa-c1-28my   amagrawa-c2-my28   Failover       FailedOver     Cleaning Up   2024-05-30T15:26:20Z                        False
openshift-gitops        rbd-appset-busybox3-placement-drpc        4h51m   amagrawa-c1-28my   amagrawa-c2-my28   Failover       FailedOver     Cleaning Up   2024-05-30T15:26:49Z                        False

Since the primary managed cluster is still down, PROGRESSION is reporting Cleaning Up which is expected.

Failover was successful on 2 CNV (RBD) workloads  cnv-sub-busybox18-placement-1-drpc and cnv-appset-busybox17-placement-drpc as well of both subscription and appset (pull model) types respectively and the data written into the VM was successfully restored after failover completion.

Fix LGTM.

Comment 29 errata-xmlrpc 2024-07-17 13:10:04 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:4591

Comment 30 Red Hat Bugzilla 2024-11-15 04:25:09 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.