2188303 – [RDR] Maintenance mode is not enabled after initiating failover action

Bug 2188303 - [RDR] Maintenance mode is not enabled after initiating failover action

Summary: [RDR] Maintenance mode is not enabled after initiating failover action

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	odf-dr
Sub Component:
Version:	4.13
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	ODF 4.13.0
Assignee:	Shyamsundar
QA Contact:	Sidhant Agrawal
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-04-20 12:03 UTC by Sidhant Agrawal
Modified:	2023-08-09 17:00 UTC (History)
CC List:	6 users (show)
Fixed In Version:	4.13.0-178
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-06-21 15:25:08 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	red-hat-storage odf-multicluster-orchestrator pull 171	0	None	open	Bug 2188303: [release-4.13] Fixes for MMode reconciler	2023-04-26 03:27:22 UTC
Red Hat Product Errata	RHBA-2023:3742	0	None	None	None	2023-06-21 15:25:52 UTC

Description Sidhant Agrawal 2023-04-20 12:03:35 UTC

Description of problem (please be detailed as possible and provide log
snippests):
On a RDR setup after initiating failover action, maintenance mode is not enabled properly and the expected pre-failover steps are not performed on failover cluster.

Version of all relevant components (if applicable):
OCP: 4.13.0-0.nightly-2023-04-18-005127
ODF: 4.13.0-168
ACM: 2.7.3
Submariner: 0.14.3

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
Yes

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Configure RDR setup
2. Deploy an application   
3. Initiate failover from C1 to C2  
4. Observe if MaintenenceMode resource are created and pre-failover maintenance step of scaling down of RBD mirror is performed on the failover cluster


Actual results:
MaintenenceMode not enabled after initiating failover action

Expected results:
MaintenenceMode should be enabled and pre-failover maintenance steps executed on failover cluster after initiating failover action

Additional Info:

>RBD mirror pod status before and after failover action

Before failover
---------------
hub cluster
$ oc get drpc -A -o wide
NAMESPACE             NAME           AGE    PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION   START TIME             DURATION          PEER READY
busybox-workloads-1   busybox-drpc   103m   sagrawal-nc1       sagrawal-nc1      Failover       FailedOver     Completed     2023-04-20T07:18:46Z   4m46.343135353s   True
 
sgrawal-nc1 cluster
$ oc get pod -n openshift-storage | grep "mirror"
enable-rbd-mirror-debug-logging-phbkj                             0/1     Completed   0              137m
rook-ceph-rbd-mirror-a-6448895769-mqfw9                           2/2     Running     5 (78m ago)    136m
 
sagrawal-c2 cluster
$ oc get pod -n openshift-storage | grep "mirror"
enable-rbd-mirror-debug-logging-8l9vj                             0/1     Completed   0          138m
rook-ceph-rbd-mirror-a-5f5748488-brgbv                            2/2     Running     0          137m
 
 
After failover 
--------------
 $ oc get drpc -A -o wide
NAMESPACE             NAME           AGE    PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION   START TIME             DURATION          PEER READY
busybox-workloads-1   busybox-drpc   111m   sagrawal-nc1       sagrawal-c2       Failover       FailedOver     Completed     2023-04-20T08:28:08Z   4m56.257080053s   True
 
sagrawal-nc1 cluster
 $ oc get pod -n openshift-storage | grep "mirror"
enable-rbd-mirror-debug-logging-phbkj                             0/1     Completed   0              144m
rook-ceph-rbd-mirror-a-6448895769-mqfw9                           2/2     Running     5 (86m ago)    144m
 
sagrawal-c2 cluster
$ oc get pod -n openshift-storage | grep "mirror"
enable-rbd-mirror-debug-logging-8l9vj                             0/1     Completed   0          144m
rook-ceph-rbd-mirror-a-5f5748488-brgbv                            2/2     Running     0          144m

Comment 3 Shyamsundar 2023-04-20 13:37:07 UTC

The labels on the StorageClass and the VolumeReplicationClass as set by MCO is "ramendr.openshift.io/replicationid" whereas Ramen VRG controller was looking for "ramendr.openshift.io/replicationID" (ID capitalized).

This causes VRG not to report any maintenance modes from other related labels in VRG status. Which hence further causes hub ramen operator to not initiate a MaintenanceMode request on the ManagedCluster that the workload is failing to. The end result being the RBD mirror daemon is not shutdown for the duration of the failover.

Correcting this requires ramen to update the label name it is looking for on the different classes and update VRG status accordingly.

Comment 13 errata-xmlrpc 2023-06-21 15:25:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:3742

Note You need to log in before you can comment on or make changes to this bug.