Bug 2188303

Summary: [RDR] Maintenance mode is not enabled after initiating failover action
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Sidhant Agrawal <sagrawal>
Component: odf-drAssignee: Shyamsundar <srangana>
odf-dr sub component: ramen QA Contact: Sidhant Agrawal <sagrawal>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: unspecified CC: amagrawa, kseeger, muagarwa, ocs-bugs, odf-bz-bot, srangana
Version: 4.13   
Target Milestone: ---   
Target Release: ODF 4.13.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.13.0-178 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-06-21 15:25:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sidhant Agrawal 2023-04-20 12:03:35 UTC
Description of problem (please be detailed as possible and provide log
snippests):
On a RDR setup after initiating failover action, maintenance mode is not enabled properly and the expected pre-failover steps are not performed on failover cluster.

Version of all relevant components (if applicable):
OCP: 4.13.0-0.nightly-2023-04-18-005127
ODF: 4.13.0-168
ACM: 2.7.3
Submariner: 0.14.3

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
Yes

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Configure RDR setup
2. Deploy an application   
3. Initiate failover from C1 to C2  
4. Observe if MaintenenceMode resource are created and pre-failover maintenance step of scaling down of RBD mirror is performed on the failover cluster


Actual results:
MaintenenceMode not enabled after initiating failover action

Expected results:
MaintenenceMode should be enabled and pre-failover maintenance steps executed on failover cluster after initiating failover action

Additional Info:

>RBD mirror pod status before and after failover action

Before failover
---------------
hub cluster
$ oc get drpc -A -o wide
NAMESPACE             NAME           AGE    PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION   START TIME             DURATION          PEER READY
busybox-workloads-1   busybox-drpc   103m   sagrawal-nc1       sagrawal-nc1      Failover       FailedOver     Completed     2023-04-20T07:18:46Z   4m46.343135353s   True
 
sgrawal-nc1 cluster
$ oc get pod -n openshift-storage | grep "mirror"
enable-rbd-mirror-debug-logging-phbkj                             0/1     Completed   0              137m
rook-ceph-rbd-mirror-a-6448895769-mqfw9                           2/2     Running     5 (78m ago)    136m
 
sagrawal-c2 cluster
$ oc get pod -n openshift-storage | grep "mirror"
enable-rbd-mirror-debug-logging-8l9vj                             0/1     Completed   0          138m
rook-ceph-rbd-mirror-a-5f5748488-brgbv                            2/2     Running     0          137m
 
 
After failover 
--------------
 $ oc get drpc -A -o wide
NAMESPACE             NAME           AGE    PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION   START TIME             DURATION          PEER READY
busybox-workloads-1   busybox-drpc   111m   sagrawal-nc1       sagrawal-c2       Failover       FailedOver     Completed     2023-04-20T08:28:08Z   4m56.257080053s   True
 
sagrawal-nc1 cluster
 $ oc get pod -n openshift-storage | grep "mirror"
enable-rbd-mirror-debug-logging-phbkj                             0/1     Completed   0              144m
rook-ceph-rbd-mirror-a-6448895769-mqfw9                           2/2     Running     5 (86m ago)    144m
 
sagrawal-c2 cluster
$ oc get pod -n openshift-storage | grep "mirror"
enable-rbd-mirror-debug-logging-8l9vj                             0/1     Completed   0          144m
rook-ceph-rbd-mirror-a-5f5748488-brgbv                            2/2     Running     0          144m

Comment 3 Shyamsundar 2023-04-20 13:37:07 UTC
The labels on the StorageClass and the VolumeReplicationClass as set by MCO is "ramendr.openshift.io/replicationid" whereas Ramen VRG controller was looking for "ramendr.openshift.io/replicationID" (ID capitalized).

This causes VRG not to report any maintenance modes from other related labels in VRG status. Which hence further causes hub ramen operator to not initiate a MaintenanceMode request on the ManagedCluster that the workload is failing to. The end result being the RBD mirror daemon is not shutdown for the duration of the failover.

Correcting this requires ramen to update the label name it is looking for on the different classes and update VRG status accordingly.

Comment 13 errata-xmlrpc 2023-06-21 15:25:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:3742