Description of problem (please be detailed as possible and provide log snippests): On a RDR setup after initiating failover action, maintenance mode is not enabled properly and the expected pre-failover steps are not performed on failover cluster. Version of all relevant components (if applicable): OCP: 4.13.0-0.nightly-2023-04-18-005127 ODF: 4.13.0-168 ACM: 2.7.3 Submariner: 0.14.3 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? Yes Can this issue reproduce from the UI? Yes If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Configure RDR setup 2. Deploy an application 3. Initiate failover from C1 to C2 4. Observe if MaintenenceMode resource are created and pre-failover maintenance step of scaling down of RBD mirror is performed on the failover cluster Actual results: MaintenenceMode not enabled after initiating failover action Expected results: MaintenenceMode should be enabled and pre-failover maintenance steps executed on failover cluster after initiating failover action Additional Info: >RBD mirror pod status before and after failover action Before failover --------------- hub cluster $ oc get drpc -A -o wide NAMESPACE NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE PROGRESSION START TIME DURATION PEER READY busybox-workloads-1 busybox-drpc 103m sagrawal-nc1 sagrawal-nc1 Failover FailedOver Completed 2023-04-20T07:18:46Z 4m46.343135353s True sgrawal-nc1 cluster $ oc get pod -n openshift-storage | grep "mirror" enable-rbd-mirror-debug-logging-phbkj 0/1 Completed 0 137m rook-ceph-rbd-mirror-a-6448895769-mqfw9 2/2 Running 5 (78m ago) 136m sagrawal-c2 cluster $ oc get pod -n openshift-storage | grep "mirror" enable-rbd-mirror-debug-logging-8l9vj 0/1 Completed 0 138m rook-ceph-rbd-mirror-a-5f5748488-brgbv 2/2 Running 0 137m After failover -------------- $ oc get drpc -A -o wide NAMESPACE NAME AGE PREFERREDCLUSTER FAILOVERCLUSTER DESIREDSTATE CURRENTSTATE PROGRESSION START TIME DURATION PEER READY busybox-workloads-1 busybox-drpc 111m sagrawal-nc1 sagrawal-c2 Failover FailedOver Completed 2023-04-20T08:28:08Z 4m56.257080053s True sagrawal-nc1 cluster $ oc get pod -n openshift-storage | grep "mirror" enable-rbd-mirror-debug-logging-phbkj 0/1 Completed 0 144m rook-ceph-rbd-mirror-a-6448895769-mqfw9 2/2 Running 5 (86m ago) 144m sagrawal-c2 cluster $ oc get pod -n openshift-storage | grep "mirror" enable-rbd-mirror-debug-logging-8l9vj 0/1 Completed 0 144m rook-ceph-rbd-mirror-a-5f5748488-brgbv 2/2 Running 0 144m
The labels on the StorageClass and the VolumeReplicationClass as set by MCO is "ramendr.openshift.io/replicationid" whereas Ramen VRG controller was looking for "ramendr.openshift.io/replicationID" (ID capitalized). This causes VRG not to report any maintenance modes from other related labels in VRG status. Which hence further causes hub ramen operator to not initiate a MaintenanceMode request on the ManagedCluster that the workload is failing to. The end result being the RBD mirror daemon is not shutdown for the duration of the failover. Correcting this requires ramen to update the label name it is looking for on the different classes and update VRG status accordingly.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:3742