+++ This bug was initially created as a clone of Bug #2251022 +++ Description of problem (please be detailed as possible and provide log snippests): Version of all relevant components (if applicable): OCP 4.14.0-0.nightly-2023-11-09-204811 Volsync 0.8.0 Submariner 0.16.2 ACM quay.io:443/acm-d/acm-custom-registry:v2.9.0-RC2 odf-multicluster-orchestrator.v4.14.1-rhodf ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable) Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: **Active hub at neutral site** We move to passive hub after active hub goes down, continue running IOs for a few days and then another disaster occurs where primary managed cluster goes down. So we require to failover those workloads to the failovercluster using passive hub. 1. Deployed multiple rbd and cephfs backed workloads of both appset and subscription types. 2. Failedover and relocated them in such a way that they are finally running on the primary managed cluster (which is expected to host all the workloads and can go under disaster). 3. Ensure that we have all the workloads in distict states like deployed, failedover, relocated etc. 4. Let the latest backups be taken at least 1 or 2 (at each 1 hr) for all the different states of the workloads (when progression is completed and no action is going on any of the workloads). Also ensure sync for all the workloads when on active hub is working fine and cluster is healthy. Note drpc -o wide, lastGroupSyncTime, download backups from S3, etc. 5. Bring active hub completely down, move to passive hub. Restore backps, ensure velero backup reports successful restoration. Make sure both the managed clusters are successfully reported, drpolicy gets validated. 6. Wait for drpc to be restored, check if all the workloads are in their last backedup state or not. They seem to have retained their last state which was backedup. So everything is fine so far. 7. Let IOs continue for a few days (3-4 days in this case). Data sync for rbd based workloads were progressing just fine. Bring down the primary managed cluster (shutdown all nodes), wait for the cluster status to change on the RHACM console. 8. Trigger failover for all rbd workloads to the secondary failover cluster via ACM UI of Passive hub and check the progress. (Older hub remains down forever and is completely unreachable). Actual results: Failover of rbd workloads didn't proceed after drpc reporting WaitForStorageMaintenanceActivation From passive hub- amagrawa:~$ date -u Wednesday 22 November 2023 10:44:15 AM UTC amagrawa:~$ drpc|grep rbd busybox-workloads-6 rbd-sub-busybox-workloads-6-placement-1-drpc 5d20h amagrawa-10n-1 amagrawa-10n-2 Failover FailingOver WaitForStorageMaintenanceActivation 2023-11-22T10:42:01Z False busybox-workloads-7 rbd-sub-busybox-workloads-7-placement-1-drpc 5d20h amagrawa-10n-1 amagrawa-10n-2 Failover FailingOver WaitForStorageMaintenanceActivation 2023-11-22T10:42:30Z False busybox-workloads-8 rbd-sub-busybox-workloads-8-placement-1-drpc 5d20h amagrawa-10n-1 amagrawa-10n-2 Failover FailingOver WaitForStorageMaintenanceActivation 2023-11-22T10:42:46Z False openshift-gitops rbd-appset-busybox-workloads-1-placement-drpc 5d20h amagrawa-10n-1 amagrawa-10n-2 Failover FailingOver WaitForStorageMaintenanceActivation 2023-11-22T10:42:53Z False openshift-gitops rbd-appset-busybox-workloads-2-placement-drpc 5d20h amagrawa-10n-1 amagrawa-10n-2 Failover FailingOver WaitForStorageMaintenanceActivation 2023-11-22T10:43:01Z False openshift-gitops rbd-appset-busybox-workloads-3-placement-drpc 5d20h amagrawa-10n-2 amagrawa-10n-2 Failover FailingOver WaitForStorageMaintenanceActivation 2023-11-22T10:43:09Z False openshift-gitops rbd-appset-busybox-workloads-4-placement-drpc 5d20h amagrawa-10n-1 amagrawa-10n-2 Failover FailingOver WaitForStorageMaintenanceActivation 2023-11-22T10:43:17Z False C2 (Failover cluster)- amagrawa:c2$ mm NAME AGE cf83e1357eefb8bdf1542850d66d8007d620e40 8s amagrawa:c2$ mmyaml apiVersion: v1 items: - apiVersion: ramendr.openshift.io/v1alpha1 kind: MaintenanceMode metadata: creationTimestamp: "2023-11-22T10:42:13Z" generation: 1 name: cf83e1357eefb8bdf1542850d66d8007d620e40 ownerReferences: - apiVersion: work.open-cluster-management.io/v1 kind: AppliedManifestWork name: 12468cfb98c25f5699e4c121971044ca37b147bb769b8512ab330cdc5a7c53d2-cf83e1357eefb8bdf1542850d66d8007d620e40-mmode-mw uid: 189dbc54-c4d4-4f17-8d31-94202eea7569 resourceVersion: "18019820" uid: 42d80af3-e49c-41cc-9f4f-016de7086cb5 spec: modes: - Failover storageProvisioner: openshift-storage.rbd.csi.ceph.com targetID: cf83e1357eefb8bdf1542850d66d8007d620e40 kind: List metadata: resourceVersion: "" amagrawa:c2$ mm NAME AGE cf83e1357eefb8bdf1542850d66d8007d620e40 4m19s amagrawa:c2$ pods|grep mirror rook-ceph-rbd-mirror-a-777755497d-7x65q 2/2 Running 1 (19h ago) 20h 10.128.2.38 compute-2 <none> <none> rbd-mirror deployment wasn't auto-scaleddown on the failovercluster C2. amagrawa:c2$ pods|grep mirror rook-ceph-rbd-mirror-a-777755497d-7x65q 2/2 Running 1 (20h ago) 20h 10.128.2.38 compute-2 <none> <none> Took the output again from C2, the observations remains same. Failover didn't even start. amagrawa:c2$ date -u Wednesday 22 November 2023 11:25:01 AM UTC amagrawa:c2$ mmyaml apiVersion: v1 items: - apiVersion: ramendr.openshift.io/v1alpha1 kind: MaintenanceMode metadata: creationTimestamp: "2023-11-22T10:42:13Z" generation: 1 name: cf83e1357eefb8bdf1542850d66d8007d620e40 ownerReferences: - apiVersion: work.open-cluster-management.io/v1 kind: AppliedManifestWork name: 12468cfb98c25f5699e4c121971044ca37b147bb769b8512ab330cdc5a7c53d2-cf83e1357eefb8bdf1542850d66d8007d620e40-mmode-mw uid: 189dbc54-c4d4-4f17-8d31-94202eea7569 resourceVersion: "18019820" uid: 42d80af3-e49c-41cc-9f4f-016de7086cb5 spec: modes: - Failover storageProvisioner: openshift-storage.rbd.csi.ceph.com targetID: cf83e1357eefb8bdf1542850d66d8007d620e40 kind: List metadata: resourceVersion: "" (C1 cluster remains down) This leads to application unavailibility as the primary managed cluster C1 is down after disaster and workloads couldn't be failedover to C2. Expected results: Failover of rbd workloads should proceed and they should be accessible on the failovercluster. Additional info: --- Additional comment from RHEL Program Management on 2023-11-22 17:01:35 IST --- This bug having no release flag set previously, is now set with release flag 'odf‑4.15.0' to '?', and so is being proposed to be fixed at the ODF 4.15.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag. --- Additional comment from RHEL Program Management on 2023-11-22 17:01:35 IST --- Since this bug has severity set to 'urgent', it is being proposed as a blocker for the currently set release flag. Please resolve ASAP. --- Additional comment from Benamar Mekhissi on 2023-11-23 00:54:17 IST --- The maintenance mode was stuck due to a mismatch of the replicationId label on the VRC and the replicationId label on the CephCluster. These two values should match. old active hub ============== vrc label: ramendr.openshift.io/replicationid: a99df9fc6c52c7ef44222ab38657a0b15628a14 new active hub ============== vrc label: ramendr.openshift.io/replicationid: cf83e1357eefb8bdf1542850d66d8007d620e40 @vbadrina I am assigning this to you. --- Additional comment from Eran Tamir on 2023-11-27 14:09:03 IST --- Do we have a workaround for that? If not, why don't we consider it a blocker? @ --- Additional comment from Aman Agrawal on 2023-11-27 15:23:10 IST --- Today, Umanga wanted us to try restarting the odfmo-controller-manager-xxxxx on the passive hub in openshift-operators NS but the setup is no longer available due to issues with datacenter as the hosts are down and cluster shows disconnected, so it couldn't be tested (ecosystem team is aware of this issue). And yes, it is a hub recovery blocker bug. Relevant thread- https://chat.google.com/room/AAAAqWkMm2s/CzNW3bY-Q_U --- Additional comment from Shyamsundar on 2023-11-28 19:19:06 IST --- Poked around some older data and here is what is happening: old active hub (correct) vrc: ramendr.openshift.io/replicationid: a99df9fc6c52c7ef44222ab38657a0b15628a14 new active hub (incorrect) vrc: ramendr.openshift.io/replicationid: cf83e1357eefb8bdf1542850d66d8007d620e40 fsid 1: 7e252ee3-abd9-4c54-a4ff-a2fdce8931a0 fsid 2: aacfbd7e-5ced-42a5-bdc2-483fcbe5a29d Correct hash generation $ echo -n "7e252ee3-abd9-4c54-a4ff-a2fdce8931a0-aacfbd7e-5ced-42a5-bdc2-483fcbe5a29d" | sha512sum a99df9fc6c52c7ef44222ab38657a0b15628a14507417e8443111e17fb9623b0194b8a84c145db0a1bdabafe573fc9b0eeb6139e356748e0bd7c533e3cb423bb - Incorrect hash generation when fsid's are empty (this tallies with the new active hub incorrect values) $ echo -n "" | sha512sum cf83e1357eefb8bdf1542850d66d8007d620e4050b5715dc83f4a921d36ce9ce47d0d13c5d85f2b0ff8318d2877eec2f63b931bd47417a81a538327af927da3e - Issue seems to be not checking if we have valid fsid values and generating a hash from an empty string in MCO and (re)labelling the classes with the same. --- Additional comment from Aman Agrawal on 2023-11-29 18:09:06 IST --- As updated in the thread, restarting odfmo-controller-manager-xxxxx pod inside openshift-operators on the passive hub didn't seem to help. The failover progression remains stuck at WaitForStorageMaintenanceActivation. mmode was activated on the failover cluster but it didn't scale down the rbd-mirror daemon deployment and failover doesn't proceed. mmode remains activated in the same state forever. --- Additional comment from umanga on 2023-11-29 20:53:47 IST --- Issue is identified and fix is available. Providing devel_ack+.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Data Foundation 4.14.1 Bug Fix Update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:7696