While working with an active/passive Hub Metro-DR setup, you might come across a rare scenario where the Ramen reconciler stops running after exceeding its allowed rate-limiting parameters. As reconciliation is specific to each workload, only that workload is impacted. In such an event, all disaster recovery orchestration activities related to that workload stop until the Ramen pod is restarted.
Workaround: Restart the Ramen pod on the Hub cluster.
$ oc delete pods <ramen-pod-name> -n openshift-operators
Created attachment 1947701[details]
[1]: Cannot initiate failover of app from UI
Description of problem (please be detailed as possible and provide log
snippests):
Created an active/passive MDR setup, made a one zone b down (where c1 managed cluster, active hub, and 3 ceph nodes running).
Then tried to restore the data in the passive hub which was running in another zone.
After restoring to a passive hub, cannot initiate failover of c1 apps as attached in the screenshot [1]
Version of all relevant components (if applicable):
OCP: 4.12.0-0.nightly-2023-03-02-051935
ODF: 4.12.1-19
ACM: 2.7.1
CEPH: 16.2.10-138.el8cp (a63ae467c8e1f7503ea3855893f1e5ca189a71b9) pacific (stable)
Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Is there any workaround available to the best of your knowledge?
A restart of the Ramen pod, once the pod is restarted you should be able to initiate Failover from UI
Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2
Can this issue reproducible?
Can this issue reproduce from the UI?
If this is a regression, please provide more details to justify this:
Steps to Reproduce:
1. Create 4 OCP clusters such that 2 hubs and 2 managed clusters. And one stretched RHCS cluster.
Deploy cluster in such a way that
zone a: arbiter ceph node
zone b: c1, active hub, 3 ceph nodes
zone c: c2, passive hub, 3 ceph nodes
2. Configure MDR and deploy an application on each managed cluster
3. Initiate a backup process, such that the active and passive hubs are in sync
4. Made zone b down
5. Initiate the restore process in a passive hub
6. Initiate failover of the application
Actual results:
After hub restore, Ramen wasn't reconciling and so was not able to initiate a failover from UI
Expected results:
Without restart of ramen pod, we should be able to initiate failover from UI
Additional info:
Comment 25Shrivaibavi Raghaventhiran
2023-10-20 12:58:06 UTC
Tested versions:
----------------
OCP - 4.14.0-0.nightly-2023-10-08-220853
ODF - 4.14.0-146.stable
ACM - 2.9.0-180
Initiated a failover from UI post hub recovery and haven't seen any errors as stated in the describe of the BZ, We haven't seen this issue for quite sometime now, we will reopen incase we hit this again.
With above observations moving this BZ to Verified.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2023:6832
Created attachment 1947701 [details] [1]: Cannot initiate failover of app from UI Description of problem (please be detailed as possible and provide log snippests): Created an active/passive MDR setup, made a one zone b down (where c1 managed cluster, active hub, and 3 ceph nodes running). Then tried to restore the data in the passive hub which was running in another zone. After restoring to a passive hub, cannot initiate failover of c1 apps as attached in the screenshot [1] Version of all relevant components (if applicable): OCP: 4.12.0-0.nightly-2023-03-02-051935 ODF: 4.12.1-19 ACM: 2.7.1 CEPH: 16.2.10-138.el8cp (a63ae467c8e1f7503ea3855893f1e5ca189a71b9) pacific (stable) Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? A restart of the Ramen pod, once the pod is restarted you should be able to initiate Failover from UI Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Create 4 OCP clusters such that 2 hubs and 2 managed clusters. And one stretched RHCS cluster. Deploy cluster in such a way that zone a: arbiter ceph node zone b: c1, active hub, 3 ceph nodes zone c: c2, passive hub, 3 ceph nodes 2. Configure MDR and deploy an application on each managed cluster 3. Initiate a backup process, such that the active and passive hubs are in sync 4. Made zone b down 5. Initiate the restore process in a passive hub 6. Initiate failover of the application Actual results: After hub restore, Ramen wasn't reconciling and so was not able to initiate a failover from UI Expected results: Without restart of ramen pod, we should be able to initiate failover from UI Additional info: