2267965 – [4.16][RDR][Hub Recovery] Failover remains stuck with WaitForReadiness

Bug 2267965 - [4.16][RDR][Hub Recovery] Failover remains stuck with WaitForReadiness

Summary: [4.16][RDR][Hub Recovery] Failover remains stuck with WaitForReadiness

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	odf-dr
Sub Component:
Version:	4.15
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	ODF 4.16.0
Assignee:	Benamar Mekhissi
QA Contact:	Aman Agrawal
Docs Contact:
URL:
Whiteboard:
Depends On:	2264767
Blocks:
TreeView+	depends on / blocked

Reported:	2024-03-05 18:31 UTC by Karolin Seeger
Modified:	2024-11-15 04:25 UTC (History)
CC List:	7 users (show)
Fixed In Version:	4.16.0-86
Doc Type:	No Doc Update
Doc Text:
Clone Of:	2264767
Environment:
Last Closed:	2024-07-17 13:14:59 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	RamenDR ramen pull 1222	0	None	Merged	Lower s3timeout to reasonable duration	2024-03-05 18:34:11 UTC
Red Hat Product Errata	RHSA-2024:4591	0	None	None	None	2024-07-17 13:15:01 UTC

Description Karolin Seeger 2024-03-05 18:31:41 UTC

+++ This bug was initially created as a clone of Bug #2264767 +++

Description of problem (please be detailed as possible and provide log
snippests):


Version of all relevant components (if applicable):
ODF 4.15.0-132.stable
OCP 4.15.0-0.nightly-2024-02-13-231030
ACM 2.9.2 GA'ed
Submariner 0.16.3
ceph version 17.2.6-194.el9cp (d9f4aedda0fc0d99e7e0e06892a69523d2eb06dc) quincy (stable)



Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
**Active hub at neutral site**

1. Deployed multiple rbd and cephfs backed workloads of both appset and subscription types.
2. Failedover and relocated them in such a way that they are finally running on the primary managed cluster (which is expected to host all the workloads and can go under disaster) but the apps which were failedover from C1 to C2 were relocated back to C1 and the apps which were relocated to C2 were failedover to C1 (with all nodes up and running).
Ensure that we have all workloads combinations in distinct states like deployed, failedover, relocated on C1, and a few workloads in deployed state on C2 as well.
4. Let the latest backups be taken at least 1 for all the different states of the workloads (when progression is completed and no action is going on any of the workloads). Also ensure sync for all the workloads when on active hub is working fine and cluster is healthy. Note drpc -o wide, lastGroupSyncTime, download backups from S3, etc.
5. Bring active hub completely down, move to passive hub. Restore backps, ensure velero backup reports successful restoration. Make sure both the managed clusters are successfully reported, drpolicy gets validated.
6. Wait for drpc to be restored, check if all the workloads are in their last backedup state or not.
They seem to have retained their last state which was backedup. So everything is fine so far.
7. Let IOs continue for a few hours (20-30hrs). Failover the cephfs workloads running on C2 to C1 with all nodes of C2 up and running.
8. After successful failover and cleanup, wait for sync to resume and after some time bring primary cluster down (all nodes). Bring it up after a few hours.
9. Check if drpc state is still the same and data sync for all workload is resuming as expected.
10. After a few hours, bring master nodes of primary cluster down and failover all the workloads running on primary after the cluster is marked offline on RHACM console and observe the failover status.


Output collected around Sunday 18 February 2024 07:55:50 PM UTC (long after failover was triggered)

amagrawa:hub$ drpc
NAMESPACE              NAME                                    AGE    PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION        START TIME             DURATION       PEER READY
busybox-workloads-13   sub-rbd-busybox13-placement-1-drpc      2d9h   amagrawa-prim      amagrawa-odf2     Failover       FailedOver     WaitForReadiness   2024-02-18T19:20:17Z                  False
busybox-workloads-14   sub-rbd-busybox14-placement-1-drpc      2d9h   amagrawa-prim      amagrawa-odf2     Failover       FailedOver     WaitForReadiness   2024-02-18T19:20:25Z                  False
busybox-workloads-15   sub-rbd-busybox15-placement-1-drpc      2d9h   amagrawa-prim      amagrawa-odf2     Failover       FailedOver     WaitForReadiness   2024-02-18T19:20:33Z                  False
busybox-workloads-16   sub-rbd-busybox16-placement-1-drpc      2d9h   amagrawa-odf2                                       Deployed       Completed          2024-02-16T10:12:51Z   660.371688ms   True
busybox-workloads-5    sub-cephfs-busybox5-placement-1-drpc    2d9h   amagrawa-prim      amagrawa-odf2     Relocate       Relocating                        2024-02-18T19:16:07Z                  False
busybox-workloads-6    sub-cephfs-busybox6-placement-1-drpc    2d9h   amagrawa-prim      amagrawa-odf2     Failover       FailedOver     Cleaning Up        2024-02-18T19:19:49Z                  False
busybox-workloads-7    sub-cephfs-busybox7-placement-1-drpc    2d9h   amagrawa-prim      amagrawa-odf2     Failover       FailedOver     Cleaning Up        2024-02-18T19:19:59Z                  False
busybox-workloads-8    sub-cephfs-busybox8-placement-1-drpc    2d9h   amagrawa-prim      amagrawa-odf2     Failover       FailedOver     Cleaning Up        2024-02-18T19:20:06Z                  False
openshift-gitops       appset-cephfs-busybox1-placement-drpc   2d9h   amagrawa-prim      amagrawa-odf2     Failover       FailedOver     Cleaning Up        2024-02-18T19:18:33Z                  False
openshift-gitops       appset-cephfs-busybox2-placement-drpc   2d9h   amagrawa-prim      amagrawa-odf2     Failover       FailedOver     Cleaning Up        2024-02-18T19:18:38Z                  False
openshift-gitops       appset-cephfs-busybox3-placement-drpc   2d9h   amagrawa-prim      amagrawa-odf2     Failover       FailedOver     Cleaning Up        2024-02-18T19:18:43Z                  False
openshift-gitops       appset-cephfs-busybox4-placement-drpc   2d9h   amagrawa-prim      amagrawa-odf2     Failover       FailedOver     Cleaning Up        2024-02-18T19:18:48Z                  False
openshift-gitops       appset-rbd-busybox10-placement-drpc     2d9h   amagrawa-prim      amagrawa-odf2     Failover       FailedOver     WaitForReadiness   2024-02-18T19:18:52Z                  False
openshift-gitops       appset-rbd-busybox11-placement-drpc     2d9h   amagrawa-prim      amagrawa-odf2     Failover       FailedOver     WaitForReadiness   2024-02-18T19:18:58Z                  False
openshift-gitops       appset-rbd-busybox12-placement-drpc     2d9h   amagrawa-odf2                                       Deployed       Completed          2024-02-16T10:13:47Z   571.259493ms   True
openshift-gitops       appset-rbd-busybox9-placement-drpc      2d9h   amagrawa-prim      amagrawa-odf2     Failover       FailedOver     WaitForReadiness   2024-02-18T19:19:17Z                  False

Failover remains stuck for multiple apps at WaitForReadiness which leads to application downtime and their inaccessibility.
 
Actual results: [RDR] [Hub recovery] [Neutral] Failover remains stuck with WaitForReadiness

Expected results: Failover should complete maintaining permissible RPO/RTO of 2xsync interval.


Additional info:

--- Additional comment from RHEL Program Management on 2024-02-18 20:06:20 UTC ---

This bug having no release flag set previously, is now set with release flag 'odf‑4.15.0' to '?', and so is being proposed to be fixed at the ODF 4.15.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag.

--- Additional comment from RHEL Program Management on 2024-02-18 20:06:20 UTC ---

Since this bug has severity set to 'urgent', it is being proposed as a blocker for the currently set release flag. Please resolve ASAP.

--- Additional comment from Aman Agrawal on 2024-02-18 20:19:43 UTC ---

Logs- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/19feb24-2/

--- Additional comment from Benamar Mekhissi on 2024-02-20 00:56:51 UTC ---

The issue stems from the liveness probe terminating the Ramen process before the VRGs (for RBD) complete their reconciliations. This behavior is primarily triggered by the unavailability of the S3 store on C1. The calls to the S3 store are taking more than 2 minutes to fail, leading to a scenario where all calls become stuck, awaiting the timeout to activate. Consequently, the liveness probe for `healthz` fails to report success, incorrectly interpreting the process as hung.

One potential workaround involves adjusting the `maxConcurrentReconciles` parameter from 50 to 1. 

To address the underlying problem, it is possible to reduce the S3 timeout call from 125 seconds to a more appropriate value, such as 10 seconds. This, however, requires code change.

--- Additional comment from Mudit Agarwal on 2024-02-26 11:55:04 UTC ---

Aman/Benamar, does this issue affects only neural or co-situated also?

--- Additional comment from Karolin Seeger on 2024-02-26 14:22:07 UTC ---

Updating the subject line as this one is not related to hub recovery.

--- Additional comment from Karolin Seeger on 2024-02-27 09:17:42 UTC ---

Moving this one to 4.15.z and marking it as a non-blocker based on the following assessment:
- not a regression, in place since day 1
- happens if failing over numerous workloads at a time only (high workload)
- workaround available to get customers unstuck if needed

--- Additional comment from RHEL Program Management on 2024-02-27 09:17:51 UTC ---

Since this bug has severity set to 'urgent', it is being proposed as a blocker for the currently set release flag. Please resolve ASAP.

--- Additional comment from Raghavendra Talur on 2024-02-27 15:29:14 UTC ---

TODO on rtalur:

Check if any of the process limits are being hit in the ramen process, like file handle limit etc. Use profiling to determine that. If no such issues are found, we will go back to investigating other causes in controller-runtime.

--- Additional comment from Aman Agrawal on 2024-03-03 21:09:07 UTC ---

While testing hub recovery where active hub is co-situated with the primary managed cluster (total of 2 sites) on following versions:

OCP 4.15.0-0.nightly-2024-02-27-181650
ACM 2.10.0-DOWNSTREAM-2024-02-28-06-06-55
ODF 4.15.0-150
ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable)

The bug was hit again.

Failedover 12 workloads, 6RBD and 6CephFS, failover was successful for all CephFS workloads however it remains stuck for RBD. Since this issue is consistent and impacts the basic functionality of hub recovery after site failure for RBD backed workloads where the only option is to perform failover to make the workloads/apps accessible, I am proposing this as a blocker and emphasising on having the fix in place in order to release co-situated hub recovery.

Latest logs can be provided if needed. Setting needinfo on Karolin for re-consideration. Thanks!

--- Additional comment from Karolin Seeger on 2024-03-04 08:27:48 UTC ---

@rtalur, please add the steps to work around this issue.
Is an ETA available for the fix? Thanks!

--- Additional comment from Benamar Mekhissi on 2024-03-04 12:24:46 UTC ---

Edit the ramen configmap on the hub cluster
oc edit cm -n ramen-system ramen-hub-operator-config
2. Change maxConcurrentReconcilesfrom 50 to1.
3. Save it.

--- Additional comment from Aman Agrawal on 2024-03-05 08:13:10 UTC ---

(In reply to Aman Agrawal from comment #10)
> While testing hub recovery where active hub is co-situated with the primary
> managed cluster (total of 2 sites) on following versions:
> 
> OCP 4.15.0-0.nightly-2024-02-27-181650
> ACM 2.10.0-DOWNSTREAM-2024-02-28-06-06-55
> ODF 4.15.0-150
> ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4)
> quincy (stable)
> 
> The bug was hit again.
> 
> Failedover 12 workloads, 6RBD and 6CephFS, failover was successful for all
> CephFS workloads however it remains stuck for RBD. Since this issue is
> consistent and impacts the basic functionality of hub recovery after site
> failure for RBD backed workloads where the only option is to perform
> failover to make the workloads/apps accessible, I am proposing this as a
> blocker and emphasising on having the fix in place in order to release
> co-situated hub recovery.
> 
> Latest logs can be provided if needed. Setting needinfo on Karolin for
> re-consideration. Thanks!

Earlier we thought it’s the same issue, but Benamar confirmed (offline) that VolumeReplicationClass is missing which obstructs failover of RBD workloads.
It’s a regression because we had already fixed it in https://bugzilla.redhat.com/show_bug.cgi?id=2258560.

Awaiting inputs from Vineet/Umanga.
This is being discussed offline.

--- Additional comment from Aman Agrawal on 2024-03-05 11:52:32 UTC ---

(In reply to Aman Agrawal from comment #13)
> (In reply to Aman Agrawal from comment #10)
> > While testing hub recovery where active hub is co-situated with the primary
> > managed cluster (total of 2 sites) on following versions:
> > 
> > OCP 4.15.0-0.nightly-2024-02-27-181650
> > ACM 2.10.0-DOWNSTREAM-2024-02-28-06-06-55
> > ODF 4.15.0-150
> > ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4)
> > quincy (stable)
> > 
> > The bug was hit again.
> > 
> > Failedover 12 workloads, 6RBD and 6CephFS, failover was successful for all
> > CephFS workloads however it remains stuck for RBD. Since this issue is
> > consistent and impacts the basic functionality of hub recovery after site
> > failure for RBD backed workloads where the only option is to perform
> > failover to make the workloads/apps accessible, I am proposing this as a
> > blocker and emphasising on having the fix in place in order to release
> > co-situated hub recovery.
> > 
> > Latest logs can be provided if needed. Setting needinfo on Karolin for
> > re-consideration. Thanks!
> 
> Earlier we thought it’s the same issue, but Benamar confirmed (offline) that
> VolumeReplicationClass is missing which obstructs failover of RBD workloads.
> It’s a regression because we had already fixed it in
> https://bugzilla.redhat.com/show_bug.cgi?id=2258560.
> 
> Awaiting inputs from Vineet/Umanga.
> This is being discussed offline.

Let's track it separately- https://bugzilla.redhat.com/show_bug.cgi?id=2267885

--- Additional comment from gowtham on 2024-03-05 13:19:25 UTC ---

Reason is:
- When QE down the C1, Somehow rook secret for cluster C1 from the hub is deleted. I can confirm this by log: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/04march24/logs/acm/must-gat[…]ger-5c88d65c6-982pd/manager/manager/logs/current.log (search: 49e0eabdd8e8db91ebd1d4520592cf7810da8db)
- With the recent fix: https://github.com/red-hat-storage/odf-multicluster-orchestrator/pull/191, MCO is reading rook secret for both C1 and C2 clusters to fetch the cephFSID to create VRC.
- Exactly on this line(https://github.com/red-hat-storage/odf-multicluster-orchestrator/blob/main/controllers/drpolicy_controller.go#L223), MCO is unable to find the rook secret for the down cluster C1. So it stops the reconciliation with an error. 
- Even though the C2 cluster's rook secret is present in the hub, MCO fails to continue the reconciliation and create the VRC.
- So the fix is that they have to continue the VRC creation for the second cluster, even though they do not find a secret for the first cluster. But somehow they have to requeue the reconciliation after VRC creation, so it can retry for the next cluster after some time. (or) there can be some way to retry the reconciliation. Need to check with the backend developers.

--- Additional comment from Aman Agrawal on 2024-03-05 14:15:36 UTC ---

(In reply to gowtham from comment #15)
> Reason is:
> - When QE down the C1, Somehow rook secret for cluster C1 from the hub is
> deleted. I can confirm this by log:
> http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/04march24/
> logs/acm/must-gat[…]ger-5c88d65c6-982pd/manager/manager/logs/current.log
> (search: 49e0eabdd8e8db91ebd1d4520592cf7810da8db)
> - With the recent fix:
> https://github.com/red-hat-storage/odf-multicluster-orchestrator/pull/191,
> MCO is reading rook secret for both C1 and C2 clusters to fetch the cephFSID
> to create VRC.
> - Exactly on this
> line(https://github.com/red-hat-storage/odf-multicluster-orchestrator/blob/
> main/controllers/drpolicy_controller.go#L223), MCO is unable to find the
> rook secret for the down cluster C1. So it stops the reconciliation with an
> error. 
> - Even though the C2 cluster's rook secret is present in the hub, MCO fails
> to continue the reconciliation and create the VRC.
> - So the fix is that they have to continue the VRC creation for the second
> cluster, even though they do not find a secret for the first cluster. But
> somehow they have to requeue the reconciliation after VRC creation, so it
> can retry for the next cluster after some time. (or) there can be some way
> to retry the reconciliation. Need to check with the backend developers.

This summary is related to BZ2267885 and not BZ2264767.

--- Additional comment from Benamar Mekhissi on 2024-03-05 15:39:12 UTC ---

PR for issue mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=2264767#c4 is https://github.com/RamenDR/ramen/pull/1222

--- Additional comment from Karolin Seeger on 2024-03-05 17:50:48 UTC ---

Proposing this one for 4.15.0 as a new RC has to be cut anyway and it's an important fix.

--- Additional comment from RHEL Program Management on 2024-03-05 18:24:44 UTC ---

The 'Target Release' is not to be set manually at the Red Hat OpenShift Data Foundation product.

The 'Target Release' will be auto set appropriately, after the 3 Acks (pm,devel,qa) are set to "+" for a specific release flag and that release flag gets auto set to "+".

Comment 7 Aman Agrawal 2024-05-30 18:16:51 UTC

Tested with following versions:

ceph version 18.2.1-188.el9cp (b1ae9c989e2f41dcfec0e680c11d1d9465b1db0e) reef (stable)
OCP 4.16.0-0.nightly-2024-05-23-173505
ACM 2.11.0-DOWNSTREAM-2024-05-23-15-16-26
MCE 2.6.0-104 
ODF 4.16.0-108.stable
Gitops v1.12.3 

Platform- VMware

When the steps to reproduce is repeated, Failover was successful for all RBD and CephFS workloads and VolumeReplicationClass was successfully restored on the surviving managed cluster (which is needed for RBD).

oc get volumereplicationclass -A
NAME                                    PROVISIONER
rbd-volumereplicationclass-1625360775   openshift-storage.rbd.csi.ceph.com
rbd-volumereplicationclass-473128587    openshift-storage.rbd.csi.ceph.com


DRPC from new hub-

busybox-workloads-101   rbd-sub-busybox101-placement-1-drpc       4h51m   amagrawa-c1-28my   amagrawa-c2-my28   Failover       FailedOver     Cleaning Up   2024-05-30T15:26:02Z                        False
busybox-workloads-13    cephfs-sub-busybox13-placement-1-drpc     4h51m   amagrawa-c1-28my   amagrawa-c2-my28   Failover       FailedOver     Cleaning Up   2024-05-30T15:27:33Z                        False
busybox-workloads-16    cephfs-sub-busybox16-placement-1-drpc     4h51m   amagrawa-c1-28my   amagrawa-c2-my28   Failover       FailedOver     Cleaning Up   2024-05-30T15:27:26Z                        False
busybox-workloads-18    cnv-sub-busybox18-placement-1-drpc        4h51m   amagrawa-c1-28my   amagrawa-c2-my28   Failover       FailedOver     Cleaning Up   2024-05-30T16:52:14Z                        False
busybox-workloads-5     rbd-sub-busybox5-placement-1-drpc         4h51m   amagrawa-c1-28my   amagrawa-c2-my28   Failover       FailedOver     Cleaning Up   2024-05-30T15:25:50Z                        False
busybox-workloads-6     rbd-sub-busybox6-placement-1-drpc         4h51m   amagrawa-c1-28my   amagrawa-c2-my28   Failover       FailedOver     Cleaning Up   2024-05-30T15:25:56Z                        False
busybox-workloads-7     rbd-sub-busybox7-placement-1-drpc         4h51m   amagrawa-c1-28my   amagrawa-c2-my28   Failover       FailedOver     Cleaning Up   2024-05-30T15:25:34Z                        False
openshift-gitops        cephfs-appset-busybox12-placement-drpc    4h51m   amagrawa-c1-28my   amagrawa-c2-my28   Failover       FailedOver     Cleaning Up   2024-05-30T15:28:14Z                        False
openshift-gitops        cephfs-appset-busybox9-placement-drpc     4h51m   amagrawa-c1-28my   amagrawa-c2-my28   Failover       FailedOver     Cleaning Up   2024-05-30T15:28:19Z                        False
openshift-gitops        cnv-appset-busybox17-placement-drpc       4h51m   amagrawa-c1-28my   amagrawa-c2-my28   Failover       FailedOver     Cleaning Up   2024-05-30T16:52:23Z                        False
openshift-gitops        rbd-appset-busybox1-placement-drpc        4h51m   amagrawa-c1-28my   amagrawa-c2-my28   Failover       FailedOver     Cleaning Up   2024-05-30T15:26:08Z                        False
openshift-gitops        rbd-appset-busybox100-placement-drpc      4h51m   amagrawa-c1-28my   amagrawa-c2-my28   Failover       FailedOver     Cleaning Up   2024-05-30T15:26:14Z                        False
openshift-gitops        rbd-appset-busybox2-placement-drpc        4h51m   amagrawa-c1-28my   amagrawa-c2-my28   Failover       FailedOver     Cleaning Up   2024-05-30T15:26:20Z                        False
openshift-gitops        rbd-appset-busybox3-placement-drpc        4h51m   amagrawa-c1-28my   amagrawa-c2-my28   Failover       FailedOver     Cleaning Up   2024-05-30T15:26:49Z                        False

Since the primary managed cluster is still down, PROGRESSION is reporting Cleaning Up which is expected.

Failover was successful on 2 CNV (RBD) workloads  cnv-sub-busybox18-placement-1-drpc and cnv-appset-busybox17-placement-drpc as well of both subscription and appset (pull model) types respectively and the data written into the VM was successfully restored after failover completion.


Fix for this BZ LGTM. Therefore I am marking this bug as verified.

Comment 10 errata-xmlrpc 2024-07-17 13:14:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:4591

Comment 11 Red Hat Bugzilla 2024-11-15 04:25:21 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.