Bug 2009735 - [DR] After failing over an application, the mirrored RBDs can be attached but the filesystems can not be mounted (still in use) [NEEDINFO]
Summary: [DR] After failing over an application, the mirrored RBDs can be attached but...
Keywords:
Status: NEW
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RBD-Mirror
Version: 5.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 8.0
Assignee: Ilya Dryomov
QA Contact:
URL:
Whiteboard:
: 2011791 2144733 (view as bug list)
Depends On:
Blocks: 2007376 2011427 2030752
TreeView+ depends on / blocked
 
Reported: 2021-10-01 13:03 UTC by Shyamsundar
Modified: 2023-11-17 21:10 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 2007376
Environment:
Last Closed:
Embargoed:
vashastr: needinfo? (jdurgin)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-1960 0 None None None 2021-10-04 12:09:43 UTC

Comment 1 RHEL Program Management 2021-10-01 13:04:02 UTC
Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.

Comment 4 Mudit Agarwal 2021-10-06 10:08:36 UTC
This is a blocker for ODF 4.9 hence targeting it for 5.0z1, please prioritize

Comment 10 Mudit Agarwal 2021-10-21 10:32:11 UTC
Last time I talked with Scott, we agreed for 5.0z2

Comment 11 Scott Ostapovicz 2021-10-21 11:59:00 UTC
We could not get it into the 5.0 z2 release on time.  5.0 z3 is the earliest release we could get it into before 5.1

Comment 26 Scott Ostapovicz 2022-01-26 16:57:40 UTC
Not completed in time for 5.0 z4, moving to 5.1

Comment 27 Josh Durgin 2022-02-03 07:33:17 UTC
Deepika and I investigated the most pressing observation here - namely 'force promote' hanging indefinitely when the primary cluster is down. We understand the root cause now (rbd-mirror was not designed to handle this kind of use) and are working on a fix.

Comment 28 Yaniv Kaul 2022-03-02 08:10:55 UTC
(In reply to Josh Durgin from comment #27)
> Deepika and I investigated the most pressing observation here - namely
> 'force promote' hanging indefinitely when the primary cluster is down. We
> understand the root cause now (rbd-mirror was not designed to handle this
> kind of use) and are working on a fix.

Is there an update?

Comment 29 Josh Durgin 2022-03-03 15:58:38 UTC
(In reply to Yaniv Kaul from comment #28)
> (In reply to Josh Durgin from comment #27)
> > Deepika and I investigated the most pressing observation here - namely
> > 'force promote' hanging indefinitely when the primary cluster is down. We
> > understand the root cause now (rbd-mirror was not designed to handle this
> > kind of use) and are working on a fix.
> 
> Is there an update?

We're testing the simplest fix and will update when there's more news.

Comment 30 Ilya Dryomov 2022-03-15 17:53:27 UTC
*** Bug 2011791 has been marked as a duplicate of this bug. ***

Comment 31 Yaniv Kaul 2022-03-24 08:41:12 UTC
(In reply to Josh Durgin from comment #29)
> (In reply to Yaniv Kaul from comment #28)
> > (In reply to Josh Durgin from comment #27)
> > > Deepika and I investigated the most pressing observation here - namely
> > > 'force promote' hanging indefinitely when the primary cluster is down. We
> > > understand the root cause now (rbd-mirror was not designed to handle this
> > > kind of use) and are working on a fix.
> > 
> > Is there an update?
> 
> We're testing the simplest fix and will update when there's more news.

Let's keep the bug updated on a regular basis, with latest finding please.

Comment 32 Josh Durgin 2022-03-24 16:27:01 UTC
After further testing we found that we can work around this problem by restarting the rbd-mirror daemon. We would like to propose this as a way forward - Shyam can you fill in how this would work from an ODF perspective?

The issue is that rbd-mirror needs to close the image that is being force promoted, and may get stuck if primary cluster is inaccessible. rbd-mirror will pick up where it left off after restart, as much as it can, but having it shut down during the force promote will allow the force promote to succeed.

Comment 33 Shyamsundar 2022-03-24 19:09:53 UTC
(In reply to Josh Durgin from comment #32)
> After further testing we found that we can work around this problem by
> restarting the rbd-mirror daemon. We would like to propose this as a way
> forward - Shyam can you fill in how this would work from an ODF perspective?

Thanks Josh.

We would need to test and document this in 2 ways,
- Initially for the TechPreview release (ODF 4.10) we can add this to the troubleshooting section
  - If pods do not reach running state and report "in-use" errors post failover, scale down the RBD mirror daemon deployment; wait for failover to complete; scale up the RBD mirror daemon deployment
- For GA, we can make this a per-requisite prior to failover, so that it does not impact RTO (i.e failover, detect/hit the in-use issue and then take further action to rectify the situation byt scaling down the RBD mirror daemon deployment)

These additional steps do impact the automated manner of performing these actions, but are single oc/kubectl commands that would need to be executed prior to failover and hence is minimal.

Even if a user fails to scale the RBD mirror daemon deployment down prior to failover, it is safe to do so post failover has been initiated and an "in-use" error has been observed, thus the workaround is viable even if the documented per-requisite is not followed.

@vkolli Let us know if this is acceptable form a product standpoint.

@prsurve we would need to test scaling-down the RBD mirror daemon prior to failover and scaling it back up post failover is complete to ensure this works as desired in ODF.

As part of the testing and documentation we mostly would use bz #2007376 and leave this one open.

--------------------
<tagging @madhu for some CSI context>

Additionally the current force-promote timeout in ceph-csi is 1 minute, we may want to bump up to 2 minutes to give enough time for rollback to complete:
  - As rollback is performed by the force-promote command it, at times, may take more than a minute (based on dirty blocks that need to be rolled back approximately) to rollback
  - But, rollback will continue rolling back in the next invocation of the command and so would eventually complete
  - The added minute is useful though to avoid multiple calls to complete the rollback and in extremely corner cases to avoid failures in the first instance of the call when the mirror watcher is not yet removed (post scaling down the RBD mirror instance)

An ideal future fix would be for ceph-csi to detect the rollback process has started and wait instead of killing the call.

Comment 35 Venkat Kolli 2022-03-30 22:56:02 UTC
This workaround indeed impacts the core value prop of our DR solution:
- 'Single Click' App failover
-  Quick error free App recovery with lowest RTO in chaotic DR situations.

But given the occurrence of this issue seems to be not frequent, we might be able to 'live' with this workaround for sometime.  We should strive to fix this issue soon.

There will be two most common failover situations customers would be facing:
- Single App failover/migration, when the cluster is otherwise healthy
- Entire Cluster is unstable or failed

In the Single App failover case, trying to preemptively bringing down RBD mirror demon would be a heavy handed solution and impacts other Apps. So we should resort to post failover recovery procedure, if there is no successful recovery.  Since this is not a cluster DR situation, it might be manageable.  But we should make this workaround as easy as we can and document it prominently.

For the Cluster failure situation, we should have the mirror brought down preemptively, again making it easy and quick for the users.  It cannot be more than a single command.

Comment 38 Scott Ostapovicz 2022-04-20 14:44:18 UTC
This is not a blocker for ODF 4.10 (DR TP) so I am moving it to 6.0 (DR GA)

Comment 40 Mudit Agarwal 2022-06-07 09:17:19 UTC
Is it a DR GA blocker? If yes then this should be fixed in 5.2 because DR GA is planned for ODF 4.11 which is consuming 5.2

Comment 41 Scott Ostapovicz 2022-06-09 09:05:41 UTC
Updated target

Comment 43 Ilya Dryomov 2022-11-22 10:30:46 UTC
*** Bug 2144733 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.