Bug 2009735

Summary: [DR] After failing over an application, the mirrored RBDs can be attached but the filesystems can not be mounted (still in use)
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Shyamsundar <srangana>
Component: RBD-MirrorAssignee: Christopher Hoffman <choffman>
Status: ASSIGNED --- QA Contact:
Severity: high Docs Contact:
Priority: unspecified    
Version: 5.0CC: aclewett, bniver, ceph-eng-bugs, ebenahar, idryomov, jdurgin, jespy, kramdoss, kseeger, mrashish, muagarwa, owasserm, pnataraj, prsurve, sangadi, sostapov, srangana, tserlin, vereddy, vkolli
Target Milestone: ---Keywords: AutomationBlocker
Target Release: 8.0Flags: vashastr: needinfo? (jdurgin)
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 2007376 Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2007376, 2011427, 2030752    

Comment 1 RHEL Program Management 2021-10-01 13:04:02 UTC
Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.

Comment 4 Mudit Agarwal 2021-10-06 10:08:36 UTC
This is a blocker for ODF 4.9 hence targeting it for 5.0z1, please prioritize

Comment 10 Mudit Agarwal 2021-10-21 10:32:11 UTC
Last time I talked with Scott, we agreed for 5.0z2

Comment 11 Scott Ostapovicz 2021-10-21 11:59:00 UTC
We could not get it into the 5.0 z2 release on time.  5.0 z3 is the earliest release we could get it into before 5.1

Comment 26 Scott Ostapovicz 2022-01-26 16:57:40 UTC
Not completed in time for 5.0 z4, moving to 5.1

Comment 27 Josh Durgin 2022-02-03 07:33:17 UTC
Deepika and I investigated the most pressing observation here - namely 'force promote' hanging indefinitely when the primary cluster is down. We understand the root cause now (rbd-mirror was not designed to handle this kind of use) and are working on a fix.

Comment 28 Yaniv Kaul 2022-03-02 08:10:55 UTC
(In reply to Josh Durgin from comment #27)
> Deepika and I investigated the most pressing observation here - namely
> 'force promote' hanging indefinitely when the primary cluster is down. We
> understand the root cause now (rbd-mirror was not designed to handle this
> kind of use) and are working on a fix.

Is there an update?

Comment 29 Josh Durgin 2022-03-03 15:58:38 UTC
(In reply to Yaniv Kaul from comment #28)
> (In reply to Josh Durgin from comment #27)
> > Deepika and I investigated the most pressing observation here - namely
> > 'force promote' hanging indefinitely when the primary cluster is down. We
> > understand the root cause now (rbd-mirror was not designed to handle this
> > kind of use) and are working on a fix.
> 
> Is there an update?

We're testing the simplest fix and will update when there's more news.

Comment 30 Ilya Dryomov 2022-03-15 17:53:27 UTC
*** Bug 2011791 has been marked as a duplicate of this bug. ***

Comment 31 Yaniv Kaul 2022-03-24 08:41:12 UTC
(In reply to Josh Durgin from comment #29)
> (In reply to Yaniv Kaul from comment #28)
> > (In reply to Josh Durgin from comment #27)
> > > Deepika and I investigated the most pressing observation here - namely
> > > 'force promote' hanging indefinitely when the primary cluster is down. We
> > > understand the root cause now (rbd-mirror was not designed to handle this
> > > kind of use) and are working on a fix.
> > 
> > Is there an update?
> 
> We're testing the simplest fix and will update when there's more news.

Let's keep the bug updated on a regular basis, with latest finding please.

Comment 32 Josh Durgin 2022-03-24 16:27:01 UTC
After further testing we found that we can work around this problem by restarting the rbd-mirror daemon. We would like to propose this as a way forward - Shyam can you fill in how this would work from an ODF perspective?

The issue is that rbd-mirror needs to close the image that is being force promoted, and may get stuck if primary cluster is inaccessible. rbd-mirror will pick up where it left off after restart, as much as it can, but having it shut down during the force promote will allow the force promote to succeed.

Comment 33 Shyamsundar 2022-03-24 19:09:53 UTC
(In reply to Josh Durgin from comment #32)
> After further testing we found that we can work around this problem by
> restarting the rbd-mirror daemon. We would like to propose this as a way
> forward - Shyam can you fill in how this would work from an ODF perspective?

Thanks Josh.

We would need to test and document this in 2 ways,
- Initially for the TechPreview release (ODF 4.10) we can add this to the troubleshooting section
  - If pods do not reach running state and report "in-use" errors post failover, scale down the RBD mirror daemon deployment; wait for failover to complete; scale up the RBD mirror daemon deployment
- For GA, we can make this a per-requisite prior to failover, so that it does not impact RTO (i.e failover, detect/hit the in-use issue and then take further action to rectify the situation byt scaling down the RBD mirror daemon deployment)

These additional steps do impact the automated manner of performing these actions, but are single oc/kubectl commands that would need to be executed prior to failover and hence is minimal.

Even if a user fails to scale the RBD mirror daemon deployment down prior to failover, it is safe to do so post failover has been initiated and an "in-use" error has been observed, thus the workaround is viable even if the documented per-requisite is not followed.

@vkolli Let us know if this is acceptable form a product standpoint.

@prsurve we would need to test scaling-down the RBD mirror daemon prior to failover and scaling it back up post failover is complete to ensure this works as desired in ODF.

As part of the testing and documentation we mostly would use bz #2007376 and leave this one open.

--------------------
<tagging @madhu for some CSI context>

Additionally the current force-promote timeout in ceph-csi is 1 minute, we may want to bump up to 2 minutes to give enough time for rollback to complete:
  - As rollback is performed by the force-promote command it, at times, may take more than a minute (based on dirty blocks that need to be rolled back approximately) to rollback
  - But, rollback will continue rolling back in the next invocation of the command and so would eventually complete
  - The added minute is useful though to avoid multiple calls to complete the rollback and in extremely corner cases to avoid failures in the first instance of the call when the mirror watcher is not yet removed (post scaling down the RBD mirror instance)

An ideal future fix would be for ceph-csi to detect the rollback process has started and wait instead of killing the call.

Comment 35 Venkat Kolli 2022-03-30 22:56:02 UTC
This workaround indeed impacts the core value prop of our DR solution:
- 'Single Click' App failover
-  Quick error free App recovery with lowest RTO in chaotic DR situations.

But given the occurrence of this issue seems to be not frequent, we might be able to 'live' with this workaround for sometime.  We should strive to fix this issue soon.

There will be two most common failover situations customers would be facing:
- Single App failover/migration, when the cluster is otherwise healthy
- Entire Cluster is unstable or failed

In the Single App failover case, trying to preemptively bringing down RBD mirror demon would be a heavy handed solution and impacts other Apps. So we should resort to post failover recovery procedure, if there is no successful recovery.  Since this is not a cluster DR situation, it might be manageable.  But we should make this workaround as easy as we can and document it prominently.

For the Cluster failure situation, we should have the mirror brought down preemptively, again making it easy and quick for the users.  It cannot be more than a single command.

Comment 38 Scott Ostapovicz 2022-04-20 14:44:18 UTC
This is not a blocker for ODF 4.10 (DR TP) so I am moving it to 6.0 (DR GA)

Comment 40 Mudit Agarwal 2022-06-07 09:17:19 UTC
Is it a DR GA blocker? If yes then this should be fixed in 5.2 because DR GA is planned for ODF 4.11 which is consuming 5.2

Comment 41 Scott Ostapovicz 2022-06-09 09:05:41 UTC
Updated target

Comment 43 Ilya Dryomov 2022-11-22 10:30:46 UTC
*** Bug 2144733 has been marked as a duplicate of this bug. ***