Please specify the severity of this bug. Severity is defined here: https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.
This is a blocker for ODF 4.9 hence targeting it for 5.0z1, please prioritize
Last time I talked with Scott, we agreed for 5.0z2
We could not get it into the 5.0 z2 release on time. 5.0 z3 is the earliest release we could get it into before 5.1
Not completed in time for 5.0 z4, moving to 5.1
Deepika and I investigated the most pressing observation here - namely 'force promote' hanging indefinitely when the primary cluster is down. We understand the root cause now (rbd-mirror was not designed to handle this kind of use) and are working on a fix.
(In reply to Josh Durgin from comment #27) > Deepika and I investigated the most pressing observation here - namely > 'force promote' hanging indefinitely when the primary cluster is down. We > understand the root cause now (rbd-mirror was not designed to handle this > kind of use) and are working on a fix. Is there an update?
(In reply to Yaniv Kaul from comment #28) > (In reply to Josh Durgin from comment #27) > > Deepika and I investigated the most pressing observation here - namely > > 'force promote' hanging indefinitely when the primary cluster is down. We > > understand the root cause now (rbd-mirror was not designed to handle this > > kind of use) and are working on a fix. > > Is there an update? We're testing the simplest fix and will update when there's more news.
*** Bug 2011791 has been marked as a duplicate of this bug. ***
(In reply to Josh Durgin from comment #29) > (In reply to Yaniv Kaul from comment #28) > > (In reply to Josh Durgin from comment #27) > > > Deepika and I investigated the most pressing observation here - namely > > > 'force promote' hanging indefinitely when the primary cluster is down. We > > > understand the root cause now (rbd-mirror was not designed to handle this > > > kind of use) and are working on a fix. > > > > Is there an update? > > We're testing the simplest fix and will update when there's more news. Let's keep the bug updated on a regular basis, with latest finding please.
After further testing we found that we can work around this problem by restarting the rbd-mirror daemon. We would like to propose this as a way forward - Shyam can you fill in how this would work from an ODF perspective? The issue is that rbd-mirror needs to close the image that is being force promoted, and may get stuck if primary cluster is inaccessible. rbd-mirror will pick up where it left off after restart, as much as it can, but having it shut down during the force promote will allow the force promote to succeed.
(In reply to Josh Durgin from comment #32) > After further testing we found that we can work around this problem by > restarting the rbd-mirror daemon. We would like to propose this as a way > forward - Shyam can you fill in how this would work from an ODF perspective? Thanks Josh. We would need to test and document this in 2 ways, - Initially for the TechPreview release (ODF 4.10) we can add this to the troubleshooting section - If pods do not reach running state and report "in-use" errors post failover, scale down the RBD mirror daemon deployment; wait for failover to complete; scale up the RBD mirror daemon deployment - For GA, we can make this a per-requisite prior to failover, so that it does not impact RTO (i.e failover, detect/hit the in-use issue and then take further action to rectify the situation byt scaling down the RBD mirror daemon deployment) These additional steps do impact the automated manner of performing these actions, but are single oc/kubectl commands that would need to be executed prior to failover and hence is minimal. Even if a user fails to scale the RBD mirror daemon deployment down prior to failover, it is safe to do so post failover has been initiated and an "in-use" error has been observed, thus the workaround is viable even if the documented per-requisite is not followed. @vkolli Let us know if this is acceptable form a product standpoint. @prsurve we would need to test scaling-down the RBD mirror daemon prior to failover and scaling it back up post failover is complete to ensure this works as desired in ODF. As part of the testing and documentation we mostly would use bz #2007376 and leave this one open. -------------------- <tagging @madhu for some CSI context> Additionally the current force-promote timeout in ceph-csi is 1 minute, we may want to bump up to 2 minutes to give enough time for rollback to complete: - As rollback is performed by the force-promote command it, at times, may take more than a minute (based on dirty blocks that need to be rolled back approximately) to rollback - But, rollback will continue rolling back in the next invocation of the command and so would eventually complete - The added minute is useful though to avoid multiple calls to complete the rollback and in extremely corner cases to avoid failures in the first instance of the call when the mirror watcher is not yet removed (post scaling down the RBD mirror instance) An ideal future fix would be for ceph-csi to detect the rollback process has started and wait instead of killing the call.
This workaround indeed impacts the core value prop of our DR solution: - 'Single Click' App failover - Quick error free App recovery with lowest RTO in chaotic DR situations. But given the occurrence of this issue seems to be not frequent, we might be able to 'live' with this workaround for sometime. We should strive to fix this issue soon. There will be two most common failover situations customers would be facing: - Single App failover/migration, when the cluster is otherwise healthy - Entire Cluster is unstable or failed In the Single App failover case, trying to preemptively bringing down RBD mirror demon would be a heavy handed solution and impacts other Apps. So we should resort to post failover recovery procedure, if there is no successful recovery. Since this is not a cluster DR situation, it might be manageable. But we should make this workaround as easy as we can and document it prominently. For the Cluster failure situation, we should have the mirror brought down preemptively, again making it easy and quick for the users. It cannot be more than a single command.
This is not a blocker for ODF 4.10 (DR TP) so I am moving it to 6.0 (DR GA)
Is it a DR GA blocker? If yes then this should be fixed in 5.2 because DR GA is planned for ODF 4.11 which is consuming 5.2
Updated target
*** Bug 2144733 has been marked as a duplicate of this bug. ***