Rook Ceph allows setting cephcluster.spec.healthCheck.daemonHealth.mon.timeout to a custom value. It can also be set to 0 which disables the mon failover. We would like this value to be configurable in ODF including the option to disable it. For mon failover, ODF currently uses a default value of 10 minutes. It doesn't look like it can be changed. The 10-minute value is too low for our use case: we deploy ODF on bare metal clusters with OpenShift Virtualization. During node draining, the virtual machines are live migrated away from the node. The live migration process can take 40-60 minutes depending on how many virtual machines are on the node and how fast the virtual machine memory can be copied over the network to another cluster node. Due to the mon failover value being too low, a failover of all three monitors occurs for us on every OpenShift upgrade. We would like the option to disable the mon failover as well. Recently, we had a scenario (https://bugzilla.redhat.com/show_bug.cgi?id=2292435) where the mon failover likely caused a Ceph mon outage. In the interim, until this issue is confirmed and fixed, we would like to disable the mon failover.
This bug was originally filed in Red Hat's Jira: https://issues.redhat.com/browse/RHSTOR-5939