Red Hat Bugzilla – Bug 1251525
galera ocf agent fail fast if sync fails during promote
Last modified: 2017-11-27 15:34:36 EST
Description of problem:
There are scenarios where a galera instance times out during the promote operation when we could potentially detect that the promote is going to fail much earlier. The galera agent's promote action involves syncing the local galera state with a donor node somewhere else in the cluster. If that sync fails to initialize, we should be able to fail the promote early rather than waiting the full duration of the promote timeout (which can be serveral minutes).
The above link identifies the loop the agent waits in while waiting to sync. We need a way to back out of that loop and fail early if we are 100% certain there's no way the sync will occur.
Also, when we fail early, we need to set the ocf_exit_reason to indicate that the 'fast fail' condition was hit. This will help us differentiate why the promotion failed when looking at the cluster status.
please add reproduce steps
fixed in https://github.com/ClusterLabs/resource-agents/pull/684
After further testing, the fix referenced in comment 7 does not solve the issue. The new way of tracking sync can timeout during monitor operation.
Patch removed from newer 7.3 builds.
Set to POST when new patch is ready for a build.
*** Bug 1372616 has been marked as a duplicate of this bug. ***
*** Bug 1376084 has been marked as a duplicate of this bug. ***