Bug 1251525 - galera ocf agent fail fast if sync fails during promote
galera ocf agent fail fast if sync fails during promote
Status: ASSIGNED
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: resource-agents (Show other bugs)
7.2
Unspecified Unspecified
high Severity high
: pre-dev-freeze
: 7.4
Assigned To: Damien Ciabrini
Ofer Blaut
: ZStream
: 1372616 1376084 (view as bug list)
Depends On:
Blocks: 1299878
  Show dependency treegraph
 
Reported: 2015-08-07 11:04 EDT by David Vossel
Modified: 2017-08-02 02:49 EDT (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1299878 (view as bug list)
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description David Vossel 2015-08-07 11:04:13 EDT
Description of problem:

There are scenarios where a galera instance times out during the promote operation when we could potentially detect that the promote is going to fail much earlier. The galera agent's promote action involves syncing the local galera state with a donor node somewhere else in the cluster. If that sync fails to initialize, we should be able to fail the promote early rather than waiting the full duration of the promote timeout (which can be serveral minutes).

https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/galera#L300

The above link identifies the loop the agent waits in while waiting to sync. We need a way to back out of that loop and fail early if we are 100% certain there's no way the sync will occur. 

Also, when we fail early, we need to set the ocf_exit_reason to indicate that the 'fast fail' condition was hit. This will help us differentiate why the promotion failed when looking at the cluster status.
Comment 6 Ofer Blaut 2015-12-21 09:03:30 EST
please add reproduce steps
Comment 7 Damien Ciabrini 2016-01-18 03:53:40 EST
fixed in https://github.com/ClusterLabs/resource-agents/pull/684
Comment 9 Damien Ciabrini 2016-02-05 03:11:26 EST
After further testing, the fix referenced in comment 7 does not solve the issue. The new way of tracking sync can timeout during monitor operation.
Comment 10 Oyvind Albrigtsen 2016-02-22 07:47:24 EST
Patch removed from newer 7.3 builds.

Set to POST when new patch is ready for a build.
Comment 15 Damien Ciabrini 2016-10-12 10:14:42 EDT
*** Bug 1372616 has been marked as a duplicate of this bug. ***
Comment 16 Damien Ciabrini 2016-10-12 10:16:20 EDT
*** Bug 1376084 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.