Bug 1251525

Summary: galera ocf agent fail fast if sync fails during promote
Product: Red Hat Enterprise Linux 7 Reporter: David Vossel <dvossel>
Component: resource-agentsAssignee: Damien Ciabrini <dciabrin>
Status: CLOSED WONTFIX QA Contact: Ofer Blaut <oblaut>
Severity: high Docs Contact:
Priority: high    
Version: 7.2CC: agk, cfeist, cluster-maint, dciabrin, fahmed, fdinitto, jmelvin, lmiksik, mbayer, oalbrigt, pzimek, royoung
Target Milestone: rcKeywords: Triaged, ZStream
Target Release: 7.7   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1299878 (view as bug list) Environment:
Last Closed: 2019-02-22 08:50:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1299878    

Description David Vossel 2015-08-07 15:04:13 UTC
Description of problem:

There are scenarios where a galera instance times out during the promote operation when we could potentially detect that the promote is going to fail much earlier. The galera agent's promote action involves syncing the local galera state with a donor node somewhere else in the cluster. If that sync fails to initialize, we should be able to fail the promote early rather than waiting the full duration of the promote timeout (which can be serveral minutes).

https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/galera#L300

The above link identifies the loop the agent waits in while waiting to sync. We need a way to back out of that loop and fail early if we are 100% certain there's no way the sync will occur. 

Also, when we fail early, we need to set the ocf_exit_reason to indicate that the 'fast fail' condition was hit. This will help us differentiate why the promotion failed when looking at the cluster status.

Comment 6 Ofer Blaut 2015-12-21 14:03:30 UTC
please add reproduce steps

Comment 7 Damien Ciabrini 2016-01-18 08:53:40 UTC
fixed in https://github.com/ClusterLabs/resource-agents/pull/684

Comment 9 Damien Ciabrini 2016-02-05 08:11:26 UTC
After further testing, the fix referenced in comment 7 does not solve the issue. The new way of tracking sync can timeout during monitor operation.

Comment 10 Oyvind Albrigtsen 2016-02-22 12:47:24 UTC
Patch removed from newer 7.3 builds.

Set to POST when new patch is ready for a build.

Comment 15 Damien Ciabrini 2016-10-12 14:14:42 UTC
*** Bug 1372616 has been marked as a duplicate of this bug. ***

Comment 16 Damien Ciabrini 2016-10-12 14:16:20 UTC
*** Bug 1376084 has been marked as a duplicate of this bug. ***

Comment 21 Damien Ciabrini 2018-07-20 07:52:30 UTC
An update here for tracking purpose...

Throughout the many iterations of this bug fix, we ended up having a working fix, bug the drawback is that the galera resource agent becomes more complex because it has to track new states via a couple of additional crm attributes.

Now meanwhile, the urgency of that bugzilla has lowered because nowadays on OpenStack, Keystone uses Fernet tokens, which means that the amount of data stored on the database has become really stable is we don't run anymore into situation where a missing DB cleanup would make the DB grow unbounded.

So to summarize, this bz can be fixed but given the time constraint and the priority it won't be fixed in the short term. so I'm keeping it a little longer for tracking purpose.

Comment 23 Damien Ciabrini 2019-02-22 08:50:02 UTC
So a long overdue update on that one.

As for context, in earlier version of OpenStack we used to very large or ever-growing mysql databases.
During the promote operation, the entire DB could be sync'ed over rsync, which would sometime exceed the configured promote timeout operation.

The approach for fixing it consisted in running different Slave monitor operations to track the rsync in a recurring fashion. But that yielded a considerable increase in complexity (additional attributes, many monitor operations, more tests...). 

Meanwhile in OpenStack we switched to what we called fernet tokens, which essentially means we now have a pretty bounded, small database to deal with. So this development is not needed anymore.

Balancing the complexity of the code with the real need for it (we don't need it anymore for OpenStack), I think I'm going to drop that bz instead of letting it slip indefinitely.