1251525 – galera ocf agent fail fast if sync fails during promote

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1251525 - galera ocf agent fail fast if sync fails during promote

Summary: galera ocf agent fail fast if sync fails during promote

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	resource-agents
Sub Component:
Version:	7.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	7.7
Assignee:	Damien Ciabrini
QA Contact:	Ofer Blaut
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1372616 1376084 (view as bug list)
Depends On:
Blocks:	1299878
TreeView+	depends on / blocked

Reported:	2015-08-07 15:04 UTC by David Vossel
Modified:	2019-12-16 04:51 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1299878 (view as bug list)
Environment:
Last Closed:	2019-02-22 08:50:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	3554411	0	None	None	None	2018-08-07 13:18:11 UTC

Description David Vossel 2015-08-07 15:04:13 UTC

Description of problem:

There are scenarios where a galera instance times out during the promote operation when we could potentially detect that the promote is going to fail much earlier. The galera agent's promote action involves syncing the local galera state with a donor node somewhere else in the cluster. If that sync fails to initialize, we should be able to fail the promote early rather than waiting the full duration of the promote timeout (which can be serveral minutes).

https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/galera#L300

The above link identifies the loop the agent waits in while waiting to sync. We need a way to back out of that loop and fail early if we are 100% certain there's no way the sync will occur. 

Also, when we fail early, we need to set the ocf_exit_reason to indicate that the 'fast fail' condition was hit. This will help us differentiate why the promotion failed when looking at the cluster status.

Comment 6 Ofer Blaut 2015-12-21 14:03:30 UTC

please add reproduce steps

Comment 7 Damien Ciabrini 2016-01-18 08:53:40 UTC

fixed in https://github.com/ClusterLabs/resource-agents/pull/684

Comment 9 Damien Ciabrini 2016-02-05 08:11:26 UTC

After further testing, the fix referenced in comment 7 does not solve the issue. The new way of tracking sync can timeout during monitor operation.

Comment 10 Oyvind Albrigtsen 2016-02-22 12:47:24 UTC

Patch removed from newer 7.3 builds.

Set to POST when new patch is ready for a build.

Comment 15 Damien Ciabrini 2016-10-12 14:14:42 UTC

*** Bug 1372616 has been marked as a duplicate of this bug. ***

Comment 16 Damien Ciabrini 2016-10-12 14:16:20 UTC

*** Bug 1376084 has been marked as a duplicate of this bug. ***

Comment 21 Damien Ciabrini 2018-07-20 07:52:30 UTC

An update here for tracking purpose...

Throughout the many iterations of this bug fix, we ended up having a working fix, bug the drawback is that the galera resource agent becomes more complex because it has to track new states via a couple of additional crm attributes.

Now meanwhile, the urgency of that bugzilla has lowered because nowadays on OpenStack, Keystone uses Fernet tokens, which means that the amount of data stored on the database has become really stable is we don't run anymore into situation where a missing DB cleanup would make the DB grow unbounded.

So to summarize, this bz can be fixed but given the time constraint and the priority it won't be fixed in the short term. so I'm keeping it a little longer for tracking purpose.

Comment 23 Damien Ciabrini 2019-02-22 08:50:02 UTC

So a long overdue update on that one.

As for context, in earlier version of OpenStack we used to very large or ever-growing mysql databases.
During the promote operation, the entire DB could be sync'ed over rsync, which would sometime exceed the configured promote timeout operation.

The approach for fixing it consisted in running different Slave monitor operations to track the rsync in a recurring fashion. But that yielded a considerable increase in complexity (additional attributes, many monitor operations, more tests...). 

Meanwhile in OpenStack we switched to what we called fernet tokens, which essentially means we now have a pretty bounded, small database to deal with. So this development is not needed anymore.

Balancing the complexity of the code with the real need for it (we don't need it anymore for OpenStack), I think I'm going to drop that bz instead of letting it slip indefinitely.

Note You need to log in before you can comment on or make changes to this bug.