1656368 – rabbitmq-cluster: regression when restarting inside a bundle

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1656368 - rabbitmq-cluster: regression when restarting inside a bundle

Summary: rabbitmq-cluster: regression when restarting inside a bundle

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	resource-agents
Sub Component:
Version:	7.6
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Oyvind Albrigtsen
QA Contact:	pkomarov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1637626 1657138
TreeView+	depends on / blocked

Reported:	2018-12-05 10:50 UTC by Michele Baldessari
Modified:	2020-03-27 09:43 UTC (History)
CC List:	20 users (show)
Fixed In Version:	resource-agents-4.1.1-15.el7
Doc Type:	If docs needed, set a value
Doc Text:	When a containerized RabbitMQ cluster was stopped entirely, but the containers were not stopped, the RabbitMQ resource agent failed to update the Pacemaker view of the RabbitMQ cluster. Consequently, RabbitMQ servers failed to restart the cluster. With this update, the RabbitMQ resource agent cleans up cluster attributes on RabbitMQ shutdown, and, as a result, the described problem no longer occurs.
Clone Of:
Clones:	1657138 (view as bug list)
Environment:
Last Closed:	2019-08-06 12:01:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	3990761	0	Troubleshoot	None	Regression in resource-agents 4.1.1-12.el7_6.6 causes rabbitmq-bundle failures after certain restarts	2019-04-23 16:52:24 UTC
Red Hat Product Errata	RHBA-2019:2012	0	None	None	None	2019-08-06 12:02:09 UTC

Description Michele Baldessari 2018-12-05 10:50:07 UTC

Description of problem:
There is a regression in resource-agents-4.1.1-12.el7_6.6.x86_64.rpm (the issue is not seen in resource-agents-4.1.1-12.el7_6.4.x86_64.rpm) when stopping the OCF rabbitmq resource inside a bundle.

To reproduce this issue simply trigger a restart of the OCF inside the rabbitmq-bundle. We did this by tweaking the following line:
<nvpair id="rabbitmq-instance_attributes-set_policy" name="set_policy" value="ha-all ^(?!amq\.).* {&quot;ha-mode&quot;:&quot;exactly&quot;,&quot;ha-params&quot;:2}"/>

(Just changing ha-params from 2 to 3 and viceversa is enough). Once we inject a CIB with a change in the rabbitmq ocf resource pacemaker will try a restart of the internal resource only and the restart will fail. If we use the old resource-agents-4.1.1-12.el7_6.4.x86_64.rpm it all works correctly.

We tried adding a few 'killall -9 epmd' in the rmq_stop action (and correctly observed that epmd was not around any longer) but it did not help. Meaning that this is likely due to some attributes not being cleaned up.

Comment 3 John Eckersberg 2018-12-05 16:37:51 UTC

Damien and I ran this down this morning.  We discovered a few places where the stop action might not remove the rabbitmq node attribute from pacemaker.  So what ends up happening is:

- change ocf resource, triggers restart
- nodes 3 and 2 stop, but do *not* delete their attribute
- node 1 errors out in some fashion[1] during monitor/notify/stop and the node attr *is* deleted and the service stopped
- node 1 starts back up but attempts to join cluster with nodes 2+3 because the attributes are still present.  This fails and thus the cluster does not bootstrap properly.

I will submit a PR with the two minor tweaks we did that seems to address this.

[1] When you start trying to do too much in the middle of a failover... exact results are less-than-predictable.  What *is* important is that it gets marked as down.

Comment 4 John Eckersberg 2018-12-05 16:45:18 UTC

Another note, I think this may only be a problem with bundles.  The attributes have a "reboot" lifetime.  I think in the non-bundle case, stopping the resource may be enough to cause the attributes to be cleaned up.  However with bundles, the resource stop only stops the service inside of the bundle, but the bundle itself stays up the entire time so the attribute remains.

Comment 5 John Eckersberg 2018-12-05 16:52:42 UTC

https://github.com/ClusterLabs/resource-agents/pull/1274

Comment 8 Michele Baldessari 2018-12-10 13:46:41 UTC

*** Bug 1655764 has been marked as a duplicate of this bug. ***

Comment 9 Daniel Alvarez Sanchez 2018-12-24 11:32:54 UTC

Hi folks, do you think this BZ [0] could be a duplicate?

[0] https://bugzilla.redhat.com/show_bug.cgi?id=1661806

Comment 12 pkomarov 2019-01-21 10:25:33 UTC

Verified , tested in : 
https://bugzilla.redhat.com/show_bug.cgi?id=1657138#c3

Comment 14 errata-xmlrpc 2019-08-06 12:01:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2012

Note You need to log in before you can comment on or make changes to this bug.

abeekhof
agk
agurenko
aherr
astupnik
cfeist
chjones
cluster-maint
dalvarez
dciabrin
fdinitto
jeckersb
michele
mlisik
phagara
pkomarov
plemenko
salmy
sasha
sbradley