1362117 – RabbitMQ fails to rejoin the cluster after some consecutive start/stop actions

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1362117 - RabbitMQ fails to rejoin the cluster after some consecutive start/stop actions

Summary: RabbitMQ fails to rejoin the cluster after some consecutive start/stop actions

Keywords:
Status:	CLOSED DUPLICATE of bug 1397393
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	resource-agents
Sub Component:
Version:	7.2
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	rc
Target Release:	7.2
Assignee:	John Eckersberg
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-08-01 10:21 UTC by Raoul Scarazzini
Modified:	2016-12-07 14:00 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-12-07 14:00:10 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1343905	0	urgent	CLOSED	RabbitMQ fails to rejoin the cluster during overcloud update	2021-02-22 00:41:40 UTC

Internal Links: 1343905

Description Raoul Scarazzini 2016-08-01 10:21:44 UTC

Description of problem:

There's a race condition that prevents rabbit cluster to start correctly inside OSP environments deployed via Director.
Problem is present with both the actual release of resource-agents (3.9.5-54.el7) and the patched one that comes from [1] (3.9.5-80.el7).
Once you stop and start rabbitmq component, sometimes it fails to start on one node with this error:

rabbitmq_start_0 on overcloud-controller-1 'unknown error' (1): call=21217, status=complete, exitreason='none', last-rc-change='Sat Jul 30
 06:44:32 2016', queued=0ms, exec=7128ms

Looking inside application logs of the failed node problem seems to be related to the joining of the cluster:

Error description:
   {error,{inconsistent_cluster,"Node 'rabbit@overcloud-controller-1' thinks it's clustered with node 'rabbit@overcloud-controller-2', but 'rabbit@overcloud-controller-2' disagrees"}}

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1343905

Version-Release number of selected component (if applicable):

resource-agents-3.9.5-54.el7 and resource-agents-3.9.5-80.el7

How reproducible:

The race happens when you repeat indefinitely cycles of stop/start of the resource. With the non patched release it happened after 16 runs, with the patched release after 265.

Actual results:

Rabbit cluster fails on one node, cluster's resource is failed, cluster overall status is unclean.

Expected results:

Starts and stops should always succeed in a consecutive undefinable number of times.

Additional info:

Tests were made also using CentOS and Newton, with the same results.

Comment 1 Raoul Scarazzini 2016-08-01 10:27:10 UTC

All sosreports are available at this address [1] with this directory tree:

BZ1362117
|-osp9
| |- with_resource-agents-3.9.5-54.el7
| \- with_resource-agents-3.9.5-80.el7
\-newton
  |- with_resource-agents-3.9.5-54.el7
  \- with_resource-agents-3.9.5-80.el7

In every try the result is the same, what changes is just the number of starts/stops that precedes the failure.

[1] http://file.rdu.redhat.com/~rscarazz/BZ1362117/

Comment 3 Raoul Scarazzini 2016-08-03 07:30:06 UTC

As an additional test I installed the exact rpm version from zstream (version resource-agents-3.9.5-54.el7_2.16 as from here [1]), and after 88 tries we got the same error:

rabbitmq_start_0 on overcloud-controller-0 'unknown error' (1): call=27889, status=complete, exitreason='none', last-rc-change='Tue Aug 2 
17:41:29 2016', queued=0ms, exec=8124ms

And from the cluster log the error seems pretty much clear:

Aug 02 17:41:37 [40741] overcloud-controller-0.localdomain        cib:     info: crm_compress_string:   Compressed 271070 bytes into 11028 (ratio 24:1) in 32ms
rabbitmq-cluster(rabbitmq)[23454]:      2016/08/02_17:41:37 INFO: node failed to join even after reseting local data. Check SELINUX policy
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [ Error: process_not_running ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [ Error: unable to connect to node 'rabbit@overcloud-con
troller-0': nodedown ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [  ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [ DIAGNOSTICS ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [ =========== ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [  ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [ attempted to contact: ['rabbit@overcloud-controller-0'
] ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [  ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [ rabbit@overcloud-controller-0: ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [   * connected to epmd (port 4369) on overcloud-control
ler-0 ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [   * epmd reports: node 'rabbit' not running at all ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [                   no other nodes on overcloud-controller-0 ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [   * suggestion: start the node ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [  ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [ current node details: ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [ - node name: 'rabbitmq-cli-15@overcloud-controller-0' ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [ - home dir: /var/lib/rabbitmq ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [ - cookie hash: YB33eAHDuEIsi0xTrQP4DA== ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [  ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [ Error: {not_a_cluster_node,"The node selected is not in the cluster."} ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [ Error: {not_a_cluster_node,"The node selected is not in the cluster."} ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [ Error: process_not_running ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [ Error: unable to connect to node 'rabbit@overcloud-controller-0': nodedown ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [  ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [ DIAGNOSTICS ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [ =========== ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [  ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [ attempted to contact: ['rabbit@overcloud-controller-0'] ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [  ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [ rabbit@overcloud-controller-0: ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [   * connected to epmd (port 4369) on overcloud-controller-0 ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [   * epmd reports: node 'rabbit' not running at all ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [                   no other nodes on overcloud-controll:
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [   * suggestion: start the node ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [  ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [ current node details: ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [ - node name: 'rabbitmq-cli-77@overcloud-controller-0' ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [ - home dir: /var/lib/rabbitmq ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [ - cookie hash: YB33eAHDuEIsi0xTrQP4DA== ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [  ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:     info: log_finished:  finished - rsc:rabbitmq action:start call_id:27889 pid:23454 exit-code:1 exec-time:8124ms queue-time:0ms

I attached all the new sosreport here [2], these were taken after while so you need to check the time while the error happened (Aug 02 17:41:37).

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1351446
[2] http://file.rdu.redhat.com/~rscarazz/BZ1362117/osp9/with_resource-agents-3.9.5-54.el7_2.16/

Comment 5 Peter Lemenkov 2016-12-07 14:00:10 UTC


*** This bug has been marked as a duplicate of bug 1397393 ***

Note You need to log in before you can comment on or make changes to this bug.