1351547 – Rabbitmq clone starts on just one node in a HA deploy

RDO tickets are now tracked in Jira https://issues.redhat.com/projects/RDO/issues/

Bug 1351547 - Rabbitmq clone starts on just one node in a HA deploy

Summary: Rabbitmq clone starts on just one node in a HA deploy

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	RDO
Classification:	Community
Component:	openstack-tripleo
Sub Component:
Version:	trunk
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	trunk
Assignee:	James Slagle
QA Contact:	Shai Revivo
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-06-30 10:03 UTC by Raoul Scarazzini
Modified:	2021-06-10 11:22 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2017-06-18 11:47:26 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
OpenStack gerrit	336072	0	None	None	None	2016-07-05 10:46:58 UTC

Description Raoul Scarazzini 2016-06-30 10:03:03 UTC

Description of problem:

The rabbitmq-clone resource starts on just one node and not on the other two.
Looking at the failed nodes logs, the problems seems to be this one:

Jun 30 07:32:47 [41714] overcloud-controller-0       lrmd:   notice: operation_finished:        rabbitmq_start_0:86381:stderr [ Error: unable to connect to nodes ['rabbit@overcloud-controlle
r-1']: nodedown ]
Jun 30 07:32:47 [41714] overcloud-controller-0       lrmd:   notice: operation_finished:        rabbitmq_start_0:86381:stderr [  ]
Jun 30 07:32:47 [41714] overcloud-controller-0       lrmd:   notice: operation_finished:        rabbitmq_start_0:86381:stderr [ DIAGNOSTICS ]
Jun 30 07:32:47 [41714] overcloud-controller-0       lrmd:   notice: operation_finished:        rabbitmq_start_0:86381:stderr [ =========== ]
Jun 30 07:32:47 [41714] overcloud-controller-0       lrmd:   notice: operation_finished:        rabbitmq_start_0:86381:stderr [  ]
Jun 30 07:32:47 [41714] overcloud-controller-0       lrmd:   notice: operation_finished:        rabbitmq_start_0:86381:stderr [ attempted to contact: ['rabbit@overcloud-controller-1'] ]
Jun 30 07:32:47 [41714] overcloud-controller-0       lrmd:   notice: operation_finished:        rabbitmq_start_0:86381:stderr [  ]
Jun 30 07:32:47 [41714] overcloud-controller-0       lrmd:   notice: operation_finished:        rabbitmq_start_0:86381:stderr [ rabbit@overcloud-controller-1: ]
Jun 30 07:32:47 [41714] overcloud-controller-0       lrmd:   notice: operation_finished:        rabbitmq_start_0:86381:stderr [   * unable to connect to epmd (port 4369) on overcloud-controller-1: address (cannot connect to host/port) ]
Jun 30 07:32:47 [41714] overcloud-controller-0       lrmd:   notice: operation_finished:        rabbitmq_start_0:86381:stderr [  ]
Jun 30 07:32:47 [41714] overcloud-controller-0       lrmd:   notice: operation_finished:        rabbitmq_start_0:86381:stderr [  ]
Jun 30 07:32:47 [41714] overcloud-controller-0       lrmd:   notice: operation_finished:        rabbitmq_start_0:86381:stderr [ current node details: ]
Jun 30 07:32:47 [41714] overcloud-controller-0       lrmd:   notice: operation_finished:        rabbitmq_start_0:86381:stderr [ - node name: 'rabbitmq-cli-54@overcloud-controller-0' ]
Jun 30 07:32:47 [41714] overcloud-controller-0       lrmd:   notice: operation_finished:        rabbitmq_start_0:86381:stderr [ - home dir: /var/lib/rabbitmq ]
Jun 30 07:32:47 [41714] overcloud-controller-0       lrmd:   notice: operation_finished:        rabbitmq_start_0:86381:stderr [ - cookie hash: JkmanA6ZXihL18UoE5q6aw== ]
Jun 30 07:32:47 [41714] overcloud-controller-0       lrmd:   notice: operation_finished:        rabbitmq_start_0:86381:stderr [  ]
Jun 30 07:32:47 [41714] overcloud-controller-0       lrmd:   notice: operation_finished:        rabbitmq_start_0:86381:stderr [ Error: {aborted,{no_exists,[rabbit_user, ]
Jun 30 07:32:47 [41714] overcloud-controller-0       lrmd:   notice: operation_finished:        rabbitmq_start_0:86381:stderr [                             [{{internal_user,'$1','_','_'}, ]
Jun 30 07:32:47 [41714] overcloud-controller-0       lrmd:   notice: operation_finished:        rabbitmq_start_0:86381:stderr [                               [{'/=','$1',<<"guest">>}], ]
Jun 30 07:32:47 [41714] overcloud-controller-0       lrmd:   notice: operation_finished:        rabbitmq_start_0:86381:stderr [                               ['$_']}]]}} ]
Jun 30 07:32:47 [41714] overcloud-controller-0       lrmd:   notice: operation_finished:        rabbitmq_start_0:86381:stderr [ Error: unable to connect to nodes ['rabbit@overcloud-controller-1']: nodedown ]

So rabbit seem to be unable to contact epmd port 4369 on controller-1 (logs are from controller-0=). 

Version-Release number of selected component (if applicable):

rabbitmq-server-3.6.2-4.el7ost.noarch
resource-agents-3.9.5-76.el7.x86_64

How reproducible:

Just deploy a RDO/Newton overcloud.

Actual results:

There are failed actions on the cluster:

* rabbitmq_start_0 on overcloud-controller-2 'unknown error' (1): call=925, status=complete, exitreason='none',
    last-rc-change='Thu Jun 30 09:33:28 2016', queued=1ms, exec=10237ms
* rabbitmq_start_0 on overcloud-controller-0 'unknown error' (1): call=943, status=complete, exitreason='none',
    last-rc-change='Thu Jun 30 09:33:16 2016', queued=0ms, exec=10338ms

Expected results:

Correct start.

Additional info:

Following these other bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1348700
https://bugzilla.redhat.com/show_bug.cgi?id=1348276
https://bugzilla.redhat.com/show_bug.cgi?id=1343905
https://bugzilla.redhat.com/show_bug.cgi?id=1343027

I ensured package versions were upgraded, but the problem did not solve.

Comment 1 Raoul Scarazzini 2016-06-30 10:06:40 UTC

SOS reports for all the controllers: http://file.rdu.redhat.com/rscarazz/BZ1351547/

Comment 2 Raoul Scarazzini 2016-06-30 10:09:42 UTC

I also tried a clean startup after removing all the stalled data inside /var/lib/rabbitmq/mnesia, but it did not helped.

Comment 3 Peter Lemenkov 2016-06-30 13:26:03 UTC

From what I see iptables blocks TCP connections between nodes on port 4369. This prevents rabbitmq cluster from assembling.

Maybe there are some other issues.

Comment 4 Raoul Scarazzini 2016-06-30 14:24:00 UTC

I confirm the connection problem is due to an iptables rule missing.
Doing these steps on each controller:

sudo sed -i -e 's/--dports 5672,35672/--dports 4369,5672,35672/g' /etc/sysconfig/iptables
sudo systemctl restart iptables

And then cleaning up rabbitmq-clone on one of the three controller:

sudo pcs resource cleanup rabbitmq-clone

solves the problem.

Comment 5 Raoul Scarazzini 2016-07-01 08:57:06 UTC

An upstream patch [1] was submitted to solve this problem and should be merged quickly.

[1] https://review.openstack.org/#/c/336072/

Comment 6 Christopher Brown 2017-06-18 11:47:26 UTC

Fixed was merged upstream so closing.

Note You need to log in before you can comment on or make changes to this bug.