Bug 1351547

Summary: Rabbitmq clone starts on just one node in a HA deploy
Product: [Community] RDO Reporter: Raoul Scarazzini <rscarazz>
Component: openstack-tripleoAssignee: James Slagle <jslagle>
Status: CLOSED UPSTREAM QA Contact: Shai Revivo <srevivo>
Severity: high Docs Contact:
Priority: high    
Version: trunkCC: chris.brown, jschluet, plemenko, rscarazz
Target Milestone: ---   
Target Release: trunk   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-06-18 11:47:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Raoul Scarazzini 2016-06-30 10:03:03 UTC
Description of problem:

The rabbitmq-clone resource starts on just one node and not on the other two.
Looking at the failed nodes logs, the problems seems to be this one:

Jun 30 07:32:47 [41714] overcloud-controller-0       lrmd:   notice: operation_finished:        rabbitmq_start_0:86381:stderr [ Error: unable to connect to nodes ['rabbit@overcloud-controlle
r-1']: nodedown ]
Jun 30 07:32:47 [41714] overcloud-controller-0       lrmd:   notice: operation_finished:        rabbitmq_start_0:86381:stderr [  ]
Jun 30 07:32:47 [41714] overcloud-controller-0       lrmd:   notice: operation_finished:        rabbitmq_start_0:86381:stderr [ DIAGNOSTICS ]
Jun 30 07:32:47 [41714] overcloud-controller-0       lrmd:   notice: operation_finished:        rabbitmq_start_0:86381:stderr [ =========== ]
Jun 30 07:32:47 [41714] overcloud-controller-0       lrmd:   notice: operation_finished:        rabbitmq_start_0:86381:stderr [  ]
Jun 30 07:32:47 [41714] overcloud-controller-0       lrmd:   notice: operation_finished:        rabbitmq_start_0:86381:stderr [ attempted to contact: ['rabbit@overcloud-controller-1'] ]
Jun 30 07:32:47 [41714] overcloud-controller-0       lrmd:   notice: operation_finished:        rabbitmq_start_0:86381:stderr [  ]
Jun 30 07:32:47 [41714] overcloud-controller-0       lrmd:   notice: operation_finished:        rabbitmq_start_0:86381:stderr [ rabbit@overcloud-controller-1: ]
Jun 30 07:32:47 [41714] overcloud-controller-0       lrmd:   notice: operation_finished:        rabbitmq_start_0:86381:stderr [   * unable to connect to epmd (port 4369) on overcloud-controller-1: address (cannot connect to host/port) ]
Jun 30 07:32:47 [41714] overcloud-controller-0       lrmd:   notice: operation_finished:        rabbitmq_start_0:86381:stderr [  ]
Jun 30 07:32:47 [41714] overcloud-controller-0       lrmd:   notice: operation_finished:        rabbitmq_start_0:86381:stderr [  ]
Jun 30 07:32:47 [41714] overcloud-controller-0       lrmd:   notice: operation_finished:        rabbitmq_start_0:86381:stderr [ current node details: ]
Jun 30 07:32:47 [41714] overcloud-controller-0       lrmd:   notice: operation_finished:        rabbitmq_start_0:86381:stderr [ - node name: 'rabbitmq-cli-54@overcloud-controller-0' ]
Jun 30 07:32:47 [41714] overcloud-controller-0       lrmd:   notice: operation_finished:        rabbitmq_start_0:86381:stderr [ - home dir: /var/lib/rabbitmq ]
Jun 30 07:32:47 [41714] overcloud-controller-0       lrmd:   notice: operation_finished:        rabbitmq_start_0:86381:stderr [ - cookie hash: JkmanA6ZXihL18UoE5q6aw== ]
Jun 30 07:32:47 [41714] overcloud-controller-0       lrmd:   notice: operation_finished:        rabbitmq_start_0:86381:stderr [  ]
Jun 30 07:32:47 [41714] overcloud-controller-0       lrmd:   notice: operation_finished:        rabbitmq_start_0:86381:stderr [ Error: {aborted,{no_exists,[rabbit_user, ]
Jun 30 07:32:47 [41714] overcloud-controller-0       lrmd:   notice: operation_finished:        rabbitmq_start_0:86381:stderr [                             [{{internal_user,'$1','_','_'}, ]
Jun 30 07:32:47 [41714] overcloud-controller-0       lrmd:   notice: operation_finished:        rabbitmq_start_0:86381:stderr [                               [{'/=','$1',<<"guest">>}], ]
Jun 30 07:32:47 [41714] overcloud-controller-0       lrmd:   notice: operation_finished:        rabbitmq_start_0:86381:stderr [                               ['$_']}]]}} ]
Jun 30 07:32:47 [41714] overcloud-controller-0       lrmd:   notice: operation_finished:        rabbitmq_start_0:86381:stderr [ Error: unable to connect to nodes ['rabbit@overcloud-controller-1']: nodedown ]

So rabbit seem to be unable to contact epmd port 4369 on controller-1 (logs are from controller-0=). 

Version-Release number of selected component (if applicable):

rabbitmq-server-3.6.2-4.el7ost.noarch
resource-agents-3.9.5-76.el7.x86_64

How reproducible:

Just deploy a RDO/Newton overcloud.

Actual results:

There are failed actions on the cluster:

* rabbitmq_start_0 on overcloud-controller-2 'unknown error' (1): call=925, status=complete, exitreason='none',
    last-rc-change='Thu Jun 30 09:33:28 2016', queued=1ms, exec=10237ms
* rabbitmq_start_0 on overcloud-controller-0 'unknown error' (1): call=943, status=complete, exitreason='none',
    last-rc-change='Thu Jun 30 09:33:16 2016', queued=0ms, exec=10338ms

Expected results:

Correct start.

Additional info:

Following these other bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1348700
https://bugzilla.redhat.com/show_bug.cgi?id=1348276
https://bugzilla.redhat.com/show_bug.cgi?id=1343905
https://bugzilla.redhat.com/show_bug.cgi?id=1343027

I ensured package versions were upgraded, but the problem did not solve.

Comment 1 Raoul Scarazzini 2016-06-30 10:06:40 UTC
SOS reports for all the controllers: http://file.rdu.redhat.com/rscarazz/BZ1351547/

Comment 2 Raoul Scarazzini 2016-06-30 10:09:42 UTC
I also tried a clean startup after removing all the stalled data inside /var/lib/rabbitmq/mnesia, but it did not helped.

Comment 3 Peter Lemenkov 2016-06-30 13:26:03 UTC
From what I see iptables blocks TCP connections between nodes on port 4369. This prevents rabbitmq cluster from assembling.

Maybe there are some other issues.

Comment 4 Raoul Scarazzini 2016-06-30 14:24:00 UTC
I confirm the connection problem is due to an iptables rule missing.
Doing these steps on each controller:

sudo sed -i -e 's/--dports 5672,35672/--dports 4369,5672,35672/g' /etc/sysconfig/iptables
sudo systemctl restart iptables

And then cleaning up rabbitmq-clone on one of the three controller:

sudo pcs resource cleanup rabbitmq-clone

solves the problem.

Comment 5 Raoul Scarazzini 2016-07-01 08:57:06 UTC
An upstream patch [1] was submitted to solve this problem and should be merged quickly.

[1] https://review.openstack.org/#/c/336072/

Comment 6 Christopher Brown 2017-06-18 11:47:26 UTC
Fixed was merged upstream so closing.