Description of problem: The rabbitmq-clone resource starts on just one node and not on the other two. Looking at the failed nodes logs, the problems seems to be this one: Jun 30 07:32:47 [41714] overcloud-controller-0 lrmd: notice: operation_finished: rabbitmq_start_0:86381:stderr [ Error: unable to connect to nodes ['rabbit@overcloud-controlle r-1']: nodedown ] Jun 30 07:32:47 [41714] overcloud-controller-0 lrmd: notice: operation_finished: rabbitmq_start_0:86381:stderr [ ] Jun 30 07:32:47 [41714] overcloud-controller-0 lrmd: notice: operation_finished: rabbitmq_start_0:86381:stderr [ DIAGNOSTICS ] Jun 30 07:32:47 [41714] overcloud-controller-0 lrmd: notice: operation_finished: rabbitmq_start_0:86381:stderr [ =========== ] Jun 30 07:32:47 [41714] overcloud-controller-0 lrmd: notice: operation_finished: rabbitmq_start_0:86381:stderr [ ] Jun 30 07:32:47 [41714] overcloud-controller-0 lrmd: notice: operation_finished: rabbitmq_start_0:86381:stderr [ attempted to contact: ['rabbit@overcloud-controller-1'] ] Jun 30 07:32:47 [41714] overcloud-controller-0 lrmd: notice: operation_finished: rabbitmq_start_0:86381:stderr [ ] Jun 30 07:32:47 [41714] overcloud-controller-0 lrmd: notice: operation_finished: rabbitmq_start_0:86381:stderr [ rabbit@overcloud-controller-1: ] Jun 30 07:32:47 [41714] overcloud-controller-0 lrmd: notice: operation_finished: rabbitmq_start_0:86381:stderr [ * unable to connect to epmd (port 4369) on overcloud-controller-1: address (cannot connect to host/port) ] Jun 30 07:32:47 [41714] overcloud-controller-0 lrmd: notice: operation_finished: rabbitmq_start_0:86381:stderr [ ] Jun 30 07:32:47 [41714] overcloud-controller-0 lrmd: notice: operation_finished: rabbitmq_start_0:86381:stderr [ ] Jun 30 07:32:47 [41714] overcloud-controller-0 lrmd: notice: operation_finished: rabbitmq_start_0:86381:stderr [ current node details: ] Jun 30 07:32:47 [41714] overcloud-controller-0 lrmd: notice: operation_finished: rabbitmq_start_0:86381:stderr [ - node name: 'rabbitmq-cli-54@overcloud-controller-0' ] Jun 30 07:32:47 [41714] overcloud-controller-0 lrmd: notice: operation_finished: rabbitmq_start_0:86381:stderr [ - home dir: /var/lib/rabbitmq ] Jun 30 07:32:47 [41714] overcloud-controller-0 lrmd: notice: operation_finished: rabbitmq_start_0:86381:stderr [ - cookie hash: JkmanA6ZXihL18UoE5q6aw== ] Jun 30 07:32:47 [41714] overcloud-controller-0 lrmd: notice: operation_finished: rabbitmq_start_0:86381:stderr [ ] Jun 30 07:32:47 [41714] overcloud-controller-0 lrmd: notice: operation_finished: rabbitmq_start_0:86381:stderr [ Error: {aborted,{no_exists,[rabbit_user, ] Jun 30 07:32:47 [41714] overcloud-controller-0 lrmd: notice: operation_finished: rabbitmq_start_0:86381:stderr [ [{{internal_user,'$1','_','_'}, ] Jun 30 07:32:47 [41714] overcloud-controller-0 lrmd: notice: operation_finished: rabbitmq_start_0:86381:stderr [ [{'/=','$1',<<"guest">>}], ] Jun 30 07:32:47 [41714] overcloud-controller-0 lrmd: notice: operation_finished: rabbitmq_start_0:86381:stderr [ ['$_']}]]}} ] Jun 30 07:32:47 [41714] overcloud-controller-0 lrmd: notice: operation_finished: rabbitmq_start_0:86381:stderr [ Error: unable to connect to nodes ['rabbit@overcloud-controller-1']: nodedown ] So rabbit seem to be unable to contact epmd port 4369 on controller-1 (logs are from controller-0=). Version-Release number of selected component (if applicable): rabbitmq-server-3.6.2-4.el7ost.noarch resource-agents-3.9.5-76.el7.x86_64 How reproducible: Just deploy a RDO/Newton overcloud. Actual results: There are failed actions on the cluster: * rabbitmq_start_0 on overcloud-controller-2 'unknown error' (1): call=925, status=complete, exitreason='none', last-rc-change='Thu Jun 30 09:33:28 2016', queued=1ms, exec=10237ms * rabbitmq_start_0 on overcloud-controller-0 'unknown error' (1): call=943, status=complete, exitreason='none', last-rc-change='Thu Jun 30 09:33:16 2016', queued=0ms, exec=10338ms Expected results: Correct start. Additional info: Following these other bugs: https://bugzilla.redhat.com/show_bug.cgi?id=1348700 https://bugzilla.redhat.com/show_bug.cgi?id=1348276 https://bugzilla.redhat.com/show_bug.cgi?id=1343905 https://bugzilla.redhat.com/show_bug.cgi?id=1343027 I ensured package versions were upgraded, but the problem did not solve.
SOS reports for all the controllers: http://file.rdu.redhat.com/rscarazz/BZ1351547/
I also tried a clean startup after removing all the stalled data inside /var/lib/rabbitmq/mnesia, but it did not helped.
From what I see iptables blocks TCP connections between nodes on port 4369. This prevents rabbitmq cluster from assembling. Maybe there are some other issues.
I confirm the connection problem is due to an iptables rule missing. Doing these steps on each controller: sudo sed -i -e 's/--dports 5672,35672/--dports 4369,5672,35672/g' /etc/sysconfig/iptables sudo systemctl restart iptables And then cleaning up rabbitmq-clone on one of the three controller: sudo pcs resource cleanup rabbitmq-clone solves the problem.
An upstream patch [1] was submitted to solve this problem and should be merged quickly. [1] https://review.openstack.org/#/c/336072/
Fixed was merged upstream so closing.