Description of problem: RabbitMQ resources fail to start on 3 controllers nodes with pacemaker deployment. Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-2.0.0-7.el7ost.noarch 2016-05-22.1 puddle How reproducible: Steps to Reproduce: 1. Deploy a 3 controller nodes environment with pacemaker Actual results: pcs status shows: Clone Set: rabbitmq-clone [rabbitmq] rabbitmq (ocf::heartbeat:rabbitmq-cluster): FAILED overcloud-controller-0 (unmanaged) rabbitmq (ocf::heartbeat:rabbitmq-cluster): FAILED overcloud-controller-1 (unmanaged) rabbitmq (ocf::heartbeat:rabbitmq-cluster): FAILED overcloud-controller-2 (unmanaged) Clone Set: memcached-clone [memcached] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Failed Actions: * rabbitmq_stop_0 on overcloud-controller-0 'unknown error' (1): call=87, status=complete, exitreason='none', last-rc-change='Mon May 23 08:06:17 2016', queued=0ms, exec=1242ms * rabbitmq_stop_0 on overcloud-controller-1 'unknown error' (1): call=86, status=complete, exitreason='none', last-rc-change='Mon May 23 08:06:16 2016', queued=0ms, exec=1150ms * rabbitmq_stop_0 on overcloud-controller-2 'unknown error' (1): call=82, status=complete, exitreason='none', last-rc-change='Mon May 23 08:06:14 2016', queued=0ms, exec=1137ms Expected results: The rabbitmq resources show up as started. Additional info: Attaching sosreports from the controller nodes.
I managed to get it started after running 'pcs resource debug-start rabbitmq' on each of the controller nodes and then 'pcs resource cleanup rabbitmq' [root@overcloud-controller-0 ~]# pcs resource debug-start rabbitmq Operation start for rabbitmq:0 (ocf:heartbeat:rabbitmq-cluster) returned 0 > stdout: Waiting for 'rabbit@overcloud-controller-0' ... > stdout: pid is 27210 ... > stderr: ERROR: Unexpected return code from '/usr/sbin/rabbitmqctl cluster status' exit code: 69 > stderr: INFO: Bootstrapping rabbitmq cluster > stderr: INFO: Waiting for server to start > stderr: DEBUG: RabbitMQ server is running normally > stderr: INFO: cluster bootstrapped > stderr: INFO: Policy set: ha-all ^(?!amq\.).* {"ha-mode":"all"} > stderr: DEBUG: rabbitmq:0 start : 0 Still, I wasn't able to proceed with the installation, the logs in /var/log/rabbitmq showing this kind of errors: =ERROR REPORT==== 23-May-2016::11:39:23 === Error on AMQP connection <0.1000.0> (10.0.0.15:54125 -> 10.0.0.15:5672, state: starting): AMQPLAIN login refused: user 'guest' can only connect via localhost This is how the rabbitmq.config looks on one of the nodes: [root@overcloud-controller-0 heat-admin]# cat /etc/rabbitmq/rabbitmq.config % This file managed by Puppet % Template Path: rabbitmq/templates/rabbitmq.config [ {rabbit, [ {tcp_listen_options, [binary, {packet, raw}, {reuseaddr, true}, {backlog, 128}, {nodelay, true}, {exit_on_close, false}] }, {cluster_partition_handling, pause_minority}, {tcp_listen_options, [binary, {packet, raw}, {reuseaddr, true}, {backlog, 128}, {nodelay, true}, {exit_on_close, false}, {keepalive, true}]}, {default_user, <<"guest">>}, {default_pass, <<"zdbXdPQmW47Yfw9wewr3pJZjQ">>} ]}, {kernel, [ {inet_dist_listen_max, 35672}, {inet_dist_listen_min, 35672} ]} , {rabbitmq_management, [ {listener, [ {port, 15672} ,{ip, "10.0.0.15"} ]} ]} ]. % EOF
Current status - I've got a fix which changes resource-agents' script, but I'm not going to push it. Instead I'm working on a patch which changes rabbitmq-server only.
Ok, I've found root cause. RabbitMQ returns more error codes starting from ver. 3.6.x, and unfortunately some values were changed. I think it's safe to say that API or even ABI was changed. Here is a workaround for current resource agent script: https://github.com/lemenkov/resource-agents/commit/5bd3a0b Meanwhile expect a fixed RabbitMQ build (with the changes partially reverted) soon.
I've just made a build - please try: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=11089199
Ok, here are the news. This is still just a workaround since it changes the API back to the old version by removing a fine-grained error reporting. This package returns just "2" as an error code instead of different ones properly reflecting actual issue. Use this build for now, but I'm going to roll another one coupled with resource-agents build patched to use both new error codes (from this build and up) and the old ones (ver. 3.3.5). There shouldn't be any user visible change from the pacemaker's user point of view.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-1597.html