Bug 1338657 - RabbitMQ resources fail to start on 3 controllers nodes with pacemaker deployment
Summary: RabbitMQ resources fail to start on 3 controllers nodes with pacemaker deploy...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rabbitmq-server
Version: 9.0 (Mitaka)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ga
: 9.0 (Mitaka)
Assignee: Peter Lemenkov
QA Contact: Udi Shkalim
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-05-23 08:24 UTC by Marius Cornea
Modified: 2016-08-11 12:22 UTC (History)
15 users (show)

Fixed In Version: rabbitmq-server-3.6.2-2.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-08-11 12:22:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2016:1597 0 normal SHIPPED_LIVE Red Hat OpenStack Platform 9 Release Candidate Advisory 2016-08-11 16:06:52 UTC

Description Marius Cornea 2016-05-23 08:24:45 UTC
Description of problem:
RabbitMQ resources fail to start on 3 controllers nodes with pacemaker deployment.

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-2.0.0-7.el7ost.noarch
2016-05-22.1 puddle

How reproducible:


Steps to Reproduce:
1. Deploy a 3 controller nodes environment with pacemaker

Actual results:
pcs status shows:

 Clone Set: rabbitmq-clone [rabbitmq]
     rabbitmq	(ocf::heartbeat:rabbitmq-cluster):	FAILED overcloud-controller-0 (unmanaged)
     rabbitmq	(ocf::heartbeat:rabbitmq-cluster):	FAILED overcloud-controller-1 (unmanaged)
     rabbitmq	(ocf::heartbeat:rabbitmq-cluster):	FAILED overcloud-controller-2 (unmanaged)
 Clone Set: memcached-clone [memcached]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]

Failed Actions:
* rabbitmq_stop_0 on overcloud-controller-0 'unknown error' (1): call=87, status=complete, exitreason='none',
    last-rc-change='Mon May 23 08:06:17 2016', queued=0ms, exec=1242ms
* rabbitmq_stop_0 on overcloud-controller-1 'unknown error' (1): call=86, status=complete, exitreason='none',
    last-rc-change='Mon May 23 08:06:16 2016', queued=0ms, exec=1150ms
* rabbitmq_stop_0 on overcloud-controller-2 'unknown error' (1): call=82, status=complete, exitreason='none',
    last-rc-change='Mon May 23 08:06:14 2016', queued=0ms, exec=1137ms


Expected results:
The rabbitmq resources show up as started.

Additional info:
Attaching sosreports from the controller nodes.

Comment 3 Marius Cornea 2016-05-23 11:51:20 UTC
I managed to get it started after running 'pcs resource debug-start rabbitmq' on each of the controller nodes and then 'pcs resource cleanup rabbitmq'

[root@overcloud-controller-0 ~]# pcs resource debug-start rabbitmq
Operation start for rabbitmq:0 (ocf:heartbeat:rabbitmq-cluster) returned 0
 >  stdout: Waiting for 'rabbit@overcloud-controller-0' ...
 >  stdout: pid is 27210 ...
 >  stderr: ERROR: Unexpected return code from '/usr/sbin/rabbitmqctl cluster status' exit code: 69
 >  stderr: INFO: Bootstrapping rabbitmq cluster
 >  stderr: INFO: Waiting for server to start
 >  stderr: DEBUG: RabbitMQ server is running normally
 >  stderr: INFO: cluster bootstrapped
 >  stderr: INFO: Policy set: ha-all ^(?!amq\.).* {"ha-mode":"all"}
 >  stderr: DEBUG: rabbitmq:0 start : 0

Still, I wasn't able to proceed with the installation, the logs in /var/log/rabbitmq showing this kind of errors:

=ERROR REPORT==== 23-May-2016::11:39:23 ===
Error on AMQP connection <0.1000.0> (10.0.0.15:54125 -> 10.0.0.15:5672, state: starting):
AMQPLAIN login refused: user 'guest' can only connect via localhost


This is how the rabbitmq.config looks on one of the nodes:

[root@overcloud-controller-0 heat-admin]# cat /etc/rabbitmq/rabbitmq.config 
% This file managed by Puppet
% Template Path: rabbitmq/templates/rabbitmq.config
[
  {rabbit, [
    {tcp_listen_options,
         [binary,
         {packet,        raw},
         {reuseaddr,     true},
         {backlog,       128},
         {nodelay,       true},
         {exit_on_close, false}]
    },
    {cluster_partition_handling, pause_minority},
    {tcp_listen_options, [binary, {packet, raw}, {reuseaddr, true}, {backlog, 128}, {nodelay, true}, {exit_on_close, false}, {keepalive, true}]},
    {default_user, <<"guest">>},
    {default_pass, <<"zdbXdPQmW47Yfw9wewr3pJZjQ">>}
  ]},
  {kernel, [
    {inet_dist_listen_max, 35672},
    {inet_dist_listen_min, 35672}
  ]}
,
  {rabbitmq_management, [
    {listener, [
      {port, 15672}
      ,{ip, "10.0.0.15"}
    ]}
  ]}
].
% EOF

Comment 4 Peter Lemenkov 2016-05-24 12:32:32 UTC
Current status - I've got a fix which changes resource-agents' script, but I'm not going to push it. Instead I'm working on a patch which changes rabbitmq-server only.

Comment 8 Peter Lemenkov 2016-05-25 16:09:24 UTC
Ok, I've found root cause. RabbitMQ returns more error codes starting from ver. 3.6.x, and unfortunately some values were changed. I think it's safe to say that API or even ABI was changed.

Here is a workaround for current resource agent script:

https://github.com/lemenkov/resource-agents/commit/5bd3a0b

Meanwhile expect a fixed RabbitMQ build (with the changes partially reverted) soon.

Comment 9 Peter Lemenkov 2016-05-25 20:00:21 UTC
I've just made a build - please try:

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=11089199

Comment 12 Peter Lemenkov 2016-05-26 14:49:44 UTC
Ok, here are the news.

This is still just a workaround since it changes the API back to the old version by removing a fine-grained error reporting. This package returns just "2" as an error code instead of different ones properly reflecting actual issue.

Use this build for now, but I'm going to roll another one coupled with resource-agents build patched to use both new error codes (from this build and up) and the old ones (ver. 3.3.5).

There shouldn't be any user visible change from the pacemaker's user point of view.

Comment 15 errata-xmlrpc 2016-08-11 12:22:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-1597.html


Note You need to log in before you can comment on or make changes to this bug.