Bug 1338657

Summary: RabbitMQ resources fail to start on 3 controllers nodes with pacemaker deployment
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: rabbitmq-serverAssignee: Peter Lemenkov <plemenko>
Status: CLOSED ERRATA QA Contact: Udi Shkalim <ushkalim>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 9.0 (Mitaka)CC: apevec, dbecker, jeckersb, jjoyce, lhh, mburns, morazi, ohochman, plemenko, rhel-osp-director-maint, royoung, rscarazz, sasha, srevivo, tvignaud
Target Milestone: gaKeywords: Automation, AutomationBlocker
Target Release: 9.0 (Mitaka)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: rabbitmq-server-3.6.2-2.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-08-11 12:22:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Marius Cornea 2016-05-23 08:24:45 UTC
Description of problem:
RabbitMQ resources fail to start on 3 controllers nodes with pacemaker deployment.

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-2.0.0-7.el7ost.noarch
2016-05-22.1 puddle

How reproducible:


Steps to Reproduce:
1. Deploy a 3 controller nodes environment with pacemaker

Actual results:
pcs status shows:

 Clone Set: rabbitmq-clone [rabbitmq]
     rabbitmq	(ocf::heartbeat:rabbitmq-cluster):	FAILED overcloud-controller-0 (unmanaged)
     rabbitmq	(ocf::heartbeat:rabbitmq-cluster):	FAILED overcloud-controller-1 (unmanaged)
     rabbitmq	(ocf::heartbeat:rabbitmq-cluster):	FAILED overcloud-controller-2 (unmanaged)
 Clone Set: memcached-clone [memcached]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]

Failed Actions:
* rabbitmq_stop_0 on overcloud-controller-0 'unknown error' (1): call=87, status=complete, exitreason='none',
    last-rc-change='Mon May 23 08:06:17 2016', queued=0ms, exec=1242ms
* rabbitmq_stop_0 on overcloud-controller-1 'unknown error' (1): call=86, status=complete, exitreason='none',
    last-rc-change='Mon May 23 08:06:16 2016', queued=0ms, exec=1150ms
* rabbitmq_stop_0 on overcloud-controller-2 'unknown error' (1): call=82, status=complete, exitreason='none',
    last-rc-change='Mon May 23 08:06:14 2016', queued=0ms, exec=1137ms


Expected results:
The rabbitmq resources show up as started.

Additional info:
Attaching sosreports from the controller nodes.

Comment 3 Marius Cornea 2016-05-23 11:51:20 UTC
I managed to get it started after running 'pcs resource debug-start rabbitmq' on each of the controller nodes and then 'pcs resource cleanup rabbitmq'

[root@overcloud-controller-0 ~]# pcs resource debug-start rabbitmq
Operation start for rabbitmq:0 (ocf:heartbeat:rabbitmq-cluster) returned 0
 >  stdout: Waiting for 'rabbit@overcloud-controller-0' ...
 >  stdout: pid is 27210 ...
 >  stderr: ERROR: Unexpected return code from '/usr/sbin/rabbitmqctl cluster status' exit code: 69
 >  stderr: INFO: Bootstrapping rabbitmq cluster
 >  stderr: INFO: Waiting for server to start
 >  stderr: DEBUG: RabbitMQ server is running normally
 >  stderr: INFO: cluster bootstrapped
 >  stderr: INFO: Policy set: ha-all ^(?!amq\.).* {"ha-mode":"all"}
 >  stderr: DEBUG: rabbitmq:0 start : 0

Still, I wasn't able to proceed with the installation, the logs in /var/log/rabbitmq showing this kind of errors:

=ERROR REPORT==== 23-May-2016::11:39:23 ===
Error on AMQP connection <0.1000.0> (10.0.0.15:54125 -> 10.0.0.15:5672, state: starting):
AMQPLAIN login refused: user 'guest' can only connect via localhost


This is how the rabbitmq.config looks on one of the nodes:

[root@overcloud-controller-0 heat-admin]# cat /etc/rabbitmq/rabbitmq.config 
% This file managed by Puppet
% Template Path: rabbitmq/templates/rabbitmq.config
[
  {rabbit, [
    {tcp_listen_options,
         [binary,
         {packet,        raw},
         {reuseaddr,     true},
         {backlog,       128},
         {nodelay,       true},
         {exit_on_close, false}]
    },
    {cluster_partition_handling, pause_minority},
    {tcp_listen_options, [binary, {packet, raw}, {reuseaddr, true}, {backlog, 128}, {nodelay, true}, {exit_on_close, false}, {keepalive, true}]},
    {default_user, <<"guest">>},
    {default_pass, <<"zdbXdPQmW47Yfw9wewr3pJZjQ">>}
  ]},
  {kernel, [
    {inet_dist_listen_max, 35672},
    {inet_dist_listen_min, 35672}
  ]}
,
  {rabbitmq_management, [
    {listener, [
      {port, 15672}
      ,{ip, "10.0.0.15"}
    ]}
  ]}
].
% EOF

Comment 4 Peter Lemenkov 2016-05-24 12:32:32 UTC
Current status - I've got a fix which changes resource-agents' script, but I'm not going to push it. Instead I'm working on a patch which changes rabbitmq-server only.

Comment 8 Peter Lemenkov 2016-05-25 16:09:24 UTC
Ok, I've found root cause. RabbitMQ returns more error codes starting from ver. 3.6.x, and unfortunately some values were changed. I think it's safe to say that API or even ABI was changed.

Here is a workaround for current resource agent script:

https://github.com/lemenkov/resource-agents/commit/5bd3a0b

Meanwhile expect a fixed RabbitMQ build (with the changes partially reverted) soon.

Comment 9 Peter Lemenkov 2016-05-25 20:00:21 UTC
I've just made a build - please try:

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=11089199

Comment 12 Peter Lemenkov 2016-05-26 14:49:44 UTC
Ok, here are the news.

This is still just a workaround since it changes the API back to the old version by removing a fine-grained error reporting. This package returns just "2" as an error code instead of different ones properly reflecting actual issue.

Use this build for now, but I'm going to roll another one coupled with resource-agents build patched to use both new error codes (from this build and up) and the old ones (ver. 3.3.5).

There shouldn't be any user visible change from the pacemaker's user point of view.

Comment 15 errata-xmlrpc 2016-08-11 12:22:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-1597.html