1343027 – RabbitMQ refuses to start after a hard reset (using IPMI) of a controller.

Bug 1343027 - RabbitMQ refuses to start after a hard reset (using IPMI) of a controller.

Summary: RabbitMQ refuses to start after a hard reset (using IPMI) of a controller.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	rabbitmq-server
Sub Component:
Version:	9.0 (Mitaka)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	ga
Target Release:	9.0 (Mitaka)
Assignee:	Peter Lemenkov
QA Contact:	Leonid Natapov
Docs Contact:
URL:
Whiteboard:
Depends On:	1343905
Blocks:	1309684
TreeView+	depends on / blocked

Reported:	2016-06-06 11:13 UTC by Leonid Natapov
Modified:	2016-11-01 16:19 UTC (History)
CC List:	12 users (show)
Fixed In Version:	rabbitmq-server-3.6.3-5.el7os,resource-agents-3.9.5-54.el7_2.16
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-08-11 12:24:14 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	rabbitmq rabbitmq-server issues 812	0	'None'	closed	Queue master process terminates in rabbit_mirror_queue_master:stop_all_slaves on promotion	2021-01-19 02:24:06 UTC
Red Hat Product Errata	RHEA-2016:1597	0	normal	SHIPPED_LIVE	Red Hat OpenStack Platform 9 Release Candidate Advisory	2016-08-11 16:06:52 UTC

Description Leonid Natapov 2016-06-06 11:13:03 UTC

Description of problem:

We are running sanity tests on rabbitmq (HA deployment with OSPD 9) in order to make sure everything is OK after RabbitMQ rebase.
The version of rabbitmq-server that we are testing is 3.6.2-3

After hard reset of controllers (the reset was done using IMPI) rabbitmq failed to start on two controllers. First one controller was reset and after it was UP ,another one was reset.

PCS STATUS shows the following:
-------------------------------
 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ overcloud-controller-2 ]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 ]

I can see the following in the overcloud-controller-0-sasl.log
----------------------------------------

CRASH REPORT==== 5-Jun-2016::10:29:00 ===
  crasher:
    initial call: application_master:init/4
    pid: <0.56.0>
    registered_name: []
    exception exit: {{shutdown,
                         {failed_to_start_child,mnesia_kernel_sup,killed}},
                     {mnesia_sup,start,[normal,[]]}}
      in function  application_master:init/4 (application_master.erl, line 134)
    ancestors: [<0.55.0>]
    messages: [{'EXIT',<0.57.0>,normal}]
    links: [<0.55.0>,<0.7.0>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 376
    stack_size: 27
    reductions: 114
  neighbours:

=CRASH REPORT==== 5-Jun-2016::10:29:00 ===
  crasher:
    initial call: mnesia_monitor:init/1
    pid: <0.61.0>
    registered_name: mnesia_monitor
    exception exit: killed
      in function  gen_server:terminate/7 (gen_server.erl, line 826)
    ancestors: [mnesia_kernel_sup,mnesia_sup,<0.57.0>]
    messages: []
    links: [<0.96.0>,<14314.25129.8>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 2586
    stack_size: 27
    reductions: 2593
  neighbours:


rabbit shows:
----------------------------------------
=ERROR REPORT==== 5-Jun-2016::10:29:07 ===
Error on AMQP connection <0.389.0> (10.35.169.15:37475 -> 10.35.169.15:5672, vhost: '/', user: 'guest', state: running), channel 0:
operation none caused a connection exception connection_forced: "broker forced connection closure with reason 'shutdown'"


The environment is available for debugging:
-------------------------------------------
Undercloud - puma42.scl.lab.tlv.redhat.com (root/qum5net) - bare metal.

Comment 1 Peter Lemenkov 2016-06-23 13:08:04 UTC

Status report - we believe this issue might be related to another one we're working on right now (see bug 1343905). As soon as we deliver a fix for it, we will ask you to re-test.

Comment 2 Peter Lemenkov 2016-06-24 14:03:35 UTC

Leonid, please, test this package:

resource-agents-3.9.5-76.el7

Comment 3 Leonid Natapov 2016-06-28 10:07:40 UTC

(In reply to Peter Lemenkov from comment #2)
> Leonid, please, test this package:
> 
> resource-agents-3.9.5-76.el7

I deployed latest ospd9 and updated the resource agent packages to the resource-agents-3.9.5-76.el7. Marian should run automation sanity on this deployment to see if it's reproduces. Will update as soon as I have results.

Comment 4 Leonid Natapov 2016-06-28 16:40:27 UTC

OK,I have tested it manually and it looks like issue reproduces.
The scenario is the following:

Controller-0 hard boot using IPMI

Controller-0 comes up but it takes some time to it to start all the pcs resources. While all the resources has not been yet started controller-1 hard rebooted using IPMI.

So,from what I have noticed rabbit can join the rabbitmq cluster if the controller-0 was rebooted,came up,started all the resources and only when everything works controller-1 goes for the reboot. In other words everything should work when rebooting one of the controllers. If,for some reason,controller-1 reboots while controller-0 not fully recovered after its reboot - things go wrong.

Comment 6 Peter Lemenkov 2016-06-29 15:45:39 UTC

This build should fix the issue.  Leonid, please test.

The problem was that we've agged a regression with one of our patches merged into 3.6.1 which caused rabbitmq to fail during slave->master promotion.

When you shut down two nodes the remaining one (slave) had to declare itself as a master. Unfortnately it failed to do so, so two starting nodes cannot re-sync with anyone. I've backported a patch which should fix this.

Comment 7 Leonid Natapov 2016-06-30 12:49:36 UTC

(In reply to Peter Lemenkov from comment #6)
> This build should fix the issue.  Leonid, please test.
> 
> The problem was that we've agged a regression with one of our patches merged
> into 3.6.1 which caused rabbitmq to fail during slave->master promotion.
> 
> When you shut down two nodes the remaining one (slave) had to declare itself
> as a master. Unfortnately it failed to do so, so two starting nodes cannot
> re-sync with anyone. I've backported a patch which should fix this.

Re-tested with the updated rabbitmq server (rabbitmq-server-3.6.2-4.el7ost.noarch), Still reproduces.

Comment 12 Leonid Natapov 2016-07-04 09:15:05 UTC

Bug still reproduces with the rabbitmq-server-3.6.2-4.el7ost.noarch and  resource-agents-3.9.5-76.el7

Comment 13 Peter Lemenkov 2016-07-26 12:11:33 UTC

This issue is likely the same as bug 1343905. As soon as we address that ticket we'll return to this one with proposed fix.

Comment 14 Peter Lemenkov 2016-07-27 13:12:41 UTC

Leonid, could you please retest with the latest rabbitmq-server and resource-agents packages?

You need to deploy 

* resource-agents-3.9.5-80.el7 or resource-agents-3.9.5-54.el7_2.16
* rabbitmq-server-3.6.3-5.el7ost

Comment 15 Leonid Natapov 2016-07-28 14:17:36 UTC

Tested with rabbitmq-server-3.6.3-5.el7os and resource-agents-3.9.5-54.el7_2.16

Tested as described in the description.
One controllers was hard rebooted using IPMI and after it came up another controller was rebooted. After all controllers were UP rabbitmq was running on all of them. pcs resource cleanup rabbitmq was required.

Comment 17 errata-xmlrpc 2016-08-11 12:24:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-1597.html

Note You need to log in before you can comment on or make changes to this bug.