Bug 1343027

Summary:	RabbitMQ refuses to start after a hard reset (using IPMI) of a controller.
Product:	Red Hat OpenStack	Reporter:	Leonid Natapov <lnatapov>
Component:	rabbitmq-server	Assignee:	Peter Lemenkov <plemenko>
Status:	CLOSED ERRATA	QA Contact:	Leonid Natapov <lnatapov>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	9.0 (Mitaka)	CC:	apevec, fdinitto, jeckersb, jjoyce, lhh, lnatapov, michele, mkrcmari, oblaut, plemenko, sclewis, srevivo
Target Milestone:	ga	Keywords:	AutomationBlocker, Regression
Target Release:	9.0 (Mitaka)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	rabbitmq-server-3.6.3-5.el7os,resource-agents-3.9.5-54.el7_2.16	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-08-11 12:24:14 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1343905
Bug Blocks:	1309684

Description Leonid Natapov 2016-06-06 11:13:03 UTC

Description of problem:

We are running sanity tests on rabbitmq (HA deployment with OSPD 9) in order to make sure everything is OK after RabbitMQ rebase.
The version of rabbitmq-server that we are testing is 3.6.2-3

After hard reset of controllers (the reset was done using IMPI) rabbitmq failed to start on two controllers. First one controller was reset and after it was UP ,another one was reset.

PCS STATUS shows the following:
-------------------------------
 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ overcloud-controller-2 ]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 ]

I can see the following in the overcloud-controller-0-sasl.log
----------------------------------------

CRASH REPORT==== 5-Jun-2016::10:29:00 ===
  crasher:
    initial call: application_master:init/4
    pid: <0.56.0>
    registered_name: []
    exception exit: {{shutdown,
                         {failed_to_start_child,mnesia_kernel_sup,killed}},
                     {mnesia_sup,start,[normal,[]]}}
      in function  application_master:init/4 (application_master.erl, line 134)
    ancestors: [<0.55.0>]
    messages: [{'EXIT',<0.57.0>,normal}]
    links: [<0.55.0>,<0.7.0>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 376
    stack_size: 27
    reductions: 114
  neighbours:

=CRASH REPORT==== 5-Jun-2016::10:29:00 ===
  crasher:
    initial call: mnesia_monitor:init/1
    pid: <0.61.0>
    registered_name: mnesia_monitor
    exception exit: killed
      in function  gen_server:terminate/7 (gen_server.erl, line 826)
    ancestors: [mnesia_kernel_sup,mnesia_sup,<0.57.0>]
    messages: []
    links: [<0.96.0>,<14314.25129.8>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 2586
    stack_size: 27
    reductions: 2593
  neighbours:


rabbit shows:
----------------------------------------
=ERROR REPORT==== 5-Jun-2016::10:29:07 ===
Error on AMQP connection <0.389.0> (10.35.169.15:37475 -> 10.35.169.15:5672, vhost: '/', user: 'guest', state: running), channel 0:
operation none caused a connection exception connection_forced: "broker forced connection closure with reason 'shutdown'"


The environment is available for debugging:
-------------------------------------------
Undercloud - puma42.scl.lab.tlv.redhat.com (root/qum5net) - bare metal.

Comment 1 Peter Lemenkov 2016-06-23 13:08:04 UTC

Status report - we believe this issue might be related to another one we're working on right now (see bug 1343905). As soon as we deliver a fix for it, we will ask you to re-test.

Comment 2 Peter Lemenkov 2016-06-24 14:03:35 UTC

Leonid, please, test this package:

resource-agents-3.9.5-76.el7

Comment 3 Leonid Natapov 2016-06-28 10:07:40 UTC

(In reply to Peter Lemenkov from comment #2)
> Leonid, please, test this package:
> 
> resource-agents-3.9.5-76.el7

I deployed latest ospd9 and updated the resource agent packages to the resource-agents-3.9.5-76.el7. Marian should run automation sanity on this deployment to see if it's reproduces. Will update as soon as I have results.

Comment 4 Leonid Natapov 2016-06-28 16:40:27 UTC

OK,I have tested it manually and it looks like issue reproduces.
The scenario is the following:

Controller-0 hard boot using IPMI

Controller-0 comes up but it takes some time to it to start all the pcs resources. While all the resources has not been yet started controller-1 hard rebooted using IPMI.

So,from what I have noticed rabbit can join the rabbitmq cluster if the controller-0 was rebooted,came up,started all the resources and only when everything works controller-1 goes for the reboot. In other words everything should work when rebooting one of the controllers. If,for some reason,controller-1 reboots while controller-0 not fully recovered after its reboot - things go wrong.

Comment 6 Peter Lemenkov 2016-06-29 15:45:39 UTC

This build should fix the issue.  Leonid, please test.

The problem was that we've agged a regression with one of our patches merged into 3.6.1 which caused rabbitmq to fail during slave->master promotion.

When you shut down two nodes the remaining one (slave) had to declare itself as a master. Unfortnately it failed to do so, so two starting nodes cannot re-sync with anyone. I've backported a patch which should fix this.

Comment 7 Leonid Natapov 2016-06-30 12:49:36 UTC

(In reply to Peter Lemenkov from comment #6)
> This build should fix the issue.  Leonid, please test.
> 
> The problem was that we've agged a regression with one of our patches merged
> into 3.6.1 which caused rabbitmq to fail during slave->master promotion.
> 
> When you shut down two nodes the remaining one (slave) had to declare itself
> as a master. Unfortnately it failed to do so, so two starting nodes cannot
> re-sync with anyone. I've backported a patch which should fix this.

Re-tested with the updated rabbitmq server (rabbitmq-server-3.6.2-4.el7ost.noarch), Still reproduces.

Comment 12 Leonid Natapov 2016-07-04 09:15:05 UTC

Bug still reproduces with the rabbitmq-server-3.6.2-4.el7ost.noarch and  resource-agents-3.9.5-76.el7

Comment 13 Peter Lemenkov 2016-07-26 12:11:33 UTC

This issue is likely the same as bug 1343905. As soon as we address that ticket we'll return to this one with proposed fix.

Comment 14 Peter Lemenkov 2016-07-27 13:12:41 UTC

Leonid, could you please retest with the latest rabbitmq-server and resource-agents packages?

You need to deploy 

* resource-agents-3.9.5-80.el7 or resource-agents-3.9.5-54.el7_2.16
* rabbitmq-server-3.6.3-5.el7ost

Comment 15 Leonid Natapov 2016-07-28 14:17:36 UTC

Tested with rabbitmq-server-3.6.3-5.el7os and resource-agents-3.9.5-54.el7_2.16

Tested as described in the description.
One controllers was hard rebooted using IPMI and after it came up another controller was rebooted. After all controllers were UP rabbitmq was running on all of them. pcs resource cleanup rabbitmq was required.

Comment 17 errata-xmlrpc 2016-08-11 12:24:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-1597.html