Description of problem: We are running sanity tests on rabbitmq (HA deployment with OSPD 9) in order to make sure everything is OK after RabbitMQ rebase. The version of rabbitmq-server that we are testing is 3.6.2-3 After hard reset of controllers (the reset was done using IMPI) rabbitmq failed to start on two controllers. First one controller was reset and after it was UP ,another one was reset. PCS STATUS shows the following: ------------------------------- Clone Set: rabbitmq-clone [rabbitmq] Started: [ overcloud-controller-2 ] Stopped: [ overcloud-controller-0 overcloud-controller-1 ] I can see the following in the overcloud-controller-0-sasl.log ---------------------------------------- CRASH REPORT==== 5-Jun-2016::10:29:00 === crasher: initial call: application_master:init/4 pid: <0.56.0> registered_name: [] exception exit: {{shutdown, {failed_to_start_child,mnesia_kernel_sup,killed}}, {mnesia_sup,start,[normal,[]]}} in function application_master:init/4 (application_master.erl, line 134) ancestors: [<0.55.0>] messages: [{'EXIT',<0.57.0>,normal}] links: [<0.55.0>,<0.7.0>] dictionary: [] trap_exit: true status: running heap_size: 376 stack_size: 27 reductions: 114 neighbours: =CRASH REPORT==== 5-Jun-2016::10:29:00 === crasher: initial call: mnesia_monitor:init/1 pid: <0.61.0> registered_name: mnesia_monitor exception exit: killed in function gen_server:terminate/7 (gen_server.erl, line 826) ancestors: [mnesia_kernel_sup,mnesia_sup,<0.57.0>] messages: [] links: [<0.96.0>,<14314.25129.8>] dictionary: [] trap_exit: true status: running heap_size: 2586 stack_size: 27 reductions: 2593 neighbours: rabbit shows: ---------------------------------------- =ERROR REPORT==== 5-Jun-2016::10:29:07 === Error on AMQP connection <0.389.0> (10.35.169.15:37475 -> 10.35.169.15:5672, vhost: '/', user: 'guest', state: running), channel 0: operation none caused a connection exception connection_forced: "broker forced connection closure with reason 'shutdown'" The environment is available for debugging: ------------------------------------------- Undercloud - puma42.scl.lab.tlv.redhat.com (root/qum5net) - bare metal.
Status report - we believe this issue might be related to another one we're working on right now (see bug 1343905). As soon as we deliver a fix for it, we will ask you to re-test.
Leonid, please, test this package: resource-agents-3.9.5-76.el7
(In reply to Peter Lemenkov from comment #2) > Leonid, please, test this package: > > resource-agents-3.9.5-76.el7 I deployed latest ospd9 and updated the resource agent packages to the resource-agents-3.9.5-76.el7. Marian should run automation sanity on this deployment to see if it's reproduces. Will update as soon as I have results.
OK,I have tested it manually and it looks like issue reproduces. The scenario is the following: Controller-0 hard boot using IPMI Controller-0 comes up but it takes some time to it to start all the pcs resources. While all the resources has not been yet started controller-1 hard rebooted using IPMI. So,from what I have noticed rabbit can join the rabbitmq cluster if the controller-0 was rebooted,came up,started all the resources and only when everything works controller-1 goes for the reboot. In other words everything should work when rebooting one of the controllers. If,for some reason,controller-1 reboots while controller-0 not fully recovered after its reboot - things go wrong.
This build should fix the issue. Leonid, please test. The problem was that we've agged a regression with one of our patches merged into 3.6.1 which caused rabbitmq to fail during slave->master promotion. When you shut down two nodes the remaining one (slave) had to declare itself as a master. Unfortnately it failed to do so, so two starting nodes cannot re-sync with anyone. I've backported a patch which should fix this.
(In reply to Peter Lemenkov from comment #6) > This build should fix the issue. Leonid, please test. > > The problem was that we've agged a regression with one of our patches merged > into 3.6.1 which caused rabbitmq to fail during slave->master promotion. > > When you shut down two nodes the remaining one (slave) had to declare itself > as a master. Unfortnately it failed to do so, so two starting nodes cannot > re-sync with anyone. I've backported a patch which should fix this. Re-tested with the updated rabbitmq server (rabbitmq-server-3.6.2-4.el7ost.noarch), Still reproduces.
Bug still reproduces with the rabbitmq-server-3.6.2-4.el7ost.noarch and resource-agents-3.9.5-76.el7
This issue is likely the same as bug 1343905. As soon as we address that ticket we'll return to this one with proposed fix.
Leonid, could you please retest with the latest rabbitmq-server and resource-agents packages? You need to deploy * resource-agents-3.9.5-80.el7 or resource-agents-3.9.5-54.el7_2.16 * rabbitmq-server-3.6.3-5.el7ost
Tested with rabbitmq-server-3.6.3-5.el7os and resource-agents-3.9.5-54.el7_2.16 Tested as described in the description. One controllers was hard rebooted using IPMI and after it came up another controller was rebooted. After all controllers were UP rabbitmq was running on all of them. pcs resource cleanup rabbitmq was required.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-1597.html