Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1397393 - RabbitMQ server crash with a core dump after fencing of one or more controllers
RabbitMQ server crash with a core dump after fencing of one or more controllers
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: resource-agents (Show other bugs)
7.4
x86_64 Linux
high Severity high
: pre-dev-freeze
: ---
Assigned To: Peter Lemenkov
Asaf Hirshberg
: ZStream
: 1362117 1397110 1398335 (view as bug list)
Depends On:
Blocks: 1305654 1402511
  Show dependency treegraph
 
Reported: 2016-11-22 07:43 EST by Udi Shkalim
Modified: 2017-12-07 08:25 EST (History)
18 users (show)

See Also:
Fixed In Version: resource-agents-3.9.5-86.el7
Doc Type: Bug Fix
Doc Text:
Previously, when fencing controllers on Red Hat OpenStack Platform, the RabbitMQ resource terminated unexpectedly with a core dump. This update fixes how the RabbitMQ resource agent handles fencing, and thus avoids the described crash.
Story Points: ---
Clone Of:
: 1402511 (view as bug list)
Environment:
Last Closed: 2017-08-01 10:57:40 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Launchpad 1620649 None None None 2016-11-22 10:52 EST
Github ClusterLabs/resource-agents/pull/897 None None None 2016-12-07 09:44 EST
Github rabbitmq/rabbitmq-server/pull/946 None None None 2016-11-22 10:52 EST
Red Hat Product Errata RHBA-2017:1844 normal SHIPPED_LIVE resource-agents bug fix and enhancement update 2017-08-01 13:49:20 EDT

  None (edit)
Description Udi Shkalim 2016-11-22 07:43:34 EST
Description of problem:
During HA recovery testing we reboot one or more controllers and verifying rabbit recovery. In one of the scenarios we see a long recovery time which make the rabbit resource to fail. rabbit server crash with the following:
=ERROR REPORT==== 17-Nov-2016::12:56:17 ===
Mnesia('rabbit@overcloud-controller-0'): ** ERROR ** (core dumped to file: "/var/lib/rabbitmq/MnesiaCore.rabbit@overcloud-controller-0_1479_387377_90720")
 ** FATAL ** Failed to merge schema: Bad cookie in table definition mirrored_sup_childspec: 'rabbit@overcloud-controller-0' = {cstruct,mirrored_sup_childspec,ordered_set,['rabbit@overcloud-controller-0','rabbit@overcloud-controller-2','rabbit@overcloud-controller-1'],[],[],0,read_write,false,[],[],false,mirrored_sup_childspec,[key,mirroring_pid,childspec],[],[],[],{{1479384494757167930,-576460752303422247,1},'rabbit@overcloud-controller-1'},{{7,0},{'rabbit@overcloud-controller-0',{1479,386787,22419}}}}, 'rabbit@overcloud-controller-1' = {cstruct,mirrored_sup_childspec,ordered_set,['rabbit@overcloud-controller-1'],[],[],0,read_write,false,[],[],false,mirrored_sup_childspec,[key,mirroring_pid,childspec],[],[],[],{{1479387361761853749,-576460752303422248,1},'rabbit@overcloud-controller-1'},{{2,0},[]} 

Version-Release number of selected component (if applicable):
rabbitmq-server-3.6.3-5.el7ost.noarch
puppet-rabbitmq-5.6.0-1.057a013git.el7ost.noarch
resource-agents-3.9.5-82.el7.x86_64


How reproducible:
90%

Steps to Reproduce:
1. Deploy openstack in HA - 3 controllers
2. Configure fencing 
3. Fence multiple controllers 

Actual results:
Rabbitmq resource is down

Expected results:
rabbitmq resource should be up

Additional info:
http://pastebin.test.redhat.com/431829
Comment 2 Peter Lemenkov 2016-11-22 10:53:24 EST
See also bug 1397110 comment 8.
Comment 3 Peter Lemenkov 2016-11-22 10:56:10 EST
This is resource-agents' related issue. it seems that there are more situations where we  should do full Mnesia cleanup.
Comment 4 Peter Lemenkov 2016-11-24 12:03:00 EST
Related issue is described in bug 1362117.

This is what happened.

Node-0 starts fresh:

==============================================

rabbitmq-cluster(rabbitmq)[18315]:	2016/11/23_15:16:11 INFO: RabbitMQ server is not running
rabbitmq-cluster(rabbitmq)[18315]:	2016/11/23_15:16:11 INFO: Bootstrapping rabbitmq cluster
rabbitmq-cluster(rabbitmq)[18315]:	2016/11/23_15:16:11 INFO: Waiting for server to start
...
rabbitmq-cluster(rabbitmq)[18315]:	2016/11/23_15:16:17 INFO: cluster bootstrapped
rabbitmq-cluster(rabbitmq)[18315]:	2016/11/23_15:16:18 INFO: Policy set: ha-all ^(?!amq\.).* {"ha-mode":"all"}

==============================================

Node-1 decides to join this cluster (consisting from a single node so far):

rabbitmq-cluster(rabbitmq)[14205]:	2016/11/23_15:16:20 INFO: RabbitMQ server is not running
rabbitmq-cluster(rabbitmq)[14205]:	2016/11/23_15:16:20 INFO: Joining existing cluster with [ rabbit@controller-0  ] nodes.
rabbitmq-cluster(rabbitmq)[14205]:	2016/11/23_15:16:20 INFO: Waiting for server to start
...

=ERROR REPORT==== 23-Nov-2016::15:16:22 ===
Mnesia('rabbit@controller-1'): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned_network, 'rabbit@controller-0'}

=ERROR REPORT==== 23-Nov-2016::15:16:22 ===
Mnesia('rabbit@controller-1'): ** ERROR ** (core dumped to file: "/var/lib/rabbitmq/MnesiaCore.rabbit@controller-1_1479_914182_938424")
 ** FATAL ** Failed to merge schema: Bad cookie in table definition mirrored_sup_childspec: 'rabbit@controller-1' = {cstruct,mirrored_sup_childspec,ordered_set,['rabbit@controller-2','rabbit@controller-1','rabbit@controller-0'],[],[],0,read_write,false,[],[],false,mirrored_sup_childspec,[key,mirroring_pid,childspec],[],[],[],{{1479896607054171156,-576460752303422935,1},'rabbit@controller-0'},{{4,0},{'rabbit@controller-2',{1479,896643,993197}}}}, 'rabbit@controller-0' = {cstruct,mirrored_sup_childspec,ordered_set,['rabbit@controller-0'],[],[],0,read_write,false,[],[],false,mirrored_sup_childspec,[key,mirroring_pid,childspec],[],[],[],{{1479914173599223089,-576460752303422975,1},'rabbit@controller-0'},{{2,0},[]}}


=ERROR REPORT==== 23-Nov-2016::15:16:32 ===
** Generic server mnesia_monitor terminating
** Last message in was {'EXIT',<0.91.0>,killed}
** When Server state == {state,<0.91.0>,[],[],true,[],undefined,[],[]}
** Reason for termination ==
** killed

=ERROR REPORT==== 23-Nov-2016::15:16:32 ===
** Generic server mnesia_recover terminating
** Last message in was {'EXIT',<0.91.0>,killed}
** When Server state == {state,<0.91.0>,undefined,undefined,undefined,0,false,
                               true,[]}
** Reason for termination ==
** killed

=ERROR REPORT==== 23-Nov-2016::15:16:32 ===
** Generic server mnesia_snmp_sup terminating
** Last message in was {'EXIT',<0.91.0>,killed}
** When Server state == {state,
                            {local,mnesia_snmp_sup},
                            simple_one_for_one,
                            [{child,undefined,mnesia_snmp_sup,
                                 {mnesia_snmp_hook,start,[]},
                                 transient,3000,worker,
                                 [mnesia_snmp_sup,mnesia_snmp_hook,
                                  supervisor]}],
                            undefined,0,86400000,[],0,mnesia_snmp_sup,[]}
** Reason for termination ==
** killed

=ERROR REPORT==== 23-Nov-2016::15:16:32 ===
** Generic server mnesia_subscr terminating
** Last message in was {'EXIT',<0.91.0>,killed}
** When Server state == {state,<0.91.0>,110616}
** Reason for termination ==
** killed

=INFO REPORT==== 23-Nov-2016::15:16:32 ===
Error description:
   {could_not_start,mnesia,
       {{shutdown,{failed_to_start_child,mnesia_kernel_sup,killed}},
        {mnesia_sup,start,[normal,[]]}}}

Log files (may contain more information):
   /var/log/rabbitmq/rabbit@controller-1.log
   /var/log/rabbitmq/rabbit@controller-1-sasl.log

...

rabbitmq-cluster(rabbitmq)[14205]:	2016/11/23_15:16:35 INFO: RabbitMQ server is not running
rabbitmq-cluster(rabbitmq)[14205]:	2016/11/23_15:16:35 INFO: node failed to join, wiping data directory and trying again
rabbitmq-cluster(rabbitmq)[14205]:	2016/11/23_15:16:38 INFO: RabbitMQ server is not running
rabbitmq-cluster(rabbitmq)[14205]:	2016/11/23_15:16:38 INFO: Forgetting rabbit@controller-1 via nodes [ rabbit@controller-0  ].
rabbitmq-cluster(rabbitmq)[14205]:	2016/11/23_15:16:39 ERROR: Failed to forget node rabbit@controller-1 via rabbit@controller-0.
rabbitmq-cluster(rabbitmq)[14205]:	2016/11/23_15:16:39 INFO: Joining existing cluster with [ rabbit@controller-0  ] nodes.
rabbitmq-cluster(rabbitmq)[14205]:	2016/11/23_15:16:39 INFO: Waiting for server to start
...
rabbitmq-cluster(rabbitmq)[14205]:	2016/11/23_15:16:48 INFO: Attempting to join cluster with target node rabbit@controller-0
rabbitmq-cluster(rabbitmq)[14205]:	2016/11/23_15:16:49 INFO: Joined cluster by connecting to node rabbit@controller-0, starting app
rabbitmq-cluster(rabbitmq)[14205]:	2016/11/23_15:16:53 INFO: Successfully joined existing rabbitmq cluster
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [ Error: unable to connect to node 'rabbit@controller-1': nodedown ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [  ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [ DIAGNOSTICS ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [ =========== ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [  ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [ attempted to contact: ['rabbit@controller-1'] ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [  ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [ rabbit@controller-1: ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [   * connected to epmd (port 4369) on controller-1 ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [   * epmd reports: node 'rabbit' not running at all ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [                   no other nodes on controller-1 ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [   * suggestion: start the node ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [  ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [ current node details: ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [ - node name: 'rabbitmq-cli-29@controller-1' ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [ - home dir: /var/lib/rabbitmq ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [ - cookie hash: 5dCLhityLmt2jfyxtOkPpw== ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [  ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [ Error: {not_a_cluster_node,"The node selected is not in the cluster."} ]
...

==============================================

Node-2 starts almost normally:

==============================================

=INFO REPORT==== 23-Nov-2016::15:16:57 ===
Error description:
   {error,{inconsistent_cluster,"Node 'rabbit@controller-2' thinks it's clustered with node 'rabbit@controller-1', but 'rabbit@controller-1' disagrees"}}

Log files (may contain more information):
   /var/log/rabbitmq/rabbit@controller-2.log
   /var/log/rabbitmq/rabbit@controller-2-sasl.log

Stack trace:
   [{rabbit_mnesia,check_cluster_consistency,0,
                   [{file,"src/rabbit_mnesia.erl"},{line,589}]},
    {rabbit,'-boot/0-fun-0-',0,[{file,"src/rabbit.erl"},{line,275}]},
    {rabbit,start_it,1,[{file,"src/rabbit.erl"},{line,296}]},
    {init,start_it,1,[]},
    {init,start_em,1,[]}]


=INFO REPORT==== 23-Nov-2016::15:17:06 ===
Starting RabbitMQ 3.6.3 on Erlang 18.3.4.4
Copyright (C) 2007-2016 Pivotal Software, Inc.
Licensed under the MPL.  See http://www.rabbitmq.com/


==============================================

"Almost normally" means it should better not even try to join a cluster which was fully reset, and instead reset itself.
Comment 5 Peter Lemenkov 2016-12-07 09:00:10 EST
*** Bug 1362117 has been marked as a duplicate of this bug. ***
Comment 6 Udi Shkalim 2016-12-07 09:05:47 EST
*** Bug 1397110 has been marked as a duplicate of this bug. ***
Comment 7 Peter Lemenkov 2016-12-07 09:45:02 EST
Patch provided. We're testing it now.
Comment 12 Asaf Hirshberg 2017-02-27 08:11:26 EST
Verified using automation.
Comment 13 John Eckersberg 2017-04-13 12:04:52 EDT
*** Bug 1398335 has been marked as a duplicate of this bug. ***
Comment 14 errata-xmlrpc 2017-08-01 10:57:40 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1844

Note You need to log in before you can comment on or make changes to this bug.