1397393 – RabbitMQ server crash with a core dump after fencing of one or more controllers

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1397393 - RabbitMQ server crash with a core dump after fencing of one or more controllers

Summary: RabbitMQ server crash with a core dump after fencing of one or more controllers

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	resource-agents
Sub Component:
Version:	7.4
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	pre-dev-freeze
Target Release:	---
Assignee:	Peter Lemenkov
QA Contact:	Asaf Hirshberg
Docs Contact:
URL:
Whiteboard:
Duplicates (3):	1362117 1397110 1398335 (view as bug list)
Depends On:
Blocks:	1305654 1402511
TreeView+	depends on / blocked

Reported:	2016-11-22 12:43 UTC by Udi Shkalim
Modified:	2021-06-10 11:41 UTC (History)
CC List:	18 users (show)
Fixed In Version:	resource-agents-3.9.5-86.el7
Doc Type:	Bug Fix
Doc Text:	Previously, when fencing controllers on Red Hat OpenStack Platform, the RabbitMQ resource terminated unexpectedly with a core dump. This update fixes how the RabbitMQ resource agent handles fencing, and thus avoids the described crash.
Clone Of:
Clones:	1402511 (view as bug list)
Environment:
Last Closed:	2017-08-01 14:57:40 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	ClusterLabs resource-agents pull 897	None	closed	RFC: RabbitMQ: Reset Mnesia before joining existing cluster	2021-01-15 18:59:34 UTC
Github	rabbitmq rabbitmq-server pull 946	None	closed	[OCF HA] Delete Mnesia schema on mnesia reset	2021-01-15 18:59:34 UTC
Launchpad	1620649	None	None	None	2016-11-22 15:52:30 UTC
Red Hat Product Errata	RHBA-2017:1844	normal	SHIPPED_LIVE	resource-agents bug fix and enhancement update	2017-08-01 17:49:20 UTC

Description Udi Shkalim 2016-11-22 12:43:34 UTC

Description of problem:
During HA recovery testing we reboot one or more controllers and verifying rabbit recovery. In one of the scenarios we see a long recovery time which make the rabbit resource to fail. rabbit server crash with the following:
=ERROR REPORT==== 17-Nov-2016::12:56:17 ===
Mnesia('rabbit@overcloud-controller-0'): ** ERROR ** (core dumped to file: "/var/lib/rabbitmq/MnesiaCore.rabbit@overcloud-controller-0_1479_387377_90720")
 ** FATAL ** Failed to merge schema: Bad cookie in table definition mirrored_sup_childspec: 'rabbit@overcloud-controller-0' = {cstruct,mirrored_sup_childspec,ordered_set,['rabbit@overcloud-controller-0','rabbit@overcloud-controller-2','rabbit@overcloud-controller-1'],[],[],0,read_write,false,[],[],false,mirrored_sup_childspec,[key,mirroring_pid,childspec],[],[],[],{{1479384494757167930,-576460752303422247,1},'rabbit@overcloud-controller-1'},{{7,0},{'rabbit@overcloud-controller-0',{1479,386787,22419}}}}, 'rabbit@overcloud-controller-1' = {cstruct,mirrored_sup_childspec,ordered_set,['rabbit@overcloud-controller-1'],[],[],0,read_write,false,[],[],false,mirrored_sup_childspec,[key,mirroring_pid,childspec],[],[],[],{{1479387361761853749,-576460752303422248,1},'rabbit@overcloud-controller-1'},{{2,0},[]} 

Version-Release number of selected component (if applicable):
rabbitmq-server-3.6.3-5.el7ost.noarch
puppet-rabbitmq-5.6.0-1.057a013git.el7ost.noarch
resource-agents-3.9.5-82.el7.x86_64


How reproducible:
90%

Steps to Reproduce:
1. Deploy openstack in HA - 3 controllers
2. Configure fencing 
3. Fence multiple controllers 

Actual results:
Rabbitmq resource is down

Expected results:
rabbitmq resource should be up

Additional info:
http://pastebin.test.redhat.com/431829

Comment 2 Peter Lemenkov 2016-11-22 15:53:24 UTC

See also bug 1397110 comment 8.

Comment 3 Peter Lemenkov 2016-11-22 15:56:10 UTC

This is resource-agents' related issue. it seems that there are more situations where we  should do full Mnesia cleanup.

Comment 4 Peter Lemenkov 2016-11-24 17:03:00 UTC

Related issue is described in bug 1362117.

This is what happened.

Node-0 starts fresh:

==============================================

rabbitmq-cluster(rabbitmq)[18315]:	2016/11/23_15:16:11 INFO: RabbitMQ server is not running
rabbitmq-cluster(rabbitmq)[18315]:	2016/11/23_15:16:11 INFO: Bootstrapping rabbitmq cluster
rabbitmq-cluster(rabbitmq)[18315]:	2016/11/23_15:16:11 INFO: Waiting for server to start
...
rabbitmq-cluster(rabbitmq)[18315]:	2016/11/23_15:16:17 INFO: cluster bootstrapped
rabbitmq-cluster(rabbitmq)[18315]:	2016/11/23_15:16:18 INFO: Policy set: ha-all ^(?!amq\.).* {"ha-mode":"all"}

==============================================

Node-1 decides to join this cluster (consisting from a single node so far):

rabbitmq-cluster(rabbitmq)[14205]:	2016/11/23_15:16:20 INFO: RabbitMQ server is not running
rabbitmq-cluster(rabbitmq)[14205]:	2016/11/23_15:16:20 INFO: Joining existing cluster with [ rabbit@controller-0  ] nodes.
rabbitmq-cluster(rabbitmq)[14205]:	2016/11/23_15:16:20 INFO: Waiting for server to start
...

=ERROR REPORT==== 23-Nov-2016::15:16:22 ===
Mnesia('rabbit@controller-1'): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned_network, 'rabbit@controller-0'}

=ERROR REPORT==== 23-Nov-2016::15:16:22 ===
Mnesia('rabbit@controller-1'): ** ERROR ** (core dumped to file: "/var/lib/rabbitmq/MnesiaCore.rabbit@controller-1_1479_914182_938424")
 ** FATAL ** Failed to merge schema: Bad cookie in table definition mirrored_sup_childspec: 'rabbit@controller-1' = {cstruct,mirrored_sup_childspec,ordered_set,['rabbit@controller-2','rabbit@controller-1','rabbit@controller-0'],[],[],0,read_write,false,[],[],false,mirrored_sup_childspec,[key,mirroring_pid,childspec],[],[],[],{{1479896607054171156,-576460752303422935,1},'rabbit@controller-0'},{{4,0},{'rabbit@controller-2',{1479,896643,993197}}}}, 'rabbit@controller-0' = {cstruct,mirrored_sup_childspec,ordered_set,['rabbit@controller-0'],[],[],0,read_write,false,[],[],false,mirrored_sup_childspec,[key,mirroring_pid,childspec],[],[],[],{{1479914173599223089,-576460752303422975,1},'rabbit@controller-0'},{{2,0},[]}}


=ERROR REPORT==== 23-Nov-2016::15:16:32 ===
** Generic server mnesia_monitor terminating
** Last message in was {'EXIT',<0.91.0>,killed}
** When Server state == {state,<0.91.0>,[],[],true,[],undefined,[],[]}
** Reason for termination ==
** killed

=ERROR REPORT==== 23-Nov-2016::15:16:32 ===
** Generic server mnesia_recover terminating
** Last message in was {'EXIT',<0.91.0>,killed}
** When Server state == {state,<0.91.0>,undefined,undefined,undefined,0,false,
                               true,[]}
** Reason for termination ==
** killed

=ERROR REPORT==== 23-Nov-2016::15:16:32 ===
** Generic server mnesia_snmp_sup terminating
** Last message in was {'EXIT',<0.91.0>,killed}
** When Server state == {state,
                            {local,mnesia_snmp_sup},
                            simple_one_for_one,
                            [{child,undefined,mnesia_snmp_sup,
                                 {mnesia_snmp_hook,start,[]},
                                 transient,3000,worker,
                                 [mnesia_snmp_sup,mnesia_snmp_hook,
                                  supervisor]}],
                            undefined,0,86400000,[],0,mnesia_snmp_sup,[]}
** Reason for termination ==
** killed

=ERROR REPORT==== 23-Nov-2016::15:16:32 ===
** Generic server mnesia_subscr terminating
** Last message in was {'EXIT',<0.91.0>,killed}
** When Server state == {state,<0.91.0>,110616}
** Reason for termination ==
** killed

=INFO REPORT==== 23-Nov-2016::15:16:32 ===
Error description:
   {could_not_start,mnesia,
       {{shutdown,{failed_to_start_child,mnesia_kernel_sup,killed}},
        {mnesia_sup,start,[normal,[]]}}}

Log files (may contain more information):
   /var/log/rabbitmq/rabbit
   /var/log/rabbitmq/rabbit

...

rabbitmq-cluster(rabbitmq)[14205]:	2016/11/23_15:16:35 INFO: RabbitMQ server is not running
rabbitmq-cluster(rabbitmq)[14205]:	2016/11/23_15:16:35 INFO: node failed to join, wiping data directory and trying again
rabbitmq-cluster(rabbitmq)[14205]:	2016/11/23_15:16:38 INFO: RabbitMQ server is not running
rabbitmq-cluster(rabbitmq)[14205]:	2016/11/23_15:16:38 INFO: Forgetting rabbit@controller-1 via nodes [ rabbit@controller-0  ].
rabbitmq-cluster(rabbitmq)[14205]:	2016/11/23_15:16:39 ERROR: Failed to forget node rabbit@controller-1 via rabbit@controller-0.
rabbitmq-cluster(rabbitmq)[14205]:	2016/11/23_15:16:39 INFO: Joining existing cluster with [ rabbit@controller-0  ] nodes.
rabbitmq-cluster(rabbitmq)[14205]:	2016/11/23_15:16:39 INFO: Waiting for server to start
...
rabbitmq-cluster(rabbitmq)[14205]:	2016/11/23_15:16:48 INFO: Attempting to join cluster with target node rabbit@controller-0
rabbitmq-cluster(rabbitmq)[14205]:	2016/11/23_15:16:49 INFO: Joined cluster by connecting to node rabbit@controller-0, starting app
rabbitmq-cluster(rabbitmq)[14205]:	2016/11/23_15:16:53 INFO: Successfully joined existing rabbitmq cluster
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [ Error: unable to connect to node 'rabbit@controller-1': nodedown ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [  ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [ DIAGNOSTICS ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [ =========== ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [  ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [ attempted to contact: ['rabbit@controller-1'] ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [  ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [ rabbit@controller-1: ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [   * connected to epmd (port 4369) on controller-1 ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [   * epmd reports: node 'rabbit' not running at all ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [                   no other nodes on controller-1 ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [   * suggestion: start the node ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [  ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [ current node details: ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [ - node name: 'rabbitmq-cli-29@controller-1' ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [ - home dir: /var/lib/rabbitmq ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [ - cookie hash: 5dCLhityLmt2jfyxtOkPpw== ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [  ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [ Error: {not_a_cluster_node,"The node selected is not in the cluster."} ]
...

==============================================

Node-2 starts almost normally:

==============================================

=INFO REPORT==== 23-Nov-2016::15:16:57 ===
Error description:
   {error,{inconsistent_cluster,"Node 'rabbit@controller-2' thinks it's clustered with node 'rabbit@controller-1', but 'rabbit@controller-1' disagrees"}}

Log files (may contain more information):
   /var/log/rabbitmq/rabbit
   /var/log/rabbitmq/rabbit

Stack trace:
   [{rabbit_mnesia,check_cluster_consistency,0,
                   [{file,"src/rabbit_mnesia.erl"},{line,589}]},
    {rabbit,'-boot/0-fun-0-',0,[{file,"src/rabbit.erl"},{line,275}]},
    {rabbit,start_it,1,[{file,"src/rabbit.erl"},{line,296}]},
    {init,start_it,1,[]},
    {init,start_em,1,[]}]


=INFO REPORT==== 23-Nov-2016::15:17:06 ===
Starting RabbitMQ 3.6.3 on Erlang 18.3.4.4
Copyright (C) 2007-2016 Pivotal Software, Inc.
Licensed under the MPL.  See http://www.rabbitmq.com/


==============================================

"Almost normally" means it should better not even try to join a cluster which was fully reset, and instead reset itself.

Comment 5 Peter Lemenkov 2016-12-07 14:00:10 UTC

*** Bug 1362117 has been marked as a duplicate of this bug. ***

Comment 6 Udi Shkalim 2016-12-07 14:05:47 UTC

*** Bug 1397110 has been marked as a duplicate of this bug. ***

Comment 7 Peter Lemenkov 2016-12-07 14:45:02 UTC

Patch provided. We're testing it now.

Comment 12 Asaf Hirshberg 2017-02-27 13:11:26 UTC

Verified using automation.

Comment 13 John Eckersberg 2017-04-13 16:04:52 UTC

*** Bug 1398335 has been marked as a duplicate of this bug. ***

Comment 14 errata-xmlrpc 2017-08-01 14:57:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1844

Note You need to log in before you can comment on or make changes to this bug.