Bug 1397393

Summary:	RabbitMQ server crash with a core dump after fencing of one or more controllers
Product:	Red Hat Enterprise Linux 7	Reporter:	Udi Shkalim <ushkalim>
Component:	resource-agents	Assignee:	Peter Lemenkov <plemenko>
Status:	CLOSED ERRATA	QA Contact:	Asaf Hirshberg <ahirshbe>
Severity:	high	Docs Contact:
Priority:	high
Version:	7.4	CC:	agk, apevec, cfeist, cluster-maint, fdinitto, jeckersb, jherrman, lhh, mjuricek, mnovacek, oalbrigt, plemenko, royoung, rscarazz, srevivo, surya_prabhakar, ushkalim, yprokule
Target Milestone:	pre-dev-freeze	Keywords:	ZStream
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	resource-agents-3.9.5-86.el7	Doc Type:	Bug Fix
Doc Text:	Previously, when fencing controllers on Red Hat OpenStack Platform, the RabbitMQ resource terminated unexpectedly with a core dump. This update fixes how the RabbitMQ resource agent handles fencing, and thus avoids the described crash.	Story Points:	---
Clone Of:
Clones:	1402511 (view as bug list)		Environment:
Last Closed:	2017-08-01 14:57:40 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1305654, 1402511

Description Udi Shkalim 2016-11-22 12:43:34 UTC

Description of problem:
During HA recovery testing we reboot one or more controllers and verifying rabbit recovery. In one of the scenarios we see a long recovery time which make the rabbit resource to fail. rabbit server crash with the following:
=ERROR REPORT==== 17-Nov-2016::12:56:17 ===
Mnesia('rabbit@overcloud-controller-0'): ** ERROR ** (core dumped to file: "/var/lib/rabbitmq/MnesiaCore.rabbit@overcloud-controller-0_1479_387377_90720")
 ** FATAL ** Failed to merge schema: Bad cookie in table definition mirrored_sup_childspec: 'rabbit@overcloud-controller-0' = {cstruct,mirrored_sup_childspec,ordered_set,['rabbit@overcloud-controller-0','rabbit@overcloud-controller-2','rabbit@overcloud-controller-1'],[],[],0,read_write,false,[],[],false,mirrored_sup_childspec,[key,mirroring_pid,childspec],[],[],[],{{1479384494757167930,-576460752303422247,1},'rabbit@overcloud-controller-1'},{{7,0},{'rabbit@overcloud-controller-0',{1479,386787,22419}}}}, 'rabbit@overcloud-controller-1' = {cstruct,mirrored_sup_childspec,ordered_set,['rabbit@overcloud-controller-1'],[],[],0,read_write,false,[],[],false,mirrored_sup_childspec,[key,mirroring_pid,childspec],[],[],[],{{1479387361761853749,-576460752303422248,1},'rabbit@overcloud-controller-1'},{{2,0},[]} 

Version-Release number of selected component (if applicable):
rabbitmq-server-3.6.3-5.el7ost.noarch
puppet-rabbitmq-5.6.0-1.057a013git.el7ost.noarch
resource-agents-3.9.5-82.el7.x86_64


How reproducible:
90%

Steps to Reproduce:
1. Deploy openstack in HA - 3 controllers
2. Configure fencing 
3. Fence multiple controllers 

Actual results:
Rabbitmq resource is down

Expected results:
rabbitmq resource should be up

Additional info:
http://pastebin.test.redhat.com/431829

Comment 2 Peter Lemenkov 2016-11-22 15:53:24 UTC

See also bug 1397110 comment 8.

Comment 3 Peter Lemenkov 2016-11-22 15:56:10 UTC

This is resource-agents' related issue. it seems that there are more situations where we  should do full Mnesia cleanup.

Comment 4 Peter Lemenkov 2016-11-24 17:03:00 UTC

Related issue is described in bug 1362117.

This is what happened.

Node-0 starts fresh:

==============================================

rabbitmq-cluster(rabbitmq)[18315]:	2016/11/23_15:16:11 INFO: RabbitMQ server is not running
rabbitmq-cluster(rabbitmq)[18315]:	2016/11/23_15:16:11 INFO: Bootstrapping rabbitmq cluster
rabbitmq-cluster(rabbitmq)[18315]:	2016/11/23_15:16:11 INFO: Waiting for server to start
...
rabbitmq-cluster(rabbitmq)[18315]:	2016/11/23_15:16:17 INFO: cluster bootstrapped
rabbitmq-cluster(rabbitmq)[18315]:	2016/11/23_15:16:18 INFO: Policy set: ha-all ^(?!amq\.).* {"ha-mode":"all"}

==============================================

Node-1 decides to join this cluster (consisting from a single node so far):

rabbitmq-cluster(rabbitmq)[14205]:	2016/11/23_15:16:20 INFO: RabbitMQ server is not running
rabbitmq-cluster(rabbitmq)[14205]:	2016/11/23_15:16:20 INFO: Joining existing cluster with [ rabbit@controller-0  ] nodes.
rabbitmq-cluster(rabbitmq)[14205]:	2016/11/23_15:16:20 INFO: Waiting for server to start
...

=ERROR REPORT==== 23-Nov-2016::15:16:22 ===
Mnesia('rabbit@controller-1'): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned_network, 'rabbit@controller-0'}

=ERROR REPORT==== 23-Nov-2016::15:16:22 ===
Mnesia('rabbit@controller-1'): ** ERROR ** (core dumped to file: "/var/lib/rabbitmq/MnesiaCore.rabbit@controller-1_1479_914182_938424")
 ** FATAL ** Failed to merge schema: Bad cookie in table definition mirrored_sup_childspec: 'rabbit@controller-1' = {cstruct,mirrored_sup_childspec,ordered_set,['rabbit@controller-2','rabbit@controller-1','rabbit@controller-0'],[],[],0,read_write,false,[],[],false,mirrored_sup_childspec,[key,mirroring_pid,childspec],[],[],[],{{1479896607054171156,-576460752303422935,1},'rabbit@controller-0'},{{4,0},{'rabbit@controller-2',{1479,896643,993197}}}}, 'rabbit@controller-0' = {cstruct,mirrored_sup_childspec,ordered_set,['rabbit@controller-0'],[],[],0,read_write,false,[],[],false,mirrored_sup_childspec,[key,mirroring_pid,childspec],[],[],[],{{1479914173599223089,-576460752303422975,1},'rabbit@controller-0'},{{2,0},[]}}


=ERROR REPORT==== 23-Nov-2016::15:16:32 ===
** Generic server mnesia_monitor terminating
** Last message in was {'EXIT',<0.91.0>,killed}
** When Server state == {state,<0.91.0>,[],[],true,[],undefined,[],[]}
** Reason for termination ==
** killed

=ERROR REPORT==== 23-Nov-2016::15:16:32 ===
** Generic server mnesia_recover terminating
** Last message in was {'EXIT',<0.91.0>,killed}
** When Server state == {state,<0.91.0>,undefined,undefined,undefined,0,false,
                               true,[]}
** Reason for termination ==
** killed

=ERROR REPORT==== 23-Nov-2016::15:16:32 ===
** Generic server mnesia_snmp_sup terminating
** Last message in was {'EXIT',<0.91.0>,killed}
** When Server state == {state,
                            {local,mnesia_snmp_sup},
                            simple_one_for_one,
                            [{child,undefined,mnesia_snmp_sup,
                                 {mnesia_snmp_hook,start,[]},
                                 transient,3000,worker,
                                 [mnesia_snmp_sup,mnesia_snmp_hook,
                                  supervisor]}],
                            undefined,0,86400000,[],0,mnesia_snmp_sup,[]}
** Reason for termination ==
** killed

=ERROR REPORT==== 23-Nov-2016::15:16:32 ===
** Generic server mnesia_subscr terminating
** Last message in was {'EXIT',<0.91.0>,killed}
** When Server state == {state,<0.91.0>,110616}
** Reason for termination ==
** killed

=INFO REPORT==== 23-Nov-2016::15:16:32 ===
Error description:
   {could_not_start,mnesia,
       {{shutdown,{failed_to_start_child,mnesia_kernel_sup,killed}},
        {mnesia_sup,start,[normal,[]]}}}

Log files (may contain more information):
   /var/log/rabbitmq/rabbit
   /var/log/rabbitmq/rabbit

...

rabbitmq-cluster(rabbitmq)[14205]:	2016/11/23_15:16:35 INFO: RabbitMQ server is not running
rabbitmq-cluster(rabbitmq)[14205]:	2016/11/23_15:16:35 INFO: node failed to join, wiping data directory and trying again
rabbitmq-cluster(rabbitmq)[14205]:	2016/11/23_15:16:38 INFO: RabbitMQ server is not running
rabbitmq-cluster(rabbitmq)[14205]:	2016/11/23_15:16:38 INFO: Forgetting rabbit@controller-1 via nodes [ rabbit@controller-0  ].
rabbitmq-cluster(rabbitmq)[14205]:	2016/11/23_15:16:39 ERROR: Failed to forget node rabbit@controller-1 via rabbit@controller-0.
rabbitmq-cluster(rabbitmq)[14205]:	2016/11/23_15:16:39 INFO: Joining existing cluster with [ rabbit@controller-0  ] nodes.
rabbitmq-cluster(rabbitmq)[14205]:	2016/11/23_15:16:39 INFO: Waiting for server to start
...
rabbitmq-cluster(rabbitmq)[14205]:	2016/11/23_15:16:48 INFO: Attempting to join cluster with target node rabbit@controller-0
rabbitmq-cluster(rabbitmq)[14205]:	2016/11/23_15:16:49 INFO: Joined cluster by connecting to node rabbit@controller-0, starting app
rabbitmq-cluster(rabbitmq)[14205]:	2016/11/23_15:16:53 INFO: Successfully joined existing rabbitmq cluster
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [ Error: unable to connect to node 'rabbit@controller-1': nodedown ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [  ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [ DIAGNOSTICS ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [ =========== ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [  ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [ attempted to contact: ['rabbit@controller-1'] ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [  ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [ rabbit@controller-1: ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [   * connected to epmd (port 4369) on controller-1 ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [   * epmd reports: node 'rabbit' not running at all ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [                   no other nodes on controller-1 ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [   * suggestion: start the node ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [  ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [ current node details: ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [ - node name: 'rabbitmq-cli-29@controller-1' ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [ - home dir: /var/lib/rabbitmq ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [ - cookie hash: 5dCLhityLmt2jfyxtOkPpw== ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [  ]
Nov 23 15:16:53 [31339] controller-1.localdomain       lrmd:   notice: operation_finished:	rabbitmq_start_0:14205:stderr [ Error: {not_a_cluster_node,"The node selected is not in the cluster."} ]
...

==============================================

Node-2 starts almost normally:

==============================================

=INFO REPORT==== 23-Nov-2016::15:16:57 ===
Error description:
   {error,{inconsistent_cluster,"Node 'rabbit@controller-2' thinks it's clustered with node 'rabbit@controller-1', but 'rabbit@controller-1' disagrees"}}

Log files (may contain more information):
   /var/log/rabbitmq/rabbit
   /var/log/rabbitmq/rabbit

Stack trace:
   [{rabbit_mnesia,check_cluster_consistency,0,
                   [{file,"src/rabbit_mnesia.erl"},{line,589}]},
    {rabbit,'-boot/0-fun-0-',0,[{file,"src/rabbit.erl"},{line,275}]},
    {rabbit,start_it,1,[{file,"src/rabbit.erl"},{line,296}]},
    {init,start_it,1,[]},
    {init,start_em,1,[]}]


=INFO REPORT==== 23-Nov-2016::15:17:06 ===
Starting RabbitMQ 3.6.3 on Erlang 18.3.4.4
Copyright (C) 2007-2016 Pivotal Software, Inc.
Licensed under the MPL.  See http://www.rabbitmq.com/


==============================================

"Almost normally" means it should better not even try to join a cluster which was fully reset, and instead reset itself.

Comment 5 Peter Lemenkov 2016-12-07 14:00:10 UTC

*** Bug 1362117 has been marked as a duplicate of this bug. ***

Comment 6 Udi Shkalim 2016-12-07 14:05:47 UTC

*** Bug 1397110 has been marked as a duplicate of this bug. ***

Comment 7 Peter Lemenkov 2016-12-07 14:45:02 UTC

Patch provided. We're testing it now.

Comment 12 Asaf Hirshberg 2017-02-27 13:11:26 UTC

Verified using automation.

Comment 13 John Eckersberg 2017-04-13 16:04:52 UTC

*** Bug 1398335 has been marked as a duplicate of this bug. ***

Comment 14 errata-xmlrpc 2017-08-01 14:57:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1844