Bug 1397393
Summary: | RabbitMQ server crash with a core dump after fencing of one or more controllers | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Udi Shkalim <ushkalim> | |
Component: | resource-agents | Assignee: | Peter Lemenkov <plemenko> | |
Status: | CLOSED ERRATA | QA Contact: | Asaf Hirshberg <ahirshbe> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 7.4 | CC: | agk, apevec, cfeist, cluster-maint, fdinitto, jeckersb, jherrman, lhh, mjuricek, mnovacek, oalbrigt, plemenko, royoung, rscarazz, srevivo, surya_prabhakar, ushkalim, yprokule | |
Target Milestone: | pre-dev-freeze | Keywords: | ZStream | |
Target Release: | --- | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | resource-agents-3.9.5-86.el7 | Doc Type: | Bug Fix | |
Doc Text: |
Previously, when fencing controllers on Red Hat OpenStack Platform, the RabbitMQ resource terminated unexpectedly with a core dump. This update fixes how the RabbitMQ resource agent handles fencing, and thus avoids the described crash.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1402511 (view as bug list) | Environment: | ||
Last Closed: | 2017-08-01 14:57:40 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1305654, 1402511 |
Description
Udi Shkalim
2016-11-22 12:43:34 UTC
See also bug 1397110 comment 8. This is resource-agents' related issue. it seems that there are more situations where we should do full Mnesia cleanup. Related issue is described in bug 1362117. This is what happened. Node-0 starts fresh: ============================================== rabbitmq-cluster(rabbitmq)[18315]: 2016/11/23_15:16:11 INFO: RabbitMQ server is not running rabbitmq-cluster(rabbitmq)[18315]: 2016/11/23_15:16:11 INFO: Bootstrapping rabbitmq cluster rabbitmq-cluster(rabbitmq)[18315]: 2016/11/23_15:16:11 INFO: Waiting for server to start ... rabbitmq-cluster(rabbitmq)[18315]: 2016/11/23_15:16:17 INFO: cluster bootstrapped rabbitmq-cluster(rabbitmq)[18315]: 2016/11/23_15:16:18 INFO: Policy set: ha-all ^(?!amq\.).* {"ha-mode":"all"} ============================================== Node-1 decides to join this cluster (consisting from a single node so far): rabbitmq-cluster(rabbitmq)[14205]: 2016/11/23_15:16:20 INFO: RabbitMQ server is not running rabbitmq-cluster(rabbitmq)[14205]: 2016/11/23_15:16:20 INFO: Joining existing cluster with [ rabbit@controller-0 ] nodes. rabbitmq-cluster(rabbitmq)[14205]: 2016/11/23_15:16:20 INFO: Waiting for server to start ... =ERROR REPORT==== 23-Nov-2016::15:16:22 === Mnesia('rabbit@controller-1'): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned_network, 'rabbit@controller-0'} =ERROR REPORT==== 23-Nov-2016::15:16:22 === Mnesia('rabbit@controller-1'): ** ERROR ** (core dumped to file: "/var/lib/rabbitmq/MnesiaCore.rabbit@controller-1_1479_914182_938424") ** FATAL ** Failed to merge schema: Bad cookie in table definition mirrored_sup_childspec: 'rabbit@controller-1' = {cstruct,mirrored_sup_childspec,ordered_set,['rabbit@controller-2','rabbit@controller-1','rabbit@controller-0'],[],[],0,read_write,false,[],[],false,mirrored_sup_childspec,[key,mirroring_pid,childspec],[],[],[],{{1479896607054171156,-576460752303422935,1},'rabbit@controller-0'},{{4,0},{'rabbit@controller-2',{1479,896643,993197}}}}, 'rabbit@controller-0' = {cstruct,mirrored_sup_childspec,ordered_set,['rabbit@controller-0'],[],[],0,read_write,false,[],[],false,mirrored_sup_childspec,[key,mirroring_pid,childspec],[],[],[],{{1479914173599223089,-576460752303422975,1},'rabbit@controller-0'},{{2,0},[]}} =ERROR REPORT==== 23-Nov-2016::15:16:32 === ** Generic server mnesia_monitor terminating ** Last message in was {'EXIT',<0.91.0>,killed} ** When Server state == {state,<0.91.0>,[],[],true,[],undefined,[],[]} ** Reason for termination == ** killed =ERROR REPORT==== 23-Nov-2016::15:16:32 === ** Generic server mnesia_recover terminating ** Last message in was {'EXIT',<0.91.0>,killed} ** When Server state == {state,<0.91.0>,undefined,undefined,undefined,0,false, true,[]} ** Reason for termination == ** killed =ERROR REPORT==== 23-Nov-2016::15:16:32 === ** Generic server mnesia_snmp_sup terminating ** Last message in was {'EXIT',<0.91.0>,killed} ** When Server state == {state, {local,mnesia_snmp_sup}, simple_one_for_one, [{child,undefined,mnesia_snmp_sup, {mnesia_snmp_hook,start,[]}, transient,3000,worker, [mnesia_snmp_sup,mnesia_snmp_hook, supervisor]}], undefined,0,86400000,[],0,mnesia_snmp_sup,[]} ** Reason for termination == ** killed =ERROR REPORT==== 23-Nov-2016::15:16:32 === ** Generic server mnesia_subscr terminating ** Last message in was {'EXIT',<0.91.0>,killed} ** When Server state == {state,<0.91.0>,110616} ** Reason for termination == ** killed =INFO REPORT==== 23-Nov-2016::15:16:32 === Error description: {could_not_start,mnesia, {{shutdown,{failed_to_start_child,mnesia_kernel_sup,killed}}, {mnesia_sup,start,[normal,[]]}}} Log files (may contain more information): /var/log/rabbitmq/rabbit /var/log/rabbitmq/rabbit ... rabbitmq-cluster(rabbitmq)[14205]: 2016/11/23_15:16:35 INFO: RabbitMQ server is not running rabbitmq-cluster(rabbitmq)[14205]: 2016/11/23_15:16:35 INFO: node failed to join, wiping data directory and trying again rabbitmq-cluster(rabbitmq)[14205]: 2016/11/23_15:16:38 INFO: RabbitMQ server is not running rabbitmq-cluster(rabbitmq)[14205]: 2016/11/23_15:16:38 INFO: Forgetting rabbit@controller-1 via nodes [ rabbit@controller-0 ]. rabbitmq-cluster(rabbitmq)[14205]: 2016/11/23_15:16:39 ERROR: Failed to forget node rabbit@controller-1 via rabbit@controller-0. rabbitmq-cluster(rabbitmq)[14205]: 2016/11/23_15:16:39 INFO: Joining existing cluster with [ rabbit@controller-0 ] nodes. rabbitmq-cluster(rabbitmq)[14205]: 2016/11/23_15:16:39 INFO: Waiting for server to start ... rabbitmq-cluster(rabbitmq)[14205]: 2016/11/23_15:16:48 INFO: Attempting to join cluster with target node rabbit@controller-0 rabbitmq-cluster(rabbitmq)[14205]: 2016/11/23_15:16:49 INFO: Joined cluster by connecting to node rabbit@controller-0, starting app rabbitmq-cluster(rabbitmq)[14205]: 2016/11/23_15:16:53 INFO: Successfully joined existing rabbitmq cluster Nov 23 15:16:53 [31339] controller-1.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:14205:stderr [ Error: unable to connect to node 'rabbit@controller-1': nodedown ] Nov 23 15:16:53 [31339] controller-1.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:14205:stderr [ ] Nov 23 15:16:53 [31339] controller-1.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:14205:stderr [ DIAGNOSTICS ] Nov 23 15:16:53 [31339] controller-1.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:14205:stderr [ =========== ] Nov 23 15:16:53 [31339] controller-1.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:14205:stderr [ ] Nov 23 15:16:53 [31339] controller-1.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:14205:stderr [ attempted to contact: ['rabbit@controller-1'] ] Nov 23 15:16:53 [31339] controller-1.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:14205:stderr [ ] Nov 23 15:16:53 [31339] controller-1.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:14205:stderr [ rabbit@controller-1: ] Nov 23 15:16:53 [31339] controller-1.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:14205:stderr [ * connected to epmd (port 4369) on controller-1 ] Nov 23 15:16:53 [31339] controller-1.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:14205:stderr [ * epmd reports: node 'rabbit' not running at all ] Nov 23 15:16:53 [31339] controller-1.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:14205:stderr [ no other nodes on controller-1 ] Nov 23 15:16:53 [31339] controller-1.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:14205:stderr [ * suggestion: start the node ] Nov 23 15:16:53 [31339] controller-1.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:14205:stderr [ ] Nov 23 15:16:53 [31339] controller-1.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:14205:stderr [ current node details: ] Nov 23 15:16:53 [31339] controller-1.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:14205:stderr [ - node name: 'rabbitmq-cli-29@controller-1' ] Nov 23 15:16:53 [31339] controller-1.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:14205:stderr [ - home dir: /var/lib/rabbitmq ] Nov 23 15:16:53 [31339] controller-1.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:14205:stderr [ - cookie hash: 5dCLhityLmt2jfyxtOkPpw== ] Nov 23 15:16:53 [31339] controller-1.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:14205:stderr [ ] Nov 23 15:16:53 [31339] controller-1.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:14205:stderr [ Error: {not_a_cluster_node,"The node selected is not in the cluster."} ] ... ============================================== Node-2 starts almost normally: ============================================== =INFO REPORT==== 23-Nov-2016::15:16:57 === Error description: {error,{inconsistent_cluster,"Node 'rabbit@controller-2' thinks it's clustered with node 'rabbit@controller-1', but 'rabbit@controller-1' disagrees"}} Log files (may contain more information): /var/log/rabbitmq/rabbit /var/log/rabbitmq/rabbit Stack trace: [{rabbit_mnesia,check_cluster_consistency,0, [{file,"src/rabbit_mnesia.erl"},{line,589}]}, {rabbit,'-boot/0-fun-0-',0,[{file,"src/rabbit.erl"},{line,275}]}, {rabbit,start_it,1,[{file,"src/rabbit.erl"},{line,296}]}, {init,start_it,1,[]}, {init,start_em,1,[]}] =INFO REPORT==== 23-Nov-2016::15:17:06 === Starting RabbitMQ 3.6.3 on Erlang 18.3.4.4 Copyright (C) 2007-2016 Pivotal Software, Inc. Licensed under the MPL. See http://www.rabbitmq.com/ ============================================== "Almost normally" means it should better not even try to join a cluster which was fully reset, and instead reset itself. *** Bug 1362117 has been marked as a duplicate of this bug. *** *** Bug 1397110 has been marked as a duplicate of this bug. *** Patch provided. We're testing it now. Verified using automation. *** Bug 1398335 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1844 |