Bug 1350073
Summary: | rabbitmq: HA-Config crash with "exception exit" with multiple error | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Faiaz Ahmed <fahmed> | ||||
Component: | rabbitmq-server | Assignee: | Peter Lemenkov <plemenko> | ||||
Status: | CLOSED ERRATA | QA Contact: | Asaf Hirshberg <ahirshbe> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 7.0 (Kilo) | CC: | adhingra, apevec, chlong, cmedeiro, dmaley, fahmed, fdinitto, ggillies, jeckersb, lhh, mschuppe, pbandark, pcaruana, plemenko, srevivo, ushkalim | ||||
Target Milestone: | async | Keywords: | ZStream | ||||
Target Release: | 7.0 (Kilo) | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | rabbitmq-server-3.3.5-23.el7ost | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 1357991 (view as bug list) | Environment: | |||||
Last Closed: | 2016-08-31 17:37:08 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1311180, 1319334 | ||||||
Bug Blocks: | |||||||
Attachments: |
|
Description
Faiaz Ahmed
2016-06-25 06:10:42 UTC
Ok, what I see in the logs so far. Client has 3 nodes - let's name them R0, R1, R2. During the 23-Jun-2016::20:28:00 R0 was partitioned from R1 and R2, while both two these nodes still can talk to each other. This is very clearly visible in the logs - just search for "23-Jun-2016::20:28:". You may easily find messages like: On R0: Slave <rabbit.6125.0> saw deaths of mirrors <rabbit.7390.0> <rabbit.8421.0> On R1: Slave <rabbit.25814.0> saw deaths of mirrors <rabbit.23661.0> On R2: Slave <rabbit.15416.0> saw deaths of mirrors <rabbit.14124.0> After that we've got partitioned cluster. Node R0 (minority) paused working but was hit by this: https://github.com/rabbitmq/rabbitmq-server/commit/b0f2c4644801a1116fbe17adba57c65e87f863ea In a couple of hours networking issues (or whatever else which disconnected R0 from R1 and R2) were resolved and R0 rejoined cluster. Unfortunately, the same even happened again. This time at 24-Jun-2016::15:40. Again, just search in rabbit logs for that. Again R0 was disconnected from the rest two nodes, but this time R1 was disconnected as well. All three nodes were disconnected from each other and normal operation was halted. My first guess - something weird is going on with their networking. Nodes are getting disconnected and reconnected several times - this affects normal cluster operation. Sockets value (16384) is low (we recommend 65536) but it seems that client's configuration is fine with this value RabbitMQ crashed but pcs did not picked that up. RabbitMQ crashed http://pastebin.test.redhat.com/387358 PCS Still showing resources are ok ----------------- Clone Set: rabbitmq-clone [rabbitmq] Started: [ rhqe-bare-ctrl-0 rhqe-bare-ctrl-1 rhqe-bare-ctrl-2 ] ----------------- http://pastebin.test.redhat.com/387357 (In reply to Faiaz Ahmed from comment #11) > RabbitMQ crashed but pcs did not picked that up. > > RabbitMQ crashed > http://pastebin.test.redhat.com/387358 > > PCS Still showing resources are ok > ----------------- > Clone Set: rabbitmq-clone [rabbitmq] > Started: [ rhqe-bare-ctrl-0 rhqe-bare-ctrl-1 rhqe-bare-ctrl-2 ] > ----------------- > http://pastebin.test.redhat.com/387357 Status report. The issue mentioned above is the main problem. RabitMQ doesn't actually crash in terms of process failure, making a coredump, and exiting uncleanly. Instead the connectivity between one node and the rest of the cluster is lost. So rabbitmq instance is still there but cluster is degraded. 1. Node should remain in the cluster as soon as networking is fine. 2. Pacemaker should repair (restart) partitioned node as soon as it left cluster. Regarding no.2 I'd like to propose upgrading at least to resource-agents-3.9.5-54.el7_2.7 (see bug 1311180) which should handle partitioned nodes more gracefully. I'd like to emphasize that it's not a final fix (the main issue that the node is partitioned suddenly) but might help while waiting for it. (In reply to Peter Lemenkov from comment #13) > Regarding no.2 I'd like to propose upgrading at least to > resource-agents-3.9.5-54.el7_2.7 (see bug 1311180) which should handle > partitioned nodes more gracefully. > > I'd like to emphasize that it's not a final fix (the main issue that the > node is partitioned suddenly) but might help while waiting for it. HI petro, resource-agentes were updated. Now backing to the logs ctrl-0/cluster-log.txt: ~~~~~~~~~~~~~ 116789:Jun 24 16:53:26 [4837] ctrl-0.localdomain crmd: info: process_graph_event: Detected action (1061.127) rabbitmq_start_0.3977=unknown error: failed 116794:Jun 24 16:53:26 [4837] ctrl-0.localdomain crmd: info: process_graph_event: Detected action (1061.127) rabbitmq_start_0.3977=unknown error: failed ~~~~~~~~~~~~~ rhqe-bare-ctrl-1.localdomain/cluster-log.txt: ~~~~~~~~~~~~~ 16055:Jun 24 16:53:26 [4900] rhqe-bare-ctrl-1.localdomain crmd: notice: process_lrm_event: Operation rabbitmq_start_0: unknown error (node=rhqe-bare-ctrl-1, call=3977, rc=1, cib-update=2664, confirmed=true) [pcaruana@collab-shell 2016-06-25-03xxxx]$ ~~~~~~~~~~~~~ Down the rabbit hole, those entries are relevant are showing already the broker having issues when triggering the shutdown because channel_termination_timeout reached 70000 grep -hEr 'SUPERVISOR REPORT.*Jun' -A3 */var/log/rabbitmq/*.log | grep Reason: | sort | uniq -c | sort -n ~~~~~~~~~~~~~ 590 Reason: {{noproc, 4122 Reason: {noproc,{gen_server2,call, 4125 Reason: {{{noproc, 4715 Reason: reached_max_restart_intensity ~~~~~~~~~~~~~ Here are examples for the dead mirrors 158942 =CRASH REPORT==== 25-Jun-2016::02:05:09 === 158943 crasher: 158944 initial call: gen:init_it/6 158945 pid: <0.29856.274> 158946 registered_name: [] 158947 exception exit: {noproc, 158948 {gen_server2,call, 158949 [worker_pool,{next_free,<0.29856.274>},infinity]}} 158950 in function gen_server2:terminate/3 (src/gen_server2.erl, line 1133) 158951 ancestors: [rabbit_mirror_queue_slave_sup,rabbit_sup,<0.692.0>] 158952 messages: [] 158953 links: [<0.878.0>,<0.25470.274>] 158954 dictionary: [{process_name, 158955 {rabbit_mirror_queue_slave, 158956 {resource,<<"/">>,queue, 158957 <<"engine_fanout_7a05f045d3d64f47a01b0771b8a5dead">>}}}] Now for documenting better the kinds of " 4715 Reason: reached_max_restart_intensity" like 207922 =SUPERVISOR REPORT==== 25-Jun-2016::02:05:14 === 207923 Supervisor: {<0.19973.249>,rabbit_channel_sup} 207924 Context: shutdown 207925 Reason: reached_max_restart_intensity 207926 Offender: [{pid,<0.2611.275>}, 207927 {name,channel}, 207928 {mfargs, 207929 {rabbit_channel,start_link, 207930 [1,<0.2806.275>,<0.2585.275>,<0.2806.275>, 207931 <<"172.16.104.16:39745 -> 172.16.104.17:5672">>, 207932 rabbit_framing_amqp_0_9_1, 207933 {user,<<"guest">>, 207934 [administrator], 207935 rabbit_auth_backend_internal, 207936 {internal_user,<<"guest">>, 207937 <<237,53,33,118,185,118,167,187,209,36,247, 207938 205,120,84,167,191,30,63,238,170>>, 207939 [administrator]}}, 207940 <<"/">>, 207941 [{<<"connection.blocked">>,bool,true}, 207942 {<<"consumer_cancel_notify">>,bool,true}], 207943 <0.10229.272>,<0.2649.275>]}}, 207944 {restart_type,intrinsic}, 207945 {shutdown,4294967295}, 207946 {child_type,worker}] Are really those ones the side effects or part of the issue? or soemthing related to supervisor2 from rabbitmq or any crash or do long tasks in gen_server init callback? This is interesting for confirming or discarding patterns specially when expanding of knowledge base. Regards, Pablo. (In reply to Pablo Caruana from comment #14) > (In reply to Peter Lemenkov from comment #13) > > Regarding no.2 I'd like to propose upgrading at least to > > resource-agents-3.9.5-54.el7_2.7 (see bug 1311180) which should handle > > partitioned nodes more gracefully. > > > > I'd like to emphasize that it's not a final fix (the main issue that the > > node is partitioned suddenly) but might help while waiting for it. > > HI petro, resource-agentes were updated. Now backing to the logs > > ctrl-0/cluster-log.txt: > ~~~~~~~~~~~~~ > 116789:Jun 24 16:53:26 [4837] ctrl-0.localdomain crmd: info: > process_graph_event: Detected action (1061.127) > rabbitmq_start_0.3977=unknown error: failed > 116794:Jun 24 16:53:26 [4837] ctrl-0.localdomain crmd: info: > process_graph_event: Detected action (1061.127) > rabbitmq_start_0.3977=unknown error: failed > ~~~~~~~~~~~~~ > > rhqe-bare-ctrl-1.localdomain/cluster-log.txt: > ~~~~~~~~~~~~~ > 16055:Jun 24 16:53:26 [4900] rhqe-bare-ctrl-1.localdomain crmd: > notice: process_lrm_event: Operation rabbitmq_start_0: unknown error > (node=rhqe-bare-ctrl-1, call=3977, rc=1, cib-update=2664, confirmed=true) > [pcaruana@collab-shell 2016-06-25-03xxxx]$ > ~~~~~~~~~~~~~ > > Down the rabbit hole, those entries are relevant are showing already the > broker having issues when triggering the shutdown because > channel_termination_timeout reached 70000 > > grep -hEr 'SUPERVISOR REPORT.*Jun' -A3 */var/log/rabbitmq/*.log | grep > Reason: | sort | uniq -c | sort -n > ~~~~~~~~~~~~~ > 590 Reason: {{noproc, > 4122 Reason: {noproc,{gen_server2,call, > 4125 Reason: {{{noproc, > 4715 Reason: reached_max_restart_intensity > ~~~~~~~~~~~~~ > Here are examples for the dead mirrors > > 158942 =CRASH REPORT==== 25-Jun-2016::02:05:09 === > 158943 crasher: > 158944 initial call: gen:init_it/6 > 158945 pid: <0.29856.274> > 158946 registered_name: [] > 158947 exception exit: {noproc, > 158948 {gen_server2,call, > 158949 > [worker_pool,{next_free,<0.29856.274>},infinity]}} > 158950 in function gen_server2:terminate/3 (src/gen_server2.erl, > line 1133) > 158951 ancestors: [rabbit_mirror_queue_slave_sup,rabbit_sup,<0.692.0>] > 158952 messages: [] > 158953 links: [<0.878.0>,<0.25470.274>] > 158954 dictionary: [{process_name, > 158955 {rabbit_mirror_queue_slave, > 158956 {resource,<<"/">>,queue, > 158957 > <<"engine_fanout_7a05f045d3d64f47a01b0771b8a5dead">>}}}] > > Now for documenting better the kinds of " 4715 Reason: > reached_max_restart_intensity" like > 207922 =SUPERVISOR REPORT==== 25-Jun-2016::02:05:14 === > 207923 Supervisor: {<0.19973.249>,rabbit_channel_sup} > 207924 Context: shutdown > 207925 Reason: reached_max_restart_intensity > 207926 Offender: [{pid,<0.2611.275>}, > 207927 {name,channel}, > 207928 {mfargs, > 207929 {rabbit_channel,start_link, > 207930 [1,<0.2806.275>,<0.2585.275>,<0.2806.275>, > 207931 <<"172.16.104.16:39745 -> > 172.16.104.17:5672">>, > 207932 rabbit_framing_amqp_0_9_1, > 207933 {user,<<"guest">>, > 207934 [administrator], > 207935 rabbit_auth_backend_internal, > 207936 {internal_user,<<"guest">>, > 207937 > <<237,53,33,118,185,118,167,187,209,36,247, > 207938 > 205,120,84,167,191,30,63,238,170>>, > 207939 [administrator]}}, > 207940 <<"/">>, > 207941 [{<<"connection.blocked">>,bool,true}, > 207942 > {<<"consumer_cancel_notify">>,bool,true}], > 207943 <0.10229.272>,<0.2649.275>]}}, > 207944 {restart_type,intrinsic}, > 207945 {shutdown,4294967295}, > 207946 {child_type,worker}] > > Are really those ones the side effects or part of the issue? or soemthing > related to supervisor2 from rabbitmq or any crash or do long tasks in > gen_server init callback? > This is interesting for confirming or discarding patterns specially when > expanding of knowledge base. > > Regards, > Pablo. "Max restart intensity" message isn't relevant here. It just says that the main issue (preventing RabbitMQ from starting up) is still there, so RabbitMQ gave up trying to recover and it's waiting for manual intervention. (In reply to Peter Lemenkov from comment #16) > "Max restart intensity" message isn't relevant here. It just says that the > main issue (preventing RabbitMQ from starting up) is still there, so > RabbitMQ gave up trying to recover and it's waiting for manual intervention. Thanks for the confirmation, this usually a confusing part. Created attachment 1178204 [details]
Freash rabbitmqctl_cluster_status.out.gz
(In reply to Faiaz Ahmed from comment #18) > Created attachment 1178204 [details] > Freash rabbitmqctl_cluster_status.out.gz Cool! I can see that we've lost node connectivity several times: lemenkov ~/Downloads/erl_crash_dumps: sed ':a;N;$!ba;s/,\n\s*/,/g' rabbitmqctl_cluster_status.out |grep -o -P "{running_nodes.*?}" | LANG=C sort | uniq -c 65 {running_nodes,['rabbit@rhqe-bare-ctrl-1','rabbit@rhqe-bare-ctrl-0']} 6410 {running_nodes,['rabbit@rhqe-bare-ctrl-1','rabbit@rhqe-bare-ctrl-2','rabbit@rhqe-bare-ctrl-0']} 147 {running_nodes,['rabbit@rhqe-bare-ctrl-2','rabbit@rhqe-bare-ctrl-1','rabbit@rhqe-bare-ctrl-0']} lemenkov ~/Downloads/erl_crash_dumps: Normally running cluster should consists from these three items (not necessary sorted alphabetically): ['rabbit@rhqe-bare-ctrl-2','rabbit@rhqe-bare-ctrl-1','rabbit@rhqe-bare-ctrl-0'] We can see that node 'rabbit@rhqe-bare-ctrl-2' leaves cluster sometimes. I almost certain now that there is something wrong with connectivity between nodes. Could you upload fresh RabbitMQ logs from all three nodes? Just tar the entire /var/log/rabbitmq directory. Also I'd like to see journal and syslog logs from these machines. Faiaz, please try this build - rabbitmq-server-3.3.5-23.el7ost [root@overcloud-controller-0 ~]# rabbitmqctl cluster_status Cluster status of node 'rabbit@overcloud-controller-0' ... [{nodes,[{disc,['rabbit@overcloud-controller-0', 'rabbit@overcloud-controller-1', 'rabbit@overcloud-controller-2']}]}, {running_nodes,['rabbit@overcloud-controller-2', 'rabbit@overcloud-controller-1', 'rabbit@overcloud-controller-0']}, {cluster_name,<<"rabbit">>}, {partitions,[]}] ...done. [root@overcloud-controller-0 ~]# rabbitmqctl status Status of node 'rabbit@overcloud-controller-0' ... [{pid,5939}, {running_applications,[{rabbit,"RabbitMQ","3.3.5"}, {mnesia,"MNESIA CXC 138 12","4.11"}, {os_mon,"CPO CXC 138 46","2.2.14"}, {xmerl,"XML parser","1.3.6"}, {sasl,"SASL CXC 138 11","2.3.4"}, {stdlib,"ERTS CXC 138 10","1.19.4"}, {kernel,"ERTS CXC 138 10","2.16.4"}]}, {os,{unix,linux}}, {erlang_version,"Erlang R16B03-1 (erts-5.10.4) [source] [64-bit] [smp:12:12] [async-threads:30] [hipe] [kernel-poll:true]\n"}, {memory,[{total,315832296}, {connection_procs,11743080}, {queue_procs,9151048}, {plugins,0}, {other_proc,14356040}, {mnesia,1660976}, {mgmt_db,0}, {msg_index,295536}, {other_ets,1482912}, {binary,248563960}, {code,16705858}, {atom,654217}, {other_system,11218669}]}, {alarms,[]}, {listeners,[{clustering,35672,"::"},{amqp,5672,"10.35.174.13"}]}, {vm_memory_high_watermark,0.4}, {vm_memory_limit,13423173632}, {disk_free_limit,50000000}, {disk_free,466386874368}, {file_descriptors,[{total_limit,65436}, {total_used,227}, {sockets_limit,58890}, {sockets_used,225}]}, {processes,[{limit,1048576},{used,3646}]}, {run_queue,0}, {uptime,2524}] ...done. [root@overcloud-controller-0 ~]# Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-1793.html |