Bug 1350073

Summary:

rabbitmq: HA-Config crash with "exception exit" with multiple error

Product:

Red Hat OpenStack

Reporter:

Faiaz Ahmed <fahmed>

Component:

rabbitmq-server

Assignee:

Peter Lemenkov <plemenko>

Status:

CLOSED ERRATA

QA Contact:

Asaf Hirshberg <ahirshbe>

Severity:

urgent

Docs Contact:

Priority:

high

Version:

7.0 (Kilo)

CC:

adhingra, apevec, chlong, cmedeiro, dmaley, fahmed, fdinitto, ggillies, jeckersb, lhh, mschuppe, pbandark, pcaruana, plemenko, srevivo, ushkalim

Target Milestone:

async

Keywords:

ZStream

Target Release:

7.0 (Kilo)

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

rabbitmq-server-3.3.5-23.el7ost

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1357991 (view as bug list)

Environment:

Last Closed:

2016-08-31 17:37:08 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1311180, 1319334

Bug Blocks:

Attachments:

Description	Flags
Freash rabbitmqctl_cluster_status.out.gz	none

Description Faiaz Ahmed 2016-06-25 06:10:42 UTC

Description of problem:
=CRASH REPORT==== 25-Jun-2016::02:09:10 ===
  crasher:
    initial call: gen:init_it/6
    pid: <0.20282.245>
    registered_name: []
    exception exit: {{noproc,
                         {gen_server2,call,
                             [worker_pool,
                              {next_free,<0.20282.245>},
                              infinity]}},
                     [{gen_server2,call,3,
                          [{file,"src/gen_server2.erl"},{line,335}]},
                      {worker_pool,submit,1,
                          [{file,"src/worker_pool.erl"},{line,66}]},
                      {rabbit_misc,execute_mnesia_transaction,1,
                          [{file,"src/rabbit_misc.erl"},{line,497}]},
                      {rabbit_misc,execute_mnesia_tx_with_tail,1,
                          [{file,"src/rabbit_misc.erl"},{line,535}]},
                      {rabbit_channel,binding_action,9,
                          [{file,"src/rabbit_channel.erl"},{line,1304}]},
                      {rabbit_channel,handle_cast,2,
                          [{file,"src/rabbit_channel.erl"},{line,276}]},
                      {gen_server2,handle_msg,2,
                          [{file,"src/gen_server2.erl"},{line,1022}]},
                      {proc_lib,init_p_do_apply,3,
                          [{file,"proc_lib.erl"},{line,239}]}]}
      in function  gen_server2:terminate/3 (src/gen_server2.erl, line 1133)
    ancestors: [<0.20388.245>,<0.20270.245>,<0.19974.245>,<0.17519.245>,
                  rabbit_tcp_client_sup,rabbit_sup,<0.70.0>]
    messages: []
    links: [<0.20388.245>]
    dictionary: [{delegate,delegate_13},
                  {channel_operation_timeout,70000},
                  {{xtype_to_module,topic},rabbit_exchange_type_topic},
                  {process_name,
                      {rabbit_channel,
                          {<<"XX.XX.104.21:35474 -> XX.XX.104.14:5672">>,
                           1}}},
                  {permission_cache,
                      [{{resource,<<"/">>,exchange,<<"neutron">>},read},
                       {{resource,<<"/">>,queue,
                            <<"q-agent-notifier-dvr-update">>},
                        write},
                       {{resource,<<"/">>,queue,
                            <<"q-agent-notifier-dvr-update">>},
                        configure},
                       {{resource,<<"/">>,exchange,<<"neutron">>},
                        configure}]}]
    trap_exit: true
    status: running
    heap_size: 2586
    stack_size: 27
    reductions: 2825


recent Changes

Jun 14 23:47:31 Updated: python-oslo-messaging-1.8.3-5.el7ost.noarch
Jun 15 00:05:37 Installed: kernel-3.10.0-327.18.2.el7.x86_64
Jun 15 00:19:46 Updated: rabbitmq-server-3.3.5-22.el7ost.noarch
Jun 15 00:48:40 Installed: gpm-libs-1.20.7-5.el7.x86_64
Jun 15 00:48:40 Installed: 2:vim-filesystem-7.4.160-1.el7.x86_64
Jun 15 00:48:42 Installed: 2:vim-common-7.4.160-1.el7.x86_64
Jun 15 00:48:42 Installed: 2:vim-enhanced-7.4.160-1.el7.x86_64
Jun 15 02:28:28 Updated: zabbix-agent-2.4.8-1.el7.x86_64
Jun 15 02:28:28 Updated: zabbix-sender-2.4.8-1.el7.x86_64
Jun 20 23:39:14 Updated: python-nova-2015.1.4-1.el7ost.noarch
Jun 20 23:39:14 Updated: openstack-nova-common-2015.1.4-1.el7ost.noarch
Jun 20 23:39:14 Updated: openstack-nova-compute-2015.1.4-1.el7ost.noarch
Jun 20 23:39:14 Updated: openstack-nova-scheduler-2015.1.4-1.el7ost.noarch
Jun 20 23:39:14 Updated: openstack-nova-novncproxy-2015.1.4-1.el7ost.noarch
Jun 20 23:39:14 Updated: openstack-nova-cert-2015.1.4-1.el7ost.noarch
Jun 20 23:39:14 Updated: openstack-nova-api-2015.1.4-1.el7ost.noarch
Jun 20 23:39:14 Updated: openstack-nova-console-2015.1.4-1.el7ost.noarch
Jun 20 23:39:14 Updated: openstack-nova-conductor-2015.1.4-1.el7ost.noarch

Comment 6 Peter Lemenkov 2016-06-25 21:19:34 UTC

Ok, what I see in the logs so far. Client has 3 nodes - let's name them R0, R1, R2.

During the 23-Jun-2016::20:28:00 R0 was partitioned from R1 and R2, while both two these nodes still can talk to each other.

This is very clearly visible in the logs - just search for "23-Jun-2016::20:28:". You may easily find messages like:

On R0: Slave <rabbit.6125.0> saw deaths of mirrors <rabbit.7390.0> <rabbit.8421.0>

On R1: Slave <rabbit.25814.0> saw deaths of mirrors <rabbit.23661.0>

On R2: Slave <rabbit.15416.0> saw deaths of mirrors <rabbit.14124.0>

After that we've got partitioned cluster.

Node R0 (minority) paused working but was hit by this:

https://github.com/rabbitmq/rabbitmq-server/commit/b0f2c4644801a1116fbe17adba57c65e87f863ea

In a couple of hours networking issues (or whatever else which disconnected R0 from R1 and R2) were resolved and R0 rejoined cluster.

Unfortunately, the same even happened again. This time at 24-Jun-2016::15:40. Again, just search in rabbit logs for that. Again R0 was disconnected from the rest two nodes, but this time R1 was disconnected as well.

All three nodes were disconnected from each other and normal operation was halted.

Comment 7 Peter Lemenkov 2016-06-25 21:20:32 UTC

My first guess - something weird is going on with their networking.

Comment 8 Peter Lemenkov 2016-06-25 21:25:30 UTC

Nodes are getting disconnected and reconnected several times - this affects normal cluster operation.

Sockets value (16384) is low (we recommend 65536) but it seems that client's configuration is fine with this value

Comment 11 Faiaz Ahmed 2016-06-26 03:39:51 UTC

RabbitMQ crashed but pcs did not picked that up.

RabbitMQ crashed
http://pastebin.test.redhat.com/387358

PCS Still showing resources are ok
-----------------
 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ rhqe-bare-ctrl-0 rhqe-bare-ctrl-1 rhqe-bare-ctrl-2 ]
-----------------
http://pastebin.test.redhat.com/387357

Comment 12 Peter Lemenkov 2016-06-26 10:48:01 UTC

(In reply to Faiaz Ahmed from comment #11)
> RabbitMQ crashed but pcs did not picked that up.
> 
> RabbitMQ crashed
> http://pastebin.test.redhat.com/387358
> 
> PCS Still showing resources are ok
> -----------------
>  Clone Set: rabbitmq-clone [rabbitmq]
>      Started: [ rhqe-bare-ctrl-0 rhqe-bare-ctrl-1 rhqe-bare-ctrl-2 ]
> -----------------
> http://pastebin.test.redhat.com/387357

Status report.

The issue mentioned above is the main problem. RabitMQ doesn't actually crash in terms of process failure, making a coredump, and exiting uncleanly. Instead the connectivity between one node and the rest of the cluster is lost. So rabbitmq instance is still there but cluster is degraded.

1. Node should remain in the cluster as soon as networking is fine.
2. Pacemaker should repair (restart) partitioned node as soon as it left cluster.

Comment 13 Peter Lemenkov 2016-06-26 10:53:06 UTC

Regarding no.2 I'd like to propose upgrading at least to resource-agents-3.9.5-54.el7_2.7 (see bug 1311180) which should handle partitioned nodes more gracefully.

I'd like to emphasize that it's not a final fix (the main issue that the node is partitioned suddenly) but might help while waiting for it.

Comment 14 Pablo Caruana 2016-06-27 09:04:24 UTC

(In reply to Peter Lemenkov from comment #13)
> Regarding no.2 I'd like to propose upgrading at least to
> resource-agents-3.9.5-54.el7_2.7 (see bug 1311180) which should handle
> partitioned nodes more gracefully.
> 
> I'd like to emphasize that it's not a final fix (the main issue that the
> node is partitioned suddenly) but might help while waiting for it.

HI petro, resource-agentes were updated.  Now backing to the logs

ctrl-0/cluster-log.txt:
~~~~~~~~~~~~~
116789:Jun 24 16:53:26 [4837]  ctrl-0.localdomain       crmd:     info: process_graph_event:	Detected action (1061.127) rabbitmq_start_0.3977=unknown error: failed
116794:Jun 24 16:53:26 [4837]  ctrl-0.localdomain       crmd:     info: process_graph_event:	Detected action (1061.127) rabbitmq_start_0.3977=unknown error: failed
~~~~~~~~~~~~~

rhqe-bare-ctrl-1.localdomain/cluster-log.txt:
~~~~~~~~~~~~~
16055:Jun 24 16:53:26 [4900] rhqe-bare-ctrl-1.localdomain       crmd:   notice: process_lrm_event:	Operation rabbitmq_start_0: unknown error (node=rhqe-bare-ctrl-1, call=3977, rc=1, cib-update=2664, confirmed=true)
[pcaruana@collab-shell 2016-06-25-03xxxx]$ 
~~~~~~~~~~~~~

Down the rabbit hole,  those entries are relevant are showing already the broker having issues  when triggering the shutdown because channel_termination_timeout reached 70000

grep -hEr 'SUPERVISOR REPORT.*Jun' -A3  */var/log/rabbitmq/*.log | grep Reason: | sort | uniq -c | sort -n
~~~~~~~~~~~~~
   590      Reason:     {{noproc,
   4122      Reason:     {noproc,{gen_server2,call,
   4125      Reason:     {{{noproc,
   4715      Reason:     reached_max_restart_intensity
~~~~~~~~~~~~~
Here are examples for the dead mirrors

158942 =CRASH REPORT==== 25-Jun-2016::02:05:09 ===
 158943   crasher:
 158944     initial call: gen:init_it/6
 158945     pid: <0.29856.274>
 158946     registered_name: []
 158947     exception exit: {noproc,
 158948                         {gen_server2,call,
 158949                             [worker_pool,{next_free,<0.29856.274>},infinity]}}
 158950       in function  gen_server2:terminate/3 (src/gen_server2.erl, line 1133)
 158951     ancestors: [rabbit_mirror_queue_slave_sup,rabbit_sup,<0.692.0>]
 158952     messages: []
 158953     links: [<0.878.0>,<0.25470.274>]
 158954     dictionary: [{process_name,
 158955                       {rabbit_mirror_queue_slave,
 158956                           {resource,<<"/">>,queue,
 158957                               <<"engine_fanout_7a05f045d3d64f47a01b0771b8a5dead">>}}}]

Now for documenting better the kinds of "   4715      Reason:     reached_max_restart_intensity"  like 
207922 =SUPERVISOR REPORT==== 25-Jun-2016::02:05:14 ===
 207923      Supervisor: {<0.19973.249>,rabbit_channel_sup}
 207924      Context:    shutdown
 207925      Reason:     reached_max_restart_intensity
 207926      Offender:   [{pid,<0.2611.275>},
 207927                   {name,channel},
 207928                   {mfargs,
 207929                       {rabbit_channel,start_link,
 207930                           [1,<0.2806.275>,<0.2585.275>,<0.2806.275>,
 207931                            <<"172.16.104.16:39745 -> 172.16.104.17:5672">>,
 207932                            rabbit_framing_amqp_0_9_1,
 207933                            {user,<<"guest">>,
 207934                                [administrator],
 207935                                rabbit_auth_backend_internal,
 207936                                {internal_user,<<"guest">>,
 207937                                    <<237,53,33,118,185,118,167,187,209,36,247,
 207938                                      205,120,84,167,191,30,63,238,170>>,
 207939                                    [administrator]}},
 207940                            <<"/">>,
 207941                            [{<<"connection.blocked">>,bool,true},
 207942                             {<<"consumer_cancel_notify">>,bool,true}],
 207943                            <0.10229.272>,<0.2649.275>]}},
 207944                   {restart_type,intrinsic},
 207945                   {shutdown,4294967295},
 207946                   {child_type,worker}]

Are really those ones the side effects or part of the issue? or soemthing related to  supervisor2 from rabbitmq or any crash or do long tasks in gen_server init callback?
This is interesting for confirming or discarding patterns specially when expanding of knowledge base.

Regards,
Pablo.

Comment 16 Peter Lemenkov 2016-07-07 09:56:43 UTC

(In reply to Pablo Caruana from comment #14)
> (In reply to Peter Lemenkov from comment #13)
> > Regarding no.2 I'd like to propose upgrading at least to
> > resource-agents-3.9.5-54.el7_2.7 (see bug 1311180) which should handle
> > partitioned nodes more gracefully.
> > 
> > I'd like to emphasize that it's not a final fix (the main issue that the
> > node is partitioned suddenly) but might help while waiting for it.
> 
> HI petro, resource-agentes were updated.  Now backing to the logs
> 
> ctrl-0/cluster-log.txt:
> ~~~~~~~~~~~~~
> 116789:Jun 24 16:53:26 [4837]  ctrl-0.localdomain       crmd:     info:
> process_graph_event:	Detected action (1061.127)
> rabbitmq_start_0.3977=unknown error: failed
> 116794:Jun 24 16:53:26 [4837]  ctrl-0.localdomain       crmd:     info:
> process_graph_event:	Detected action (1061.127)
> rabbitmq_start_0.3977=unknown error: failed
> ~~~~~~~~~~~~~
> 
> rhqe-bare-ctrl-1.localdomain/cluster-log.txt:
> ~~~~~~~~~~~~~
> 16055:Jun 24 16:53:26 [4900] rhqe-bare-ctrl-1.localdomain       crmd:  
> notice: process_lrm_event:	Operation rabbitmq_start_0: unknown error
> (node=rhqe-bare-ctrl-1, call=3977, rc=1, cib-update=2664, confirmed=true)
> [pcaruana@collab-shell 2016-06-25-03xxxx]$ 
> ~~~~~~~~~~~~~
> 
> Down the rabbit hole,  those entries are relevant are showing already the
> broker having issues  when triggering the shutdown because
> channel_termination_timeout reached 70000
> 
> grep -hEr 'SUPERVISOR REPORT.*Jun' -A3  */var/log/rabbitmq/*.log | grep
> Reason: | sort | uniq -c | sort -n
> ~~~~~~~~~~~~~
>    590      Reason:     {{noproc,
>    4122      Reason:     {noproc,{gen_server2,call,
>    4125      Reason:     {{{noproc,
>    4715      Reason:     reached_max_restart_intensity
> ~~~~~~~~~~~~~
> Here are examples for the dead mirrors
> 
> 158942 =CRASH REPORT==== 25-Jun-2016::02:05:09 ===
>  158943   crasher:
>  158944     initial call: gen:init_it/6
>  158945     pid: <0.29856.274>
>  158946     registered_name: []
>  158947     exception exit: {noproc,
>  158948                         {gen_server2,call,
>  158949                            
> [worker_pool,{next_free,<0.29856.274>},infinity]}}
>  158950       in function  gen_server2:terminate/3 (src/gen_server2.erl,
> line 1133)
>  158951     ancestors: [rabbit_mirror_queue_slave_sup,rabbit_sup,<0.692.0>]
>  158952     messages: []
>  158953     links: [<0.878.0>,<0.25470.274>]
>  158954     dictionary: [{process_name,
>  158955                       {rabbit_mirror_queue_slave,
>  158956                           {resource,<<"/">>,queue,
>  158957                              
> <<"engine_fanout_7a05f045d3d64f47a01b0771b8a5dead">>}}}]
> 
> Now for documenting better the kinds of "   4715      Reason:    
> reached_max_restart_intensity"  like 
> 207922 =SUPERVISOR REPORT==== 25-Jun-2016::02:05:14 ===
>  207923      Supervisor: {<0.19973.249>,rabbit_channel_sup}
>  207924      Context:    shutdown
>  207925      Reason:     reached_max_restart_intensity
>  207926      Offender:   [{pid,<0.2611.275>},
>  207927                   {name,channel},
>  207928                   {mfargs,
>  207929                       {rabbit_channel,start_link,
>  207930                           [1,<0.2806.275>,<0.2585.275>,<0.2806.275>,
>  207931                            <<"172.16.104.16:39745 ->
> 172.16.104.17:5672">>,
>  207932                            rabbit_framing_amqp_0_9_1,
>  207933                            {user,<<"guest">>,
>  207934                                [administrator],
>  207935                                rabbit_auth_backend_internal,
>  207936                                {internal_user,<<"guest">>,
>  207937                                   
> <<237,53,33,118,185,118,167,187,209,36,247,
>  207938                                     
> 205,120,84,167,191,30,63,238,170>>,
>  207939                                    [administrator]}},
>  207940                            <<"/">>,
>  207941                            [{<<"connection.blocked">>,bool,true},
>  207942                            
> {<<"consumer_cancel_notify">>,bool,true}],
>  207943                            <0.10229.272>,<0.2649.275>]}},
>  207944                   {restart_type,intrinsic},
>  207945                   {shutdown,4294967295},
>  207946                   {child_type,worker}]
> 
> Are really those ones the side effects or part of the issue? or soemthing
> related to  supervisor2 from rabbitmq or any crash or do long tasks in
> gen_server init callback?
> This is interesting for confirming or discarding patterns specially when
> expanding of knowledge base.
> 
> Regards,
> Pablo.


"Max restart intensity" message isn't relevant here. It just says that the main issue (preventing RabbitMQ from starting up) is still there, so RabbitMQ gave up trying to recover and it's waiting for manual intervention.

Comment 17 Pablo Caruana 2016-07-07 10:08:08 UTC

(In reply to Peter Lemenkov from comment #16)

> "Max restart intensity" message isn't relevant here. It just says that the
> main issue (preventing RabbitMQ from starting up) is still there, so
> RabbitMQ gave up trying to recover and it's waiting for manual intervention.

Thanks for the confirmation, this usually a confusing part.

Comment 18 Faiaz Ahmed 2016-07-11 00:29:48 UTC

Created attachment 1178204 [details]
Freash rabbitmqctl_cluster_status.out.gz

Comment 19 Peter Lemenkov 2016-07-11 13:49:37 UTC

(In reply to Faiaz Ahmed from comment #18)
> Created attachment 1178204 [details]
> Freash rabbitmqctl_cluster_status.out.gz

Cool! I can see that we've lost node connectivity several times:

lemenkov ~/Downloads/erl_crash_dumps: sed ':a;N;$!ba;s/,\n\s*/,/g' rabbitmqctl_cluster_status.out |grep -o -P "{running_nodes.*?}" | LANG=C sort | uniq -c
     65 {running_nodes,['rabbit@rhqe-bare-ctrl-1','rabbit@rhqe-bare-ctrl-0']}
   6410 {running_nodes,['rabbit@rhqe-bare-ctrl-1','rabbit@rhqe-bare-ctrl-2','rabbit@rhqe-bare-ctrl-0']}
    147 {running_nodes,['rabbit@rhqe-bare-ctrl-2','rabbit@rhqe-bare-ctrl-1','rabbit@rhqe-bare-ctrl-0']}
lemenkov ~/Downloads/erl_crash_dumps: 

Normally running cluster should consists from these three items (not necessary sorted alphabetically):

 ['rabbit@rhqe-bare-ctrl-2','rabbit@rhqe-bare-ctrl-1','rabbit@rhqe-bare-ctrl-0']

We can see that node 'rabbit@rhqe-bare-ctrl-2' leaves cluster sometimes. I almost certain now that there is something wrong with connectivity between nodes. 

Could you upload fresh RabbitMQ logs from all three nodes? Just tar the entire /var/log/rabbitmq directory. Also I'd like to see journal and syslog logs from these machines.

Comment 25 Peter Lemenkov 2016-07-19 21:26:57 UTC

Faiaz, please try this build - rabbitmq-server-3.3.5-23.el7ost

Comment 38 Asaf Hirshberg 2016-08-25 11:57:38 UTC

[root@overcloud-controller-0 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@overcloud-controller-0' ...
[{nodes,[{disc,['rabbit@overcloud-controller-0',
                'rabbit@overcloud-controller-1',
                'rabbit@overcloud-controller-2']}]},
 {running_nodes,['rabbit@overcloud-controller-2',
                 'rabbit@overcloud-controller-1',
                 'rabbit@overcloud-controller-0']},
 {cluster_name,<<"rabbit">>},
 {partitions,[]}]
...done.
[root@overcloud-controller-0 ~]# rabbitmqctl status 
Status of node 'rabbit@overcloud-controller-0' ...
[{pid,5939},
 {running_applications,[{rabbit,"RabbitMQ","3.3.5"},
                        {mnesia,"MNESIA  CXC 138 12","4.11"},
                        {os_mon,"CPO  CXC 138 46","2.2.14"},
                        {xmerl,"XML parser","1.3.6"},
                        {sasl,"SASL  CXC 138 11","2.3.4"},
                        {stdlib,"ERTS  CXC 138 10","1.19.4"},
                        {kernel,"ERTS  CXC 138 10","2.16.4"}]},
 {os,{unix,linux}},
 {erlang_version,"Erlang R16B03-1 (erts-5.10.4) [source] [64-bit] [smp:12:12] [async-threads:30] [hipe] [kernel-poll:true]\n"},
 {memory,[{total,315832296},
          {connection_procs,11743080},
          {queue_procs,9151048},
          {plugins,0},
          {other_proc,14356040},
          {mnesia,1660976},
          {mgmt_db,0},
          {msg_index,295536},
          {other_ets,1482912},
          {binary,248563960},
          {code,16705858},
          {atom,654217},
          {other_system,11218669}]},
 {alarms,[]},
 {listeners,[{clustering,35672,"::"},{amqp,5672,"10.35.174.13"}]},
 {vm_memory_high_watermark,0.4},
 {vm_memory_limit,13423173632},
 {disk_free_limit,50000000},
 {disk_free,466386874368},
 {file_descriptors,[{total_limit,65436},
                    {total_used,227},
                    {sockets_limit,58890},
                    {sockets_used,225}]},
 {processes,[{limit,1048576},{used,3646}]},
 {run_queue,0},
 {uptime,2524}]
...done.
[root@overcloud-controller-0 ~]#

Comment 40 errata-xmlrpc 2016-08-31 17:37:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-1793.html