1357991 – rabbitmq: HA-Config crash with "exception exit" with multiple error

Bug 1357991 - rabbitmq: HA-Config crash with "exception exit" with multiple error

Summary: rabbitmq: HA-Config crash with "exception exit" with multiple error

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	rabbitmq-server
Sub Component:
Version:	8.0 (Liberty)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	async
Target Release:	---
Assignee:	Peter Lemenkov
QA Contact:	Asaf Hirshberg
Docs Contact:
URL:
Whiteboard:
Depends On:	1311180 1319334
Blocks:
TreeView+	depends on / blocked

Reported:	2016-07-19 18:42 UTC by Peter Lemenkov
Modified:	2019-11-14 08:47 UTC (History)
CC List:	14 users (show)
Fixed In Version:	rabbitmq-server-3.3.5-23.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1350073
Clones:	1370082 1387985 (view as bug list)
Environment:
Last Closed:	2016-08-31 17:37:59 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	rabbitmq rabbitmq-server issues 812	'None'	'closed'	'Queue master process terminates in rabbit_mirror_queue_master:stop_all_slaves on promotion'	2019-11-20 07:18:57 UTC
Red Hat Knowledge Base (Solution)	2619181	None	None	None	2016-09-12 08:15:12 UTC
Red Hat Product Errata	RHBA-2016:1792	normal	SHIPPED_LIVE	Red Hat OpenStack Platform 8 Bug Fix Advisory	2016-08-31 21:35:27 UTC

Comment 3 Asaf Hirshberg 2016-08-25 08:38:33 UTC

Testing on OSPD-8 with the desired rpm I ran some automation(running Rally, reboots for controllers..) now I got some crash reports like:

=CRASH REPORT==== 24-Aug-2016::17:57:07 ===
  crasher:
    initial call: gen:init_it/6
    pid: <0.623.0>
    registered_name: []
    exception exit: {undef,
                        [{rabbit_misc,get_env,
                             [rabbit,slave_wait_timeout,15000],
                             []},
                         {rabbit_mirror_queue_master,
                             promote_backing_queue_state,8,
                             [{file,"src/rabbit_mirror_queue_master.erl"},
                              {line,452}]},
                         {rabbit_mirror_queue_slave,promote_me,2,
                             [{file,"src/rabbit_mirror_queue_slave.erl"},
                              {line,615}]},
                         {rabbit_mirror_queue_slave,handle_call,3,
                             [{file,"src/rabbit_mirror_queue_slave.erl"},
                              {line,220}]},
                         {gen_server2,handle_msg,2,
                             [{file,"src/gen_server2.erl"},{line,1001}]},
                         {proc_lib,wake_up,3,
                             [{file,"proc_lib.erl"},{line,249}]}]}
      in function  gen_server2:terminate/3 (src/gen_server2.erl, line 1133)
    ancestors: [rabbit_mirror_queue_slave_sup,rabbit_sup,<0.105.0>]
    messages: [{'$gen_cast',policy_changed}]

But I not sure what are the success/fail criteria. is there something specific I should  look for? How can I now if the crash is not related to a reboot of a controller? Is there any "reproduce steps"?

Comment 4 Peter Lemenkov 2016-08-25 09:16:05 UTC

(In reply to Asaf Hirshberg from comment #3)
> Testing on OSPD-8 with the desired rpm I ran some automation(running Rally,
> reboots for controllers..) now I got some crash reports like:
> 
> =CRASH REPORT==== 24-Aug-2016::17:57:07 ===
>   crasher:
>     initial call: gen:init_it/6
>     pid: <0.623.0>
>     registered_name: []
>     exception exit: {undef,
>                         [{rabbit_misc,get_env,
>                              [rabbit,slave_wait_timeout,15000],
>                              []},
>                          {rabbit_mirror_queue_master,
>                              promote_backing_queue_state,8,
>                              [{file,"src/rabbit_mirror_queue_master.erl"},
>                               {line,452}]},
>                          {rabbit_mirror_queue_slave,promote_me,2,
>                              [{file,"src/rabbit_mirror_queue_slave.erl"},
>                               {line,615}]},
>                          {rabbit_mirror_queue_slave,handle_call,3,
>                              [{file,"src/rabbit_mirror_queue_slave.erl"},
>                               {line,220}]},
>                          {gen_server2,handle_msg,2,
>                              [{file,"src/gen_server2.erl"},{line,1001}]},
>                          {proc_lib,wake_up,3,
>                              [{file,"proc_lib.erl"},{line,249}]}]}
>       in function  gen_server2:terminate/3 (src/gen_server2.erl, line 1133)
>     ancestors: [rabbit_mirror_queue_slave_sup,rabbit_sup,<0.105.0>]
>     messages: [{'$gen_cast',policy_changed}]
> 
> But I not sure what are the success/fail criteria. is there something
> specific I should  look for? How can I now if the crash is not related to a
> reboot of a controller? Is there any "reproduce steps"?

That's another (unrelated) issue. It was introduced during backporting (application calls non-existing function added later). I'll provide a build shortly.

Comment 5 Asaf Hirshberg 2016-08-25 11:56:26 UTC

[root@overcloud-controller-0 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@overcloud-controller-0' ...
[{nodes,[{disc,['rabbit@overcloud-controller-0',
                'rabbit@overcloud-controller-1',
                'rabbit@overcloud-controller-2']}]},
 {running_nodes,['rabbit@overcloud-controller-2',
                 'rabbit@overcloud-controller-1',
                 'rabbit@overcloud-controller-0']},
 {cluster_name,<<"rabbit">>},
 {partitions,[]}]
...done.
[root@overcloud-controller-0 ~]# rabbitmqctl status 
Status of node 'rabbit@overcloud-controller-0' ...
[{pid,5939},
 {running_applications,[{rabbit,"RabbitMQ","3.3.5"},
                        {mnesia,"MNESIA  CXC 138 12","4.11"},
                        {os_mon,"CPO  CXC 138 46","2.2.14"},
                        {xmerl,"XML parser","1.3.6"},
                        {sasl,"SASL  CXC 138 11","2.3.4"},
                        {stdlib,"ERTS  CXC 138 10","1.19.4"},
                        {kernel,"ERTS  CXC 138 10","2.16.4"}]},
 {os,{unix,linux}},
 {erlang_version,"Erlang R16B03-1 (erts-5.10.4) [source] [64-bit] [smp:12:12] [async-threads:30] [hipe] [kernel-poll:true]\n"},
 {memory,[{total,315832296},
          {connection_procs,11743080},
          {queue_procs,9151048},
          {plugins,0},
          {other_proc,14356040},
          {mnesia,1660976},
          {mgmt_db,0},
          {msg_index,295536},
          {other_ets,1482912},
          {binary,248563960},
          {code,16705858},
          {atom,654217},
          {other_system,11218669}]},
 {alarms,[]},
 {listeners,[{clustering,35672,"::"},{amqp,5672,"10.35.174.13"}]},
 {vm_memory_high_watermark,0.4},
 {vm_memory_limit,13423173632},
 {disk_free_limit,50000000},
 {disk_free,466386874368},
 {file_descriptors,[{total_limit,65436},
                    {total_used,227},
                    {sockets_limit,58890},
                    {sockets_used,225}]},
 {processes,[{limit,1048576},{used,3646}]},
 {run_queue,0},
 {uptime,2524}]
...done.
[root@overcloud-controller-0 ~]#

Comment 7 errata-xmlrpc 2016-08-31 17:37:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-1792.html

Note You need to log in before you can comment on or make changes to this bug.