1319334 – erlang crashes during a compute scaled out operation. [osp7]

Bug 1319334 - erlang crashes during a compute scaled out operation. [osp7]

Summary: erlang crashes during a compute scaled out operation. [osp7]

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	rabbitmq-server
Sub Component:
Version:	7.0 (Kilo)
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	urgent
Target Milestone:	async
Target Release:	7.0 (Kilo)
Assignee:	Peter Lemenkov
QA Contact:	Udi Shkalim
Docs Contact:
URL:
Whiteboard:
Depends On:	1348276 1370082
Blocks:	1323194 1323195 1323196 1323197 1350073 1357991 1387474 1387985 1391186 1391188 1391190
TreeView+	depends on / blocked

Reported:	2016-03-18 21:09 UTC by Pablo Caruana
Modified:	2019-11-14 07:38 UTC (History)
CC List:	15 users (show)
Fixed In Version:	rabbitmq-server-3.3.5-22.el7ost
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1323194 1323195 1323196 1323197 (view as bug list)
Environment:
Last Closed:	2016-04-18 13:44:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	rabbitmq rabbitmq-server issues 349	None	None	None	2016-11-02 14:20:58 UTC
Github	rabbitmq rabbitmq-server issues 812	None	None	None	2016-11-01 17:46:11 UTC
Red Hat Knowledge Base (Solution)	2354791	None	None	None	2016-06-05 15:54:19 UTC
Red Hat Product Errata	RHBA-2016:0641	normal	SHIPPED_LIVE	rabbitmq-server bug fix advisory	2016-04-18 17:44:02 UTC

Comment 5 Peter Lemenkov 2016-03-22 10:50:58 UTC

From what I see so far - at least rabbitmq configuration seems fine.

Comment 7 John Eckersberg 2016-03-22 16:57:30 UTC

The erlang crash dumps aren't for the main rabbitmq-server process, instead they are for rabbitmqctl processes.  It's not clear exactly what happened.  The state from the crash dump is after the error has occured and it's in the process of terminating so the original context is missing.  I'm guessing you would have the actual error in the stdout/stderr of the rabbitmqctl command but we don't have that available.

If things get stuck again, on *each* controller run:

# rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().'

And attach to this bz or the case for review.

Comment 8 Pablo Caruana 2016-03-22 17:50:24 UTC

John, for the Nov 17 not all the rabbitmqctl were not responsive as:

# rabbitmqctl eval '[process_info(P) || P <-
> rabbit_networking:connections()].' 
Error: {badarg,[{erlang,process_info,[<6150.3725.0>],[]},
                {erl_eval,do_apply,6,[{file,"erl_eval.erl"},{line,569}]},
                {erl_eval,eval_lc1,6,[{file,"erl_eval.erl"},{line,595}]},
                {erl_eval,eval_generate,7,[{file,"erl_eval.erl"},{line,624}]},
                {erl_eval,eval_lc,6,[{file,"erl_eval.erl"},{line,581}]},
                {rpc,'-handle_call_call/6-fun-0-',5,
                     [{file,"rpc.erl"},{line,205}]}]}
# rabbitmqctl eval 'rabbit_mnesia:running_clustered_nodes().'
Error: {undef,[{rabbit_mnesia,running_clustered_nodes,[],[]},
               {erl_eval,do_apply,6,[{file,"erl_eval.erl"},{line,569}]},
               {rpc,'-handle_call_call/6-fun-0-',5,
                    [{file,"rpc.erl"},{line,205}]}]}
[root@ost-ah1-controller2 heat-admin]# rabbitmqctl eval 'process_info(whereis(rabbit_tcp_client_sup)).'
[{registered_name,rabbit_tcp_client_sup},
 {current_function,{supervisor2,wait_dynamic_children,5}},
 {initial_call,{proc_lib,init_p,5}},
 {status,waiting},
 {message_queue_len,0},
 {messages,[]},
 {links,[]},
 {dictionary,[{'$ancestors',[rabbit_sup,<5333.789.0>]},
              {'$initial_call',{supervisor2,init,1}}]},
 {trap_exit,true},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<5333.788.0>},
 {total_heap_size,364609},
 {heap_size,46422},
 {stack_size,24},
 {reductions,7387464},
 {garbage_collection,[{min_bin_vheap_size,46422},
                      {min_heap_size,233},
                      {fullsweep_after,65535},
                      {minor_gcs,10}]},
 {suspending,[]}]
...done.

But probably when running the full rabbitmqctl report kind as reaching the c

Connections:

<...... more than 10 minutes at this point  ..... and still counting ......>

Usually at those scenario if the general rabbitmqctl stop_app or just stop fails, processes need to be terminated in not a graceful way.

Now having the # rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().' looks a general good approach (kcs diagnostic material) thanks for the hint.

Comment 11 Peter Lemenkov 2016-03-24 09:35:42 UTC

Robin, Pablo, could customer increase fd limits up to 16k? From the report above:

 {file_descriptors,[{total_limit,3996},
                    {total_used,2},
                    {sockets_limit,3594},
                    {sockets_used,0}]},

This effectively throttles RabbitMQ operations and very likely that affects the rabbitmqctl (which actually uses sockets for communications with RabbitMQ).

We discussed this in bug 1282491 few months ago, and I really surprised to see this coming back.

Comment 13 John Eckersberg 2016-03-24 19:04:08 UTC

OK, this looks like the relevant part of the stuck output:

[{pid,<5333.6084.0>},
 {registered_name,[]},
 {current_stacktrace,
     [{timer,sleep,1,[{file,"timer.erl"},{line,152}]},
      {rabbit_amqqueue,'-with/3-fun-0-',4,
          [{file,"src/rabbit_amqqueue.erl"},{line,377}]},
      {rabbit_channel,basic_consume,8,
          [{file,"src/rabbit_channel.erl"},{line,1178}]},
      {rabbit_channel,'-handle_consuming_queue_down/2-fun-0-',4,
          [{file,"src/rabbit_channel.erl"},{line,1257}]},
      {gb_sets,fold_1,3,[{file,"gb_sets.erl"},{line,891}]},
      {rabbit_channel,handle_info,2,
          [{file,"src/rabbit_channel.erl"},{line,371}]},
      {gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1022}]},
      {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,249}]}]},
 {initial_call,{proc_lib,init_p,5}},
 {dictionary,
     [{delegate,delegate_13},
      {{xtype_to_module,topic},rabbit_exchange_type_topic},
      {'$ancestors',
          [<5333.6081.0>,<5333.6080.0>,<5333.6075.0>,<5333.6074.0>,
           rabbit_tcp_client_sup,rabbit_sup,<5333.621.0>]},
      {{xtype_to_module,fanout},rabbit_exchange_type_fanout},
      {process_name,
          {rabbit_channel,{<<"172.24.13.11:56883 -> 172.24.13.9:5672">>,1}}},
      {'$initial_call',{gen,init_it,6}}]},
 {message_queue_len,3},
 {links,[]},
 {monitors,[{process,<6403.1977.0>},{process,<5333.2210.0>}]},
 {monitored_by,[<5333.7944.0>,<5333.6081.0>,<5333.1037.0>]},
 {heap_size,6772}]

The important part being the presence of rabbit_amqqueue:with/3.  This
was fixed in a newer upstream version here:

https://github.com/rabbitmq/rabbitmq-common/pull/26
(with reference to https://github.com/rabbitmq/rabbitmq-server/issues/349)

With an additional two follow-up commits adjusting the response and
timeout:

https://github.com/rabbitmq/rabbitmq-common/commit/f74ef2ec9118ffd62fc82d29c56bc0fc7422aee0
https://github.com/rabbitmq/rabbitmq-common/commit/a374fe0ae4bfeba1c7b988b7e3c6501c3289cf24

Comment 23 Peter Lemenkov 2016-03-29 06:44:40 UTC

This issue looks like a combination of timeout issue mentioned in comment 13 and ordering issue (nodes must be stopped in a specific order). I'm trying to reproduce it.

Comment 30 Fabio Massimo Di Nitto 2016-03-31 11:49:03 UTC

Customer is reporting that the patch provided with the test package is working based on their tests.

Peter, please prepare a build that we can deliver as hotfix.

Comment 35 Udi Shkalim 2016-04-11 13:12:40 UTC

Verified sanity only on ospd- 7.3 with package containing the fix: rabbitmq-server-3.3.5-22.el7ost.noarch

Comment 37 errata-xmlrpc 2016-04-18 13:44:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0641.html

Note You need to log in before you can comment on or make changes to this bug.