Bug 1319334

Summary: erlang crashes during a compute scaled out operation. [osp7]
Product: Red Hat OpenStack Reporter: Pablo Caruana <pcaruana>
Component: rabbitmq-serverAssignee: Peter Lemenkov <plemenko>
Status: CLOSED ERRATA QA Contact: Udi Shkalim <ushkalim>
Severity: urgent Docs Contact:
Priority: high    
Version: 7.0 (Kilo)CC: apevec, fahmed, fdinitto, jbiao, jeckersb, lhh, mfuruta, morazi, mschuppe, pcaruana, plemenko, rcernin, scorcora, srevivo, ushkalim
Target Milestone: asyncKeywords: ZStream
Target Release: 7.0 (Kilo)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: rabbitmq-server-3.3.5-22.el7ost Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1323194 1323195 1323196 1323197 (view as bug list) Environment:
Last Closed: 2016-04-18 13:44:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1348276, 1370082    
Bug Blocks: 1323194, 1323195, 1323196, 1323197, 1350073, 1357991, 1387474, 1387985, 1391186, 1391188, 1391190    

Comment 5 Peter Lemenkov 2016-03-22 10:50:58 UTC
From what I see so far - at least rabbitmq configuration seems fine.

Comment 7 John Eckersberg 2016-03-22 16:57:30 UTC
The erlang crash dumps aren't for the main rabbitmq-server process, instead they are for rabbitmqctl processes.  It's not clear exactly what happened.  The state from the crash dump is after the error has occured and it's in the process of terminating so the original context is missing.  I'm guessing you would have the actual error in the stdout/stderr of the rabbitmqctl command but we don't have that available.

If things get stuck again, on *each* controller run:

# rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().'

And attach to this bz or the case for review.

Comment 8 Pablo Caruana 2016-03-22 17:50:24 UTC
John, for the Nov 17 not all the rabbitmqctl were not responsive as:

# rabbitmqctl eval '[process_info(P) || P <-
> rabbit_networking:connections()].' 
Error: {badarg,[{erlang,process_info,[<6150.3725.0>],[]},
                {erl_eval,do_apply,6,[{file,"erl_eval.erl"},{line,569}]},
                {erl_eval,eval_lc1,6,[{file,"erl_eval.erl"},{line,595}]},
                {erl_eval,eval_generate,7,[{file,"erl_eval.erl"},{line,624}]},
                {erl_eval,eval_lc,6,[{file,"erl_eval.erl"},{line,581}]},
                {rpc,'-handle_call_call/6-fun-0-',5,
                     [{file,"rpc.erl"},{line,205}]}]}
# rabbitmqctl eval 'rabbit_mnesia:running_clustered_nodes().'
Error: {undef,[{rabbit_mnesia,running_clustered_nodes,[],[]},
               {erl_eval,do_apply,6,[{file,"erl_eval.erl"},{line,569}]},
               {rpc,'-handle_call_call/6-fun-0-',5,
                    [{file,"rpc.erl"},{line,205}]}]}
[root@ost-ah1-controller2 heat-admin]# rabbitmqctl eval 'process_info(whereis(rabbit_tcp_client_sup)).'
[{registered_name,rabbit_tcp_client_sup},
 {current_function,{supervisor2,wait_dynamic_children,5}},
 {initial_call,{proc_lib,init_p,5}},
 {status,waiting},
 {message_queue_len,0},
 {messages,[]},
 {links,[]},
 {dictionary,[{'$ancestors',[rabbit_sup,<5333.789.0>]},
              {'$initial_call',{supervisor2,init,1}}]},
 {trap_exit,true},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<5333.788.0>},
 {total_heap_size,364609},
 {heap_size,46422},
 {stack_size,24},
 {reductions,7387464},
 {garbage_collection,[{min_bin_vheap_size,46422},
                      {min_heap_size,233},
                      {fullsweep_after,65535},
                      {minor_gcs,10}]},
 {suspending,[]}]
...done.

But probably when running the full rabbitmqctl report kind as reaching the c

Connections:

<...... more than 10 minutes at this point  ..... and still counting ......>

Usually at those scenario if the general rabbitmqctl stop_app or just stop fails, processes need to be terminated in not a graceful way.

Now having the # rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().' looks a general good approach (kcs diagnostic material) thanks for the hint.

Comment 11 Peter Lemenkov 2016-03-24 09:35:42 UTC
Robin, Pablo, could customer increase fd limits up to 16k? From the report above:

 {file_descriptors,[{total_limit,3996},
                    {total_used,2},
                    {sockets_limit,3594},
                    {sockets_used,0}]},

This effectively throttles RabbitMQ operations and very likely that affects the rabbitmqctl (which actually uses sockets for communications with RabbitMQ).

We discussed this in bug 1282491 few months ago, and I really surprised to see this coming back.

Comment 13 John Eckersberg 2016-03-24 19:04:08 UTC
OK, this looks like the relevant part of the stuck output:

[{pid,<5333.6084.0>},
 {registered_name,[]},
 {current_stacktrace,
     [{timer,sleep,1,[{file,"timer.erl"},{line,152}]},
      {rabbit_amqqueue,'-with/3-fun-0-',4,
          [{file,"src/rabbit_amqqueue.erl"},{line,377}]},
      {rabbit_channel,basic_consume,8,
          [{file,"src/rabbit_channel.erl"},{line,1178}]},
      {rabbit_channel,'-handle_consuming_queue_down/2-fun-0-',4,
          [{file,"src/rabbit_channel.erl"},{line,1257}]},
      {gb_sets,fold_1,3,[{file,"gb_sets.erl"},{line,891}]},
      {rabbit_channel,handle_info,2,
          [{file,"src/rabbit_channel.erl"},{line,371}]},
      {gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1022}]},
      {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,249}]}]},
 {initial_call,{proc_lib,init_p,5}},
 {dictionary,
     [{delegate,delegate_13},
      {{xtype_to_module,topic},rabbit_exchange_type_topic},
      {'$ancestors',
          [<5333.6081.0>,<5333.6080.0>,<5333.6075.0>,<5333.6074.0>,
           rabbit_tcp_client_sup,rabbit_sup,<5333.621.0>]},
      {{xtype_to_module,fanout},rabbit_exchange_type_fanout},
      {process_name,
          {rabbit_channel,{<<"172.24.13.11:56883 -> 172.24.13.9:5672">>,1}}},
      {'$initial_call',{gen,init_it,6}}]},
 {message_queue_len,3},
 {links,[]},
 {monitors,[{process,<6403.1977.0>},{process,<5333.2210.0>}]},
 {monitored_by,[<5333.7944.0>,<5333.6081.0>,<5333.1037.0>]},
 {heap_size,6772}]

The important part being the presence of rabbit_amqqueue:with/3.  This
was fixed in a newer upstream version here:

https://github.com/rabbitmq/rabbitmq-common/pull/26
(with reference to https://github.com/rabbitmq/rabbitmq-server/issues/349)

With an additional two follow-up commits adjusting the response and
timeout:

https://github.com/rabbitmq/rabbitmq-common/commit/f74ef2ec9118ffd62fc82d29c56bc0fc7422aee0
https://github.com/rabbitmq/rabbitmq-common/commit/a374fe0ae4bfeba1c7b988b7e3c6501c3289cf24

Comment 23 Peter Lemenkov 2016-03-29 06:44:40 UTC
This issue looks like a combination of timeout issue mentioned in comment 13 and ordering issue (nodes must be stopped in a specific order). I'm trying to reproduce it.

Comment 30 Fabio Massimo Di Nitto 2016-03-31 11:49:03 UTC
Customer is reporting that the patch provided with the test package is working based on their tests.

Peter, please prepare a build that we can deliver as hotfix.

Comment 35 Udi Shkalim 2016-04-11 13:12:40 UTC
Verified sanity only on ospd- 7.3 with package containing the fix: rabbitmq-server-3.3.5-22.el7ost.noarch

Comment 37 errata-xmlrpc 2016-04-18 13:44:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0641.html