Bug 1319334
| Summary: | erlang crashes during a compute scaled out operation. [osp7] | |||
|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Pablo Caruana <pcaruana> | |
| Component: | rabbitmq-server | Assignee: | Peter Lemenkov <plemenko> | |
| Status: | CLOSED ERRATA | QA Contact: | Udi Shkalim <ushkalim> | |
| Severity: | urgent | Docs Contact: | ||
| Priority: | high | |||
| Version: | 7.0 (Kilo) | CC: | apevec, fahmed, fdinitto, jbiao, jeckersb, lhh, mfuruta, morazi, mschuppe, pcaruana, plemenko, rcernin, scorcora, srevivo, ushkalim | |
| Target Milestone: | async | Keywords: | ZStream | |
| Target Release: | 7.0 (Kilo) | |||
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | rabbitmq-server-3.3.5-22.el7ost | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1323194 1323195 1323196 1323197 (view as bug list) | Environment: | ||
| Last Closed: | 2016-04-18 13:44:46 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 1348276, 1370082 | |||
| Bug Blocks: | 1323194, 1323195, 1323196, 1323197, 1350073, 1357991, 1387474, 1387985, 1391186, 1391188, 1391190 | |||
|
Comment 5
Peter Lemenkov
2016-03-22 10:50:58 UTC
The erlang crash dumps aren't for the main rabbitmq-server process, instead they are for rabbitmqctl processes. It's not clear exactly what happened. The state from the crash dump is after the error has occured and it's in the process of terminating so the original context is missing. I'm guessing you would have the actual error in the stdout/stderr of the rabbitmqctl command but we don't have that available. If things get stuck again, on *each* controller run: # rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().' And attach to this bz or the case for review. John, for the Nov 17 not all the rabbitmqctl were not responsive as:
# rabbitmqctl eval '[process_info(P) || P <-
> rabbit_networking:connections()].'
Error: {badarg,[{erlang,process_info,[<6150.3725.0>],[]},
{erl_eval,do_apply,6,[{file,"erl_eval.erl"},{line,569}]},
{erl_eval,eval_lc1,6,[{file,"erl_eval.erl"},{line,595}]},
{erl_eval,eval_generate,7,[{file,"erl_eval.erl"},{line,624}]},
{erl_eval,eval_lc,6,[{file,"erl_eval.erl"},{line,581}]},
{rpc,'-handle_call_call/6-fun-0-',5,
[{file,"rpc.erl"},{line,205}]}]}
# rabbitmqctl eval 'rabbit_mnesia:running_clustered_nodes().'
Error: {undef,[{rabbit_mnesia,running_clustered_nodes,[],[]},
{erl_eval,do_apply,6,[{file,"erl_eval.erl"},{line,569}]},
{rpc,'-handle_call_call/6-fun-0-',5,
[{file,"rpc.erl"},{line,205}]}]}
[root@ost-ah1-controller2 heat-admin]# rabbitmqctl eval 'process_info(whereis(rabbit_tcp_client_sup)).'
[{registered_name,rabbit_tcp_client_sup},
{current_function,{supervisor2,wait_dynamic_children,5}},
{initial_call,{proc_lib,init_p,5}},
{status,waiting},
{message_queue_len,0},
{messages,[]},
{links,[]},
{dictionary,[{'$ancestors',[rabbit_sup,<5333.789.0>]},
{'$initial_call',{supervisor2,init,1}}]},
{trap_exit,true},
{error_handler,error_handler},
{priority,normal},
{group_leader,<5333.788.0>},
{total_heap_size,364609},
{heap_size,46422},
{stack_size,24},
{reductions,7387464},
{garbage_collection,[{min_bin_vheap_size,46422},
{min_heap_size,233},
{fullsweep_after,65535},
{minor_gcs,10}]},
{suspending,[]}]
...done.
But probably when running the full rabbitmqctl report kind as reaching the c
Connections:
<...... more than 10 minutes at this point ..... and still counting ......>
Usually at those scenario if the general rabbitmqctl stop_app or just stop fails, processes need to be terminated in not a graceful way.
Now having the # rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().' looks a general good approach (kcs diagnostic material) thanks for the hint.
Robin, Pablo, could customer increase fd limits up to 16k? From the report above:
{file_descriptors,[{total_limit,3996},
{total_used,2},
{sockets_limit,3594},
{sockets_used,0}]},
This effectively throttles RabbitMQ operations and very likely that affects the rabbitmqctl (which actually uses sockets for communications with RabbitMQ).
We discussed this in bug 1282491 few months ago, and I really surprised to see this coming back.
OK, this looks like the relevant part of the stuck output:
[{pid,<5333.6084.0>},
{registered_name,[]},
{current_stacktrace,
[{timer,sleep,1,[{file,"timer.erl"},{line,152}]},
{rabbit_amqqueue,'-with/3-fun-0-',4,
[{file,"src/rabbit_amqqueue.erl"},{line,377}]},
{rabbit_channel,basic_consume,8,
[{file,"src/rabbit_channel.erl"},{line,1178}]},
{rabbit_channel,'-handle_consuming_queue_down/2-fun-0-',4,
[{file,"src/rabbit_channel.erl"},{line,1257}]},
{gb_sets,fold_1,3,[{file,"gb_sets.erl"},{line,891}]},
{rabbit_channel,handle_info,2,
[{file,"src/rabbit_channel.erl"},{line,371}]},
{gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1022}]},
{proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,249}]}]},
{initial_call,{proc_lib,init_p,5}},
{dictionary,
[{delegate,delegate_13},
{{xtype_to_module,topic},rabbit_exchange_type_topic},
{'$ancestors',
[<5333.6081.0>,<5333.6080.0>,<5333.6075.0>,<5333.6074.0>,
rabbit_tcp_client_sup,rabbit_sup,<5333.621.0>]},
{{xtype_to_module,fanout},rabbit_exchange_type_fanout},
{process_name,
{rabbit_channel,{<<"172.24.13.11:56883 -> 172.24.13.9:5672">>,1}}},
{'$initial_call',{gen,init_it,6}}]},
{message_queue_len,3},
{links,[]},
{monitors,[{process,<6403.1977.0>},{process,<5333.2210.0>}]},
{monitored_by,[<5333.7944.0>,<5333.6081.0>,<5333.1037.0>]},
{heap_size,6772}]
The important part being the presence of rabbit_amqqueue:with/3. This
was fixed in a newer upstream version here:
https://github.com/rabbitmq/rabbitmq-common/pull/26
(with reference to https://github.com/rabbitmq/rabbitmq-server/issues/349)
With an additional two follow-up commits adjusting the response and
timeout:
https://github.com/rabbitmq/rabbitmq-common/commit/f74ef2ec9118ffd62fc82d29c56bc0fc7422aee0
https://github.com/rabbitmq/rabbitmq-common/commit/a374fe0ae4bfeba1c7b988b7e3c6501c3289cf24
This issue looks like a combination of timeout issue mentioned in comment 13 and ordering issue (nodes must be stopped in a specific order). I'm trying to reproduce it. Customer is reporting that the patch provided with the test package is working based on their tests. Peter, please prepare a build that we can deliver as hotfix. Verified sanity only on ospd- 7.3 with package containing the fix: rabbitmq-server-3.3.5-22.el7ost.noarch Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0641.html |