From what I see so far - at least rabbitmq configuration seems fine.
The erlang crash dumps aren't for the main rabbitmq-server process, instead they are for rabbitmqctl processes. It's not clear exactly what happened. The state from the crash dump is after the error has occured and it's in the process of terminating so the original context is missing. I'm guessing you would have the actual error in the stdout/stderr of the rabbitmqctl command but we don't have that available. If things get stuck again, on *each* controller run: # rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().' And attach to this bz or the case for review.
John, for the Nov 17 not all the rabbitmqctl were not responsive as: # rabbitmqctl eval '[process_info(P) || P <- > rabbit_networking:connections()].' Error: {badarg,[{erlang,process_info,[<6150.3725.0>],[]}, {erl_eval,do_apply,6,[{file,"erl_eval.erl"},{line,569}]}, {erl_eval,eval_lc1,6,[{file,"erl_eval.erl"},{line,595}]}, {erl_eval,eval_generate,7,[{file,"erl_eval.erl"},{line,624}]}, {erl_eval,eval_lc,6,[{file,"erl_eval.erl"},{line,581}]}, {rpc,'-handle_call_call/6-fun-0-',5, [{file,"rpc.erl"},{line,205}]}]} # rabbitmqctl eval 'rabbit_mnesia:running_clustered_nodes().' Error: {undef,[{rabbit_mnesia,running_clustered_nodes,[],[]}, {erl_eval,do_apply,6,[{file,"erl_eval.erl"},{line,569}]}, {rpc,'-handle_call_call/6-fun-0-',5, [{file,"rpc.erl"},{line,205}]}]} [root@ost-ah1-controller2 heat-admin]# rabbitmqctl eval 'process_info(whereis(rabbit_tcp_client_sup)).' [{registered_name,rabbit_tcp_client_sup}, {current_function,{supervisor2,wait_dynamic_children,5}}, {initial_call,{proc_lib,init_p,5}}, {status,waiting}, {message_queue_len,0}, {messages,[]}, {links,[]}, {dictionary,[{'$ancestors',[rabbit_sup,<5333.789.0>]}, {'$initial_call',{supervisor2,init,1}}]}, {trap_exit,true}, {error_handler,error_handler}, {priority,normal}, {group_leader,<5333.788.0>}, {total_heap_size,364609}, {heap_size,46422}, {stack_size,24}, {reductions,7387464}, {garbage_collection,[{min_bin_vheap_size,46422}, {min_heap_size,233}, {fullsweep_after,65535}, {minor_gcs,10}]}, {suspending,[]}] ...done. But probably when running the full rabbitmqctl report kind as reaching the c Connections: <...... more than 10 minutes at this point ..... and still counting ......> Usually at those scenario if the general rabbitmqctl stop_app or just stop fails, processes need to be terminated in not a graceful way. Now having the # rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().' looks a general good approach (kcs diagnostic material) thanks for the hint.
Robin, Pablo, could customer increase fd limits up to 16k? From the report above: {file_descriptors,[{total_limit,3996}, {total_used,2}, {sockets_limit,3594}, {sockets_used,0}]}, This effectively throttles RabbitMQ operations and very likely that affects the rabbitmqctl (which actually uses sockets for communications with RabbitMQ). We discussed this in bug 1282491 few months ago, and I really surprised to see this coming back.
OK, this looks like the relevant part of the stuck output: [{pid,<5333.6084.0>}, {registered_name,[]}, {current_stacktrace, [{timer,sleep,1,[{file,"timer.erl"},{line,152}]}, {rabbit_amqqueue,'-with/3-fun-0-',4, [{file,"src/rabbit_amqqueue.erl"},{line,377}]}, {rabbit_channel,basic_consume,8, [{file,"src/rabbit_channel.erl"},{line,1178}]}, {rabbit_channel,'-handle_consuming_queue_down/2-fun-0-',4, [{file,"src/rabbit_channel.erl"},{line,1257}]}, {gb_sets,fold_1,3,[{file,"gb_sets.erl"},{line,891}]}, {rabbit_channel,handle_info,2, [{file,"src/rabbit_channel.erl"},{line,371}]}, {gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1022}]}, {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,249}]}]}, {initial_call,{proc_lib,init_p,5}}, {dictionary, [{delegate,delegate_13}, {{xtype_to_module,topic},rabbit_exchange_type_topic}, {'$ancestors', [<5333.6081.0>,<5333.6080.0>,<5333.6075.0>,<5333.6074.0>, rabbit_tcp_client_sup,rabbit_sup,<5333.621.0>]}, {{xtype_to_module,fanout},rabbit_exchange_type_fanout}, {process_name, {rabbit_channel,{<<"172.24.13.11:56883 -> 172.24.13.9:5672">>,1}}}, {'$initial_call',{gen,init_it,6}}]}, {message_queue_len,3}, {links,[]}, {monitors,[{process,<6403.1977.0>},{process,<5333.2210.0>}]}, {monitored_by,[<5333.7944.0>,<5333.6081.0>,<5333.1037.0>]}, {heap_size,6772}] The important part being the presence of rabbit_amqqueue:with/3. This was fixed in a newer upstream version here: https://github.com/rabbitmq/rabbitmq-common/pull/26 (with reference to https://github.com/rabbitmq/rabbitmq-server/issues/349) With an additional two follow-up commits adjusting the response and timeout: https://github.com/rabbitmq/rabbitmq-common/commit/f74ef2ec9118ffd62fc82d29c56bc0fc7422aee0 https://github.com/rabbitmq/rabbitmq-common/commit/a374fe0ae4bfeba1c7b988b7e3c6501c3289cf24
This issue looks like a combination of timeout issue mentioned in comment 13 and ordering issue (nodes must be stopped in a specific order). I'm trying to reproduce it.
Customer is reporting that the patch provided with the test package is working based on their tests. Peter, please prepare a build that we can deliver as hotfix.
Verified sanity only on ospd- 7.3 with package containing the fix: rabbitmq-server-3.3.5-22.el7ost.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0641.html