Description of problem: One of our customers with RHOSP 16.2 (GA) running pacemaker cluster without STONITH configured reported the following problem: - from time to time rabbitmq stops working: it just doesn't write the logs anymore. We don't have sosreport from this period of time, so can't tell what exactly happened there. - if restarted, rabbitmq at first fails because of error [1] - subsequent restarts fail because of error [2]. We have sosreport collected on controller node when it was blocked by error [2] rabbitmq starts fine after controller node is rebooted. Please help us to understand this problem. More information and timestamps would be provided privately. [1] 10:07:28.073 [error] BOOT FAILED 10:07:28.073 [error] =========== 10:07:28.073 [error] ERROR: distribution port 25672 in use by another node: rabbit@controller3 -> failed because rabbit was already running 10:07:28.073 [error] 10:07:29.074 [error] Supervisor rabbit_prelaunch_sup had child prelaunch started with rabbit_prelaunch:run_prelaunch_first_phase() at undefined exit with reason {dist_port_already_used,25672,"rabbit","controller3"} in context start_error 10:07:29.075 [error] CRASH REPORT Process <0.153.0> with 0 neighbours exited with reason: {{shutdown,{failed_to_start_child,prelaunch,{dist_port_already_used,25672,"rabbit","controller3"}}},{rabbit_prelaunch_app,start,[normal,[]]}} in application_master:init/4 line 138 {"Kernel pid terminated",application_controller,"{application_start_failure,rabbitmq_prelaunch,{{shutdown,{failed_to_start_child,prelaunch,{dist_port_already_used,25672,\"rabbit\",\"controller3\"}}},{rabbit_prelaunch_app,start,[normal,[]]}}}"} [2] 20:00:13.442 [error] supervisor:children_map/4 line 1171 20:00:13.442 [error] supervisor:'-start_children/2-fun-0-'/3 line 355 20:00:13.442 [error] supervisor:do_start_child/2 line 371 20:00:13.442 [error] supervisor:do_start_child_i/3 line 385 20:00:13.442 [error] rabbit_prelaunch:run_prelaunch_first_phase/0 line 27 20:00:13.442 [error] rabbit_prelaunch:do_run/0 line 111 20:00:13.442 [error] rabbit_prelaunch_dist:setup/1 line 15 20:00:13.443 [error] rabbit_prelaunch_dist:duplicate_node_check/1 line 51 -> net_kernel:start failed, failed_to_start_child 20:00:13.443 [error] error:{badmatch, 20:00:13.443 [error] {error, 20:00:13.443 [error] {{shutdown, 20:00:13.443 [error] {failed_to_start_child,net_kernel,{'EXIT',nodistribution}}}, 20:00:13.443 [error] {child,undefined,net_sup_dynamic, 20:00:13.443 [error] {erl_distribution,start_link, 20:00:13.443 [error] [[rabbit_prelaunch_224320@localhost,shortnames], 20:00:13.443 [error] false,net_sup_dynamic]}, 20:00:13.443 [error] permanent,1000,supervisor, 20:00:13.443 [error] [erl_distribution]}}}}
this has been root caused to be related to neutron-server opening thousands of connections to memcached and blocking rabbitmq, preventing it from (re)starting. we switched memcache advaced_pool to true for all the services in 16.2 via https://bugzilla.redhat.com/show_bug.cgi?id=2046185, so closing this. *** This bug has been marked as a duplicate of bug 2046185 ***