2034931 – rabbitmq crushes and fails to start after strange sequence of events

Bug 2034931 - rabbitmq crushes and fails to start after strange sequence of events

Summary: rabbitmq crushes and fails to start after strange sequence of events

Keywords:
Status:	CLOSED DUPLICATE of bug 2046185
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	rabbitmq-server
Sub Component:
Version:	16.2 (Train)
Hardware:	All
OS:	All
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Peter Lemenkov
QA Contact:	pkomarov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-12-22 13:58 UTC by Alex Stupnikov
Modified:	2022-06-08 07:27 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-06-08 07:18:55 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-15577	0	None	None	None	2022-06-08 07:27:46 UTC

Description Alex Stupnikov 2021-12-22 13:58:40 UTC

Description of problem:

One of our customers with RHOSP 16.2 (GA) running pacemaker cluster without STONITH configured reported the following problem:

- from time to time rabbitmq stops working: it just doesn't write the logs anymore. We don't have sosreport from this period of time, so can't tell what exactly happened there.
- if restarted, rabbitmq at first fails because of error [1]
- subsequent restarts fail because of error [2]. We have sosreport collected on controller node when it was blocked by error [2]

rabbitmq starts fine after controller node is rebooted.

Please help us to understand this problem. More information and timestamps would be provided privately.


[1]
10:07:28.073 [error] BOOT FAILED
10:07:28.073 [error] ===========
10:07:28.073 [error] ERROR: distribution port 25672 in use by another node: rabbit@controller3  -> failed because rabbit was already running
10:07:28.073 [error]
10:07:29.074 [error] Supervisor rabbit_prelaunch_sup had child prelaunch started with rabbit_prelaunch:run_prelaunch_first_phase() at undefined exit with reason {dist_port_already_used,25672,"rabbit","controller3"} in context start_error
10:07:29.075 [error] CRASH REPORT Process <0.153.0> with 0 neighbours exited with reason: {{shutdown,{failed_to_start_child,prelaunch,{dist_port_already_used,25672,"rabbit","controller3"}}},{rabbit_prelaunch_app,start,[normal,[]]}} in application_master:init/4 line 138
{"Kernel pid terminated",application_controller,"{application_start_failure,rabbitmq_prelaunch,{{shutdown,{failed_to_start_child,prelaunch,{dist_port_already_used,25672,\"rabbit\",\"controller3\"}}},{rabbit_prelaunch_app,start,[normal,[]]}}}"}

[2]
20:00:13.442 [error]     supervisor:children_map/4 line 1171
20:00:13.442 [error]     supervisor:'-start_children/2-fun-0-'/3 line 355
20:00:13.442 [error]     supervisor:do_start_child/2 line 371
20:00:13.442 [error]     supervisor:do_start_child_i/3 line 385
20:00:13.442 [error]     rabbit_prelaunch:run_prelaunch_first_phase/0 line 27
20:00:13.442 [error]     rabbit_prelaunch:do_run/0 line 111
20:00:13.442 [error]     rabbit_prelaunch_dist:setup/1 line 15
20:00:13.443 [error]     rabbit_prelaunch_dist:duplicate_node_check/1 line 51  -> net_kernel:start failed, failed_to_start_child
20:00:13.443 [error] error:{badmatch,
20:00:13.443 [error]           {error,
20:00:13.443 [error]               {{shutdown,
20:00:13.443 [error]                    {failed_to_start_child,net_kernel,{'EXIT',nodistribution}}},
20:00:13.443 [error]                {child,undefined,net_sup_dynamic,
20:00:13.443 [error]                    {erl_distribution,start_link,
20:00:13.443 [error]                        [[rabbit_prelaunch_224320@localhost,shortnames],
20:00:13.443 [error]                         false,net_sup_dynamic]},
20:00:13.443 [error]                    permanent,1000,supervisor,
20:00:13.443 [error]                    [erl_distribution]}}}}

Comment 6 Luca Miccini 2022-06-08 07:18:55 UTC

this has been root caused to be related to neutron-server opening thousands of connections to memcached and blocking rabbitmq, preventing it from (re)starting.

we switched memcache advaced_pool to true for all the services in 16.2 via https://bugzilla.redhat.com/show_bug.cgi?id=2046185, so closing this.

*** This bug has been marked as a duplicate of bug 2046185 ***

Note You need to log in before you can comment on or make changes to this bug.