1574250 – After a global network outage, it appears like rabbitmq didn't recover properly

Bug 1574250 - After a global network outage, it appears like rabbitmq didn't recover properly

Summary: After a global network outage, it appears like rabbitmq didn't recover properly

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	rabbitmq-server
Sub Component:
Version:	10.0 (Newton)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Peter Lemenkov
QA Contact:	Udi Shkalim
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-05-02 22:07 UTC by David Hill
Modified:	2021-09-09 13:56 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-05-29 21:17:25 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	rabbitmq rabbitmq-server issues 959	0	'None'	closed	GM - crash in calculate_activity	2020-06-25 02:09:06 UTC
Red Hat Bugzilla	1466896	0	high	CLOSED	Rabbitmq - three node cluster crash after after two nodes become network isolated from each other	2022-08-09 14:14:16 UTC

Internal Links: 1612815

Description David Hill 2018-05-02 22:07:20 UTC

Description of problem:
After a global network outage, it appears like rabbitmq didn't recover properly .   We can see that rabbitmq was stopped and restarted globally around 1:14AM on Apr 28 so we should've properly recovered from that situation but when we look in the rabbitmq logs, we can see the following:


=ERROR REPORT==== 30-Apr-2018::15:33:22 ===
Channel error on connection <0.2394.14> (192.168.1.1:54680 -> 192.168.1.1:5672, vhost: '/', user: 'guest'), channel 1:
operation queue.declare caused a channel exception not_found: "failed to perform operation on queue 'versioned_notifications.info' in vhost '/' due to timeout"

=ERROR REPORT==== 30-Apr-2018::15:33:26 ===
Channel error on connection <0.2412.14> (192.168.1.1:43120 -> 192.168.1.1:5672, vhost: '/', user: 'guest'), channel 1:
operation queue.declare caused a channel exception not_found: "failed to perform operation on queue 'versioned_notifications.info' in vhost '/' due to timeout"

=ERROR REPORT==== 30-Apr-2018::15:33:26 ===
closing AMQP connection <0.3018.12> (192.168.1.1:33562 -> 192.168.1.1:5672):
missed heartbeats from client, timeout: 60s


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Global network outage 
2. Flaky LACP bond on one of the controllers
3. Rabbitmq flaky

Actual results:
Many openstack API calls were failing and restarting rabbitmq on the controllers solved the issue

Expected results:
Shouldn't it have recovered automatically?  

Additional info:

Comment 15 Andrew Beekhof 2018-08-07 06:39:16 UTC

Hi David, can you report if the config change addressed the issue?

Eck: Do we need to consider releasing the change?

Comment 16 David Hill 2018-08-07 12:29:18 UTC

Hey Andrew, 

   Let me poke the customer and see if they tried it ... 

Thanks,
Dave

Note You need to log in before you can comment on or make changes to this bug.