Bug 1574250

Summary:	After a global network outage, it appears like rabbitmq didn't recover properly
Product:	Red Hat OpenStack	Reporter:	David Hill <dhill>
Component:	rabbitmq-server	Assignee:	Peter Lemenkov <plemenko>
Status:	CLOSED WONTFIX	QA Contact:	Udi Shkalim <ushkalim>
Severity:	high	Docs Contact:
Priority:	high
Version:	10.0 (Newton)	CC:	abeekhof, apevec, asoni, cfields, dhill, dvd, jeckersb, lhh, marjones, michele, plemenko, srevivo
Target Milestone:	---	Keywords:	Triaged, ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-05-29 21:17:25 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description David Hill 2018-05-02 22:07:20 UTC

Description of problem:
After a global network outage, it appears like rabbitmq didn't recover properly .   We can see that rabbitmq was stopped and restarted globally around 1:14AM on Apr 28 so we should've properly recovered from that situation but when we look in the rabbitmq logs, we can see the following:


=ERROR REPORT==== 30-Apr-2018::15:33:22 ===
Channel error on connection <0.2394.14> (192.168.1.1:54680 -> 192.168.1.1:5672, vhost: '/', user: 'guest'), channel 1:
operation queue.declare caused a channel exception not_found: "failed to perform operation on queue 'versioned_notifications.info' in vhost '/' due to timeout"

=ERROR REPORT==== 30-Apr-2018::15:33:26 ===
Channel error on connection <0.2412.14> (192.168.1.1:43120 -> 192.168.1.1:5672, vhost: '/', user: 'guest'), channel 1:
operation queue.declare caused a channel exception not_found: "failed to perform operation on queue 'versioned_notifications.info' in vhost '/' due to timeout"

=ERROR REPORT==== 30-Apr-2018::15:33:26 ===
closing AMQP connection <0.3018.12> (192.168.1.1:33562 -> 192.168.1.1:5672):
missed heartbeats from client, timeout: 60s


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Global network outage 
2. Flaky LACP bond on one of the controllers
3. Rabbitmq flaky

Actual results:
Many openstack API calls were failing and restarting rabbitmq on the controllers solved the issue

Expected results:
Shouldn't it have recovered automatically?  

Additional info:

Comment 15 Andrew Beekhof 2018-08-07 06:39:16 UTC

Hi David, can you report if the config change addressed the issue?

Eck: Do we need to consider releasing the change?

Comment 16 David Hill 2018-08-07 12:29:18 UTC

Hey Andrew, 

   Let me poke the customer and see if they tried it ... 

Thanks,
Dave