Bug 1574250 - After a global network outage, it appears like rabbitmq didn't recover properly
Summary: After a global network outage, it appears like rabbitmq didn't recover properly
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rabbitmq-server
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Peter Lemenkov
QA Contact: Udi Shkalim
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-05-02 22:07 UTC by David Hill
Modified: 2020-10-26 11:57 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-05-29 21:17:25 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github rabbitmq rabbitmq-server issues 959 0 'None' closed GM - crash in calculate_activity 2020-06-25 02:09:06 UTC
Red Hat Bugzilla 1466896 0 high CLOSED Rabbitmq - three node cluster crash after after two nodes become network isolated from each other 2021-02-22 00:41:40 UTC

Internal Links: 1612815

Description David Hill 2018-05-02 22:07:20 UTC
Description of problem:
After a global network outage, it appears like rabbitmq didn't recover properly .   We can see that rabbitmq was stopped and restarted globally around 1:14AM on Apr 28 so we should've properly recovered from that situation but when we look in the rabbitmq logs, we can see the following:


=ERROR REPORT==== 30-Apr-2018::15:33:22 ===
Channel error on connection <0.2394.14> (192.168.1.1:54680 -> 192.168.1.1:5672, vhost: '/', user: 'guest'), channel 1:
operation queue.declare caused a channel exception not_found: "failed to perform operation on queue 'versioned_notifications.info' in vhost '/' due to timeout"

=ERROR REPORT==== 30-Apr-2018::15:33:26 ===
Channel error on connection <0.2412.14> (192.168.1.1:43120 -> 192.168.1.1:5672, vhost: '/', user: 'guest'), channel 1:
operation queue.declare caused a channel exception not_found: "failed to perform operation on queue 'versioned_notifications.info' in vhost '/' due to timeout"

=ERROR REPORT==== 30-Apr-2018::15:33:26 ===
closing AMQP connection <0.3018.12> (192.168.1.1:33562 -> 192.168.1.1:5672):
missed heartbeats from client, timeout: 60s


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Global network outage 
2. Flaky LACP bond on one of the controllers
3. Rabbitmq flaky

Actual results:
Many openstack API calls were failing and restarting rabbitmq on the controllers solved the issue

Expected results:
Shouldn't it have recovered automatically?  

Additional info:

Comment 15 Andrew Beekhof 2018-08-07 06:39:16 UTC
Hi David, can you report if the config change addressed the issue?

Eck: Do we need to consider releasing the change?

Comment 16 David Hill 2018-08-07 12:29:18 UTC
Hey Andrew, 

   Let me poke the customer and see if they tried it ... 

Thanks,
Dave


Note You need to log in before you can comment on or make changes to this bug.