Bug 1466896 - Rabbitmq - three node cluster crash after after two nodes become network isolated from each other
Summary: Rabbitmq - three node cluster crash after after two nodes become network isol...
Keywords:
Status: CLOSED CANTFIX
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rabbitmq-server
Version: 10.0 (Newton)
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Peter Lemenkov
QA Contact: Udi Shkalim
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-06-30 17:04 UTC by Matt Flusche
Modified: 2022-08-09 14:02 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-05-21 09:14:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github rabbitmq rabbitmq-server issues 959 0 'None' closed GM - crash in calculate_activity 2020-07-29 05:40:11 UTC
Red Hat Issue Tracker OSP-8588 0 None None None 2022-08-09 14:02:34 UTC

Internal Links: 1574250 1602820 1612815

Description Matt Flusche 2017-06-30 17:04:52 UTC
Description of problem:
In a three node cluster with pause_minority, nodes 1 and 2 become network isolated; however, nodes 1 and 2 can still talk to node 3.

The behavior is that node 1 and 2 both stop due to the pause_minority config and node 3 seems to crash.  The rabbit cluster is unavailable.

Will provide full logs.

###########
From node1
###########

=INFO REPORT==== 28-Jun-2017::13:41:17 ===
rabbit on node rabbit@node2 down


=ERROR REPORT==== 28-Jun-2017::13:41:21 ===
Partial partition detected:
 * We saw DOWN from rabbit@node2
 * We can still see rabbit@node3 which can see rabbit@node2
 * pause_minority mode enabled
We will therefore pause until the *entire* cluster recovers

=WARNING REPORT==== 28-Jun-2017::13:41:21 ===
Cluster minority/secondary status detected - awaiting recovery

=INFO REPORT==== 28-Jun-2017::13:41:21 ===
Stopping RabbitMQ


###########
From node2
###########

=INFO REPORT==== 28-Jun-2017::13:41:17 ===
rabbit on node rabbit@node1 down


=ERROR REPORT==== 28-Jun-2017::13:41:21 ===
Partial partition detected:
 * We saw DOWN from rabbit@node1
 * We can still see rabbit@node3 which can see rabbit@node1
 * pause_minority mode enabled
We will therefore pause until the *entire* cluster recovers

=WARNING REPORT==== 28-Jun-2017::13:41:21 ===
Cluster minority/secondary status detected - awaiting recovery

=INFO REPORT==== 28-Jun-2017::13:41:21 ===
Stopping RabbitMQ


###########
From node3
###########

=WARNING REPORT==== 28-Jun-2017::13:41:18 ===
Received a 'DOWN' message from rabbit@node1 but still can communicate with it 

=WARNING REPORT==== 28-Jun-2017::13:41:18 ===
Received a 'DOWN' message from rabbit@node2 but still can communicate with it 

Then 150+ Generic server terminating messages.

=ERROR REPORT==== 28-Jun-2017::13:41:20 ===
** Generic server <0.14879.0> terminating
--
=ERROR REPORT==== 28-Jun-2017::13:41:20 ===
** Generic server <0.22414.0> terminating
--
=ERROR REPORT==== 28-Jun-2017::13:41:20 ===
** Generic server <0.14877.0> terminating
--
=ERROR REPORT==== 28-Jun-2017::13:41:20 ===
** Generic server <0.22412.0> terminating

Version-Release number of selected component (if applicable):
rabbitmq-server-3.6.3-6.el7ost.noarch

How reproducible:
Unknown; it's happened twice in this specific env. 

Steps to Reproduce:
1. Unknown
2.
3.

Actual results:
Cluster crash and rabbit is unavailable

Expected results:
degraded but available rabbit service

Additional info:
Will provide full logs and sosreports

Comment 6 Andrew Beekhof 2018-05-21 09:14:11 UTC
Based on the last comment and the lack of further feedback from the customer, it sounds like we should close this as "can't fix".


Note You need to log in before you can comment on or make changes to this bug.