Bug 1466896 - Rabbitmq - three node cluster crash after after two nodes become network isolated from each other
Rabbitmq - three node cluster crash after after two nodes become network isol...
Status: NEW
Product: Red Hat OpenStack
Classification: Red Hat
Component: rabbitmq-server (Show other bugs)
10.0 (Newton)
x86_64 Linux
high Severity high
: ---
: ---
Assigned To: Peter Lemenkov
Udi Shkalim
: FutureFeature
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-30 13:04 EDT by Matt Flusche
Modified: 2017-10-02 09:59 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Github rabbitmq/rabbitmq-server/issues/959 None None None 2017-07-14 12:33 EDT

  None (edit)
Description Matt Flusche 2017-06-30 13:04:52 EDT
Description of problem:
In a three node cluster with pause_minority, nodes 1 and 2 become network isolated; however, nodes 1 and 2 can still talk to node 3.

The behavior is that node 1 and 2 both stop due to the pause_minority config and node 3 seems to crash.  The rabbit cluster is unavailable.

Will provide full logs.

###########
From node1
###########

=INFO REPORT==== 28-Jun-2017::13:41:17 ===
rabbit on node rabbit@node2 down


=ERROR REPORT==== 28-Jun-2017::13:41:21 ===
Partial partition detected:
 * We saw DOWN from rabbit@node2
 * We can still see rabbit@node3 which can see rabbit@node2
 * pause_minority mode enabled
We will therefore pause until the *entire* cluster recovers

=WARNING REPORT==== 28-Jun-2017::13:41:21 ===
Cluster minority/secondary status detected - awaiting recovery

=INFO REPORT==== 28-Jun-2017::13:41:21 ===
Stopping RabbitMQ


###########
From node2
###########

=INFO REPORT==== 28-Jun-2017::13:41:17 ===
rabbit on node rabbit@node1 down


=ERROR REPORT==== 28-Jun-2017::13:41:21 ===
Partial partition detected:
 * We saw DOWN from rabbit@node1
 * We can still see rabbit@node3 which can see rabbit@node1
 * pause_minority mode enabled
We will therefore pause until the *entire* cluster recovers

=WARNING REPORT==== 28-Jun-2017::13:41:21 ===
Cluster minority/secondary status detected - awaiting recovery

=INFO REPORT==== 28-Jun-2017::13:41:21 ===
Stopping RabbitMQ


###########
From node3
###########

=WARNING REPORT==== 28-Jun-2017::13:41:18 ===
Received a 'DOWN' message from rabbit@node1 but still can communicate with it 

=WARNING REPORT==== 28-Jun-2017::13:41:18 ===
Received a 'DOWN' message from rabbit@node2 but still can communicate with it 

Then 150+ Generic server terminating messages.

=ERROR REPORT==== 28-Jun-2017::13:41:20 ===
** Generic server <0.14879.0> terminating
--
=ERROR REPORT==== 28-Jun-2017::13:41:20 ===
** Generic server <0.22414.0> terminating
--
=ERROR REPORT==== 28-Jun-2017::13:41:20 ===
** Generic server <0.14877.0> terminating
--
=ERROR REPORT==== 28-Jun-2017::13:41:20 ===
** Generic server <0.22412.0> terminating

Version-Release number of selected component (if applicable):
rabbitmq-server-3.6.3-6.el7ost.noarch

How reproducible:
Unknown; it's happened twice in this specific env. 

Steps to Reproduce:
1. Unknown
2.
3.

Actual results:
Cluster crash and rabbit is unavailable

Expected results:
degraded but available rabbit service

Additional info:
Will provide full logs and sosreports

Note You need to log in before you can comment on or make changes to this bug.