1466896 – Rabbitmq - three node cluster crash after after two nodes become network isolated from each other

Bug 1466896 - Rabbitmq - three node cluster crash after after two nodes become network isolated from each other

Summary: Rabbitmq - three node cluster crash after after two nodes become network isol...

Keywords:
Status:	CLOSED CANTFIX
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	rabbitmq-server
Sub Component:
Version:	10.0 (Newton)
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Peter Lemenkov
QA Contact:	Udi Shkalim
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-06-30 17:04 UTC by Matt Flusche
Modified:	2022-08-09 14:02 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-05-21 09:14:11 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	rabbitmq rabbitmq-server issues 959	0	'None'	closed	GM - crash in calculate_activity	2020-07-29 05:40:11 UTC
Red Hat Issue Tracker	OSP-8588	0	None	None	None	2022-08-09 14:02:34 UTC

Internal Links: 1574250 1602820 1612815

Description Matt Flusche 2017-06-30 17:04:52 UTC

Description of problem:
In a three node cluster with pause_minority, nodes 1 and 2 become network isolated; however, nodes 1 and 2 can still talk to node 3.

The behavior is that node 1 and 2 both stop due to the pause_minority config and node 3 seems to crash.  The rabbit cluster is unavailable.

Will provide full logs.

###########
From node1
###########

=INFO REPORT==== 28-Jun-2017::13:41:17 ===
rabbit on node rabbit@node2 down


=ERROR REPORT==== 28-Jun-2017::13:41:21 ===
Partial partition detected:
 * We saw DOWN from rabbit@node2
 * We can still see rabbit@node3 which can see rabbit@node2
 * pause_minority mode enabled
We will therefore pause until the *entire* cluster recovers

=WARNING REPORT==== 28-Jun-2017::13:41:21 ===
Cluster minority/secondary status detected - awaiting recovery

=INFO REPORT==== 28-Jun-2017::13:41:21 ===
Stopping RabbitMQ


###########
From node2
###########

=INFO REPORT==== 28-Jun-2017::13:41:17 ===
rabbit on node rabbit@node1 down


=ERROR REPORT==== 28-Jun-2017::13:41:21 ===
Partial partition detected:
 * We saw DOWN from rabbit@node1
 * We can still see rabbit@node3 which can see rabbit@node1
 * pause_minority mode enabled
We will therefore pause until the *entire* cluster recovers

=WARNING REPORT==== 28-Jun-2017::13:41:21 ===
Cluster minority/secondary status detected - awaiting recovery

=INFO REPORT==== 28-Jun-2017::13:41:21 ===
Stopping RabbitMQ


###########
From node3
###########

=WARNING REPORT==== 28-Jun-2017::13:41:18 ===
Received a 'DOWN' message from rabbit@node1 but still can communicate with it 

=WARNING REPORT==== 28-Jun-2017::13:41:18 ===
Received a 'DOWN' message from rabbit@node2 but still can communicate with it 

Then 150+ Generic server terminating messages.

=ERROR REPORT==== 28-Jun-2017::13:41:20 ===
** Generic server <0.14879.0> terminating
--
=ERROR REPORT==== 28-Jun-2017::13:41:20 ===
** Generic server <0.22414.0> terminating
--
=ERROR REPORT==== 28-Jun-2017::13:41:20 ===
** Generic server <0.14877.0> terminating
--
=ERROR REPORT==== 28-Jun-2017::13:41:20 ===
** Generic server <0.22412.0> terminating

Version-Release number of selected component (if applicable):
rabbitmq-server-3.6.3-6.el7ost.noarch

How reproducible:
Unknown; it's happened twice in this specific env. 

Steps to Reproduce:
1. Unknown
2.
3.

Actual results:
Cluster crash and rabbit is unavailable

Expected results:
degraded but available rabbit service

Additional info:
Will provide full logs and sosreports

Comment 6 Andrew Beekhof 2018-05-21 09:14:11 UTC

Based on the last comment and the lack of further feedback from the customer, it sounds like we should close this as "can't fix".

Note You need to log in before you can comment on or make changes to this bug.