Bug 1356169

Summary:	RabbitMQ got stuck for a few seconds if some other node fails.
Product:	Red Hat OpenStack	Reporter:	Peter Lemenkov <plemenko>
Component:	rabbitmq-server	Assignee:	Peter Lemenkov <plemenko>
Status:	CLOSED ERRATA	QA Contact:	Marian Krcmarik <mkrcmari>
Severity:	high	Docs Contact:
Priority:	high
Version:	9.0 (Mitaka)	CC:	apevec, fdinitto, jeckersb, jjoyce, lhh, michele, mkrcmari, oblaut, royoung, srevivo, ushkalim
Target Milestone:	ga	Keywords:	AutomationBlocker
Target Release:	9.0 (Mitaka)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	rabbitmq-server-3.6.3-5.el7ost	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-08-15 07:19:23 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1300728
Bug Blocks:

Description Peter Lemenkov 2016-07-13 14:30:18 UTC

Description of problem:

1. Run on node-0:

# watch -n 0.2 --no-title "date | tee --append out.txt  &&  rabbitmqctl cluster_status 2>&1 | tee --append out.txt"

2. Run on node-2

echo b > /proc/sysrq-trigger

3. You'll see a following message on node-0 soon (just for a couple of seconds):

Cluster status of node 'rabbit@overcloud-controller-0' ...
Error: unable to connect to node 'rabbit@overcloud-controller-0': nodedown

DIAGNOSTICS
===========

attempted to contact: ['rabbit@overcloud-controller-0']

rabbit@overcloud-controller-0:
  * connected to epmd (port 4369) on overcloud-controller-0
  * node rabbit@overcloud-controller-0 up, 'rabbit' application running

current node details:
- node name: 'rabbitmq-cli-31@overcloud-controller-0'
- home dir: /var/lib/rabbitmq
- cookie hash: PB31uPq3vzeQeZ+MHv+wgg==


How reproducible: always


Although it lasts just for a few seconds, it's enough for pacemaker to believe that local RabbitMQ node monitoring failed:

Failed Actions:
* rabbitmq_monitor_10000 on overcloud-controller-1 'not running' (7): call=1469, status=complete, exitreason='none',
    last-rc-change='Wed Jul 13 14:08:54 2016', queued=0ms, exec=0ms
* rabbitmq_monitor_10000 on overcloud-controller-0 'not running' (7): call=980, status=complete, exitreason='none',
    last-rc-change='Wed Jul 13 13:51:18 2016', queued=0ms, exec=0ms

Comment 2 Marian Krcmarik 2016-07-13 15:03:39 UTC

I would like to add some consequences which this has - when a rabbitmq monitor fails on the host which is not actually reset (not in failover) It assumes that rabbitmq-server is broken on the host and It starts stopping other resources on this node most likely due to pacemaker constraints set on rabbitmq resource.In the case such failed monitor happenes on all live nodes the cluster ends up with stopped resources on all nodes (I saw that happen when services like httpd were stopped on all controler nodes even though 2 of 3 nodes were completely okay).

Comment 3 Peter Lemenkov 2016-07-18 14:13:35 UTC

Set priority/severity to high - easily reproduced, makes visible damage.

Comment 4 Peter Lemenkov 2016-07-21 21:27:39 UTC

It seems that this issue isn't a new one. Fortunately (or unfortunately) we've seen this in the past. See bug 1300728.

Thanks to Marian, I've got a reproducer, so expect a fix soon (few days).

Comment 5 Peter Lemenkov 2016-07-22 15:05:07 UTC

(In reply to Peter Lemenkov from comment #4)
> It seems that this issue isn't a new one. Fortunately (or unfortunately)
> we've seen this in the past. See bug 1300728.
> 
> Thanks to Marian, I've got a reproducer, so expect a fix soon (few days).

Catched the offender. Working on patch.

Comment 6 Peter Lemenkov 2016-07-22 16:39:35 UTC

Build provided. Marian, please test

Comment 7 Peter Lemenkov 2016-07-22 17:02:31 UTC

Here is the story behind this ticket. First of all here is an offending commit:

https://github.com/rabbitmq/rabbitmq-server/commit/93b9e37

The issue is this added line: 

    [alarms_by_node(Name) || Name <- nodes_in_cluster(Node)].

RabbitMQ handles failed nodes this way:

0. Node failure occurs.
1. RabbitMQ removes failed node from the DB cluster (Mnesia tables). Think of it as of "physical" representation. 
2. RabbitMQ removes failed node from running nodes list. More high level "physical" representation.
3. RabbitMQ removes failed node from clustered nodes. "Logical" representation.

Obviously one shouldn't remove node from the cluster if connection dropped for a second or so, so the latter operation cat take a while (up to few seconds). This means that nodes_in_cluster(...) function will still return list of nodes including failed ones. Unfortunately alarms_by_node(...) function calls nodes using their "physical" address and raises an exception (caught and translated into called node's cluster failure).

So we've got surprising result - we're calling local node, it calls remote node, got exception, catches it properly, throws this exception up, and user-space application (rabbitmqctl) believes that local node failed.

Funny, but later rabbimqctl calls "default" handler (under "DIAGNOSTIC" label) which probes failed (e.g. local) node. It reports that everything is ok. See comment 1.

Surprisingly, we're observing the same pattern in bug 1356169. So theis at least another one race condition somewhere in the code.

Comment 8 Marian Krcmarik 2016-07-25 14:28:18 UTC

(In reply to Peter Lemenkov from comment #6)
> Build provided. Marian, please test

It seems to solve the problem, I am not able to reproduce with the fixed build.

Comment 10 Udi Shkalim 2016-07-26 10:45:20 UTC

Verified based on comment #8

Comment 12 errata-xmlrpc 2016-08-15 07:19:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-1597.html