1356169 – RabbitMQ got stuck for a few seconds if some other node fails.

Bug 1356169 - RabbitMQ got stuck for a few seconds if some other node fails.

Summary: RabbitMQ got stuck for a few seconds if some other node fails.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	rabbitmq-server
Sub Component:
Version:	9.0 (Mitaka)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	ga
Target Release:	9.0 (Mitaka)
Assignee:	Peter Lemenkov
QA Contact:	Marian Krcmarik
Docs Contact:
URL:
Whiteboard:
Depends On:	1300728
Blocks:
TreeView+	depends on / blocked

Reported:	2016-07-13 14:30 UTC by Peter Lemenkov
Modified:	2016-11-01 16:23 UTC (History)
CC List:	11 users (show)
Fixed In Version:	rabbitmq-server-3.6.3-5.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-08-15 07:19:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	rabbitmq rabbitmq-server pull 894	0	'None'	closed	Don't die in case of faulty node	2020-06-04 08:24:56 UTC
Red Hat Product Errata	RHEA-2016:1597	0	normal	SHIPPED_LIVE	Red Hat OpenStack Platform 9 Release Candidate Advisory	2016-08-11 16:06:52 UTC

Internal Links: 1300728

Description Peter Lemenkov 2016-07-13 14:30:18 UTC

Description of problem:

1. Run on node-0:

# watch -n 0.2 --no-title "date | tee --append out.txt  &&  rabbitmqctl cluster_status 2>&1 | tee --append out.txt"

2. Run on node-2

echo b > /proc/sysrq-trigger

3. You'll see a following message on node-0 soon (just for a couple of seconds):

Cluster status of node 'rabbit@overcloud-controller-0' ...
Error: unable to connect to node 'rabbit@overcloud-controller-0': nodedown

DIAGNOSTICS
===========

attempted to contact: ['rabbit@overcloud-controller-0']

rabbit@overcloud-controller-0:
  * connected to epmd (port 4369) on overcloud-controller-0
  * node rabbit@overcloud-controller-0 up, 'rabbit' application running

current node details:
- node name: 'rabbitmq-cli-31@overcloud-controller-0'
- home dir: /var/lib/rabbitmq
- cookie hash: PB31uPq3vzeQeZ+MHv+wgg==


How reproducible: always


Although it lasts just for a few seconds, it's enough for pacemaker to believe that local RabbitMQ node monitoring failed:

Failed Actions:
* rabbitmq_monitor_10000 on overcloud-controller-1 'not running' (7): call=1469, status=complete, exitreason='none',
    last-rc-change='Wed Jul 13 14:08:54 2016', queued=0ms, exec=0ms
* rabbitmq_monitor_10000 on overcloud-controller-0 'not running' (7): call=980, status=complete, exitreason='none',
    last-rc-change='Wed Jul 13 13:51:18 2016', queued=0ms, exec=0ms

Comment 2 Marian Krcmarik 2016-07-13 15:03:39 UTC

I would like to add some consequences which this has - when a rabbitmq monitor fails on the host which is not actually reset (not in failover) It assumes that rabbitmq-server is broken on the host and It starts stopping other resources on this node most likely due to pacemaker constraints set on rabbitmq resource.In the case such failed monitor happenes on all live nodes the cluster ends up with stopped resources on all nodes (I saw that happen when services like httpd were stopped on all controler nodes even though 2 of 3 nodes were completely okay).

Comment 3 Peter Lemenkov 2016-07-18 14:13:35 UTC

Set priority/severity to high - easily reproduced, makes visible damage.

Comment 4 Peter Lemenkov 2016-07-21 21:27:39 UTC

It seems that this issue isn't a new one. Fortunately (or unfortunately) we've seen this in the past. See bug 1300728.

Thanks to Marian, I've got a reproducer, so expect a fix soon (few days).

Comment 5 Peter Lemenkov 2016-07-22 15:05:07 UTC

(In reply to Peter Lemenkov from comment #4)
> It seems that this issue isn't a new one. Fortunately (or unfortunately)
> we've seen this in the past. See bug 1300728.
> 
> Thanks to Marian, I've got a reproducer, so expect a fix soon (few days).

Catched the offender. Working on patch.

Comment 6 Peter Lemenkov 2016-07-22 16:39:35 UTC

Build provided. Marian, please test

Comment 7 Peter Lemenkov 2016-07-22 17:02:31 UTC

Here is the story behind this ticket. First of all here is an offending commit:

https://github.com/rabbitmq/rabbitmq-server/commit/93b9e37

The issue is this added line: 

    [alarms_by_node(Name) || Name <- nodes_in_cluster(Node)].

RabbitMQ handles failed nodes this way:

0. Node failure occurs.
1. RabbitMQ removes failed node from the DB cluster (Mnesia tables). Think of it as of "physical" representation. 
2. RabbitMQ removes failed node from running nodes list. More high level "physical" representation.
3. RabbitMQ removes failed node from clustered nodes. "Logical" representation.

Obviously one shouldn't remove node from the cluster if connection dropped for a second or so, so the latter operation cat take a while (up to few seconds). This means that nodes_in_cluster(...) function will still return list of nodes including failed ones. Unfortunately alarms_by_node(...) function calls nodes using their "physical" address and raises an exception (caught and translated into called node's cluster failure).

So we've got surprising result - we're calling local node, it calls remote node, got exception, catches it properly, throws this exception up, and user-space application (rabbitmqctl) believes that local node failed.

Funny, but later rabbimqctl calls "default" handler (under "DIAGNOSTIC" label) which probes failed (e.g. local) node. It reports that everything is ok. See comment 1.

Surprisingly, we're observing the same pattern in bug 1356169. So theis at least another one race condition somewhere in the code.

Comment 8 Marian Krcmarik 2016-07-25 14:28:18 UTC

(In reply to Peter Lemenkov from comment #6)
> Build provided. Marian, please test

It seems to solve the problem, I am not able to reproduce with the fixed build.

Comment 10 Udi Shkalim 2016-07-26 10:45:20 UTC

Verified based on comment #8

Comment 12 errata-xmlrpc 2016-08-15 07:19:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-1597.html

Note You need to log in before you can comment on or make changes to this bug.