Bug 1356169 - RabbitMQ got stuck for a few seconds if some other node fails.
Summary: RabbitMQ got stuck for a few seconds if some other node fails.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rabbitmq-server
Version: 9.0 (Mitaka)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ga
: 9.0 (Mitaka)
Assignee: Peter Lemenkov
QA Contact: Marian Krcmarik
URL:
Whiteboard:
Depends On: 1300728
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-07-13 14:30 UTC by Peter Lemenkov
Modified: 2016-11-01 16:23 UTC (History)
11 users (show)

Fixed In Version: rabbitmq-server-3.6.3-5.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-08-15 07:19:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github rabbitmq rabbitmq-server pull 894 0 'None' closed Don't die in case of faulty node 2020-06-04 08:24:56 UTC
Red Hat Product Errata RHEA-2016:1597 0 normal SHIPPED_LIVE Red Hat OpenStack Platform 9 Release Candidate Advisory 2016-08-11 16:06:52 UTC

Internal Links: 1300728

Description Peter Lemenkov 2016-07-13 14:30:18 UTC
Description of problem:

1. Run on node-0:

# watch -n 0.2 --no-title "date | tee --append out.txt  &&  rabbitmqctl cluster_status 2>&1 | tee --append out.txt"

2. Run on node-2

echo b > /proc/sysrq-trigger

3. You'll see a following message on node-0 soon (just for a couple of seconds):

Cluster status of node 'rabbit@overcloud-controller-0' ...
Error: unable to connect to node 'rabbit@overcloud-controller-0': nodedown

DIAGNOSTICS
===========

attempted to contact: ['rabbit@overcloud-controller-0']

rabbit@overcloud-controller-0:
  * connected to epmd (port 4369) on overcloud-controller-0
  * node rabbit@overcloud-controller-0 up, 'rabbit' application running

current node details:
- node name: 'rabbitmq-cli-31@overcloud-controller-0'
- home dir: /var/lib/rabbitmq
- cookie hash: PB31uPq3vzeQeZ+MHv+wgg==


How reproducible: always


Although it lasts just for a few seconds, it's enough for pacemaker to believe that local RabbitMQ node monitoring failed:

Failed Actions:
* rabbitmq_monitor_10000 on overcloud-controller-1 'not running' (7): call=1469, status=complete, exitreason='none',
    last-rc-change='Wed Jul 13 14:08:54 2016', queued=0ms, exec=0ms
* rabbitmq_monitor_10000 on overcloud-controller-0 'not running' (7): call=980, status=complete, exitreason='none',
    last-rc-change='Wed Jul 13 13:51:18 2016', queued=0ms, exec=0ms

Comment 2 Marian Krcmarik 2016-07-13 15:03:39 UTC
I would like to add some consequences which this has - when a rabbitmq monitor fails on the host which is not actually reset (not in failover) It assumes that rabbitmq-server is broken on the host and It starts stopping other resources on this node most likely due to pacemaker constraints set on rabbitmq resource.In the case such failed monitor happenes on all live nodes the cluster ends up with stopped resources on all nodes (I saw that happen when services like httpd were stopped on all controler nodes even though 2 of 3 nodes were completely okay).

Comment 3 Peter Lemenkov 2016-07-18 14:13:35 UTC
Set priority/severity to high - easily reproduced, makes visible damage.

Comment 4 Peter Lemenkov 2016-07-21 21:27:39 UTC
It seems that this issue isn't a new one. Fortunately (or unfortunately) we've seen this in the past. See bug 1300728.

Thanks to Marian, I've got a reproducer, so expect a fix soon (few days).

Comment 5 Peter Lemenkov 2016-07-22 15:05:07 UTC
(In reply to Peter Lemenkov from comment #4)
> It seems that this issue isn't a new one. Fortunately (or unfortunately)
> we've seen this in the past. See bug 1300728.
> 
> Thanks to Marian, I've got a reproducer, so expect a fix soon (few days).

Catched the offender. Working on patch.

Comment 6 Peter Lemenkov 2016-07-22 16:39:35 UTC
Build provided. Marian, please test

Comment 7 Peter Lemenkov 2016-07-22 17:02:31 UTC
Here is the story behind this ticket. First of all here is an offending commit:

https://github.com/rabbitmq/rabbitmq-server/commit/93b9e37

The issue is this added line: 

    [alarms_by_node(Name) || Name <- nodes_in_cluster(Node)].

RabbitMQ handles failed nodes this way:

0. Node failure occurs.
1. RabbitMQ removes failed node from the DB cluster (Mnesia tables). Think of it as of "physical" representation. 
2. RabbitMQ removes failed node from running nodes list. More high level "physical" representation.
3. RabbitMQ removes failed node from clustered nodes. "Logical" representation.

Obviously one shouldn't remove node from the cluster if connection dropped for a second or so, so the latter operation cat take a while (up to few seconds). This means that nodes_in_cluster(...) function will still return list of nodes including failed ones. Unfortunately alarms_by_node(...) function calls nodes using their "physical" address and raises an exception (caught and translated into called node's cluster failure).

So we've got surprising result - we're calling local node, it calls remote node, got exception, catches it properly, throws this exception up, and user-space application (rabbitmqctl) believes that local node failed.

Funny, but later rabbimqctl calls "default" handler (under "DIAGNOSTIC" label) which probes failed (e.g. local) node. It reports that everything is ok. See comment 1.

Surprisingly, we're observing the same pattern in bug 1356169. So theis at least another one race condition somewhere in the code.

Comment 8 Marian Krcmarik 2016-07-25 14:28:18 UTC
(In reply to Peter Lemenkov from comment #6)
> Build provided. Marian, please test

It seems to solve the problem, I am not able to reproduce with the fixed build.

Comment 10 Udi Shkalim 2016-07-26 10:45:20 UTC
Verified based on comment #8

Comment 12 errata-xmlrpc 2016-08-15 07:19:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-1597.html


Note You need to log in before you can comment on or make changes to this bug.