Bug 1356169
Summary: | RabbitMQ got stuck for a few seconds if some other node fails. | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Peter Lemenkov <plemenko> |
Component: | rabbitmq-server | Assignee: | Peter Lemenkov <plemenko> |
Status: | CLOSED ERRATA | QA Contact: | Marian Krcmarik <mkrcmari> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 9.0 (Mitaka) | CC: | apevec, fdinitto, jeckersb, jjoyce, lhh, michele, mkrcmari, oblaut, royoung, srevivo, ushkalim |
Target Milestone: | ga | Keywords: | AutomationBlocker |
Target Release: | 9.0 (Mitaka) | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | rabbitmq-server-3.6.3-5.el7ost | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-08-15 07:19:23 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1300728 | ||
Bug Blocks: |
Description
Peter Lemenkov
2016-07-13 14:30:18 UTC
I would like to add some consequences which this has - when a rabbitmq monitor fails on the host which is not actually reset (not in failover) It assumes that rabbitmq-server is broken on the host and It starts stopping other resources on this node most likely due to pacemaker constraints set on rabbitmq resource.In the case such failed monitor happenes on all live nodes the cluster ends up with stopped resources on all nodes (I saw that happen when services like httpd were stopped on all controler nodes even though 2 of 3 nodes were completely okay). Set priority/severity to high - easily reproduced, makes visible damage. It seems that this issue isn't a new one. Fortunately (or unfortunately) we've seen this in the past. See bug 1300728. Thanks to Marian, I've got a reproducer, so expect a fix soon (few days). (In reply to Peter Lemenkov from comment #4) > It seems that this issue isn't a new one. Fortunately (or unfortunately) > we've seen this in the past. See bug 1300728. > > Thanks to Marian, I've got a reproducer, so expect a fix soon (few days). Catched the offender. Working on patch. Build provided. Marian, please test Here is the story behind this ticket. First of all here is an offending commit: https://github.com/rabbitmq/rabbitmq-server/commit/93b9e37 The issue is this added line: [alarms_by_node(Name) || Name <- nodes_in_cluster(Node)]. RabbitMQ handles failed nodes this way: 0. Node failure occurs. 1. RabbitMQ removes failed node from the DB cluster (Mnesia tables). Think of it as of "physical" representation. 2. RabbitMQ removes failed node from running nodes list. More high level "physical" representation. 3. RabbitMQ removes failed node from clustered nodes. "Logical" representation. Obviously one shouldn't remove node from the cluster if connection dropped for a second or so, so the latter operation cat take a while (up to few seconds). This means that nodes_in_cluster(...) function will still return list of nodes including failed ones. Unfortunately alarms_by_node(...) function calls nodes using their "physical" address and raises an exception (caught and translated into called node's cluster failure). So we've got surprising result - we're calling local node, it calls remote node, got exception, catches it properly, throws this exception up, and user-space application (rabbitmqctl) believes that local node failed. Funny, but later rabbimqctl calls "default" handler (under "DIAGNOSTIC" label) which probes failed (e.g. local) node. It reports that everything is ok. See comment 1. Surprisingly, we're observing the same pattern in bug 1356169. So theis at least another one race condition somewhere in the code. (In reply to Peter Lemenkov from comment #6) > Build provided. Marian, please test It seems to solve the problem, I am not able to reproduce with the fixed build. Verified based on comment #8 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-1597.html |