Bug 1343027
Summary: | RabbitMQ refuses to start after a hard reset (using IPMI) of a controller. | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Leonid Natapov <lnatapov> |
Component: | rabbitmq-server | Assignee: | Peter Lemenkov <plemenko> |
Status: | CLOSED ERRATA | QA Contact: | Leonid Natapov <lnatapov> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 9.0 (Mitaka) | CC: | apevec, fdinitto, jeckersb, jjoyce, lhh, lnatapov, michele, mkrcmari, oblaut, plemenko, sclewis, srevivo |
Target Milestone: | ga | Keywords: | AutomationBlocker, Regression |
Target Release: | 9.0 (Mitaka) | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | rabbitmq-server-3.6.3-5.el7os,resource-agents-3.9.5-54.el7_2.16 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-08-11 12:24:14 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1343905 | ||
Bug Blocks: | 1309684 |
Description
Leonid Natapov
2016-06-06 11:13:03 UTC
Status report - we believe this issue might be related to another one we're working on right now (see bug 1343905). As soon as we deliver a fix for it, we will ask you to re-test. Leonid, please, test this package: resource-agents-3.9.5-76.el7 (In reply to Peter Lemenkov from comment #2) > Leonid, please, test this package: > > resource-agents-3.9.5-76.el7 I deployed latest ospd9 and updated the resource agent packages to the resource-agents-3.9.5-76.el7. Marian should run automation sanity on this deployment to see if it's reproduces. Will update as soon as I have results. OK,I have tested it manually and it looks like issue reproduces. The scenario is the following: Controller-0 hard boot using IPMI Controller-0 comes up but it takes some time to it to start all the pcs resources. While all the resources has not been yet started controller-1 hard rebooted using IPMI. So,from what I have noticed rabbit can join the rabbitmq cluster if the controller-0 was rebooted,came up,started all the resources and only when everything works controller-1 goes for the reboot. In other words everything should work when rebooting one of the controllers. If,for some reason,controller-1 reboots while controller-0 not fully recovered after its reboot - things go wrong. This build should fix the issue. Leonid, please test. The problem was that we've agged a regression with one of our patches merged into 3.6.1 which caused rabbitmq to fail during slave->master promotion. When you shut down two nodes the remaining one (slave) had to declare itself as a master. Unfortnately it failed to do so, so two starting nodes cannot re-sync with anyone. I've backported a patch which should fix this. (In reply to Peter Lemenkov from comment #6) > This build should fix the issue. Leonid, please test. > > The problem was that we've agged a regression with one of our patches merged > into 3.6.1 which caused rabbitmq to fail during slave->master promotion. > > When you shut down two nodes the remaining one (slave) had to declare itself > as a master. Unfortnately it failed to do so, so two starting nodes cannot > re-sync with anyone. I've backported a patch which should fix this. Re-tested with the updated rabbitmq server (rabbitmq-server-3.6.2-4.el7ost.noarch), Still reproduces. Bug still reproduces with the rabbitmq-server-3.6.2-4.el7ost.noarch and resource-agents-3.9.5-76.el7 This issue is likely the same as bug 1343905. As soon as we address that ticket we'll return to this one with proposed fix. Leonid, could you please retest with the latest rabbitmq-server and resource-agents packages? You need to deploy * resource-agents-3.9.5-80.el7 or resource-agents-3.9.5-54.el7_2.16 * rabbitmq-server-3.6.3-5.el7ost Tested with rabbitmq-server-3.6.3-5.el7os and resource-agents-3.9.5-54.el7_2.16 Tested as described in the description. One controllers was hard rebooted using IPMI and after it came up another controller was rebooted. After all controllers were UP rabbitmq was running on all of them. pcs resource cleanup rabbitmq was required. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-1597.html |