Bug 1656368
Summary: | rabbitmq-cluster: regression when restarting inside a bundle | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Michele Baldessari <michele> | |
Component: | resource-agents | Assignee: | Oyvind Albrigtsen <oalbrigt> | |
Status: | CLOSED ERRATA | QA Contact: | pkomarov | |
Severity: | urgent | Docs Contact: | ||
Priority: | urgent | |||
Version: | 7.6 | CC: | abeekhof, agk, agurenko, aherr, astupnik, cfeist, chjones, cluster-maint, dalvarez, dciabrin, fdinitto, jeckersb, michele, mlisik, phagara, pkomarov, plemenko, salmy, sasha, sbradley | |
Target Milestone: | rc | Keywords: | Triaged, ZStream | |
Target Release: | --- | |||
Hardware: | All | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | resource-agents-4.1.1-15.el7 | Doc Type: | If docs needed, set a value | |
Doc Text: |
When a containerized RabbitMQ cluster was stopped entirely, but the containers were not stopped, the RabbitMQ resource agent failed to update the Pacemaker view of the RabbitMQ cluster. Consequently, RabbitMQ servers failed to restart the cluster. With this update, the RabbitMQ resource agent cleans up cluster attributes on RabbitMQ shutdown, and, as a result, the described problem no longer occurs.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1657138 (view as bug list) | Environment: | ||
Last Closed: | 2019-08-06 12:01:38 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1637626, 1657138 |
Description
Michele Baldessari
2018-12-05 10:50:07 UTC
Damien and I ran this down this morning. We discovered a few places where the stop action might not remove the rabbitmq node attribute from pacemaker. So what ends up happening is: - change ocf resource, triggers restart - nodes 3 and 2 stop, but do *not* delete their attribute - node 1 errors out in some fashion[1] during monitor/notify/stop and the node attr *is* deleted and the service stopped - node 1 starts back up but attempts to join cluster with nodes 2+3 because the attributes are still present. This fails and thus the cluster does not bootstrap properly. I will submit a PR with the two minor tweaks we did that seems to address this. [1] When you start trying to do too much in the middle of a failover... exact results are less-than-predictable. What *is* important is that it gets marked as down. Another note, I think this may only be a problem with bundles. The attributes have a "reboot" lifetime. I think in the non-bundle case, stopping the resource may be enough to cause the attributes to be cleaned up. However with bundles, the resource stop only stops the service inside of the bundle, but the bundle itself stays up the entire time so the attribute remains. *** Bug 1655764 has been marked as a duplicate of this bug. *** Hi folks, do you think this BZ [0] could be a duplicate? [0] https://bugzilla.redhat.com/show_bug.cgi?id=1661806 Verified , tested in : https://bugzilla.redhat.com/show_bug.cgi?id=1657138#c3 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2012 |