Bug 1595753
| Summary: | [UPGRADES][10]RMQ resource-agent should handle stopped node | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Yurii Prokulevych <yprokule> | |
| Component: | resource-agents | Assignee: | Oyvind Albrigtsen <oalbrigt> | |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | pkomarov | |
| Severity: | urgent | Docs Contact: | Marek Suchánek <msuchane> | |
| Priority: | urgent | |||
| Version: | 7.5 | CC: | agk, aherr, apevec, augol, ccamacho, cchen, cfeist, cluster-maint, ctowsley, fdinitto, jeckersb, lhh, lmiccini, michele, mkrcmari, morazi, oalbrigt, pkomarov, sbradley, sgolovat, srevivo, toneata, yprokule | |
| Target Milestone: | pre-dev-freeze | Keywords: | ReleaseNotes, Triaged, ZStream | |
| Target Release: | 7.7 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | GSSApproved | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: |
Previously, the rabbitmqctl cluster_status command read cached cluster status from disk and returned 0 when mnesia service was not running. For example, this happened if rabbitmqctl stop_app was called, or the service paused during partition due to the pause_minority strategy. As a consequence, RabbitMQ might have returned cached status from disk. With this update, RabbitMQ now gets cluster status from mnesia during monitor-action. As a result, the described problem no longer occurs.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1641944 1641945 (view as bug list) | Environment: | ||
| Last Closed: | 2019-03-21 11:57:59 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1639826, 1641944, 1641945 | |||
|
Description
Yurii Prokulevych
2018-06-27 13:27:40 UTC
Here's what's going on here... We changed the clustering port out of the ephemeral range here (35672->25672): https://review.openstack.org/#/c/345851/6/puppet/services/rabbitmq.yaml So rabbitmq is originally up and bound to 35672. The upgrade runs, and the firewall rules change so 35672 closes and 25672 opens. On each controller, when 35672 closes, the existing clustered connection dies. So the pause_minority code kicks in and the rabbit app stops. This happens on all three controllers. Each is paused waiting for the others to return (they never do). You can see this in the status because the rabbit app is not running: [root@controller-0 ~]# rabbitmqctl status Status of node 'rabbit@controller-0' ... [{pid,20759}, {running_applications,[{xmerl,"XML parser","1.3.10"}, {ranch,"Socket acceptor pool for TCP protocols.", "1.2.1"}, {sasl,"SASL CXC 138 11","2.7"}, {stdlib,"ERTS CXC 138 10","2.8"}, {kernel,"ERTS CXC 138 10","4.2"}]}, [...snip...] Also, and more importantly, the resource agent health check is buggy in this situation. It's reporting the service is up: [root@controller-0 ~]# pcs status | grep -A1 rabbit Clone Set: rabbitmq-clone [rabbitmq] Started: [ controller-0 controller-1 controller-2 ] Because the health check just checks cluster_status, which returns success: [root@controller-0 ~]# rabbitmqctl cluster_status Cluster status of node 'rabbit@controller-0' ... [{nodes,[{disc,['rabbit@controller-0','rabbit@controller-1', 'rabbit@controller-2']}]}, {alarms,[]}] [root@controller-0 ~]# echo $? 0 However this isn't really accurate when the node is paused. When mnesia is stopped, cluster_status reads from the cached status on disk here: [root@controller-0 ~]# cat /var/lib/rabbitmq/mnesia/rabbit@controller-0/cluster_nodes.config {['rabbit@controller-0','rabbit@controller-1','rabbit@controller-2'],['rabbit@controller-0','rabbit@controller-1','rabbit@controller-2']}. [root@controller-0 ~]# cat /var/lib/rabbitmq/mnesia/rabbit@controller-0/nodes_running_at_shutdown ['rabbit@controller-0']. But if you force it to read from mnesia, you can see things are not ok: [root@controller-0 ~]# rabbitmqctl eval 'rabbit_mnesia:cluster_status_from_mnesia().' {error,mnesia_not_running} So this is not really a doc bug, we should fix the resource agent to handle this better. *** Bug 1628524 has been marked as a duplicate of this bug. *** |