Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1595753

Summary:	[UPGRADES][10]RMQ resource-agent should handle stopped node
Product:	Red Hat Enterprise Linux 7	Reporter:	Yurii Prokulevych <yprokule>
Component:	resource-agents	Assignee:	Oyvind Albrigtsen <oalbrigt>
Status:	CLOSED CURRENTRELEASE	QA Contact:	pkomarov
Severity:	urgent	Docs Contact:	Marek Suchánek <msuchane>
Priority:	urgent
Version:	7.5	CC:	agk, aherr, apevec, augol, ccamacho, cchen, cfeist, cluster-maint, ctowsley, fdinitto, jeckersb, lhh, lmiccini, michele, mkrcmari, morazi, oalbrigt, pkomarov, sbradley, sgolovat, srevivo, toneata, yprokule
Target Milestone:	pre-dev-freeze	Keywords:	ReleaseNotes, Triaged, ZStream
Target Release:	7.7
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	GSSApproved
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:	Previously, the rabbitmqctl cluster_status command read cached cluster status from disk and returned 0 when mnesia service was not running. For example, this happened if rabbitmqctl stop_app was called, or the service paused during partition due to the pause_minority strategy. As a consequence, RabbitMQ might have returned cached status from disk. With this update, RabbitMQ now gets cluster status from mnesia during monitor-action. As a result, the described problem no longer occurs.	Story Points:	---
Clone Of:
Clones:	1641944 1641945 (view as bug list)		Environment:
Last Closed:	2019-03-21 11:57:59 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1639826, 1641944, 1641945

Description Yurii Prokulevych 2018-06-27 13:27:40 UTC

Description of problem:
-----------------------
RMQ failed to come up after controller upgraded to RHOS-10 got rebooted.
Excerpt from logs:

Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [ rabbit@controller-2: ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [   * connected to epmd (port 4369) on controller-2 ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [   * epmd reports node 'rabbit' running on port 35672 ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [   * can't establish TCP connection, reason: timeout (timed out) ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [   * suggestion: blocked by firewall? ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [  ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [ current node details: ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [ - node name: 'rabbitmq-cli-42@controller-0' ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [ - home dir: /var/lib/rabbitmq ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [ - cookie hash: T6QXguOieCu9zpMaS+Vohw== ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [  ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [ Error: unable to connect to nodes ['rabbit@controller-1']: nodedown ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [  ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [ DIAGNOSTICS ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [ =========== ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [  ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [ attempted to contact: ['rabbit@controller-1'] ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [  ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [ rabbit@controller-1: ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [   * connected to epmd (port 4369) on controller-1 ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [   * epmd reports node 'rabbit' running on port 35672 ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [   * can't establish TCP connection, reason: timeout (timed out) ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [   * suggestion: blocked by firewall? ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [  ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [ current node details: ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [ - node name: 'rabbitmq-cli-55@controller-0' ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [ - home dir: /var/lib/rabbitmq ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [ - cookie hash: T6QXguOieCu9zpMaS+Vohw== ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [  ]
Jun 27 12:39:38 [3300] controller-0       lrmd:     info: log_finished: finished - rsc:rabbitmq action:start call_id:67 pid:17517 exit-code:1 exec-time:94900ms queue-time:0ms
Jun 27 12:39:38 [3306] controller-0       crmd:   notice: process_lrm_event:    Result of start operation for rabbitmq on controller-0: 1 (unknown error) | call=67 key=rabbitmq_start_0 confirmed=true cib-update=
63
Jun 27 12:39:38 [3306] controller-0       crmd:   notice: process_lrm_event:    controller-0-rabbitmq_start_0:67 [ Error: unable to connect to node 'rabbit@controller-0': nodedown\n\nDIAGNOSTICS\n===========\n\n
attempted to contact: ['rabbit@controller-0']\n\nrabbit@controller-0:\n  * connected to epmd (port 4369) on controller-0\n  * epmd reports: node 'rabbit' not running at all\n                  no other nodes on c
ontroller-0\n  * suggestion: start the node\n\ncurrent node details:\n- node name: 'rabbitmq

Restarting the service before reboot helped:
  pcs resource restart rabbitmq-clone

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
puppet-rabbitmq-5.6.0-2.el7ost.noarch
rabbitmq-server-3.6.3-9.el7ost.noarch
pcs-0.9.162-5.el7_5.1.x86_64
resource-agents-3.9.5-124.el7.x86_64
python-testresources-0.2.7-6.el7ost.noarch

Steps to Reproduce:
-------------------
1. Upgrade RHOS-9 to RHOS-10
2. Reboot controller-0 and observe rmq's status

Comment 2 John Eckersberg 2018-08-28 15:47:58 UTC

Here's what's going on here...

We changed the clustering port out of the ephemeral range here (35672->25672):

  https://review.openstack.org/#/c/345851/6/puppet/services/rabbitmq.yaml

So rabbitmq is originally up and bound to 35672.

The upgrade runs, and the firewall rules change so 35672 closes and 25672 opens.

On each controller, when 35672 closes, the existing clustered connection dies.  So the pause_minority code kicks in and the rabbit app stops.

This happens on all three controllers.  Each is paused waiting for the others to return (they never do).  You can see this in the status because the rabbit app is not running:

[root@controller-0 ~]# rabbitmqctl status
Status of node 'rabbit@controller-0' ...
[{pid,20759},
 {running_applications,[{xmerl,"XML parser","1.3.10"},
                        {ranch,"Socket acceptor pool for TCP protocols.",
                               "1.2.1"},
                        {sasl,"SASL  CXC 138 11","2.7"},
                        {stdlib,"ERTS  CXC 138 10","2.8"},
                        {kernel,"ERTS  CXC 138 10","4.2"}]},
[...snip...]

Also, and more importantly, the resource agent health check is buggy in this situation.  It's reporting the service is up:

[root@controller-0 ~]# pcs status | grep -A1 rabbit
 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ controller-0 controller-1 controller-2 ]

Because the health check just checks cluster_status, which returns success:

[root@controller-0 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@controller-0' ...
[{nodes,[{disc,['rabbit@controller-0','rabbit@controller-1',
                'rabbit@controller-2']}]},
 {alarms,[]}]
[root@controller-0 ~]# echo $?
0

However this isn't really accurate when the node is paused.  When mnesia is stopped, cluster_status reads from the cached status on disk here:

[root@controller-0 ~]# cat /var/lib/rabbitmq/mnesia/rabbit@controller-0/cluster_nodes.config 
{['rabbit@controller-0','rabbit@controller-1','rabbit@controller-2'],['rabbit@controller-0','rabbit@controller-1','rabbit@controller-2']}.
[root@controller-0 ~]# cat /var/lib/rabbitmq/mnesia/rabbit@controller-0/nodes_running_at_shutdown 
['rabbit@controller-0'].

But if you force it to read from mnesia, you can see things are not ok:

[root@controller-0 ~]# rabbitmqctl eval 'rabbit_mnesia:cluster_status_from_mnesia().'
{error,mnesia_not_running}

So this is not really a doc bug, we should fix the resource agent to handle this better.

Comment 4 John Eckersberg 2018-08-29 20:34:22 UTC

https://github.com/ClusterLabs/resource-agents/pull/1202

Comment 5 John Eckersberg 2018-09-13 14:04:52 UTC

*** Bug 1628524 has been marked as a duplicate of this bug. ***