1576533 – Need to analyze rabbitmq erlang log, as rabbitmq not starting on one of the nodes in cluster

Bug 1576533 - Need to analyze rabbitmq erlang log, as rabbitmq not starting on one of the nodes in cluster

Summary: Need to analyze rabbitmq erlang log, as rabbitmq not starting on one of the n...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	rabbitmq-server
Sub Component:
Version:	10.0 (Newton)
Hardware:	All
OS:	All
Priority:	urgent
Severity:	urgent
Target Milestone:	z9
Target Release:	10.0 (Newton)
Assignee:	Peter Lemenkov
QA Contact:	Udi Shkalim
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-05-09 17:04 UTC by Ganesh Kadam
Modified:	2022-08-16 08:50 UTC (History)
CC List:	9 users (show)
Fixed In Version:	rabbitmq-server-3.6.3-10.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-09-17 16:59:20 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	rabbitmq rabbitmq-server issues 1035	'None'	closed	Termination of queue master fails with `{badmatch,{error,not_found}}`	2021-01-26 02:17:10 UTC
Red Hat Bugzilla	1436784	unspecified	CLOSED	RabbitMQ - Termination of queue master fails with `{badmatch,{error,not_found}}`	2021-02-22 00:41:40 UTC
Red Hat Issue Tracker	OSP-5080	None	None	None	2022-08-16 08:50:21 UTC
Red Hat Product Errata	RHBA-2018:2671	None	None	None	2018-09-17 17:00:21 UTC

Description Ganesh Kadam 2018-05-09 17:04:01 UTC

Description of problem:

- Rabbitmq is not starting on one of the nodes in the cluster. 
- This is a freshly deployed RHOSP10 environment

Below is the error log from the failed controller node(c1f-ops-ctlc22):
~~~
=ERROR REPORT==== 9-May-2018::11:10:25 ===
Error on AMQP connection <0.405.0> (10.20.x.3x:47814 -> 10.20.x.3x:5672, vhost: '/', user: 'guest', state: running), channel 0:
operation none caused a connection exception connection_forced: "broker forced connection closure with reason 'shutdown'"

=ERROR REPORT==== 9-May-2018::11:10:25 ===
Error on AMQP connection <0.429.0> (10.20.x.3x:47942 -> 10.20.x.3x:5672, vhost: '/', user: 'guest', state: running), channel 0:
operation none caused a connection exception connection_forced: "broker forced connection closure with reason 'shutdown'"
~~~

~~~
[root@c1f-ops-ctlc21 ~]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: c1f-ops-ctlc20 (version 1.1.16-12.el7_4.7-94ff4df) - partition with quorum
Last updated: Wed May  9 11:36:27 2018
Last change: Wed May  9 11:23:15 2018 by hacluster via crmd on c1f-ops-ctlc20

3 nodes configured
19 resources configured

Online: [ c1f-ops-ctlc20 c1f-ops-ctlc21 c1f-ops-ctlc22 ]

Full list of resources:

 ip-10.20.184.250       (ocf::heartbeat:IPaddr2):       Started c1f-ops-ctlc20
 ip-10.20.185.250       (ocf::heartbeat:IPaddr2):       Started c1f-ops-ctlc21
 ip-10.20.186.250       (ocf::heartbeat:IPaddr2):       Started c1f-ops-ctlc22
 Clone Set: haproxy-clone [haproxy]
     Started: [ c1f-ops-ctlc20 c1f-ops-ctlc21 c1f-ops-ctlc22 ]
 ip-10.20.182.33        (ocf::heartbeat:IPaddr2):       Started c1f-ops-ctlc20
 Master/Slave Set: galera-master [galera]
     Masters: [ c1f-ops-ctlc20 c1f-ops-ctlc21 c1f-ops-ctlc22 ]
 ip-10.20.176.250       (ocf::heartbeat:IPaddr2):       Started c1f-ops-ctlc21
 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ c1f-ops-ctlc20 c1f-ops-ctlc21 ]
     Stopped: [ c1f-ops-ctlc22 ]
 ip-10.20.186.12        (ocf::heartbeat:IPaddr2):       Started c1f-ops-ctlc22
 Master/Slave Set: redis-master [redis]
     Masters: [ c1f-ops-ctlc20 ]
     Slaves: [ c1f-ops-ctlc21 c1f-ops-ctlc22 ]
 openstack-cinder-volume        (systemd:openstack-cinder-volume):      Started c1f-ops-ctlc20

Failed Actions:
* rabbitmq_start_0 on c1f-ops-ctlc22 'unknown error' (1): call=108, status=complete, exitreason='none',
    last-rc-change='Wed May  9 11:23:22 2018', queued=0ms, exec=10393ms


Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

[root@c1f-ops-ctlc21 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@c1f-ops-ctlc21' ...
[{nodes,[{disc,['rabbit@c1f-ops-ctlc20','rabbit@c1f-ops-ctlc21']}]},
 {running_nodes,['rabbit@c1f-ops-ctlc20','rabbit@c1f-ops-ctlc21']},
 {cluster_name,<<"rabbit.tesoro.it">>},
 {partitions,[]},
 {alarms,[{'rabbit@c1f-ops-ctlc20',[]},{'rabbit@c1f-ops-ctlc21',[]}]}]
~~~



Version-Release number of selected component (if applicable):

[gkadam@collab-shell c1f-ops-ctlc22.coll.tesoro.it]$ grep -ri rabbit installed-rpms 
puppet-rabbitmq-5.6.0-2.el7ost.noarch                       Thu Feb 22 18:26:45 2018
rabbitmq-server-3.6.3-7.el7ost.noarch                       Thu Feb 22 18:21:40 2018


Actual results:

Rabbitmq failing on one of the controller nodes

Expected results:

Rabbitmq should start on all the controller nodes

Additional info/Steps performed:

We tried to restart the rabbitmq-clone in the following manner:

- Unmanaged rabbitmq-clone on one of the controller nodes
- killed all the rabbitmq processes on all the controller nodes
- Managed the rabbitmq-clone again, and tried a pcs resource cleanup 
- we found a stale epmd process on the failed controller node, so we killed all the epmd process, and again tried the same procedure to start the rabbitmq-clone, it started successfully on all the nodes, but later it failed again on the same node(c1f-ops-ctlc22), once we tried to restart the openstack nova services on it. 

- We later, tried again to unmanage rabbitmq-clone
- stop all the rabbitmq processes
- rm -rf /var/lib/rabbitmq/mnesia/*
- pcs resource manage rabbitmq-clone 
- however, this didn't worked, so we again unmanaged the rabbitmq-clone, and tried to start the rabbitmq application in the faulty node (c1f-ops-ctlc22), 
this didn't worked as well
- it was unable to start the rabbitmq application and join to the master node manually.


We have asked to upload the erlang logs, and we request Engineering team to analyze the same to narrow down the issue, this is because, we tried all the possible ways to recover the failing rabbitmq, but it is somehow failing on the same node.

Comment 9 Peter Lemenkov 2018-05-14 13:11:51 UTC

Ganesh, please try this build - rabbitmq-server-3.6.3-10.el7ost. It should fix your issue.

Comment 10 Peter Lemenkov 2018-05-14 13:15:31 UTC

A yum repository for the build of rabbitmq-server-3.6.3-10.el7ost (task 16274921) is available at:

http://brew-task-repos.usersys.redhat.com/repos/official/rabbitmq-server/3.6.3/10.el7ost/

You can install the rpms locally by putting this .repo file in your /etc/yum.repos.d/ directory:

http://brew-task-repos.usersys.redhat.com/repos/official/rabbitmq-server/3.6.3/10.el7ost/rabbitmq-server-3.6.3-10.el7ost.repo

RPMs and build logs can be found in the following locations:
http://brew-task-repos.usersys.redhat.com/repos/official/rabbitmq-server/3.6.3/10.el7ost/noarch/

The full list of available rpms is:
http://brew-task-repos.usersys.redhat.com/repos/official/rabbitmq-server/3.6.3/10.el7ost/noarch/rabbitmq-server-3.6.3-10.el7ost.src.rpm
http://brew-task-repos.usersys.redhat.com/repos/official/rabbitmq-server/3.6.3/10.el7ost/noarch/rabbitmq-server-3.6.3-10.el7ost.noarch.rpm

Build output will be available for the next 21 days.

If you wish to stop receiving these emails, please email:
Mike Bonnet <mikeb>

Thank you,
The Brew Task Repos System

Comment 19 Alex McLeod 2018-09-03 07:58:33 UTC

Hi there,

If this bug requires doc text for errata release, please set the 'Doc Type' and provide draft text according to the template in the 'Doc Text' field.

The documentation team will review, edit, and approve the text.

If this bug does not require doc text, please set the 'requires_doc_text' flag to -.

Thanks,
Alex

Comment 20 Peter Lemenkov 2018-09-03 12:49:27 UTC

I don't think this particular ticket requires any documentation. So I'm going to sed requires_doc_text to '-'.

Comment 22 errata-xmlrpc 2018-09-17 16:59:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2671

Note You need to log in before you can comment on or make changes to this bug.