Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1576533

Summary: Need to analyze rabbitmq erlang log, as rabbitmq not starting on one of the nodes in cluster
Product: Red Hat OpenStack Reporter: Ganesh Kadam <gkadam>
Component: rabbitmq-serverAssignee: Peter Lemenkov <plemenko>
Status: CLOSED ERRATA QA Contact: Udi Shkalim <ushkalim>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 10.0 (Newton)CC: apevec, dhill, gkadam, jeckersb, lhh, mkrcmari, pkomarov, plemenko, srevivo
Target Milestone: z9Keywords: Triaged, ZStream
Target Release: 10.0 (Newton)   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: rabbitmq-server-3.6.3-10.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-17 16:59:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Ganesh Kadam 2018-05-09 17:04:01 UTC
Description of problem:

- Rabbitmq is not starting on one of the nodes in the cluster. 
- This is a freshly deployed RHOSP10 environment

Below is the error log from the failed controller node(c1f-ops-ctlc22):
~~~
=ERROR REPORT==== 9-May-2018::11:10:25 ===
Error on AMQP connection <0.405.0> (10.20.x.3x:47814 -> 10.20.x.3x:5672, vhost: '/', user: 'guest', state: running), channel 0:
operation none caused a connection exception connection_forced: "broker forced connection closure with reason 'shutdown'"

=ERROR REPORT==== 9-May-2018::11:10:25 ===
Error on AMQP connection <0.429.0> (10.20.x.3x:47942 -> 10.20.x.3x:5672, vhost: '/', user: 'guest', state: running), channel 0:
operation none caused a connection exception connection_forced: "broker forced connection closure with reason 'shutdown'"
~~~

~~~
[root@c1f-ops-ctlc21 ~]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: c1f-ops-ctlc20 (version 1.1.16-12.el7_4.7-94ff4df) - partition with quorum
Last updated: Wed May  9 11:36:27 2018
Last change: Wed May  9 11:23:15 2018 by hacluster via crmd on c1f-ops-ctlc20

3 nodes configured
19 resources configured

Online: [ c1f-ops-ctlc20 c1f-ops-ctlc21 c1f-ops-ctlc22 ]

Full list of resources:

 ip-10.20.184.250       (ocf::heartbeat:IPaddr2):       Started c1f-ops-ctlc20
 ip-10.20.185.250       (ocf::heartbeat:IPaddr2):       Started c1f-ops-ctlc21
 ip-10.20.186.250       (ocf::heartbeat:IPaddr2):       Started c1f-ops-ctlc22
 Clone Set: haproxy-clone [haproxy]
     Started: [ c1f-ops-ctlc20 c1f-ops-ctlc21 c1f-ops-ctlc22 ]
 ip-10.20.182.33        (ocf::heartbeat:IPaddr2):       Started c1f-ops-ctlc20
 Master/Slave Set: galera-master [galera]
     Masters: [ c1f-ops-ctlc20 c1f-ops-ctlc21 c1f-ops-ctlc22 ]
 ip-10.20.176.250       (ocf::heartbeat:IPaddr2):       Started c1f-ops-ctlc21
 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ c1f-ops-ctlc20 c1f-ops-ctlc21 ]
     Stopped: [ c1f-ops-ctlc22 ]
 ip-10.20.186.12        (ocf::heartbeat:IPaddr2):       Started c1f-ops-ctlc22
 Master/Slave Set: redis-master [redis]
     Masters: [ c1f-ops-ctlc20 ]
     Slaves: [ c1f-ops-ctlc21 c1f-ops-ctlc22 ]
 openstack-cinder-volume        (systemd:openstack-cinder-volume):      Started c1f-ops-ctlc20

Failed Actions:
* rabbitmq_start_0 on c1f-ops-ctlc22 'unknown error' (1): call=108, status=complete, exitreason='none',
    last-rc-change='Wed May  9 11:23:22 2018', queued=0ms, exec=10393ms


Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

[root@c1f-ops-ctlc21 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@c1f-ops-ctlc21' ...
[{nodes,[{disc,['rabbit@c1f-ops-ctlc20','rabbit@c1f-ops-ctlc21']}]},
 {running_nodes,['rabbit@c1f-ops-ctlc20','rabbit@c1f-ops-ctlc21']},
 {cluster_name,<<"rabbit.tesoro.it">>},
 {partitions,[]},
 {alarms,[{'rabbit@c1f-ops-ctlc20',[]},{'rabbit@c1f-ops-ctlc21',[]}]}]
~~~



Version-Release number of selected component (if applicable):

[gkadam@collab-shell c1f-ops-ctlc22.coll.tesoro.it]$ grep -ri rabbit installed-rpms 
puppet-rabbitmq-5.6.0-2.el7ost.noarch                       Thu Feb 22 18:26:45 2018
rabbitmq-server-3.6.3-7.el7ost.noarch                       Thu Feb 22 18:21:40 2018


Actual results:

Rabbitmq failing on one of the controller nodes

Expected results:

Rabbitmq should start on all the controller nodes

Additional info/Steps performed:

We tried to restart the rabbitmq-clone in the following manner:

- Unmanaged rabbitmq-clone on one of the controller nodes
- killed all the rabbitmq processes on all the controller nodes
- Managed the rabbitmq-clone again, and tried a pcs resource cleanup 
- we found a stale epmd process on the failed controller node, so we killed all the epmd process, and again tried the same procedure to start the rabbitmq-clone, it started successfully on all the nodes, but later it failed again on the same node(c1f-ops-ctlc22), once we tried to restart the openstack nova services on it. 

- We later, tried again to unmanage rabbitmq-clone
- stop all the rabbitmq processes
- rm -rf /var/lib/rabbitmq/mnesia/*
- pcs resource manage rabbitmq-clone 
- however, this didn't worked, so we again unmanaged the rabbitmq-clone, and tried to start the rabbitmq application in the faulty node (c1f-ops-ctlc22), 
this didn't worked as well
- it was unable to start the rabbitmq application and join to the master node manually.


We have asked to upload the erlang logs, and we request Engineering team to analyze the same to narrow down the issue, this is because, we tried all the possible ways to recover the failing rabbitmq, but it is somehow failing on the same node.

Comment 9 Peter Lemenkov 2018-05-14 13:11:51 UTC
Ganesh, please try this build - rabbitmq-server-3.6.3-10.el7ost. It should fix your issue.

Comment 10 Peter Lemenkov 2018-05-14 13:15:31 UTC
A yum repository for the build of rabbitmq-server-3.6.3-10.el7ost (task 16274921) is available at:

http://brew-task-repos.usersys.redhat.com/repos/official/rabbitmq-server/3.6.3/10.el7ost/

You can install the rpms locally by putting this .repo file in your /etc/yum.repos.d/ directory:

http://brew-task-repos.usersys.redhat.com/repos/official/rabbitmq-server/3.6.3/10.el7ost/rabbitmq-server-3.6.3-10.el7ost.repo

RPMs and build logs can be found in the following locations:
http://brew-task-repos.usersys.redhat.com/repos/official/rabbitmq-server/3.6.3/10.el7ost/noarch/

The full list of available rpms is:
http://brew-task-repos.usersys.redhat.com/repos/official/rabbitmq-server/3.6.3/10.el7ost/noarch/rabbitmq-server-3.6.3-10.el7ost.src.rpm
http://brew-task-repos.usersys.redhat.com/repos/official/rabbitmq-server/3.6.3/10.el7ost/noarch/rabbitmq-server-3.6.3-10.el7ost.noarch.rpm

Build output will be available for the next 21 days.

If you wish to stop receiving these emails, please email:
Mike Bonnet <mikeb>

Thank you,
The Brew Task Repos System

Comment 19 Alex McLeod 2018-09-03 07:58:33 UTC
Hi there,

If this bug requires doc text for errata release, please set the 'Doc Type' and provide draft text according to the template in the 'Doc Text' field.

The documentation team will review, edit, and approve the text.

If this bug does not require doc text, please set the 'requires_doc_text' flag to -.

Thanks,
Alex

Comment 20 Peter Lemenkov 2018-09-03 12:49:27 UTC
I don't think this particular ticket requires any documentation. So I'm going to sed requires_doc_text to '-'.

Comment 22 errata-xmlrpc 2018-09-17 16:59:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2671