1307056 – rabbitmq node does not rejoin cluster after node reset.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1307056 - rabbitmq node does not rejoin cluster after node reset.

Summary: rabbitmq node does not rejoin cluster after node reset.

Keywords:
Status:	CLOSED DUPLICATE of bug 1299923
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	resource-agents
Sub Component:
Version:	7.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Peter Lemenkov
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-02-12 15:14 UTC by Marian Krcmarik
Modified:	2016-02-13 07:17 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-02-13 07:17:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Marian Krcmarik 2016-02-12 15:14:50 UTC

Description of problem:
It seems that rabbitmq resource agent may be buggy in some scenario within Openstack HA setup.
Openstack HA setup has three controllers all supposed to be part of rabbitmq cluster, initially everything worked fine, then I did following - nongraceful reset of First controller, Once the first controller came back online all was good, next step was non-graceful reset of second controller, once the controller was back online all resource on the reset controller went up and was marked as started successfully by pacemaker. But l3-neutron agent on the reset controller was reported as down - It displayed the same message in l3 agent log in the loop:
2016-02-12 15:03:16.388 4596 ERROR neutron.agent.l3.agent [-] Failed reporting state!
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent Traceback (most recent call last):
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 615, in _report_state
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent     self.use_call)
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/rpc.py", line 80, in report_state
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent     return method(context, 'report_state', **kwargs)
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 156, in call
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent     retry=self.retry)
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 90, in _send
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent     timeout=timeout, retry=retry)
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 350, in send
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent     retry=retry)
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 339, in _send
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent     result = self._waiter.wait(msg_id, timeout)
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 243, in wait
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent     message = self.waiters.get(msg_id, timeout=timeout)
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 149, in get
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent     'to message ID %s' % msg_id)
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent MessagingTimeout: Timed out waiting for a reply to message ID 79eaa4c3a09944a0963f5bb5a2173c32
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent 
2016-02-12 15:03:16.389 4596 WARNING neutron.openstack.common.loopingcall [-] task <bound method L3NATAgentWithStateReport._report_state of <neutron.agent.l3.agent.L3NATAgentWithStateReport object at 0x47e1e10>> run outlasted interval by 30.01 sec

It turns out that the reset node is not part of rabbitmqcluster:
# rabbitmqctl cluster_status
Cluster status of node 'rabbit@overcloud-controller-1' ...
[{nodes,[{disc,['rabbit@overcloud-controller-1']}]},
 {running_nodes,['rabbit@overcloud-controller-1']},
 {cluster_name,<<"rabbit">>},
 {partitions,[]}]
...done.

and It remains in this state as well as the neutron l3 agent remains in the dead state.

The log - /var/log/rabbitmq/startup_err is empty

The log - /var/log/rabbitmq/rabbit\@overcloud-controller-1.log keeps repeating sth like:
=ERROR REPORT==== 12-Feb-2016::15:06:08 ===
connection <0.2086.0>, channel 2 - soft error:
{amqp_error,not_found,
            "no exchange 'reply_2c2f7eac82f34702a8ad5cc4aef7b45f' in vhost '/'",
            'exchange.declare'}

Version-Release number of selected component (if applicable):
resource-agents-3.9.5-54.el7_2.1.x86_64
python-oslo-messaging-1.8.3-3.el7ost.noarch
python-neutron-2015.1.2-8.el7ost.noarch
openstack-neutron-2015.1.2-8.el7ost.noarch

How reproducible:
Once( will drop a comment If happens more)

Steps to Reproduce:
1. reboot an openstack controller which is part of rabbitmq cluster (for example reboot one controller at a time, If it recovers successfully then continue with next controller).
2. check the status of neutron agent and cluster_status of rabbitmq.

Actual results:
The node is not member of rabbitmq cluster after It comes back online, pacemaker does not report any problem, all resource started.

Additional info:
I am not sure about component, as well as not sure why l3 agent of neutron was not reconnected to different amqp server while the one which was reset was down and then stayed connected to that different amqp server.

Note You need to log in before you can comment on or make changes to this bug.