Bug 1307056

Summary: rabbitmq node does not rejoin cluster after node reset.
Product: Red Hat Enterprise Linux 7 Reporter: Marian Krcmarik <mkrcmari>
Component: resource-agentsAssignee: Peter Lemenkov <plemenko>
Status: CLOSED DUPLICATE QA Contact: cluster-qe <cluster-qe>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.2CC: agk, cluster-maint, fdinitto, michele
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-02-13 07:17:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Marian Krcmarik 2016-02-12 15:14:50 UTC
Description of problem:
It seems that rabbitmq resource agent may be buggy in some scenario within Openstack HA setup.
Openstack HA setup has three controllers all supposed to be part of rabbitmq cluster, initially everything worked fine, then I did following - nongraceful reset of First controller, Once the first controller came back online all was good, next step was non-graceful reset of second controller, once the controller was back online all resource on the reset controller went up and was marked as started successfully by pacemaker. But l3-neutron agent on the reset controller was reported as down - It displayed the same message in l3 agent log in the loop:
2016-02-12 15:03:16.388 4596 ERROR neutron.agent.l3.agent [-] Failed reporting state!
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent Traceback (most recent call last):
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 615, in _report_state
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent     self.use_call)
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/rpc.py", line 80, in report_state
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent     return method(context, 'report_state', **kwargs)
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 156, in call
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent     retry=self.retry)
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 90, in _send
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent     timeout=timeout, retry=retry)
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 350, in send
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent     retry=retry)
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 339, in _send
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent     result = self._waiter.wait(msg_id, timeout)
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 243, in wait
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent     message = self.waiters.get(msg_id, timeout=timeout)
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 149, in get
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent     'to message ID %s' % msg_id)
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent MessagingTimeout: Timed out waiting for a reply to message ID 79eaa4c3a09944a0963f5bb5a2173c32
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent 
2016-02-12 15:03:16.389 4596 WARNING neutron.openstack.common.loopingcall [-] task <bound method L3NATAgentWithStateReport._report_state of <neutron.agent.l3.agent.L3NATAgentWithStateReport object at 0x47e1e10>> run outlasted interval by 30.01 sec

It turns out that the reset node is not part of rabbitmqcluster:
# rabbitmqctl cluster_status
Cluster status of node 'rabbit@overcloud-controller-1' ...
[{nodes,[{disc,['rabbit@overcloud-controller-1']}]},
 {running_nodes,['rabbit@overcloud-controller-1']},
 {cluster_name,<<"rabbit">>},
 {partitions,[]}]
...done.

and It remains in this state as well as the neutron l3 agent remains in the dead state.

The log - /var/log/rabbitmq/startup_err is empty

The log - /var/log/rabbitmq/rabbit\@overcloud-controller-1.log keeps repeating sth like:
=ERROR REPORT==== 12-Feb-2016::15:06:08 ===
connection <0.2086.0>, channel 2 - soft error:
{amqp_error,not_found,
            "no exchange 'reply_2c2f7eac82f34702a8ad5cc4aef7b45f' in vhost '/'",
            'exchange.declare'}

Version-Release number of selected component (if applicable):
resource-agents-3.9.5-54.el7_2.1.x86_64
python-oslo-messaging-1.8.3-3.el7ost.noarch
python-neutron-2015.1.2-8.el7ost.noarch
openstack-neutron-2015.1.2-8.el7ost.noarch

How reproducible:
Once( will drop a comment If happens more)

Steps to Reproduce:
1. reboot an openstack controller which is part of rabbitmq cluster (for example reboot one controller at a time, If it recovers successfully then continue with next controller).
2. check the status of neutron agent and cluster_status of rabbitmq.

Actual results:
The node is not member of rabbitmq cluster after It comes back online, pacemaker does not report any problem, all resource started.

Additional info:
I am not sure about component, as well as not sure why l3 agent of neutron was not reconnected to different amqp server while the one which was reset was down and then stayed connected to that different amqp server.