Hide Forgot
Description of problem: It seems that rabbitmq resource agent may be buggy in some scenario within Openstack HA setup. Openstack HA setup has three controllers all supposed to be part of rabbitmq cluster, initially everything worked fine, then I did following - nongraceful reset of First controller, Once the first controller came back online all was good, next step was non-graceful reset of second controller, once the controller was back online all resource on the reset controller went up and was marked as started successfully by pacemaker. But l3-neutron agent on the reset controller was reported as down - It displayed the same message in l3 agent log in the loop: 2016-02-12 15:03:16.388 4596 ERROR neutron.agent.l3.agent [-] Failed reporting state! 2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent Traceback (most recent call last): 2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 615, in _report_state 2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent self.use_call) 2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/rpc.py", line 80, in report_state 2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent return method(context, 'report_state', **kwargs) 2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 156, in call 2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent retry=self.retry) 2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 90, in _send 2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent timeout=timeout, retry=retry) 2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 350, in send 2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent retry=retry) 2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 339, in _send 2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent result = self._waiter.wait(msg_id, timeout) 2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 243, in wait 2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent message = self.waiters.get(msg_id, timeout=timeout) 2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 149, in get 2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent 'to message ID %s' % msg_id) 2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent MessagingTimeout: Timed out waiting for a reply to message ID 79eaa4c3a09944a0963f5bb5a2173c32 2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent 2016-02-12 15:03:16.389 4596 WARNING neutron.openstack.common.loopingcall [-] task <bound method L3NATAgentWithStateReport._report_state of <neutron.agent.l3.agent.L3NATAgentWithStateReport object at 0x47e1e10>> run outlasted interval by 30.01 sec It turns out that the reset node is not part of rabbitmqcluster: # rabbitmqctl cluster_status Cluster status of node 'rabbit@overcloud-controller-1' ... [{nodes,[{disc,['rabbit@overcloud-controller-1']}]}, {running_nodes,['rabbit@overcloud-controller-1']}, {cluster_name,<<"rabbit">>}, {partitions,[]}] ...done. and It remains in this state as well as the neutron l3 agent remains in the dead state. The log - /var/log/rabbitmq/startup_err is empty The log - /var/log/rabbitmq/rabbit\@overcloud-controller-1.log keeps repeating sth like: =ERROR REPORT==== 12-Feb-2016::15:06:08 === connection <0.2086.0>, channel 2 - soft error: {amqp_error,not_found, "no exchange 'reply_2c2f7eac82f34702a8ad5cc4aef7b45f' in vhost '/'", 'exchange.declare'} Version-Release number of selected component (if applicable): resource-agents-3.9.5-54.el7_2.1.x86_64 python-oslo-messaging-1.8.3-3.el7ost.noarch python-neutron-2015.1.2-8.el7ost.noarch openstack-neutron-2015.1.2-8.el7ost.noarch How reproducible: Once( will drop a comment If happens more) Steps to Reproduce: 1. reboot an openstack controller which is part of rabbitmq cluster (for example reboot one controller at a time, If it recovers successfully then continue with next controller). 2. check the status of neutron agent and cluster_status of rabbitmq. Actual results: The node is not member of rabbitmq cluster after It comes back online, pacemaker does not report any problem, all resource started. Additional info: I am not sure about component, as well as not sure why l3 agent of neutron was not reconnected to different amqp server while the one which was reset was down and then stayed connected to that different amqp server.