Bug 1307056 - rabbitmq node does not rejoin cluster after node reset.
rabbitmq node does not rejoin cluster after node reset.
Status: CLOSED DUPLICATE of bug 1299923
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: resource-agents (Show other bugs)
7.2
Unspecified Unspecified
unspecified Severity unspecified
: rc
: ---
Assigned To: Peter Lemenkov
cluster-qe@redhat.com
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-02-12 10:14 EST by Marian Krcmarik
Modified: 2016-02-13 02:17 EST (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-02-13 02:17:18 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Marian Krcmarik 2016-02-12 10:14:50 EST
Description of problem:
It seems that rabbitmq resource agent may be buggy in some scenario within Openstack HA setup.
Openstack HA setup has three controllers all supposed to be part of rabbitmq cluster, initially everything worked fine, then I did following - nongraceful reset of First controller, Once the first controller came back online all was good, next step was non-graceful reset of second controller, once the controller was back online all resource on the reset controller went up and was marked as started successfully by pacemaker. But l3-neutron agent on the reset controller was reported as down - It displayed the same message in l3 agent log in the loop:
2016-02-12 15:03:16.388 4596 ERROR neutron.agent.l3.agent [-] Failed reporting state!
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent Traceback (most recent call last):
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 615, in _report_state
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent     self.use_call)
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/rpc.py", line 80, in report_state
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent     return method(context, 'report_state', **kwargs)
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 156, in call
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent     retry=self.retry)
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 90, in _send
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent     timeout=timeout, retry=retry)
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 350, in send
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent     retry=retry)
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 339, in _send
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent     result = self._waiter.wait(msg_id, timeout)
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 243, in wait
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent     message = self.waiters.get(msg_id, timeout=timeout)
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 149, in get
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent     'to message ID %s' % msg_id)
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent MessagingTimeout: Timed out waiting for a reply to message ID 79eaa4c3a09944a0963f5bb5a2173c32
2016-02-12 15:03:16.388 4596 TRACE neutron.agent.l3.agent 
2016-02-12 15:03:16.389 4596 WARNING neutron.openstack.common.loopingcall [-] task <bound method L3NATAgentWithStateReport._report_state of <neutron.agent.l3.agent.L3NATAgentWithStateReport object at 0x47e1e10>> run outlasted interval by 30.01 sec

It turns out that the reset node is not part of rabbitmqcluster:
# rabbitmqctl cluster_status
Cluster status of node 'rabbit@overcloud-controller-1' ...
[{nodes,[{disc,['rabbit@overcloud-controller-1']}]},
 {running_nodes,['rabbit@overcloud-controller-1']},
 {cluster_name,<<"rabbit@overcloud-controller-1.localdomain">>},
 {partitions,[]}]
...done.

and It remains in this state as well as the neutron l3 agent remains in the dead state.

The log - /var/log/rabbitmq/startup_err is empty

The log - /var/log/rabbitmq/rabbit\@overcloud-controller-1.log keeps repeating sth like:
=ERROR REPORT==== 12-Feb-2016::15:06:08 ===
connection <0.2086.0>, channel 2 - soft error:
{amqp_error,not_found,
            "no exchange 'reply_2c2f7eac82f34702a8ad5cc4aef7b45f' in vhost '/'",
            'exchange.declare'}

Version-Release number of selected component (if applicable):
resource-agents-3.9.5-54.el7_2.1.x86_64
python-oslo-messaging-1.8.3-3.el7ost.noarch
python-neutron-2015.1.2-8.el7ost.noarch
openstack-neutron-2015.1.2-8.el7ost.noarch

How reproducible:
Once( will drop a comment If happens more)

Steps to Reproduce:
1. reboot an openstack controller which is part of rabbitmq cluster (for example reboot one controller at a time, If it recovers successfully then continue with next controller).
2. check the status of neutron agent and cluster_status of rabbitmq.

Actual results:
The node is not member of rabbitmq cluster after It comes back online, pacemaker does not report any problem, all resource started.

Additional info:
I am not sure about component, as well as not sure why l3 agent of neutron was not reconnected to different amqp server while the one which was reset was down and then stayed connected to that different amqp server.

Note You need to log in before you can comment on or make changes to this bug.