Description of problem: openswitch neutron agents are reported as down on compute nodes for some significant time (10 minutes) after all openstack controllers (in HA setup) come back online from previous failover. The test is as following: 1. Take down all the controllers at once. 2. Keep them down around 5 minutes. 3. Check the status of all neutron agents status. The results is that openvswitch agent on compute nodes is being reported for about 10 minutes after the all controllers are back online (After the step 2.) Example from log, where reboot of all controllers at once happened at 15:19:49 (as It is visible that agent lost connection to rabbitmq server), The controllers came back online about 15:25:53, and agent as visible in log established connection rabbitmq server at 15:28, then nothing was happening for about 10 minutes and finally agent was reported as alive 15:38. There is ten minutes gap of downtime for the agent which I did not observe with the same test on RHOS8 setups. 2016-07-25 15:19:54.528 6854 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.17.1.13:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds. 2016-07-25 15:19:55.456 6854 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.17.1.13:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds. 2016-07-25 15:19:55.567 6854 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.17.1.13:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds. 2016-07-25 15:19:56.240 6854 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.17.1.13:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds. 2016-07-25 15:19:56.271 6854 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.17.1.13:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds. 2016-07-25 15:19:57.455 6854 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.17.1.13:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds. 2016-07-25 15:19:58.448 6854 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.17.1.14:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds. 2016-07-25 15:19:58.655 6854 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.17.1.14:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds. 2016-07-25 15:19:59.199 6854 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.17.1.13:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds. 2016-07-25 15:20:00.538 6854 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.17.1.13:5672 is unreachable: timed out. Trying again in 1 seconds. . . A lot of "timed out. Trying again in 1 seconds. messages" . . 2016-07-25 15:20:29.593 6854 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.17.1.16:5672 is unreachable: [Errno 113] EHOSTUNREACH. Trying again in 4 seconds. . . A lot of EHOSTUNREACH messages . . 2016-07-25 15:27:12.102 6854 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.17.1.13:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds. . . Some ECONNREFUSED messages . . 2016-07-25 15:28:20.003 6854 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.17.1.13:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds. 2016-07-25 15:28:20.026 6854 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.17.1.14:5672 via [amqp] client 2016-07-25 15:28:20.049 6854 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.17.1.14:5672 via [amqp] client 2016-07-25 15:28:20.082 6854 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.17.1.13:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds. 2016-07-25 15:28:20.195 6854 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.17.1.14:5672 via [amqp] client 2016-07-25 15:28:20.223 6854 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.17.1.13:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds. 2016-07-25 15:28:20.224 6854 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.17.1.13:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds. 2016-07-25 15:28:21.023 6854 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.17.1.14:5672 via [amqp] client 2016-07-25 15:28:21.112 6854 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.17.1.14:5672 via [amqp] client 2016-07-25 15:28:21.259 6854 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.17.1.14:5672 via [amqp] client 2016-07-25 15:28:21.260 6854 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.17.1.14:5672 via [amqp] client 2016-07-25 15:28:30.916 6854 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.17.1.14:5672 via [amqp] client 2016-07-25 15:28:44.913 6854 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.17.1.13:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds. 2016-07-25 15:28:45.939 6854 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.17.1.14:5672 via [amqp] client 2016-07-25 15:38:30.921 6854 ERROR neutron.common.rpc [-] Timeout in RPC method report_state. Waiting for 1 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough. 2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [-] Failed reporting state! 2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent Traceback (most recent call last): 2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 313, in _report_state 2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent True) 2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/neutron/agent/rpc.py", line 86, in report_state 2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent return method(context, 'report_state', **kwargs) 2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/neutron/common/rpc.py", line 155, in call 2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent time.sleep(wait) 2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__ 2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent self.force_reraise() 2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise 2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent six.reraise(self.type_, self.value, self.tb) 2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/neutron/common/rpc.py", line 136, in call 2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent return self._original_context.call(ctxt, method, **kwargs) 2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 158, in call 2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent retry=self.retry) 2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 90, in _send 2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent timeout=timeout, retry=retry) 2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 470, in send 2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent retry=retry) 2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in _send 2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent result = self._waiter.wait(msg_id, timeout) 2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 342, in wait 2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent message = self.waiters.get(msg_id, timeout=timeout) 2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 244, in get 2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent 'to message ID %s' % msg_id) 2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent MessagingTimeout: Timed out waiting for a reply to message ID b09d51da5c454dd09318a328495c6bd5 2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent 2016-07-25 15:38:32.105 6854 WARNING oslo.service.loopingcall [-] Function 'neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent.OVSNeutronAgent._report_state' run outlasted interval by 1087.76 sec 2016-07-25 15:38:32.429 6854 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [-] Agent has just been revived. Doing a full sync. 2016-07-25 15:38:33.750 6854 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-df730a39-d95d-4316-9ede-db28887f765a - - - - -] rpc_loop doing a full sync. 2016-07-25 15:38:33.801 6854 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-df730a39-d95d-4316-9ede-db28887f765a - - - - -] Agent out of sync with plugin! Version-Release number of selected component (if applicable): python-neutron-lib-0.0.2-1.el7ost.noarch openstack-neutron-ml2-8.1.2-1.el7ost.noarch openstack-neutron-bigswitch-lldp-2015.3.8-1.el7ost.noarch openstack-neutron-lbaas-8.0.0-1.el7ost.noarch openstack-neutron-metering-agent-8.1.2-1.el7ost.noarch python-neutronclient-4.1.1-2.el7ost.noarch openstack-neutron-common-8.1.2-1.el7ost.noarch openstack-neutron-bigswitch-agent-2015.3.8-1.el7ost.noarch python-neutron-8.1.2-1.el7ost.noarch python-neutron-lbaas-8.0.0-1.el7ost.noarch openstack-neutron-openvswitch-8.1.2-1.el7ost.noarch openstack-neutron-8.1.2-1.el7ost.noarch How reproducible: Often Steps to Reproduce: 1. Take down all the controllers at once. 2. Keep them down around 5 minutes. 3. Check the status of all neutron agents status. Actual results: 10 minute downtime of openstack neutron openvswitch agent on computes. Expected results: This downtime was not observed on RHOS8 setups Additional info:
Please attach an SOS report from a controller and include the timestamps for the reboots.
Or give us access to the system.
Assigned to John to chase the info down and to root cause the bug.
Logs from the initial report show that rabbitmq had restarted a bunch of times (got connection refused and successful reconnections a few times in a row), and then finally the connect succeeded and a report_state call was initiated. Then, I guess that the rabbitmq might have restarted again but oslo.messaging didn't hear about this. Interesting to see if this will reproduce with the same Neutron versions but an earlier (rhos-8's) oslo.messaging version. Either way, I talked to Marian and the setup was reprovisioned. They are trying to reproduce this on another setup. Marian - I'll be happy if, when/if this is reproduced again, the oslo.messaging version could be posted in addition to the sosreport. Also, could you try to downgrade the oslo.messaging package to rhos-8's on the problematic node and see if this fixes things?
Marian gave me another setup and we debugged it together. The fault is ultimately https://review.openstack.org/#/c/280595/, which introduces an exponential backoff (actually it's linear but nevermind). This patch was ported usptream to mitaka and thus we are encountering this in osp-9. In short, the patch was intended for potentially-long-lasting RPC calls (like "get_routers" when there are a lot of routers) and not for short RPC calls that don't carry a lot of calculations. The solution I have in mind is introducing a way for code to decide for itself if it wants to use the exponential backoff or not. I'll begin work on this upstream in the coming days.
Added external trackers for the proposed upstream solution and bug report.
This only happens if you all shut down all three controllers at the same time. I don't think that situation could classify this bug to be a blocker. My recommendation would be not to possibly delay the release due to this bug.
The patch has been merged upstream and I've begun a backport for rhos-9. Once that is done I'll package a new RPM.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-1759.html