1359894 – openvswitch agents are being reported as down for 10 minutes after all reset controllers come back online

Bug 1359894 - openvswitch agents are being reported as down for 10 minutes after all reset controllers come back online

Summary: openvswitch agents are being reported as down for 10 minutes after all reset ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-neutron
Sub Component:
Version:	9.0 (Mitaka)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	ga
Target Release:	9.0 (Mitaka)
Assignee:	John Schwarz
QA Contact:	Toni Freger
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1365378
TreeView+	depends on / blocked

Reported:	2016-07-25 16:16 UTC by Marian Krcmarik
Modified:	2016-08-24 12:56 UTC (History)
CC List:	13 users (show)
Fixed In Version:	openstack-neutron-8.1.2-2.el7ost
Doc Type:	Bug Fix
Doc Text:	Previously, all controllers restarted at the same time, causing all AMQP (RabbitMQ) servers to also restart; this caused the connection between neutron agents and the AMQP servers to seem to hang until timed-out, while the time-out amount was linear (60 seconds, then 120, then 240, and so on (limited at 600)). Consequently, an agent was considered `down` (not receive any events) until the timeout expired. With this update, the time-out mechanism in the specific event that tries to connect between the agents and the neutron-server has been changed to always be 60 seconds. As a result, if the connection hangs because of a restart of all the controllers, the agents will recover quicker (~60 seconds after the controllers fully start again).
Clone Of:
Clones:	1365378 (view as bug list)
Environment:
Last Closed:	2016-08-24 12:56:40 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1606827	None	None	None	2016-07-27 09:13:44 UTC
OpenStack gerrit	347708	'None'	MERGED	Don't use exponential back-off for report_state	2020-12-07 19:47:53 UTC
Red Hat Product Errata	RHBA-2016:1759	normal	SHIPPED_LIVE	openstack-neutron bug fix advisory	2016-08-24 16:50:04 UTC

Description Marian Krcmarik 2016-07-25 16:16:43 UTC

Description of problem:
openswitch neutron agents are reported as down on compute nodes for some significant time (10 minutes) after all openstack controllers (in HA setup) come back online from previous failover.
The test is as following:
1. Take down all the controllers at once.
2. Keep them down around 5 minutes.
3.  Check the status of all neutron agents status.

The results is that openvswitch agent on compute nodes is being reported for about 10 minutes after the all controllers are back online (After the step 2.)

Example from log, where reboot of all controllers at once happened at 15:19:49 (as It is visible that agent lost connection to rabbitmq server), The controllers came back online about 15:25:53, and agent as visible in log established connection rabbitmq server at 15:28, then nothing was happening for about 10 minutes and finally agent was reported as alive 15:38. There is ten minutes gap of downtime for the agent which I did not observe with the same test on RHOS8 setups.

2016-07-25 15:19:54.528 6854 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.17.1.13:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds.
2016-07-25 15:19:55.456 6854 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.17.1.13:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds.
2016-07-25 15:19:55.567 6854 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.17.1.13:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds.
2016-07-25 15:19:56.240 6854 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.17.1.13:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds.
2016-07-25 15:19:56.271 6854 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.17.1.13:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds.
2016-07-25 15:19:57.455 6854 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.17.1.13:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds.
2016-07-25 15:19:58.448 6854 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.17.1.14:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds.
2016-07-25 15:19:58.655 6854 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.17.1.14:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds.
2016-07-25 15:19:59.199 6854 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.17.1.13:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds.
2016-07-25 15:20:00.538 6854 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.17.1.13:5672 is unreachable: timed out. Trying again in 1 seconds.
.
.
A lot of "timed out. Trying again in 1 seconds. messages"
.
.
2016-07-25 15:20:29.593 6854 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.17.1.16:5672 is unreachable: [Errno 113] EHOSTUNREACH. Trying again in 4 seconds.
.
.
A lot of EHOSTUNREACH messages
.
.
2016-07-25 15:27:12.102 6854 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.17.1.13:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.
.
.
Some ECONNREFUSED messages
.
.
2016-07-25 15:28:20.003 6854 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.17.1.13:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.
2016-07-25 15:28:20.026 6854 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.17.1.14:5672 via [amqp] client
2016-07-25 15:28:20.049 6854 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.17.1.14:5672 via [amqp] client
2016-07-25 15:28:20.082 6854 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.17.1.13:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.
2016-07-25 15:28:20.195 6854 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.17.1.14:5672 via [amqp] client
2016-07-25 15:28:20.223 6854 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.17.1.13:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.
2016-07-25 15:28:20.224 6854 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.17.1.13:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.
2016-07-25 15:28:21.023 6854 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.17.1.14:5672 via [amqp] client
2016-07-25 15:28:21.112 6854 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.17.1.14:5672 via [amqp] client
2016-07-25 15:28:21.259 6854 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.17.1.14:5672 via [amqp] client
2016-07-25 15:28:21.260 6854 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.17.1.14:5672 via [amqp] client
2016-07-25 15:28:30.916 6854 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.17.1.14:5672 via [amqp] client
2016-07-25 15:28:44.913 6854 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.17.1.13:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.
2016-07-25 15:28:45.939 6854 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.17.1.14:5672 via [amqp] client
2016-07-25 15:38:30.921 6854 ERROR neutron.common.rpc [-] Timeout in RPC method report_state. Waiting for 1 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.
2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [-] Failed reporting state!
2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent Traceback (most recent call last):
2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 313, in _report_state
2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     True)
2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/neutron/agent/rpc.py", line 86, in report_state
2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     return method(context, 'report_state', **kwargs)
2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/neutron/common/rpc.py", line 155, in call
2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     time.sleep(wait)
2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     self.force_reraise()
2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     six.reraise(self.type_, self.value, self.tb)
2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/neutron/common/rpc.py", line 136, in call
2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     return self._original_context.call(ctxt, method, **kwargs)
2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 158, in call
2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     retry=self.retry)
2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 90, in _send
2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     timeout=timeout, retry=retry)
2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 470, in send
2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     retry=retry)
2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in _send
2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     result = self._waiter.wait(msg_id, timeout)
2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 342, in wait
2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     message = self.waiters.get(msg_id, timeout=timeout)
2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 244, in get
2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     'to message ID %s' % msg_id)
2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent MessagingTimeout: Timed out waiting for a reply to message ID b09d51da5c454dd09318a328495c6bd5
2016-07-25 15:38:32.103 6854 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent 
2016-07-25 15:38:32.105 6854 WARNING oslo.service.loopingcall [-] Function 'neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent.OVSNeutronAgent._report_state' run outlasted interval by 1087.76 sec
2016-07-25 15:38:32.429 6854 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [-] Agent has just been revived. Doing a full sync.
2016-07-25 15:38:33.750 6854 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-df730a39-d95d-4316-9ede-db28887f765a - - - - -] rpc_loop doing a full sync.
2016-07-25 15:38:33.801 6854 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-df730a39-d95d-4316-9ede-db28887f765a - - - - -] Agent out of sync with plugin!

Version-Release number of selected component (if applicable):
python-neutron-lib-0.0.2-1.el7ost.noarch
openstack-neutron-ml2-8.1.2-1.el7ost.noarch
openstack-neutron-bigswitch-lldp-2015.3.8-1.el7ost.noarch
openstack-neutron-lbaas-8.0.0-1.el7ost.noarch
openstack-neutron-metering-agent-8.1.2-1.el7ost.noarch
python-neutronclient-4.1.1-2.el7ost.noarch
openstack-neutron-common-8.1.2-1.el7ost.noarch
openstack-neutron-bigswitch-agent-2015.3.8-1.el7ost.noarch
python-neutron-8.1.2-1.el7ost.noarch
python-neutron-lbaas-8.0.0-1.el7ost.noarch
openstack-neutron-openvswitch-8.1.2-1.el7ost.noarch
openstack-neutron-8.1.2-1.el7ost.noarch

How reproducible:
Often

Steps to Reproduce:
1. Take down all the controllers at once.
2. Keep them down around 5 minutes.
3. Check the status of all neutron agents status.

Actual results:
10 minute downtime of openstack neutron openvswitch agent on computes.

Expected results:
This downtime was not observed on RHOS8 setups

Additional info:

Comment 2 Assaf Muller 2016-07-26 11:45:31 UTC

Please attach an SOS report from a controller and include the timestamps for the reboots.

Comment 3 Assaf Muller 2016-07-26 11:46:05 UTC

Or give us access to the system.

Comment 4 Assaf Muller 2016-07-26 13:54:26 UTC

Assigned to John to chase the info down and to root cause the bug.

Comment 5 John Schwarz 2016-07-26 14:18:24 UTC

Logs from the initial report show that rabbitmq had restarted a bunch of times (got connection refused and successful reconnections a few times in a row), and then finally the connect succeeded and a report_state call was initiated. Then, I guess that the rabbitmq might have restarted again but oslo.messaging didn't hear about this.

Interesting to see if this will reproduce with the same Neutron versions but an earlier (rhos-8's) oslo.messaging version.

Either way, I talked to Marian and the setup was reprovisioned. They are trying to reproduce this on another setup.

Marian - I'll be happy if, when/if this is reproduced again, the oslo.messaging version could be posted in addition to the sosreport. Also, could you try to downgrade the oslo.messaging package to rhos-8's on the problematic node and see if this fixes things?

Comment 6 John Schwarz 2016-07-26 15:46:18 UTC

Marian gave me another setup and we debugged it together. The fault is ultimately https://review.openstack.org/#/c/280595/, which introduces an exponential backoff (actually it's linear but nevermind). This patch was ported usptream to mitaka and thus we are encountering this in osp-9.

In short, the patch was intended for potentially-long-lasting RPC calls (like "get_routers" when there are a lot of routers) and not for short RPC calls that don't carry a lot of calculations. The solution I have in mind is introducing a way for code to decide for itself if it wants to use the exponential backoff or not.

I'll begin work on this upstream in the coming days.

Comment 7 John Schwarz 2016-07-27 09:14:12 UTC

Added external trackers for the proposed upstream solution and bug report.

Comment 8 Assaf Muller 2016-07-27 14:39:28 UTC

This only happens if you all shut down all three controllers at the same time. I don't think that situation could classify this bug to be a blocker. My recommendation would be not to possibly delay the release due to this bug.

Comment 14 John Schwarz 2016-08-09 06:46:12 UTC

The patch has been merged upstream and I've begun a backport for rhos-9. Once that is done I'll package a new RPM.

Comment 24 errata-xmlrpc 2016-08-24 12:56:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-1759.html

Note You need to log in before you can comment on or make changes to this bug.