Hide Forgot
openstack-neutron-2015.1.2-9.el7ost.noarch Mar 5 2016 openstack-neutron-common-2015.1.2-9.el7ost.noarch Mar 5 2016 openstack-neutron-lbaas-2015.1.2-1.el7ost.noarch Mar 5 2016 openstack-neutron-ml2-2015.1.2-9.el7ost.noarch Mar 5 2016 openstack-neutron-openvswitch-2015.1.2-9.el7ost.noarch Mar 5 2016 openvswitch-2.4.0-1.el7.x86_64 Feb 22 2016 python-neutron-2015.1.2-9.el7ost.noarch Mar 5 2016 python-neutron-lbaas-2015.1.2-1.el7ost.noarch Mar 5 2016 python-neutronclient-2.4.0-2.el7ost.noarch Mar 5 2016 python-openvswitch-2.4.0-1.el7.noarch Mar 5 2016 - following error with over 2400 ports ~~~~~ Here's what customer saw issue on updating port status over 2000 ports every 1 minutes. $ perl -pe "s/, u/\n, u/g" AR_general_timesearch_2016-03-07.log.\ because\ of\ error\:\ Timed\ out\ waiting\ for\ a\ reply\ to\ message\ ID |sed -e "s/, u.*/, u'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxxxxx'/g"|uniq -c |head -100|tail -57 1 1 ** sosreport-ttbosneutron0002-20160306161323/sosreport-ttbosneutron0002-20160306161323/var/log/neutron/openvswitch-agent.log - because of error: Timed out waiting for a reply to message ID 1 2016-03-06 15:45:38.083 20303 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent DeviceListRetrievalError: Unable to retrieve port details for devices: set([u'9301ae97-842d-41ad-9b97-84957b00841b' 2291 , u'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxxxxx' 1 2016-03-06 16:11:06.983 10474 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent DeviceListRetrievalError: Unable to retrieve port details for devices: set([u'9301ae97-842d-41ad-9b97-84957b00841b' 2291 , u'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxxxxx' 1 2016-03-06 16:12:10.258 10474 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent DeviceListRetrievalError: Unable to retrieve port details for devices: set([u'9301ae97-842d-41ad-9b97-84957b00841b' 2291 , u'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxxxxx' 1 2016-03-06 16:13:13.606 10474 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent DeviceListRetrievalError: Unable to retrieve port details for devices: set([u'9301ae97-842d-41ad-9b97-84957b00841b' 2291 , u'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxxxxx' 1 1 ** sosreport-ttbosneutron0002-20160306161323/sosreport-ttbosneutron0002-20160306161323/var/log/neutron/openvswitch-agent.log-20160306 - because of error: Timed out waiting for a reply to message ID 1 2016-03-05 20:39:47.640 20303 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent DeviceListRetrievalError: Unable to retrieve port details for devices: set([u'9301ae97-842d-41ad-9b97-84957b00841b' 2248 , u'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxxxxx' 1 2016-03-05 20:40:51.383 20303 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent DeviceListRetrievalError: Unable to retrieve port details for devices: set([u'9301ae97-842d-41ad-9b97-84957b00841b' 2248 , u'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxxxxx' 1 2016-03-05 20:41:54.503 20303 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent DeviceListRetrievalError: Unable to retrieve port details for devices: set([u'9301ae97-842d-41ad-9b97-84957b00841b' 2248 , u'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxxxxx' 1 2016-03-05 20:42:57.800 20303 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent DeviceListRetrievalError: Unable to retrieve port details for devices: set([u'9301ae97-842d-41ad-9b97-84957b00841b' 2248 , u'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxxxxx' 1 2016-03-05 20:44:01.160 20303 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent DeviceListRetrievalError: Unable to retrieve port details for devices: set([u'9301ae97-842d-41ad-9b97-84957b00841b' 2248 , u'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxxxxx' 1 2016-03-05 20:45:04.344 20303 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent DeviceListRetrievalError: Unable to retrieve port details for devices: set([u'9301ae97-842d-41ad-9b97-84957b00841b' 2248 , u'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxxxxx' 1 2016-03-05 20:46:07.848 20303 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent DeviceListRetrievalError: Unable to retrieve port details for devices: set([u'9301ae97-842d-41ad-9b97-84957b00841b' 2248 , u'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxxxxx' 1 2016-03-05 20:47:11.346 20303 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent DeviceListRetrievalError: Unable to retrieve port details for devices: set([u'9301ae97-842d-41ad-9b97-84957b00841b' 2248 , u'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxxxxx' ~~~~~
Suspecting this could be the cause There is a UnixDomainWSGIServer in the l3-agent listening for those notifications. When they are received, they call this method in neutron/agent/l3/ha.py: def enqueue_state_change(self, router_id, state): LOG.info(_LI('Router %(router_id)s transitioned to %(state)s'), {'router_id': router_id, 'state': state}) try: ri = self.router_info[router_id] except AttributeError: LOG.info(_LI('Router %s is not managed by this agent. It was ' 'possibly deleted concurrently.'), router_id) return self._configure_ipv6_ra_on_ext_gw_port_if_necessary(ri, state) self._update_metadata_proxy(ri, router_id, state) self._update_radvd_daemon(ri, state) self.state_change_notifier.queue_event((router_id, state))
The error message gone away after the change /etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini [DEFAULT] :rpc_response_timeout = 300
I believe this is simply a case of neutron-server taking more than the default of 60 seconds (Controlled by rpc_response_timeout, described in comment 4) to respond to the RPC query. The query times out, and is sent again, and now neutron-server has to process two queries instead of one. This goes on potentially forever, causing more and more load on the server. As an immediate term solution you can increase the timeout as has already been done. Long term, there's a patch upstream that executes an automatic exponential back off for queries such as this: https://review.openstack.org/#/c/280595/
(In reply to Faiaz Ahmed from comment #2) > Suspecting this could be the cause > There is a UnixDomainWSGIServer in the l3-agent listening for those > notifications. When they are received, they call this method in > neutron/agent/l3/ha.py: > > def enqueue_state_change(self, router_id, state): > LOG.info(_LI('Router %(router_id)s transitioned to %(state)s'), > {'router_id': router_id, > 'state': state}) > > try: > ri = self.router_info[router_id] > except AttributeError: > LOG.info(_LI('Router %s is not managed by this agent. It was ' > 'possibly deleted concurrently.'), router_id) > return > > self._configure_ipv6_ra_on_ext_gw_port_if_necessary(ri, state) > self._update_metadata_proxy(ri, router_id, state) > self._update_radvd_daemon(ri, state) > self.state_change_notifier.queue_event((router_id, state)) (For future reference, L3 code really has nothing to do with this issue. The request that's timing out is fired from the OVS agent and is processed by neutron-server ML2 code.
https://review.openstack.org/#/c/280595/ was merged and will be available in the next release of OSP 8.
According to our records, this should be resolved by openstack-neutron-7.2.0-5.el7ost. This build is available now.
Please add step to reproduce to this issue .
This RHBZ is tracking https://bugs.launchpad.net/neutron/+bug/1554332. One way to assert the new back-off functionality of agent state reports would be to inject some sort of fault (like shutting down the rabbitmq services), observe that OVS agents (Or DHCP/L3) cannot send state reports, and that using DEBUG mode logging, the retries are happening in increasing delays (e.g. not every 30 seconds, but an increasing amount of time between each retry).
I sat down with jschwarz on this. As per rhbz#1365378, exponential back-off is disabled for state reports, so the reproduction steps don't work. However, since that bug also complained that exponential back-off is working (and needed to be disabled for a specific case), I'm fine with verifying this bug on that basis. Checked on version openstack-neutron-7.2.0-5.el7ost.noarch (GA).