Bug 1315156 - neutron.plugins.openvswitch.agent.ovs_neutron_agent DeviceListRetrievalError: Unable to retrieve port details for devices
Summary: neutron.plugins.openvswitch.agent.ovs_neutron_agent DeviceListRetrievalError:...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 7.0 (Kilo)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: async
: 8.0 (Liberty)
Assignee: Assaf Muller
QA Contact: Eran Kuris
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-03-07 05:44 UTC by Faiaz Ahmed
Modified: 2020-04-15 14:24 UTC (History)
11 users (show)

Fixed In Version: openstack-neutron-7.1.1-3.el7ost
Doc Type: Bug Fix
Doc Text:
Cause: If a server operation takes long enough to trigger a timeout on an agent call to the server, the agent will just give up and issue a new call immediately. Consequence: First, if the server is busy and the requests take more than the timeout window to fulfill, the agent will just continually hammer the server with calls that are bound to fail until the server load is reduced enough to fulfill the query. If the load is a result of calls from agents, this leads to a stampeding effect where the server will be unable to fulfill requests until operator intervention. Second, the server will build a backlog of call requests that makes the window of time to process a message smaller as the backlog grows. With enough clients making calls, the timeout threshold can be crossed before a call even starts to process. For example, if it takes the server 6 seconds to process a given call and the clients are configured with a 60 second timeout, 30 agents making the call simultaneously will result in a situation where 20 of the agents will never get a response. The first 10 will get their calls filled and the last 20 will end up in a loop where the server is just spending time replying to calls that are expired by the time it processes them. Fix: This adds an exponential backoff mechanism for timeout values on any RPC calls in Neutron that don't explicitly request a timeout value. Result: This will prevent the clients from DDoSing the server by giving up on requests and retrying them before they are fulfilled.
Clone Of:
Environment:
Last Closed: 2017-01-11 19:26:51 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1554332 0 None None None 2017-01-04 12:44:38 UTC
OpenStack gerrit 314319 0 None MERGED Add exponential back-off RPC client 2020-06-05 19:50:12 UTC

Description Faiaz Ahmed 2016-03-07 05:44:12 UTC
openstack-neutron-2015.1.2-9.el7ost.noarch Mar 5 2016
openstack-neutron-common-2015.1.2-9.el7ost.noarch Mar 5 2016
openstack-neutron-lbaas-2015.1.2-1.el7ost.noarch Mar 5 2016
openstack-neutron-ml2-2015.1.2-9.el7ost.noarch Mar 5 2016
openstack-neutron-openvswitch-2015.1.2-9.el7ost.noarch Mar 5 2016
openvswitch-2.4.0-1.el7.x86_64 Feb 22 2016
python-neutron-2015.1.2-9.el7ost.noarch Mar 5 2016
python-neutron-lbaas-2015.1.2-1.el7ost.noarch Mar 5 2016
python-neutronclient-2.4.0-2.el7ost.noarch Mar 5 2016
python-openvswitch-2.4.0-1.el7.noarch Mar 5 2016

- following error with over 2400 ports

~~~~~
Here's what customer saw issue on updating port status over 2000 ports every 1 minutes.

$ perl -pe "s/, u/\n, u/g" AR_general_timesearch_2016-03-07.log.\ because\ of\ error\:\ Timed\ out\ waiting\ for\ a\ reply\ to\ message\ ID |sed -e "s/, u.*/, u'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxxxxx'/g"|uniq -c |head -100|tail -57
      1 
      1 ** sosreport-ttbosneutron0002-20160306161323/sosreport-ttbosneutron0002-20160306161323/var/log/neutron/openvswitch-agent.log -  because of error: Timed out waiting for a reply to message ID
      1 2016-03-06 15:45:38.083 20303 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent DeviceListRetrievalError: Unable to retrieve port details for devices: set([u'9301ae97-842d-41ad-9b97-84957b00841b'
   2291 , u'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxxxxx'
      1 2016-03-06 16:11:06.983 10474 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent DeviceListRetrievalError: Unable to retrieve port details for devices: set([u'9301ae97-842d-41ad-9b97-84957b00841b'
   2291 , u'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxxxxx'
      1 2016-03-06 16:12:10.258 10474 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent DeviceListRetrievalError: Unable to retrieve port details for devices: set([u'9301ae97-842d-41ad-9b97-84957b00841b'
   2291 , u'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxxxxx'
      1 2016-03-06 16:13:13.606 10474 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent DeviceListRetrievalError: Unable to retrieve port details for devices: set([u'9301ae97-842d-41ad-9b97-84957b00841b'
   2291 , u'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxxxxx'
      1 
      1 ** sosreport-ttbosneutron0002-20160306161323/sosreport-ttbosneutron0002-20160306161323/var/log/neutron/openvswitch-agent.log-20160306 -  because of error: Timed out waiting for a reply to message ID
      1 2016-03-05 20:39:47.640 20303 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent DeviceListRetrievalError: Unable to retrieve port details for devices: set([u'9301ae97-842d-41ad-9b97-84957b00841b'
   2248 , u'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxxxxx'
      1 2016-03-05 20:40:51.383 20303 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent DeviceListRetrievalError: Unable to retrieve port details for devices: set([u'9301ae97-842d-41ad-9b97-84957b00841b'
   2248 , u'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxxxxx'
      1 2016-03-05 20:41:54.503 20303 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent DeviceListRetrievalError: Unable to retrieve port details for devices: set([u'9301ae97-842d-41ad-9b97-84957b00841b'
   2248 , u'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxxxxx'
      1 2016-03-05 20:42:57.800 20303 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent DeviceListRetrievalError: Unable to retrieve port details for devices: set([u'9301ae97-842d-41ad-9b97-84957b00841b'
   2248 , u'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxxxxx'
      1 2016-03-05 20:44:01.160 20303 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent DeviceListRetrievalError: Unable to retrieve port details for devices: set([u'9301ae97-842d-41ad-9b97-84957b00841b'
   2248 , u'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxxxxx'
      1 2016-03-05 20:45:04.344 20303 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent DeviceListRetrievalError: Unable to retrieve port details for devices: set([u'9301ae97-842d-41ad-9b97-84957b00841b'
   2248 , u'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxxxxx'
      1 2016-03-05 20:46:07.848 20303 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent DeviceListRetrievalError: Unable to retrieve port details for devices: set([u'9301ae97-842d-41ad-9b97-84957b00841b'
   2248 , u'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxxxxx'
      1 2016-03-05 20:47:11.346 20303 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent DeviceListRetrievalError: Unable to retrieve port details for devices: set([u'9301ae97-842d-41ad-9b97-84957b00841b'
   2248 , u'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxxxxx'
~~~~~

Comment 2 Faiaz Ahmed 2016-03-07 05:46:18 UTC
Suspecting this could be the cause
There is a UnixDomainWSGIServer in the l3-agent listening for those notifications.  When they are received, they call this method in neutron/agent/l3/ha.py:

 def enqueue_state_change(self, router_id, state):
        LOG.info(_LI('Router %(router_id)s transitioned to %(state)s'),
                 {'router_id': router_id,
                  'state': state})

        try:
            ri = self.router_info[router_id]
        except AttributeError:
            LOG.info(_LI('Router %s is not managed by this agent. It was '
                         'possibly deleted concurrently.'), router_id)
            return

        self._configure_ipv6_ra_on_ext_gw_port_if_necessary(ri, state)
        self._update_metadata_proxy(ri, router_id, state)
        self._update_radvd_daemon(ri, state)
        self.state_change_notifier.queue_event((router_id, state))

Comment 4 Faiaz Ahmed 2016-03-07 07:09:18 UTC
The error message gone away after the change

/etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini
[DEFAULT]
:rpc_response_timeout = 300

Comment 5 Assaf Muller 2016-03-07 23:39:42 UTC
I believe this is simply a case of neutron-server taking more than the default of 60 seconds (Controlled by rpc_response_timeout, described in comment 4) to respond to the RPC query. The query times out, and is sent again, and now neutron-server has to process two queries instead of one. This goes on potentially forever, causing more and more load on the server.

As an immediate term solution you can increase the timeout as has already been done. Long term, there's a patch upstream that executes an automatic exponential back off for queries such as this:
https://review.openstack.org/#/c/280595/

Comment 6 Assaf Muller 2016-03-07 23:41:04 UTC
(In reply to Faiaz Ahmed from comment #2)
> Suspecting this could be the cause
> There is a UnixDomainWSGIServer in the l3-agent listening for those
> notifications.  When they are received, they call this method in
> neutron/agent/l3/ha.py:
> 
>  def enqueue_state_change(self, router_id, state):
>         LOG.info(_LI('Router %(router_id)s transitioned to %(state)s'),
>                  {'router_id': router_id,
>                   'state': state})
> 
>         try:
>             ri = self.router_info[router_id]
>         except AttributeError:
>             LOG.info(_LI('Router %s is not managed by this agent. It was '
>                          'possibly deleted concurrently.'), router_id)
>             return
> 
>         self._configure_ipv6_ra_on_ext_gw_port_if_necessary(ri, state)
>         self._update_metadata_proxy(ri, router_id, state)
>         self._update_radvd_daemon(ri, state)
>         self.state_change_notifier.queue_event((router_id, state))

(For future reference, L3 code really has nothing to do with this issue. The request that's timing out is fired from the OVS agent and is processed by neutron-server ML2 code.

Comment 7 Assaf Muller 2016-06-01 00:31:03 UTC
https://review.openstack.org/#/c/280595/ was merged and will be available in the next release of OSP 8.

Comment 11 Jon Schlueter 2017-01-03 13:16:30 UTC
According to our records, this should be resolved by openstack-neutron-7.2.0-5.el7ost.  This build is available now.

Comment 12 Eran Kuris 2017-01-04 09:56:26 UTC
Please add step to reproduce to this issue .

Comment 13 Assaf Muller 2017-01-04 12:44:39 UTC
This RHBZ is tracking https://bugs.launchpad.net/neutron/+bug/1554332. One way to assert the new back-off functionality of agent state reports would be to inject some sort of fault (like shutting down the rabbitmq services), observe that OVS agents (Or DHCP/L3) cannot send state reports, and that using DEBUG mode logging, the retries are happening in increasing delays (e.g. not every 30 seconds, but an increasing amount of time between each retry).

Comment 14 Eran Kuris 2017-01-09 09:00:44 UTC
I sat down with jschwarz on this. As per rhbz#1365378, exponential back-off is disabled for state reports, so the reproduction steps don't work.

However, since that bug also complained that exponential back-off is working (and needed to be disabled for a specific case), I'm fine with verifying this bug on that basis.

Checked on version openstack-neutron-7.2.0-5.el7ost.noarch (GA).


Note You need to log in before you can comment on or make changes to this bug.