| Summary: | neutron.plugins.openvswitch.agent.ovs_neutron_agent DeviceListRetrievalError: Unable to retrieve port details for devices | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Faiaz Ahmed <fahmed> |
| Component: | openstack-neutron | Assignee: | Assaf Muller <amuller> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Eran Kuris <ekuris> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 7.0 (Kilo) | CC: | amuller, apevec, chrisw, dchia, jraju, lhh, mfuruta, nyechiel, srevivo, tfreger, tquinlan |
| Target Milestone: | async | Keywords: | TestOnly, ZStream |
| Target Release: | 8.0 (Liberty) | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | openstack-neutron-7.1.1-3.el7ost | Doc Type: | Bug Fix |
| Doc Text: |
Cause:
If a server operation takes long enough to trigger a timeout on an agent call to the server, the agent will just give up and issue a new call immediately.
Consequence:
First, if the server is busy and the requests take more than the timeout window to fulfill, the agent will just continually hammer the server with calls that are bound to fail until the server load is reduced enough to fulfill the query. If the load is a result of calls from agents, this leads to a stampeding effect where the server will be unable to fulfill requests until operator intervention.
Second, the server will build a backlog of call requests that makes the window of time to process a message smaller as the backlog grows. With enough clients making calls, the timeout threshold can be crossed before a call even starts to process. For example, if it takes the server 6 seconds to process a given call and the clients are configured with a 60 second timeout, 30 agents making the call simultaneously will result in a situation where 20 of the agents will never get a response. The first 10 will get their calls filled and the last 20 will end up in a loop where the server is just spending time replying to calls that are expired by the time it processes them.
Fix:
This adds an exponential backoff mechanism for timeout values on any RPC calls in Neutron that don't explicitly request a timeout value.
Result:
This will prevent the clients from DDoSing the server by giving up on requests and retrying them before they are fulfilled.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-01-11 19:26:51 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
Suspecting this could be the cause
There is a UnixDomainWSGIServer in the l3-agent listening for those notifications. When they are received, they call this method in neutron/agent/l3/ha.py:
def enqueue_state_change(self, router_id, state):
LOG.info(_LI('Router %(router_id)s transitioned to %(state)s'),
{'router_id': router_id,
'state': state})
try:
ri = self.router_info[router_id]
except AttributeError:
LOG.info(_LI('Router %s is not managed by this agent. It was '
'possibly deleted concurrently.'), router_id)
return
self._configure_ipv6_ra_on_ext_gw_port_if_necessary(ri, state)
self._update_metadata_proxy(ri, router_id, state)
self._update_radvd_daemon(ri, state)
self.state_change_notifier.queue_event((router_id, state))
The error message gone away after the change /etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini [DEFAULT] :rpc_response_timeout = 300 I believe this is simply a case of neutron-server taking more than the default of 60 seconds (Controlled by rpc_response_timeout, described in comment 4) to respond to the RPC query. The query times out, and is sent again, and now neutron-server has to process two queries instead of one. This goes on potentially forever, causing more and more load on the server. As an immediate term solution you can increase the timeout as has already been done. Long term, there's a patch upstream that executes an automatic exponential back off for queries such as this: https://review.openstack.org/#/c/280595/ (In reply to Faiaz Ahmed from comment #2) > Suspecting this could be the cause > There is a UnixDomainWSGIServer in the l3-agent listening for those > notifications. When they are received, they call this method in > neutron/agent/l3/ha.py: > > def enqueue_state_change(self, router_id, state): > LOG.info(_LI('Router %(router_id)s transitioned to %(state)s'), > {'router_id': router_id, > 'state': state}) > > try: > ri = self.router_info[router_id] > except AttributeError: > LOG.info(_LI('Router %s is not managed by this agent. It was ' > 'possibly deleted concurrently.'), router_id) > return > > self._configure_ipv6_ra_on_ext_gw_port_if_necessary(ri, state) > self._update_metadata_proxy(ri, router_id, state) > self._update_radvd_daemon(ri, state) > self.state_change_notifier.queue_event((router_id, state)) (For future reference, L3 code really has nothing to do with this issue. The request that's timing out is fired from the OVS agent and is processed by neutron-server ML2 code. https://review.openstack.org/#/c/280595/ was merged and will be available in the next release of OSP 8. According to our records, this should be resolved by openstack-neutron-7.2.0-5.el7ost. This build is available now. Please add step to reproduce to this issue . This RHBZ is tracking https://bugs.launchpad.net/neutron/+bug/1554332. One way to assert the new back-off functionality of agent state reports would be to inject some sort of fault (like shutting down the rabbitmq services), observe that OVS agents (Or DHCP/L3) cannot send state reports, and that using DEBUG mode logging, the retries are happening in increasing delays (e.g. not every 30 seconds, but an increasing amount of time between each retry). I sat down with jschwarz on this. As per rhbz#1365378, exponential back-off is disabled for state reports, so the reproduction steps don't work. However, since that bug also complained that exponential back-off is working (and needed to be disabled for a specific case), I'm fine with verifying this bug on that basis. Checked on version openstack-neutron-7.2.0-5.el7ost.noarch (GA). |
openstack-neutron-2015.1.2-9.el7ost.noarch Mar 5 2016 openstack-neutron-common-2015.1.2-9.el7ost.noarch Mar 5 2016 openstack-neutron-lbaas-2015.1.2-1.el7ost.noarch Mar 5 2016 openstack-neutron-ml2-2015.1.2-9.el7ost.noarch Mar 5 2016 openstack-neutron-openvswitch-2015.1.2-9.el7ost.noarch Mar 5 2016 openvswitch-2.4.0-1.el7.x86_64 Feb 22 2016 python-neutron-2015.1.2-9.el7ost.noarch Mar 5 2016 python-neutron-lbaas-2015.1.2-1.el7ost.noarch Mar 5 2016 python-neutronclient-2.4.0-2.el7ost.noarch Mar 5 2016 python-openvswitch-2.4.0-1.el7.noarch Mar 5 2016 - following error with over 2400 ports ~~~~~ Here's what customer saw issue on updating port status over 2000 ports every 1 minutes. $ perl -pe "s/, u/\n, u/g" AR_general_timesearch_2016-03-07.log.\ because\ of\ error\:\ Timed\ out\ waiting\ for\ a\ reply\ to\ message\ ID |sed -e "s/, u.*/, u'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxxxxx'/g"|uniq -c |head -100|tail -57 1 1 ** sosreport-ttbosneutron0002-20160306161323/sosreport-ttbosneutron0002-20160306161323/var/log/neutron/openvswitch-agent.log - because of error: Timed out waiting for a reply to message ID 1 2016-03-06 15:45:38.083 20303 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent DeviceListRetrievalError: Unable to retrieve port details for devices: set([u'9301ae97-842d-41ad-9b97-84957b00841b' 2291 , u'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxxxxx' 1 2016-03-06 16:11:06.983 10474 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent DeviceListRetrievalError: Unable to retrieve port details for devices: set([u'9301ae97-842d-41ad-9b97-84957b00841b' 2291 , u'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxxxxx' 1 2016-03-06 16:12:10.258 10474 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent DeviceListRetrievalError: Unable to retrieve port details for devices: set([u'9301ae97-842d-41ad-9b97-84957b00841b' 2291 , u'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxxxxx' 1 2016-03-06 16:13:13.606 10474 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent DeviceListRetrievalError: Unable to retrieve port details for devices: set([u'9301ae97-842d-41ad-9b97-84957b00841b' 2291 , u'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxxxxx' 1 1 ** sosreport-ttbosneutron0002-20160306161323/sosreport-ttbosneutron0002-20160306161323/var/log/neutron/openvswitch-agent.log-20160306 - because of error: Timed out waiting for a reply to message ID 1 2016-03-05 20:39:47.640 20303 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent DeviceListRetrievalError: Unable to retrieve port details for devices: set([u'9301ae97-842d-41ad-9b97-84957b00841b' 2248 , u'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxxxxx' 1 2016-03-05 20:40:51.383 20303 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent DeviceListRetrievalError: Unable to retrieve port details for devices: set([u'9301ae97-842d-41ad-9b97-84957b00841b' 2248 , u'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxxxxx' 1 2016-03-05 20:41:54.503 20303 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent DeviceListRetrievalError: Unable to retrieve port details for devices: set([u'9301ae97-842d-41ad-9b97-84957b00841b' 2248 , u'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxxxxx' 1 2016-03-05 20:42:57.800 20303 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent DeviceListRetrievalError: Unable to retrieve port details for devices: set([u'9301ae97-842d-41ad-9b97-84957b00841b' 2248 , u'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxxxxx' 1 2016-03-05 20:44:01.160 20303 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent DeviceListRetrievalError: Unable to retrieve port details for devices: set([u'9301ae97-842d-41ad-9b97-84957b00841b' 2248 , u'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxxxxx' 1 2016-03-05 20:45:04.344 20303 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent DeviceListRetrievalError: Unable to retrieve port details for devices: set([u'9301ae97-842d-41ad-9b97-84957b00841b' 2248 , u'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxxxxx' 1 2016-03-05 20:46:07.848 20303 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent DeviceListRetrievalError: Unable to retrieve port details for devices: set([u'9301ae97-842d-41ad-9b97-84957b00841b' 2248 , u'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxxxxx' 1 2016-03-05 20:47:11.346 20303 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent DeviceListRetrievalError: Unable to retrieve port details for devices: set([u'9301ae97-842d-41ad-9b97-84957b00841b' 2248 , u'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxxxxx' ~~~~~