Bug 1315114

Summary: UnixDomainWSGIServer keepalived state change listener in L3 agent has an uncapped number of threads, overloading node
Product: Red Hat OpenStack Reporter: David Hill <dhill>
Component: openstack-neutronAssignee: anil venkata <vkommadi>
Status: CLOSED ERRATA QA Contact: Eran Kuris <ekuris>
Severity: urgent Docs Contact:
Priority: high    
Version: 7.0 (Kilo)CC: amuller, bperkins, bschmaus, chris.fields, chrisw, dchia, dhill, jraju, mflusche, mlopes, nyechiel, srevivo, vkommadi
Target Milestone: asyncKeywords: ZStream
Target Release: 7.0 (Kilo)Flags: clincoln: needinfo-
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-neutron-2015.1.4-10.el7ost Doc Type: Bug Fix
Doc Text:
Prior to this update, with wsgi_default_pool_size(=100) concurrent requests, the state change server would create a heavy CPU load on the l3 agent. With this update, a new option `ha_keepalived_state_change_server_threads` has been added to configure the number of concurrent threads spawned for keepalived server connection requests. Higher values increase the CPU load on the agent nodes. The default value is half of the number of CPUs present on the node. This allows operators to tune the number of threads to suit their environment. With more threads, simultaneous requests for multiple HA routers state change can be handled faster. As a result, ha_keepalived_state_change_server_threads can be configured to avoid high load on l3 agents.
Story Points: ---
Clone Of:
: 1381618 (view as bug list) Environment:
Last Closed: 2017-01-19 13:25:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1381618, 1381619, 1381620    

Description David Hill 2016-03-06 23:00:00 UTC
Description of problem:
When restarting neutron-openvswitch-agent on networks ndes with lots of port, server crashes when setting accept_ra on the qrouter for ipv6 due to number of concurrent sysctl calls

Version-Release number of selected component (if applicable):


How reproducible:
Restart neutron-openvswitch-agent and wait for neutron-l3-agent to do its work

Steps to Reproduce:
1. Always
2.
3.

Actual results:
Server crashes

Expected results:
Server survive

Additional info:

Comment 4 David Hill 2016-03-06 23:18:12 UTC
As a quick workaround, we've turned off ipv6 with the following KCS article: https://access.redhat.com/solutions/8709

Comment 5 Chris Fields 2016-04-06 16:07:35 UTC
The code changes we made to work around the neutron server crash are in support case https://access.redhat.com/support/cases/#/case/01594598 but are re-produced here:

--- neutron/agent/l3/ha.py-orig 2016-03-07 13:20:36.128333368 -0600 
+++ neutron/agent/l3/ha.py 2016-03-06 21:25:21.198378877 -0600 
@@ -79,7 +79,7 @@
def run(self): 
server = agent_utils.UnixDomainWSGIServer( 
- 'neutron-keepalived-state-change') 
+ 'neutron-keepalived-state-change', num_threads=4)      server.start(KeepalivedStateChangeHandler(self.agent), self.get_keepalived_state_change_socket_path(self.conf), 
workers=0, 

--- neutron/agent/linux/utils.py-orig 2016-03-07 13:23:25.476246117 -0600 
+++ neutron/agent/linux/utils.py 2016-03-07 13:26:31.587545191 -0600 
@@ -379,11 +379,11 @@ 
class UnixDomainWSGIServer(wsgi.Server): 
- def __init__(self, name): 
+ def __init__(self, name, num_threads=8): 
self._socket = None
self._launcher = None
self._server = None
- super(UnixDomainWSGIServer, self).__init__(name) 
+ super(UnixDomainWSGIServer, self).__init__(name, num_threads) 

def start(self, application, file_socket, workers, backlog, mode=None): 
self._socket = eventlet.listen(file_socket,

Comment 6 Chris Fields 2016-04-06 16:19:20 UTC
We got hit with this bug again when we applied 
https://rhn.redhat.com/errata/RHBA-2016-0499.html.

Comment 8 Assaf Muller 2016-04-07 13:59:56 UTC
I'm assuming you're using Pacemaker to restart the OVS-agent, which then also restarts the L3 agent, and that by "Server crashes" you mean that the node itself (The hardware) crashes? Can we get any logs, kernel panic or anything like that?



(In reply to Chris Fields from comment #5)
> The code changes we made to work around the neutron server crash are in
> support case https://access.redhat.com/support/cases/#/case/01594598 but are
> re-produced here:
> 
> --- neutron/agent/l3/ha.py-orig 2016-03-07 13:20:36.128333368 -0600 
> +++ neutron/agent/l3/ha.py 2016-03-06 21:25:21.198378877 -0600 
> @@ -79,7 +79,7 @@
> def run(self): 
> server = agent_utils.UnixDomainWSGIServer( 
> - 'neutron-keepalived-state-change') 
> + 'neutron-keepalived-state-change', num_threads=4)     
> server.start(KeepalivedStateChangeHandler(self.agent),
> self.get_keepalived_state_change_socket_path(self.conf), 
> workers=0, 
> 
> --- neutron/agent/linux/utils.py-orig 2016-03-07 13:23:25.476246117 -0600 
> +++ neutron/agent/linux/utils.py 2016-03-07 13:26:31.587545191 -0600 
> @@ -379,11 +379,11 @@ 
> class UnixDomainWSGIServer(wsgi.Server): 
> - def __init__(self, name): 
> + def __init__(self, name, num_threads=8): 
> self._socket = None
> self._launcher = None
> self._server = None
> - super(UnixDomainWSGIServer, self).__init__(name) 
> + super(UnixDomainWSGIServer, self).__init__(name, num_threads) 
> 
> def start(self, application, file_socket, workers, backlog, mode=None): 
> self._socket = eventlet.listen(file_socket,

I don't see anything in the case that explains this change or did you get to it. Can you explain the rationale behind the change and how did you get to it?

Comment 9 Chris Fields 2016-04-07 20:05:01 UTC
We do not use pacemaker to start neutron services.  We are very perspective with the order and with the timing in which they start.

By reducing the number of threads in the code we prevent the load average on the neutron nodes from going so high that they have to be powered cycled.  Here are the descriptions of the code changes above from support case https://access.redhat.com/support/cases/#/case/01594598:

"Christenson, Mark on Mar 07 2016 at 01:29 PM -06:00
Patches to mitigate the load average issue caused by keepalived state transitions, as noted earlier.  Note that the default and explicit values for num_threads are somewhat arbitrarily chosen, and likely should be higher.  Should also ultimately be a configuration item."

"Christenson, Mark on Mar 06 2016 at 08:14 PM -06:00
It would seem that all of those 'ip netns exec ... sysctl ...' processes were caused by the neutron-l3-agent getting state change notification from the neutron-keepalived-state-change processes.

There is a UnixDomainWSGIServer in the l3-agent listening for those notifications.  When they are received, they call this method in neutron/agent/l3/ha.py:

    def enqueue_state_change(self, router_id, state):
        LOG.info(_LI('Router %(router_id)s transitioned to %(state)s'),
                 {'router_id': router_id,
                  'state': state})

        try:
            ri = self.router_info[router_id]
        except AttributeError:
            LOG.info(_LI('Router %s is not managed by this agent. It was '
                         'possibly deleted concurrently.'), router_id)
            return

        self._configure_ipv6_ra_on_ext_gw_port_if_necessary(ri, state)
        self._update_metadata_proxy(ri, router_id, state)
        self._update_radvd_daemon(ri, state)
        self.state_change_notifier.queue_event((router_id, state))


Note that the first thing it does is update ipv6 ra.  It looks like the default number of threads for that WSGI is 1000, so it would explain why we saw so many of those processes.  One question I have is whether the problem is just because of the number of processes, or is there some other kind of contention which is locking them up permanently.

I am going to attempt to change that thread count and see what happens."

Lastly, there are sosreports attached to this support case.

Comment 17 anil venkata 2016-09-27 13:38:29 UTC
https://review.openstack.org/#/c/317616/ is in review. This makes num_threads for state change server to be configurable.

Comment 18 anil venkata 2016-11-21 09:36:48 UTC
Change [1] has been merged into  rhos-7.0-patches branch. Will create the build soon.

[1] https://code.engineering.redhat.com/gerrit/#/c/86087/

Comment 20 Eran Kuris 2016-12-01 10:28:47 UTC
Verified on OSP7  [root@controller-2 heat-admin]# rpm -qa |grep neutron 
openstack-neutron-openvswitch-2015.1.4-11.el7ost.noarch

Created 10 routers
rebooted all controllers except controller 1.
Checked keepalived processes to have only one thread. 
Did not see any with more then one. 


[root@controller-2 heat-admin]# ps -ef | grep keepalived-state-change | awk '{ print $2 }'
5890
6167
6574
6862
7058
7426
7756
8056
8363
8738
31537
[root@controller-2 heat-admin]# sudo gdb --pid 5890
(gdb) info threads 
  Id   Target Id         Frame 
* 1    Thread 0x7f7c6ba13740 (LWP 5890) "neutron-keepali" 0x00007f7c6a84bcf3 in __epoll_wait_nocancel () from /lib64/libc.so.6

Comment 23 errata-xmlrpc 2017-01-19 13:25:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0159.html