Bug 1381620

Summary: UnixDomainWSGIServer keepalived state change listener in L3 agent has an uncapped number of threads, overloading node
Product: Red Hat OpenStack Reporter: anil venkata <vkommadi>
Component: openstack-neutronAssignee: anil venkata <vkommadi>
Status: CLOSED ERRATA QA Contact: Alexander Stafeyev <astafeye>
Severity: urgent Docs Contact:
Priority: high    
Version: 10.0 (Newton)CC: amuller, bperkins, bschmaus, chris.fields, chrisw, dchia, ddomingo, dhill, jraju, jschluet, jschwarz, mflusche, nyechiel, srevivo, tfreger, vkommadi
Target Milestone: rcKeywords: Triaged
Target Release: 10.0 (Newton)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-neutron-9.0.0-1.2.el7ost Doc Type: Bug Fix
Doc Text:
Previously, the maximum number of client connections (i.e greenlets spawned at a time) opened at any time by the WSGI server was set to 100 with 'wsgi_default_pool_size'. While this setting was adequate for the OpenStack Networking API server, the state change server created heavy CPU loads on the L3 agent, which caused the agent to crash. With this release, you can now use the new 'ha_keepalived_state_change_server_threads' setting to configure the number of threads in the state change server. Client connections are no longer limited by 'wsgi_default_pool_size', thereby avoiding an L3 agent crash when many state change server threads are spawned.
Story Points: ---
Clone Of: 1381619 Environment:
Last Closed: 2016-12-14 16:07:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1315114, 1381618, 1381619    
Bug Blocks:    

Comment 3 Toni Freger 2016-11-20 12:00:14 UTC
Hi Anil,

Please provide steps to reproduce.
I have latest version, 3 controller, 2 compute.
Do I need IPv6 network?
How many ports should openvswitch host?
Any additional info you can provide in order to verify the fix? 

Thanks,
Toni

Comment 4 anil venkata 2016-11-21 09:13:43 UTC
Hi Toni

There are no direct steps  to reproduce this issue.
This [1] is the original bug reported on OSP7. In bug description, they said - "When restarting neutron-openvswitch-agent on networks nodes with lots of port, server crashes". And in [2] they figured out code changes to avoid the crash. These code changes are related limiting the spawning of keepalived statechange workers. So I tried to test spawning many num of keepalived statechange threads and its effect on CPU load. For this 

1) I created many HA routers(close to 150 I think)
2) made sure controller1 has all the "Master" HA routers( and HA routers in other contoller2 are slaves)  
3) used script to bring down HA interface in all controller1's HA routers.  Keeplived state change server threads are simultaneously spawned in controller2, also increasing cpu load on this node. 
I tried with 2 controller nodes.

I am not sure this bug is worth testing.

[1]https://bugzilla.redhat.com/show_bug.cgi?id=1315114
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1315114#c5

Comment 5 Assaf Muller 2016-11-21 13:26:56 UTC
There should be a way to check how many threads are running in a Python process. One way to verify is to use such a tool to assert that we don't have more than $num_cpus / 2 keepalived_state_change_monitor threads in the L3 agent process.

Comment 6 John Schwarz 2016-11-22 06:55:52 UTC
One can use gdb (the C debugger) to display this information:

1. Run "gdb --pid <pid>" (install if it's not already),
2. Run "info threads" top display the amount of threads running. An example output can be:

(gdb) info threads
  Id   Target Id         Frame 
* 1    Thread 0x7fc25be0c700 (LWP 20895) "python" 0x00007fc25ac331c3 in select () from /lib64/libc.so.6
  2    Thread 0x7fc253594700 (LWP 20896) "python" 0x00007fc25ac331c3 in select () from /lib64/libc.so.6


which signifies 2 threads.

Alternatively, one can use this one-liner to display the information (still needs gdb):

echo "info threads" > /tmp/cmd; gdb --pid 21415 < /tmp/cmd

Comment 7 Alexander Stafeyev 2016-11-23 09:11:55 UTC
(In reply to John Schwarz from comment #6)
> One can use gdb (the C debugger) to display this information:
> 
> 1. Run "gdb --pid <pid>" (install if it's not already),
> 2. Run "info threads" top display the amount of threads running. An example
> output can be:
> 
> (gdb) info threads
>   Id   Target Id         Frame 
> * 1    Thread 0x7fc25be0c700 (LWP 20895) "python" 0x00007fc25ac331c3 in
> select () from /lib64/libc.so.6
>   2    Thread 0x7fc253594700 (LWP 20896) "python" 0x00007fc25ac331c3 in
> select () from /lib64/libc.so.6
> 
> 
> which signifies 2 threads.
> 
> Alternatively, one can use this one-liner to display the information (still
> needs gdb):
> 
> echo "info threads" > /tmp/cmd; gdb --pid 21415 < /tmp/cmd

Created 10 routers
rebooted all controllers except controller 1.
Checked keepalived processes to have only one thread. 
Did not see any with more then one. 

[root@controller-1 ~]# ps -ef | grep keepalived-state-change | awk '{ print $2 }'

[root@controller-1 ~]# sudo gdb --pid 453603

(gdb) info threads 
  Id   Target Id         Frame 
* 1    Thread 0x7f1e813c5740 (LWP 219061) "neutron-keepali" 0x00007f1e801ffcf3 in __epoll_wait_nocancel ()
   from /lib64/libc.so.6
(gdb) quit


Same behavior with 30 and 80 routers. 


If it is enough for verifiaction we can verify

Comment 8 anil venkata 2016-11-23 13:49:42 UTC
That verification is fine.

Comment 10 errata-xmlrpc 2016-12-14 16:07:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2948.html