1381620 – UnixDomainWSGIServer keepalived state change listener in L3 agent has an uncapped number of threads, overloading node

Bug 1381620 - UnixDomainWSGIServer keepalived state change listener in L3 agent has an uncapped number of threads, overloading node

Summary: UnixDomainWSGIServer keepalived state change listener in L3 agent has an unca...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-neutron
Sub Component:
Version:	10.0 (Newton)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	rc
Target Release:	10.0 (Newton)
Assignee:	anil venkata
QA Contact:	Alexander Stafeyev
Docs Contact:
URL:
Whiteboard:
Depends On:	1315114 1381618 1381619
Blocks:
TreeView+	depends on / blocked

Reported:	2016-10-04 14:59 UTC by anil venkata
Modified:	2016-12-14 16:07 UTC (History)
CC List:	16 users (show)
Fixed In Version:	openstack-neutron-9.0.0-1.2.el7ost
Doc Type:	Bug Fix
Doc Text:	Previously, the maximum number of client connections (i.e greenlets spawned at a time) opened at any time by the WSGI server was set to 100 with 'wsgi_default_pool_size'. While this setting was adequate for the OpenStack Networking API server, the state change server created heavy CPU loads on the L3 agent, which caused the agent to crash. With this release, you can now use the new 'ha_keepalived_state_change_server_threads' setting to configure the number of threads in the state change server. Client connections are no longer limited by 'wsgi_default_pool_size', thereby avoiding an L3 agent crash when many state change server threads are spawned.
Clone Of:	1381619
Environment:
Last Closed:	2016-12-14 16:07:37 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1581580	None	None	None	2016-10-04 14:59:31 UTC
OpenStack gerrit	379578	None	None	None	2016-10-04 14:59:31 UTC
Red Hat Product Errata	RHEA-2016:2948	normal	SHIPPED_LIVE	Red Hat OpenStack Platform 10 enhancement update	2016-12-14 19:55:27 UTC

Comment 3 Toni Freger 2016-11-20 12:00:14 UTC

Hi Anil,

Please provide steps to reproduce.
I have latest version, 3 controller, 2 compute.
Do I need IPv6 network?
How many ports should openvswitch host?
Any additional info you can provide in order to verify the fix? 

Thanks,
Toni

Comment 4 anil venkata 2016-11-21 09:13:43 UTC

Hi Toni

There are no direct steps  to reproduce this issue.
This [1] is the original bug reported on OSP7. In bug description, they said - "When restarting neutron-openvswitch-agent on networks nodes with lots of port, server crashes". And in [2] they figured out code changes to avoid the crash. These code changes are related limiting the spawning of keepalived statechange workers. So I tried to test spawning many num of keepalived statechange threads and its effect on CPU load. For this 

1) I created many HA routers(close to 150 I think)
2) made sure controller1 has all the "Master" HA routers( and HA routers in other contoller2 are slaves)  
3) used script to bring down HA interface in all controller1's HA routers.  Keeplived state change server threads are simultaneously spawned in controller2, also increasing cpu load on this node. 
I tried with 2 controller nodes.

I am not sure this bug is worth testing.

[1]https://bugzilla.redhat.com/show_bug.cgi?id=1315114
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1315114#c5

Comment 5 Assaf Muller 2016-11-21 13:26:56 UTC

There should be a way to check how many threads are running in a Python process. One way to verify is to use such a tool to assert that we don't have more than $num_cpus / 2 keepalived_state_change_monitor threads in the L3 agent process.

Comment 6 John Schwarz 2016-11-22 06:55:52 UTC

One can use gdb (the C debugger) to display this information:

1. Run "gdb --pid <pid>" (install if it's not already),
2. Run "info threads" top display the amount of threads running. An example output can be:

(gdb) info threads
  Id   Target Id         Frame 
* 1    Thread 0x7fc25be0c700 (LWP 20895) "python" 0x00007fc25ac331c3 in select () from /lib64/libc.so.6
  2    Thread 0x7fc253594700 (LWP 20896) "python" 0x00007fc25ac331c3 in select () from /lib64/libc.so.6


which signifies 2 threads.

Alternatively, one can use this one-liner to display the information (still needs gdb):

echo "info threads" > /tmp/cmd; gdb --pid 21415 < /tmp/cmd

Comment 7 Alexander Stafeyev 2016-11-23 09:11:55 UTC

(In reply to John Schwarz from comment #6)
> One can use gdb (the C debugger) to display this information:
> 
> 1. Run "gdb --pid <pid>" (install if it's not already),
> 2. Run "info threads" top display the amount of threads running. An example
> output can be:
> 
> (gdb) info threads
>   Id   Target Id         Frame 
> * 1    Thread 0x7fc25be0c700 (LWP 20895) "python" 0x00007fc25ac331c3 in
> select () from /lib64/libc.so.6
>   2    Thread 0x7fc253594700 (LWP 20896) "python" 0x00007fc25ac331c3 in
> select () from /lib64/libc.so.6
> 
> 
> which signifies 2 threads.
> 
> Alternatively, one can use this one-liner to display the information (still
> needs gdb):
> 
> echo "info threads" > /tmp/cmd; gdb --pid 21415 < /tmp/cmd

Created 10 routers
rebooted all controllers except controller 1.
Checked keepalived processes to have only one thread. 
Did not see any with more then one. 

[root@controller-1 ~]# ps -ef | grep keepalived-state-change | awk '{ print $2 }'

[root@controller-1 ~]# sudo gdb --pid 453603

(gdb) info threads 
  Id   Target Id         Frame 
* 1    Thread 0x7f1e813c5740 (LWP 219061) "neutron-keepali" 0x00007f1e801ffcf3 in __epoll_wait_nocancel ()
   from /lib64/libc.so.6
(gdb) quit


Same behavior with 30 and 80 routers. 


If it is enough for verifiaction we can verify

Comment 8 anil venkata 2016-11-23 13:49:42 UTC

That verification is fine.

Comment 10 errata-xmlrpc 2016-12-14 16:07:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2948.html

Note You need to log in before you can comment on or make changes to this bug.