Bug 1381620
Summary: | UnixDomainWSGIServer keepalived state change listener in L3 agent has an uncapped number of threads, overloading node | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | anil venkata <vkommadi> |
Component: | openstack-neutron | Assignee: | anil venkata <vkommadi> |
Status: | CLOSED ERRATA | QA Contact: | Alexander Stafeyev <astafeye> |
Severity: | urgent | Docs Contact: | |
Priority: | high | ||
Version: | 10.0 (Newton) | CC: | amuller, bperkins, bschmaus, chris.fields, chrisw, dchia, ddomingo, dhill, jraju, jschluet, jschwarz, mflusche, nyechiel, srevivo, tfreger, vkommadi |
Target Milestone: | rc | Keywords: | Triaged |
Target Release: | 10.0 (Newton) | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | openstack-neutron-9.0.0-1.2.el7ost | Doc Type: | Bug Fix |
Doc Text: |
Previously, the maximum number of client connections (i.e greenlets spawned at a time) opened at any time by the WSGI server was set to 100 with 'wsgi_default_pool_size'. While this setting was adequate for the OpenStack Networking API server, the state change server created heavy CPU loads on the L3 agent, which caused the agent to crash.
With this release, you can now use the new 'ha_keepalived_state_change_server_threads' setting to configure the number of threads in the state change server. Client connections are no longer limited by 'wsgi_default_pool_size', thereby avoiding an L3 agent crash when many state change server threads are spawned.
|
Story Points: | --- |
Clone Of: | 1381619 | Environment: | |
Last Closed: | 2016-12-14 16:07:37 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1315114, 1381618, 1381619 | ||
Bug Blocks: |
Comment 3
Toni Freger
2016-11-20 12:00:14 UTC
Hi Toni There are no direct steps to reproduce this issue. This [1] is the original bug reported on OSP7. In bug description, they said - "When restarting neutron-openvswitch-agent on networks nodes with lots of port, server crashes". And in [2] they figured out code changes to avoid the crash. These code changes are related limiting the spawning of keepalived statechange workers. So I tried to test spawning many num of keepalived statechange threads and its effect on CPU load. For this 1) I created many HA routers(close to 150 I think) 2) made sure controller1 has all the "Master" HA routers( and HA routers in other contoller2 are slaves) 3) used script to bring down HA interface in all controller1's HA routers. Keeplived state change server threads are simultaneously spawned in controller2, also increasing cpu load on this node. I tried with 2 controller nodes. I am not sure this bug is worth testing. [1]https://bugzilla.redhat.com/show_bug.cgi?id=1315114 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1315114#c5 There should be a way to check how many threads are running in a Python process. One way to verify is to use such a tool to assert that we don't have more than $num_cpus / 2 keepalived_state_change_monitor threads in the L3 agent process. One can use gdb (the C debugger) to display this information: 1. Run "gdb --pid <pid>" (install if it's not already), 2. Run "info threads" top display the amount of threads running. An example output can be: (gdb) info threads Id Target Id Frame * 1 Thread 0x7fc25be0c700 (LWP 20895) "python" 0x00007fc25ac331c3 in select () from /lib64/libc.so.6 2 Thread 0x7fc253594700 (LWP 20896) "python" 0x00007fc25ac331c3 in select () from /lib64/libc.so.6 which signifies 2 threads. Alternatively, one can use this one-liner to display the information (still needs gdb): echo "info threads" > /tmp/cmd; gdb --pid 21415 < /tmp/cmd (In reply to John Schwarz from comment #6) > One can use gdb (the C debugger) to display this information: > > 1. Run "gdb --pid <pid>" (install if it's not already), > 2. Run "info threads" top display the amount of threads running. An example > output can be: > > (gdb) info threads > Id Target Id Frame > * 1 Thread 0x7fc25be0c700 (LWP 20895) "python" 0x00007fc25ac331c3 in > select () from /lib64/libc.so.6 > 2 Thread 0x7fc253594700 (LWP 20896) "python" 0x00007fc25ac331c3 in > select () from /lib64/libc.so.6 > > > which signifies 2 threads. > > Alternatively, one can use this one-liner to display the information (still > needs gdb): > > echo "info threads" > /tmp/cmd; gdb --pid 21415 < /tmp/cmd Created 10 routers rebooted all controllers except controller 1. Checked keepalived processes to have only one thread. Did not see any with more then one. [root@controller-1 ~]# ps -ef | grep keepalived-state-change | awk '{ print $2 }' [root@controller-1 ~]# sudo gdb --pid 453603 (gdb) info threads Id Target Id Frame * 1 Thread 0x7f1e813c5740 (LWP 219061) "neutron-keepali" 0x00007f1e801ffcf3 in __epoll_wait_nocancel () from /lib64/libc.so.6 (gdb) quit Same behavior with 30 and 80 routers. If it is enough for verifiaction we can verify That verification is fine. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-2948.html |