Hi Anil, Please provide steps to reproduce. I have latest version, 3 controller, 2 compute. Do I need IPv6 network? How many ports should openvswitch host? Any additional info you can provide in order to verify the fix? Thanks, Toni
Hi Toni There are no direct steps to reproduce this issue. This [1] is the original bug reported on OSP7. In bug description, they said - "When restarting neutron-openvswitch-agent on networks nodes with lots of port, server crashes". And in [2] they figured out code changes to avoid the crash. These code changes are related limiting the spawning of keepalived statechange workers. So I tried to test spawning many num of keepalived statechange threads and its effect on CPU load. For this 1) I created many HA routers(close to 150 I think) 2) made sure controller1 has all the "Master" HA routers( and HA routers in other contoller2 are slaves) 3) used script to bring down HA interface in all controller1's HA routers. Keeplived state change server threads are simultaneously spawned in controller2, also increasing cpu load on this node. I tried with 2 controller nodes. I am not sure this bug is worth testing. [1]https://bugzilla.redhat.com/show_bug.cgi?id=1315114 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1315114#c5
There should be a way to check how many threads are running in a Python process. One way to verify is to use such a tool to assert that we don't have more than $num_cpus / 2 keepalived_state_change_monitor threads in the L3 agent process.
One can use gdb (the C debugger) to display this information: 1. Run "gdb --pid <pid>" (install if it's not already), 2. Run "info threads" top display the amount of threads running. An example output can be: (gdb) info threads Id Target Id Frame * 1 Thread 0x7fc25be0c700 (LWP 20895) "python" 0x00007fc25ac331c3 in select () from /lib64/libc.so.6 2 Thread 0x7fc253594700 (LWP 20896) "python" 0x00007fc25ac331c3 in select () from /lib64/libc.so.6 which signifies 2 threads. Alternatively, one can use this one-liner to display the information (still needs gdb): echo "info threads" > /tmp/cmd; gdb --pid 21415 < /tmp/cmd
(In reply to John Schwarz from comment #6) > One can use gdb (the C debugger) to display this information: > > 1. Run "gdb --pid <pid>" (install if it's not already), > 2. Run "info threads" top display the amount of threads running. An example > output can be: > > (gdb) info threads > Id Target Id Frame > * 1 Thread 0x7fc25be0c700 (LWP 20895) "python" 0x00007fc25ac331c3 in > select () from /lib64/libc.so.6 > 2 Thread 0x7fc253594700 (LWP 20896) "python" 0x00007fc25ac331c3 in > select () from /lib64/libc.so.6 > > > which signifies 2 threads. > > Alternatively, one can use this one-liner to display the information (still > needs gdb): > > echo "info threads" > /tmp/cmd; gdb --pid 21415 < /tmp/cmd Created 10 routers rebooted all controllers except controller 1. Checked keepalived processes to have only one thread. Did not see any with more then one. [root@controller-1 ~]# ps -ef | grep keepalived-state-change | awk '{ print $2 }' [root@controller-1 ~]# sudo gdb --pid 453603 (gdb) info threads Id Target Id Frame * 1 Thread 0x7f1e813c5740 (LWP 219061) "neutron-keepali" 0x00007f1e801ffcf3 in __epoll_wait_nocancel () from /lib64/libc.so.6 (gdb) quit Same behavior with 30 and 80 routers. If it is enough for verifiaction we can verify
That verification is fine.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-2948.html