Bug 1315114
Summary: | UnixDomainWSGIServer keepalived state change listener in L3 agent has an uncapped number of threads, overloading node | |||
---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | David Hill <dhill> | |
Component: | openstack-neutron | Assignee: | anil venkata <vkommadi> | |
Status: | CLOSED ERRATA | QA Contact: | Eran Kuris <ekuris> | |
Severity: | urgent | Docs Contact: | ||
Priority: | high | |||
Version: | 7.0 (Kilo) | CC: | amuller, bperkins, bschmaus, chris.fields, chrisw, dchia, dhill, jraju, mflusche, mlopes, nyechiel, srevivo, vkommadi | |
Target Milestone: | async | Keywords: | ZStream | |
Target Release: | 7.0 (Kilo) | Flags: | clincoln:
needinfo-
|
|
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | openstack-neutron-2015.1.4-10.el7ost | Doc Type: | Bug Fix | |
Doc Text: |
Prior to this update, with wsgi_default_pool_size(=100) concurrent requests, the state change server would create a heavy CPU load on the l3 agent.
With this update, a new option `ha_keepalived_state_change_server_threads` has been added to configure the number of concurrent threads spawned for keepalived server connection requests. Higher values increase the CPU load on the agent nodes. The default value is half of the number of CPUs present on the node. This allows operators to tune the number of threads to suit their environment. With more threads, simultaneous requests for multiple HA routers state change can be handled faster.
As a result, ha_keepalived_state_change_server_threads can be configured to avoid high load on l3 agents.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1381618 (view as bug list) | Environment: | ||
Last Closed: | 2017-01-19 13:25:51 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1381618, 1381619, 1381620 |
Description
David Hill
2016-03-06 23:00:00 UTC
As a quick workaround, we've turned off ipv6 with the following KCS article: https://access.redhat.com/solutions/8709 The code changes we made to work around the neutron server crash are in support case https://access.redhat.com/support/cases/#/case/01594598 but are re-produced here: --- neutron/agent/l3/ha.py-orig 2016-03-07 13:20:36.128333368 -0600 +++ neutron/agent/l3/ha.py 2016-03-06 21:25:21.198378877 -0600 @@ -79,7 +79,7 @@ def run(self): server = agent_utils.UnixDomainWSGIServer( - 'neutron-keepalived-state-change') + 'neutron-keepalived-state-change', num_threads=4) server.start(KeepalivedStateChangeHandler(self.agent), self.get_keepalived_state_change_socket_path(self.conf), workers=0, --- neutron/agent/linux/utils.py-orig 2016-03-07 13:23:25.476246117 -0600 +++ neutron/agent/linux/utils.py 2016-03-07 13:26:31.587545191 -0600 @@ -379,11 +379,11 @@ class UnixDomainWSGIServer(wsgi.Server): - def __init__(self, name): + def __init__(self, name, num_threads=8): self._socket = None self._launcher = None self._server = None - super(UnixDomainWSGIServer, self).__init__(name) + super(UnixDomainWSGIServer, self).__init__(name, num_threads) def start(self, application, file_socket, workers, backlog, mode=None): self._socket = eventlet.listen(file_socket, We got hit with this bug again when we applied https://rhn.redhat.com/errata/RHBA-2016-0499.html. I'm assuming you're using Pacemaker to restart the OVS-agent, which then also restarts the L3 agent, and that by "Server crashes" you mean that the node itself (The hardware) crashes? Can we get any logs, kernel panic or anything like that? (In reply to Chris Fields from comment #5) > The code changes we made to work around the neutron server crash are in > support case https://access.redhat.com/support/cases/#/case/01594598 but are > re-produced here: > > --- neutron/agent/l3/ha.py-orig 2016-03-07 13:20:36.128333368 -0600 > +++ neutron/agent/l3/ha.py 2016-03-06 21:25:21.198378877 -0600 > @@ -79,7 +79,7 @@ > def run(self): > server = agent_utils.UnixDomainWSGIServer( > - 'neutron-keepalived-state-change') > + 'neutron-keepalived-state-change', num_threads=4) > server.start(KeepalivedStateChangeHandler(self.agent), > self.get_keepalived_state_change_socket_path(self.conf), > workers=0, > > --- neutron/agent/linux/utils.py-orig 2016-03-07 13:23:25.476246117 -0600 > +++ neutron/agent/linux/utils.py 2016-03-07 13:26:31.587545191 -0600 > @@ -379,11 +379,11 @@ > class UnixDomainWSGIServer(wsgi.Server): > - def __init__(self, name): > + def __init__(self, name, num_threads=8): > self._socket = None > self._launcher = None > self._server = None > - super(UnixDomainWSGIServer, self).__init__(name) > + super(UnixDomainWSGIServer, self).__init__(name, num_threads) > > def start(self, application, file_socket, workers, backlog, mode=None): > self._socket = eventlet.listen(file_socket, I don't see anything in the case that explains this change or did you get to it. Can you explain the rationale behind the change and how did you get to it? We do not use pacemaker to start neutron services. We are very perspective with the order and with the timing in which they start. By reducing the number of threads in the code we prevent the load average on the neutron nodes from going so high that they have to be powered cycled. Here are the descriptions of the code changes above from support case https://access.redhat.com/support/cases/#/case/01594598: "Christenson, Mark on Mar 07 2016 at 01:29 PM -06:00 Patches to mitigate the load average issue caused by keepalived state transitions, as noted earlier. Note that the default and explicit values for num_threads are somewhat arbitrarily chosen, and likely should be higher. Should also ultimately be a configuration item." "Christenson, Mark on Mar 06 2016 at 08:14 PM -06:00 It would seem that all of those 'ip netns exec ... sysctl ...' processes were caused by the neutron-l3-agent getting state change notification from the neutron-keepalived-state-change processes. There is a UnixDomainWSGIServer in the l3-agent listening for those notifications. When they are received, they call this method in neutron/agent/l3/ha.py: def enqueue_state_change(self, router_id, state): LOG.info(_LI('Router %(router_id)s transitioned to %(state)s'), {'router_id': router_id, 'state': state}) try: ri = self.router_info[router_id] except AttributeError: LOG.info(_LI('Router %s is not managed by this agent. It was ' 'possibly deleted concurrently.'), router_id) return self._configure_ipv6_ra_on_ext_gw_port_if_necessary(ri, state) self._update_metadata_proxy(ri, router_id, state) self._update_radvd_daemon(ri, state) self.state_change_notifier.queue_event((router_id, state)) Note that the first thing it does is update ipv6 ra. It looks like the default number of threads for that WSGI is 1000, so it would explain why we saw so many of those processes. One question I have is whether the problem is just because of the number of processes, or is there some other kind of contention which is locking them up permanently. I am going to attempt to change that thread count and see what happens." Lastly, there are sosreports attached to this support case. https://review.openstack.org/#/c/317616/ is in review. This makes num_threads for state change server to be configurable. Change [1] has been merged into rhos-7.0-patches branch. Will create the build soon. [1] https://code.engineering.redhat.com/gerrit/#/c/86087/ Verified on OSP7 [root@controller-2 heat-admin]# rpm -qa |grep neutron openstack-neutron-openvswitch-2015.1.4-11.el7ost.noarch Created 10 routers rebooted all controllers except controller 1. Checked keepalived processes to have only one thread. Did not see any with more then one. [root@controller-2 heat-admin]# ps -ef | grep keepalived-state-change | awk '{ print $2 }' 5890 6167 6574 6862 7058 7426 7756 8056 8363 8738 31537 [root@controller-2 heat-admin]# sudo gdb --pid 5890 (gdb) info threads Id Target Id Frame * 1 Thread 0x7f7c6ba13740 (LWP 5890) "neutron-keepali" 0x00007f7c6a84bcf3 in __epoll_wait_nocancel () from /lib64/libc.so.6 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2017-0159.html |