Created attachment 1209689 [details] guru-neutron-server.log Description of problem: Services with oslo messaging enabled are busy looping on epoll_wait() calls with no timeout. This leads to high (cumulative) CPU usage and heavy context switching activity on a per oslo-enabled service basis. Version-Release number of selected component (if applicable): python2-oslo-messaging.noarch 5.10.0-1.el7 How reproducible: Always. On an all-in-one install, the effect is even more pronounced because all the services are running on a single machine. I have reduced the number of workers for all services to 1, since my setup is not under heavy load, but there are still about 35k epoll_wait()/s system wide. Steps to Reproduce: 1. Start nova/cinder/neutron services 2. Observe ~350 epoll_wait()s with 0 timeout per second per service process with strace or 35k poll_waits()/s system-wide with sysdig 3. Actual results: 35k epoll_waits()/s system-wide with sysdig Expected results: Much fewer epoll_waits(). Additional info: On a 30s sysdig capture on idle cluster with no instances # sysdig -c topscalls -r trace.scap # Calls Syscall -------------------------------------------------------------------------------- 1053676 epoll_wait # sysdig -r trace.scap evt.type=epoll_wait | cut -f4 -d' ' | sort | uniq -c | sort -n ... 70960 neutron-dhcp-ag 71440 cinder-volume 73956 neutron-metadat 74230 neutron-l3-agen 77322 neutron-openvsw 145352 nova-compute 147996 cinder-schedule 150146 nova-conductor 233788 neutron-server Seems that this bug has been around a while, since Liberty: https://bugs.launchpad.net/oslo.messaging/+bug/1518430 CPU usage is ~2% per process. Cumulatively, it matters. It also causes system wide scheduling latency and tons of traffic on user<->kernel across the syscall boundary. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1709 nova 20 0 2057220 130564 9848 S 2.3 0.5 0:55.61 nova-compute 1213 neutron 20 0 320824 57512 5472 S 2.0 0.2 0:47.76 neutron-metadat 1216 nova 20 0 390748 121384 5700 S 2.0 0.5 1:00.23 nova-conductor 1717 neutron 20 0 341852 78624 5484 S 2.0 0.3 1:10.24 neutron-openvsw 1958 cinder 20 0 560728 118180 2452 S 2.0 0.5 0:47.80 cinder-volume 1979 neutron 20 0 408652 107568 2548 S 2.0 0.4 0:49.68 neutron-server 1980 neutron 20 0 387208 87712 2156 S 2.0 0.4 0:46.39 neutron-server 1982 neutron 20 0 392020 92216 2172 S 2.0 0.4 0:44.42 neutron-server 1211 cinder 20 0 561472 129136 11796 S 1.7 0.5 0:50.24 cinder-schedule 1706 neutron 20 0 325948 62640 5484 S 1.7 0.3 0:58.58 neutron-l3-agen 1708 neutron 20 0 325968 62824 5496 R 1.7 0.3 0:57.32 neutron-dhcp-ag 1195 nova 20 0 438904 141136 6648 S 0.7 0.6 0:20.43 nova-api 1215 cinder 20 0 558304 125208 11792 S 0.7 0.5 0:18.25 cinder-volume 873 root 10 -10 418920 52404 9388 S 0.3 0.2 0:05.86 ovs-vswitchd 1209 cinder 20 0 573832 141432 11792 S 0.3 0.6 0:19.19 cinder-api 1217 rabbitmq 20 0 4600604 114676 4120 S 0.3 0.5 0:14.40 beam.smp
Created attachment 1209690 [details] systemctl-status.log
Ick. Thanks for filing this, I'll look into it.
Created attachment 1209692 [details] strace.log about 3 seconds of strace on neutron-server
Is this a regression vs OSP9 or OSP8? If so I agree it´s a blocker, but if it´s not a regression, then we should drop blocker.
I didn't have time to analyze the issue yet, but it doesn't seem like a regression nor a major bug. For me, it's more a missed optimization: I also expect processes to be 100% idle (whereas Seth reports only 98% idle) when no RPC call is done. eventlet calls epoll_wait(0) when it has callbacks ready to be called, it's the normal behaviour. I'm not sure that the issue comes from Oslo Messaging neither. We need to run a Python profiler on nova/cinder/neutron to see exactly what is running and why.
A few quick comments here. I have a whole lot more to braindump after this but we'll get the easier stuff out of the way first :) I think this is a regression, but I haven't gone back to prove it yet. Anecdotally, I have noticed this behavior on Newton for a while now (but never had a good explaination). I can't reliably run Newton on my mini-dell. Mitaka and before worked fine. So there's that. As for idle%, I think it's misleading. The system spends so much time jumping across the syscall boundary and not doing any "real" work on either side. So it's not driving up the user/system cpu usage. For a workaround, set: heartbeat_timeout_threshold = 0 in the config files (nova.conf/neutron.conf/etc.)
I can confirm that heartbeat_timeout_threshold = 0 largely mitigates this issue system-wide: Over 30s, epoll_wait() are down to ~2200/s # sysdig -c topscalls -r trace.scap can't open file trace.scap # Calls Syscall -------------------------------------------------------------------------------- [root@openstack neutron]# cd [root@openstack ~]# sysdig -c topscalls -r trace.scap # Calls Syscall -------------------------------------------------------------------------------- 67230 epoll_wait # sysdig -r trace.scap evt.type=epoll_wait | cut -f4 -d' ' | sort | uniq -c | sort -n ... 116 nova-novncproxy 194 nova-conductor 342 cinder-volume 598 nova-consoleaut 644 nova-scheduler 768 cinder-schedule 781 1_scheduler 1214 nova-cert 1679 2_scheduler 1712 nova-compute 2842 neutron-l3-agen 3066 neutron-metadat 4694 neutron-server 15638 neutron-openvsw 32786 neutron-dhcp-ag Seems that neutron-openvswitch-agent and neutron-dhcp-agent might not be honoring this tunable. Or they have some other path that calls the same path that results in the high epoll_wait() call rate. I can say if this is a regression or not. Reports online make me think this has been around since Liberty, but I haven't confirmed. I'll comment back if I have time to figure it out. blocker or no blocker is up to you.
(In reply to Seth Jennings from comment #7) > I can say if this is a regression or not. s/can/can't/
I think neutron is just busy for other reasons. It's historically burned more cycles than the other services and I've never looked into why. Glad to hear that the other ones are behaving now though. I'm still working on the writeup explaining the details, stay tuned.
Ok, here's the deal... First off, I've verified this is the case on both devstack/master as well as on OSP10/Newton. So it's not something specific to our downstream. In the oslo.messaging rabbit driver, there is a thread for heartbeats: https://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo_messaging/_drivers/impl_rabbit.py?id=293084f9b5ef59c4b2dcd8c7180f7053b47337a9#n968 specifically, the three methods, _heartbeat_start, _heartbeat_stop, and _heartbeat_thread_job. In _heartbeat_start, it creates a threading.Event (this event ends up being the root of the problem), a threading.Thread for the heartbeat thread, and then starts the heartbeat thread. The heartbeat thread does work, and then at the end executes: self._heartbeat_exit_event.wait( timeout=self._heartbeat_wait_timeout) The _heartbeat_stop method sets the exit event when the connection is stopped, this is where it receives that and shuts down. With the default heartbeat configuration, _heartbeat_wait_timeout ends up being 15.0 seconds. This sounds reasonable, but explodes under the hood. Event.wait() ultimately calls Condition.wait(), which runs this gem[1] of code: # Balancing act: We can't afford a pure busy loop, so we # have to sleep; but if we sleep the whole timeout time, # we'll be unresponsive. The scheme here sleeps very # little at first, longer as time goes on, but never longer # than 20 times per second (or the timeout time remaining). endtime = _time() + timeout delay = 0.0005 # 500 us -> initial delay of 1 ms while True: gotit = waiter.acquire(0) if gotit: break remaining = endtime - _time() if remaining <= 0: break delay = min(delay * 2, remaining, .05) _sleep(delay) Now, the problem is that _sleep here is time.sleep, which has been previously monkey-patched by eventlet to be eventlet.greenthread.sleep instead. When you call sleep() in eventlet, it adds a timer onto an internal heap that it maintains[2]. When the main event loop runs, it checks whatever the shortest timer is, and uses that delta as the argument for waiting (this ultimately ends up to epoll_wait). Here's the crucial part of the code[3]: wakeup_when = self.sleep_until() if wakeup_when is None: sleep_time = self.default_sleep() else: sleep_time = wakeup_when - self.clock() if sleep_time > 0: self.wait(sleep_time) else: self.wait(0) Because Condition.wait() uses such small sleep times between 0.0005..0.05, these tiny timers are almost always the next timer to fire, and sleep_time above ends up being negative (or small enough to practically be zero anyways). The self.wait() call propagates down to epoll_wait() and you get the behavior we're seeing. So, the root of the problem is that using threading.Event inside of eventlet is bad. I think this should be a pretty easy fix. My first thought is to just use a pipe to signal the heartbeat thread to stop, and then the read on the pipe can timeout at _heartbeat_wait_timeout. This moves everything into the regular i/o handling parts of eventlet, and will still work in the case that some other executor than eventlet is used. [1] https://hg.python.org/cpython/file/v2.7.12/Lib/threading.py#l344 [2] https://github.com/eventlet/eventlet/blob/7343f2e817fbc09d07c36727dfb2e4655a37901f/eventlet/hubs/hub.py#L389-L392 [3] https://github.com/eventlet/eventlet/blob/7343f2e817fbc09d07c36727dfb2e4655a37901f/eventlet/hubs/hub.py#L340-L348
Note that the crazy Condition.wait() code has been fixed in Python 3 - https://hg.python.org/cpython/file/3.6/Lib/threading.py#l263
By the way, I worked on a performance bug in asyncio: epoll_wait() was called with a timeout of 0 because of how the timeout was rounded. epoll_wait() has a resolution of 1 ms (the timeout is a integer number of millisecond), so asyncio entered a busy when a timer was in less than 1 ms. Full story: * http://bugs.python.org/issue20311 * http://bugs.python.org/issue20320 * http://bugs.python.org/issue20505 The fix is made in two parts: * Round the timeout away from zero (ROUND_UP, not ROUND_DOWN which rounds towards zero) * Use a "granularity" constant to check if a timer should be fired or not, instead of using the exact current date: see the "end_time = self.time() + self._clock_resolution" line in asyncio/base_events.py. In asyncio, clock_resolution comes from time.get_clock_info('monotonic').resolution which is usually 1 nanosecond on UNIX but 18.6 ms on Windows. I checked quickly eventlet, the timeout is rounded towards zero (ROUND_DOWN) in eventlet/hubs/poll.py. So maybe it has the bug?
https://review.openstack.org/#/c/386656/
ON_DEV until upstream merges.
(In reply to Mike Burns from comment #14) > ON_DEV until upstream merges. I need to print a giant banner to hang over my desk because I always mix up POST/ON_DEV :)
Hey John, I'd like to test this upstream patch but I think I'm missing something. I patch my impl_rabbit.py and do a "python -O -m compileall impl_rabbit.py" and "python -m compileall impl_rabbit.py" (which might be TOTALLY wrong) to get the .pyc and .pyo, but everything freaks out when I do that. All openstack components eat 100% cpu doing something, I know not what. How would I go about testing this upstream patch on my openstack install?
(In reply to Seth Jennings from comment #16) > Hey John, > > I'd like to test this upstream patch but I think I'm missing something. I > patch my impl_rabbit.py and do a "python -O -m compileall impl_rabbit.py" > and "python -m compileall impl_rabbit.py" (which might be TOTALLY wrong) to > get the .pyc and .pyo, but everything freaks out when I do that. All > openstack components eat 100% cpu doing something, I know not what. > > How would I go about testing this upstream patch on my openstack install? Hm that seems like it should work? I never generate those explicitly, if you just rm the .pyc and .pyo files it will regenerate the .pyc on startup. You don't really need the .pyo. Try to just let it generate it itself and see if it makes any difference on the behavior.
https://review.openstack.org/#/c/389739
Newton backport - https://review.openstack.org/#/c/394963/
Merged to Newton, finally.
FailedQA - The patch is not included in python-oslo-messaging-5.10.0-3.el7ost so the behaviour described in description of the bug is still present. I am switching back to POST (Correct me If I am wrong) for a new build to be made with patch included.
Verified I can observe significant drop of epoll_wait() calls over 30seconds timeframes on my setup for neutron-*, nova-* and cinder-* services. Mostly around following numbers: 5.10.0-2 5.10.0-5 12 14 memcached 16 76392 cinder-schedule 28 8 auditd 28 20 nova-scheduler 52 16 systemd 64 20 systemd-logind 74 8 NetworkManager 142 208 mysqld 176 734 neutron-metadat 184 30685 neutron-openvsw 202 54 dbus-daemon 370 236 heat-engine 398 336 redis-server 462 261063 nova-conductor 1108 289802 neutron-server 1108 1250 nova-novncproxy 2020 3351 systemd-journal 3976 4826 1_scheduler 4788 4823 swift-container 9306 14264 neutron-dhcp-ag 80224 105754 corosync 81640 163418 sahara-api 89535 161516 sahara-engine
(In reply to Marian Krcmarik from comment #24) > Verified > > I can observe significant drop of epoll_wait() calls over 30seconds > timeframes on my setup for neutron-*, nova-* and cinder-* services. > > Mostly around following numbers: > 5.10.0-2 5.10.0-5 urgh, the version numbers should be the vice versa.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-2948.html