Bug 1384183
| Summary: | busy looping on epoll_wait() | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Seth Jennings <sjenning> | ||||||||
| Component: | python-oslo-messaging | Assignee: | John Eckersberg <jeckersb> | ||||||||
| Status: | CLOSED ERRATA | QA Contact: | Marian Krcmarik <mkrcmari> | ||||||||
| Severity: | urgent | Docs Contact: | |||||||||
| Priority: | unspecified | ||||||||||
| Version: | 10.0 (Newton) | CC: | apevec, fdinitto, fpercoco, jschluet, lhh, mburns, michele, morazi, oblaut, sjenning, srevivo, vstinner | ||||||||
| Target Milestone: | rc | Keywords: | Triaged | ||||||||
| Target Release: | 10.0 (Newton) | ||||||||||
| Hardware: | Unspecified | ||||||||||
| OS: | Unspecified | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | python-oslo-messaging-5.10.0-4.el7ost | Doc Type: | If docs needed, set a value | ||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2016-12-14 16:17:21 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Attachments: |
|
||||||||||
|
Description
Seth Jennings
2016-10-12 18:30:23 UTC
Created attachment 1209690 [details]
systemctl-status.log
Ick. Thanks for filing this, I'll look into it. Created attachment 1209692 [details]
strace.log
about 3 seconds of strace on neutron-server
Is this a regression vs OSP9 or OSP8? If so I agree it´s a blocker, but if it´s not a regression, then we should drop blocker. I didn't have time to analyze the issue yet, but it doesn't seem like a regression nor a major bug. For me, it's more a missed optimization: I also expect processes to be 100% idle (whereas Seth reports only 98% idle) when no RPC call is done. eventlet calls epoll_wait(0) when it has callbacks ready to be called, it's the normal behaviour. I'm not sure that the issue comes from Oslo Messaging neither. We need to run a Python profiler on nova/cinder/neutron to see exactly what is running and why. A few quick comments here. I have a whole lot more to braindump after this but we'll get the easier stuff out of the way first :) I think this is a regression, but I haven't gone back to prove it yet. Anecdotally, I have noticed this behavior on Newton for a while now (but never had a good explaination). I can't reliably run Newton on my mini-dell. Mitaka and before worked fine. So there's that. As for idle%, I think it's misleading. The system spends so much time jumping across the syscall boundary and not doing any "real" work on either side. So it's not driving up the user/system cpu usage. For a workaround, set: heartbeat_timeout_threshold = 0 in the config files (nova.conf/neutron.conf/etc.) I can confirm that heartbeat_timeout_threshold = 0 largely mitigates this issue system-wide:
Over 30s, epoll_wait() are down to ~2200/s
# sysdig -c topscalls -r trace.scap
can't open file trace.scap
# Calls Syscall
--------------------------------------------------------------------------------
[root@openstack neutron]# cd
[root@openstack ~]# sysdig -c topscalls -r trace.scap
# Calls Syscall
--------------------------------------------------------------------------------
67230 epoll_wait
# sysdig -r trace.scap evt.type=epoll_wait | cut -f4 -d' ' | sort | uniq -c | sort -n
...
116 nova-novncproxy
194 nova-conductor
342 cinder-volume
598 nova-consoleaut
644 nova-scheduler
768 cinder-schedule
781 1_scheduler
1214 nova-cert
1679 2_scheduler
1712 nova-compute
2842 neutron-l3-agen
3066 neutron-metadat
4694 neutron-server
15638 neutron-openvsw
32786 neutron-dhcp-ag
Seems that neutron-openvswitch-agent and neutron-dhcp-agent might not be honoring this tunable. Or they have some other path that calls the same path that results in the high epoll_wait() call rate.
I can say if this is a regression or not. Reports online make me think this has been around since Liberty, but I haven't confirmed.
I'll comment back if I have time to figure it out. blocker or no blocker is up to you.
(In reply to Seth Jennings from comment #7) > I can say if this is a regression or not. s/can/can't/ I think neutron is just busy for other reasons. It's historically burned more cycles than the other services and I've never looked into why. Glad to hear that the other ones are behaving now though. I'm still working on the writeup explaining the details, stay tuned. Ok, here's the deal... First off, I've verified this is the case on both devstack/master as well as on OSP10/Newton. So it's not something specific to our downstream. In the oslo.messaging rabbit driver, there is a thread for heartbeats: https://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo_messaging/_drivers/impl_rabbit.py?id=293084f9b5ef59c4b2dcd8c7180f7053b47337a9#n968 specifically, the three methods, _heartbeat_start, _heartbeat_stop, and _heartbeat_thread_job. In _heartbeat_start, it creates a threading.Event (this event ends up being the root of the problem), a threading.Thread for the heartbeat thread, and then starts the heartbeat thread. The heartbeat thread does work, and then at the end executes: self._heartbeat_exit_event.wait( timeout=self._heartbeat_wait_timeout) The _heartbeat_stop method sets the exit event when the connection is stopped, this is where it receives that and shuts down. With the default heartbeat configuration, _heartbeat_wait_timeout ends up being 15.0 seconds. This sounds reasonable, but explodes under the hood. Event.wait() ultimately calls Condition.wait(), which runs this gem[1] of code: # Balancing act: We can't afford a pure busy loop, so we # have to sleep; but if we sleep the whole timeout time, # we'll be unresponsive. The scheme here sleeps very # little at first, longer as time goes on, but never longer # than 20 times per second (or the timeout time remaining). endtime = _time() + timeout delay = 0.0005 # 500 us -> initial delay of 1 ms while True: gotit = waiter.acquire(0) if gotit: break remaining = endtime - _time() if remaining <= 0: break delay = min(delay * 2, remaining, .05) _sleep(delay) Now, the problem is that _sleep here is time.sleep, which has been previously monkey-patched by eventlet to be eventlet.greenthread.sleep instead. When you call sleep() in eventlet, it adds a timer onto an internal heap that it maintains[2]. When the main event loop runs, it checks whatever the shortest timer is, and uses that delta as the argument for waiting (this ultimately ends up to epoll_wait). Here's the crucial part of the code[3]: wakeup_when = self.sleep_until() if wakeup_when is None: sleep_time = self.default_sleep() else: sleep_time = wakeup_when - self.clock() if sleep_time > 0: self.wait(sleep_time) else: self.wait(0) Because Condition.wait() uses such small sleep times between 0.0005..0.05, these tiny timers are almost always the next timer to fire, and sleep_time above ends up being negative (or small enough to practically be zero anyways). The self.wait() call propagates down to epoll_wait() and you get the behavior we're seeing. So, the root of the problem is that using threading.Event inside of eventlet is bad. I think this should be a pretty easy fix. My first thought is to just use a pipe to signal the heartbeat thread to stop, and then the read on the pipe can timeout at _heartbeat_wait_timeout. This moves everything into the regular i/o handling parts of eventlet, and will still work in the case that some other executor than eventlet is used. [1] https://hg.python.org/cpython/file/v2.7.12/Lib/threading.py#l344 [2] https://github.com/eventlet/eventlet/blob/7343f2e817fbc09d07c36727dfb2e4655a37901f/eventlet/hubs/hub.py#L389-L392 [3] https://github.com/eventlet/eventlet/blob/7343f2e817fbc09d07c36727dfb2e4655a37901f/eventlet/hubs/hub.py#L340-L348 Note that the crazy Condition.wait() code has been fixed in Python 3 - https://hg.python.org/cpython/file/3.6/Lib/threading.py#l263 By the way, I worked on a performance bug in asyncio: epoll_wait() was called with a timeout of 0 because of how the timeout was rounded. epoll_wait() has a resolution of 1 ms (the timeout is a integer number of millisecond), so asyncio entered a busy when a timer was in less than 1 ms. Full story: * http://bugs.python.org/issue20311 * http://bugs.python.org/issue20320 * http://bugs.python.org/issue20505 The fix is made in two parts: * Round the timeout away from zero (ROUND_UP, not ROUND_DOWN which rounds towards zero) * Use a "granularity" constant to check if a timer should be fired or not, instead of using the exact current date: see the "end_time = self.time() + self._clock_resolution" line in asyncio/base_events.py. In asyncio, clock_resolution comes from time.get_clock_info('monotonic').resolution which is usually 1 nanosecond on UNIX but 18.6 ms on Windows. I checked quickly eventlet, the timeout is rounded towards zero (ROUND_DOWN) in eventlet/hubs/poll.py. So maybe it has the bug? ON_DEV until upstream merges. (In reply to Mike Burns from comment #14) > ON_DEV until upstream merges. I need to print a giant banner to hang over my desk because I always mix up POST/ON_DEV :) Hey John, I'd like to test this upstream patch but I think I'm missing something. I patch my impl_rabbit.py and do a "python -O -m compileall impl_rabbit.py" and "python -m compileall impl_rabbit.py" (which might be TOTALLY wrong) to get the .pyc and .pyo, but everything freaks out when I do that. All openstack components eat 100% cpu doing something, I know not what. How would I go about testing this upstream patch on my openstack install? (In reply to Seth Jennings from comment #16) > Hey John, > > I'd like to test this upstream patch but I think I'm missing something. I > patch my impl_rabbit.py and do a "python -O -m compileall impl_rabbit.py" > and "python -m compileall impl_rabbit.py" (which might be TOTALLY wrong) to > get the .pyc and .pyo, but everything freaks out when I do that. All > openstack components eat 100% cpu doing something, I know not what. > > How would I go about testing this upstream patch on my openstack install? Hm that seems like it should work? I never generate those explicitly, if you just rm the .pyc and .pyo files it will regenerate the .pyc on startup. You don't really need the .pyo. Try to just let it generate it itself and see if it makes any difference on the behavior. Newton backport - https://review.openstack.org/#/c/394963/ Merged to Newton, finally. FailedQA - The patch is not included in python-oslo-messaging-5.10.0-3.el7ost so the behaviour described in description of the bug is still present. I am switching back to POST (Correct me If I am wrong) for a new build to be made with patch included. Verified I can observe significant drop of epoll_wait() calls over 30seconds timeframes on my setup for neutron-*, nova-* and cinder-* services. Mostly around following numbers: 5.10.0-2 5.10.0-5 12 14 memcached 16 76392 cinder-schedule 28 8 auditd 28 20 nova-scheduler 52 16 systemd 64 20 systemd-logind 74 8 NetworkManager 142 208 mysqld 176 734 neutron-metadat 184 30685 neutron-openvsw 202 54 dbus-daemon 370 236 heat-engine 398 336 redis-server 462 261063 nova-conductor 1108 289802 neutron-server 1108 1250 nova-novncproxy 2020 3351 systemd-journal 3976 4826 1_scheduler 4788 4823 swift-container 9306 14264 neutron-dhcp-ag 80224 105754 corosync 81640 163418 sahara-api 89535 161516 sahara-engine (In reply to Marian Krcmarik from comment #24) > Verified > > I can observe significant drop of epoll_wait() calls over 30seconds > timeframes on my setup for neutron-*, nova-* and cinder-* services. > > Mostly around following numbers: > 5.10.0-2 5.10.0-5 urgh, the version numbers should be the vice versa. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-2948.html |