Bug 1384183 - busy looping on epoll_wait()
Summary: busy looping on epoll_wait()
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-oslo-messaging
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: rc
: 10.0 (Newton)
Assignee: John Eckersberg
QA Contact: Marian Krcmarik
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-10-12 18:30 UTC by Seth Jennings
Modified: 2016-12-14 16:17 UTC (History)
12 users (show)

Fixed In Version: python-oslo-messaging-5.10.0-4.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-12-14 16:17:21 UTC


Attachments (Terms of Use)
guru-neutron-server.log (12.09 KB, text/x-vhdl)
2016-10-12 18:30 UTC, Seth Jennings
no flags Details
systemctl-status.log (14.66 KB, text/plain)
2016-10-12 18:31 UTC, Seth Jennings
no flags Details
strace.log (35.26 KB, text/plain)
2016-10-12 18:36 UTC, Seth Jennings
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2016:2948 normal SHIPPED_LIVE Red Hat OpenStack Platform 10 enhancement update 2016-12-14 19:55:27 UTC
OpenStack gerrit 386656 None None None 2016-10-14 15:06:06 UTC
OpenStack gerrit 389739 None None None 2016-10-21 15:02:26 UTC
OpenStack gerrit 394963 None None None 2016-11-08 13:53:47 UTC
Launchpad 1518430 None None None 2016-10-14 14:53:22 UTC

Description Seth Jennings 2016-10-12 18:30:23 UTC
Created attachment 1209689 [details]
guru-neutron-server.log

Description of problem:

Services with oslo messaging enabled are busy looping on epoll_wait() calls with no timeout.  This leads to high (cumulative) CPU usage and heavy context switching activity on a per oslo-enabled service basis.

Version-Release number of selected component (if applicable):

python2-oslo-messaging.noarch   5.10.0-1.el7

How reproducible:

Always.  On an all-in-one install, the effect is even more pronounced because all the services are running on a single machine.  I have reduced the number of workers for all services to 1, since my setup is not under heavy load, but there are still about 35k epoll_wait()/s system wide.

Steps to Reproduce:
1.  Start nova/cinder/neutron services
2.  Observe ~350 epoll_wait()s with 0 timeout per second per service process with strace or 35k poll_waits()/s system-wide with sysdig
3.

Actual results:

35k epoll_waits()/s system-wide with sysdig

Expected results:

Much fewer epoll_waits().

Additional info:

On a 30s sysdig capture on idle cluster with no instances

# sysdig -c topscalls -r trace.scap 
# Calls             Syscall             
--------------------------------------------------------------------------------
1053676             epoll_wait

# sysdig -r trace.scap evt.type=epoll_wait | cut -f4 -d' ' | sort | uniq -c | sort -n
...
  70960 neutron-dhcp-ag
  71440 cinder-volume
  73956 neutron-metadat
  74230 neutron-l3-agen
  77322 neutron-openvsw
 145352 nova-compute
 147996 cinder-schedule
 150146 nova-conductor
 233788 neutron-server

Seems that this bug has been around a while, since Liberty:
https://bugs.launchpad.net/oslo.messaging/+bug/1518430

CPU usage is ~2% per process.  Cumulatively, it matters.  It also causes system wide scheduling latency and tons of traffic on user<->kernel across the syscall boundary.

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                 
 1709 nova      20   0 2057220 130564   9848 S   2.3  0.5   0:55.61 nova-compute                                                                                            
 1213 neutron   20   0  320824  57512   5472 S   2.0  0.2   0:47.76 neutron-metadat                                                                                         
 1216 nova      20   0  390748 121384   5700 S   2.0  0.5   1:00.23 nova-conductor                                                                                          
 1717 neutron   20   0  341852  78624   5484 S   2.0  0.3   1:10.24 neutron-openvsw                                                                                         
 1958 cinder    20   0  560728 118180   2452 S   2.0  0.5   0:47.80 cinder-volume                                                                                           
 1979 neutron   20   0  408652 107568   2548 S   2.0  0.4   0:49.68 neutron-server                                                                                          
 1980 neutron   20   0  387208  87712   2156 S   2.0  0.4   0:46.39 neutron-server                                                                                          
 1982 neutron   20   0  392020  92216   2172 S   2.0  0.4   0:44.42 neutron-server                                                                                          
 1211 cinder    20   0  561472 129136  11796 S   1.7  0.5   0:50.24 cinder-schedule                                                                                         
 1706 neutron   20   0  325948  62640   5484 S   1.7  0.3   0:58.58 neutron-l3-agen                                                                                         
 1708 neutron   20   0  325968  62824   5496 R   1.7  0.3   0:57.32 neutron-dhcp-ag                                                                                         
 1195 nova      20   0  438904 141136   6648 S   0.7  0.6   0:20.43 nova-api                                                                                                
 1215 cinder    20   0  558304 125208  11792 S   0.7  0.5   0:18.25 cinder-volume                                                                                           
  873 root      10 -10  418920  52404   9388 S   0.3  0.2   0:05.86 ovs-vswitchd                                                                                            
 1209 cinder    20   0  573832 141432  11792 S   0.3  0.6   0:19.19 cinder-api                                                                                              
 1217 rabbitmq  20   0 4600604 114676   4120 S   0.3  0.5   0:14.40 beam.smp

Comment 1 Seth Jennings 2016-10-12 18:31:48 UTC
Created attachment 1209690 [details]
systemctl-status.log

Comment 2 John Eckersberg 2016-10-12 18:34:49 UTC
Ick.  Thanks for filing this, I'll look into it.

Comment 3 Seth Jennings 2016-10-12 18:36:27 UTC
Created attachment 1209692 [details]
strace.log

about 3 seconds of strace on neutron-server

Comment 4 Fabio Massimo Di Nitto 2016-10-13 07:54:48 UTC
Is this a regression vs OSP9 or OSP8?

If so I agree it´s a blocker, but if it´s not a regression, then we should drop blocker.

Comment 5 Victor Stinner 2016-10-13 09:29:41 UTC
I didn't have time to analyze the issue yet, but it doesn't seem like a regression nor a major bug. For me, it's more a missed optimization: I also expect processes to be 100% idle (whereas Seth reports only 98% idle) when no RPC call is done.

eventlet calls epoll_wait(0) when it has callbacks ready to be called, it's the normal behaviour. I'm not sure that the issue comes from Oslo Messaging neither. We need to run a Python profiler on nova/cinder/neutron to see exactly what is running and why.

Comment 6 John Eckersberg 2016-10-13 14:00:38 UTC
A few quick comments here.  I have a whole lot more to braindump after this but we'll get the easier stuff out of the way first :)

I think this is a regression, but I haven't gone back to prove it yet.  Anecdotally, I have noticed this behavior on Newton for a while now (but never had a good explaination).  I can't reliably run Newton on my mini-dell.  Mitaka and before worked fine.  So there's that.

As for idle%, I think it's misleading.  The system spends so much time jumping across the syscall boundary and not doing any "real" work on either side.  So it's not driving up the user/system cpu usage.

For a workaround, set:

heartbeat_timeout_threshold = 0

in the config files (nova.conf/neutron.conf/etc.)

Comment 7 Seth Jennings 2016-10-13 14:37:33 UTC
I can confirm that heartbeat_timeout_threshold = 0 largely mitigates this issue system-wide:

Over 30s, epoll_wait() are down to ~2200/s

# sysdig -c topscalls -r trace.scap
can't open file trace.scap
# Calls             Syscall             
--------------------------------------------------------------------------------
[root@openstack neutron]# cd
[root@openstack ~]# sysdig -c topscalls -r trace.scap
# Calls             Syscall             
--------------------------------------------------------------------------------
67230               epoll_wait

# sysdig -r trace.scap evt.type=epoll_wait | cut -f4 -d' ' | sort | uniq -c | sort -n
...
    116 nova-novncproxy
    194 nova-conductor
    342 cinder-volume
    598 nova-consoleaut
    644 nova-scheduler
    768 cinder-schedule
    781 1_scheduler
   1214 nova-cert
   1679 2_scheduler
   1712 nova-compute
   2842 neutron-l3-agen
   3066 neutron-metadat
   4694 neutron-server
  15638 neutron-openvsw
  32786 neutron-dhcp-ag

Seems that neutron-openvswitch-agent and neutron-dhcp-agent might not be honoring this tunable.  Or they have some other path that calls the same path that results in the high epoll_wait() call rate.

I can say if this is a regression or not.  Reports online make me think this has been around since Liberty, but I haven't confirmed.

I'll comment back if I have time to figure it out.  blocker or no blocker is up to you.

Comment 8 Seth Jennings 2016-10-13 14:42:16 UTC
(In reply to Seth Jennings from comment #7)
> I can say if this is a regression or not.

s/can/can't/

Comment 9 John Eckersberg 2016-10-13 14:45:08 UTC
I think neutron is just busy for other reasons.  It's historically burned more cycles than the other services and I've never looked into why.  Glad to hear that the other ones are behaving now though.  I'm still working on the writeup explaining the details, stay tuned.

Comment 10 John Eckersberg 2016-10-13 17:12:43 UTC
Ok, here's the deal...

First off, I've verified this is the case on both devstack/master as
well as on OSP10/Newton.  So it's not something specific to our
downstream.

In the oslo.messaging rabbit driver, there is a thread for heartbeats:

https://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo_messaging/_drivers/impl_rabbit.py?id=293084f9b5ef59c4b2dcd8c7180f7053b47337a9#n968

specifically, the three methods, _heartbeat_start, _heartbeat_stop,
and _heartbeat_thread_job.

In _heartbeat_start, it creates a threading.Event (this event ends up
being the root of the problem), a threading.Thread for the heartbeat
thread, and then starts the heartbeat thread.

The heartbeat thread does work, and then at the end executes:

    self._heartbeat_exit_event.wait(
        timeout=self._heartbeat_wait_timeout)

The _heartbeat_stop method sets the exit event when the connection is
stopped, this is where it receives that and shuts down.

With the default heartbeat configuration, _heartbeat_wait_timeout ends
up being 15.0 seconds.  This sounds reasonable, but explodes under the
hood.

Event.wait() ultimately calls Condition.wait(), which runs this gem[1] of
code:

    # Balancing act:  We can't afford a pure busy loop, so we
    # have to sleep; but if we sleep the whole timeout time,
    # we'll be unresponsive.  The scheme here sleeps very
    # little at first, longer as time goes on, but never longer
    # than 20 times per second (or the timeout time remaining).
    endtime = _time() + timeout
    delay = 0.0005 # 500 us -> initial delay of 1 ms
    while True:
        gotit = waiter.acquire(0)
        if gotit:
            break
        remaining = endtime - _time()
        if remaining <= 0:
            break
        delay = min(delay * 2, remaining, .05)
        _sleep(delay)

Now, the problem is that _sleep here is time.sleep, which has been
previously monkey-patched by eventlet to be eventlet.greenthread.sleep
instead.

When you call sleep() in eventlet, it adds a timer onto an internal
heap that it maintains[2].  When the main event loop runs, it checks
whatever the shortest timer is, and uses that delta as the argument
for waiting (this ultimately ends up to epoll_wait).  Here's the
crucial part of the code[3]:

    wakeup_when = self.sleep_until()
    if wakeup_when is None:
        sleep_time = self.default_sleep()
    else:
        sleep_time = wakeup_when - self.clock()
    if sleep_time > 0:
        self.wait(sleep_time)
    else:
        self.wait(0)

Because Condition.wait() uses such small sleep times between
0.0005..0.05, these tiny timers are almost always the next timer to
fire, and sleep_time above ends up being negative (or small enough to
practically be zero anyways).  The self.wait() call propagates down to
epoll_wait() and you get the behavior we're seeing.

So, the root of the problem is that using threading.Event inside of
eventlet is bad.

I think this should be a pretty easy fix.  My first thought is to just
use a pipe to signal the heartbeat thread to stop, and then the read
on the pipe can timeout at _heartbeat_wait_timeout.  This moves
everything into the regular i/o handling parts of eventlet, and will
still work in the case that some other executor than eventlet is used.

[1] https://hg.python.org/cpython/file/v2.7.12/Lib/threading.py#l344
[2] https://github.com/eventlet/eventlet/blob/7343f2e817fbc09d07c36727dfb2e4655a37901f/eventlet/hubs/hub.py#L389-L392
[3] https://github.com/eventlet/eventlet/blob/7343f2e817fbc09d07c36727dfb2e4655a37901f/eventlet/hubs/hub.py#L340-L348

Comment 11 John Eckersberg 2016-10-13 19:50:07 UTC
Note that the crazy Condition.wait() code has been fixed in Python 3 - https://hg.python.org/cpython/file/3.6/Lib/threading.py#l263

Comment 12 Victor Stinner 2016-10-14 07:48:08 UTC
By the way, I worked on a performance bug in asyncio: epoll_wait() was called with a timeout of 0 because of how the timeout was rounded. epoll_wait() has a resolution of 1 ms (the timeout is a integer number of millisecond), so asyncio entered a busy when a timer was in less than 1 ms. Full story:

* http://bugs.python.org/issue20311
* http://bugs.python.org/issue20320
* http://bugs.python.org/issue20505

The fix is made in two parts:

* Round the timeout away from zero (ROUND_UP, not ROUND_DOWN which rounds towards zero)
* Use a "granularity" constant to check if a timer should be fired or not, instead of using the exact current date: see the "end_time = self.time() + self._clock_resolution" line in asyncio/base_events.py. In asyncio, clock_resolution comes from time.get_clock_info('monotonic').resolution which is usually 1 nanosecond on UNIX but 18.6 ms on Windows.

I checked quickly eventlet, the timeout is rounded towards zero (ROUND_DOWN) in eventlet/hubs/poll.py. So maybe it has the bug?

Comment 13 John Eckersberg 2016-10-14 15:06:07 UTC
https://review.openstack.org/#/c/386656/

Comment 14 Mike Burns 2016-10-14 17:33:30 UTC
ON_DEV until upstream merges.

Comment 15 John Eckersberg 2016-10-14 17:45:47 UTC
(In reply to Mike Burns from comment #14)
> ON_DEV until upstream merges.

I need to print a giant banner to hang over my desk because I always mix up POST/ON_DEV :)

Comment 16 Seth Jennings 2016-10-19 15:40:40 UTC
Hey John,

I'd like to test this upstream patch but I think I'm missing something.  I patch my impl_rabbit.py and do a "python -O -m compileall impl_rabbit.py" and "python -m compileall impl_rabbit.py" (which might be TOTALLY wrong) to get the .pyc and .pyo, but everything freaks out when I do that.  All openstack components eat 100% cpu doing something, I know not what.

How would I go about testing this upstream patch on my openstack install?

Comment 17 John Eckersberg 2016-10-19 17:19:51 UTC
(In reply to Seth Jennings from comment #16)
> Hey John,
> 
> I'd like to test this upstream patch but I think I'm missing something.  I
> patch my impl_rabbit.py and do a "python -O -m compileall impl_rabbit.py"
> and "python -m compileall impl_rabbit.py" (which might be TOTALLY wrong) to
> get the .pyc and .pyo, but everything freaks out when I do that.  All
> openstack components eat 100% cpu doing something, I know not what.
> 
> How would I go about testing this upstream patch on my openstack install?

Hm that seems like it should work?  I never generate those explicitly, if you just rm the .pyc and .pyo files it will regenerate the .pyc on startup.  You don't really need the .pyo.  Try to just let it generate it itself and see if it makes any difference on the behavior.

Comment 18 John Eckersberg 2016-10-21 15:02:26 UTC
https://review.openstack.org/#/c/389739

Comment 19 John Eckersberg 2016-11-08 13:53:47 UTC
Newton backport - https://review.openstack.org/#/c/394963/

Comment 20 John Eckersberg 2016-11-11 04:03:32 UTC
Merged to Newton, finally.

Comment 23 Marian Krcmarik 2016-11-15 17:30:43 UTC
FailedQA - The patch is not included in python-oslo-messaging-5.10.0-3.el7ost so the behaviour described in description of the bug is still present.
I am switching back to POST (Correct me If I am wrong) for a new build to be made with patch included.

Comment 24 Marian Krcmarik 2016-11-16 17:30:52 UTC
Verified

I can observe significant drop of epoll_wait() calls over 30seconds timeframes on my setup for neutron-*, nova-* and cinder-* services.

Mostly around following numbers:

5.10.0-2	5.10.0-5
12		14		memcached
16		76392		cinder-schedule
28		8		auditd
28		20		nova-scheduler
52		16		systemd
64		20		systemd-logind
74		8		NetworkManager
142		208		mysqld
176		734		neutron-metadat
184		30685		neutron-openvsw
202		54		dbus-daemon
370		236		heat-engine
398		336		redis-server
462		261063		nova-conductor
1108		289802		neutron-server
1108		1250		nova-novncproxy
2020		3351		systemd-journal
3976		4826		1_scheduler
4788		4823		swift-container
9306		14264		neutron-dhcp-ag
80224		105754		corosync
81640		163418		sahara-api
89535		161516		sahara-engine

Comment 25 Marian Krcmarik 2016-11-16 17:32:26 UTC
(In reply to Marian Krcmarik from comment #24)
> Verified
> 
> I can observe significant drop of epoll_wait() calls over 30seconds
> timeframes on my setup for neutron-*, nova-* and cinder-* services.
> 
> Mostly around following numbers:
> 
 5.10.0-2	5.10.0-5
urgh, the version numbers should be the vice versa.

Comment 27 errata-xmlrpc 2016-12-14 16:17:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2948.html


Note You need to log in before you can comment on or make changes to this bug.