Description of problem:
The epoll subsystem allows users to create large nested epoll structures,
which the kernel will then to walk with preemption disabled, causing a denial of
service via excessive CPU consumption in the kernel.
Red Hat would like to thank Nelson Elhage for reporting this issue.
This issue affected the versions of Linux kernel as shipped with Red Hat Enterprise Linux 4, 5, 6, and Red Hat Enterprise MRG. It was addressed in Red Hat Enterprise Linux 5 and 6 via RHSA-2012:0150 and RHSA-2012:0862 respectively. There is no plan to address this flaw in Red Hat Enterprise Linux 4. Future updates may address this issue in Red Hat Enterprise MRG.
Created kernel tracking bugs for this issue
Affects: fedora-all [bug 748668]
This issue has been addressed in following products:
Red Hat Enterprise Linux 5
Via RHSA-2012:0150 https://rhn.redhat.com/errata/RHSA-2012-0150.html
It may be useful for others to know this patch caused significant problems with dovecot 2.0.13/epoll on our CentOS 5.7 machines (2.6.18-238.19.1.el5, 2.6.18-274.3.1.el5, 2.6.18-274.17.1.el5). Other dovecot versions may also be affected, but we've not determined that yet.
IMAP & POP were affected and we isolated which servers were affected via the following message in our logs:
Panic: epoll_ctl(add, 6) failed: Invalid argument
More information in the link to the dovecot mailing list above
The patch does put additional limits on the amount of nesting and number of epoll fds that can be attached to an fd. I thought that the limits were higher than anybody would hit in practice but perhaps not. I can re-spin a 'debug' patch that will print more info in this situation, if you are willing to test that. Otherwise, if have a re-producer that I can try, that would be appreciated. Do you have any sense of how many epoll fds are required per fd?
Another question here is that I see that this was caused in the context of a ksplice update. I'm wondering if this issue can be re-produced without ksplice too? So we can narrow down if this is ksplice or not.
Thanks for getting back to us on this. We're going to be attempting to see if the latest EL5 kernel will have an issue sans-ksplice, since the dovecot folks also recommended doing that. Will let you know what we find.
If we do still see the issue, we'd definitely be willing to re-produce with the debug patch to get more info.
Just wondering if you got a chance to test this?
We haven't been able to tackle it yet, however I should get to testing it this week. We have to verify that we can reproduce the problem with enough generated traffic using ksplice in a test environment, then swap out with the kernel update to see if the issue persists.
Will provide an update soon on the results.
Well, it looks like I can reproduce the issue in a testing environment with the latest CentOS 5.x kernel and no ksplice uptrack.
Here's the info:
Linux n99.XXXX 2.6.18-308.1.1.el5 #1 SMP Wed Mar 7 04:16:51 EST 2012 x86_64 x86_64 x86_64 GNU/Linux
$ sudo /usr/sbin/uptrack-show
Warning: The cron output configuration options have been removed.
Please visit <http://www.ksplice.com/uptrack/notification-options>
for more information.
Effective kernel version is 2.6.18-308.1.1.el5
And...flipping back to:
Linux n99.XXXX 2.6.18-274.17.1.el5 #1 SMP Tue Jan 10 17:25:58 EST 2012 x86_64 x86_64 x86_64 GNU/Linux
and indeed, the problems go away.
Used mstone to general load, using 1000 IMAP logins, 1000 Maildir folders consisting of 21 IMAP folders with cur/new/tmp sub-directories. No actual mail, indexes, cache files in any of the mail folders. Just a uidvalidity file that gets generated by dovecot.
Consistently dovecot becomes unusable and must be restarted to function normally again.
From our dovecot logs:
2012-03-10T07:20:42.019732-05:00 n99 dovecot: imap-login: proxy(bob@579): started proxying to XXX.XX.XX.XX:143: user=<bob@579>, method=PLAIN, rip=127.0.0.1, lip=127.0.0.1, secured
2012-03-10T07:20:42.027678-05:00 n99 dovecot: imap-login: Panic: epoll_ctl(add, 6) failed: Invalid argument
2012-03-10T07:20:42.028028-05:00 n99 dovecot: imap-login: Error: Raw backtrace: /usr/lib64/dovecot/libdovecot.so.0 [0x370503baa0] -> /usr/lib64/dovecot/libdovecot.so.0 [0x370503baf6] -> /usr/lib64/dovecot/libdovecot.so.0 [0x370503afb3] -> /usr/lib64/dovecot/libdovecot.so.0(io_loop_handle_add+0x118) [0x3705047708] -> /usr/lib64/dovecot/libdovecot.so
.0(io_add+0xa5) [0x3705046e15] -> /usr/lib64/dovecot/libdovecot.so.0(master_service_init_finish+0x1c6) [0x37050355a6] -> /usr/lib64/dovecot/libdovecot-login.so.0(main+0x136) [0x3f6100bdf6] -> /lib64/libc.so.6(__libc_start_main+0xf4) [0x370301d994] -> dovecot/imap-login(main+0x39) [0x402069]
2012-03-10T07:20:42.028434-05:00 n99 dovecot: master: Error: service(imap-login): child 10947 killed with signal 6 (core dumps disabled)
2012-03-10T07:20:42.028471-05:00 n99 dovecot: master: Error: service(imap-login): command startup failed, throttling
Let me know if you'd like me to try this with a different kernel, enable core dumps for dovecot, etc.
Created attachment 569503 [details]
increase the ep path limits and add some debug output
So it looks like we have run up against some of the new epoll path limits code. I have a kernel patch that doubles the limits, and should print to the log when we would have hit the old soft limits. If its possible can you apply this patch to latest RHEL5 kernel and report back? If you can't re-build the kernel I can look into supplying you with a re-built kernel. Thanks!
Thanks Jason. I'll try to give the patch a whirl later this week.
Dovecot master process creates one pipe whose read side it passes to all the child processes. All of the Dovecot child processes listen on this pipe to find out when the master process dies. There are systems with tens of thousands of Dovecot child processes, so simply increasing the limit to 2000 won't help.
Postfix behaves in a similar way by polling on a pipe fd shared by all the same service processes. So if e.g. Postfix starts up over 1000 smtp or smtpd processes it'll fail in the same way.
I'd think there are also other software using a pipe to find out when the other side is dead.
Can this shared pipe be handled as a special case somehow?
Sorry, I meant write side of the pipe is passed to Dovecot child processes. They are listening for EPOLLERR | EPOLLHUP from the pipe.
Just so I understand, you're saying that all of the 'child' processes will do an epoll_create followed by an epoll_ctl() to attach to a single pipe? And then they are all woken up at once?
If so, this bug fix is about preventing deep nesting. So in this case it sounds like the depth is '1'. We could probably increase the level '1' nesting really large - and then leave the level '2' and beyond where they currently are, to fix this.
Or maybe there is no level '1' limit?
Created attachment 570690 [details]
allow unlimited number of depth 1 paths
Here's a simple patch, which allows all depth 1 paths. The argument being that you are limited to the number of open files and processes that you can create by the sysadmin. This should still prevent the nasty infinite wakeup paths. Do the limits seem sane then?
unlimited depth 1 paths (limited by open files and processes you can create)
500 depth 2
100 depth 3
50 depth 4
10 depth 5
I didn't think there were any apps with > 1000 depth 1 paths. But I guess I was wrong :(
Yes, tons of processes doing a single epoll_create() and adding epoll_ctl() to a single pipe in it, all woken up at once when the read end of the pipe closes. A depth of 1 sounds like it, but I don't really understand what the nesting stuff in this bug is about.
So this bug is about the ability to attach epoll fds to other fds. So even though the max depth of these paths is 4. you can get 1000^5 or more wakeups, which effectively brings down the box.
The fix is to limit these deep paths, such that you can't get all these wakeups. So we probably could be ok with an unlimited number of depth 1 paths. And impose the limitations just on deeper paths.
The other point here is that sane software can't be creating these 'infinite' wakeup problems, b/c otherwise they wouldn't work. So fundamentally, there has to be a sane limit we can impose.
Also, if somebody could verify the patch from comment #31, that would be greatly appreciated. This is an important fix, which should probably be included asap. Thanks!
I'll be sure to test with the comment #33 patch in as well. I'm hoping to get to it over the weekend.
Just to be clear, I think we just want to test the patch from comment #31. I wouldn't bother with the patch from comment #25.
Good news. It does indeed appear that dovecot is happier, at the same load tested as before, with this new patch. FYI, I simply built the patched kernel with the following options, the patch in #31 added to the spec file, using configs/kernel-2.6.18-x86_64.config and the kernel-2.6.18-308.el5 kernel source:
rpmbuild -bb --with baseonly --target=`uname -m` kernel.spec
So, as for the issues that were causing us problems (Panic: epoll_ctl...), this patch looks to prevent them.
Thanks for taking the time to confirm the fix. Thanks for helping us get to the bottom of this issue!
I've created a new bug:
Bug 804778 - Excessive epoll nesting fix too restrictive
For tracking purposes. Please re-direct all future comments about this to that bz.
Hmm trying to read https://bugzilla.redhat.com/show_bug.cgi?id=804778 gives me an Access Denied error. Is that bz intended to be marked private?
Bug 804778 remains unreadable, after nearly an additional month. Is it a big secret?
I have cross-filed case 00635459 in the Red Hat customer portal, because we
see Postfix watchdog timeouts that seem to be related to that from our point
of view. And as bug #804778 is still marked as private, I'm abusing this one
for the time being.
Jason: Is there a test or hotfix kernel with this patch onboard?
This also appears to hang Apache under any sort of reasonable load.
Also, see bug: https://bugzilla.redhat.com/show_bug.cgi?id=807860
Some other info from the Net for Apache: http://bugs.centos.org/view.php?id=5634
In response to comment #44, yes there is a kernel which is currently under going testing with a fix for this issue. It should be released shortly (I realize that this is a critical issue). The fixed as mentioned is being tracked under bz #804778. Thanks.
It would probably avoid repeated questions if either bz #804778 was readable/public or an brief comment as to why it isn't visible...
There really isn't any additional info in 804778, I was only referring to it b/c it should be updated to indicate when the fix is released. There is no reason for it not to be more public short of my lack of knowledge on how to make bugzillas more visible. I just tried to open it up more. Let me know if you can't view it. Thanks.
Sorry for the noise. Sadly #804778 still shows "You are not authorized to access bug #804778." it doesn't give any reasons.
I can see that the patch is now in the latest kernel 2.6.18-308.8.1.el5 released on 2012-05-29.
* Mon Apr 30 2012 Alexander Gordeev <agordeev> [2.6.18-308.7.1.el5]
- [fs] epoll: Don't limit non-nested epoll paths (Jason Baron) [809380 804778]
This issue has been addressed in following products:
Red Hat Enterprise Linux 6
Via RHSA-2012:0862 https://rhn.redhat.com/errata/RHSA-2012-0862.html
This issue has been addressed in following products:
Red Hat Enterprise Linux 6.2 EUS - Server Only
Via RHSA-2012:1129 https://rhn.redhat.com/errata/RHSA-2012-1129.html