Red Hat Bugzilla – Bug 611434
Crashing when going over open file limits.
Last modified: 2011-07-21 03:47:04 EDT
Description of problem:
When openais runs out of available open files, it segfaults.
- poll_handler_accept experiences EMFILE errno, and return(0)
- poll_run is main poll loop - because return(0) from poll_handler_accept indicates NO ERROR, poll loop continues
- We hit EMFILE again, rinse repeat
- Loop is very tight, and thus we easily flood the log handler through log_printf
- openais logger drops log messages due to overflow
- Not quite sure of the exact mechanism, but we eventually end up with corrupted log message strings or va_list
- syslog eventually runs strlen on a bad pointer, which triggers the SEGFAULT
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Artificially reduce the amount of open files limit.
2. Mount 255 GFS file systems in a 4 node cluster.
3. Watch system crash.
Aug 29 02:18:45 pe1950-1 openais: [MAIN ] ERROR: Could not accept Library connection: Too many open files
Aug 29 02:18:45 pe1950-1 last message repeated 1034 times
Aug 29 02:18:45 pe1950-1 openais: [MAIN ] ERROR: Could not accept Library connection: (null) - prior to this log entry,
openais logger dropped '1337' messages because of overflow.
Aug 29 02:18:45 pe1950-1 kernel: aisexec general protection rip:33ece78d50 rsp:7fffcb110ed8 error:0
Perhaps logging the error message that it couldn't accept library connection, but no crashing of aisexec.
While this is artificially induced by reducing the amount of open files (ulimit modification) customers are seeing this in production environments.
We will address file limitation issue in 5.6 by removing ipc connection in this case - ignore comment #2.
I would recommend if a customer hits this limit to increase their system file limits.
Created attachment 436714 [details]
handle maximum use of file descriptors gracefully.
If we get EMFILE from accept() then withdraw the published
server listening socket. Then when a connection closes
see if we need to re-publish the server socket.
Created attachment 440812 [details]
this is a better fix (plus the next patch)
Created attachment 440814 [details]
return NO_RESOURCES when approaching fd limit
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated in the
current release, Red Hat is unfortunately unable to address this
request at this time. Red Hat invites you to ask your support
representative to propose this request, if appropriate and relevant,
in the next release of Red Hat Enterprise Linux.
This request was erroneously denied for the current release of
Red Hat Enterprise Linux. The error has been fixed and this
request has been re-proposed for the current release.
Created attachment 489344 [details]
The corosync versions have been committed upstream, but the whitetank one
hasn't. I have re-tested the original whitetank patch and it works fine.
So I'll re-post it here and to the ML.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.