Description of problem: When openais runs out of available open files, it segfaults. - poll_handler_accept experiences EMFILE errno, and return(0) - poll_run is main poll loop - because return(0) from poll_handler_accept indicates NO ERROR, poll loop continues - We hit EMFILE again, rinse repeat - Loop is very tight, and thus we easily flood the log handler through log_printf - openais logger drops log messages due to overflow - Not quite sure of the exact mechanism, but we eventually end up with corrupted log message strings or va_list - syslog eventually runs strlen on a bad pointer, which triggers the SEGFAULT Version-Release number of selected component (if applicable): openais-0.80.3-15.el5_2.2 How reproducible: Every time. Steps to Reproduce: 1. Artificially reduce the amount of open files limit. 2. Mount 255 GFS file systems in a 4 node cluster. 3. Watch system crash. Actual results: Aug 29 02:18:45 pe1950-1 openais[7495]: [MAIN ] ERROR: Could not accept Library connection: Too many open files Aug 29 02:18:45 pe1950-1 last message repeated 1034 times Aug 29 02:18:45 pe1950-1 openais[7495]: [MAIN ] ERROR: Could not accept Library connection: (null) - prior to this log entry, openais logger dropped '1337' messages because of overflow. Aug 29 02:18:45 pe1950-1 kernel: aisexec[7495] general protection rip:33ece78d50 rsp:7fffcb110ed8 error:0 Expected results: Perhaps logging the error message that it couldn't accept library connection, but no crashing of aisexec. Additional info: While this is artificially induced by reducing the amount of open files (ulimit modification) customers are seeing this in production environments.
We will address file limitation issue in 5.6 by removing ipc connection in this case - ignore comment #2. I would recommend if a customer hits this limit to increase their system file limits.
Created attachment 436714 [details] handle maximum use of file descriptors gracefully. If we get EMFILE from accept() then withdraw the published server listening socket. Then when a connection closes see if we need to re-publish the server socket.
Created attachment 440812 [details] this is a better fix (plus the next patch)
Created attachment 440814 [details] return NO_RESOURCES when approaching fd limit
This request was evaluated by Red Hat Product Management for inclusion in the current release of Red Hat Enterprise Linux. Because the affected component is not scheduled to be updated in the current release, Red Hat is unfortunately unable to address this request at this time. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux.
This request was erroneously denied for the current release of Red Hat Enterprise Linux. The error has been fixed and this request has been re-proposed for the current release.
Created attachment 489344 [details] whitetank patch The corosync versions have been committed upstream, but the whitetank one hasn't. I have re-tested the original whitetank patch and it works fine. So I'll re-post it here and to the ML.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-1012.html