Bug 611434 - Crashing when going over open file limits.
Crashing when going over open file limits.
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: openais (Show other bugs)
5.6
All Linux
urgent Severity urgent
: rc
: ---
Assigned To: Jan Friesse
Cluster QE
: ZStream
Depends On:
Blocks: 694181 707860
  Show dependency treegraph
 
Reported: 2010-07-05 05:17 EDT by Wade Mealing
Modified: 2011-07-21 03:47 EDT (History)
8 users (show)

See Also:
Fixed In Version: openais-0.80.6-29.el5
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 707860 (view as bug list)
Environment:
Last Closed: 2011-07-21 03:47:04 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
handle maximum use of file descriptors gracefully. (6.83 KB, patch)
2010-08-04 22:19 EDT, Angus Salkeld
no flags Details | Diff
this is a better fix (plus the next patch) (5.24 KB, patch)
2010-08-24 21:22 EDT, Angus Salkeld
no flags Details | Diff
return NO_RESOURCES when approaching fd limit (5.35 KB, patch)
2010-08-24 21:23 EDT, Angus Salkeld
no flags Details | Diff
whitetank patch (6.83 KB, patch)
2011-04-01 06:17 EDT, Angus Salkeld
no flags Details | Diff

  None (edit)
Description Wade Mealing 2010-07-05 05:17:12 EDT
Description of problem:

When openais runs out of available open files, it segfaults.

- poll_handler_accept experiences EMFILE errno, and return(0)
- poll_run is main poll loop - because return(0) from poll_handler_accept  indicates NO ERROR, poll loop continues
- We hit EMFILE again, rinse repeat
- Loop is very tight, and thus we easily flood the log handler through log_printf
- openais logger drops log messages due to overflow
- Not quite sure of the exact mechanism, but we eventually end up with corrupted log message strings or va_list
- syslog eventually runs strlen on a bad pointer, which triggers the SEGFAULT


Version-Release number of selected component (if applicable):

openais-0.80.3-15.el5_2.2

How reproducible:

Every time.


Steps to Reproduce:
1. Artificially reduce the amount of open files limit.
2. Mount 255 GFS file systems in a 4 node cluster.
3. Watch system crash.
  
Actual results:

Aug 29 02:18:45 pe1950-1 openais[7495]: [MAIN ] ERROR: Could not accept Library connection: Too many open files
Aug 29 02:18:45 pe1950-1 last message repeated 1034 times
Aug 29 02:18:45 pe1950-1 openais[7495]: [MAIN ] ERROR: Could not accept Library connection: (null) - prior to this log entry,
openais logger dropped '1337' messages because of overflow.
Aug 29 02:18:45 pe1950-1 kernel: aisexec[7495] general protection rip:33ece78d50 rsp:7fffcb110ed8 error:0

Expected results:

Perhaps logging the error message that it couldn't accept library connection, but no crashing of aisexec.

Additional info:

While this is artificially induced by reducing the amount of open files (ulimit modification) customers are seeing this in production environments.
Comment 4 Steven Dake 2010-07-12 13:24:56 EDT
We will address file limitation issue in 5.6 by removing ipc connection in this case - ignore comment #2.

I would recommend if a customer hits this limit to increase their system file limits.
Comment 7 Angus Salkeld 2010-08-04 22:19:23 EDT
Created attachment 436714 [details]
handle maximum use of file descriptors gracefully.

If we get EMFILE from accept() then withdraw the published
server listening socket. Then when a connection closes
see if we need to re-publish the server socket.
Comment 8 Angus Salkeld 2010-08-24 21:22:45 EDT
Created attachment 440812 [details]
this is a better fix (plus the next patch)
Comment 9 Angus Salkeld 2010-08-24 21:23:54 EDT
Created attachment 440814 [details]
return NO_RESOURCES when approaching fd limit
Comment 10 RHEL Product and Program Management 2011-01-11 14:50:08 EST
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated in the
current release, Red Hat is unfortunately unable to address this
request at this time. Red Hat invites you to ask your support
representative to propose this request, if appropriate and relevant,
in the next release of Red Hat Enterprise Linux.
Comment 11 RHEL Product and Program Management 2011-01-11 18:16:02 EST
This request was erroneously denied for the current release of
Red Hat Enterprise Linux.  The error has been fixed and this
request has been re-proposed for the current release.
Comment 13 Angus Salkeld 2011-04-01 06:17:20 EDT
Created attachment 489344 [details]
whitetank patch

The corosync versions have been committed upstream, but the whitetank one
hasn't. I have re-tested the original whitetank patch and it works fine.

So I'll re-post it here and to the ML.
Comment 21 errata-xmlrpc 2011-07-21 03:47:04 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-1012.html

Note You need to log in before you can comment on or make changes to this bug.