Bug 611434

Summary: Crashing when going over open file limits.
Product: Red Hat Enterprise Linux 5 Reporter: Wade Mealing <wmealing>
Component: openaisAssignee: Jan Friesse <jfriesse>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 5.6CC: cluster-maint, djansa, edamato, jfriesse, jkortus, jwest, sdake, tao
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: openais-0.80.6-29.el5 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 707860 (view as bug list) Environment:
Last Closed: 2011-07-21 07:47:04 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 694181, 707860    
Attachments:
Description Flags
handle maximum use of file descriptors gracefully.
none
this is a better fix (plus the next patch)
none
return NO_RESOURCES when approaching fd limit
none
whitetank patch none

Description Wade Mealing 2010-07-05 09:17:12 UTC
Description of problem:

When openais runs out of available open files, it segfaults.

- poll_handler_accept experiences EMFILE errno, and return(0)
- poll_run is main poll loop - because return(0) from poll_handler_accept  indicates NO ERROR, poll loop continues
- We hit EMFILE again, rinse repeat
- Loop is very tight, and thus we easily flood the log handler through log_printf
- openais logger drops log messages due to overflow
- Not quite sure of the exact mechanism, but we eventually end up with corrupted log message strings or va_list
- syslog eventually runs strlen on a bad pointer, which triggers the SEGFAULT


Version-Release number of selected component (if applicable):

openais-0.80.3-15.el5_2.2

How reproducible:

Every time.


Steps to Reproduce:
1. Artificially reduce the amount of open files limit.
2. Mount 255 GFS file systems in a 4 node cluster.
3. Watch system crash.
  
Actual results:

Aug 29 02:18:45 pe1950-1 openais[7495]: [MAIN ] ERROR: Could not accept Library connection: Too many open files
Aug 29 02:18:45 pe1950-1 last message repeated 1034 times
Aug 29 02:18:45 pe1950-1 openais[7495]: [MAIN ] ERROR: Could not accept Library connection: (null) - prior to this log entry,
openais logger dropped '1337' messages because of overflow.
Aug 29 02:18:45 pe1950-1 kernel: aisexec[7495] general protection rip:33ece78d50 rsp:7fffcb110ed8 error:0

Expected results:

Perhaps logging the error message that it couldn't accept library connection, but no crashing of aisexec.

Additional info:

While this is artificially induced by reducing the amount of open files (ulimit modification) customers are seeing this in production environments.

Comment 4 Steven Dake 2010-07-12 17:24:56 UTC
We will address file limitation issue in 5.6 by removing ipc connection in this case - ignore comment #2.

I would recommend if a customer hits this limit to increase their system file limits.

Comment 7 Angus Salkeld 2010-08-05 02:19:23 UTC
Created attachment 436714 [details]
handle maximum use of file descriptors gracefully.

If we get EMFILE from accept() then withdraw the published
server listening socket. Then when a connection closes
see if we need to re-publish the server socket.

Comment 8 Angus Salkeld 2010-08-25 01:22:45 UTC
Created attachment 440812 [details]
this is a better fix (plus the next patch)

Comment 9 Angus Salkeld 2010-08-25 01:23:54 UTC
Created attachment 440814 [details]
return NO_RESOURCES when approaching fd limit

Comment 10 RHEL Program Management 2011-01-11 19:50:08 UTC
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated in the
current release, Red Hat is unfortunately unable to address this
request at this time. Red Hat invites you to ask your support
representative to propose this request, if appropriate and relevant,
in the next release of Red Hat Enterprise Linux.

Comment 11 RHEL Program Management 2011-01-11 23:16:02 UTC
This request was erroneously denied for the current release of
Red Hat Enterprise Linux.  The error has been fixed and this
request has been re-proposed for the current release.

Comment 13 Angus Salkeld 2011-04-01 10:17:20 UTC
Created attachment 489344 [details]
whitetank patch

The corosync versions have been committed upstream, but the whitetank one
hasn't. I have re-tested the original whitetank patch and it works fine.

So I'll re-post it here and to the ML.

Comment 21 errata-xmlrpc 2011-07-21 07:47:04 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-1012.html