611434 – Crashing when going over open file limits.

Bug 611434 - Crashing when going over open file limits.

Summary: Crashing when going over open file limits.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	openais
Sub Component:
Version:	5.6
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Jan Friesse
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	694181 707860
TreeView+	depends on / blocked

Reported:	2010-07-05 09:17 UTC by Wade Mealing
Modified:	2018-11-14 20:35 UTC (History)
CC List:	8 users (show)
Fixed In Version:	openais-0.80.6-29.el5
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	707860 (view as bug list)
Environment:
Last Closed:	2011-07-21 07:47:04 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
handle maximum use of file descriptors gracefully. (6.83 KB, patch) 2010-08-05 02:19 UTC, Angus Salkeld	no flags	Details \| Diff
this is a better fix (plus the next patch) (5.24 KB, patch) 2010-08-25 01:22 UTC, Angus Salkeld	no flags	Details \| Diff
return NO_RESOURCES when approaching fd limit (5.35 KB, patch) 2010-08-25 01:23 UTC, Angus Salkeld	no flags	Details \| Diff
whitetank patch (6.83 KB, patch) 2011-04-01 10:17 UTC, Angus Salkeld	no flags	Details \| Diff
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2011:1012	0	normal	SHIPPED_LIVE	openais bug fix update	2011-07-20 15:44:37 UTC

Description Wade Mealing 2010-07-05 09:17:12 UTC

Description of problem:

When openais runs out of available open files, it segfaults.

- poll_handler_accept experiences EMFILE errno, and return(0)
- poll_run is main poll loop - because return(0) from poll_handler_accept indicates NO ERROR, poll loop continues
- We hit EMFILE again, rinse repeat
- Loop is very tight, and thus we easily flood the log handler through log_printf
- openais logger drops log messages due to overflow
- Not quite sure of the exact mechanism, but we eventually end up with corrupted log message strings or va_list
- syslog eventually runs strlen on a bad pointer, which triggers the SEGFAULT

Version-Release number of selected component (if applicable):

openais-0.80.3-15.el5_2.2

How reproducible:

Every time.

Steps to Reproduce:
1. Artificially reduce the amount of open files limit.
2. Mount 255 GFS file systems in a 4 node cluster.
3. Watch system crash.

Actual results:

Aug 29 02:18:45 pe1950-1 openais[7495]: [MAIN ] ERROR: Could not accept Library connection: Too many open files
Aug 29 02:18:45 pe1950-1 last message repeated 1034 times
Aug 29 02:18:45 pe1950-1 openais[7495]: [MAIN ] ERROR: Could not accept Library connection: (null) - prior to this log entry,
openais logger dropped '1337' messages because of overflow.
Aug 29 02:18:45 pe1950-1 kernel: aisexec[7495] general protection rip:33ece78d50 rsp:7fffcb110ed8 error:0

Expected results:

Perhaps logging the error message that it couldn't accept library connection, but no crashing of aisexec.

Additional info:

While this is artificially induced by reducing the amount of open files (ulimit modification) customers are seeing this in production environments.

Comment 4 Steven Dake 2010-07-12 17:24:56 UTC

We will address file limitation issue in 5.6 by removing ipc connection in this case - ignore comment #2.

I would recommend if a customer hits this limit to increase their system file limits.

Comment 7 Angus Salkeld 2010-08-05 02:19:23 UTC

Created attachment 436714 [details]
handle maximum use of file descriptors gracefully.

If we get EMFILE from accept() then withdraw the published
server listening socket. Then when a connection closes
see if we need to re-publish the server socket.

Comment 8 Angus Salkeld 2010-08-25 01:22:45 UTC

Created attachment 440812 [details]
this is a better fix (plus the next patch)

Comment 9 Angus Salkeld 2010-08-25 01:23:54 UTC

Created attachment 440814 [details]
return NO_RESOURCES when approaching fd limit

Comment 10 RHEL Program Management 2011-01-11 19:50:08 UTC

This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated in the
current release, Red Hat is unfortunately unable to address this
request at this time. Red Hat invites you to ask your support
representative to propose this request, if appropriate and relevant,
in the next release of Red Hat Enterprise Linux.

Comment 11 RHEL Program Management 2011-01-11 23:16:02 UTC

This request was erroneously denied for the current release of
Red Hat Enterprise Linux.  The error has been fixed and this
request has been re-proposed for the current release.

Comment 13 Angus Salkeld 2011-04-01 10:17:20 UTC

Created attachment 489344 [details]
whitetank patch

The corosync versions have been committed upstream, but the whitetank one
hasn't. I have re-tested the original whitetank patch and it works fine.

So I'll re-post it here and to the ML.

Comment 21 errata-xmlrpc 2011-07-21 07:47:04 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-1012.html

Note You need to log in before you can comment on or make changes to this bug.